Hi everyone, so this is the newest project i've been working on! using a cheap Microcontroller (with ironically no onboard FPU) as an FPU!
but first some back story,
this all pretty much started because i've gotten my fresh new 16kB dual port SRAM chips for my VGA Card, finally giving it the 64kB of VRAM i wanted. allowing me to do 320x200 @ 256 colors!
so to celebrate i wrote (with some online help) a small 3D renderer to draw a simple wireframe and rotate it around.
of course doing any kind of drawing in software is slow, but with the help of my fixed point library and some cheeky optimizations (like redrawing the shape in black instead of clearing the whole screen to save time) made it run fairly well.
though the shape changed over time, likely due to the limited precision of the Q16.16 fixed point format.
so i modified the code to make use of Calypsi C's built-in floating point library.
now the shape didn't malform over time anymore, but it was noticeably slower.
So then came the idea, what if i could use an external Microcontroller like an Arduino or similar to do such math operations for the CPU.
For that i choose SPI (cause it's easy, doesn't require a lot of wires, or bi-directional connection) and for the Microcontroller my custom made AVR128DB48 board.
the AVR128DB48 (similar to the Atmega1284p) boasts 128kB of Flash and 16kB of RAM. but unlike the Atmega it runs at 24MHz without an external crystal (overclockable to 32MHz, which is what i run it at) and has multiple UART, SPI, and I2C interfaces.
so i set up a simple test program (2 really) with the AVR-DB board and an Arduino Mega to simply test the SPI Slave mode, since i never used it and it doesn't have official Arduino support.
after getting that working i had to think about a protocol, some way for the master device to give commands and operands to the MCU to then work on and return results.
the interface i ended up with was as simple as you could get. a single byte for the command ID followed by a set amount of operands, and once the command is done the result can be read back.
though there were already atleast 2 issues with this idea.
first, AVR doesn't have hardware FIFOs or DMA controllers, so i needed the CPU to manually feed the SPI controller.
for that i got myself a highly customizable circular buffer library and used it to create 2 software FIFOs: one for received bytes and one for sending bytes.
the way the SPI controller on AVR works is that you can setup an Interrupt routine to run whenever a Byte has been send over SPI.
so my ISR would simply take the received Byte from the SPI data register and put it into the "receive" FIFO, then take a byte from the "send" FIFO and put it into the SPI data register so it will be send next.
if there are no Bytes in the send FIFO it would write $FF instead. and if there is no space in the receive FIFO it would discard the Byte.
This works very well, i thought about redoing the ISR in assembly for maximum speed, but since it has to interface with the buffer library anyways i doubt i could make it any faster than the Compiler.
Thanks to the buffer library i also had the ability to have the MCU sit and wait until atleast some set amount of bytes are in the receiving FIFO (or wait until the sending FIFO has atleast some set amount of empty spaces).
so using those i wrote version 1 of the AVR-FPU.
the main loop works like so:
the MCU waits for 1 byte in the receiving FIFO, once it has one it immedately grabs it as that's the command to execute.
it then uses that command ID on a giant switch case to decode it.
depending on the command it will then wait for more bytes, do some operation (like Add, Sub, Mul, Div, Sin, etc) on them and write the result to the sending FIFO.
but this made me aware of the second issue, how do i make sure the result is read with no gaps?
with that i mean that SPI is always a 2 way street, whenever you write a byte you also read one at the same time.
so one edge case i thought of was when the master is writing a command to the MCU it could partly read the result of the previous operation, which would require a similar FIFO structure on the master to handle properly.
another potential edge case is reading a byte of the result just as it was put into the buffer, then trying to read the second byte before it was put in, returning $FF and screwing up all future reads by offsetting them by a byte.
to solve both of these issues in exchange for some (maybe a lot of) performance i've opted to add an interrupt line from the MCU to the master.
whenever the MCU finished a command and has written the entire the result to the sending FIFO it will pull the interrupt line low and wait for the sending FIFO to be empty. once it is, it will pull the line high again and continue with the next command.
of course this lowers perfomance as the MCU will wait after each operation, but it makes the interface very reliable (as i've tested it a lot and never noticed any issues or missing/corrupted data)
after that i went ahead and wrote a very simple benchmark program for various floating point operations.
it repeats each operation 1000 times while recording the time it took, it then calculates the the average "FLoating point Operations Per Second" (aka FLOPS, but it's written as kFLOPS).
it does this for both the software and MCU functions, finally it then compares the results of both and records the average difference between them (as both use different floating point implementations, so there will always be some difference/error)
here are the results for some of the operations:
Code:
----------------------------------------------------------
add - Software/AVR:
Total Time: 0.060 sec, Speed: 16.667 kFLOPS
Total Time: 0.770 sec, Speed: 1.299 kFLOPS
Average Difference: 0.000000
----------------------------------------------------------
sub - Software/AVR:
Total Time: 0.070 sec, Speed: 14.286 kFLOPS
Total Time: 0.780 sec, Speed: 1.282 kFLOPS
Average Difference: 0.000000
----------------------------------------------------------
mul - Software/AVR:
Total Time: 0.210 sec, Speed: 4.762 kFLOPS
Total Time: 0.780 sec, Speed: 1.282 kFLOPS
Average Difference: 0.000000
----------------------------------------------------------
div - Software/AVR:
Total Time: 0.160 sec, Speed: 6.250 kFLOPS
Total Time: 0.780 sec, Speed: 1.282 kFLOPS
Average Difference: 0.000000
----------------------------------------------------------
mod - Software/AVR:
Total Time: 0.610 sec, Speed: 1.639 kFLOPS
Total Time: 0.780 sec, Speed: 1.282 kFLOPS
Average Difference: 27.131926
----------------------------------------------------------
sin - Software/AVR:
Total Time: 4.020 sec, Speed: 0.248 kFLOPS
Total Time: 0.630 sec, Speed: 1.587 kFLOPS
Average Difference: 0.000325
----------------------------------------------------------
cos - Software/AVR:
Total Time: 4.090 sec, Speed: 0.244 kFLOPS
Total Time: 0.630 sec, Speed: 1.587 kFLOPS
Average Difference: 0.000373
----------------------------------------------------------
tan - Software/AVR:
Total Time: 8.260 sec, Speed: 0.121 kFLOPS
Total Time: 0.650 sec, Speed: 1.538 kFLOPS
Average Difference: 0.827708
----------------------------------------------------------
sqrt - Software/AVR:
Total Time: 0.610 sec, Speed: 1.639 kFLOPS
Total Time: 0.580 sec, Speed: 1.724 kFLOPS
Average Difference: 0.000000
----------------------------------------------------------
inverse sqrt - Software/AVR:
Total Time: 0.690 sec, Speed: 1.449 kFLOPS
Total Time: 0.590 sec, Speed: 1.695 kFLOPS
Average Difference: 0.000000
----------------------------------------------------------
((a * cos(b)) + (sin(c) / -d)) - Software/AVR:
Total Time: 8.440 sec, Speed: 0.118 kFLOPS
Total Time: 4.310 sec, Speed: 0.232 kFLOPS
Average Difference: 16090.212891
simple operations like Add, Sub, Mul, and Div are actually faster in software, likely due to the interface bottleneck (software SPI)
trigonometric functions like Sin, Cos, and Tan are way faster on the MCU. though Tan has a noticably higher error rate compared to the other 2.
modulo also has a very high error rate, i looked at the actual results for both and the software float function seems to have a bug and occasionally spits out weird values, throwing off the average. the MCU results were pretty much always correct.
and the final one is actually just a more complicated sequence of simple operations. the MCU performed well very likely due to the trigonometric functions. you'll also notice the high error rate, and like with modulo there seems to be a bug in the software float library and it sometimes just spits out seemingly random values while the MCU results were normal.
anyways after this i started working on Version 2, since Version 1 was pretty much just a POC.
Version 2 is a lot more complicated and while it uses the same interface and FIFOs on the MCU, the commands work very much differently.
instead of simply taking in floating point numbers, doing some work, and returning a result, it would now be closer to an actual FPU like the 8087, with it's own set of registers and all operations would work on those instead.
so i designed the emulated architecture of Version 2 to be similar to a RISC CPU, with and all operations being done exclsuively on registers. but it has no program or data memory (though that might change for Version 3) and has all of it's instructions feed by the master.
Here the contents of the text file i used to define everything:
Code:
Registers:
F0-F7 - General Purpose Floating Point Registers (32-bit)
SR - Status Register (16-bit)
SR bits:
15 - 0 (constant 0) \
14 - 0 (constant 0) |
13 - 0 (constant 0) |
12 - 0 (constant 0) |
11 - 0 (constant 0) |-- Currently Unused if you couldn't tell :p
10 - 0 (constant 0) |
9 - 0 (constant 0) |
8 - 0 (constant 0) /
7 - Infinity (Ra == INF, only updated by TEST)
6 - NaN (Ra == NaN, only updated by TEST)
5 - Greater Than (Ra > Rb)
4 - Greater Equal (Ra >= Rb)
3 - Equal (Ra == Rb)
2 - Nearly Equal (Ra ~= Rb) (threshold = 0.001f)
1 - Less Equal (Ra <= Rb)
0 - Less Than (Ra < Rb)
Instruction set:
Ra = Source Register A
Rb = Source Register B
Re = Destination Register
Write Integer (16-bit) - WRITE16 Re, k - 01 000 000 eee 00000 kkkkkkkk kkkkkkkk
Write Integer (32-bit) - WRITE32 Re, k - 10 000 000 eee 00000 kkkkkkkk kkkkkkkk kkkkkkkk kkkkkkkk
Write Float - WRITEF Re, k - 11 000 000 eee 00000 kkkkkkkk kkkkkkkk kkkkkkkk kkkkkkkk
Read Integer (16-bit) - READ16 Ra - 01 001 aaa 000 00000 (returns 2 Bytes)
Read Integer (32-bit) - READ32 Ra - 10 001 aaa 000 00000 (returns 4 Bytes)
Read Float - READF Ra - 11 001 aaa 000 00000 (returns 4 Bytes)
Read Status - READS - 00 010 000 000 00000 (returns 2 Bytes)
Load Constant (0) - LOADZ Re - 00 011 000 eee 00000
Load Constant (1) - LOADO Re - 01 011 000 eee 00000
Load Constant (PI) - LOADP Re - 10 011 000 eee 00000
Load Constant (E) - LOADE Re - 11 011 000 eee 00000
Absolute Value - ABS Re, Ra - 00 000 aaa eee 00001 Re = abs(Ra)
Negate Register - NEG Re, Ra - 01 000 aaa eee 00001 Re = -Ra
Degree to Radian - DTR Re, Ra - 10 000 aaa eee 00001 Re = Ra * (PI / 180)
Radian to Degree - RTD Re, Ra - 11 000 aaa eee 00001 Re = Ra * (180 / PI)
Round down (Floor) - FLOOR Re, Ra - 00 001 aaa eee 00001 Re = floor(Ra)
Round closest - ROUND Re, Ra - 01 001 aaa eee 00001 Re = round(Ra)
Round up (Feiling) - CEIL Re, Ra - 10 001 aaa eee 00001 Re = ceil(Ra)
Round to 0 (Truncate) - TRUNC Re, Ra - 11 001 aaa eee 00001 Re = trunc(Ra)
Transfer Register - TRF Re, Ra - 00 010 aaa eee 00001 Re = Ra
Exchange Registers - EXCH Re, Ra - 01 010 aaa eee 00001 Re = Ra; Ra = Re (swaps the contents)
Test against Zero - TEST Ra - 00 000 aaa 000 00010 Compare Ra with 0
Compare Registers - COMP Ra, Rb - 01 bbb aaa 000 00010 Compare Ra with Rb
Add - ADD Re, Ra, Rb - 00 bbb aaa eee 00011 Re = Ra + Rb
Subtract - SUB Re, Ra, Rb - 01 bbb aaa eee 00011 Re = Ra - Rb
Multiply - MUL Re, Ra, Rb - 10 bbb aaa eee 00011 Re = Ra * Rb
Divide - DIV Re, Ra, Rb - 11 bbb aaa eee 00011 Re = Ra / Rb
Modulo - MOD Re, Ra, Rb - 00 bbb aaa eee 00100 Re = Ra % Rb
Power - POW Re, Ra, Rb - 01 bbb aaa eee 00100 Re = Ra ^ Rb
Exponent - EXP Re, Ra - 10 000 aaa eee 00100 Re = e ^ Ra
Logarithm - LOG Re, Ra - 11 000 aaa eee 00100 Re = log(Ra)
Decimal Logarithm - LOG10 Re, Ra - 00 000 aaa eee 00101 Re = log10(Ra)
Binary Logarithm - LOG2 Re, Ra - 01 000 aaa eee 00101 Re = log2(Ra)
Square Root - SQRT Re, Ra - 10 000 aaa eee 00101 Re = sqrt(Ra)
Inverse Square Root - ISQRT Re, Ra - 11 000 aaa eee 00101 Re = 1 / sqrt(Ra) (also yes it uses the Fast Inverse Square Root trick from Quake III)
Sine - SIN Re, Ra - 00 000 aaa eee 00110 Re = sin(Ra)
Cosine - COS Re, Ra - 01 000 aaa eee 00110 Re = cos(Ra)
Tangent - TAN Re, Ra - 10 000 aaa eee 00110 Re = tan(Ra)
Inverse Sine - ASIN Re, Ra - 00 000 aaa eee 00111 Re = asin(Ra)
Inverse Cosine - ACOS Re, Ra - 01 000 aaa eee 00111 Re = acos(Ra)
Inverse Tangent - ATAN Re, Ra - 10 000 aaa eee 00111 Re = atan(Ra)
Reset - RESET - 11 111 111 111 11110 Clears all Registers and both FIFO buffers
No Operation - NOP - 11 111 111 111 11111 Does nothing
the main benifit of doing operations like this instead of like in Version 1 is that doing multiple operations in sequence will be faster, as you don't need to constantly transfer the floats between master and MCU, they simply stay on the MCU in the registers.
this also means that indivitual operations like Version 1 did are actually slower than before, since you need 3-4 insturctions in total, one or two to WRITE the values to the MCU, one to do the operation, and one to read the result back.
also the MCU still has the feature where it stops all operations, pulls the interrupt line low, and waits for the sending FIFO to be empty. but because that only happens with the READ instructions, it can actually continuously work on some data given that the FIFO has a backlog of instructions in it.
so here the benchmark results for Version 2:
Code:
----------------------------------------------------------
add - Software/AVR:
Total Time: 0.060 sec, Speed: 16.667 kFLOPS
Total Time: 1.150 sec, Speed: 0.869 kFLOPS
Average Difference: 0.000000
----------------------------------------------------------
sub - Software/AVR:
Total Time: 0.060 sec, Speed: 16.667 kFLOPS
Total Time: 1.150 sec, Speed: 0.869 kFLOPS
Average Difference: 0.000000
----------------------------------------------------------
mul - Software/AVR:
Total Time: 0.210 sec, Speed: 4.762 kFLOPS
Total Time: 1.150 sec, Speed: 0.869 kFLOPS
Average Difference: 0.000000
----------------------------------------------------------
div - Software/AVR:
Total Time: 0.170 sec, Speed: 5.882 kFLOPS
Total Time: 1.150 sec, Speed: 0.869 kFLOPS
Average Difference: 0.000000
----------------------------------------------------------
mod - Software/AVR:
Total Time: 0.610 sec, Speed: 1.639 kFLOPS
Total Time: 1.140 sec, Speed: 0.877 kFLOPS
Average Difference: 27.131926
----------------------------------------------------------
sin - Software/AVR:
Total Time: 4.020 sec, Speed: 0.248 kFLOPS
Total Time: 0.840 sec, Speed: 1.190 kFLOPS
Average Difference: 0.000325
----------------------------------------------------------
cos - Software/AVR:
Total Time: 4.090 sec, Speed: 0.244 kFLOPS
Total Time: 0.840 sec, Speed: 1.190 kFLOPS
Average Difference: 0.000373
----------------------------------------------------------
tan - Software/AVR:
Total Time: 8.250 sec, Speed: 0.121 kFLOPS
Total Time: 0.840 sec, Speed: 1.190 kFLOPS
Average Difference: 0.827708
----------------------------------------------------------
sqrt - Software/AVR:
Total Time: 0.610 sec, Speed: 1.639 kFLOPS
Total Time: 0.840 sec, Speed: 1.190 kFLOPS
Average Difference: 0.000000
----------------------------------------------------------
inverse sqrt - Software/AVR:
Total Time: 0.700 sec, Speed: 1.429 kFLOPS
Total Time: 0.840 sec, Speed: 1.190 kFLOPS
Average Difference: 0.000000
----------------------------------------------------------
((a * cos(b)) + (sin(c) / -d)) - Software/AVR:
Total Time: 8.440 sec, Speed: 0.118 kFLOPS
Total Time: 2.420 sec, Speed: 0.413 kFLOPS
Average Difference: 16090.212891
as expected, all single operations are slower than before, but the last sequential one is pretty much twice as fast. meaning the more work you can do on the MCU without having to move data back and forth the better the performance compared to Version 1.
for anyone curious here the brakdown of the SPI transfers required for the last operation:
Code:
Operation: (a * cos(b)) + (sin(c) / -d)
Version 1:
COS b - 9 (+2 dummy reads)
MUL a, b - 13 (+2 dummy reads)
SUB 0, d - 13 (+2 dummy reads)
SIN c - 9 (+2 dummy reads)
DIV c, d - 13 (+2 dummy reads)
ADD a, c - 13 (+2 dummy reads)
Version 2:
WRITEF F0, a - 6
WRITEF F1, b - 6
WRITEF F2, c - 6
WRITEF F3, d - 6
LOADZ F7 - 2 (F7 = 0)
COS F1, F1 - 2 (b = cos(b))
MUL F0, F0, F1 - 2 (a = a * b)
SUB F3, F7, F3 - 2 (d = 0 - d)
SIN F2, F2 - 2 (c = sin(c))
DIV F2, F2, F3 - 2 (c = c / -d)
ADD F0, F0, F2 - 2 (a = a + c)
READF F0 - 6 (+2 dummy reads)
total SPI transfers:
Version 1 - 82
Version 2 - 46
there are still loads of empty places in the Instruction Encoding to add functions and features, so suggestions are welcome! also any suggestions to the general architecture of the FPU for a potential Version 3.
welp, that's it! that's currently how far i've gotten in this project, i'll adapt my 3D renderer to make use of the MCU and see how performance looks (though i can already tell it will be much better).
i'll attatch the Arduino files for both version 1 and 2 though they aren't well polished. (both are renamed to txt files because the forum is weirdly limited when it comes to file extensions for some reason)
and once i have cleaned up and seperated the code for the master/C side of the interface i'll throw that into it's own library and put it on github with all the other information.