6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 22, 2024 4:22 am

All times are UTC




Post new topic Reply to topic  [ 14 posts ] 
Author Message
PostPosted: Sat Aug 12, 2023 9:16 pm 
Offline
User avatar

Joined: Fri Aug 03, 2018 8:52 am
Posts: 746
Location: Germany
Hi everyone, so this is the newest project i've been working on! using a cheap Microcontroller (with ironically no onboard FPU) as an FPU!

but first some back story,
this all pretty much started because i've gotten my fresh new 16kB dual port SRAM chips for my VGA Card, finally giving it the 64kB of VRAM i wanted. allowing me to do 320x200 @ 256 colors!
so to celebrate i wrote (with some online help) a small 3D renderer to draw a simple wireframe and rotate it around.

of course doing any kind of drawing in software is slow, but with the help of my fixed point library and some cheeky optimizations (like redrawing the shape in black instead of clearing the whole screen to save time) made it run fairly well.
though the shape changed over time, likely due to the limited precision of the Q16.16 fixed point format.
so i modified the code to make use of Calypsi C's built-in floating point library.
now the shape didn't malform over time anymore, but it was noticeably slower.

So then came the idea, what if i could use an external Microcontroller like an Arduino or similar to do such math operations for the CPU.
For that i choose SPI (cause it's easy, doesn't require a lot of wires, or bi-directional connection) and for the Microcontroller my custom made AVR128DB48 board.

the AVR128DB48 (similar to the Atmega1284p) boasts 128kB of Flash and 16kB of RAM. but unlike the Atmega it runs at 24MHz without an external crystal (overclockable to 32MHz, which is what i run it at) and has multiple UART, SPI, and I2C interfaces.
so i set up a simple test program (2 really) with the AVR-DB board and an Arduino Mega to simply test the SPI Slave mode, since i never used it and it doesn't have official Arduino support.

after getting that working i had to think about a protocol, some way for the master device to give commands and operands to the MCU to then work on and return results.
the interface i ended up with was as simple as you could get. a single byte for the command ID followed by a set amount of operands, and once the command is done the result can be read back.

though there were already atleast 2 issues with this idea.
first, AVR doesn't have hardware FIFOs or DMA controllers, so i needed the CPU to manually feed the SPI controller.
for that i got myself a highly customizable circular buffer library and used it to create 2 software FIFOs: one for received bytes and one for sending bytes.
the way the SPI controller on AVR works is that you can setup an Interrupt routine to run whenever a Byte has been send over SPI.
so my ISR would simply take the received Byte from the SPI data register and put it into the "receive" FIFO, then take a byte from the "send" FIFO and put it into the SPI data register so it will be send next.
if there are no Bytes in the send FIFO it would write $FF instead. and if there is no space in the receive FIFO it would discard the Byte.

This works very well, i thought about redoing the ISR in assembly for maximum speed, but since it has to interface with the buffer library anyways i doubt i could make it any faster than the Compiler.

Thanks to the buffer library i also had the ability to have the MCU sit and wait until atleast some set amount of bytes are in the receiving FIFO (or wait until the sending FIFO has atleast some set amount of empty spaces).

so using those i wrote version 1 of the AVR-FPU.
the main loop works like so:
the MCU waits for 1 byte in the receiving FIFO, once it has one it immedately grabs it as that's the command to execute.
it then uses that command ID on a giant switch case to decode it.
depending on the command it will then wait for more bytes, do some operation (like Add, Sub, Mul, Div, Sin, etc) on them and write the result to the sending FIFO.

but this made me aware of the second issue, how do i make sure the result is read with no gaps?
with that i mean that SPI is always a 2 way street, whenever you write a byte you also read one at the same time.
so one edge case i thought of was when the master is writing a command to the MCU it could partly read the result of the previous operation, which would require a similar FIFO structure on the master to handle properly.
another potential edge case is reading a byte of the result just as it was put into the buffer, then trying to read the second byte before it was put in, returning $FF and screwing up all future reads by offsetting them by a byte.

to solve both of these issues in exchange for some (maybe a lot of) performance i've opted to add an interrupt line from the MCU to the master.
whenever the MCU finished a command and has written the entire the result to the sending FIFO it will pull the interrupt line low and wait for the sending FIFO to be empty. once it is, it will pull the line high again and continue with the next command.

of course this lowers perfomance as the MCU will wait after each operation, but it makes the interface very reliable (as i've tested it a lot and never noticed any issues or missing/corrupted data)

after that i went ahead and wrote a very simple benchmark program for various floating point operations.
it repeats each operation 1000 times while recording the time it took, it then calculates the the average "FLoating point Operations Per Second" (aka FLOPS, but it's written as kFLOPS).
it does this for both the software and MCU functions, finally it then compares the results of both and records the average difference between them (as both use different floating point implementations, so there will always be some difference/error)

here are the results for some of the operations:

Code:
----------------------------------------------------------
add - Software/AVR:
Total Time: 0.060 sec, Speed: 16.667 kFLOPS
Total Time: 0.770 sec, Speed: 1.299 kFLOPS
Average Difference: 0.000000
----------------------------------------------------------
sub - Software/AVR:
Total Time: 0.070 sec, Speed: 14.286 kFLOPS
Total Time: 0.780 sec, Speed: 1.282 kFLOPS
Average Difference: 0.000000
----------------------------------------------------------
mul - Software/AVR:
Total Time: 0.210 sec, Speed: 4.762 kFLOPS
Total Time: 0.780 sec, Speed: 1.282 kFLOPS
Average Difference: 0.000000
----------------------------------------------------------
div - Software/AVR:
Total Time: 0.160 sec, Speed: 6.250 kFLOPS
Total Time: 0.780 sec, Speed: 1.282 kFLOPS
Average Difference: 0.000000
----------------------------------------------------------
mod - Software/AVR:
Total Time: 0.610 sec, Speed: 1.639 kFLOPS
Total Time: 0.780 sec, Speed: 1.282 kFLOPS
Average Difference: 27.131926
----------------------------------------------------------
sin - Software/AVR:
Total Time: 4.020 sec, Speed: 0.248 kFLOPS
Total Time: 0.630 sec, Speed: 1.587 kFLOPS
Average Difference: 0.000325
----------------------------------------------------------
cos - Software/AVR:
Total Time: 4.090 sec, Speed: 0.244 kFLOPS
Total Time: 0.630 sec, Speed: 1.587 kFLOPS
Average Difference: 0.000373
----------------------------------------------------------
tan - Software/AVR:
Total Time: 8.260 sec, Speed: 0.121 kFLOPS
Total Time: 0.650 sec, Speed: 1.538 kFLOPS
Average Difference: 0.827708
----------------------------------------------------------
sqrt - Software/AVR:
Total Time: 0.610 sec, Speed: 1.639 kFLOPS
Total Time: 0.580 sec, Speed: 1.724 kFLOPS
Average Difference: 0.000000
----------------------------------------------------------
inverse sqrt - Software/AVR:
Total Time: 0.690 sec, Speed: 1.449 kFLOPS
Total Time: 0.590 sec, Speed: 1.695 kFLOPS
Average Difference: 0.000000
----------------------------------------------------------
((a * cos(b)) + (sin(c) / -d)) - Software/AVR:
Total Time: 8.440 sec, Speed: 0.118 kFLOPS
Total Time: 4.310 sec, Speed: 0.232 kFLOPS
Average Difference: 16090.212891


simple operations like Add, Sub, Mul, and Div are actually faster in software, likely due to the interface bottleneck (software SPI)
trigonometric functions like Sin, Cos, and Tan are way faster on the MCU. though Tan has a noticably higher error rate compared to the other 2.
modulo also has a very high error rate, i looked at the actual results for both and the software float function seems to have a bug and occasionally spits out weird values, throwing off the average. the MCU results were pretty much always correct.
and the final one is actually just a more complicated sequence of simple operations. the MCU performed well very likely due to the trigonometric functions. you'll also notice the high error rate, and like with modulo there seems to be a bug in the software float library and it sometimes just spits out seemingly random values while the MCU results were normal.

anyways after this i started working on Version 2, since Version 1 was pretty much just a POC.



Version 2 is a lot more complicated and while it uses the same interface and FIFOs on the MCU, the commands work very much differently.
instead of simply taking in floating point numbers, doing some work, and returning a result, it would now be closer to an actual FPU like the 8087, with it's own set of registers and all operations would work on those instead.

so i designed the emulated architecture of Version 2 to be similar to a RISC CPU, with and all operations being done exclsuively on registers. but it has no program or data memory (though that might change for Version 3) and has all of it's instructions feed by the master.

Here the contents of the text file i used to define everything:

Code:
Registers:
F0-F7       - General Purpose Floating Point Registers (32-bit)
SR          - Status Register (16-bit)

SR bits:

15 - 0              (constant 0) \
14 - 0              (constant 0) |
13 - 0              (constant 0) |
12 - 0              (constant 0) |
11 - 0              (constant 0) |-- Currently Unused if you couldn't tell :p
10 - 0              (constant 0) |
 9 - 0              (constant 0) |
 8 - 0              (constant 0) /
 7 - Infinity       (Ra == INF, only updated by TEST)
 6 - NaN            (Ra == NaN, only updated by TEST)
 5 - Greater Than   (Ra >  Rb)
 4 - Greater Equal  (Ra >= Rb)
 3 - Equal          (Ra == Rb)
 2 - Nearly Equal   (Ra ~= Rb) (threshold = 0.001f)
 1 - Less Equal     (Ra <= Rb)
 0 - Less Than      (Ra <  Rb)



Instruction set:

Ra = Source Register A
Rb = Source Register B
Re = Destination Register

Write Integer (16-bit)      - WRITE16 Re, k         - 01 000 000 eee 00000 kkkkkkkk kkkkkkkk
Write Integer (32-bit)      - WRITE32 Re, k         - 10 000 000 eee 00000 kkkkkkkk kkkkkkkk kkkkkkkk kkkkkkkk
Write Float                 - WRITEF  Re, k         - 11 000 000 eee 00000 kkkkkkkk kkkkkkkk kkkkkkkk kkkkkkkk

Read Integer (16-bit)       - READ16 Ra             - 01 001 aaa 000 00000 (returns 2 Bytes)
Read Integer (32-bit)       - READ32 Ra             - 10 001 aaa 000 00000 (returns 4 Bytes)
Read Float                  - READF  Ra             - 11 001 aaa 000 00000 (returns 4 Bytes)

Read Status                 - READS                 - 00 010 000 000 00000 (returns 2 Bytes)

Load Constant (0)           - LOADZ Re              - 00 011 000 eee 00000
Load Constant (1)           - LOADO Re              - 01 011 000 eee 00000
Load Constant (PI)          - LOADP Re              - 10 011 000 eee 00000
Load Constant (E)           - LOADE Re              - 11 011 000 eee 00000

Absolute Value              - ABS Re, Ra            - 00 000 aaa eee 00001      Re = abs(Ra)
Negate Register             - NEG Re, Ra            - 01 000 aaa eee 00001      Re = -Ra
Degree to Radian            - DTR Re, Ra            - 10 000 aaa eee 00001      Re = Ra * (PI / 180)
Radian to Degree            - RTD Re, Ra            - 11 000 aaa eee 00001      Re = Ra * (180 / PI)

Round down (Floor)          - FLOOR Re, Ra          - 00 001 aaa eee 00001      Re = floor(Ra)
Round closest               - ROUND Re, Ra          - 01 001 aaa eee 00001      Re = round(Ra)
Round up (Feiling)          - CEIL  Re, Ra          - 10 001 aaa eee 00001      Re = ceil(Ra)
Round to 0 (Truncate)       - TRUNC Re, Ra          - 11 001 aaa eee 00001      Re = trunc(Ra)

Transfer Register           - TRF Re, Ra            - 00 010 aaa eee 00001      Re = Ra
Exchange Registers          - EXCH Re, Ra           - 01 010 aaa eee 00001      Re = Ra; Ra = Re (swaps the contents)

Test against Zero           - TEST Ra               - 00 000 aaa 000 00010      Compare Ra with 0
Compare Registers           - COMP Ra, Rb           - 01 bbb aaa 000 00010      Compare Ra with Rb

Add                         - ADD Re, Ra, Rb        - 00 bbb aaa eee 00011      Re = Ra + Rb
Subtract                    - SUB Re, Ra, Rb        - 01 bbb aaa eee 00011      Re = Ra - Rb
Multiply                    - MUL Re, Ra, Rb        - 10 bbb aaa eee 00011      Re = Ra * Rb
Divide                      - DIV Re, Ra, Rb        - 11 bbb aaa eee 00011      Re = Ra / Rb

Modulo                      - MOD Re, Ra, Rb        - 00 bbb aaa eee 00100      Re = Ra % Rb
Power                       - POW Re, Ra, Rb        - 01 bbb aaa eee 00100      Re = Ra ^ Rb
Exponent                    - EXP Re, Ra            - 10 000 aaa eee 00100      Re =  e ^ Ra
Logarithm                   - LOG Re, Ra            - 11 000 aaa eee 00100      Re = log(Ra)

Decimal Logarithm           - LOG10 Re, Ra          - 00 000 aaa eee 00101      Re = log10(Ra)
Binary Logarithm            - LOG2  Re, Ra          - 01 000 aaa eee 00101      Re = log2(Ra)
Square Root                 - SQRT  Re, Ra          - 10 000 aaa eee 00101      Re = sqrt(Ra)
Inverse Square Root         - ISQRT Re, Ra          - 11 000 aaa eee 00101      Re = 1 / sqrt(Ra) (also yes it uses the Fast Inverse Square Root trick from Quake III)

Sine                        - SIN Re, Ra            - 00 000 aaa eee 00110      Re = sin(Ra)
Cosine                      - COS Re, Ra            - 01 000 aaa eee 00110      Re = cos(Ra)
Tangent                     - TAN Re, Ra            - 10 000 aaa eee 00110      Re = tan(Ra)

Inverse Sine                - ASIN Re, Ra           - 00 000 aaa eee 00111      Re = asin(Ra)
Inverse Cosine              - ACOS Re, Ra           - 01 000 aaa eee 00111      Re = acos(Ra)
Inverse Tangent             - ATAN Re, Ra           - 10 000 aaa eee 00111      Re = atan(Ra)

Reset                       - RESET                 - 11 111 111 111 11110      Clears all Registers and both FIFO buffers
No Operation                - NOP                   - 11 111 111 111 11111      Does nothing


the main benifit of doing operations like this instead of like in Version 1 is that doing multiple operations in sequence will be faster, as you don't need to constantly transfer the floats between master and MCU, they simply stay on the MCU in the registers.
this also means that indivitual operations like Version 1 did are actually slower than before, since you need 3-4 insturctions in total, one or two to WRITE the values to the MCU, one to do the operation, and one to read the result back.

also the MCU still has the feature where it stops all operations, pulls the interrupt line low, and waits for the sending FIFO to be empty. but because that only happens with the READ instructions, it can actually continuously work on some data given that the FIFO has a backlog of instructions in it.

so here the benchmark results for Version 2:
Code:
----------------------------------------------------------
add - Software/AVR:
Total Time: 0.060 sec, Speed: 16.667 kFLOPS
Total Time: 1.150 sec, Speed: 0.869 kFLOPS
Average Difference: 0.000000
----------------------------------------------------------
sub - Software/AVR:
Total Time: 0.060 sec, Speed: 16.667 kFLOPS
Total Time: 1.150 sec, Speed: 0.869 kFLOPS
Average Difference: 0.000000
----------------------------------------------------------
mul - Software/AVR:
Total Time: 0.210 sec, Speed: 4.762 kFLOPS
Total Time: 1.150 sec, Speed: 0.869 kFLOPS
Average Difference: 0.000000
----------------------------------------------------------
div - Software/AVR:
Total Time: 0.170 sec, Speed: 5.882 kFLOPS
Total Time: 1.150 sec, Speed: 0.869 kFLOPS
Average Difference: 0.000000
----------------------------------------------------------
mod - Software/AVR:
Total Time: 0.610 sec, Speed: 1.639 kFLOPS
Total Time: 1.140 sec, Speed: 0.877 kFLOPS
Average Difference: 27.131926
----------------------------------------------------------
sin - Software/AVR:
Total Time: 4.020 sec, Speed: 0.248 kFLOPS
Total Time: 0.840 sec, Speed: 1.190 kFLOPS
Average Difference: 0.000325
----------------------------------------------------------
cos - Software/AVR:
Total Time: 4.090 sec, Speed: 0.244 kFLOPS
Total Time: 0.840 sec, Speed: 1.190 kFLOPS
Average Difference: 0.000373
----------------------------------------------------------
tan - Software/AVR:
Total Time: 8.250 sec, Speed: 0.121 kFLOPS
Total Time: 0.840 sec, Speed: 1.190 kFLOPS
Average Difference: 0.827708
----------------------------------------------------------
sqrt - Software/AVR:
Total Time: 0.610 sec, Speed: 1.639 kFLOPS
Total Time: 0.840 sec, Speed: 1.190 kFLOPS
Average Difference: 0.000000
----------------------------------------------------------
inverse sqrt - Software/AVR:
Total Time: 0.700 sec, Speed: 1.429 kFLOPS
Total Time: 0.840 sec, Speed: 1.190 kFLOPS
Average Difference: 0.000000
----------------------------------------------------------
((a * cos(b)) + (sin(c) / -d)) - Software/AVR:
Total Time: 8.440 sec, Speed: 0.118 kFLOPS
Total Time: 2.420 sec, Speed: 0.413 kFLOPS
Average Difference: 16090.212891

as expected, all single operations are slower than before, but the last sequential one is pretty much twice as fast. meaning the more work you can do on the MCU without having to move data back and forth the better the performance compared to Version 1.

for anyone curious here the brakdown of the SPI transfers required for the last operation:
Code:
Operation: (a * cos(b)) + (sin(c) / -d)

Version 1:
COS b               -  9 (+2 dummy reads)
MUL a, b            - 13 (+2 dummy reads)
SUB 0, d            - 13 (+2 dummy reads)
SIN c               -  9 (+2 dummy reads)
DIV c, d            - 13 (+2 dummy reads)
ADD a, c            - 13 (+2 dummy reads)


Version 2:
WRITEF F0, a        - 6
WRITEF F1, b        - 6
WRITEF F2, c        - 6
WRITEF F3, d        - 6
LOADZ F7            - 2     (F7 = 0)
COS F1, F1          - 2     (b = cos(b))
MUL F0, F0, F1      - 2     (a = a * b)
SUB F3, F7, F3      - 2     (d = 0 - d)
SIN F2, F2          - 2     (c = sin(c))
DIV F2, F2, F3      - 2     (c = c / -d)
ADD F0, F0, F2      - 2     (a = a + c)
READF F0            - 6 (+2 dummy reads)

total SPI transfers:
Version 1 - 82
Version 2 - 46


there are still loads of empty places in the Instruction Encoding to add functions and features, so suggestions are welcome! also any suggestions to the general architecture of the FPU for a potential Version 3.

welp, that's it! that's currently how far i've gotten in this project, i'll adapt my 3D renderer to make use of the MCU and see how performance looks (though i can already tell it will be much better).
i'll attatch the Arduino files for both version 1 and 2 though they aren't well polished. (both are renamed to txt files because the forum is weirdly limited when it comes to file extensions for some reason)
and once i have cleaned up and seperated the code for the master/C side of the interface i'll throw that into it's own library and put it on github with all the other information.


Attachments:
SPI_CoProcessor_v2.txt [14.59 KiB]
Downloaded 75 times
SPI_CoProcessor_v1.txt [15.05 KiB]
Downloaded 53 times
Top
 Profile  
Reply with quote  
PostPosted: Sat Aug 12, 2023 9:29 pm 
Offline
User avatar

Joined: Wed Feb 14, 2018 2:33 pm
Posts: 1488
Location: Scotland
Will study this in detail later, but FWIW, I've been using the ATmega that 'hosts' my Ruby '816 board to off-load IEEE-754 floating point (and 32-bit integer multiply and division) for some time. This is only from my BCPL operating system though

I do have a multually-exclusive shared memory interface rather than SPI, but it still has a higher latency that I'd like but that's not going to change for some time.

I can't do any comparisons between software on the '816 compared to off-loading it although it's faster than BBC Basic on the same hardware - however that has a 5-byte FP format rather than the 4 bytes of '754.

I write a command byte then 2 operand words (32-bits) into the transfer area, ping the ATmega and wait for the reply then pick up the 32-bit result.

I did some benchmarks here:

https://projects.drogon.net/retro-basic ... enchmarks/

-Gordon

_________________
--
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/


Top
 Profile  
Reply with quote  
PostPosted: Sat Aug 12, 2023 10:11 pm 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
Nice projects both! I'd wondered about this sort of thing but gravitated towards the kind of shared-bus architecture that ARM2 used. It may also be how x87 worked, but I don't know about that. The coprocessor only requires connection to the data bus, the clock, and perhaps some control signals for things like stretching the CPU's clock.

Essentially the regular CPU always drives the address bus, but when it sees a coprocessor opcode it signals the coprocessor to act but still keeps performing the address bus operations that are required, while the coprocessor uses the data bus to talk to RAM or the CPU itself. The CPU needs to be running appropriate opcodes with the right memory accesses for whatever the coprocessor operations are. One way to do this is through a prefix code.

I believe Dr Jeffyl has used a similar mechanism for adding custom instructions, it seems quite viable.


Top
 Profile  
Reply with quote  
PostPosted: Sat Aug 12, 2023 10:27 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California
It sounds like you're pretty much finished with the project; but you could see how AWC did it, at http://www.awce.com/pak1.htm (these are available for sale), and how Micromega did it, at https://web.archive.org/web/20190726225 ... ducts.html .  The latter seems to be out of business.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Sat Aug 12, 2023 11:20 pm 
Offline
User avatar

Joined: Fri Aug 03, 2018 8:52 am
Posts: 746
Location: Germany
GARTHWILSON wrote:
It sounds like you're pretty much finished with the project; but you could see how AWC did it, at http://www.awce.com/pak1.htm (these are available for sale), and how Micromega did it, at https://web.archive.org/web/20190726225 ... ducts.html . The latter seems to be out of business.


i did take a look at PAK and the uM-FPU before starting the project, because i was looking for an existing modern FPU that could be hooked into a System.
but sadly PAK is way too expensive and uM-FPU is dead so something custom is the next best thing.

in terms of functionality PAK seems the closest to my Version 2 design, with the main difference being that my design has more registers to work with, but also longer instructions.
uM-FPU also seems similar but it has some additional features that i would like to do for Version 3 as well. like custom functions and matrix instructions.

ideally by the "end" (there likely is no end as there is always more to improve and add) of this i'd like to have an open source version of these kinds of chips. it wouldn't be as performant as i assume PAK I/II and uM-FPU use some degree of custom hardware, but it would be a cheaper alternative.

gfoot wrote:
Nice projects both! I'd wondered about this sort of thing but gravitated towards the kind of shared-bus architecture that ARM2 used. It may also be how x87 worked, but I don't know about that. The coprocessor only requires connection to the data bus, the clock, and perhaps some control signals for things like stretching the CPU's clock.

Essentially the regular CPU always drives the address bus, but when it sees a coprocessor opcode it signals the coprocessor to act but still keeps performing the address bus operations that are required, while the coprocessor uses the data bus to talk to RAM or the CPU itself. The CPU needs to be running appropriate opcodes with the right memory accesses for whatever the coprocessor operations are. One way to do this is through a prefix code.

a bus shared design would certainly help with throughput!
though i don't know about having it inline with the CPU's instruction stream. having it more independent of the CPU would likely be better as it keeps the hardware simplier and allows both to run at the same time.
so you'd just have an area in memory that is shared between both, place the code and data, signal the MCU to start, and once done you get an interrupt and can read the results. (similar to drogon's design if i understood it right).


Top
 Profile  
Reply with quote  
PostPosted: Sun Aug 13, 2023 5:43 am 
Offline

Joined: Mon Jan 19, 2004 12:49 pm
Posts: 983
Location: Potsdam, DE
If it helps, one of my early projects (late seventies) bolted a pocket calculator onto the side of a 6502. It had to emulate keyboard presses to get the data in, and to decode the LED seven segment output to get the result.

Speedy, quick, fast, like lightning... all words which could not be applied to the system.

Neil


Top
 Profile  
Reply with quote  
PostPosted: Sun Aug 13, 2023 9:23 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
I think it's a good lesson that designing the protocol is a fairly big chunk of this sort of project.

I think some people go wrong by making something too complicated, and worry about errors, or timeouts, or resynchronising. Assume a reliable channel, that's a great place to be.

Quote:
For that i choose SPI (cause it's easy, doesn't require a lot of wires, or bi-directional connection)


SPI is nice, not only for reasons given but also because pretty much any microcontroller can play along - whatever you have today, and whatever anyone else might have, or something which you might get hold of tomorrow.

Hooking up an actual calculator is great, and very retro!


Top
 Profile  
Reply with quote  
PostPosted: Sun Aug 13, 2023 9:35 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
One idea, perhaps, for a protocol like this, where either side might not know what's coming back when something was sent, is to have FF mean 'idle'. Any other value in the bonus reply byte means 'something' and is either an identifier or a length, and tells the receiver how many more bytes, if any, to read.

Another possibly useful tactic, perhaps, is byte-stuffing, where FF always means nothing, and a real FF is a two byte sequence. You need to set aside another value, as the first byte of the sequence. So, for example, FF means nothing, 1B 00 means 1B, and 1B 01 means FF, and all other bytes mean themselves. There might be less clumsy ways of doing this!


Top
 Profile  
Reply with quote  
PostPosted: Sun Aug 13, 2023 9:48 am 
Offline
User avatar

Joined: Mon Mar 06, 2023 9:26 am
Posts: 88
Location: UK
Or you could send FF repeatedly to mean "I'm not sending anything", then send a packet length <= 127 when the sender actually has data. That way you can just wait for a byte that has bit 7 clear, and then start listening for the actual data.

_________________
probably the youngest person on this forum


Top
 Profile  
Reply with quote  
PostPosted: Sun Aug 13, 2023 9:55 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
That's simpler!


Top
 Profile  
Reply with quote  
PostPosted: Sun Aug 13, 2023 3:35 pm 
Offline
User avatar

Joined: Wed Feb 14, 2018 2:33 pm
Posts: 1488
Location: Scotland
Proxy wrote:

a bus shared design would certainly help with throughput!
though i don't know about having it inline with the CPU's instruction stream. having it more independent of the CPU would likely be better as it keeps the hardware simplier and allows both to run at the same time.
so you'd just have an area in memory that is shared between both, place the code and data, signal the MCU to start, and once done you get an interrupt and can read the results. (similar to drogon's design if i understood it right).


Sort of. What happens is that the 65xx side writes a command byte and (optional) data into a fixed memory area ($FF00 - $FF87) then executes WAI instruction.

At which point, the ATmega wakes up - it's been polling the Rdy signal. The ATmega pulls BE Low on the 65xx, then attaches itself to the bus by un-tristating the signal required (8 data bits, 8 address bits, /Rd and /Rw to the RAM chip) then reads the data, interprets the code (used for many things other than math - e.g. serial IO, disk, and other stuff) writes data back to the RAM, then reverses the process and finally send an NMI back to the 65xx which takes it out of WAI state.

There is added latency in all this so I try to avoid trivial calls - the serial put is actually buffered on the 65xx side for example as it's almost as fast to send one byte as it is to send 128.

There is another small issue in that the data bus is flipped bit 0 for 7, 1 for 6, etc. so data in/out of the ATmega is passed through a lookup table to bit-flip. This was due to the way the PCB was laid out - I could have spent more time "fixing" it, but felt that was an OK compromise. (and I've yet to work out of this bit flipping/table lookup is faster on the 65xx side or the ATmega side - currently it's done on the ATmega side)

The ATmega side is also all coded in C - because lifes too short and it's good enough.

Data xfer speed of a 'raw' file over the interface is about 40KB/sec right now which is good enough in 128 byte chunks.

To make a true shared memory interface would require something better than an ATmega and/or some more complex circuitry. A thing I did consider was a memory mapped (in the peripheral sense) FIFO - similar to the Acorn/BBC Micro "Tube" interface but it would complicate hardware somewhat.

-Gordon

_________________
--
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/


Top
 Profile  
Reply with quote  
PostPosted: Sun Aug 13, 2023 6:05 pm 
Offline
User avatar

Joined: Fri Aug 03, 2018 8:52 am
Posts: 746
Location: Germany
allisonlastname wrote:
Or you could send FF repeatedly to mean "I'm not sending anything", then send a packet length <= 127 when the sender actually has data. That way you can just wait for a byte that has bit 7 clear, and then start listening for the actual data.

before i settled on the interrupt line and blocking READ commands i did think about preloading a $00 byte before the start of data, with Version 1 a size parameter was not needed as all responses were 32-bits wide. with Version 2 responses can vary.
and encoding both the start and size of a packet into a single byte is pretty genius!
though the downside is that it will make every response 1 byte longer, therefore lowering throughput. though technically it's 2 bytes longer as all transfers have to be 16-bits wide.

hmm... what if in addition to the Interrupt line (which would just go low when there is data in the sending FIFO) there was a R/W line from the master to the MCU?
so you can specify to the MCU whenever you're reading or writting through SPI. that would make it similar to 3-wire SPI as you can only either read or write but not both at the same time. that way you could load up the receiving FIFO with a lot of operations at once and read back the results independently, but it would also technically hurt performance as you cannot overlap reads and writes anymore.

.

very interesting drogon! so the CPU simply makes the AVR the temporary bus master so it can do it's thing.

drogon wrote:
To make a true shared memory interface would require something better than an ATmega and/or some more complex circuitry. A thing I did consider was a memory mapped (in the peripheral sense) FIFO - similar to the Acorn/BBC Micro "Tube" interface but it would complicate hardware somewhat.

well a FIFO would only need a bit of extra decoding logic on the CPU side, while an MCU can handle the other side by itself.

a Hardware FIFO avoids the need for Software FIFOs on both the MCU and master side, plus it would increase bandwidth as you're reading/writing whole bytes at once. and since it gives both sides access to flags to check if there is data to be read in the first place, complicated packet protocols with size bytes and such would not be required.
also unlike with regular SPI, data can be read or written from/to FIFOs independent from eachother.

hmm, that might be worth making an expansion card for.


Top
 Profile  
Reply with quote  
PostPosted: Mon Aug 14, 2023 8:27 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California
Next stage of development is taken up in a new topic, STM interface for 6502 boot and uart - PoC.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Tue Aug 22, 2023 8:13 pm 
Offline
User avatar

Joined: Fri Aug 03, 2018 8:52 am
Posts: 746
Location: Germany
i don't really think that that is the "next stage" for this project.
my idea is pretty much just a peripheral that you can add to existing 65xx systems but is not required for them to function, while that thread is more about using an MCU to boot a 65xx system.

though speaking of STM, i do have an STM32 Bluepill. while it doesn't have an onboard FPU it still runs at 72MHz and has a full 32-bit CPU Core, so it should still perform much better than the AVR.
downside is that it only runs at 3.3V but it does have 5V tolerant IO, technically 3.3V is just about too low for the VIA to read but i'm sure that the actual hardware is more lenient than the datasheet makes it seem. though i do have some level shifters in case it doesn't work.

anyways, the main reason i haven't posted an update before is because i've been struggling to port the v2 code over to the STM32. even though it's written for the arduino IDE, which makes porting usually pretty easy, because i'm dealing with peripheral interrupts and manually fiddling with MCU internal bits, there is quite some hardware dependent stuff that i needed to rewrite manually.

and so far i've not been able to get the SPI interrupt working, despite setting all the bits required in the SPI and Interrupt controller. so i've been kinda looking for alternatives.
external pin interrupts are natively supported by the arduino library so they're easy to do on the STM32, so the idea is to trigger an interrupt when the SS line goes low (ie the master is about to transfer data), and then have the CPU manually poll the SPI controller's status register to hand-feed it bytes until the SS line goes high again.
i tried that on the AVR and it cannot keep up with my SBC doing software SPI so i don't have high hopes for the STM32 doing better despite it's much much higher clock speed.

and if that doesn't work i might have to drop SPI until i solve that issue, and make something custom with handshaking so that the system is never too fast for the MCU to handle regardless of the speed of either side. though that does mean that you cannot have it inline with other SPI devices... unless i keep SPI and see if i can add handshaking to it. hmm, that might be worth looking into.

anyways just wanted to drop this quick update. i've also been working on version 3 a bit, though i'll post about the details once it's done and functional


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 14 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 42 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron