Some further thoughts on how this might be worked.
Adding a number of constraints simpifies the design somewhat:
- Transfers are all 512 bytes (because that's what the rest of my software expects; it could be expanded to any power of two).
- Transfers are all to even page boundaries (it keeps things tidy).
- Transfers are started by writing the target page to the top seven bits of a latch. The remaining bit might be used to indicate whether a read or write is required.
- The CF has to remain accessible on the data bus, since a normal sector read/write begins by setting the sector lba address, , the number of sectors to read (always one in my design), sending a command code to read or write, and waiting until the CF is ready. Access then consists of 512 sequential reads or writes; there is no further addressing since we are reading/writing to/from the CF's buffer.
- To avoid bus contention issues, the processor is disconnected from the busses (BE) and halted (either the clock stopped, or RDY) while the DMA is taking place.
With these constraints in place, it looks possible to provide a DMA system specifically to allow transfer from memory to and from the CF. From the processor's point of view, it would issue a write the the target selection latch and when it executes its next instruction, the data has miraculously been transferred; it could (should) at that point perform an error check on the CF - but knowing me it probably won't bother.
The CF card interface uses
~RD and
~WR rather than the 65c02's
R/~W, which turns out to be handy. We start with a synchronous counter of ten bits wired so that on demand it counts from 0 to 511 and then stops. On each clock phase, we can derive
~RD and
~WR which each use the first and second (respectively) half-phases of the clock for timing.
To transfer
from the CF, we assert
~RD and data is available from then until the next assertion. Generating a low going write pulse for the system memory on the other half of the phase should therefore copy the data at the end of the clock; successive reads and writes will transfer the successive bytes from the buffer. It may be necessary to use a transparent latch to stretch the signal but I don't think so, looking at the timing diagrams.
Similarly, transferring
to the CF starts with
R/~W high - it remains high for the entire process - which puts the memory data on the bus. The second half of the clock issues a
~WR signal to the CF. Again, we continue until 512 bytes have been transferred.
At clock speed, this would result in transfer rates (either direction) on my system of 1.8MB/s. But the DMA can run at multiples of the system clock; the speed limit is that of the RAM (70ns on my system) and the CF card (30-60MB/s seems common these days) and of course the limits on the logic delays. Even the 'slow' rate is a significant improvement by a factor of about fourteen on the current inline software version, which is a worthwhile improvement.
I should also point out and thank DrJeff for his advice; he has suggested a simpler system using invalid 65c02 codes and a much smaller logic collection which reduces the transfer time to about five clocks per byte. However, I am considering using an entire partition as a virtual flat 24-bit addressed memory (i.e. as 32k sectors) and for that I need all the speed I can get.
I suspect that this approach could be expanded to allow e.g. arbitrary memory transfers of any size, but to be honest that's not on the game plan. It would require multiple latches for the to and from addresses, and two counters instead of one (or perhaps adders?)...
Three '163s count from zero as soon as the logic input (above the clock) goes low; if it remains low then the count will stop at 511.
Neil