Contemplating DMA

barnacle · Post by **barnacle** » Fri Feb 27, 2026 7:26 am

The core reads and writes of my FAT32 CF card system are between the card and a couple of page-aligned pages in memory, and the code requires 13 clocks per byte, plus a little overhead, giving a maximum transfer rate of ~70kB/s per MHz (I'm using a 1.8432MHz clock). In some cases, the read rate is limited by other factors, but speeding this up is a benefit.

Which I can't do in software.

But a hardware approach seems possible... DMA in 6502 land is not new, but mostly it's video circuits, reading only but in the unused half cycle. My board doesn't have video, so perhaps I can use that time to benefit. If I transfer one byte per cycle - the CF cards are currently able to deliver 30-60MB/s but I'm limited by RAM speed - then I'm looking at 1.8MB/second.

My initial thought is that I need:

A 0-511 counter to provide the low bits of the memory address
A latch to hold the upper bits of the memory address - writing this latch would start the process
A 16-bit multiplexer to switch between CPU and DMA address bus to the memory
A 3-bit multiplexer to switch between CPU and a forced '0' address to the CF address bus (I will still need to be able to write and read the CF parallel interface to set up sectors, read errors, start the transfer, and suchlike)
A bidirectional 8-bit buffer between the data bus and the CF data bus
Some sort of logic to generate appropriate out of phase read and write pulses
A readable flag to indicate transfer complete

There's probably other stuff I've forgotten, but I think that's most of it. This would allow me to specify a buffer page (they're already page aligned for software speed) by writing to the latch, which would start the transfer, and poll for completion. I suppose I could use an interrupt to indicate completion but as the FAT32 code is all single-threaded (the whole system is single-threaded!) it doesn't gain me anything.

If it turns out that faster still is more desirable, then it would not be beyond the bounds of possibility to use multiple transfers per clock cycle, but to be honest I'm not sure it's worth the effort.

I shall start noodling circuits...

Neil

BigDumbDinosaur · Post by **BigDumbDinosaur** » Fri Feb 27, 2026 8:38 am

If you can really achieve 70 KB per second at 1 MHz clock with your present hardware and driver, I’m not sure your effort to cobble up DMA makes sense. Cranking up Ø2 would be a better route, methinks, since the 65C02’s performance scales in direct proportion to the clock rate.

For example, if you take the clock up to 5 MHz, now you are (in theory) shoveling coal at the rate of 560 KB/second, certainly nothing to sneeze at. Get Ø2 up to the official 14 MHz and you are (in theory) stoking the old boiler at 980 KB/second. That’s faster than the 715 KB/second I’m getting with POC V1.3 running at 16 MHz, and that’s with SCSI hardware (I/O has to be wait-stated at 16 MHz).

That said, you first need to devise some way of accurately determining the raw throughput of your CF interface (IDE, I presume) so you have a trustworthy baseline. Counting clock cycles in your code presumes that the hardware doesn’t cause handshake pauses while the CF card’s innards do what they do to access the next byte. You might get some surprises when you compare what you’ve measured against what you have calculated.

I did this with POC V1.3’s second-design SCSI host adapter to gauge how much the new-and-improved™ circuit sped up things. I used the logic analyzer to track a train of pulses that occurred with each transfer, one byte per pulse. The actual transfer rate was a bit slower than the computed rate, although significantly faster (about 18 percent) than with the first-design host adapter.

barnacle wrote:

My initial thought is that I need:

A 0-511 counter to provide the low bits of the memory address
A latch to hold the upper bits of the memory address - writing this latch would start the process
A 16-bit multiplexer to switch between CPU and DMA address bus to the memory
A 3-bit multiplexer to switch between CPU and a forced '0' address to the CF address bus (I will still need to be able to write and read the CF parallel interface to set up sectors, read errors, start the transfer, and suchlike)
A bidirectional 8-bit buffer between the data bus and the CF data bus
Some sort of logic to generate appropriate out of phase read and write pulses
A readable flag to indicate transfer complete

That’s a lot of hardware and I suspect the prop time from cascaded logic is going to be a bump in the road.

If going the DMA route, encapsulating it in a CPLD would be a whole more efficient, with the CPLD acting as bus master during a DMA transfer. Fortunately, the C02 has the necessary input logic (BE, and RDY or clock-suspension) to work with another bus master. Such an arrangement could achieve a theoretical transfer rate of Ø2 ÷ 2 bytes per second.

BTW, I’ve been periodically wracking my brain over DMA ever since I got SCSI working in POC V1.1, which was nearly 14 years ago. I’ve mulled the CPLD approach, although with the 65C816, things are significantly more complicated than with the 65C02.

I’m nearly as close to trying something as I was 14 years ago. It’s a simple thing to describe on paper...

barnacle · Post by **barnacle** » Fri Feb 27, 2026 9:40 am

At the moment, with overhead, but without doing anything else, reading a large file into the buffer a sector at a time and then immediately overwriting it gives me a throughput of around 33kB/s, so about half the theoretical maximum. So over half the time is spent in overhead (approximate stopwatch timing); using DMA is perhaps going to overall double the throughput.

Might still be worthwhile, but it's simpler to drop in a 3.6MHz oscillator and double things that way. Of course, it would also double the DMA speed... I'll continue to think about it. You know I like to noodle these things.

Meanwhile, I have to solve two issues which have become apparent since I did a full rebuild/restore of the laptop (an upgrade caused kernel panics!): I can no longer execute the assembler, although I can create a C executable which does work; and I can't connect to any of the serial ports e.g. /dev/ttyACM0. Odd. Probably a permissions thing.

Neil

barnacle · Post by **barnacle** » Fri Feb 27, 2026 10:06 am

Ah, fixed the ttyACM0 - turns out you have to be in the dialout group, not tty...

Neil

barnacle · Post by **barnacle** » Fri Feb 27, 2026 10:15 am

And the assembler turns out to be a 32-bit executable, and required

Code: Select all

sudo apt-get install gcc-multilib

to install the 32 bit libraries.

Neil

drogon · Post by **drogon** » Fri Feb 27, 2026 6:56 pm

barnacle wrote:

The core reads and writes of my FAT32 CF card system are between the card and a couple of page-aligned pages in memory, and the code requires 13 clocks per byte, plus a little overhead, giving a maximum transfer rate of ~70kB/s per MHz (I'm using a 1.8432MHz clock). In some cases, the read rate is limited by other factors, but speeding this up is a benefit.

Personally, I'd be happy with that.

Probably because my 16Mhz system only achieves about 34KB/sec transfer to/from SD card. (128 byte packets between the 65xx and the AVR and highish latency) I've never really felt it was too slow for anything I do. It many not 'stream' Bad Apple, but all else is just fine.

And in a 65C02 system, there's only so much RAM you have, so you can totally fill RAM in under a second at that rate. I have an '816 with 512KB of RAM - the biggest thing I load there is the compiler - all 48KB of it and there are slower things when compiling than waiting for the compiler to load (e.g. The compiler reads source code one byte at a time so there is much overhead)

So balance up complexity of that idea vs. the reality of something usable?

-Gordon

Yuri · Post by **Yuri** » Fri Feb 27, 2026 8:43 pm

I agree with BDD, I think a CPLD would be a bit easier to work with in this case. Your latches, logic, counters etc can all work within the propagation delay of the single device. With one of the ATF15xx series that can be as low as 7.5ns; so then your limiting factor is making sure your device and RAM can work with whatever clock speed you are operating at, pretty fast SRAM can be had easily enough, so most of that is really going to be dependent on the CF card.

Though at 1.8432MHz you could probably go about wiring in a 8257 or 6844.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Fri Feb 27, 2026 9:59 pm

barnacle wrote:

At the moment, with overhead, but without doing anything else, reading a large file into the buffer a sector at a time and then immediately overwriting it gives me a throughput of around 33kB/s, so about half the theoretical maximum. So over half the time is spent in overhead (approximate stopwatch timing); using DMA is perhaps going to overall double the throughput.

Have you published the circuit used to interface the CF card with your system? I don’t recall seeing it, although at my advanced age, I don’t recall a lot of things.

Quote:

Might still be worthwhile, but it's simpler to drop in a 3.6MHz oscillator and double things that way. Of course, it would also double the DMA speed... I'll continue to think about it. You know I like to noodle these things.

What’s the upper clock limit with your rig?

Working out how to do DMA would be a worthwhile endeavor, if for no other reason than getting the opportunity to explore arcane hardware stuff. As I said earlier, I’ve given it a fair amount of thought, but a combination of lack of time and insufficient hardware chops has never taken it beyond the cogitation stage. I’m comfortable with my present setup’s throughput, but know there is more lurking somewhere in the shadows. Design “improvements” in POC V1.4.1 should allow it to be clocked faster than POC V1.3, plus reduce the number of wait-states that will occur during a disk <—> core transfer (currently two wait-states per byte transferred, whether a read or write). It is slightly possible that that combination of improvements might get me to the “magical” 1 MB/second realm if I can ramp up Ø2 enough.

drogon wrote:

I have an '816 with 512KB of RAM - the biggest thing I load there is the compiler

The way I see it, a single load event, such as running your BCPL compiler, in itself won’t be a significant delay. After all, you can only type so fast. Even if it takes the better part of a second to load the compiler into core, it’ll be done almost before you have lifted your fingers from the keyboard after hitting [ENTER] to initiate the load.

Where high-performance mass storage becomes valuable is in operations that repetitively go to the disk, e.g., filesystem manipulations in a system with a paucity of buffers. Then it becomes the old effects-of-scale situation.

Yuri wrote:

I agree with BDD, I think a CPLD would be a bit easier to work with in this case. Your latches, logic, counters etc can all work within the propagation delay of the single device. Plus the logic design would be done in a descriptive language, rather than a plethora of wires and chips.

Yep! The integration of all required “machinery” into a single device would make it easier to avoid timing races and other maladies. The MPU can be stopped with RDY (or by freezing Ø2) when the CPLD/DMAC takes over and released when the transfer is completed. BE will take care of bus and RWB isolation, making the DMAC the bus master.

In a 65C02 system, this should be within the capability of the ATF1504AS, since there are 64 macrocells and buried logic for setting up required counters and registers. If the 1504’s logic fabric is insufficient, the 1508 could be used, although only available in QFP or the PLCC-84 monster package.

Quote:

Though at 1.8432MHz you could probably go about wiring in a 8257 or 6844.

The 6844 would be bus-friendlier, but the Intel part has a higher clock ceiling. In both cases, you would be interfacing TTL hardware to a CMOS system.

barnacle · Post by **barnacle** » Sat Feb 28, 2026 6:07 am

With the simple system I'm using as a base, the upper clock limit is probably 2MHz - limited by the UART (for which 1.8432MHz is a very handy baud rate frequency). The eeprom is 130ns, so would go to 4MHz, and the ram is 65ns so 8MHz should be possible, glue logic permitting; I haven't thought it out in detail. I suspect that to increase the speed on what is a deliberately conservative design, I'd need to think about variable clock speed depending on address.

The interface circuit is on the first page of the FAT32 thread; nothing complicated, just a '245 between the data bus and the CF card, and a standard '138 chip select for the appropriate memory area and a couple of gates for ~rd and ~wr.

As I've said before, I have no knowledge of the care and feeding of CPLDs and their friends, and I prefer to work with discrete logic parts unless and until it becomes completely impractical.

It's probably worth remembering that this would not be a generalised DMA controller; it can't e.g. move blocks of memory around (though there could be plenty of uses for doing that at clock speed rather than processor speed). It's more akin to the IO DMA controllers you might find on an ARM microcontroller, which transfer to or from specified peripherals. And even then, it's further limited because it will only transfer to/from even page boundaries... which keeps the hardware down.

The main part is the counter, which counts from 0-511 and then stops; three '163? Or two '393 but the ripple delay might be prohibitive. Multiplexing to the main memory could be done any number of ways, including the BE input to the processor (surely what it's intended for) and a control input to the existing '245. I rather like the possibility of stopping the processor when the DMA starts, so that the STA instruction which sets the target high address is followed 'immediately' by the first instruction to use the data - as far as the processor is concerned, the data is 'just there'...

Neil

BigDumbDinosaur · Post by **BigDumbDinosaur** » Sat Feb 28, 2026 6:12 am

barnacle wrote:

The interface circuit is on the first page of the FAT32 thread...

I recall it now; it was in color and I couldn’t read it.

barnacle · Post by **barnacle** » Sat Feb 28, 2026 7:42 am

My apologies; it was a screen grab rather than a pdf. Here you go...

6502_cf_2_a.pdf: (585 KiB) Downloaded 33 times

Neil

barnacle · Post by **barnacle** » Wed Mar 04, 2026 9:10 pm

Some further thoughts on how this might be worked.

Adding a number of constraints simpifies the design somewhat:

Transfers are all 512 bytes (because that's what the rest of my software expects; it could be expanded to any power of two).
Transfers are all to even page boundaries (it keeps things tidy).
Transfers are started by writing the target page to the top seven bits of a latch. The remaining bit might be used to indicate whether a read or write is required.
The CF has to remain accessible on the data bus, since a normal sector read/write begins by setting the sector lba address, , the number of sectors to read (always one in my design), sending a command code to read or write, and waiting until the CF is ready. Access then consists of 512 sequential reads or writes; there is no further addressing since we are reading/writing to/from the CF's buffer.
To avoid bus contention issues, the processor is disconnected from the busses (BE) and halted (either the clock stopped, or RDY) while the DMA is taking place.

With these constraints in place, it looks possible to provide a DMA system specifically to allow transfer from memory to and from the CF. From the processor's point of view, it would issue a write the the target selection latch and when it executes its next instruction, the data has miraculously been transferred; it could (should) at that point perform an error check on the CF - but knowing me it probably won't bother.

The CF card interface uses ~RD and ~WR rather than the 65c02's R/~W, which turns out to be handy. We start with a synchronous counter of ten bits wired so that on demand it counts from 0 to 511 and then stops. On each clock phase, we can derive ~RD and ~WR which each use the first and second (respectively) half-phases of the clock for timing.

To transfer from the CF, we assert ~RD and data is available from then until the next assertion. Generating a low going write pulse for the system memory on the other half of the phase should therefore copy the data at the end of the clock; successive reads and writes will transfer the successive bytes from the buffer. It may be necessary to use a transparent latch to stretch the signal but I don't think so, looking at the timing diagrams.

Similarly, transferring to the CF starts with R/~W high - it remains high for the entire process - which puts the memory data on the bus. The second half of the clock issues a ~WR signal to the CF. Again, we continue until 512 bytes have been transferred.

At clock speed, this would result in transfer rates (either direction) on my system of 1.8MB/s. But the DMA can run at multiples of the system clock; the speed limit is that of the RAM (70ns on my system) and the CF card (30-60MB/s seems common these days) and of course the limits on the logic delays. Even the 'slow' rate is a significant improvement by a factor of about fourteen on the current inline software version, which is a worthwhile improvement.

I should also point out and thank DrJeff for his advice; he has suggested a simpler system using invalid 65c02 codes and a much smaller logic collection which reduces the transfer time to about five clocks per byte. However, I am considering using an entire partition as a virtual flat 24-bit addressed memory (i.e. as 32k sectors) and for that I need all the speed I can get.

I suspect that this approach could be expanded to allow e.g. arbitrary memory transfers of any size, but to be honest that's not on the game plan. It would require multiple latches for the to and from addresses, and two counters instead of one (or perhaps adders?)...

Three '163s count from zero as soon as the logic input (above the clock) goes low; if it remains low then the count will stop at 511.

Neil

Dr Jefyll · Post by **Dr Jefyll** » Fri Mar 06, 2026 3:41 pm

Interesting proposal, Neil! And lots to think about. Here are some points that come to mind.

Quote:

the DMA can run at multiples of the system clock

I agree this is doable, but -- realistically -- is it something you'd actually want to undertake? It implies two different clocks, with part of the system required to "change gears" between the two. That isn't entirely trivial (even granting that the two clocks would have a stable phase relationship).

Myself, I'd be far more inclined to stick with a single clock... but double it! Other than that 68B50 UART (which may require an upgrade), is there any objection to boosting the CPU clock from 1.843 MHz to 3.68 MHz? This'll accelerate both your DMA *and* the somewhat pokey multi-precision math routines you mentioned elsewhere.

Quote:

The CF card interface uses ~RD and ~WR

Yes, I see that... (referring to the pdf schematic in the previous post)... but I'm not clear on why the CF card interface has eleven address inputs (only three of which are supplied with a signal). Is there a part# or data sheet I can refer to? Thanks.

Edit: answering my own question.

Early in a very long topic, Neil lists Resources Online in this post.

BTW, you might wanna consider changing the Enable signal for '245 bus transceiver so it's qualified by PHI2.

-- Jeff

barnacle · Post by **barnacle** » Sat Mar 07, 2026 8:33 am

As you've found, Jeff, the CF interface in IDE mode exposes only eight addresses; all others have to be held low.

My thought on different speeds is that there need be absolutely no correlation between the clocks: once the write to the address latch is made, the processor is stopped by the DMA logic, to be restarted once the DMA is finished. I'm not sharing the bus on alternate clocks between the processor and the DMA but between the DMA and the memory. There's no conflict either clock or bus. But as you say, at least in the first iteration, I'll use the same clock for both.

Neil

BigDumbDinosaur · Post by **BigDumbDinosaur** » Sat Mar 07, 2026 1:20 pm

barnacle wrote:

My thought on different speeds is that there need be absolutely no correlation between the clocks...

Dunno about that...

In order to initiate a DMA transaction, the MPU has to set up the DMA hardware using the usual select-and-write bus sequences. Hence the DMA hardware is effectively being clocked by the MPU during setup. Furthermore, once setup is done, the DMA hardware has to stop and isolate the MPU, which probably should occur on the fall of Ø2.¹ Timing could end up being touchy, especially when you eventually are overcome with the desire to crank up the clock.

Hence I foresee problemos muchos in trying to make a smooth transition from MPU to DMA and back with two different clock domains running things.

————————————————————
¹If I were doing this, I would disable all interrupt sources, other than the DMA hardware, and halt the MPU with the SEI - WAI sequence. With the MPU WAIting, BE can be driven low by the DMA hardware at any convenient time. The actual DMA start would be triggered by RDY being driven low by the MPU when WAI is executed—Ø2 can continue to run while WAIting. When DMA is done, it would hi-Z itself, drive BE high and then toggle IRQB to resume the MPU. The restart will occur on the next low-going edge of Ø2.

Contemplating DMA

Contemplating DMA

Re: Contemplating DMA

Re: Contemplating DMA

Re: Contemplating DMA

Re: Contemplating DMA

Re: Contemplating DMA

Re: Contemplating DMA

Re: Contemplating DMA

Re: Contemplating DMA

Re: Contemplating DMA

Re: Contemplating DMA

Re: Contemplating DMA

Re: Contemplating DMA

Re: Contemplating DMA

Re: Contemplating DMA