The secret, hidden, transparent 6502 DMA channel

jorge · Post by **jorge** » Fri May 04, 2007 1:21 pm

Ruud wrote:

Hallo Jorge,

jorge wrote:

That's what I think that's the whole point about this idea...

First: I forgot to say that the idea on itself is very good!

And even after a second and third thought I have my doubts but please don't let that spoil the fun for you. I really hope you get it working.

Hi ruud,

For example, imagine an Apple II running. It's using what you call phase 1 DMA for the video generator, and the CPU runs on phase 2 cycles. There is a DMA pin at the slot connectors, that would give your peripheral DMA to phase 2 cycles, but it does stop the CPU. This other "DMAMagic" channel doesn't halt nor slow down the CPU, nor uses cycles from phase 1 DMA. It's bandwidth comes from the CPU dead cycles, so it comes "for free".

Why do you, after a second and third thought, still have doubts ?

Regards,
--Jorge.

P.D. I've been trying to see the schematics of the MMU in your site, but there are some broken links...

could you fix that please ?

kc5tja · Post by **kc5tja** » Sat May 05, 2007 7:43 am

At this point, we're just re-hashing old arguments. As is the case with Internet RFCs, only working implementations can be eligible for future consideration/discussion. I, personally, rather like the idea, but apparently some others just don't see the value. Once people make up their minds, there's no changing them. Ask me -- I'm a qualified hard-head!

GARTHWILSON · Post by **GARTHWILSON** » Mon May 07, 2007 2:38 am

The dead-bus-cycle decoder could simply rule out access during the unpredictable cycles in instructions that have variable lengths depending on decimal flag and whether or not a page boundary is being crossed in indexing, and you'd still be left with a substancial chunck of DMA bandwidth. If you want to use it for existing systems though, those systems will have to be ones that already have address bus buffers and a R/W\ buffer that can be disabled. If you can put a new 65c02 in and use the BE pin, so much the better. But for my own use, interrupt performance absolutely cannot be interfered with, regardless of DMA method. It's a non-negociable.

smilingphoenix · Post by **smilingphoenix** » Mon May 07, 2007 1:49 pm

I've had my PC chugg through the code that's running my bench computer. Using the dead cycles, I can rely on a bus bandwidth of about 60-70 kbytes per second per megahertz, with a peak performance of three times that for certain blocks of code. However, there is also a point in the code where there is a 42 cycle gap between successive dead cycles. If I thought something needed a data rate fast enougth to justify DMA, I don't think I could stand a 42 cycle latency at the wrong moment.

I'm not knocking the idea - for a system with phase-1 DMA already it would allow you to squeeze out a little more bandwidth, especially on a faster system where 6% of the bus bandwidth is still considerable. I've no doubt you could get it to work, I just wonder exactly what you would use it for?

jorge · Post by **jorge** » Mon May 07, 2007 8:18 pm

smilingphoenix wrote:

I've had my PC chugg through the code that's running my bench computer. Using the dead cycles, I can rely on a bus bandwidth of about 60-70 kbytes per second per megahertz, with a peak performance of three times that for certain blocks of code. However, there is also a point in the code where there is a 42 cycle gap between successive dead cycles. If I thought something needed a data rate fast enougth to justify DMA, I don't think I could stand a 42 cycle latency at the wrong moment.

Hi smilingphoenix,

This channel is quite fast, but, most important, it's free in terms of bandwidth cost. The main virtue is not it's speed, but it's (free) bandwidth cost = zero. I mean, the application for this "DMAMagic" thing, should be chosen based on it's main virtue, not on 'prejudices' about what DMA used to be used for.

smilingphoenix wrote:

I'm not knocking the idea - for a system with phase-1 DMA already it would allow you to squeeze out a little more bandwidth, especially on a faster system where 6% of the bus bandwidth is still considerable. I've no doubt you could get it to work, I just wonder exactly what you would use it for?

I still don't know what would be the killer application for it, but I do know that the idea itself is a killer idea...

Well, seriously, if, for example, you wanted to move 256 bytes of data from a sector of a disk to the RAM (in an existing, already built 6502 system), you could go 3 ways:

1.- Pure (phase 2) DMA : 100% of the CPU's memory bandwidth available means that's the fastest way, at least in theory. But in this mode the CPU would be halted during 256 memory cycles.

2.- Via software, using the CPU to poll, something like:

LOOP LDA $somewhere (4 cycles)
STA $somewhereElse,X (5 cycles)
INX (2 cycles)
BNE LOOP (3 cycles)

You'd get a maximum transfer rate of about 1 byte every 14 cycles, or about 7.1% of the available memory bandwidth, or 14 times slower than (1). In this mode, the CPU would be 'halted' (busy transferring during) 256*14 cycles.

3.- Post the transfer request, let the CPU continue doing whatever other productive work while the transfer takes place trought the transparent DMA channel, and either poll every now and then, or trigger an interrupt after the transfer is done. CPU cycles dedicated to the transfer : almost 0. Therefore, somehow, this is the fastest way.

--Jorge.

kc5tja · Post by **kc5tja** » Mon May 07, 2007 8:48 pm

smilingphoenix wrote:

I'm not knocking the idea - for a system with phase-1 DMA already it would allow you to squeeze out a little more bandwidth, especially on a faster system where 6% of the bus bandwidth is still considerable. I've no doubt you could get it to work, I just wonder exactly what you would use it for?

Are you aware that most devices on the USB bus actually transfer data in the "bulk" class? That means, basically, bulk transfer devices (like USB drives and so forth) consume only "free" bandwidth. If other devices are using it, then they are automatically starved of data. For example, a USB speaker or microphone (yes these exist!) would have priority over the "bulk" devices.

Phase-2 DMA would be applicable for similar applications. DMA enables multitasking at the hardware level -- the CPU can be doing other things while the bulk data transfer is occuring in the background. When I said it'd be perfect for talking to a harddrive, I wasn't being snotty -- it really _would_ be a good use for it, because you can stream data in from the harddrive at the same time the CPU is manipulating previously loaded data. The result is a higher performance system.

Using it for sprite meta-data is another application, allowing a static bitmap to occupy all of phase-1 DMA if required (since it's characteristically very bandwidth intensive). A bit-blitter or other vector coprocessor is another possible use.

Phase-2 DMA such as this is ideal for the timing insensitive stuff. Transferring mouse coordinates or keyboard input (or even RS-232 input) this way would likely work great.

It's all in the imagination.

You can also set a priority for phase-2 DMA. If the CPU has higher priority, then the phase-2 peripherals get their data during CPU idle time. Otherwise, the periphals could cycle-steal. If memory serves me correctly, the Commodore-Amiga's chipset allowed you to control blitter DMA priority in precisely this way.

jorge · Post by **jorge** » Mon May 07, 2007 8:59 pm

GARTHWILSON wrote:

The dead-bus-cycle decoder could simply rule out access during the unpredictable cycles in instructions that have variable lengths depending on decimal flag and whether or not a page boundary is being crossed in indexing, and you'd still be left with a substancial chunck of DMA bandwidth. If you want to use it for existing systems though, those systems will have to be ones that already have address bus buffers and a R/W\ buffer that can be disabled. If you can put a new 65c02 in and use the BE pin, so much the better.

Hi GARTHWILSON,

Either that, or a small PCB that plugs into the CPU socket, with the CPU + the buffers on it. Adding the buffers in-between would affect the timing slightly, though.

GARTHWILSON wrote:

But for my own use, interrupt performance absolutely cannot be interfered with, regardless of DMA method. It's a non-negociable.

It does *not* affect the interrupt performance/latency...
All your code will keep running as fast as before (100%) .

--Jorge.

jorge · Post by **jorge** » Mon May 07, 2007 9:03 pm

kc5tja wrote:

Ask me -- I'm a qualified hard-head!

Me too..

--Jorge.

VBR · Post by **VBR** » Thu May 31, 2007 8:20 pm

It's a good idea if it works (and I can't think of any reason why it wouldn't). Even if the CPU-DMA concurrency is useless in a given application, it still has the benefit of no increase in interrupt latency.

By the way, in the rather obscure 65CE02 chip, the "dead" cycles are eliminated, so performance is 20-30% faster at the same clock rate. I wonder whether anyone has ever tried plugging one of those chips into an Apple II. Of course, timing sensitive code like the floppy write loop would have to be modified.

GARTHWILSON · Post by **GARTHWILSON** » Thu May 31, 2007 8:40 pm

Quote:

By the way, in the rather obscure 65CE02 chip, the "dead" cycles are eliminated, so performance is 20-30% faster at the same clock rate. I wonder whether anyone has ever tried plugging one of those chips into an Apple II.

I tried one in my workbench computer after years of trouble-free use with a 65c02, but the 65CE02 did not work in it. I didn't try to figure out why, since the 'CE02 was no longer being made. Later I put a 65c802 in it, which is what has been in there for the last 8 or 10 years, again without a problem.

Dr Jefyll · Post by **Dr Jefyll** » Tue Mar 23, 2010 8:03 am

The secret, hidden, transparent 6502 DMA channel has some intriguing potential. It's true, as some of you have commented, that there's no consistent guarantee of how much unused bandwidth can be harvested. (This will vary depending on how many dead cycles appear the code mix.) But if this is a problem, there's always the option of a dynamic fallback policy, as follows:

Whenever there's a moment during which there's not enough "free" bandwidth available, it's easy to negate the 6502 RDY line and create some dead cycles to harvest! Admittedly this ceases to be "free" bandwidth, but the deal as a whole may still be attractive. Using a mixture of free and stolen cycles is better than having to steal them all. Go for it, Jorge, I say!

One caveat: the SYNC line -- central to your scheme -- is not 100% reliable if interrupts are in use (for Rockwell 65C02s, at least). That is to say, the op-code fetched when SYNC is high doesn't necessarily get executed. This will completely befuddle your logic -- as I found out the hard way with my KimKlone computer, a design that tracks instruction execution exactly the same way your DMA scheme does. (In fact your DMA would've been trivially easy to design into the KK. D'oh! )

More details on the SYNC problem are here -- and the means for a solution, discovered many years after the fact.

kc5tja · Post by **kc5tja** » Tue Mar 23, 2010 4:13 pm

The 65816's VPA and VDA lines are always correct; when both are low, the CPU is not using the bus, and thus, you may steal the cycle for DMA use.

Also, I just did the math; even using the 65816's block move instructions, you can only achieve a data rate of 142kBps/MHz. Having a secondary channel that is, on average, half that rate is nothing to laugh at, especially when you consider that the CPU itself isn't blocked during the transfer.

And with respect to data timing, this type of channel is most analogous to a best-effort delivery service. Hence, your application must be written as such. Transferring isochronous data through such a channel will require larger buffering at the end-points to compensate for the jitter in delivery rate.

It's interesting that the 6502/65816 supports three classes of DMA service now: (1) Phase 1, which is best for isochronous, low-jitter applications, (2) Phase 2 interleaved, which is best for bulk data transfers without timing requirements, and (3) Phase 2 with cycle-stealing, which is best for overflow or emergency bandwidth, since it steals otherwise useful cycles from the CPU (including type-2 DMA cycles).

Dr Jefyll · Post by **Dr Jefyll** » Tue Mar 23, 2010 5:44 pm

Quote:

The 65816's VPA and VDA lines are always correct

Correct for identifying dead cycles, yes. Just a reminder, though: the same issue exists with VPA and VPD as with the 02's SYNC pin. In other words the op-code fetched when VPA and VDP are high doesn't necessarily get executed. This is evident from entry 22a, Table 5-7 of the Aug 3, 2009 version of WDC 's W65C816S Data Sheet. (It was this document that alerted me in the first place about the fact of PC not incrementing.)

Quote:

The dead-bus-cycle decoder could simply rule out access during the unpredictable cycles in instructions that have variable lengths depending on decimal flag and whether or not a page boundary is being crossed in indexing

Garth is right. Let's not forget that some of the dead cycles occur too unpredictably to be easily harvested. Luckily,

Quote:

you'd still be left with a substancial chunck of DMA bandwidth.

kc5tja · Post by **kc5tja** » Tue Mar 23, 2010 6:17 pm

Dr Jefyll wrote:

Correct for identifying dead cycles, yes. Just a reminder, though: the same issue exists with VPA and VPD as with the 02's SYNC pin.

This is completely non-relevant for the purposes of exploiting unused cycles for DMA access.

Quote:

The dead-bus-cycle decoder could simply rule out access during the unpredictable cycles in instructions that have variable lengths depending on decimal flag and whether or not a page boundary is being crossed in indexing

Again, with specific reference to exploiting dead cycles for DMA, I must contest this; VDA and VPA are valid during these instructions too. If you have VDA and VPA low, you can safely access the bus during that clock cycle. In more traditional forms of multi-processor bus access, if (!VDA && !VPA) then the CPU has released the bus. Maybe it's only for a single cycle, but that's all it takes. Even the simplest of bus arbiters (e.g., a round-robin arbiter like that used in STEbus, or a daisy-chain approach like that used in VMEbus) is a sharp-eyed device, and will readily pass control to another device at a moment's notice. If either VDA or VPA go high and another bus master is using it, it's expected that the bus arbiter will negate RDY until the processor owns the bus again.

Simplifying the bus arbiter so that the CPU always has bus priority, you can then grant cycle-by-cycle access to peripherals based solely on the VPA/VDA signals. This establishes a best-effort policy for phase-2 peripherals.

This is why, in fact, VPA and VDA are exposed, and why SYNC alone proves insufficient: (1) to facilitate multi-processor systems (along with the MLB signal), and (2) to prevent spurious bus cycles from interfering with I/O devices (which BigDumbDinosaur encountered in his SBC not more than a few months ago).

I will concede that VPA/VDA aren't as helpful for implementing out-board instruction enhancements (e.g., forcing stack accesses to banks other than zero). However, this specific topic deserves a separate thread.

Dr Jefyll · Post by **Dr Jefyll** » Tue Mar 23, 2010 7:13 pm

Thanks for pointing that out; I was a little too hasty: in regard to the '816 Garth's remarks don't apply. VPA and VPD will tell you what you need to know, and you can harvest all the dead cycles.

My other remark was irrelevant to DMA but pertinent to the mention of the VPA and VPD signals. (Like SYNC, they come with a subtle but crucial caveat.) Admittedly I'm guilty of momentarily going off-topic; sorry if I confused anybody.