6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Apr 19, 2024 1:45 pm

All times are UTC




Post new topic Reply to topic  [ 53 posts ]  Go to page 1, 2, 3, 4  Next
Author Message
PostPosted: Fri Apr 20, 2007 3:28 pm 
Offline

Joined: Tue Mar 27, 2007 6:02 pm
Posts: 14
Location: Sevilla, España
http://homepage.mac.com/jorgechamorro/a2things/a2DMAMagic/index.html

Has this ever been done ?
What do you think about it ?

--Jorge.


Last edited by jorge on Wed May 16, 2007 7:28 am, edited 3 times in total.

Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sun Apr 22, 2007 7:33 am 
Offline

Joined: Tue Jul 05, 2005 7:08 pm
Posts: 990
Location: near Heidelberg, Germany
Hi Jorge,

Quote:

links to
http://homepage.mac.com/jorgechamorro/a2things/a2DMAMagic/index.html

I don't think anyone has ever done this. Every system I know relied on guaranteed throughput and is using the RDY line or Phi2 magic to actually stop the CPU.

But nice idea, though. I am currently in the process of designing a blitter (block transfer engine) that could use this idea, but it's a though SMD job on a Euro card already, so I don't know whether I will do this.

You would have to use a different PROM for each type of CPU (NMOS, Rockwell CMOS, WDC, ...) though.

André


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sun Apr 22, 2007 8:10 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8422
Location: Southern California
Without spending much time on it, it looks like it should work as long as certain limitations are kept in mind. For example, certain instructions are one clock longer in decimal mode than in binary. The logic might also have to know if interrupts are enabled and watch the IRQ and NMI lines (+ ABORT if it's a 65816). If you use WDC's processor, you can use the BE input to disable its address bus outputs during your DMA accesses, otherwise you'll have to add external address bus buffers too, because the processor will be driving the address bus even during "dead" bus cycles. The '816 also makes it easier to identify the dead bus cycles, with its VDA and VPA outputs.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sun Apr 22, 2007 11:20 am 
Offline

Joined: Tue Mar 27, 2007 6:02 pm
Posts: 14
Location: Sevilla, España
fachat wrote:
Hi Jorge,

links to
http://homepage.mac.com/jorgechamorro/a2things/a2DMAMagic/index.html

I don't think anyone has ever done this. Every system I know relied on guaranteed throughput and is using the RDY line or Phi2 magic to actually stop the CPU.

But nice idea, though. I am currently in the process of designing a blitter (block transfer engine) that could use this idea, but it's a though SMD job on a Euro card already, so I don't know whether I will do this.

You would have to use a different PROM for each type of CPU (NMOS, Rockwell CMOS, WDC, ...) though.

André


André,

What I love about this idea is that is uses bandwidth that was already there, being wasted by the CPU. You could for example completely redraw the high resolution screen of an Apple II (8kb) @25 fps !. The bandwidth is not substracted from anywhere else, it's recovered, it adds to the total memory bandwidth of the system, transparently, like magic !

A PROM would probably be too slow for anything but the slower 6502s. A PAL or nowadays a CPLD will very likely be much faster. I have not checked yet where the dead cycles are differently located among the 6502 variants. Anyway, a few (2?) additional input bits to a CPLD could probably be enough to deal with the differences.. ?

Regards,
Jorge


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sun Apr 22, 2007 11:34 am 
Offline

Joined: Tue Mar 27, 2007 6:02 pm
Posts: 14
Location: Sevilla, España
GARTHWILSON wrote:
Without spending much time on it, it looks like it should work as long as certain limitations are kept in mind. For example, certain instructions are one clock longer in decimal mode than in binary. The logic might also have to know if interrupts are enabled and watch the IRQ and NMI lines (+ ABORT if it's a 65816). If you use WDC's processor, you can use the BE input to disable its address bus outputs during your DMA accesses, otherwise you'll have to add external address bus buffers too, because the processor will be driving the address bus even during "dead" bus cycles.


You're right. And the RDY line must also be closely monitored.. !
The 65816s don't need this circuit, as the VDA and VPA output signals serve this same purpose.
The BE input is a God Bless, but it's not usually available in any Apple II.
Do you know which of the extra cycles while in decimal mode are dead cycles ?
I'm not sure yet what to do with IRQs/NMIs ?

Regards,
Jorge.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Tue Apr 24, 2007 10:47 am 
Offline
User avatar

Joined: Fri Dec 12, 2003 7:22 am
Posts: 259
Location: Heerlen, NL
Hallo Jorge,


To be honest, IMHO to much trouble to get those 199 KB/sec. And it depends on the program running what quantity is available at that particular moment. And you will see that at a particular moment you really need those byte, mr. Murphy is running around.

I rather prefer the two CPU solution as used in various Commodore drives: two 6502's running on the same data bus and sharing the available memory. And why should the second CPU be a 6502? IMHO you can use every thing/method to access the RAM (or whatever) during PHI2 =(L).

--
___
/ __|__
/ / |_/ Groetjes, Ruud
\ \__|_\
\___| URL: Ruud.C64.org

_________________
Code:
    ___
   / __|__
  / /  |_/     Groetjes, Ruud
  \ \__|_\
   \___|       URL: www.baltissen.org



Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Tue Apr 24, 2007 11:21 am 
Offline

Joined: Tue Mar 27, 2007 6:02 pm
Posts: 14
Location: Sevilla, España
Ruud wrote:
Hallo Jorge,


To be honest, IMHO to much trouble to get those 199 KB/sec.

(..)

I rather prefer the two CPU solution as used in various Commodore drives: two 6502's running on the same data bus and sharing the available memory. And why should the second CPU be a 6502? IMHO you can use every thing/method to access the RAM (or whatever) during PHI2 =(L).

--Groetjes, Ruud



Hi Ruud,

Imagine for a second that you've got this dual-cpu 6502 system that you prefer. Imagine that the memory bus is running at 5Mhz, 2.5Mhz for each 6502. Well, you've got there, already built-in, hidden in this system, a 1MegaByte/s DMA channel that won't slow down the system at all while the 1MegaByte/s transfer (transparently) takes place...

That's what I think that's the whole point about this idea...

Regards,
Jorge.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Tue Apr 24, 2007 5:05 pm 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
Ruud wrote:
To be honest, IMHO to much trouble to get those 199 KB/sec.


That's quite a chunk of bandwidth, even if it is only probabalistic. It's faster than the 6502 can move on its own, and yet it's not quite as fast as a dedicated DMA-fed device on the other clock phase. This makes it perfect for bulk transfers to, oh, I dunno, a harddrive controller, a printer, and as someone else already mentioned elsewhere on the thread, the video screen.

This would be a PERFECT conduit for vector processors of all kinds -- the most obvious being a bit-blitter (yes, that IS a vector processor), although perhaps someone could conceivably think of a Connection Machine-style 1-bit ALU coupled to a few DMA channels. The slow rate of feed from the bus could conceivably be perfectly matched with serial ALU architecture, thus producing all sorts of USEFUL background processing, while the 6502 or 65816 does something else.

Personally, I rather LIKE this idea, and will try to play with it at some point. Kudos to jorge for pointing this out!


Top
 Profile  
Reply with quote  
 Post subject: Harvesting dead cycles
PostPosted: Sat Apr 28, 2007 11:55 pm 
Offline

Joined: Sun Aug 28, 2005 9:01 pm
Posts: 17
Location: Pennsylvania
Hello All,

Interesting idea indeed.
Assuming a systen already set-up for DMA access and already using
interleaved memory access (video controller?) this has potential to
eke-out a bit more bandwidth for (almost) free. Check me on this:
DMA access throughput (during dead cycles) will be mostly limited
by address decode, DMA MUX/BE and memory R/W delays. Given
a 6502 running 1 or 2 MHz coupled with and modern high-speed
MUX/decode/memory/combo logic components, a lot could be done
during these intervals, IMHO.

Brian M. Phelps
R65C02P3

_________________
Numbers were bad enough. Now I have to add
LETTERS?!


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sun Apr 29, 2007 1:12 pm 
Offline

Joined: Sat May 20, 2006 2:54 pm
Posts: 43
Location: Brighton, England
Definately an intresting idea. But whatever is using the transparant DMA chanel must be something that can afford to wait a bit for its data. Otherwise, the processor is going to execute a section of code that has few dead cycles at the wrong time and really screw things up.

That is the big problem with this idea. Any section of code that has been optimised for speed will, by design, have very few dead CPU cycles and thus will seriously reduce the DMA bandwidth.

I think I'll stick to doing DMA in phase1 time - at least I know I have a guarenteed bandwidth that way.


Top
 Profile  
Reply with quote  
PostPosted: Sun Apr 29, 2007 7:11 pm 
Offline

Joined: Tue Mar 27, 2007 6:02 pm
Posts: 14
Location: Sevilla, España
R65C02P3 wrote:
Hello All,

Interesting idea indeed.
Assuming a systen already set-up for DMA access and already using
interleaved memory access (video controller?) this has potential to
eke-out a bit more bandwidth for (almost) free. Check me on this:
DMA access throughput (during dead cycles) will be mostly limited
by address decode, DMA MUX/BE and memory R/W delays. Given
a 6502 running 1 or 2 MHz coupled with and modern high-speed
MUX/decode/memory/combo logic components, a lot could be done
during these intervals, IMHO.

Brian M. Phelps
R65C02P3


Hi R65C02P3,

In an existing design in a single dead cycle no more than a single byte transfer. In a new design it *might* be possible to squeeze 2 or even more memory cycles per dead cycle. I'm not sure that I got what you mean, though.

--Jorge.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sun Apr 29, 2007 7:59 pm 
Offline

Joined: Tue Mar 27, 2007 6:02 pm
Posts: 14
Location: Sevilla, España
smilingphoenix wrote:
Definately an intresting idea. But whatever is using the transparant DMA chanel must be something that can afford to wait a bit for its data. Otherwise, the processor is going to execute a section of code that has few dead cycles at the wrong time and really screw things up.

That is the big problem with this idea. Any section of code that has been optimised for speed will, by design, have very few dead CPU cycles and thus will seriously reduce the DMA bandwidth.

I think I'll stick to doing DMA in phase1 time - at least I know I have a guarenteed bandwidth that way.


Hi smilingphoenix,

"Phase 1 DMA" is a different thing, this isn't intended to replace it. Phase 1 DMA can't be added on top to an existing system, this can. Phase 1 DMA does not recover the bandwidth that the CPU wastes, this does. This is a different thing, therefore has different uses. This is a cool, new, 2007 hack, one that has never been done before. Phase 1 DMA is not so cool anymore, this is... :-)

--Jorge.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sun Apr 29, 2007 11:56 pm 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
One example of how combining phase-1 and phase-2 DMA can be beneficial is in sprite display logic for video, or blitting.

In the case of sprites, it's often necessary to fetch sprite position data and content. Let's assume 32-bit wide sprites (4 bytes to fetch per row), plus it's horizontal position (2 bytes). Assuming NTSC frequencies, that requires a bandwidth of 6 bytes every 63us, or 95.25kBps per enabled sprite.

Best case, a 1MHz processor can provide 500kBps phase-2 bandwidth (just repeatedly execute NOPs), so there is a theoretical capacity to fetch and manage 5 sprites per scanline on a 1MHz machine, without impacting phase-1 bandwidth at all. (note: it's assumed that the Y positions of the sprites are maintained in the video chip for architectural simplicity).

On a 4MHz machine, that's 20 sprites, _quite_ an appreciable number.

Since there is only a total of 5*6=30 bytes that need to be fetched, the application need only ensure that, to display the sprites in their entirety, the processor must execute instructions such that 30 unused cycles appears -- somehow. And if you use JSR, BRA, INX/DEY, CLC, etc., the odds of that happening are pretty darn good.

As long as sprite data is fetched one line prior to its actual use, real-time constraints are alleviated. Hence, the phase-2 DMA channel is almost ideal for this application.

Another use is for blitting, as I've indicated earlier. A blitter executes a 3-in-1-out logical function with a variety of data transforms along the way. Each execution cycle of the blitter involves four data transfers under maximum load (3 reads, 1 write). If these occur concurrently with CPU processing, the performance will probably be about the same as a 6502 running the code itself. But since it's going in parallel with the 6502, you end up with a x2 performance gain for any kind of simple vector operation (which graphics all but defines).

Audio is another example where the bandwidth wasted by the CPU could be recovered. However, audio does involve care, because our ears are more susceptible to even the slightest interruptions in audio, which isn't the case for video.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Tue May 01, 2007 10:30 am 
Offline
User avatar

Joined: Fri Dec 12, 2003 7:22 am
Posts: 259
Location: Heerlen, NL
Hallo Jorge,

jorge wrote:
That's what I think that's the whole point about this idea...


First: I forgot to say that the idea on itself is very good!

And even after a second and third thought I have my doubts but please don't let that spoil the fun for you. I really hope you get it working.

_________________
Code:
    ___
   / __|__
  / /  |_/     Groetjes, Ruud
  \ \__|_\
   \___|       URL: www.baltissen.org



Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Thu May 03, 2007 10:48 pm 
Offline

Joined: Thu Jan 16, 2003 12:55 pm
Posts: 64
Location: Indianapolis
That is quite a cool idea to push the system a little further, I like it.

Too bad on the platform I use (NES), I don't have access to a sync signal, or I'd try to fit it into my crazy future expansion design.

It's not completely transparent, but surely close enough. I've seen a lot of NMI/IRQ based programs that spend a lot of time in an infinite: JMP infinite loop. Simply adding a bunch of NOPs to the loop would give a ton of free cycles (and also reduce interrupt latency just a bit, heheh). Very nice.

Ruud wrote:
I rather prefer the two CPU solution as used in various Commodore drives: two 6502's running on the same data bus and sharing the available memory.


In theory, maybe one could then add a 3rd CPU, running during the spare cycles provided by one of the other two. :)


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 53 posts ]  Go to page 1, 2, 3, 4  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 3 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: