Realtime 6502 on CUDA
Posted: Sun Aug 26, 2012 1:39 am
Hi everyone. I'm new here, first post.
I've just made the math quickly, and on paper, it fits!
It shall be possible to emulate the 6502 (at circuit level @ 14 MHz signaling), in realtime, using the last GeForce GTX 680 graphic card.
(I compared my actual GTX 570, and the new mid-range GTX 660 Ti, along with the top-of-the-line GTX 680)
I'll try to program that, at least do a proof of concept.
(but on spare time only: it will probably take me a year)
The way I would simulate with CUDA:
have a 2x 16 bits pointers table for circuits connections
have a 8 bits table for previous values (1 bit only used in the each byte)
and have an other 8 bits table for the results
That would be run at 14 MHz, and the simulation run 14 times (1 MHz cycle for the CPU) before sending the table to CPU.
The GPU would do all the 6502 simulation, the CPU would emulate the rest (I/O, memory, display) at 1 MHz.
I would simplify the global frequency to 14 305 200 Hz, so that the video would be a perfect 60 Hz, instead of the standard 14 318 180 (4x NTSC frequency): that's a global -0,1% speed decrease only, but it simplifies many things. Video would be pure 60 fps (good for the PC), all 65 cycles would be the exact same duration (not a longer 65th cycle anymore), but the whole retraces timings and cycles would be still perfect, so it should not disturb any program that is doing any sync with the video: well programmed games should have no flickering, no jittering (just there are globally -0.1% slower).
65 cycles x 262 scan lines x 14 Hz = 14 305 200 Hz
Divided by 14 = 1 021 800 Hz for the 6502 (1.022 MHz)
Anyway the original crystal wasn't that precise, it could run faster if very hot, or slower if cold.
My PTX ISA code is certainly not correct, but it's just to count the LD/ST and ALU instructions, in order to do the math: only the GTX 680 would be powerfull enough to sustain the number of LD/ST instructions plus the necessary memory bandwidth.
(I'm counting on the L2 cache for the pointers table: it would be read-only, so all that data will be loaded onto the L2 cache after the first pass)
I aimed 4000 simulated gates, so that it can run all the NAND gates like the original (from visual6502) + some circuits around the 6502 for interfacing to the CPU I/O / RAM / Video.
But it could use much less if I'm reconstructing the CPU using simplier logic circuits (AND, OR, XOR, NOT).
I would like to do a graphical, ultra simple logic circuits editor.
With only AND, OR, XOR, NAND and NOT gates.
Plus the wiring, of course (allowing junctions to split signals).
And allowing to connect a bus: any bits size (8, 16, 32, or 4, 24, anything from 2 to 32 bits), bi-directionnal, and drawn with a simple big arrow.
And we would be able to do blocks, we can duplicate and imbricate: for example doing a simple 1 bit complete adder, do a block of that, then duplicating 8 times and do a 8-bit adder, do a block of that so that it can be added to the full cpu schematic, keeping it simple & readable. See?
It would run à 10 Hz (only) with animated signals. All running on CPU (GPU for graphics only).
And we could burn it all, and have it run in realtime once plugued in an emulated motherboard.
I would do an online tutorial (in HTML5) showing the basics, some history, and my progress on the full 6502 processor. (so that it can be used on any PC / Mac / Tablet / Smartphone)
And I would do a PC/Win7 + Mac/OS.X + iPad program for the editor.
And a PC/Win7 only for the realtime simulator (needing a GTX 680 graphics card).
We could run the PC realtime simulator as a server, using an iPad (or the PC) for the display, and using two iPhone (bluetooth) for controlling joysticks or keyboard, for two players games.
What do you think? I missed something huge and it's just impossible?
Or I'm right? Shall I start that amazing project?
I've just made the math quickly, and on paper, it fits!
It shall be possible to emulate the 6502 (at circuit level @ 14 MHz signaling), in realtime, using the last GeForce GTX 680 graphic card.
6502.org wrote:
Image no longer available: http://www.chaunier.com/images/6502.jpg
I'll try to program that, at least do a proof of concept.
(but on spare time only: it will probably take me a year)
The way I would simulate with CUDA:
have a 2x 16 bits pointers table for circuits connections
have a 8 bits table for previous values (1 bit only used in the each byte)
and have an other 8 bits table for the results
That would be run at 14 MHz, and the simulation run 14 times (1 MHz cycle for the CPU) before sending the table to CPU.
The GPU would do all the 6502 simulation, the CPU would emulate the rest (I/O, memory, display) at 1 MHz.
I would simplify the global frequency to 14 305 200 Hz, so that the video would be a perfect 60 Hz, instead of the standard 14 318 180 (4x NTSC frequency): that's a global -0,1% speed decrease only, but it simplifies many things. Video would be pure 60 fps (good for the PC), all 65 cycles would be the exact same duration (not a longer 65th cycle anymore), but the whole retraces timings and cycles would be still perfect, so it should not disturb any program that is doing any sync with the video: well programmed games should have no flickering, no jittering (just there are globally -0.1% slower).
65 cycles x 262 scan lines x 14 Hz = 14 305 200 Hz
Divided by 14 = 1 021 800 Hz for the 6502 (1.022 MHz)
Anyway the original crystal wasn't that precise, it could run faster if very hot, or slower if cold.
My PTX ISA code is certainly not correct, but it's just to count the LD/ST and ALU instructions, in order to do the math: only the GTX 680 would be powerfull enough to sustain the number of LD/ST instructions plus the necessary memory bandwidth.
(I'm counting on the L2 cache for the pointers table: it would be read-only, so all that data will be loaded onto the L2 cache after the first pass)
I aimed 4000 simulated gates, so that it can run all the NAND gates like the original (from visual6502) + some circuits around the 6502 for interfacing to the CPU I/O / RAM / Video.
But it could use much less if I'm reconstructing the CPU using simplier logic circuits (AND, OR, XOR, NOT).
I would like to do a graphical, ultra simple logic circuits editor.
With only AND, OR, XOR, NAND and NOT gates.
Plus the wiring, of course (allowing junctions to split signals).
And allowing to connect a bus: any bits size (8, 16, 32, or 4, 24, anything from 2 to 32 bits), bi-directionnal, and drawn with a simple big arrow.
And we would be able to do blocks, we can duplicate and imbricate: for example doing a simple 1 bit complete adder, do a block of that, then duplicating 8 times and do a 8-bit adder, do a block of that so that it can be added to the full cpu schematic, keeping it simple & readable. See?
It would run à 10 Hz (only) with animated signals. All running on CPU (GPU for graphics only).
And we could burn it all, and have it run in realtime once plugued in an emulated motherboard.
I would do an online tutorial (in HTML5) showing the basics, some history, and my progress on the full 6502 processor. (so that it can be used on any PC / Mac / Tablet / Smartphone)
And I would do a PC/Win7 + Mac/OS.X + iPad program for the editor.
And a PC/Win7 only for the realtime simulator (needing a GTX 680 graphics card).
We could run the PC realtime simulator as a server, using an iPad (or the PC) for the display, and using two iPhone (bluetooth) for controlling joysticks or keyboard, for two players games.
What do you think? I missed something huge and it's just impossible?
Or I'm right? Shall I start that amazing project?