Realtime 6502 on CUDA

Zetmor · Post by **Zetmor** » Sun Aug 26, 2012 1:39 am

Hi everyone. I'm new here, first post.

I've just made the math quickly, and on paper, it fits!
It shall be possible to emulate the 6502 (at circuit level @ 14 MHz signaling), in realtime, using the last GeForce GTX 680 graphic card.

6502.org wrote:

Image no longer available: http://www.chaunier.com/images/6502.jpg

(I compared my actual GTX 570, and the new mid-range GTX 660 Ti, along with the top-of-the-line GTX 680)

I'll try to program that, at least do a proof of concept.
(but on spare time only: it will probably take me a year)

The way I would simulate with CUDA:
have a 2x 16 bits pointers table for circuits connections
have a 8 bits table for previous values (1 bit only used in the each byte)
and have an other 8 bits table for the results

That would be run at 14 MHz, and the simulation run 14 times (1 MHz cycle for the CPU) before sending the table to CPU.
The GPU would do all the 6502 simulation, the CPU would emulate the rest (I/O, memory, display) at 1 MHz.

I would simplify the global frequency to 14 305 200 Hz, so that the video would be a perfect 60 Hz, instead of the standard 14 318 180 (4x NTSC frequency): that's a global -0,1% speed decrease only, but it simplifies many things. Video would be pure 60 fps (good for the PC), all 65 cycles would be the exact same duration (not a longer 65th cycle anymore), but the whole retraces timings and cycles would be still perfect, so it should not disturb any program that is doing any sync with the video: well programmed games should have no flickering, no jittering (just there are globally -0.1% slower).
65 cycles x 262 scan lines x 14 Hz = 14 305 200 Hz
Divided by 14 = 1 021 800 Hz for the 6502 (1.022 MHz)
Anyway the original crystal wasn't that precise, it could run faster if very hot, or slower if cold.

My PTX ISA code is certainly not correct, but it's just to count the LD/ST and ALU instructions, in order to do the math: only the GTX 680 would be powerfull enough to sustain the number of LD/ST instructions plus the necessary memory bandwidth.
(I'm counting on the L2 cache for the pointers table: it would be read-only, so all that data will be loaded onto the L2 cache after the first pass)

I aimed 4000 simulated gates, so that it can run all the NAND gates like the original (from visual6502) + some circuits around the 6502 for interfacing to the CPU I/O / RAM / Video.
But it could use much less if I'm reconstructing the CPU using simplier logic circuits (AND, OR, XOR, NOT).

I would like to do a graphical, ultra simple logic circuits editor.
With only AND, OR, XOR, NAND and NOT gates.
Plus the wiring, of course (allowing junctions to split signals).
And allowing to connect a bus: any bits size (8, 16, 32, or 4, 24, anything from 2 to 32 bits), bi-directionnal, and drawn with a simple big arrow.
And we would be able to do blocks, we can duplicate and imbricate: for example doing a simple 1 bit complete adder, do a block of that, then duplicating 8 times and do a 8-bit adder, do a block of that so that it can be added to the full cpu schematic, keeping it simple & readable. See?

It would run à 10 Hz (only) with animated signals. All running on CPU (GPU for graphics only).
And we could burn it all, and have it run in realtime once plugued in an emulated motherboard.

I would do an online tutorial (in HTML5) showing the basics, some history, and my progress on the full 6502 processor. (so that it can be used on any PC / Mac / Tablet / Smartphone)
And I would do a PC/Win7 + Mac/OS.X + iPad program for the editor.
And a PC/Win7 only for the realtime simulator (needing a GTX 680 graphics card).
We could run the PC realtime simulator as a server, using an iPad (or the PC) for the display, and using two iPhone (bluetooth) for controlling joysticks or keyboard, for two players games.

What do you think? I missed something huge and it's just impossible?
Or I'm right? Shall I start that amazing project?

GARTHWILSON · Post by **GARTHWILSON** » Sun Aug 26, 2012 3:09 am

Welcome! It sounds like a great project. I would just add Linux, as there are a lot of Linux enthusiasts here. I don't have anything that runs Windows or Mac, or any iPad or iPhone.

BigEd · Post by **BigEd** » Sun Aug 26, 2012 6:52 am

Welcome! This would be amazing!

The first little obstacle will most likely be the simulation of the bidirectional pass gates which connect the Special Bus to the other busses. In the case of Peter Monta's FPGA model, this behaviour was simulated using a 6-bit analog model, with the overall effect that the FPGA clocks at 50MHz or so to emulate a 1MHz 6502. In the case of visual6502, the model is binary but recalcNodeList() can take 20 iterations for the circuit to stabilise (maybe more, but that's the most needed for the test program)

Forum user richarde has reduced the visual6502 netlist to flops, logic gates and remaining transistors: 950 gates, 340 unidirectional and 103 bidirectional transistors. It's believed that only 16 bidirectional transistors should be needed, but this was a programmatic reduction. (It's notable how few NAND structures there are: NMOS technology favours the NOR gate. Likewise, there's only a single case of 3 series transistors in a pull-down structure - otherwise, it's two at most.)

These stats are from a different version of the reduction:

Code: Select all

* Circuit Statistics after optimisation:
*
*   Number of signal nets: 1215
*   Number of gates: 1461
*   Number of modules: 1
*   Primitive instances:
* 
*    AND           1
*    AO            3
*    AOI          87
*    BUF          26
*    DLAT        146
*    DLATB       117
*    INV         277
*    INV_APU      21
*    NAND         39
*    NMOS        103
*    NOR         310
*    NOR_APU      35
*    OR           28
*    UNMOS       265
*    ram32k        1
*    rom32k        1
*    ---------- ----
*    Total      1460
*    ---------- ----

Cheers
Ed

Zetmor · Post by **Zetmor** » Sun Aug 26, 2012 12:37 pm

Great!! Thanks!
So it's like half the gates I'll have to simulate.
Meaning even my GTX 570 could do it, with some margin.
(I could simulate a bit more chips on the motherboards then)
And I hope the reduced logic will allow the chip to stabilize in <14 pass.
(else I'll have to cheat with accelerated blocks, like in robot odyssey)

About the Linux version, I'm not sure I'll still have the will and patience once I've already done my PC & Mac versions. The are so many nVidia CUDA SDK versions for Linux (Fedora 14, RedHat 5.5 & 6.0, Ubuntu 10.04 & 11.04, OpenSUSE 11.2, SUSE Server 11 SP1): 8 versions! I'm afraid it would need 8 versions of the program. I'll try to code it mainly in plain C, using C++ (and Objective-C) only for the GUI, so that it can be ported easily. I can't promise I'll do it, but if I won't, I'll help to make it possible (giving sources). The priority is clearly PC/Win7, because that's the one I know the best, and the realtime sim will need a gamers GFX card: and gamers with that kind of hardware are usually using Win7. And I'll do a Mac version because I really like the idea of having a full working simulated Apple //e running on an Apple computer, and that will be my initiation to Objective-C & Cocoa, which I'll need for the iPad/iPhone editor version I'll try to do too (if it's not taking me years to do it and someone else does it before me, hehe). Porting the Mac version to Linux might be easier too.

First step, I finish reading all the CUDA guides.
Then I'll do a proof of concept for the CUDA sim: a simple console program that benchmarks the gates simulations. The core CUDA code would be near final (it shall be quite small), I'll just run it with random connected gates to give it a try.
Then I'll start the editor (PC/Win7), so that at least I can do screenshots of the basic circuits (or saving directly to html5 with working logic?), so I can start writing the 'tutorial' and the whole 'story':
- an history of computing and electronics
- simple gates, logic
- binary, BCD, numbers representation
- building a simple 1 bit adder
- building a simple BCD adder
- building a 12 digits BCD adder
- accumulator (BCD), memory (BCD)
- sync notions, I/O
- building a complete adder (signed, substraction)
- building a working 4 operations calculator (emulated keyboard & display provided)
- adding a memory (MC/M+/M-/MR)
Then the computer odyssey:
- building 8 bit adder
- doing a complex adder, able to do both BCD & 8 bit
- adding shifts etc... the 6502 ALU.
... etc ... etc ...

I'm not sure it would interest many people.
But it would be a great adventure for myself (I would learn while doing it hehe).
I even found a name for it!

6502.org wrote:

Image no longer available: http://www.chaunier.com/images/a-computer-Odyssey.jpg

BigEd · Post by **BigEd** » Sun Aug 26, 2012 1:52 pm

A big project! But yes, if you port to the Mac, and open source your work, it's likely not to be a big extra step to get Linux working.
Cheers
Ed

Zetmor · Post by **Zetmor** » Mon Aug 27, 2012 2:37 pm

With some optimization of CUDA PTX ISA code, I think it could run even faster.

AND|OR|XOR|NAND|NOR gates (2 by 2)

Code: Select all

.global .b32	pointers[2000]; # (4000 pointers pairs grouped by 2)
.global .b8	old[4000];
.global .b32	new[2000];	# (4000 values grouped by 2)

ld.global.b64	%r, pointers[index]
mov.b64	{a1, b1, a2, b2}, %r	# extract 2 pairs of 16 bits pointers

ld.global.b8	x1, old[a1]
ld.global.b8	y1, old[b1]
ld.global.b8	x2, old[a2]
ld.global.b8	y2, old[b2]

cvt.b16.b8	x1, x1
cvt.b16.b8	y1, y1
and|or|xor.b16	x1, x1, y1
[cnot.b16	x1, x1]	# nand|nor

cvt.b16.b8	x2, x2
cvt.b16.b8	y2, y2
and|or|xor.b16	x2, x2, y2
[cnot.b16	x2, x2]	# nand|nor

mov.b16	r, {x1,x2}
st.global.b16	new[index], r

(the code is certainly not correct yet, it's just a draft on paper, in order to count the instructions;
there is no selection of gate type, it would be simply run as several kernels each with a gate type, having the index starting and ending at the good index; gates would be rearranged at 'burn'/synthetis time in order to optimize cache usage)

= 6 LD/ST & 8-10 INT instructions for 2 gates
--> 3 LD/ST & 4-5 INT instructions / gate

= 172 Billion LD/ST instructions per second
Meaning even my GTX 570 could sustain 4000 gates simulation from LD/ST point of view
(but still limited by L2 cache throughput)

We could reduce the store by grouping by 4 or 8, so that we have a 32 or 64 bits store instruction for the result, but it won't go faster because grouping more gates in the code is too reducing the number of parallel threads. On CUDA, because of the quite long latency of each instruction (>100 for a LD/ST, >20 for typical INT), it needs hundreds or a thousand threads to hide that latency. I could group values, up to 8 values per byte (using each bit), reducing the bandwidth greatly (div by

, but dividing the number of threads by 8 too: it would probably not be fast enough because it would always be waiting on latencies. I'll have to do tests on various optimizations.

For the simple NOT gates, grouped by 4 it would be much faster:

Code: Select all


.global .b32	pointers[2000]; # (4000 pointers pairs grouped by 2)
.global .b8	old[4000];
.global .b32	new[1000];	# (4000 values grouped by 4)

ld.global.b64	%r, pointers[index]
mov.b64	{a, b, c, d}, %r	# extract 2 pairs of 16 bits pointers

ld.global.b8	x1, old[a]
ld.global.b8	x2, old[b]
ld.global.b8	x3, old[c]
ld.global.b8	x4, old[d]

mov.b32	r, {x1,x2,x3,x4}
xor.b32	r, r, 0x1111

st.global.b32	new[index], r

= 6 LD/ST & 7 INT instruction for 4 NOT gates
--> 1.5 LD/ST & 1.75 INT instruction / NOT gate

An other optimization could be I load the result instead of storing it:
- if the result is the same of the old value, then it has not changed, so I don't need to store it.
If it has changed, I'm losing a LD/ST instruction.
If if it hasn't changed, I'm not storing the result (it's the same): reducing write bandwidth, and the read could be in L1 = very fast (no latency). Would be quite much faster for big circuits for which most values don't change at each pass. But it would be much slower on circuits all values change at each pass...

BigEd · Post by **BigEd** » Mon Aug 27, 2012 3:31 pm

Quick note on convergence: on richarde's unpublished simulator, the logic-reduced 6502 takes 22 iterations to stabilise (worst observed case) - the unreduced 6502 takes 27, or 34 on the very first tick.

That 22 is similar to and probably related to visual6502's 20 (which might be plus or minus one due to my miscounting)

It might be worth planning a 21MHz operation to get stable 1MHz emulation. It's possible the worst cases could be investigated and somehow improved by tweaking the netlist. I just checked and both rising and falling input clocks need 22 - if we were luckier then one phase would have been the harder case and the two phases could be modelled with different timings.

Cheers
Ed

Arlet · Post by **Arlet** » Mon Aug 27, 2012 4:17 pm

Would it be feasible to replace the 16 bi-directional pass gates by pairs of one-way buffers, and add some extra control lines from the PLA so that only a single direction was enabled at any time ? Of course, it would not be a 100% replica, but it would be very close, and much faster.

BigEd · Post by **BigEd** » Mon Aug 27, 2012 4:40 pm

Interesting idea. At first I thought it might mean figuring out the 'right' direction for every T-state of every type of instruction, and then make the encoder for those combinations. (There would be a lot of don't cares, probably.) But if it's the case that SB should be driven exactly when one of the 4 registers is written, we could perhaps use the register write control signals, which already exist and (presumably) account for RDY too. So it's a six-input control for each of the two sets of 8: the original enable and an OR of SB/X, SB/Y, SB/S, SB/ADD, SB/AC

Zetmor · Post by **Zetmor** » Mon Aug 27, 2012 9:31 pm

I was more on the idea of redoing the 6502: reinventing it from a white page, step by step. And using basic logics (and adding a 1 pass flip-flop component) it could be optimized so it's stabilized in 14 pass (nicely fiting the 14 MHz clock). Of course I would be 99% inspired by the real thing, but more like the goal than the start of the project. In the end it would have the same exact working (same cycles, same signals & signals timings at 14 MHz precision).

If even optimized as hell it really can't fit the 14 iterations, then I'll see.
Or cheating with double speed blocks (like in Robot Odyssey: each block or 'chip' you burn can be accelerated by a factor). Or a more elegant way: doubling the whole frequency to 28 MHz, and allowing to slow down blocks (by a factor of 2/3/4/...) so that parts that are already fast enough can be slowed down in order to reduce the simulation workload. Would be like emulating different propagation speeds.
But I would be much happy if everything can run at the same speed, without any trick anywhere.

For the bidirectionnal buses (the special internal one, and the outside buses too) I was thinking about simulating it with the special bus component provided: you draw the big bus arrow to a block (which can be a register, or the 6502 itself, blocks are nested) ... and you assign it to the 8 or 16 (or any number) of pins of the block that will use that bus. Inside the block the pins don't show anymore like simple input or output, but allow two connections. One connection from a gate output, one connection to a gate input = 2 wires (one way each). And you connect a direction wire to the bus (the R/W). When drawing that bus, it has a direction: the block from which the bus arrow starts is the only data source when in READ mode (and all target block receive that data). When in READ mode all data sources (connected to the bus arrow ends) are ORed, except the source block (which will receive the data).

I need to find a simple & quick tool to draw those things before I get to program my editor.

I don't expect to have many buses in the schematic, so it could be emulated separately (without basic gates but still in CUDA code) in a quick manner. Each bus has a list of sources, and a list of targets + the R/W wire source.
If the R/W is on (write), load the source bits, and copy to all target bits.
If the R/W is off (read), load all targets bits / OR each values, and copy result to source bits.

The provided components would be:

- AND gate (2 inputs)
- NAND gate (2 inputs)
- OR gate (2 inputs)
- NOR gate (2 inputs)
- XOR gate (2 inputs)
- NOT gate
- simple FLIPFLOP (2 inputs 2 ouputs)
- wires of course (you can make junctions; you add a junction anywhere you want to an existing wire -- would put a nice dot --, going to an input of course, and for the sim the destination input pin will be mapped directly to the source of the wire on which you do the junction: it won't need more computations in the sim, and won't add a propagation delay: if you have a wire with many junction, all of it will turn red when current goes through, at the same time for all wires connected)
+ I'll probably add a slowing down component (looking like a diode symbol), for the cases we need to have signals arrive at the same time and one path is too fast (simple load/store in the cuda code)
+ the special bus, the big arrow, unidirectionnal, or bidirectionnal with a direction wire, so that you don't have 8 or 16 wires everywhere in the schematics, but just a nice looking big arrow
(I have been dreaming of this one since I played Robot Odyssey

30 years ago )
+ block (a square icon with dotted lines, like a selection box)

The most difficult part would be the editor itself. I want it ultra simple, ultra obvious, without icons everywhere or boxes showing bits or values like I see on existing editors. You would just see the gates, as basic as it should be (standard black drawing), when not connected you see a little empty circle for the inputs and a little arrow for the output (it goes back to a simple straight line when connected). You move the components directly. You put new ones by sliding them from the toolbox, or you can add one quickly by double-clicking / double-taping (iPad) an empty zone: the toolbox appears after the second click/tap, with the last added component just beneath. If you release without moving, it adds the same component you added before. Or you can slide to select an another one. Meaning if you want to add several AND gates (after you added one), you just double-click on an empty space, double-click double-click double-click, each time a new component is added.
For wiring you just click the input or output and draw the line.
(the pathing will be tough to program, to avoid collisions and keep it nice)

To do a block of components (grouping some components, or even grouping almost 'everything' to form the cpu), you arrange you components so that the one you want to group are in a rectangle. Then you select the block icon in the toolbox, and you draw a box around the components you want to group. And that's all. It will popup a block window showing your components with the same placement, but wires that are connected to things outside are now connected to pins in the box border. You can name the block, name the pins. Once you're happy you close that window pop-up, and you see the schematic you had but ... instead of the components you had in the selection box, you now have a little block with the wires still connected as it should. You can 'enter' it again anytime by double-clicking (or long click) on it. That block is added in the toolbox: you can duplicate it, like any other basic component. You can copy the block and edit it: that one will differ from the original. But you can too 'freeze' the original block so that the new ones are not just copies: there are clones. If you change the original, you change all the clones at the same time, avoiding to redo modifications 8 times if for example you are working on a 1 bit ALU part of the 8-bit ALU.
And like basic components, you can add several times the same block with the double-click on empty space. A complete 8 bit adder could be done from scratch in less than 1 minute (if you know how to do it).

(if you want a 3 input standard gate, you do it with 2 gates and form a block with it, it will be reduces to a small block with 3 inputs / 1 ouput, you can name the block and the input & ouput pins: then you have your own 3 inputs standard gate you can reuse, still working in 1 iteration since the two 2-inputs gate are parallel, it's just it takes twice the computation for the simulator, which is still realistic compared to the real thing transistors count)

The simulation (at slow pace, like 10 Hz) is working all the time, even while editing.
You can touch any wire to force it red (like if sending current there), and see the signal go through.
You can touch any red wire to force it black (like if cutting that wire), and see the signal go through.
And the iPad you can touch multiple wires that way, and play with the circuit without having to connect an emulated peripheral (I'll provide keys, keyboards, simple leds, numbers leds & lcd, etc...)

In that editing mode you don't see what's happening inside the blocks anymore, only the displayed wires are colored when signal is going through. But I could add a zoom out with the program exploding all the blocks to show all gates & wires in a single page (hiding the gates themselves, or with very tiny icons).

Once burned (compiled) for realtime CUDA simulation, we don't use the editor anymore.
We'll have realworld alike emulated hardware: you don't see any signal if there is no real (emulated) hardware connected to a wire. There will be keyboard, buttons, leds, screen of course, mouse, joysticks, etc ... I could even wire the apple slots, and allow pluging extension boards (what about a SCSI & 20MB hard disk? Or a Mockingboard?). That could be simulated in logic level like everything else (running on CUDA CORE), or emulated on the CPU at 1MHz (standard x86 code).

And of course there would be tools for helping make it work. Signal analyser (you connect to any wire) to check the signals timings, an hex editor or even a compiler? And a debugger, with bits or hex display of x-bits blocks outputs (to see registers or buses content) and memory chip dumps (that one would be emulated on CPU only I'm afraid, except if I manage to simulate a very small memory for the purpose of explaining how memory chip work, like I could emulate just the zero page, but 48k RAM wont fit for sure: when going for the full blown Apple ][ the memory would be emulated only).

Sounding good?

(sorry for the very long posts, I need to write down my thoughts; promised once I've started the web site I'll be more concise

)

org · Post by **org** » Tue Aug 28, 2012 4:59 am

Sounds like Napoleon plans

BigEd · Post by **BigEd** » Tue Aug 28, 2012 5:15 am

It's certainly ambitious. Best to define some intermediate goals, I would think: either the simulator or the GUI project would be worth bringing to a prototype stage.
Cheers
Ed

BigDumbDinosaur · Post by **BigDumbDinosaur** » Tue Aug 28, 2012 6:23 am

BigEd wrote:

It's certainly ambitious. Best to define some intermediate goals, I would think: either the simulator or the GUI project would be worth bringing to a prototype stage.
Cheers
Ed

I certainly don't want to be discouraging, but Ed's right. It's better to start out smaller and then work up to a large project. Or, as I'm fond of saying now and then, learn to fly a Piper Cub before taking on a 747.

Zetmor · Post by **Zetmor** » Tue Aug 28, 2012 8:56 am

Yep but I need an editor to start building the cpu, and to prepare the web site (with screenshots from it ; with working animated logics in javascript later).

The core of the simulator itself would be the easiest part I think: that would be something like 3-4 pages of C code and 2 pages of CUDA code only. But I'll need the editor to feed it. The long part would be to provide the emulated hardware around (like the video decoder & display).
But I can start with a simple console program that takes the schematics from a text file.

I'll start with the sim, at least to be sure it's really feasible.

Zetmor · Post by **Zetmor** » Tue Aug 28, 2012 9:01 am

And I'll need someone with a good english grammar to correct the web site.
That's about computer history (and MOS & apple), so that's mainly american, so it will be in english.

Realtime 6502 on CUDA

Realtime 6502 on CUDA

Re: Realtime 6502 on CUDA

Re: Realtime 6502 on CUDA

Re: Realtime 6502 on CUDA

Re: Realtime 6502 on CUDA

Re: Realtime 6502 on CUDA

Re: Realtime 6502 on CUDA

Re: Realtime 6502 on CUDA

Re: Realtime 6502 on CUDA

Re: Realtime 6502 on CUDA

Re: Realtime 6502 on CUDA

Re: Realtime 6502 on CUDA

Re: Realtime 6502 on CUDA

Re: Realtime 6502 on CUDA

Re: Realtime 6502 on CUDA