Improving the 6502, some ideas

GARTHWILSON · Post by **GARTHWILSON** » Fri Aug 07, 2009 8:51 pm

Quote:

They are all surface mount though, and the higher capacities are not in socket-friendly packages.

PQFP's are pretty easy to solder by hand, unlike BGA's. PGA's are socketable. Just dong a little looking at Digi-Key, I see they have a Xilinx FPGA D-K number 122-1520-ND, 3.3V, half million gates, 208-pin PQFP, 158 I/O (almost twice what we need), in stock, $25 each for singles, which is close to the price of a 65265. I have no idea how many gates we would need, and maybe we could go for a smaller-scale one. It might be worth considering using some of the extra pins for large external ROMs to cut most of the work for the multiply and divide instructions, or taking it further, other math functions as well. Those would not need to be on the same bus that handles the user's program and data.

fachat · Post by **fachat** » Sat Aug 08, 2009 8:00 am

BigEd wrote:

Does the 65816's ABORTB pin do this job? It says that the instruction completes but without any register updates, then the machine takes an interrupt. So the interrupt handler can fixup and restart the instruction.

I didn't look at the specs, but if that's what the ABORTB pin does, then that's it.

Quote:

I'm not sure how hard that is to implement but I see the timing on the 816 demands that it must be valid before the rising edge of phi2 - that must be a clue.

I think it's basically that all the registers are implemented as some kind of two-level shift register. The running operation works on the "working" copy, which at the end of the opcode gets shifted into the saved copy. If the opcode is aborted, the saved copy would be shifted into the working copy instead.

About being valid before the rising edge of phi2 - I think this should be relaxed to during Phi2 high, before the falling edge. Otherwise there is no way for address decoding to be done.

André

kc5tja · Post by **kc5tja** » Sun Aug 09, 2009 3:02 am

fachat wrote:

I think it's basically that all the registers are implemented as some kind of two-level shift register.

More likely, rising-edge-triggered registers.

Quote:

About being valid before the rising edge of phi2 - I think this should be relaxed to during Phi2 high, before the falling edge. Otherwise there is no way for address decoding to be done.

Plenty of time available if you're willing to stretch the clock out (e.g., slow the system down).

This is why most other CPUs have multiple clock cycles per machine cycle.

kc5tja · Post by **kc5tja** » Sun Aug 09, 2009 3:38 am

I'm returning to this thread very late. Sorry.

Quote:

That just doesn't work on 6502 hardware level...

Sure it does. The x86 is no more magical an instruction set than the 6502 is (more registers, wider widths, but otherwise a Von Neumann architecture all the same), and Oberon works just fine there.

Quote:

And even if we use type descriptors, the stack frame on a 6502 can have very different sizes, depending on the execution flow - as the 6502 stack is often used as intermediary storage in conditional expressions, as there are so few registers.

Smalltalk shows that these can be predicted and accounted for in the activation frame of a procedure.

Quote:

read and write barriers have been greatly improved in Java as well, and are, on modern CPUs, supported by hardware! A write barrier flushes the dirty pages from cache to memory (everything written before the barrier is stored in actual memory, not in cache), a read barrier throws the cache away (everything read after the barrier comes from memory)

These are not read- or write-barriers. You're describing synchronization points.

A read-barrier is a small clip of code responsible for translating a reference to a machine-readable address, which must be used (in the worst case) at each and every read site. It can be as simple as "@ @" in Forth (e.g., a double fetch) to resolve a handle to the object pointed to by the handle.

Likewise, a write-barrier is a small piece of code which is responsible for somehow marking an object as having been modified, so that a garbage collector will know to traverse its pointers anew, instead of caching the GC state from a previous pass (why re-traverse a tree when it's dynamically known to not have changed since the last time you checked?).

These barriers have nothing to do with cache coherency or thread synchronization.

Indeed, flushing the caches on every read or write in a GC'ed language will utterly destroy performance. Caches exist precisely because the CPU core speed is often at least a factor of 10 (often a factor of 1000!!) faster than bus speed.

Quote:

P.S.: I'm a Java architect for a living, so I might be biased....

I feel so sorry for you. You must hate life.

Whenever I have to deal with Java, I know I hate life.

kc5tja · Post by **kc5tja** » Sun Aug 09, 2009 3:45 am

BigEd wrote:

Interesting. But I note that Xilinx offer a free 16-bit CPU (picoBlaze) which does 42MHz in the CPLD version, versus 74MHz-200MHz on FPGA. See their PDF

I'm aware of picoBlaze. However, I call into question that the two semiconductor processes are the same. I was very careful to point out that qualification.

The reason is very simple: FPGAs have HUGE bus wires in them, which takes time to load. More importantly, they have HUGE amounts of multiplexors and other random logic which isolates one look-up block from another. Yes, FPGAs can be quite fast, but the architecture that defines a CPLD (namely, a matrix of PALs, each PAL being just a sum-of-terms matrix) obviously means less propagation delays given the same feature sizes on die. For starters, CPLDs usually don't have bus wires that span the whole die. There's not a whole lot of random logic involved (indeed, this is the CPLD's greatest strength and its greatest weakness -- ALL logic must be expressed as sum-of-terms form!), and absolutely no storage elements except for I/O pads.

I'd like to see the picoBlaze in a CPLD in the same process as a Xilinx Spartan-III go up against a real Spartan-III implementation.

kc5tja · Post by **kc5tja** » Sun Aug 09, 2009 3:52 am

GARTHWILSON wrote:

I'm not sure exactly what you mean by "program-referenced,"

What are you not sure about? In what ways can I clarify my definition further?

Quote:

HP-71 BASIC does have a DESTROY instruction

OK, you've found the one, and the only, version of BASIC which allows you to explicitly dispose of a single variable. My statement, as a rule, applies, even for HP-Basic, because you're not obligated to use DESTROY.

Quote:

Thankyou for the explanations. I probably wasn't able to read them as fast as you typed them (Samuel types nearly 100wpm and seems to be able to talk about something else at the same time!)

I totally freak out my coworkers when I hold a conversation with them while coding on a project at the same time. I often don't even look at the monitor while doing so. Haha

Quote:

I believe the Apple IIgs ProDOS used that. Is that correct?

It goes earlier than that, actually. Smalltalk object references are indices into a global handle table. MacOS used handles for objects all the way up until MacOS X. PC/GEOS and Windows 3.x and earlier used them as well. All things considered, it's an amazingly common thing.

kc5tja · Post by **kc5tja** » Sun Aug 09, 2009 4:55 am

GARTHWILSON wrote:

The next step really does seem to be finding the HDL designers.

There once was a time when I was hacking Verilog; though I'm no expert at it, I was able to get designs to at least simulate. We'd still need someone to synthesize though.

BigEd · Post by **BigEd** » Sun Aug 09, 2009 10:00 am

kc5tja wrote:

We'd still need someone to synthesize though.

That's not so hard: Xilinx webpack is only a download away (plus a registration process, etc) and I can help with the commands. I use it as a command line tool, on linux.

I had another go at synthing Rob Finch's bc6502 just now: without any special care, it met timing at 60MHz and takes up 18% of a xc3s250e-pq208-5 which seems to be a $20 part. (I only constrained the clock, so input and output timings might be worse.)

That particular design has a slightly restrictive license, so it might be best to look into the other designs out there (or start afresh) And note that any of those designs might have the odd bug.

Rob did post elsewhere that it took him a year (of spare time) to get a working 6502, and another year to get it cycle accurate and finished. This is why I recommend aiming at a the simplest design, preferably as an increment on something existing.

ps: might be worth noting that Rob went on to do an enhanced 6502 (with stack-relative mode), a 24bit and then 32 bit Sparrow, then 64bit and possibly other cores. I got the impression Sparrow was in the spirit of the 6502. You may have to dig for details.

fachat · Post by **fachat** » Sun Aug 09, 2009 10:10 am

kc5tja wrote:

I'm returning to this thread very late. Sorry.

...

Quote:

read and write barriers have been greatly improved in Java as well, and are, on modern CPUs, supported by hardware! A write barrier flushes the dirty pages from cache to memory (everything written before the barrier is stored in actual memory, not in cache), a read barrier throws the cache away (everything read after the barrier comes from memory)

These are not read- or write-barriers. You're describing synchronization points.

hm I'm referring to:
http://en.wikipedia.org/wiki/Write_barrier or more low level
http://gee.cs.oswego.edu/dl/jmm/cookbook.html
And yes, those are synchronization points.

Quote:

A read-barrier is a small clip of code responsible for translating a reference to a machine-readable address, which must be used (in the worst case) at each and every read site. It can be as simple as "@ @" in Forth (e.g., a double fetch) to resolve a handle to the object pointed to by the handle.

Likewise, a write-barrier is a small piece of code which is responsible for somehow marking an object as having been modified, so that a garbage collector will know to traverse its pointers anew, instead of caching the GC state from a previous pass (why re-traverse a tree when it's dynamically known to not have changed since the last time you checked?).

These barriers have nothing to do with cache coherency or thread synchronization.

Not really sure what you're talking about. Maybe read/write barriers have some other meaning in a Garbage Collection code.

Quote:

Indeed, flushing the caches on every read or write in a GC'ed language will utterly destroy performance. Caches exist precisely because the CPU core speed is often at least a factor of 10 (often a factor of 1000!!) faster than bus speed.

The read/write barriers I mentioned are explicitely put into the Java code (in the form of volatile variables, or synchronization blocks) to synchronize multiple threads of execution. This implies that they are of course not executed on every instruction.

Putting them in wrong places is a source of performance degradation, though, and is done quite often. Writing efficient multithreading programs is very difficult and often not understood well enough and cause for performance optimization tasks (which I do more often than I'd like to).

André

BigEd · Post by **BigEd** » Sun Aug 09, 2009 10:19 am

kc5tja wrote:

BigEd wrote:

picoBlaze reportedly does 42MHz in CPLD versus 74MHz-200MHz on FPGA.

I'm aware of picoBlaze. However, I call into question that the two semiconductor processes are the same. I was very careful to point out that qualification.

Quite so, but from a perspective of what you can buy and how fast it will go, today's fastest FPGAs would seem to outperform today's CPLDs.

Quote:

CPLD's ... no storage elements except for I/O pads.

Not precisely: Xilinx offer a range of packages for each core, and in some cases there is storage not routed to I/O. A smaller cheaper package with the same die inside, probably.

kc5tja · Post by **kc5tja** » Sun Aug 09, 2009 6:26 pm

fachat wrote:

Not really sure what you're talking about. Maybe read/write barriers have some other meaning in a Garbage Collection code.

The entire time I've been talking about garbage collection, since that's the context that raised concerns over read- or write-barriers. I'm not addressing concurrency.

See http://citeseerx.ist.psu.edu/viewdoc/su ... .1.52.8857

kc5tja · Post by **kc5tja** » Sun Aug 09, 2009 6:27 pm

BigEd wrote:

Not precisely: Xilinx offer a range of packages for each core, and in some cases there is storage not routed to I/O. A smaller cheaper package with the same die inside, probably.

I should have been more clear; I meant the I/O pads of the PLAs, not of the chip as a whole.

OwenS · Post by **OwenS** » Sun Aug 09, 2009 10:18 pm

kc5tja wrote:

BigEd wrote:

Interesting. But I note that Xilinx offer a free 16-bit CPU (picoBlaze) which does 42MHz in the CPLD version, versus 74MHz-200MHz on FPGA. See their PDF

I'm aware of picoBlaze. However, I call into question that the two semiconductor processes are the same. I was very careful to point out that qualification.

The reason is very simple: FPGAs have HUGE bus wires in them, which takes time to load. More importantly, they have HUGE amounts of multiplexors and other random logic which isolates one look-up block from another. Yes, FPGAs can be quite fast, but the architecture that defines a CPLD (namely, a matrix of PALs, each PAL being just a sum-of-terms matrix) obviously means less propagation delays given the same feature sizes on die. For starters, CPLDs usually don't have bus wires that span the whole die. There's not a whole lot of random logic involved (indeed, this is the CPLD's greatest strength and its greatest weakness -- ALL logic must be expressed as sum-of-terms form!), and absolutely no storage elements except for I/O pads.

I'd like to see the picoBlaze in a CPLD in the same process as a Xilinx Spartan-III go up against a real Spartan-III implementation.

I think it may be as much a matter of process as of core. The picoBlaze for CPLDs is most likely a condensed slower version.

OwenS · Post by **OwenS** » Mon Aug 10, 2009 9:43 am

kc5tja wrote:

I'm returning to this thread very late. Sorry.

Quote:

read and write barriers have been greatly improved in Java as well, and are, on modern CPUs, supported by hardware! A write barrier flushes the dirty pages from cache to memory (everything written before the barrier is stored in actual memory, not in cache), a read barrier throws the cache away (everything read after the barrier comes from memory)

These are not read- or write-barriers. You're describing synchronization points.

A read-barrier is a small clip of code responsible for translating a reference to a machine-readable address, which must be used (in the worst case) at each and every read site. It can be as simple as "@ @" in Forth (e.g., a double fetch) to resolve a handle to the object pointed to by the handle.

Likewise, a write-barrier is a small piece of code which is responsible for somehow marking an object as having been modified, so that a garbage collector will know to traverse its pointers anew, instead of caching the GC state from a previous pass (why re-traverse a tree when it's dynamically known to not have changed since the last time you checked?).

These barriers have nothing to do with cache coherency or thread synchronization.

You have a completely different notion of a read/write barrier to me (And GCC - and Intel in it's Itanium ABI). To me, a read barrier is any statement which a compiler cannot move a read across (I.E. a read cannot cross from one side of it to another), and a write barrier is any statement which a compiler cannot move a write across. A common example is a mutex: You cannot move reads after locking it to before the lock; and you cannot move writes before unlocking it to after the unlock; in this case you have a pair of one way read/write barriers.

You also get full barriers - the optimizer must keep everything on the side you wrote it.

They have nothing to do with caches, of course; and flushing the cache is an utterly pointless (And on most architectures, privileged) exercise.

fachat · Post by **fachat** » Mon Aug 10, 2009 10:54 am

OwenS wrote:

They have nothing to do with caches, of course; and flushing the cache is an utterly pointless (And on most architectures, privileged) exercise.

If you're talking multiprocessing, the ordering you described directly becomes cache-related.

If the ordering defines that anything that has been written before the barrier must be persistent after the barrier (write barrier), you have to somehow tell the other processors that these writes have happened.

In a strictly coupled multiprocessor system this is I think automatically done by the cache coherency protocol (MESI cache protocol for example when a processor snoops other processor's memory accesses and for example intercepts them when the other processor requests a dirty cache line in the local cache etc.).

With explicit synchronization points in the program, this strictly coupled model can be relaxed (as it is complex and difficult to implement, and costs performance). Memory accesses need not be snooped by other processors when the program itself tells the processor when to send the information to other processors. This is what I handwavingly called "cache flush", which of course is not removing the data from the cache, but writing the cache data to main memory and still keeping it.

A read barrier would clear the read cache, so that everything that is read behind the barrier has been persisted (with a write barrier) to main memory before the barrier.

Removing the need to snoop other processors' memory accesses greatly improves scalability, as snooping only scales IIRC to about 4 procs and quickly gets inefficient. Java programs using the memory model with the read/write barriers I described are capable of running on a lot more processors in parallel.

The Java explicit memory model was a great step in formalizing the contract between the programming language and the hardware, and allows for much more standardized handling of synchronizations etc - across platform!

André