Improving the 6502, some ideas
- GARTHWILSON
- Forum Moderator
- Posts: 8775
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
Quote:
They are all surface mount though, and the higher capacities are not in socket-friendly packages.
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
BigEd wrote:
Does the 65816's ABORTB pin do this job? It says that the instruction completes but without any register updates, then the machine takes an interrupt. So the interrupt handler can fixup and restart the instruction.
Quote:
I'm not sure how hard that is to implement but I see the timing on the 816 demands that it must be valid before the rising edge of phi2 - that must be a clue.
About being valid before the rising edge of phi2 - I think this should be relaxed to during Phi2 high, before the falling edge. Otherwise there is no way for address decoding to be done.
André
fachat wrote:
I think it's basically that all the registers are implemented as some kind of two-level shift register.
Quote:
About being valid before the rising edge of phi2 - I think this should be relaxed to during Phi2 high, before the falling edge. Otherwise there is no way for address decoding to be done.
This is why most other CPUs have multiple clock cycles per machine cycle.
I'm returning to this thread very late. Sorry.
Sure it does. The x86 is no more magical an instruction set than the 6502 is (more registers, wider widths, but otherwise a Von Neumann architecture all the same), and Oberon works just fine there.
Smalltalk shows that these can be predicted and accounted for in the activation frame of a procedure.
These are not read- or write-barriers. You're describing synchronization points.
A read-barrier is a small clip of code responsible for translating a reference to a machine-readable address, which must be used (in the worst case) at each and every read site. It can be as simple as "@ @" in Forth (e.g., a double fetch) to resolve a handle to the object pointed to by the handle.
Likewise, a write-barrier is a small piece of code which is responsible for somehow marking an object as having been modified, so that a garbage collector will know to traverse its pointers anew, instead of caching the GC state from a previous pass (why re-traverse a tree when it's dynamically known to not have changed since the last time you checked?).
These barriers have nothing to do with cache coherency or thread synchronization.
Indeed, flushing the caches on every read or write in a GC'ed language will utterly destroy performance. Caches exist precisely because the CPU core speed is often at least a factor of 10 (often a factor of 1000!!) faster than bus speed.
I feel so sorry for you. You must hate life.
Whenever I have to deal with Java, I know I hate life.
Quote:
That just doesn't work on 6502 hardware level...
Quote:
And even if we use type descriptors, the stack frame on a 6502 can have very different sizes, depending on the execution flow - as the 6502 stack is often used as intermediary storage in conditional expressions, as there are so few registers.
Quote:
read and write barriers have been greatly improved in Java as well, and are, on modern CPUs, supported by hardware! A write barrier flushes the dirty pages from cache to memory (everything written before the barrier is stored in actual memory, not in cache), a read barrier throws the cache away (everything read after the barrier comes from memory)
A read-barrier is a small clip of code responsible for translating a reference to a machine-readable address, which must be used (in the worst case) at each and every read site. It can be as simple as "@ @" in Forth (e.g., a double fetch) to resolve a handle to the object pointed to by the handle.
Likewise, a write-barrier is a small piece of code which is responsible for somehow marking an object as having been modified, so that a garbage collector will know to traverse its pointers anew, instead of caching the GC state from a previous pass (why re-traverse a tree when it's dynamically known to not have changed since the last time you checked?).
These barriers have nothing to do with cache coherency or thread synchronization.
Indeed, flushing the caches on every read or write in a GC'ed language will utterly destroy performance. Caches exist precisely because the CPU core speed is often at least a factor of 10 (often a factor of 1000!!) faster than bus speed.
Quote:
P.S.: I'm a Java architect for a living, so I might be biased....
Whenever I have to deal with Java, I know I hate life.
BigEd wrote:
Interesting. But I note that Xilinx offer a free 16-bit CPU (picoBlaze) which does 42MHz in the CPLD version, versus 74MHz-200MHz on FPGA. See their PDF
The reason is very simple: FPGAs have HUGE bus wires in them, which takes time to load. More importantly, they have HUGE amounts of multiplexors and other random logic which isolates one look-up block from another. Yes, FPGAs can be quite fast, but the architecture that defines a CPLD (namely, a matrix of PALs, each PAL being just a sum-of-terms matrix) obviously means less propagation delays given the same feature sizes on die. For starters, CPLDs usually don't have bus wires that span the whole die. There's not a whole lot of random logic involved (indeed, this is the CPLD's greatest strength and its greatest weakness -- ALL logic must be expressed as sum-of-terms form!), and absolutely no storage elements except for I/O pads.
I'd like to see the picoBlaze in a CPLD in the same process as a Xilinx Spartan-III go up against a real Spartan-III implementation.
GARTHWILSON wrote:
I'm not sure exactly what you mean by "program-referenced,"
Quote:
HP-71 BASIC does have a DESTROY instruction
Quote:
Thankyou for the explanations. I probably wasn't able to read them as fast as you typed them (Samuel types nearly 100wpm and seems to be able to talk about something else at the same time!)
Quote:
I believe the Apple IIgs ProDOS used that. Is that correct?
kc5tja wrote:
We'd still need someone to synthesize though.
I had another go at synthing Rob Finch's bc6502 just now: without any special care, it met timing at 60MHz and takes up 18% of a xc3s250e-pq208-5 which seems to be a $20 part. (I only constrained the clock, so input and output timings might be worse.)
That particular design has a slightly restrictive license, so it might be best to look into the other designs out there (or start afresh) And note that any of those designs might have the odd bug.
Rob did post elsewhere that it took him a year (of spare time) to get a working 6502, and another year to get it cycle accurate and finished. This is why I recommend aiming at a the simplest design, preferably as an increment on something existing.
ps: might be worth noting that Rob went on to do an enhanced 6502 (with stack-relative mode), a 24bit and then 32 bit Sparrow, then 64bit and possibly other cores. I got the impression Sparrow was in the spirit of the 6502. You may have to dig for details.
Last edited by BigEd on Sun Aug 09, 2009 10:22 am, edited 1 time in total.
kc5tja wrote:
I'm returning to this thread very late. Sorry.
...
These are not read- or write-barriers. You're describing synchronization points.
...
Quote:
read and write barriers have been greatly improved in Java as well, and are, on modern CPUs, supported by hardware! A write barrier flushes the dirty pages from cache to memory (everything written before the barrier is stored in actual memory, not in cache), a read barrier throws the cache away (everything read after the barrier comes from memory)
http://en.wikipedia.org/wiki/Write_barrier or more low level
http://gee.cs.oswego.edu/dl/jmm/cookbook.html
And yes, those are synchronization points.
Quote:
A read-barrier is a small clip of code responsible for translating a reference to a machine-readable address, which must be used (in the worst case) at each and every read site. It can be as simple as "@ @" in Forth (e.g., a double fetch) to resolve a handle to the object pointed to by the handle.
Likewise, a write-barrier is a small piece of code which is responsible for somehow marking an object as having been modified, so that a garbage collector will know to traverse its pointers anew, instead of caching the GC state from a previous pass (why re-traverse a tree when it's dynamically known to not have changed since the last time you checked?).
These barriers have nothing to do with cache coherency or thread synchronization.
Likewise, a write-barrier is a small piece of code which is responsible for somehow marking an object as having been modified, so that a garbage collector will know to traverse its pointers anew, instead of caching the GC state from a previous pass (why re-traverse a tree when it's dynamically known to not have changed since the last time you checked?).
These barriers have nothing to do with cache coherency or thread synchronization.
Quote:
Indeed, flushing the caches on every read or write in a GC'ed language will utterly destroy performance. Caches exist precisely because the CPU core speed is often at least a factor of 10 (often a factor of 1000!!) faster than bus speed.
Putting them in wrong places is a source of performance degradation, though, and is done quite often. Writing efficient multithreading programs is very difficult and often not understood well enough and cause for performance optimization tasks (which I do more often than I'd like to).
André
kc5tja wrote:
BigEd wrote:
picoBlaze reportedly does 42MHz in CPLD versus 74MHz-200MHz on FPGA.
Quote:
CPLD's ... no storage elements except for I/O pads.
fachat wrote:
Not really sure what you're talking about. Maybe read/write barriers have some other meaning in a Garbage Collection code.
See http://citeseerx.ist.psu.edu/viewdoc/su ... .1.52.8857
kc5tja wrote:
BigEd wrote:
Interesting. But I note that Xilinx offer a free 16-bit CPU (picoBlaze) which does 42MHz in the CPLD version, versus 74MHz-200MHz on FPGA. See their PDF
The reason is very simple: FPGAs have HUGE bus wires in them, which takes time to load. More importantly, they have HUGE amounts of multiplexors and other random logic which isolates one look-up block from another. Yes, FPGAs can be quite fast, but the architecture that defines a CPLD (namely, a matrix of PALs, each PAL being just a sum-of-terms matrix) obviously means less propagation delays given the same feature sizes on die. For starters, CPLDs usually don't have bus wires that span the whole die. There's not a whole lot of random logic involved (indeed, this is the CPLD's greatest strength and its greatest weakness -- ALL logic must be expressed as sum-of-terms form!), and absolutely no storage elements except for I/O pads.
I'd like to see the picoBlaze in a CPLD in the same process as a Xilinx Spartan-III go up against a real Spartan-III implementation.
kc5tja wrote:
I'm returning to this thread very late. Sorry.
These are not read- or write-barriers. You're describing synchronization points.
A read-barrier is a small clip of code responsible for translating a reference to a machine-readable address, which must be used (in the worst case) at each and every read site. It can be as simple as "@ @" in Forth (e.g., a double fetch) to resolve a handle to the object pointed to by the handle.
Likewise, a write-barrier is a small piece of code which is responsible for somehow marking an object as having been modified, so that a garbage collector will know to traverse its pointers anew, instead of caching the GC state from a previous pass (why re-traverse a tree when it's dynamically known to not have changed since the last time you checked?).
These barriers have nothing to do with cache coherency or thread synchronization.
Quote:
read and write barriers have been greatly improved in Java as well, and are, on modern CPUs, supported by hardware! A write barrier flushes the dirty pages from cache to memory (everything written before the barrier is stored in actual memory, not in cache), a read barrier throws the cache away (everything read after the barrier comes from memory)
A read-barrier is a small clip of code responsible for translating a reference to a machine-readable address, which must be used (in the worst case) at each and every read site. It can be as simple as "@ @" in Forth (e.g., a double fetch) to resolve a handle to the object pointed to by the handle.
Likewise, a write-barrier is a small piece of code which is responsible for somehow marking an object as having been modified, so that a garbage collector will know to traverse its pointers anew, instead of caching the GC state from a previous pass (why re-traverse a tree when it's dynamically known to not have changed since the last time you checked?).
These barriers have nothing to do with cache coherency or thread synchronization.
You also get full barriers - the optimizer must keep everything on the side you wrote it.
They have nothing to do with caches, of course; and flushing the cache is an utterly pointless (And on most architectures, privileged) exercise.
OwenS wrote:
They have nothing to do with caches, of course; and flushing the cache is an utterly pointless (And on most architectures, privileged) exercise.
If the ordering defines that anything that has been written before the barrier must be persistent after the barrier (write barrier), you have to somehow tell the other processors that these writes have happened.
In a strictly coupled multiprocessor system this is I think automatically done by the cache coherency protocol (MESI cache protocol for example when a processor snoops other processor's memory accesses and for example intercepts them when the other processor requests a dirty cache line in the local cache etc.).
With explicit synchronization points in the program, this strictly coupled model can be relaxed (as it is complex and difficult to implement, and costs performance). Memory accesses need not be snooped by other processors when the program itself tells the processor when to send the information to other processors. This is what I handwavingly called "cache flush", which of course is not removing the data from the cache, but writing the cache data to main memory and still keeping it.
A read barrier would clear the read cache, so that everything that is read behind the barrier has been persisted (with a write barrier) to main memory before the barrier.
Removing the need to snoop other processors' memory accesses greatly improves scalability, as snooping only scales IIRC to about 4 procs and quickly gets inefficient. Java programs using the memory model with the read/write barriers I described are capable of running on a lot more processors in parallel.
The Java explicit memory model was a great step in formalizing the contract between the programming language and the hardware, and allows for much more standardized handling of synchronizations etc - across platform!
André