Pipelined 6502

GARTHWILSON · Post by **GARTHWILSON** » Mon Oct 10, 2016 6:22 am

BigEd wrote:

an affordable FPGA has only 64k RAM on board - are you thinking of a different meaning of on board here Garth?

I was thinking of ASICs, not FPGAs, since the school project implies something that would be cheap in high volumes, and low-power. I doubt that FPGAs would qualify. For low-volume designs, we'll do it for fun, or to prove a concept, or to preserve a software investment, otherwise go to an ARM or other processor.

Quote:

As for the idea of fast predictable response for embedded computing, yes and no. A CPU running at 100MHz or 200MHz may well have the freedom to take a few more cycles, or a variable number of cycles, to respond to interrupts, and still be a great improvement on a 1MHz or 14MHz CPU which behaves exactly like a 6502. Or it may not - it depends on what the requirements are for any specific use case. A very fast and very cycle-efficient and entirely deterministic CPU is a difficult target to hit - again, there are engineering tradeoffs. What made sense at 1MHz in the early 70s may need adjusting forty years later.)

Bill Mensch said in an interview last year that if the '02 were made in the latest (for last year) deep-submicron geometry, he expects it could do 10GHz, which would be about 2.5 GIPS.

BigEd · Post by **BigEd** » Mon Oct 10, 2016 6:28 am

So, for any given choice of technology, you can have a bigger memory outside the chip than you can on it. So, caches make sense at certain operating points. If you keep changing the technology so you don't need a cache, then you don't need a cache - but you haven't proved that caches are not useful!

As for chip speed, you often make the point that you like low latency and fixed latency. But your latency will always vary by one clock, or indeed by one instruction. For a conventional 6502, that's say seven cycles, or at 14MHz it's 500nS. For a 100MHz pipelined CPU, that's some number of 10nS cycles. I hope you see how it might be that the less deterministic system counted in cycles might have better behaviour when looked at in nanoseconds - just change the numbers accordingly if they don't quite convince. The real world is measured in nanoseconds, so that's what counts.

I don't think the 10GHz idea from Bill means much here - you won't have a single-cycle large RAM at that speed, so Bill would end up in the same part of the design space as any other of us.

ttlworks · Post by **ttlworks** » Mon Oct 10, 2016 6:55 am

Welcome, manili.

The 6502 instruction set looks simple at first sight, but when trying to implement something 100% compatible to it,
a few subtle things in there might give you quite a headache.

There is a saying, that the devil hides in the details...
To list a few things:

The difference in the return address between RTS and RTI, as BigEd already had pointed out.
For PHA\PLA etc., unlike most of the other CPUs the 6502 uses pre_increment and post_decrement,
means that the stackpointer decrements _after_ a Byte is written to the stack,
and increments _before_ a Byte is read from stack.
The 'B flag' in the status register which identifies a BRK instruction for the interrupt service routine isn't really a flag,
It's a control signal from the "instruction sequencer".
Means, after a BRK instruction, the 6502 pushes status register and PC on stack before fetching the vector,
and only in the status Byte pushed on stack the Bit that resembles the B flag would be 1.

Another discussion that might (or might not) be of interest for your project is here.

What makes the 6502 instruction decoder\sequencer a bit difficult are all those fancy addressing modes.
Would suggest to keep things as simple as possible from the start (while sticking with the NMOS 6502 instruction set),
because things tend to have a habit to become more and more complicated all by themselves later.

Good luck with your project,
looking forward to following your progress.

Arlet · Post by **Arlet** » Mon Oct 10, 2016 6:59 am

BigEd wrote:

The real world is measured in nanoseconds, so that's what counts.

And as systems have gotten faster, the physical limits have not improved. In the time that a hypothetical 10GHz CPU does one clock cycle, the data from external memory has traveled less than an inch across the circuit board. It is unavoidable that a real life system would require several clocks to fetch data, which means that you'd want to send data in bursts, which adds to complexity and latency for individual transfers.

ttlworks · Post by **ttlworks** » Mon Oct 10, 2016 7:20 am

Arlet wrote:

And as systems have gotten faster, the physical limits have not improved.

True, true.
One solution would be having ball grid array pins on top and at the bottom of the CPU chip,
then to solder the CPU chip into the PCB... and SDRAM on top of the CPU chip.

But speaking of speed:
PCs seem to be getting faster with every year, but if you happen to be into process controll
or real time data aquisition stuff, it feels like getting the data in and out of the PC
at a reasonable speed is getting more and more difficult\complicated with every year, too.

Of course, having two separate bus systems in a CPU, one for memory and one for I\O,
would require a lot more pins on the chip.

For instance, the obsolete TMS320C30 had two bus systems (block diagram is on page 12).

BigEd · Post by **BigEd** » Mon Oct 10, 2016 7:59 am

Indeed, even the humble Raspberry Pi uses Package on Package - diagram here.

But I feel we've strayed rather off-topic. The idea of this thread is a B.S. project to build a pipelined CPU somewhat like a 6502, with caches.

GARTHWILSON · Post by **GARTHWILSON** » Mon Oct 10, 2016 8:02 am

BigEd wrote:

As for chip speed, you often make the point that you like low latency and fixed latency. But your latency will always vary by one clock, or indeed by one instruction. For a conventional 6502, that's say seven cycles, or at 14MHz it's 500nS. For a 100MHz pipelined CPU, that's some number of 10nS cycles. I hope you see how it might be that the less deterministic system counted in cycles might have better behaviour when looked at in nanoseconds - just change the numbers accordingly if they don't quite convince. The real world is measured in nanoseconds, so that's what counts.

I'm not sure what you're getting at here. If the system is kept simple enough to get all the memory and processor on the same IC, you can run it faster, right? And wouldn't running from cache make it a lot less deterministic? If you're in the middle of a refill because of a cache miss when an interrupt hits, won't that cause a huge increase in latency?

Arlet wrote:

BigEd wrote:

The real world is measured in nanoseconds, so that's what counts.

And as systems have gotten faster, the physical limits have not improved. In the time that a hypothetical 10GHz CPU does one clock cycle, the data from external memory has traveled less than an inch across the circuit board. It is unavoidable that a real life system would require several clocks to fetch data, which means that you'd want to send data in bursts, which adds to complexity and latency for individual transfers.

That's why I'm saying that for the small amount of memory likely to be used in a 65 system (even an '816), all the memory can be on the same chip with the processor and I/O, and still be a much smaller die than many of the modern processors. The memory bus(es) won't go out on the PCB at all. How fast are the fastest caches?

Arlet · Post by **Arlet** » Mon Oct 10, 2016 8:14 am

Quote:

How fast are the fastest caches?

Here's some useful data on a recent high performance CPU. It appears that the L1 cache has a 4 cycle latency, and is only 32kB big. And I assume Intel hardware engineers have pulled every trick from the book to make it as fast as possible on the given technology.

BigEd · Post by **BigEd** » Mon Oct 10, 2016 8:14 am

What I'm saying is that if you move the goalposts, you can always seem to score. There are circumstances in which caches add performance, and there are circumstances in which some variability in latency is acceptable. There are other circumstances in which the original 6502 or '816 are absolutely the ideal solution. But we ought to be able to organise our thoughts well enough to distinguish these cases.

There's yet another case, in which it's worthwhile pursuing the design and implementation of a more advanced microarchitecture, as a learning experience, even if the end result is not an engineering solution to any specific problem.

Perhaps we could take this to another thread? Something like 'Is a cache ever an advantage?' or less problematic 'When is a cache a good solution?'

Arlet · Post by **Arlet** » Mon Oct 10, 2016 8:21 am

GARTHWILSON wrote:

If you're in the middle of a refill because of a cache miss when an interrupt hits, won't that cause a huge increase in latency?

Possibly, yes. But 'huge' is a relative word. We're still talking nanoseconds here, and there are few real-life events that need 0.1 microsecond interrupt response. And for those cases where this is a requirement, people usually solve this by adding some special purpose hardware, or a dedicated I/O processor.

ttlworks · Post by **ttlworks** » Mon Oct 10, 2016 8:33 am

IMHO a cache is a "kludge" to compensate for slow memory speed\bandwidth,
and it won't make the "real time" response of a computer "more deterministic".

6502 can address 64kB of memory in total.

If 64kB of RAM would fit into the FPGA you choose for the CPU implementation,
the interesting question is, if we really need a cache when creatively "arranging"
the RAM blocks in the FPGA. //That's why I had mentioned the TMS320C30 DSP, too.

Before we are getting lost in an endless debate about processor architecture,
it certainly would be good to know what FPGA hardware (or evaluation board)
manili already has (or intends to buy) for his project...

...and then to sort out what is feasible with the hardware and what is not.

BigEd · Post by **BigEd** » Mon Oct 10, 2016 8:36 am

I'm not even sure that's true - the idea is to make a successful B.S. project which demonstrates some level of engineering understanding. Even if it goes really slow, has some bugs, doesn't fit on the FPGA, it still could be a successful project depending on how it is written up.

We need a completely separate discussion for projects we might like to make for other purposes. Some of us are thinking about projects to learn and demonstrate, and others of us are thinking about projects with commercial applicability. We are not all on the same page!

GARTHWILSON · Post by **GARTHWILSON** » Mon Oct 10, 2016 8:49 am

Arlet wrote:

Quote:

How fast are the fastest caches?

Here's some useful data on a recent high performance CPU. It appears that the L1 cache has a 4 cycle latency, and is only 32kB big. And I assume Intel hardware engineers have pulled every trick from the book to make it as fast as possible on the given technology.

I don't see anything there about nanoseconds. I've seen leaded SRAM ICs down to about 5ns, truly random-access memory, ie, not burst mode with a latency to get the burst going, nor requiring the following bytes to come from successive addresses.

Quote:

Perhaps we could take this to another thread? Something like 'Is a cache ever an advantage?' or less problematic 'When is a cache a good solution?'

Good idea. Go ahead. Much appreciated.

Arlet wrote:

GARTHWILSON wrote:

If you're in the middle of a refill because of a cache miss when an interrupt hits, won't that cause a huge increase in latency?

Possibly, yes. But 'huge' is a relative word. We're still talking nanoseconds here, and there are few real-life events that need 0.1 microsecond interrupt response. And for those cases where this is a requirement, people usually solve this by adding some special purpose hardware, or a dedicated I/O processor.

If you have to load 1K at a time, at 100MB/s (for the sake of discussion), that's 10µs, which is a long, long time to wait for interrupt service on a processor designed for performance high enough to justify having caches. However, if interrupt and direct (non-cache) memory performance is high enough, you can eliminate the separate sound cards, and going to the hypothetical extreme, even video cards or video chip sets, in the hypothetical case of a timer-driven interrupt for each pixel.

Arlet · Post by **Arlet** » Mon Oct 10, 2016 8:53 am

GARTHWILSON wrote:

I don't see anything there about nanoseconds.

On the first line it mentions that the clock is 3.4 GHz, so 4 cycles would be just over 1 nanosecond for the L1 cache. The L3 cache has a latency of 36 cycles, or just over 10 nanoseconds.

manili · Post by **manili** » Mon Oct 10, 2016 9:04 am

Thank you all for this hot discussion

!

The topic grows really fast, and I don't know how to answer previous REPLYs (that was a typo and thanks for your notification

).

Frist thing first, I should say my special thanks to all people who took part in this discussion and help me learn much more things, and most of them encouraged me to continue this project (as it become more difficult day after day). I'm currently busy working on stack related instructions. And believe me they are really hard to implement !
There are two important points here :
1. As I said before this is going to be a B.S. project and the last step is to just synthesis it. So there is no kind of FPGAs or ASICs. Anything I'll do/I did should be seen under this scope (including implementing caches, 6 pipeline stages and etc...).
2. I'm really at the middle of the project (80%). So I can't just destruct the whole thing and try to rebuild it again. That's impossible because of point #1.

I think one of the most important parts which Garth pointed out is "Why WISHBONE" ?
1. Again look at #1

.
2. Many of opencores.org IP cores are WISHBONE compatible. So they can just talk to my processor without any problem. This is the main reason behind the WISHBONE. And also you can use memories/peripherals with different clock speeds beside the processor.

Again thank you all for your REPLIES

.

P.S. : Again sorry for my bad English.

Pipelined 6502

Re: Pipelined 6502

Re: Pipelined 6502

Re: Pipelined 6502

Re: Pipelined 6502

Re: Pipelined 6502

Re: Pipelined 6502

Re: Pipelined 6502

Re: Pipelined 6502

Re: Pipelined 6502

Re: Pipelined 6502

Re: Pipelined 6502

Re: Pipelined 6502

Re: Pipelined 6502

Re: Pipelined 6502

Re: Pipelined 6502