6502/65816 Pipeline

BigEd · Post by **BigEd** » Fri Mar 27, 2015 11:08 am

Hi cr1901
just in case it helps, here's the tabulation from the visual6502 simulation:

cycle ab	db	rw  Fetch	pc	 a	 x	 y	 s	p	      ir
0	0000	88	1	DEY	0000	aa	00	00	fd	nv‑BdIZc	00
0	0000	88	1	DEY	0000	aa	00	00	fd	nv‑BdIZc	00
1	0001	d0	1	   	0001	aa	00	00	fd	nv‑BdIZc	88
1	0001	d0	1	   	0001	aa	00	00	fd	nv‑BdIZc	88
2	0001	d0	1	BNE   0001	aa	00	00	fd	nv‑BdIZc	88
2	0001	d0	1	BNE   0001	aa	00	00	fd	nv‑BdIZc	88
3	0002	02	1	   	0002	aa	00	ff	fd	Nv‑BdIzc	d0
3	0002	02	1	   	0002	aa	00	ff	fd	Nv‑BdIzc	d0

We can all agree that adding a DEY to a program will add two cycles. As Michael puts it:

Quote:

As pointed our here (or on another thread), a rudimentary amount of overlapped instruction fetch and execution is provided. Perhaps it may be more clearly stated to say that the these processors try to overlap the instruction fetch and write back phases of an instruction's execution.

From a CPU implementation perspective, I think it is helpful to see four cycles of activity corresponding to that DEY - two or three of which are overlapped. The final write back cycle is worth including because there is no overlap in the case where it's a memory store, only when it's a write back to a register. For the case of an instruction like DEY, we have:
Cycle 0: fetch, maybe overlapped with previous instruction's writeback
Cycle 1: decode
Cycle 2: execute, may be overlapped with next instruction's fetch
Cycle 3: writeback, may be overlapped with next instruction's decode

If you trace also the signals sb, alu, DPControl in visual6502, you'll see that the new value for Y is written back to the Y register during the first half of cycle 3, over the sb (Special Bus) - therefore it's valid to say that the 6502 is still working on the DEY in that cycle, even though the IR now contains the next instruction. There are pipelined state bits in the random control logic which continue to control the datapath with the appropriate actions for the previous instruction. For example, node 460 goes low in the second half of Cycle 2, and then dpc1_SBY is valid in the first half of Cycle 3 for the writeback.

It can be useful also to trace the Execute and State pseudosignals, or to use the Logmore button.
http://visual6502.org/JSSim/expert.html ... ,DPControl

If you're not thinking about CPU implementation, but only thinking about programming the CPU, then you don't need to go into such fine detail - you probably only care about the incremental extra cycles for each instruction, if you even care about that. Of course, the datasheets are written for programmers, not for CPU designers.

Hope this helps
Ed

cr1901 · Post by **cr1901** » Sat Sep 12, 2015 8:28 am

Just to make sure... all 6502 clock cycle counts take into account that the last cycle may not be doing anything besides fetching the next byte in the case that the instruction need to write/read memory (example STA $FE00 takes for cycles, but the write is committed during the third clock cycle)?

FWIW, I'm looking at a Verilog 6502 core's signals- cycle count is identical to an ASIC 6502, but what's actually on the bus during a given clock cycle may not.

BigEd · Post by **BigEd** » Sat Sep 12, 2015 8:45 am

Arlet's core is quite different from most 6502s, in the bus timing. So let's put that to one side for a moment...

The NMOS 6502 will indeed do a write on the fourth cycle of STA absolute, and you're right to say that it sets up that write in the previous cycle. And yes, in a sense the core can't do anything useful during the write - there are no flags to update, no registers to change, and it can't use the bus to fetch the next opcode. From an outside perspective, cycle 4 is a write. That's the usual perspective. From an inside perspective, it's a wait. (There might be a small optimisation for a core: leave that write pending, fetch the next opcode instead, and perform the write during the following cycle. If that fetched opcode turns out to be a single byte, that's a win. But this will come at some cost in complexity and verification!)

For most (well, many) instructions, the "last" cycle is actually only the last bus cycle. The next cycle is used to fetch the next instruction and is counted as part of that instruction, but it's also used to update the registers and flags - it's the write-back.

Rob Finch · Post by **Rob Finch** » Sat Sep 12, 2015 12:35 pm

The 65xx typically moves one byte per clock cycle which is what makes it so fast. To get even better performance the processor has to move more than one byte per cycle. In order to improve performance in the RTF65003, I have the processor read all instructions bytes from a cache in a single clock cycle. So that a number of longer instructions (eg. LDA $123456) take fewer clock cycles to execute than they would on a regular '02 / '816. The LDA $123456 takes only a single cycle to read four bytes as opposed to the four cycles on the '816.

For higher performance the '816 could probably benefit from an instruction cache. But it really needs to also be able to read more than one byte of an instruction per clock cycle which would change the processor quite a bit.

BigEd · Post by **BigEd** » Sat Sep 12, 2015 12:41 pm

The other trick that's been proposed but not, I think, implemented by anyone yet is to offer a two-byte-wide port to an on-chip memory for page zero and maybe also page one. You'd end up with a core that has a fetch port, a read port, a write port, and a low memory port.

6502/65816 Pipeline

Re: 6502/65816 Pipeline

Re: 6502/65816 Pipeline

Re: 6502/65816 Pipeline

Re: 6502/65816 Pipeline

Re: 6502/65816 Pipeline