6502.org • View topic - 6502/65816 Pipeline

View unanswered posts | View active topics

Board index » 6502.org Users Forum » Hardware

All times are UTC

6502/65816 Pipeline

Page 1 of 2

[ 20 posts ]

Go to page 1, 2 Next

Previous topic | Next topic

Author

Message

cr1901

Post subject: 6502/65816 Pipeline

Posted: Sat Jun 07, 2014 11:28 pm

Joined: Wed Feb 05, 2014 7:02 pm
Posts: 158

Just as the title suggests... is there any information on the hardware internals of the '02/'816 pipeline beyond the 2 paragraphs in the programmer's manual?

Basically, I ask because while I was looking at the number of bytes/clocks required for an '02/'816 instruction, I realized something: If neither of those processors were pipelined, each Direct/Zero Page instruction would always save 2 cycles (1 fetch + 1 execute) over a "normal" instruction. However, if the '02/'816 have pipelines, it is quite possible that shorter instructions will empty the instruction queue faster than can be filled... or maybe not, since the '02/'816 can fetch a single byte per clock cycle = bus cycle. While a pipelined processor will never be slower than an unpipelined processor, it seems such behavior can at least partially negate the performance (not space

) gains from using the Direct Page.

This is of course assuming the '02/'816 pipeline works akin to the Execution Unit and Bus Interface Unit separation of the 8086/8088... A very good example of a string of simple instructions which actually takes longer to execute than a single complex instruction is SHL ax, 1 vs SHL ax, cl: SHL ax, 1 empties the instruction queue faster than it can be replenished.

Top

GARTHWILSON

Post subject: Re: 6502/65816 Pipeline

Posted: Sun Jun 08, 2014 12:05 am

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California

cr1901 wrote:

If neither of those processors were pipelined, each Direct/Zero Page instruction would always save 2 cycles (1 fetch + 1 execute) over a "normal" instruction.

This may not be what you're looking for, but the data sheet does tell what's on the buses in each cycle of an instruction, and the ZP/DP ones are in many cases one clock shorter, and it's not possible to make it two clocks shorter. LDA ZP takes 3: op code, operand, and load from the address. LDA abs takes 4, the extra one being for the high address byte. There's no reason to take 5, and there's no way to get it down to 2. In the case of indexing, the low byte is added while the high byte is being fetched, due to minor pipelining. Normally while the op code is being decoded, the next byte is being fetched so it will already be there for when it figures out if it's needed as the low byte of an operand. (I would have to look at all the addressing modes in the Detailed Instruction Operation table to see if this is universally true.) If there is an operand at all, there will definitely be a low byte but not necessarily a high byte; so it makes sense to have the low one first, and that's part of the minor pipelining. Ed or someone who's been involved with the visual6502 project can probably tell you more. That would be the info on the hardware internals which you might be looking for, although visual6502 is NMOS only, not CMOS. An advantage to not having a longer pipeline is the fact that branches and jumps are quick. A branch taken usually requires only three clocks (as does a JMP abs), which is a lot less than other processors.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?

Top

BigEd

Post subject: Re: 6502/65816 Pipeline

Posted: Sun Jun 08, 2014 6:13 am

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England

Visual6502 can tell you everything! But not necessarily in digestible form.

I think it's only barely true to speak of the 6502 as pipelined. The only overlap which saves cycles is when the final cycle of an instruction is an internal operation, which will then overlap with the next instruction's fetch. But there is no buffer for previously fetched information, so there's never a gain from the processor already having read the next opcode byte, which it sometimes will have done. And if the final cycle of an instruction is a write, it can't overlap with the subsequent fetch anyway.

For example,
INX
INY
will take four cycles, not six. There's a gain of two cycles, over an even simpler fetch-decode-execute machine(*), from the fact that the internal operation of writing back the modified register value overlaps with the fetch of the following instruction, which happens for both the INX and the INY.

But there is no gain from the fact that the INY byte was already read during the second cycle of the INX - the INY is read again and then decoded. Every instruction reads a subsequent byte during the decode cycle, which is used as an operand if it is needed, and which therefore constitutes a gain, but the byte is never fed into the instruction decoder even if it turns out that the present instruction was a single byte.

(*) I think the Z80, which is clocked faster at any given technology, does use a clock cycle for each step of an instruction and, if that's right, is even less pipelined than the 6502. But because it's clocked faster, that's not a noticeable net loss. In the Z80, a memory access takes several clock cycles. (We don't have a visual Z80 to investigate, although Ken Shirriff has done some detailed analyses: see http://www.righto.com/search/label/Z-80)
I think later reimplementations of the Z80 gained more performance than the slightly improved later implementations of the 6502 which saved a cycle here and there - there was more slack to be taken up.
As we know, later descendants of the 8080 put in successively more elaborate mechanisms to become much much more productive per clock cycle. Having multiple instruction decoders working in parallel on the available pre-fetched instruction bytes is just the start of it. A quick search indicates 3 instructions per cycle is an achievable peak value, with half of that being a more likely best case.

Cheers
Ed

Top

BigEd

Post subject: Re: 6502/65816 Pipeline

Posted: Sun Jun 08, 2014 7:54 am

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England

Hmm, I may have to rethink. Perhaps there are four stages to an instruction execution:
fetch
decode
execute
writeback
in which case the two-cycle instructions of 6502 actually save a couple of cycles compared to a strictly sequential implementation.

Certainly it's true that the results of a computation arrive in the registers and flags in cycle 4.
See for example
http://visual6502.org/JSSim/expert.html ... logmore=ir
where of course the fourth cycle is labelled 3 because we're counting from zero.

Cheers
Ed

Top

MichaelM

Post subject: Re: 6502/65816 Pipeline

Posted: Sun Jun 08, 2014 3:51 pm

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL

cr1901:

cr1901 wrote:

Just as the title suggests... is there any information on the hardware internals of the '02/'816 pipeline beyond the 2 paragraphs in the programmer's manual?

There's not a lot of detail available regarding the cycles in the execution of an instruction for these processors. As you've probably already concluded, a lot of effort will be required to establish the kind of data you are looking for from those two paragraphs and the stated timings of the instructions. While documenting the behavior of my synthesized FPGA processor core, the M65C02, I developed a spreadsheet that contains a map of the opcodes, a cross-reference of the opcodes sorted by addressing modes, and a table of the cycles each opcode requires for execution in my core. You may find it useful. A GitHUB wiki page presenting and discussing the contents of this table can be found here. The MS Excel spreadsheet can also be found the GitHUB repository for the project, but I dare not point you to it since it might raise BDD's hackles regarding incompatible industry standard file formats.

cr1901 wrote:

This is of course assuming the '02/'816 pipeline works akin to the Execution Unit and Bus Interface Unit separation of the 8086/8088... A very good example of a string of simple instructions which actually takes longer to execute than a single complex instruction is SHL ax, 1 vs SHL ax, cl: SHL ax, 1 empties the instruction queue faster than it can be replenished.

The '02, 'C02 and '816 are not pipelined. The '02 and 'C02 certainly don't break down the bus interface and execution units in the manner of the 8086/8088. Given the pedigree of the '816, I doubt that it is partitioned into bus interface and execution units in the manner of the 8086/8088. With bus interface and execution units, the '02/'816 instruction stream can be fetched sequentially and achieve some measure of pipelining as was achieved in the 8086/8088. However, the 8086/8088 BIU simply fetches sequential locations from the instruction memory. It does not perform any of the complex effective address calculations required by the '02/'816 architectures. I suspect that the benefit of prefetch queue to the '02/'816 would be negated by the indirect addressing modes, which are not included in the 8086/8088 architecture, and which are often used by programmers of the '02/'816.

As pointed our here (or on another thread), a rudimentary amount of overlapped instruction fetch and execution is provided. Perhaps it may be more clearly stated to say that the these processors try to overlap the instruction fetch and write back phases of an instruction's execution.

The REP modifier, although problematic with respect to interrupts, should be applied to instructions such the '186/'188 string move instructions. That addition to the x86 ISA is very much worth the increased complexity. The improvement in instruction timing is high, and it certainly relieves the issue that you refer to relative to the prefetch queue. Since the instruction is no longer fetched from memory, the full bandwidth of the processor's bus is available for the move operations.

cr1901 wrote:

Basically, I ask because while I was looking at the number of bytes/clocks required for an '02/'816 instruction, I realized something: If neither of those processors were pipelined, each Direct/Zero Page instruction would always save 2 cycles (1 fetch + 1 execute) over a "normal" instruction.

I don't agree with your conclusion regarding the number of cycles that may be saved by an unpipelined '02/'816. I have pasted in two instruction timing tables below. The first shows the instruction cycles for the simple ZP read-only and write-only addressing modes, and the second shows the instruction cycles for the ZP read-modify-write addressing modes. The column labeled Fetch and the column labeled Next represent the memory/instruction cycles than can be overlapped. As can be seen from the first figure, the M65C02 and the 65SC02 are able to overlap most these cycles to report the same number of memory/instructions cycles. The 1 cycle difference reported for the zp,X and zp,Y addressing modes comes from the fact that I included a separate address generator in my implementation of the '02/'C02 ISA that allows the M65C02 to make the indexed calculation directly as the next output address is being generated. (This approach works well in an FPGA using high-speed asynchronous memory, but not so much with synchronous RAM.) In neither addressing mode is there an opportunity, in a von Neumann memory architecture, to save two instruction cycles for ZP variables. This is more obvious in the second figure which highlights the instruction cycles for ZP read-modify-write operations. In the second figure, the M65C02 has eliminated one "dead" cycle, and can perform an operand read immediately followed by a ALU result write cycle. A standard '02/'C02 must perform the ALU operation after the operand read cycle, and therefore creates a "dead" memory cycle. (The most significant improvement between the '02 and 'C02 is that the 'C02 reads the source operand address during the internal "dead" cycle instead of writing to the destination address twice as is done by the '02. This improvement, in general, allows the RMW instructions of the 'C02 to be used with R/W control registers of memory-mapped I/O devices. Again, this is not a generally recommended programming practice for memory-mapped I/O devices. Since the M65C02 does not include the dummy read cycle, its RMW instruction behavior can be used with R/W control registers of memory-mapped I/O devices.)

Attachment:

File comment: M65C02 zp; zp,X; and zp,Y Instruction Cycles

M65C02_zp_zpX_Cycles.JPG [ 139.85 KiB | Viewed 3008 times ]

Attachment:

File comment: M65C02 RMW zp; and zp,X Instruction Cycles

M65C02_RMW_zp_zpX_abs_absX_Cycles.JPG [ 148.74 KiB | Viewed 2995 times ]

Edit: the following figure was inadvertently loaded rather than the preceding figure. MAM

Attachment:

File comment: M65C02 (zp); (zp,X); and (zp),Y Instruction Cycles

M65C02_zpI_zpXI_zpIY_Cycles.JPG [ 223.02 KiB | Viewed 3008 times ]

_________________
Michael A.

Top

cr1901

Post subject: Re: 6502/65816 Pipeline

Posted: Tue Jun 10, 2014 7:47 pm

Joined: Wed Feb 05, 2014 7:02 pm
Posts: 158

So in effect, the '816 manual's concept of pipelining is stretching the truth a bit, perhaps for the sake of marketing

Top

MichaelM

Post subject: Re: 6502/65816 Pipeline

Posted: Wed Jun 11, 2014 1:11 pm

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL

I wouldn't say that.

Pipelining is an implementation technique that takes many forms. The opportunity for the 65816 to pipeline instruction execution is limited. Its lack of registers and many complex addressing modes, in particular the indirect addressing modes, tend to reduce the opportunities for instruction pipelining.

As in all realistic designs, there are trade offs, and the 6502/65816 are no exception. One generally acknowledged benefit of the small register set of these processors is their superior interrupt latency when compared to competitors of either the CISC or RISC processor camps. (I find their indirect addressing modes of great benefit for accessing tables and pointers stored in memory.)

In general, I find that these processors take good advantage of the available opportunities for instruction pipelining. Their simple internal architecture makes them easier to implement than many other processors. One particular opportunity for pipelining which is not taken advantage of is the "dead" cycle in a RMW instruction. I eliminated that cycle in my implementation, but even I didn't attempt to use that cycle to fetch the next instruction's opcode.

_________________
Michael A.

Top

BigEd

Post subject: Re: 6502/65816 Pipeline

Posted: Wed Jun 11, 2014 2:12 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England

Do you also manage to perform the simple single-byte instructions in one cycle? If so, you're overlapping decode and fetch, which is a tad more than most implementations.

Top

Dr Jefyll

Post subject: Re: 6502/65816 Pipeline

Posted: Wed Jun 11, 2014 3:06 pm

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada

cr1901 wrote:

So in effect, the '816 manual's concept of pipelining is stretching the truth a bit, perhaps for the sake of marketing

IMO, yes. In the 21st century context it seems gratuitous for them to be bragging about pipelining.

I'll soften that by saying MOS Technology did have good reason to brag back in the mid-1970's. Their newly-introduced 65xx CPUs made double use of the cycle following an opcode fetch -- unlike the 6800, IIRC. During that cycle the opcode is being decoded, and MOS's innovation was to simultaneously have a prefetch of the following byte. Multi-byte instructions thereby saved one cycle -- a significant boost.

The prefetch I just mentioned did not benefit single-byte instructions, although the potential exists to cut one cycle from the following instruction. (The prefetch could grab the following instruction's opcode.) But the MOS design of the 1970's was too unsophisticated to exploit this -- and so is the design sold by WDC today.

That doesn't make it a poor product! WDC makes up for it in other ways, such as the far-improved process and higher clock rates. But for them to mention pipelining in their hype does seems to me to be gratuitous (in the 21st century context).

If *any* 65xx product deserves recognition for pipelining, I'd say it's the 65CE02 -- which unfortunately was never produced in volume. Among its numerous features is single-cycle excution for many of the single-byte opcodes -- a feat which (partly due to patent restrictions?) remains unmatched even by the modern-day WDC 'C02 and '816.

-- Jeff

ps- I could discuss other 65xx pipelining speedups which may or may not be implemented. In the name of brevity I will refrain.

pps- Ed has posted while I was typing this. I see he too has touched on the question of single-cycle execution.

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html

Top

cr1901

Post subject: Re: 6502/65816 Pipeline

Posted: Wed Jun 11, 2014 5:07 pm

Joined: Wed Feb 05, 2014 7:02 pm
Posts: 158

BigEd wrote:

Do you also manage to perform the simple single-byte instructions in one cycle? If so, you're overlapping decode and fetch, which is a tad more than most implementations.

I don't think that counts, since in effect, the '02/'816 are double-pumped (use both the rising and falling edge of the clock cycle to do different things). But from other CPU's points-of-view, where the clock doesn't necessarily do different things on both edges, yes, I can see where you're going with this.

EDIT: I think I'm actually confusing two different things... fetch should happen during posedge of PHI2 (on '816), and decode AND execute should happen during the next cycle. Provided the single byte instruction doesn't access memory (don't think any of them do), then a fetch should also be happening while the previous instruction is executing... I think.

Top

GARTHWILSON

Post subject: Re: 6502/65816 Pipeline

Posted: Wed Jun 11, 2014 6:48 pm

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California

Quote:

Provided the single-byte instruction doesn't access memory (don't think any of them do)

Stack pushes and pulls do, plus RTS & RTI.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?

Top

Bregalad

Post subject: Re: 6502/65816 Pipeline

Posted: Sun Jun 15, 2014 7:22 pm

Joined: Sat Mar 27, 2010 7:50 pm
Posts: 149
Location: Chexbres, VD, Switzerland

The main problems in pipelining the 6502 (and derivates) are :
1) Variable instruction width - the fetch unit has to known the size of an instruction to fetch the next one
2) Multiple non-instruction memory accesses for each instructions / needs multiple ports and/or simultaneous 8 and 16-bit reads per port

There is work arround these, but it's just that nobody ever got through the trouble of doing all this work. Also, having a pipeline implicitely means having caches, and this creates many problem/incompatibilities with traditional 6502 systems (memory mapped I/O, bankswitching etc...) again there is work arrounds but it depends very specifically of the memory map one user uses so it's not possible to release a generic 6502 clone with a cache (as it'd have to be parametrised where is the memory mapped I/O), and it doesn't make sense to pipeline a processor without having caches, as a non-pipelined 65C02 (6502 without the dead cycles) already uses all the bandwidth you can theoretically get without using caches.

I tried to make a mook up of a pipelined 6502 here, but it's very incomplete and just a vague idea, subject to many improvements.

We concluded that it might almost make more sense to dynamically translate the machine code to a RISC equivalent and send it into an efficient RISC pipeline, rather than doing this. Of course without trying both, it's impossible to make true conclusions

Top

BigEd

Post subject: Re: 6502/65816 Pipeline

Posted: Sun Jun 15, 2014 8:29 pm

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England

Thanks for pointing back to that discussion and in particular the diagram.
To get all the way to single cycle execution of 6502 instructions would be difficult and impressive. But note that some gains including pipelining can be made without going the whole way.
Cheers
Ed

Top

cr1901

Post subject: Re: 6502/65816 Pipeline

Posted: Wed Mar 25, 2015 6:01 am

Joined: Wed Feb 05, 2014 7:02 pm
Posts: 158

BigEd wrote:

Somehow I forgot about this discussion... ah well, better late than never

.

I'm not certain I follow your logic here. What's significant about the fourth cycle? WDC's datasheet says for implied mode, an operation takes 2 cycles on both NMOS and CMOS models, as you state (I assume they don't include the time it takes to fetch the instruction in this), I'm not sure why the results of the DEY computation wouldn't arrive at cycle 4. Or are you simply stating that because results arrive on the 4th cycle of a 2-cycle instruction, we can make an educated guess that the third cycle is always execute?

Almost none of the 2-cycle/1-byte instructions touch memory. I'm guessing writeback for cycle 4 is just to satisfy Garth's observation of the exceptions

(stack pushes)?

What's more interesting to me is the fact that in your example, the fetch for BNE indeed starts while the calculation for DEY is completing. And since this is a cycle-exact simulation, it represents reality

.

Bregalad, re: pipelining requiring cache... isn't that a memory access-time problem, rather than a hard requirement for implementing pipelining?

Top

Bregalad

Post subject: Re: 6502/65816 Pipeline

Posted: Wed Mar 25, 2015 9:01 am

Joined: Sat Mar 27, 2010 7:50 pm
Posts: 149
Location: Chexbres, VD, Switzerland

Yes, it's a memory/access time problem, but also a problem on the number of read ports and amount of data you can read at a time.

With dual or triple ported memory you would be able to pipeline without caches. So yeah, my previous statement was wrong. What I had in mind was pipelineing why keeping the traditional I/O interface of the original 6502, that is single ported memory and 8 bit data at a time. It would only be able to be faster by pre-fetching parts of the programs in the adress space in a more efficient memory in the chip that allows to read multiple bytes per cycle, and execute form here.

Top

Page 1 of 2

[ 20 posts ]

Go to page 1, 2 Next

Board index » 6502.org Users Forum » Hardware

All times are UTC

Who is online

Users browsing this forum: barnacle and 26 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum