6502 instruction cycles

lordsteve · Post by **lordsteve** » Mon Feb 21, 2005 3:48 am

How many clock cycles are in a CPU instruction cycle? Is this what the W65C02 datasheet refers to as a memory cycle? Why does NOP take 8 clock cycles in my circuit which is hardwired to read NOPs?

GARTHWILSON · Post by **GARTHWILSON** » Mon Feb 21, 2005 4:55 am

> How many clock cycles are in a CPU instruction cycle?

One. They're the same thing on the 6502. When an instruction like ADC# takes two cycles, that's two microseconds at 1MHz. The 6502 does a memory read or write in half a cycle, which is why 100ns memory is nowhere near fast enough to do 10MHz on a 6502. Since a "cycle" on many other processors refers to a cycle of several clock cycles, I prefer to call them clocks on the 6502. The average 6502 instruction takes about 4 clocks, depending on how many ZP instructions, indirect and/or indexed instructions, etc. that go into the mix.

> Why does NOP take 8 clock cycles in my circuit which is hardwired to
> read NOPs?

NOP is two clocks; so maybe you're starting with a higher frequency oscillator and dividing it down before feeding it to the processor.

Rob Finch · Post by **Rob Finch** » Tue Feb 22, 2005 4:03 am

> Why does NOP take 8 clock cycles in my circuit which is hardwired to
> read NOPs?

What manufacturer's part are you using ? There is the odd 6502 compatible out there that isn't as efficient as the original.

lordsteve · Post by **lordsteve** » Tue Feb 22, 2005 4:34 am

its a W65C02-14 from western design center.

ghaytack · Post by **ghaytack** » Fri Feb 25, 2005 4:40 pm

Quote:

> How many clock cycles are in a CPU instruction cycle?

One. They're the same thing on the 6502.

Hmmm - surely not since "CPU instruction cycle" is otherwise known as the "fetch/execute cycle" which in most processors, and certainly the 6502, takes a multiple number of clock cycles. Plus the statment that "NOP takes 2 clock cycles" wouldn't be true, which it is as NOP does only take 2 clock cycles.

Isn't the correct answer "depends on the instruction - look it up in the manufacturer's data sheet" (see page 32 onward in the WDC 65C02 data sheet).

Garth is right though that if Steve is seeing NOP take 8 clock cycles then either the clock being looked at isn't PHI 2 (the external output of the clock controlling the fetch/execute cycle) or Steve's not watching the SYNC pin. Keeping an eye on that is essential to be able to determine when an instruction cycle is starting since SYNC goes high at the start of the fetch phase and stays high until the end of the fetch phase. SYNC then goes low for the rest of the instruction cycle.

So counting the clock pulses from one low-high transition of SYNC to the next should give an accurate measure of the clock cycles each instruction is actually taking.

GARTHWILSON · Post by **GARTHWILSON** » Fri Feb 25, 2005 9:18 pm

So you're counting a "cycle" as the length of time to carry out one instruction? On other processors whose instruction cycle is multiple clock cycles, the instructions take varying numbers of these "cycles." For example even if an instruction cycle is 4 clocks, instructions may take one, two, three, or more "cycles", meaning four, eight, twelve, or more clocks-- but always a multiple of four. Even the PIC microcontrollers for which Microchip boasts "single-cycle instruction execution" take 4 clocks for most instructions but 8 for some.

OTOH, maybe those are just "bus cycles" needed for one instruction cycle-- I suppose it's a matter of terminology. I don't know. The 6502 does not have a multi-clock bus cycle like so many other processors do. Its bus cycle is the same as the clock cycle, which is 1µs @ 1MHz, 100ns @ 10MHz, etc..

If the "cycle" is just the amount of time taken for each instruction, you have to remember that many instructions' execution finishes up while the next instruction is being fetched-- ie, there's an overlap due to minor pipelining.

ghaytack · Post by **ghaytack** » Mon Feb 28, 2005 10:19 am

Hmmm - well this where it can get very sticky if one tries to compare different CPUs since each manufacturer uses different definitions. In addition you have to take into account each CPU's underlying architecture.

With reference to Steve's original question, I used a definition of "instruction cycle" as the time it takes to "run" an instruction from the point at which the CPU starts to fetch it to the point at which that instruction is completely finished with. This fits in with the standard "simple" computer science "fetch/decode/execute" model. At least three factors make this model substantially more complex :

1. Pipelining
2. Cacheing
3. Relationship between "clock cycles", "bus cycles" and "machine cycles".

The 6502 is, with respect to these, very simple. It has no true pipelining - i.e. there is no pre-fecthing of the next instruction before the present one has finished executing. Some manufacturers data sheets had a box labelled "pipeline" on the block diagram - but it wasn't what would be understood as a pipeline today. It has no cache, and finally, the relationship between "clock cycles", "bus cycles" and "machine cycles" is very simple. A quick comparison between the old MOS 65xx Hardware Manual with the Z80 will show just how simple it is.

With respect to instructions taking multiple and variable numbers of clock cycles, this is true of the 6502. The minimum number of clock cycles for any instruction is 2. The maximum is 8 (absolute address read/modify/write crossing a page boundary). But compared to most other CPUs it is a very simple relationship :

Instructions with implied, immediate and accumulator addressing , i.e. no operand stored after the opcode in memory or a single byte operand, take 2 clock cycles.

Instructions with non-indexed zero page addressing take 3 clock cycles. One for the opcode fetch/decode, one for the operand address read and a third to execute. It takes this extra cycle due to having to transform the operand into an address to read from or write to when it executes the instruction.

And so it goes on. (see page 20 of the WDC 65c02 data sheet).

It's very true that some CPUs like the PIC boast "single cycle per instruction". However it does depends on which "cycle" they refer to. I suspect (would need to check Microchips data) it is a single "machine cycle" and is helped by the fact that both opcode and operand are contained in a single machine "word" read in one "gulp". Trying to understand the relationship between clock cycles and instructions and bus cycles on something like a Pentium is a major mind blowing excercise! I supose when the 6502 was designed they could have just made all instructions take 8 clock cycles and define that as a "machine cycle". Then it also would be a single cycle/instruction CPU....

But back to the 6502, the best way of seeing just how straightforward the relationship between clock cycles and instructions and bus cycles really is, is to hook one up to a logic analyser and let it rip with some code. I did this back in the late 80's on a single board 6502 I was developing for some on-board vehicle weighing system. There was some odd shit going on with the prototype my predecessor had built (on HUGE slabs of veroboard!). So I used a Microtan 65 as a cheap logic analyser. Hooked the ports of one VIA up to the target's address bus. One port of the other VIA to the data bus and the last port to the control bus and clock.

It was really facinating watching the results. Poetry in motion..... Until the clock speed got to high and the miles of copper comprising the busses on this beast started having an effect. You could see everything on all three buses becomming corrupted and skewed - solution KISS (keep it small stupid!).

Anyway, this all comes back to the fact that on the 6502 the number of clock cycles per instruction cycle is exactly what is printed in the data sheet and if you watch the SYNC pin you'll see it do a low-high transition at the start of each and every instruction.

BitWise · Post by **BitWise** » Mon Feb 28, 2005 5:06 pm

ghaytack wrote:

The 6502 is, with respect to these, very simple. It has no true pipelining - i.e. there is no pre-fecthing of the next instruction before the present one has finished executing.

I thought the 6502 'pipelined' its effective address calculation process (for indexed memory access). The CPU adds X or Y to the low byte of the address in parallel with incrementing of the PC and fetching the high byte. Its certainly more pipelined than the Z80 which takes a separate cycle just to increment the PC during instruction fetching!

I agree that conventions used by the various manufacturers for 'cycle' times are very confusing and you always need to be on your toes to determine whether they are stating 'clock cycles' or some derived 'machine cycle'.

Intel seems to have had a bit of history with multi-cycle instructions. Their first processor, the 4004, uses 8 clock cycles to execute a single byte instruction or 16 for a two byter. Admittedly some of the inefficiency is due to the fact that they were constrained to a 16 pin package and had to multiplex a 12 bit address, and 8 bit opcode and 4 bit data thru the same four I/O pins. Even the ROM and RAM chips have to understand some of the instruction decoding! (Its making my VHDL very complicated)

GARTHWILSON · Post by **GARTHWILSON** » Mon Feb 28, 2005 5:08 pm

Quote:

The 6502 is, with respect to these, very simple. It has no true pipelining - i.e. there is no pre-fecthing of the next instruction before the present one has finished executing.

Actually there is a little bit of true pipelining, and a lot of instructions do finish up while the next one is being fetched. An example given in WDC's programming manual (page 40 in my edition) is ADC#, which requires 5 distinct steps, but only two clocks' time:
Step 1: Fetch the instruction opcode ADC.
Step 2: Interpret the opcode to be ADC of a constant.
Step 3: Fetch the operand, the constant to be added.
Step 4: Add the constant to the accumulator contents.
Step 5: Store the result back to the accumulator.

Steps 2 and 3 both happen in a single clock. The processor fetches the next byte not knowing yet if it will need it or what it will be for. Steps 4 and 5 occur during the next instruction's step 1, eliminating the need for two more clocks. It cannot do steps 3 and 4 in one clock because the memory being read may not have the data valid and stable any more than a small set-up time before phase 2 falls and the data actually gets taken into the processor; so step 4 cannot begin until after step 3 is totally finished. But doing 2 and 3 simultaneously, and then doing 4 and 5 simultaneous with step 1 of the next instruction makes the whole 5-step process appear to take only 2 clocks.

Another part of the pipelining is the reason why operands are low-byte-first. The processor starts fetching the operand's low byte before the instruction decode has figured out how many bytes the instruction will have (1, 2, or 3). In the case of indexing before or without any indirection, the low byte needs to be added to the index register first anyway, so the 6502 gets that going before the high byte has finished arriving at the processor. In the case of something like LDA(abs), the first indirect address is fetched before the carry from the low-byte addition is added to the high byte. Then if it finds out there was no carry generated, it already has what it needs, and there's no need to add another cycle to read another address 256 bytes higher in the memory map. This way the whole 7-step instruction process requires only 4 clocks. (This is from the next page of the same programming manual.)

ghaytack · Post by **ghaytack** » Tue Mar 01, 2005 9:52 am

You are of course correct Garth on that point and I stand corrected.

However without details on how this affects single byte instructions (i.e. those using implied addressing such as CLC or INX) it is difficult to see if this pipelining results in pre-fetching of the NEXT instruction rather than just compressing of the fetch/decode/execute cycle of a single instruction into fewer clock cycles. All the examples given in the WDC programming manual only detail how the pipelining affects a SINGLE instruction. There is no mention of it having an effect on the next instruction.

So, a good question for the boys and girls at WDC would be, what happens to the pre-fetched operand byte if when the opcode is decoded it needs no operand? Is this abandoned or somehow used as the next opcode? I suspect (but don't know) that it is just abandoned where as the pipleline implementation on some other processor families would see it used as the next opcode.

But getting back to Steve's original question, I still think that the most accurate method of seeing how many clock cycles an instruction actually takes is still to watch what happens with the SYNC pin.

Interesting stuff this!

ghaytack · Post by **ghaytack** » Tue Mar 01, 2005 10:12 am

Please ignore the load of tosh I just posted.... A better reading of the WDC Programming Manual has proven you entirely right Garth!

GARTHWILSON · Post by **GARTHWILSON** » Tue Mar 01, 2005 7:31 pm

What I don't understand (not knowing much about the internals of processor design) is why there's a two-clock minimum. I would think that the execution of something like CLC could be carried out in the second clock at the same time that the next op code is being fetched. My guess is that there's something else involved with preparing the instruction decoder to handle the next one, requiring an extra clock. Commodore had a patent on a process they used in the 65CE02 that allowed them to eliminate virtually all the dead bus cycles and make at least 30 op codes take just one clock. There were not very many 65CE02's made, and the '816 was a much better upgrade to the 65c02 than the CE02 was, so I'm glad to have that. Still, it makes you wonder. I would not be opposed to having the processor's input clock be twice as high as the bus speed if it needs extra edges to do the job faster without affecting the bus speed. Having the input clock be twice as high would take another pin or two to get an onboard oscillator, but the original '02 had a couple of NC pins anyway, and I don't think they were just there for production testing. I think they're actually not connected to anything. In fact, now that I think about it, I remember the NMOS 6502 required the oscillator to be external anyway. Now the extra pins are used for the bus-enable input and the memory-lock output; and even though pin 35 is still a NC on the DIP, WDC still took away one of the ground pins (pin 1) for a vector-pull output. I suppose Bill Mensch had a good reason for it, but I sure don't know what it was.

orac · Post by **orac** » Thu Mar 03, 2005 3:13 am

Hi Garth,

You've touched upon somthing big.

Starting from the late 70's as silicon got faster, micros that had "asynchronous" buses had an advantage over the synchronous bus processors like the 6502 and 6800 family.

Why, because you could run a Z80/8085 at higher and higher speeds and still connect them to the slowest (and cheaper) speed of 8255. With the 6502/6800 family, each time you went to a faster CPU, you had to buy a faster 6522/6821!

You could slow the 6502 with the "RDY" line, but then the reference clock for the 6522 wouldn't be consistant.
It is much easier, however, to connect a 14Mhz 65C02 to an old 8255/8254.

Do you agree?

Cheers,

Paul

GARTHWILSON · Post by **GARTHWILSON** » Thu Mar 03, 2005 6:49 am

Hi Paul,

Yes, the Z80 and 8085 could use the same speed of memory at higher processor clock rates, but they also had to do 4 times as many MHz to keep up with the 6502 in terms of how fast they got a job done. I remember being impressed by the high numbers of clocks (called "T-states" on the Z80) that it took to carry out just a single instruction; then, IIRC, they also lacked some of the implied instructions that save time on the 6502, like the automatic compare-to-zero that comes with the load, logic, and arithmetic op codes, or the automatic implied decimal-adjust, not to mention the efficiency of the 6502's ZP being used, in a sense, as 256 processor registers.

There is definitely an advantage to not having to run everything at the speed of the slowest bus occupant; although I suppose you could clock a 6502 with pulses not of a constant speed but rather whose individual widths depend on how quickly the particular device being addressed on the bus can respond. In that respect, adding the appropriate logic, there's no reason you couldn't have an asynchronous 6502 bus; but as you say, it would keep you from using a 6522's T1 for something like a real-time clock or to output a constant-frequency square wave on PB7. One RTC alternative you could to take, if it were still a 6522, would be to use an external, constant-frequency clock for the 6522's T2, input through PB6. From the data book, it looks like it might be a little more of a pain than using T1 in free-run mode. I haven't actually done it myself. One of the 6502's attractions however remains its simplicity, which we've touched on here regarding the synchronous, one-clock bus, and easy bus interfacing.

Garth

ghaytack · Post by **ghaytack** » Thu Mar 03, 2005 12:41 pm

The alternative is to do what Acorn did with the old BBC B and Master series. The 6502 was clocked at 2MHz but almost all of the I/O stuff (6522, 6850 etc) was clocked at 1MHz. They used a niffty cycle stretching circuit to gear the CPU's Phi2 up and down depending upon the relative phases of the 2MHz and 1MHz clocks every time I/O was needed.

Not as nice or elegant as using full speed I/O parts but it's an alternative. Details of this can be found in the Advanced User Guides on the BBC Documentation Project website www.bbcdocs.com under "Essentials". Take a look at either the Model B Advanced User Guide or BBC B Service Manual.