65C02 bus cycle types and frequencies

For discussing the 65xx hardware itself or electronics projects.
Post Reply
gfoot
Posts: 871
Joined: 09 Jul 2021

65C02 bus cycle types and frequencies

Post by gfoot »

A few months ago I had a hypothesis about what types of bus cycles the CPU would perform most often, e.g. opcode fetch, operand fetch, or general data reading and writing. I think I guessed that it would be about 50% opcode fetches, 25% operands, 20% random access data, and 5% I/O operations. I'd been meaning to test that on some "real world" programs, so I wrote some - a Mandelbrot plotter, a prime sieve, another prime sieve, and a Life simulation - and analysed them.

Hoglet's decoder can output a profiling summary of code that has executed, including how many times each instruction was executed and what the total cycle count was for that instruction. I used that to deduce what kinds of bus cycles were executed - the cycle count for example can inform how often branches are taken, or how often page crossing penalties are paid. This could be done better inside the decoder code, as it already knows most of this, but I did it as a postprocess. Here are the results: (Edit: replaced with corrected version, the "average" calculation was wrong)
cycletypefrequencies.png
I included a second copy of the data from the Life simulation with a serial output busy-waiting loop edited out, because that loop was dominating the profile otherwise. It's not an invalid data point, just a bit of an outlier perhaps, or a rather different kind of workload.

The "average" row aggregates all of them, weighted by their inverse execution time ("weight" column).

I separated "internal" cycles, i.e. ones where the CPU doesn't care about the bus, into "internal 1" - being ones that read from program memory, e.g. the next opcode - and "internal 2" where the CPU is executing a dummy read somewhere else, e.g. the stack, or rereading an address in the middle of a read-modify-write. "internal 1" often happens after an opcode fetch when there's no operand - but not always, sometimes it happens after the operand fetch.

I think the results are interesting. My 50% guess was intended to include most of "internal 1" as opcode fetches - but I still overestimated that by quite a bit. Operand fetches are more common than I'd guessed, as are general purpose reads and writes. I didn't separate out I/O accesses because it wasn't easy to do so from the data I had, and I think that's highly dependent on the type of program.

The mandelbrot program is very light on "internal 1" - I think because a lot of the work it's doing is arithmetic with values stored in memory, so it doesn't have many implied instructions. It is correspondingly heavier on "internal 2" because it does a lot of read-modify-write operations to memory (particularly shifts and rotates).

The prime sieves don't have much "internal 2" - perhaps because they don't do those kind of read-modify-write operations. They are completely different programs, but have very similar distributions here.

I had wondered whether it was theoretically possible to shorten the CPU clock during internal cycles, as the limiting factors in my system seem to be the address bus transition time and subsequent decoding etc - but it's probably not worth it given they are relatively rare.

Writes are on the whole quite rare though, so if it was possible to shorten the clock in general but add a wait state for writes, that could be a win.

For reference - the profile data for each is also included in its repository:
Last edited by gfoot on Wed Oct 18, 2023 5:37 pm, edited 1 time in total.
User avatar
BigEd
Posts: 11463
Joined: 11 Dec 2008
Location: England
Contact:

Re: 65C02 bus cycle types and frequencies

Post by BigEd »

Very nice analysis!
User avatar
drogon
Posts: 1671
Joined: 14 Feb 2018
Location: Scotland
Contact:

Re: 65C02 bus cycle types and frequencies

Post by drogon »

I wonder though ... What might the result look like when it's executing a BASIC version of e.g. Mandelbrot?

It does seem to suggest that the 6502 isn't that efficient as it's spending more time fetching opcode and operands than it spends manipulating data...

Hmmm...

-Gordon
--
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/
User avatar
BigEd
Posts: 11463
Joined: 11 Dec 2008
Location: England
Contact:

Re: 65C02 bus cycle types and frequencies

Post by BigEd »

(That feels like the reason a larger register set is a win. Fetching opcodes is unavoidable... unless you have a cache. They are quite popular too, but not in the 8 bit era.)
gfoot
Posts: 871
Joined: 09 Jul 2021

Re: 65C02 bus cycle types and frequencies

Post by gfoot »

I've edited the image as I got the weighting backwards - the idea was to balance out the differences in execution time, but instead it emphasized them! It doesn't make a huge difference in the scheme of things though.

I considered trying the BASIC Mandelbrot program, but it's not so easy as I'd need to somehow preload it into the interpreter and make it auto-run, and I'd probably need to make it not draw the whole thing as I'd run out of disk space! I'll see what I can do later on.
gfoot
Posts: 871
Joined: 09 Jul 2021

Re: 65C02 bus cycle types and frequencies

Post by gfoot »

Regarding time spent fetching opcodes, I guess ARM gets around that through pipelining - it can execute one instruction while another is being fetched - and through having block load/store instructions so that the proportion of memory traffic spent on opcodes is lower in those cases.

I'd also note that the 6502 does overlap other processing with opcode, operand, and data operations, so it's not entirely wasted time.
Chromatix
Posts: 1462
Joined: 21 May 2018

Re: 65C02 bus cycle types and frequencies

Post by Chromatix »

Quote:
Writes are on the whole quite rare though, so if it was possible to shorten the clock in general but add a wait state for writes, that could be a win.
That would be one reason why early caches tended to be write-through rather than write-back. The logic to do the latter correctly is more complex, and the performance penalty for leaving it out was initially not severe. So the caches simply reduced the typical latency for read operations, including opcode fetches.
Quote:
Regarding time spent fetching opcodes, I guess ARM gets around that through pipelining - it can execute one instruction while another is being fetched - and through having block load/store instructions so that the proportion of memory traffic spent on opcodes is lower in those cases.
Considering the early cacheless ARM, it was designed quite specifically (like most RISC CPUs of the time) so that most of the time it was fetching a continuous sequence of constant-size instruction words from sequential addresses. These sequential accesses were optimised at the memory-controller level, hinted by a "sequential access" bus control pin, into DRAM fast-page-mode accesses which completed in significantly less time than a random access - and the memory bus was a full 32 bits wide, so every fetch delivered a complete instruction word. Deviating from the sequential access pattern required releasing the DRAM from fast-page mode, selecting the new row address, and only then being able to provide the column address - with potentially a second such switch to get back to the previous sequence.

Operands were mostly embedded in the instruction word or taken from the large register file, with only branches and load/store instructions deviating from that pattern. It was made pretty clear to programmers that, so long as you didn't violate the data dependency requirements of certain early-input (barrel shifter, multiplier) and late-output (memory load, multiplier) operations, only branch and load/store instructions would take more than one cycle each (with minor exceptions due to external DRAM quirks). Multi-register loads and stores were implemented to take advantage of the sequential-access optimisation, so they were faster than a sequence of individual loads and stores (as well as taking fewer instruction words).

Some CISC CPUs also implemented pipelining to some degree, but it was much more common for instructions to spend multiple cycles in a given pipeline stage, or even occupying two or more consecutive pipeline stages for multiple cycles in certain cases. The 68040 is perhaps the most obvious example of this, having a dedicated AGU stage with its own associated operand fetch stage, which some of the complex 68K addressing modes needed to use several times in sequence, before it got through to the actual operand fetch stage and execution. Some instructions would also spend several, even dozens of cycles blocking the ALU - which is the main reason why ARM didn't get a division instruction until much later.

Modern CPUs are really RISC CPUs with a CISC-to-RISC translator engine bolted onto the front. AMD's K5 was almost literally this, with the back-end execution hardware being a slightly enhanced version of their 29K series RISC CPU. Instruction fetch and decode is a highly optimised area, with its own dedicated cache - or nowadays, separate caches for before and after translation. Ryzen's µOp cache works so similarly to the front end of a superscalar RISC CPU (eg. I saw something extremely similar in the PowerPC G5 architecture) that it's almost indistinguishable. Again, it's all about making steady progress in each and every cycle when possible.
Post Reply