6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Thu Nov 21, 2024 10:49 pm

All times are UTC




Post new topic Reply to topic  [ 7 posts ] 
Author Message
PostPosted: Wed Oct 18, 2023 4:47 pm 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
A few months ago I had a hypothesis about what types of bus cycles the CPU would perform most often, e.g. opcode fetch, operand fetch, or general data reading and writing. I think I guessed that it would be about 50% opcode fetches, 25% operands, 20% random access data, and 5% I/O operations. I'd been meaning to test that on some "real world" programs, so I wrote some - a Mandelbrot plotter, a prime sieve, another prime sieve, and a Life simulation - and analysed them.

Hoglet's decoder can output a profiling summary of code that has executed, including how many times each instruction was executed and what the total cycle count was for that instruction. I used that to deduce what kinds of bus cycles were executed - the cycle count for example can inform how often branches are taken, or how often page crossing penalties are paid. This could be done better inside the decoder code, as it already knows most of this, but I did it as a postprocess. Here are the results: (Edit: replaced with corrected version, the "average" calculation was wrong)
Attachment:
cycletypefrequencies.png
cycletypefrequencies.png [ 92.81 KiB | Viewed 30053 times ]


I included a second copy of the data from the Life simulation with a serial output busy-waiting loop edited out, because that loop was dominating the profile otherwise. It's not an invalid data point, just a bit of an outlier perhaps, or a rather different kind of workload.

The "average" row aggregates all of them, weighted by their inverse execution time ("weight" column).

I separated "internal" cycles, i.e. ones where the CPU doesn't care about the bus, into "internal 1" - being ones that read from program memory, e.g. the next opcode - and "internal 2" where the CPU is executing a dummy read somewhere else, e.g. the stack, or rereading an address in the middle of a read-modify-write. "internal 1" often happens after an opcode fetch when there's no operand - but not always, sometimes it happens after the operand fetch.

I think the results are interesting. My 50% guess was intended to include most of "internal 1" as opcode fetches - but I still overestimated that by quite a bit. Operand fetches are more common than I'd guessed, as are general purpose reads and writes. I didn't separate out I/O accesses because it wasn't easy to do so from the data I had, and I think that's highly dependent on the type of program.

The mandelbrot program is very light on "internal 1" - I think because a lot of the work it's doing is arithmetic with values stored in memory, so it doesn't have many implied instructions. It is correspondingly heavier on "internal 2" because it does a lot of read-modify-write operations to memory (particularly shifts and rotates).

The prime sieves don't have much "internal 2" - perhaps because they don't do those kind of read-modify-write operations. They are completely different programs, but have very similar distributions here.

I had wondered whether it was theoretically possible to shorten the CPU clock during internal cycles, as the limiting factors in my system seem to be the address bus transition time and subsequent decoding etc - but it's probably not worth it given they are relatively rare.

Writes are on the whole quite rare though, so if it was possible to shorten the clock in general but add a wait state for writes, that could be a win.

For reference - the profile data for each is also included in its repository:


Last edited by gfoot on Wed Oct 18, 2023 5:37 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Wed Oct 18, 2023 4:52 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Very nice analysis!


Top
 Profile  
Reply with quote  
PostPosted: Wed Oct 18, 2023 5:08 pm 
Offline
User avatar

Joined: Wed Feb 14, 2018 2:33 pm
Posts: 1488
Location: Scotland
I wonder though ... What might the result look like when it's executing a BASIC version of e.g. Mandelbrot?

It does seem to suggest that the 6502 isn't that efficient as it's spending more time fetching opcode and operands than it spends manipulating data...

Hmmm...

-Gordon

_________________
--
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/


Top
 Profile  
Reply with quote  
PostPosted: Wed Oct 18, 2023 5:14 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
(That feels like the reason a larger register set is a win. Fetching opcodes is unavoidable... unless you have a cache. They are quite popular too, but not in the 8 bit era.)


Top
 Profile  
Reply with quote  
PostPosted: Wed Oct 18, 2023 5:40 pm 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
I've edited the image as I got the weighting backwards - the idea was to balance out the differences in execution time, but instead it emphasized them! It doesn't make a huge difference in the scheme of things though.

I considered trying the BASIC Mandelbrot program, but it's not so easy as I'd need to somehow preload it into the interpreter and make it auto-run, and I'd probably need to make it not draw the whole thing as I'd run out of disk space! I'll see what I can do later on.


Top
 Profile  
Reply with quote  
PostPosted: Wed Oct 18, 2023 5:49 pm 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
Regarding time spent fetching opcodes, I guess ARM gets around that through pipelining - it can execute one instruction while another is being fetched - and through having block load/store instructions so that the proportion of memory traffic spent on opcodes is lower in those cases.

I'd also note that the 6502 does overlap other processing with opcode, operand, and data operations, so it's not entirely wasted time.


Top
 Profile  
Reply with quote  
PostPosted: Wed Oct 18, 2023 7:12 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
Quote:
Writes are on the whole quite rare though, so if it was possible to shorten the clock in general but add a wait state for writes, that could be a win.

That would be one reason why early caches tended to be write-through rather than write-back. The logic to do the latter correctly is more complex, and the performance penalty for leaving it out was initially not severe. So the caches simply reduced the typical latency for read operations, including opcode fetches.

Quote:
Regarding time spent fetching opcodes, I guess ARM gets around that through pipelining - it can execute one instruction while another is being fetched - and through having block load/store instructions so that the proportion of memory traffic spent on opcodes is lower in those cases.

Considering the early cacheless ARM, it was designed quite specifically (like most RISC CPUs of the time) so that most of the time it was fetching a continuous sequence of constant-size instruction words from sequential addresses. These sequential accesses were optimised at the memory-controller level, hinted by a "sequential access" bus control pin, into DRAM fast-page-mode accesses which completed in significantly less time than a random access - and the memory bus was a full 32 bits wide, so every fetch delivered a complete instruction word. Deviating from the sequential access pattern required releasing the DRAM from fast-page mode, selecting the new row address, and only then being able to provide the column address - with potentially a second such switch to get back to the previous sequence.

Operands were mostly embedded in the instruction word or taken from the large register file, with only branches and load/store instructions deviating from that pattern. It was made pretty clear to programmers that, so long as you didn't violate the data dependency requirements of certain early-input (barrel shifter, multiplier) and late-output (memory load, multiplier) operations, only branch and load/store instructions would take more than one cycle each (with minor exceptions due to external DRAM quirks). Multi-register loads and stores were implemented to take advantage of the sequential-access optimisation, so they were faster than a sequence of individual loads and stores (as well as taking fewer instruction words).

Some CISC CPUs also implemented pipelining to some degree, but it was much more common for instructions to spend multiple cycles in a given pipeline stage, or even occupying two or more consecutive pipeline stages for multiple cycles in certain cases. The 68040 is perhaps the most obvious example of this, having a dedicated AGU stage with its own associated operand fetch stage, which some of the complex 68K addressing modes needed to use several times in sequence, before it got through to the actual operand fetch stage and execution. Some instructions would also spend several, even dozens of cycles blocking the ALU - which is the main reason why ARM didn't get a division instruction until much later.

Modern CPUs are really RISC CPUs with a CISC-to-RISC translator engine bolted onto the front. AMD's K5 was almost literally this, with the back-end execution hardware being a slightly enhanced version of their 29K series RISC CPU. Instruction fetch and decode is a highly optimised area, with its own dedicated cache - or nowadays, separate caches for before and after translation. Ryzen's µOp cache works so similarly to the front end of a superscalar RISC CPU (eg. I saw something extremely similar in the PowerPC G5 architecture) that it's almost indistinguishable. Again, it's all about making steady progress in each and every cycle when possible.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 7 posts ] 

All times are UTC


Who is online

Users browsing this forum: pdragon and 36 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: