hi everyone!
i've mentioned multiple times on the forums that i always wanted to try and make a 65c02 softcore that had an internal Zero Page and Stack Page to see how it would effect performance at a given clock speed.
and the final design turned out a bit more complicated than expected, so i thought it would be a good idea to make an emulator for it first to see how it would do.
well, the emulator is done and it passes all of Klaus' Tests (except the interrupt one as besides BRK no interrupts exist). but i don't have any timings from cycle accurate Emulators (or real hardware) to compare the results against, so if anyone has some numbers for the "6502_functional_test" programs, that'd be nice!
but before i show the results of the test i'll first try to explain the hardware the emulator implements.
or rather how it came to be.
i started by looking back at my first 65c02 softcore, the one i made in Logisim where i tried to make the CPU faster by reducing the amount of cycles each instruction needed. and i achieved that by making each cycle do more work, which ironically lowered the maximum clock speed it could run at. but it did work and i used that CPU to test ROM code for my SBC so i didn't need to reprogram the ROM each time i made changes.
after that i learned a lot more about hardware and CPU design and got a better idea of how to save cycles without dragging Fmax (the maximum clock speed) down:
the idea was to keep roughly the same cycles counts (if not adding more to make individual cycles faster) and then use the dead cycles (when the CPU isn't accessing Memory) to prefetch instructions into an internal buffer, so after the current instruction is done the CPU doesn't have to spend as many cycles fetching the next one.
This idea evolved into splitting the Control unit into 3 seperate parts. before the control unit did everything from fetching instructions, executing them, and accessing memory. now each of those has it's own unit: the Fecthing Unit (FEU), the Execution Unit (EXU), and the Bus Interface Unit (BIU).
the Execution Unit does as it says, it executes whatever is in the Instruction Buffer by directly controlling all internal signals related to the registers and the ALU, it can also give the BIU commands to read from/write to memory.
the Fetching Unit is also simple, it was specifically designed to be stupid to avoid excessively complex circuitry.
the FEU simply steps through memory and reads bytes into the Instruction Buffer whenever it's allowed to so. for this it has it's own Program Counter called "FPC" that only jumps when the actual PC gets loaded with a new value (ie any control flow instruction gets executed, like JMP, JSR, RTS, B**, etc). this will also clear the buffer, which does waste some cycles.
when an instruction is about to finish it checks how many bytes are in the buffer. if it's less than 3 (the maximum size of an instruction on the 65c02) the FEU will take over starting next cycle and fetch bytes until the buffer has atleast 3 bytes in it, afterwards it will give control back to the EXU.
and finally the Bus Interface Unit. likely the most complex of the 3. it's main purpose is to abstract memory accesses in such a way that the EXU and FEU don't have to care about the width of the data bus, external control signals, or even each other, when accessing memory.
for example when the EXU requests a 16-bit word to be read from address $BEEF (on an 8-bit data bus) the BIU will do the following:
Cycle 1. Read a byte from address $BEEF and store it into a temporary register (also stop the EXU from advancing to the next cycle)
Cycle 2. Read a byte from $BEEF+1, combine it with the first byte and give the whole 16-bit word to the EXU
as you can see the BIU has the ability to stop the EXU from continuing when necessary. this is done so that from the perspective of the EXU all memory accesses take 1 cycle, which massively simplies logic and allows me to more easily change the width of the external data bus without having to redo any EXU related logic or micocode.
another thing the BIU is handling is the decoding for the 3 memory types (Zero Page, the Stack, and External/Main Memory). it checks the incoming address to see which memory it should access: $0000-$00FF = Zero Page, $0100-$01FF = Stack, $0200-$FFFF = External/Main Memory.
this means i can split the address space even further and only have to change the BIU's Logic while the rest of the CPU remains the same.
and finally, a very important feature the BIU also has, is allowing both the FEU and EXU to access memory in the same cycle (assuming they aren't trying to access the same memory type. if they are, the EXU gets priority) so for example if the EXU is accessing Zero Page, then the FEU is allowed to access the Stack or External/Main Memory in the same cycle.
This does help a lot with performance as even non-dead cycles can potentially be used to pre-fetch instructions.
on a side note, both the Zero Page and Stack allow for single cycle 16-bit reads (even when unaligned, as they both just function like register files), but only the Stack supports single cycle 16-bit writes as it's the only memory type that ever gets 16-bit words written to it.
overall i'm pretty proud of this little system i made. and i was very happy to see it pretty much work exactly how i imagined it in the emulator.
so i guess it's time to look at the Klaus test results for a few runs i did (and explain the various stats)
Code:
<---------------------------------------------------------------------------------------------------->
< Klaus Functional Test (32 Byte Buffer, 8-bit data bus)
< -Options:
< ROM_vectors = 1
< load_data_direct = 1
< I_flag = 3
< zero_page = $0080
< data_segment = $0000
< code_segment = $8000
< disable_selfmod = 0
< report = 1
< ram_top = -1
< disable_decimal = 0
<---------------------------------------------------------------------------------------------------->
Execution took 1.631 Seconds
Stats:
Executed Instructions: 30,648,428
Total Cycles: 65,626,395
Average Cycles per Instruction: 2.141
Amount of TAKEN Control Flow Instructions: 6,012,259 (19.617% out of all Instructions)
Lowest Buffer Size: 0
Highest Buffer Size: 17
Average Buffer Size: 2.375
Amount of times the Buffer had to be refilled: 12,578,144
(47.799% of them were from Control Flow Instructions)
"Executed Instructions", "Total Cycles", and "Average CPI" are self explanatory.
"Amount of TAKEN Control Flow Instructions" shows how many instructions loaded a new value into the PC
"Lowest Buffer Size", "Highest Buffer Size" shows what the least and most amount of bytes in the buffer during execution were
"Average Buffer Size", says how many bytes on average were in the buffer after an instruction finished executing. any value below 3 means the buffer almost never fills up and has to be refilled a lot.
"Amount of times the Buffer had to be refilled", just as it says. whenever the buffer has fewer than 3 bytes in it, it needs to be refilled up to 3 bytes. this can potentially waste cycles. all taken control flow instructions always cause a refill because they clear the buffer.
even without any results to compare against this already looks pretty good to me judging from that average CPI of 2.141!
i also did place the Data segment into the ZP to speed things up a bit more.
and here a test with the same ROM on a 16-bit external data bus:
Code:
<---------------------------------------------------------------------------------------------------->
< Klaus Functional Test (32 byte buffer, 16-bit data bus)
<---------------------------------------------------------------------------------------------------->
Execution took 1.882 Seconds
Stats:
Executed Instructions: 30,648,428
Total Cycles: 52,937,835
Average Cycles per Instruction: 1.727
Amount of TAKEN Control Flow Instructions: 6,012,259 (19.617% out of all Instructions)
Lowest Buffer Size: 0
Highest Buffer Size: 32
Average Buffer Size: 5.310
Amount of times the Buffer had to be refilled: 6,013,458
(99.980% of them were from Control Flow Instructions)
as you can see it takes 12,688,560 fewer cycles (21.4% speed increase), and the buffer only has to be refilled around half as many times as before. a wider data bus allows the FEU to fetch whole words at once so it keeps the buffer filled more easily.
you can also see that almost none of the refills were caused by the buffer actually running dry after an instruction (it only happend 1199 times)
the total size of the buffer doesn't seem to have any impact on CPI (i tried 4, 8, 16, 32, 64, 128, and 256 bytes), unless it becomes too small to fit atleast 2 full sized instructions (like 4 bytes did).
so 8 bytes seems like the perfect size. small enough to be simple to implement in logic, but large enough to not have any negative impact on performance.
one final Klaus result, this one with a 32-bit data bus. as expected the larger you go the more diminishing the performance improvements.
Code:
<---------------------------------------------------------------------------------------------------->
< Klaus Functional Test (32 byte buffer, 32-bit data bus):
<---------------------------------------------------------------------------------------------------->
Execution took 2.718 Seconds
Stats:
Executed Instructions: 30648428
Total Cycles: 49922606
Average Cycles per Instruction: 1.629
Amount of TAKEN Control Flow Instructions: 6012259 (19.617% out of all Instructions)
Lowest Buffer Size: 0
Highest Buffer Size: 32
Average Buffer Size: 12.254
Amount of times the Buffer had to be refilled: 6012328
(99.999% of them were from Control Flow Instructions)
and lastly, i used cc65 to compile and run my Fixed Point Mandelbrot Set program on the emulator as well, to get a better idea of a "real world" work load rather than just a benchmark/testing program.
Code:
<---------------------------------------------------------------------------------------------------->
< 300x100 @ 1020 Iterations Mandelbrot set (8-bit data bus):
<---------------------------------------------------------------------------------------------------->
Execution took 259.999 Seconds
Stats:
Executed Instructions: 4,259,166,627
Total Cycles: 11,133,901,121
Average Cycles per Instruction: 2.614
Amount of TAKEN Control Flow Instructions: 605,588,201 (14.218% out of all Instructions)
Lowest Buffer Size: 0
Highest Buffer Size: 15
Average Buffer Size: 3.292
Amount of times the Buffer had to be refilled: 1,648,663,841
(36.732% of them were from Control Flow Instructions)
<---------------------------------------------------------------------------------------------------->
< 300x100 @ 1020 Iterations Mandelbrot set (16-bit data bus):
<---------------------------------------------------------------------------------------------------->
Execution took 299.207 Seconds
Stats:
Executed Instructions: 4,259,166,627
Total Cycles: 9,459,555,353
Average Cycles per Instruction: 2.221
Amount of TAKEN Control Flow Instructions: 605,588,201 (14.218% out of all Instructions)
Lowest Buffer Size: 0
Highest Buffer Size: 32
Average Buffer Size: 11.722
Amount of times the Buffer had to be refilled: 647,748,606
(93.491% of them were from Control Flow Instructions)
(~16.26% faster than the 8-bit data bus one)
so yea, that's kinda what i've been doing the past few weeks besides planning for my new 65816 VGA card with 32/64k of Dual Port RAM, and making a PLCC to DIP adapter for my Amiga 500 (also includes an IDE and RAM expansion connector).
hope this is was an interesting read! i'm planning on releasing the code for the emulator on github, but i want to clean it up a bit more so for now i'll just throw the files as they currently are into this thread.
of course any thoughts, comments, and questions are appreciated!