I don't have the energy to recompose the full message from yesterday, but the summary is this:
Thanks for all the help, the tube project is very intriguing, as they have some awesome performance numbers. I have reviewed some of the comments in the sources and have gained a few ideas that I can try on my build.
I have been simulating the Status Register, which admittedly does use up a lot of instructions. The memory load and save also takes up quote a bit, due partly to 32 bit word alignment. 16 bit reads are being done with two 8 bit reads and then combining the results. I read something last night about disabling "alignment", which may help me with that issue. I need to experiment some more.
This build uses the same model as my AVR code, using macros to do memory and IO read and writes, address mode, and opcodes. I also use a macro for the fetch process, which gets added to the end of every opcode. I'll explain why in a minute.
Here's a sample of my LDA ABS instruction:
Code:
C_AD: ; LDA_ABS - opcode $AD
ABS
C_LDA
CYCLE 4
FETCH
ABS will read the 2 bytes after the opcode, form the target address, and read the byte at that address (RAM, ROM, or IO).
C_LDA will transfer that data to the Acc register and update the proper flags
CYCLE 4 is a 1 ARM instruction macro that adds the number to the cycle count.
FETCH will get the opcode from the location pointed to by the PC register and branch to the proper code (label C_AD in the case of and LDA ABS)
By placing the fetch macro at the end of each opcode, I avoid subroutine overhead and/or extra branch instructions. The pipeline architecture of the ARM allows for 4 instructions to be moving through the pipe, starting with the fetch and ending with the operation. If a branch occurs, the pipeline stalls (has to be refreshed), which uses up extra cycles. The cool thing is that most instructions can be conditionally executed, which eliminated the need for forward branches, and keeps the pipeline moving. If a condition is not met, a 1 cycle nop replaces the instruction.
I also need to find optimizations in calculating flags for an 8 bit value in a 32 bit register, i.e., the ARM has a V flag, but its looking at overflows at the 31st bit, not the 7th.
Bottom line, by streamlining the code, removing complexity with memory/IO read and write, using native flags when possible vs. simulating them, and reducing pipeline stalls will allow me to speed up the simulation.
More still to come.
Daryl
_________________
Please visit my website ->
https://sbc.rictor.org/