It's important to note that litwr's aim is clearly to write the highest performance code and to write code for both the 6502 and 65C02. I did fail to grasp his code fragment, but now I see that it is a very fast 256-way branch, and I can't see a faster way of doing it, if every cycle counts. It's a small piece of self-modifying code.
The nice thing is that the 256 byte table needed for the 6502 and the one needed for the 65C02 overlap, so that a 257 byte table will, I think, work portably on either CPU.
Code:
lda INDEX
and #1
bne l_odd
sta m_even+1
m_even
jmp (even_table)
l_odd
sta m_odd+1
m_odd
jmp (odd_table)
align 256
even_table .word ... ;256 bytes
odd_table .word ... ;256 bytes
.byte ... ; 1 extra byte for 65C02
I hope I have that right. It's a bit tricky to think about.
When Dave and I were doing the
upgrade of Arlet's core from 6502 to 65C02, we noticed that he
uses a very nice trick for the indirect JMP. His core, like the 6502, has only a byte-wide ALU, and on the face of it, a 16-bit address must be computed and incremented to fetch the two bytes of the destination address. With the byte-wide ALU, that would mean taking an extra cycle if the page boundary is crossed. But the PC is capable of incrementing the full 16 bit value in one cycle - so what the core does is to pretend that there's a JMP just before the two bytes, and proceed as if fetching the two operand bytes of a JMP. It's a bit like doing half a JMP to update the PC and then another half a JMP to load the destination address. The PC does the incrementing, ignoring the page boundary issue. It sort of works like this:
Code:
JMP ghostly_jmp ; actually a JMP (far)
; ... lots more code, possibly ...
ghostly_jmp:
.byte $4C ; JMP destination
far:
.word destination
Here's the clue, in the comments:
Code:
JMP0 = 6'd22, // JMP - fetch PCL and hold
JMP1 = 6'd23, // JMP - fetch PCH
JMPI0 = 6'd24, // JMP IND - fetch LSB and send to ALU for delay (+0)
JMPI1 = 6'd25, // JMP IND - fetch MSB, proceed with JMP0 state
(The decimal numbers there are just arbitrary state labels.)
We kept that mechanism and
added three more states to perform the indexed indirect JMP:
Code:
JMPIX0 = 6'd51, // JMP (,X)- fetch LSB and send to ALU (+X)
JMPIX1 = 6'd52, // JMP (,X)- fetch MSB and send to ALU (+Carry)
JMPIX2 = 6'd53; // JMP (,X)- Wait for ALU (only if needed)
and
here's the optional extra cycle:
Code:
JMPIX0 : state <= JMPIX1;
JMPIX1 : state <= CO ? JMPIX2 : JMP0;
JMPIX2 : state <= JMP0;
As is usually the case, Dave had a burst of productivity and gets the credit for the implementation!