32 Bit data bus and 8 Bytes Instruction buffer (IB), here we go:
Attachment:
6502_instruction_buffer.png [ 158.34 KiB | Viewed 2202 times ]
Notation, for avoiding any confusions related to 'little endian' and 'big endian':
Shift IB '-3' means shifting IB 3 Bytes toward IB0.
Shift IB '3' or '+3' means shifting IB 3 Bytes toward IB7.
Edit:
NMOS 6502 and 65C02 instruction size is 1..3 Bytes, and shifting IB -3..0 Bytes would do.
But for 65816 and K24, instruction size is 1..4 Bytes, take care.
The box labeled "merge" alternatively could be filled with OR gates.
;---
As an example, we take a short piece of code that does copy 256 Bytes in memory,
from $0400..$04FF to $0500..$05FF.
Code:
LDY #$00
foo: LDA $0400,Y
STA $0500,Y
DEY
BNE foo
RTS
Code:
Program memory contents at $0100:
A0.00.B9.00.04.99.00.05.88.D0.F7.60
. . . . .
LDY #$00 . . . .
foo: .LDA $0400,Y
.STA $0500,Y
.DEY
.BNE foo
.RTS
Instruction buffer activity explained slowly and step by step, the CPU actually might try to do some of the steps within one machine cycle.
red: execute an instruction
blue: read 32 Bit from program memory into the Instruction Buffer (short form: buffer).
green: shift or flush buffer
IB: Instruction Buffer
_0._1._2._3._4._5._6._7
xx.xx.xx.xx.xx.xx.xx.xx //No valid Bytes in the buffer. Need to read in 32 Bit.
A0.00.B9.00.XX.XX.XX.XX //Read in 32 Bit from program memory.
A0.00.B9.00.xx.xx.xx.xx //Execute A0 00 //LDY #$00
B9.00.xx.xx.xx.xx.xx.xx //Shift the buffer for removing the instruction which was executed.
B9.00.xx.xx.xx.xx.xx.xx //B9 is a three Bytes instruction, two valid Bytes in the buffer. Need to read in another 32 Bit.
B9.00.
04.99.00.05.xx.xx //Read in 32 Bit from program memory.
B9.00.04.99.00.05.xx.xx //Execute B9 00 04 //LDA $0400,Y
99.00.05.xx.xx.xx.xx.xx //Shift the buffer for removing the instruction which was executed.
99.00.05.xx.xx.xx.xx.xx //Execute 99 00 05 //STA $0500,Y
xx.xx.xx.xx.xx.xx.xx.xx //Shift the buffer for removing the instruction which was executed.
xx.xx.xx.xx.xx.xx.xx.xx //No valid Bytes in the buffer. Need to read in 32 Bit.
88.D0.F7.60.xx.xx.xx.xx //Read in 32 Bit from program memory.
88.D0.F7.60.xx.xx.xx.xx //Execute 88 //DEY
D0.F7.60.xx.xx.xx.xx.xx //Shift the buffer for removing the instruction which was executed.
D0.F7.60.xx.xx.xx.xx.xx //Execute D0 F7 //BNE foo. Conditional branch taken.
xx.xx.xx.xx.xx.xx.xx.xx //Change in program memory flow, flush buffer. //the rest in the buffer is not executed.
B9.00.04.99.xx.xx.xx.xx //and from here, the loop repeats until Y contains zero.
Some observations:
0) IB0 contains the instruction Byte.
1) For 2 Byte instructions, IB1 contains either 8 Bit immediate data or the lower Byte of a 16 Bit address.
2) For 3 Byte instructions, IB2 contains the higher Byte of a 16 Bit address //screaming "build a 16 Bit ALU for address calculation into the CPU"
3) If an instruction was executed, we need to remove the related Bytes from the Instruction buffer (by shifting the buffer toward IB0).
4) If there was a change in program flow (JMP,JSR,RTS,RTI, Bxx true), we need to flush the Instruction Buffer.
5) We need to keep track, how many Bytes in the buffer are valid.
6) If the Instruction Buffer is empty, we need to read 32 Bit from program memory (into IB0..IB4).
7) If there are not enough valid Bytes for executing an instruction, we need to read 32 Bit from program memory.
8) We would have to shift the 32 Bit (4 Bytes) read from program memory by -3..+7 Bytes before merging them with the buffer for efficiently making use of the buffer.
9) For a 32 Bit data bus, the address from which the 4 Bytes are read usually isn't PC, but [PC AND (NOT 3)] //A0 and A1 are zero.
10) It might become necessary to read 4 Bytes from [PC AND (NOT 3) +4].
11)
MC68010 featured a "loop mode" for two instruction loops.
When changing the Instruction Buffer size from 8 Bytes (IB0..IB7) to 16 Bytes (IB-7..IB8), when being able to Bytewise shifting the buffer
into both directions, and when pulling the one or other little trick, the loop in our example could repeat without the need for reading
Bytes into the buffer from program memory.
For "shifting", we would need something like a
barrel shifter working at Byte level.
Barrel shifter could be either implemented by using 74CBTLV multiplexers (what won't simplify PCB layout) or by using a lattice of 74CBTLV3245 bus switches (what won't simplify generating the control signals).
When sacrificing a little bit of speed, the chip count for implementing a barrel shifter could be reduced (replacing one layer of 74CBTLV3251 8:1 multiplexers by three layers of 74CBTLV3257 2:1 multiplexers).
For "merging", I think it would be something like 74CBTLV3257 2:1 multiplexers working at Byte level... or a lot of OR gates.
Spending any thoughts on:
How to generate the 32 Bit program memory read address,
How many and which control signals are needed for implementing all this,
How to generate said control signals...
That's a thing that could be done later, after we have:
Some more spare time at our hands,
A decision if we really should go for a 32 Bit data bus and an 8 Bytes Instruction Buffer,
And a clear definition of what a (hypothetical) 32 Bit 6502 bus interface is supposed to look like.
OT: started to wonder, if anybody at TI still remembers the
SN74AS897 barrel shifter.
BTW: using an Instruction Buffer and staying cycle compatible to the 6502 might be two different pairs of shoes.
...but now I need a break before "going superscalar".