So here you are the basic block diagram of what I had in mind.
The caches are read synchronously while the registers (A, X, Y, SP) are read asynchronously.
The I-cache is optional (the fetch stage and FIFO could do raw access on the bus), but the D-cache is not.
The particularity is 3 D-Cache read ports instead of just 1, and 3 register read ports.
For the D-Cache 2 register ports are probably going to be exclusive to Z-Page and used only for indirect addressing modes. They might come to some other uses at a later point, though.
For the registers, the 1st read port is here mainly for the (,X) addressing mode, while the second is used mainly for the (),Y addressing mode. The normal indexed modes, like $1234,X can use either. It does not matter if the instruction is read, write or RWM, as all of them will work for all addressing modes. Even if it doesn't exist on the 6502, this architecture could do a RMW operation on (,X) or (),Y just fine.
At first I would have said that, to save overhead, it would be possible to remove one of both indexed reading ports for the registers, typically the 1st reading port which is here for (,X). That way (,X) operations would be done in 2 cycles instead, and that'll be fine because they are sparsely used. However, removing would be a bad idea, because a RTS/RTI instruction uses the "hidden" ($100,S) addressing mode, and this instruction IS often used. We'll see this in due time, now is no time to worry about details like that yet.
For a showcase, here you are how instructions will typically behave in the pipeline. I use ADC because it does both a read and a write to the accumulator (and therefore is a good example) :
1) ADC $1234,X
Fetch/Decode : The opcode plus the "$12" and "$34" arguments are available at the output of the 4-byte FIFO in a single cycle.
The decode stages detect that this opcode takes 3 bytes, which will increment the PC by 3 and take 3 bytes from the FIFO in the next cycle.
Pointer Prefetch : This is a direct addressing mode so we basically have nothing to do here (or we could already add $1234 with X optionally, but it doesn't change much)
Operand fetch : We add $1234 with the value of X which is read from the register file, and send this to the D-Cache along with the order to read data
Alu & Flags : If the cache could not read the information in one cycle, we'll have to stall the pipeline until the info is here. Once it's here, we can add the read data with the value read from the accumulator, and update the status flags.
Writeback :There is nothing to write to memory, so we just put the result data into the accumulator register.
2) ADC ($12,X)
Fetch/Decode : The opcode plus the "$12" argument are available at the output of the 4-byte FIFO in a single cycle.
The decode stages detect that this opcode takes 2 bytes, which will increment the PC by 2 and take 2 bytes from the FIFO in the next cycle.
Pointer Prefetch :We add $12 with the value of X which is read from the register, and send the computed address to the D-Cache.
Operand fetch : Let's assume from now on that the data cache is perfect and always retrieves information in 1 cycle. We now have all the 16-bits of the pointer we were seeking (done as two separate 8-bit cache reads in parallel), so we send this pointer to the D-Cache and order a read.
Alu & Flags : We can add the data read from the D-Cache with the value read from the accumulator, and update the status flags.
Writeback :There is nothing to write to memory, so we just put the result data into the accumulator register.
3) ADC ($34),Y
Fetch/Decode : The opcode plus the "$34" argument are available at the output of the 4-byte FIFO in a single cycle.
The decode stages detect that this opcode takes 2 bytes, which will increment the PC by 2 and take 2 bytes from the FIFO in the next cycle.
Pointer Prefetch :We send the addresses $34 and $35 to the D-Cache reading ports 1 and 2.
Operand fetch : Retrieve the pointer data which is read from the D-Cache as a 16-bit value, and add to it the value of Y which is read from the register. Send the result of this addition to the D-Cache and order a read.
Alu & Flags : We can add the data read from the D-Cache with the value read from the accumulator, and update the status flags.
Writeback :There is nothing to write to memory, so we just put the result data into the accumulator register.
Other addressing modes are trivial so I won't detail them.
Now let's analyze operation of RMW operations on memory, for instance I'll use ASL $1234,X
Fetch/Decode : The opcode plus the "$12" and "$34" arguments are available at the output of the 4-byte FIFO in a single cycle.
The decode stages detect that this opcode takes 3 bytes, which will increment the PC by 3 and take 3 bytes from the FIFO in the next cycle.
Pointer Prefetch : This is a direct addressing mode so we basically have nothing to do here (or we could already add $1234 with X optionally, but it doesn't change much)
Operand fetch : We add $1234 with the value of X which is read from the register file, and send this to the D-Cache along with the order to read data
Alu & Flags : We can shift left data which is read from the D-Cache and update the status flags.
Writeback :The result is written back to the D-Cache. In this case we are certain the data is already cached as we just loaded it a few cycles ago. But in the case of an STA instruction, a cache miss could happen, and the pipeline should stall because of bus access.
Finally I'm not sure exactly how I'll do it, but let's consider a branch, let's say a BPL $50 instruction
Fetch/Decode : The opcode plus the "$50" arguments are available at the output of the 4-byte FIFO in a single cycle.
The decode stages detect that this opcode takes 2 bytes, and let's assume the branch prediction is not implemented yet (or simply assumes the branch will be false), so we will increment the PC by 2 and take 3 bytes from the FIFO in the next cycle.
Pointer Prefetch :There is nothing to be done in this cycle.
Operand fetch : Now it gets complicated. I'm heistant between two variations. In variation 1), the simpler, there is nothing to be done here.
In variation 2), we could snoop the flags that are going to be updated by the previous instruction which is in the following stage, and act accordingly. This could lead to a long critical path through.
If anyone have comments I'm all ears.
Alu & Flags : In variation 1), we can see if the N flag is set. If this is not the case, we have to branch, and invalidate all instructions in the operand fetch, pointer prefetch, decode and fetch stages, and write the new value (old pc + 2) + $50 to the PC. We should also flush the FIFO.
On next cycle, the 32-bit word containing the target instruction will be loaded into the FIFO, but the program won't be able to continue just yet. 0-3 dummy bytes will have to be taken out of the FIFO after this read, as they precede the instruction we want to execute.
Finally, on the cycle after that, the FIFO will be fully loaded with the next instruction to read, and the execution of the program will continue.
We can immediately see that the branching overhead is major, so a branch prediction mechanism should be added to the work to be done once we get something working (if we ever reach that stage).
Writeback :Nothing to be done in this case.
I hope it starts to get a bit clearer in everyone's mind.
I'm open to any suggestions.