Hi Morita,
I've just seen your post, and it's a great thing you find flaws in my (entierely theorical) design, because I can learn from it. I see that you're experienced in this particular domain so I'm interested.
As for the problems themselves...
Quote:
1) The cache is overbuilt
I guess it's too early to know. The only real reason for a cache is for multiple read ports, necessary the hope to execute one 6502 instruction per cycle. As I said before, the first two reading ports might or might not be fusionned somehow (since they will always read two consecutive addresses, mostly in ZP).
I'm not knownledgable about caches enough to say more, but it looks like I've underestimated the problem, as in "well I'll just throw more cache/more ports and it'll fix the problem". I'm now sure it's not that simple.
However, the dimention of the cache can always be reduced, even if it hurts performance, reducing the area or FPGA usage. I'd agree that 80% for cache and 20% for the logic is a bit crazy. But if somehow this configuration makes a more efficient processor, then it should not be overlooked. I said IF, I never said I proofed that was the case.
Also I'm not planning to do ASICs, just possibly a FPGA implementation. If I ever get loads of money and loads of potential customers, then why not an ASIC, but hell, the probability such a thing happens is ridiculous. FPGA has SRAM cells no matter if you use them or not, so using them is good. Duplicate reading ports is as simple as duplicating the SRAM itself.
Quote:
2) Basic pipeline has terrible performance
The following parameters can/should be taken in account for branch prediction :
* What instruction it is (BPL, BMI, etc, etc...)
* What direction it branches to (forwards, backwards)
* What is the previous instruction opcode
If a great analysis is made on an large existing sample of emulated 6502 code, then it becomes relatively easy to do static branch prediction without any cache or SRAM, that is relatively efficient. This doesn't remove the penalty when it guesses wrong, but at least it will guess wrong rarely enough.
For exemple, DEX followed by BNE backwards has more probability to predict to "true" than "false", don't you agree ?
Of course if then someone somehow manages to have a loop with somewhere a "dex bne backwards" which is hardly ever branching, then it will loose ~5 cycle everytime which is terrible, but experience told me that, any processor improvement has some pathological case which makes the improvement even worse than the original processor. If it didn't work then pipelines and caches would not exist. It's kind of a hardware equivalent of huffman compression (for those who see what I'm talking about).
Finally I think the branch could be done at the "operand & fetch" stage by "backwarding" the new flag values from the "ALU & Flags" stage. This would make about 4 cycles of penalty for a wrong prediction, not much worse than the 6502's original 3 cycles. I know this "backwarding" technique works because I used it in my ARM processor, however it can become the critical path easily since the signals propagate through the entiere ALU and only then to the entiere branch logic in the previous stage, all in the same clock cycle.
Quote:
3) Bad assumptions made about FIFO
FIFO is just a general concept which means "First In First Out"
What I mean is an unit which will always provide the 4 bytes following the PC, and able to increment the PC from 1 to 4 bytes every cycle. Since it's possible to read memory in 32-bit words (4 bytes) such a thing is possible (and it was the point of this thread before this guy started to get insulted by everyone). It is not nessesarly simple, but it is possible. I call this a FIFO because it "eats bytes in advance" from the memory, and outputs them at the desired pace. If another name is better suited, then just say it.
Quote:
4) Bad assumption made about Operand Fetch pipeline stage
The result of $1234 + X is available at the end of the cycle. The result of the D-cache read needs to be available at the start of the next clock cycle. This gives the D-cache less than one clock to return the value.
Exacly one clock, no ? I mean it's fairly classical to have addresses lines set on cycle X and data comes ready on cycle X+1.
However it can be a problem if the cache is not direct mapped, then either some trickery or extra dummy piplelines stages are required. I already mentionned this in my previous post so I won't do it again. It's probably one of the bigger problem to solve, though.
Quote:
5) Missing bypass from the Writeback stage
We mentionned that the A, X, Y and S registers are asynchronous, because it's not worth using SRAM for only 32-bit of storage. This removes the need of a bypass from the writback stage - the registers themselves are a bypass.
Quote:
Fundamentally, ISAs with (memory) read-modify-write instructions are not well-suited for a pipelined processor design without weird kludges like breaking into micro-ops.
This is why CISC ISAs are mostly dead.
The reason for this is:
1) RMW instructions require extra pipeline stages
2) Extra pipeline stages increases the branch penalty
3) The amount of performance gained through RMW instructions (typically 2-5%) is much less than the
peformance loss through increased branch penalty (>10%)
I believe you, however such a technology has so far only been tried with the x86 and 68000 as far I know. Never with the 6502, which while being CISC is QUITE different from the latter two in terms of instruction set and addressing modes.
Therefore, yes of course such a pipelined "ISA" 6502 may suck completely but we can't know for sure without trying.
1) has been solved by me (it's only 5 stages so far, some RISCs have more than 10...), 2) can be solved with smart branch prediction, and 3) you don't really know unless you do a further analysis
I really throught about using RISC microcode, but I came to the conclusion it would suck in 6502's case. Why ? Let's consider an instruction like adc ($12),Y.
It would translate into something like (assuming 16-bit regs...) :
Code:
mov r0, #$12
ldr r1, [r0]
add r1, ry
ldr r0, [r1]
adds ra, r0
This results in 4 RISC instructions, and would require 7 cycles to complete even if the inner processor is multi-issue, because they all depend on the previous instruciton. (ldr usually takes at least 2 cycles on pipelined RISCs if the result is used by the next instruction).
This is the same number of cycles than the original 6502, so in other words, this sucks, as it would overly complicate the processor for no performance gain (excpet a higher clock speed, which can also be attained without using any inner RISC machine nor microcode).
Finally :
Quote:
A cache of this design can support a load-store ISA pipeline that can issue four instructions per clock. The current pipeline design can only issue one instruction per clock.
In fact no, I'm planning to issue up to 4 instructions per clock but only in special cases. Some "idioms" could be reconized and bypass the normal instruction decoding, and execute in a single clock.
For example : INX INX INX INX (which is quite frequently used), will translate into something that will perform : add x, #4 and take a single cycle.
Of course most of the case this won't apply. But this could boost the performance some little more.
Sorry for the wall of text, but I hope this makes my point clear.