In spite of myself, I find myself writing this (OCD). I don't think I or anyone else should withhold information from our community because of bad behaviour of others. The truth is that I've pursued this direction (fast 6502 core with 32-bit RAM) in the past, and abandoned it as too complex a solution. My interest lies more in cycle-compatible retro-machinery anyway, so I am somewhat biased. I will try to stick to the facts and refer to notes from experiments some years back, and to avoid seeing an onslaught of abuse windfall is on my 'foe' list. If windfall posts something useful, please rephrase it in non-judgemental technical language for me.
My initial off-hand dismissal of the idea has to do with the complexity of even the most simple solution, and the fact that the poster has not considered even remotely the consequences of this idea (stated inaptly as 'you are talking out your butt', a regrettable choice of words that no doubt caused much of the escalation. However, that is a pretty good summary of windfall's presentation). As I can't fully rely on my memory, I found notes from several years ago when I pursued this direction extensively, in a misguided effort to create a really fast 8-bit CPU core. I certainly learned a lot, but it was a total waste of time. I sincerely hope that no one here repeats my mistakes.
Jeff's diagram is a nice start, but it doesn't even remotely approach the entirety of the issues at hand. All parts of the 6502 core must be modified and integrated in order for this to actually run at full speed (or at all). Here are some of my notes paraphrased for the occasion.
Summary of the idea
A 6502 core with 32-bit memory can be made to run as fast as possible, avoiding multi-cycle address and data fetches cycles as long as the data is available in the pre-fetch buffer. In order to avoid corner conditions, the buffer has to be 8-bytes long.
In order to provide execution fast enough to take advantage of this pre-fetching, the core has to be able to decode an incoming instruction and mux in the data during a single cycle. In fact, most simple instructions would have to decode and execute via sequential logic during the initial cycle. This affects all aspects of the 6502 core. Therefore this is not a simple pre-fetch modification, but an entirely different core.
Pre-fetch unit Pre-fetcher has to be able to provide 3 bytes of opcode-data. Since the opcode can be positioned anywhere in bytes 0-7, an 8-1 mux is required to provide output. Since data can be one or two bytes, similar muxes must be created to get data from anywhere in the prefetch buffer. A state machine for sequential pre-fetching of the next longword is required, co-ordinating its activities with the PC register. In addition, to avoid a cycle loss after fetching a new byte, bypasses must be implemented, shunting any of the 4 8-bit chunks of the direct memory input to the instruction decoder.
Instruction Decoder The instruction decoder must be fast enough to decode an instruction in the first cycle and leave enough time for the ALU and other execution units to finish the task, as well as for the address generator to operate (see next item). The instruction decoder must have a mux connecting 8 prefetch bytes and 4 memory bypasses (12-1 mux) in order to operate at full speed.
Address Generation For absolute addressing, since the normal 6502 circuitry is not applicable, the absolute address from the prefetch buffer needs to be placed onto the address bus during the same first cycle. This requires muxing all 8 bytes into each of the AB bytes, an adder for indirect X/Y ops muxed to all 8 prefetch bytes, X and Y register, SP etc. combined with the other possibilites (see next items) - resulting in a pretty wide mux.
Memory Indirection Indirect addressing creates an additional problem. Since we don't want to abandon the prefetch buffer and our memory is 32-bits wide, the address from the prefetch buffer needs to be placed onto AB (see above). The result needs to be placed onto AB for the next cycle. However, since the 16-bit address is not aligned, it may occur across two 32-bit words, requiring an extra cycle 1/4 of the time. No easy solution exists here - bypasses are required for the 3/4 fast operation, and a register-bypass combination must be implemented to handle 32-bit boundary cross. These are extra inputs to the AB, and must be accounted for in the state machine.
Write Operations In fact, a whole other 4 byte buffer should be implemented for data addressing. Since the RAM is 32-bits wide, every read has to go through a 4-1 mux, and every write operation becomes a read-modify-store operation. The write is more complicated to be efficient: the input of the data register, in addition to coming from RAM, has to have an override for the write from every possible write source. That is every one of the 4 bytes to be written next cycle has to be able to come from any of the registers (or other sources), requiring a crossbar switch or a complex muxing arrangement.
Jumps Direct jump address must be placed onto AB immediately. The result is bypassed into the Instruction register, and the prefetch state machine must fill in the rest of the prefetch buffer.
Relative jumps The PC must include a 16-bit adder circuit to accomodate relative jumps. The adder must mux in any of the 8 prefetch bytes.
Indirect jumps Indirect jump address must be placed onto AB immediately. See the discussion above about indirect memory access - a similar state machine with a possible 1-cycle delay for 32-bit boundary crossing must be implemented. AB mux has to have extra inputs from memory bypass (8 8-bit inputs) and the register in case of mis-alignment.
Zero-page operations Since 0-page memory includes indirection for both addresses and data, these must be accounted for in the mux and the state machine.
Indirect X and Y jumps An adder on the AB output, muxed in from X and Y is required. In reality, a little more complicated due to state machine issues.
Stack operations Stack operations require temporary data fetches and stores. In addition, the stack supports 16-bit read and write operations, requiring for buffering of 32-bit boundary mismatches which now occur 1/2 the time. To be as fast as possible (that is incurring only 1 clock delay 1/2 of the time, 16-bit wide muxes and bypasses are required. Consider that since writing is possible, a RAM writing path must be established, and the contents of the prefetch buffer must be flushed. I don't even want to address the full complexity of this in this document.
ALU Since the ALU no longer uses the restricted 6502 buses and is required to execute in a single cycle, wide muxes must connect the inputs of the ALU to every one of the 8 possible bytes of the prefetch buffer (bypass inputs are not necessary since after a jump an instruction is encountered, not data). But the alu must also connect to all the registers and memory-read datapaths to maintain full speed.
Register File Every register capable of loading an immediate must be muxed from every one of the 8 prefetch bytes. Since registers can be written, the paths must be muxed in as well, and the data buffer flushing must occur on write.
Summary
Speed
The complexity of the instruction decoder in even simple addressing modes, combined with the fact that combinational logic must be used to keep the system running at full speed, results in long path delays and unacceptably low maximum clock rates, completely defeating the purpose of the acceleration circuit. Large muxes and routing congestion associated with the core size slow down the design considerably. A fast 8-bit core connected to 8-bit RAM will in most cases outrun a CPU accelerated in this fashion.
Size
The implementation requires many wide (8 and 16-bit wide) muxes, with many inputs (as the prefetch buffer has 8 bytes in addition to all possible paths). 16-1 muxes in FPGA architectures are enormous, leading to unacceptably large core sizes and associated routing congestion.
If anyone still feels that this direction is worth pursuing, please use this input as a positive starting point. I am sorry if my descriptions or conclusions are negative - I am a little bitter about wasting a year of my life pursuing this rabbithole.
Edit: there is a reason 32-bit processors are a lot more complicated, especially ones that allow 8 and 16-bit unaligned access... As I said before, 32 is not at all the new 8 bit.
_________________ In theory, there is no difference between theory and practice. In practice, there is. ...Jan van de Snepscheut
Last edited by enso on Fri Jun 07, 2013 4:59 pm, edited 1 time in total.
|