100MHz TTL 6502: Here we go!
Re: 100MHz TTL 6502: Here we go!
ttlworks wrote:
Die Devices sells bare chip dies
Re: 100MHz TTL 6502: Here we go!
ttlworks wrote:
Maybe not relevant for this project:
Die Devices sells bare chip dies, 74AUC2G53 is on their list.
Würth Elektronik is able to bond bare chip dies to PCBs.
Die Devices sells bare chip dies, 74AUC2G53 is on their list.
Würth Elektronik is able to bond bare chip dies to PCBs.
Btw, I like the 74AUC2G53 replacement for a 74'151 equivalent. Thanks.
Looks like in the right hands, this 74AUC2G53 is a real powerhouse.
C74-6502 Website: https://c74project.com
Re: 100MHz TTL 6502: Here we go!
Windfall wrote:
Do they offer a free, teeny tiny soldering iron, when they ship their dies ? 
To me, it looks like sort of a "if you have to ask for the price, you know you can't afford to buy it" business.
Drass wrote:
That's so cool! Maybe we can create our own SMD 74AUC283 on a little PCB with tiny castellations. 
BTW: Würth Elektronik also is able to integrate SMD components into PCBs.
Re: 100MHz TTL 6502: Here we go!
Drass wrote:
Looks like in the right hands, this 74AUC2G53 is a real powerhouse. 
It's a pity these things do not come in larger packages, such as quad 2:1 (SPDT) switches with independent control inputs, as it is the case for 1:1 switches. Not even dual 2:1 switches seem to be available, as far as I know. However there's always the possibility to pair 74xx3125 with 74xx3126 quadruple 1:1 switches to to get 2:1 functionality with less overall number of components.
Re: 100MHz TTL 6502: Here we go!
joanlluch wrote:
Not even dual 2:1 switches seem to be available, as far as I know.
Re: 100MHz TTL 6502: Here we go!
ttlworks wrote:
joanlluch wrote:
Not even dual 2:1 switches seem to be available, as far as I know.
Re: 100MHz TTL 6502: Here we go!
joanlluch wrote:
[...] but not 2:1 fast Switches with the same or similar pin layout than say a 74HC4053.
-- Jeff
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html
https://laughtonelectronics.com/Arcana/ ... mmary.html
Re: 100MHz TTL 6502: Here we go!
Let's now take a closer look at the pipeline in this design. The objective is to reduce the cycle-time while keeping the cycle-count fixed. The critical path in the CPU falls squarely on the ALU, and associated pre- and post-processing. Rather than cramming all this into one cycle, the basic strategy is to push pre-processing to the prior cycle, and post-processing to the next. This allows the ALU to have the whole cycle to itself, giving us the headroom we need to boost the clock-rate.
Pre-processing here refers to the work required to set up the inputs to the ALU with appropriate values. That seemingly innocuous task takes a surprising amount of time -- we have to fetch microcode, decode control signals, select source values and output-enable the approriate registers. Post-processing, on the other hand, refers to updating the status flags and writing to the destination register. Rebalancing this workload around the ALU, we end up quite naturally with a four-stage pipeline, as follows:
We have FETCH, DECODE, EXECUTE and WRITEBACK -- the idea is to perform a roughly equal amount of work at each stage and then to pass the baton to the next. Along the way, we capture intermediate results in pipeline registers. Specifically, we have the Microinstruction Register (MIR) after the FETCH stage, we have ALUA, ALUB and ALUC registers at the ALU inputs and we have the R register at its output. The FTM (Flags To Modify) and RTM (Registers To Modify) registers direct the WRITEBACK stage regarding which flags and destination register to update. (More on the WRITEBACK stage below.)
Memory operations using "flow-through" synch RAMs are a good fit for this arrangement. A key feature of these RAMs is that we can clock an address into the RAM's internal registers then read the data value from its outputs before the next clock edge occurs. The ADL and ADH registers allow the pipeline to work in this same way with asynchronous peripherals. For writes, there is also the WE register and a Data Output Register (DOR). As we've discussed before, the ALU features a "recirculate" path to allow the result to be fed back into its inputs. This is done during address calculation, for example, when the ALU result is immediately required in the next cycle. Memory reads are also recirculated, as either ALU operands or addresses to be used in the next cycle. The WRITEBACK stage calculates the flags based on the ALU result, updates the P register according to the FTM, and writes the result to a destination register according to the RTM. One important thing to highlight is that the WRITEBACK stage writes to registers using a mid-cycle rising clock-edge (PHI2 rising edge). Meanwhile, registers are always sampled at the end of the cycle (PHI1 rising edge). This discipline ensures that we always get an up to date value when a given register is being read and written to in the same cycle. For example, the P register may be updated in the same cycle that a branch test is being executed. Delaying the branch test until the second half of the cycle ensures that the branch test evaluates correctly. Beyond allowing enough time to calculate the flags, a separate WRITEBACK stage allows the R register to neatly buffer the ALU from the rest of the CPU's internal registers (and the added bus capacitance they would impose). There are over ten destinations for the ALU output, all of which would add unnecessary delay to the ALU's critical path were they connected directly (10 loads x 3pF per load x 50Ω + 6" trace delay = 2.5ns).
Finally, we should note that the DECODE stage must receive a fresh instruction every cycle in order for the pipeline to function smoothly. To begin with, FETCH retrieves a new opcode from main memory (or simply generates a BRK on a CPU reset) and feeds it to DECODE stage via the Instruction Register (IR). Thereafter, FETCH will retrieve microinstructions associated with that opcode from the microcode store, one per cycle, and feed them to the DECODE stage via the the Mircoinstruction Register (MIR). Once we reach the end of the current opcode, a new opcode is fetched and the sequence repeats again. The DECODE stage, meanwhile, always delivers appropriate control signals for downstream pipeline stages, whether by decoding the opcode in one cycle or a microinstruction in another.
And that's it. We'll take a look at how this pipeline executes cycle-accurate 6502 instructions in a future post. For now, the main thing to note is that this is a relatively simple pipeline that still packs a punch in terms of performance. By way of comparison, the critical path on this pipeline is about 20ns long (50MHz) as compared to 50ns (20MHz) on the C74-6502 -- that's assuming similar components in both cases; ie, AC logic for the ALU and CBT logic for tri-state buffers. The hope of course is that faster components and further optimizations (like the FET Switch Adder) will enable us to double the clock-rate yet again and reach the 100MHz milestone. Only time will tell whether we'll manage to get there.
Cheers for now,
Drass
P.S. Many thanks to Dr Jefyll for helping to clarify and edit this description. It is much better for it. Thanks Jeff!
Memory operations using "flow-through" synch RAMs are a good fit for this arrangement. A key feature of these RAMs is that we can clock an address into the RAM's internal registers then read the data value from its outputs before the next clock edge occurs. The ADL and ADH registers allow the pipeline to work in this same way with asynchronous peripherals. For writes, there is also the WE register and a Data Output Register (DOR). As we've discussed before, the ALU features a "recirculate" path to allow the result to be fed back into its inputs. This is done during address calculation, for example, when the ALU result is immediately required in the next cycle. Memory reads are also recirculated, as either ALU operands or addresses to be used in the next cycle. The WRITEBACK stage calculates the flags based on the ALU result, updates the P register according to the FTM, and writes the result to a destination register according to the RTM. One important thing to highlight is that the WRITEBACK stage writes to registers using a mid-cycle rising clock-edge (PHI2 rising edge). Meanwhile, registers are always sampled at the end of the cycle (PHI1 rising edge). This discipline ensures that we always get an up to date value when a given register is being read and written to in the same cycle. For example, the P register may be updated in the same cycle that a branch test is being executed. Delaying the branch test until the second half of the cycle ensures that the branch test evaluates correctly. Beyond allowing enough time to calculate the flags, a separate WRITEBACK stage allows the R register to neatly buffer the ALU from the rest of the CPU's internal registers (and the added bus capacitance they would impose). There are over ten destinations for the ALU output, all of which would add unnecessary delay to the ALU's critical path were they connected directly (10 loads x 3pF per load x 50Ω + 6" trace delay = 2.5ns).
Finally, we should note that the DECODE stage must receive a fresh instruction every cycle in order for the pipeline to function smoothly. To begin with, FETCH retrieves a new opcode from main memory (or simply generates a BRK on a CPU reset) and feeds it to DECODE stage via the Instruction Register (IR). Thereafter, FETCH will retrieve microinstructions associated with that opcode from the microcode store, one per cycle, and feed them to the DECODE stage via the the Mircoinstruction Register (MIR). Once we reach the end of the current opcode, a new opcode is fetched and the sequence repeats again. The DECODE stage, meanwhile, always delivers appropriate control signals for downstream pipeline stages, whether by decoding the opcode in one cycle or a microinstruction in another.
And that's it. We'll take a look at how this pipeline executes cycle-accurate 6502 instructions in a future post. For now, the main thing to note is that this is a relatively simple pipeline that still packs a punch in terms of performance. By way of comparison, the critical path on this pipeline is about 20ns long (50MHz) as compared to 50ns (20MHz) on the C74-6502 -- that's assuming similar components in both cases; ie, AC logic for the ALU and CBT logic for tri-state buffers. The hope of course is that faster components and further optimizations (like the FET Switch Adder) will enable us to double the clock-rate yet again and reach the 100MHz milestone. Only time will tell whether we'll manage to get there.
Cheers for now,
Drass
P.S. Many thanks to Dr Jefyll for helping to clarify and edit this description. It is much better for it. Thanks Jeff!
Last edited by Drass on Tue Oct 27, 2020 1:32 pm, edited 1 time in total.
C74-6502 Website: https://c74project.com
Re: 100MHz TTL 6502: Here we go!
Hi Drass,
That's interesting.
I have a question on the "writeback" stage and the flow of the pipeline. You explain that by performing register writes on the mid-cycle, then reading them at the end of the cycle, you avoid data hazards on the pipeline. However, I think this only works (possibly) because the 6502 uses two cycles anyway to complete instructions. So in fact you have a two step gap between the fetch-decode-execute-writeback sequence of one cycle to the next. Or in other words, the next instruction fetch happens while the current cycle is in the executing stage, not while it is in the decode stage, as it would be the case for a standard pipelined risc processor. Is this right, or I am missing something fundamental here?
I mean, you have this:
As opposed to this:
Thanks
Joan
That's interesting.
I have a question on the "writeback" stage and the flow of the pipeline. You explain that by performing register writes on the mid-cycle, then reading them at the end of the cycle, you avoid data hazards on the pipeline. However, I think this only works (possibly) because the 6502 uses two cycles anyway to complete instructions. So in fact you have a two step gap between the fetch-decode-execute-writeback sequence of one cycle to the next. Or in other words, the next instruction fetch happens while the current cycle is in the executing stage, not while it is in the decode stage, as it would be the case for a standard pipelined risc processor. Is this right, or I am missing something fundamental here?
I mean, you have this:
Code: Select all
Fetch | Decode | Execute | Writeback
Fetch | Decode | Execute | Writeback
Fetch | Decode | Execute | Writeback
Fetch | Decode | Execute | Writeback
Code: Select all
Fetch | Decode | Execute | Writeback
Fetch | Decode | Execute | Writeback
Fetch | Decode | Execute | Writeback
Fetch | Decode | Execute | Writeback
Joan
Re: 100MHz TTL 6502: Here we go!
Drass wrote:
One important thing to highlight is that the WRITEBACK stage writes to registers using a mid-cycle rising clock-edge (PHI2 rising edge).
Meanwhile, registers are always sampled at the end of the cycle (PHI1 rising edge).
Meanwhile, registers are always sampled at the end of the cycle (PHI1 rising edge).
Drass, since the ALU input latches will be edge triggered 74AUC16374 chips or such...
have you considered building the registers with transparent latches like 74AUC16373 ?
When using transparent latches for the registers (transparent during PHI2=HIGH),
the register data inputs wouldn't have to be stable/valid before the rising edge of PHI2.
Re: 100MHz TTL 6502: Here we go!
ttlworks wrote:
Now this gives me quite a headache.
Drass, since the ALU input latches will be edge triggered 74AUC16374 chips or such...
have you considered building the registers with transparent latches like 74AUC16373 ?
When using transparent latches for the registers (transparent during PHI2=HIGH),
the register data inputs wouldn't have to be stable/valid before the rising edge of PHI2.
Drass, since the ALU input latches will be edge triggered 74AUC16374 chips or such...
have you considered building the registers with transparent latches like 74AUC16373 ?
When using transparent latches for the registers (transparent during PHI2=HIGH),
the register data inputs wouldn't have to be stable/valid before the rising edge of PHI2.
Re: 100MHz TTL 6502: Here we go!
On the face of it, computing Z should be no worse than a carry-chain problem, and indeed the inputs to the Z function arrive earliest from the LSB and latest from the MSB, so Z might only need to be a gate-delay behind C. (I say this, knowing that computing Z often does seem to be a time-consuming thing. So I'm interested in why the difference between theory and common practice.)
Re: 100MHz TTL 6502: Here we go!
Hi Ed,
As per my limited experience on processor design, essentially only acquired from reading books and my adventure at developing a processor architecture (on paper), I can say that your affirmation is only true in the context of ripple carry ALUs. Indeed, the Z flag can be computed with the same delay than the carry flag by just comparing each bit with the previous one.
However, once we apply carry lookahead circuits or carry skip strategies, then the Z flag starts to proportionally add some non-meaningless time to the set. This is because, apparently, there's no way to look for a Z flag ahead of the result. The Z flag must be computed from the result, so any delay on that is directly added up to the total ALU delay. If the ALU data width is significant, say 16 or 32 bits, then implementing the Z flag circuit with standard 74xx ics requires cascading them up to 3 or 4 levels.
Still, your comment from a conceptual point of view seems to be quite fair. So I would be interested too to know if it really there's no way to compute the Z flag ahead (or in parallel) to the result.
BigEd wrote:
On the face of it, computing Z should be no worse than a carry-chain problem, and indeed the inputs to the Z function arrive earliest from the LSB and latest from the MSB, so Z might only need to be a gate-delay behind C. (I say this, knowing that computing Z often does seem to be a time-consuming thing. So I'm interested in why the difference between theory and common practice.)
However, once we apply carry lookahead circuits or carry skip strategies, then the Z flag starts to proportionally add some non-meaningless time to the set. This is because, apparently, there's no way to look for a Z flag ahead of the result. The Z flag must be computed from the result, so any delay on that is directly added up to the total ALU delay. If the ALU data width is significant, say 16 or 32 bits, then implementing the Z flag circuit with standard 74xx ics requires cascading them up to 3 or 4 levels.
Still, your comment from a conceptual point of view seems to be quite fair. So I would be interested too to know if it really there's no way to compute the Z flag ahead (or in parallel) to the result.
Re: 100MHz TTL 6502: Here we go!
I suspect even in more sophisticated ALUs, the LSB results come sooner. So it might be an advantage to structure the Z logic to take that into account, with the MSB bits having a shallower logic depth.
(Not just logic depth: sometimes a many-input logic gate will in practice react faster to some inputs than others, faster to one sense of transition than the other. I don't think TTL spec sheets tend to show this, but the timing models used within chips do, AFAIR.)
(Not just logic depth: sometimes a many-input logic gate will in practice react faster to some inputs than others, faster to one sense of transition than the other. I don't think TTL spec sheets tend to show this, but the timing models used within chips do, AFAIR.)
Re: 100MHz TTL 6502: Here we go!
BigEd wrote:
I suspect even in more sophisticated ALUs, the LSB results come sooner. So it might be an advantage to structure the Z logic to take that into account, with the MSB bits having a shallower logic depth.
(Not just logic depth: sometimes a many-input logic gate will in practice react faster to some inputs than others, faster to one sense of transition than the other. I don't think TTL spec sheets tend to show this, but the timing models used within chips do, AFAIR.)
(Not just logic depth: sometimes a many-input logic gate will in practice react faster to some inputs than others, faster to one sense of transition than the other. I don't think TTL spec sheets tend to show this, but the timing models used within chips do, AFAIR.)