100MHz TTL 6502: Here we go!

Windfall · Post by **Windfall** » Thu Oct 22, 2020 2:39 pm

ttlworks wrote:

Die Devices sells bare chip dies

Do they offer a free, teeny tiny soldering iron, when they ship their dies ?

Drass · Post by **Drass** » Thu Oct 22, 2020 4:23 pm

ttlworks wrote:

Maybe not relevant for this project:

Die Devices sells bare chip dies, 74AUC2G53 is on their list.
Würth Elektronik is able to bond bare chip dies to PCBs.

That's so cool! Maybe we can create our own SMD 74AUC283 on a little PCB with tiny castellations.

Btw, I like the 74AUC2G53 replacement for a 74'151 equivalent. Thanks.

Looks like in the right hands, this 74AUC2G53 is a real powerhouse.

ttlworks · Post by **ttlworks** » Fri Oct 23, 2020 5:25 am

Windfall wrote:

Do they offer a free, teeny tiny soldering iron, when they ship their dies ?

I think they don't.
To me, it looks like sort of a "if you have to ask for the price, you know you can't afford to buy it" business.

Drass wrote:

That's so cool! Maybe we can create our own SMD 74AUC283 on a little PCB with tiny castellations.

Nah, just build sort of an improved AM29203 with 74AUC, then sell it to other TTL CPU hobbyists.

BTW: Würth Elektronik also is able to integrate SMD components into PCBs.

joanlluch · Post by **joanlluch** » Fri Oct 23, 2020 2:39 pm

Drass wrote:

Looks like in the right hands, this 74AUC2G53 is a real powerhouse.

Some time ago, I spent some time studying the viability of using similar analog switches to make a relay-logic based processor. It looked to me that it was feasible, and it used far less number of components than an equivalent discrete transistor logic gates based processor.

It's a pity these things do not come in larger packages, such as quad 2:1 (SPDT) switches with independent control inputs, as it is the case for 1:1 switches. Not even dual 2:1 switches seem to be available, as far as I know. However there's always the possibility to pair 74xx3125 with 74xx3126 quadruple 1:1 switches to to get 2:1 functionality with less overall number of components.

ttlworks · Post by **ttlworks** » Fri Oct 23, 2020 3:09 pm

joanlluch wrote:

Not even dual 2:1 switches seem to be available, as far as I know.

The problem is, that analog switches like 74HC4053 (triple individually controlled SPDT) are too slow for being useful when trying to build a fast CPU.

joanlluch · Post by **joanlluch** » Fri Oct 23, 2020 3:24 pm

ttlworks wrote:

joanlluch wrote:

Not even dual 2:1 switches seem to be available, as far as I know.

The problem is, that analog switches like 74HC4053 (triple individually controlled SPDT) are too slow for being useful when trying to build a fast CPU.

Yes, that's the point. There's single fast 4:1 and 8:1, as well as multiple 1:1 switches with both common and separated control inputs, but not 2:1 fast Switches with the same or similar pin layout than say a 74HC4053.

Dr Jefyll · Post by **Dr Jefyll** » Fri Oct 23, 2020 4:20 pm

joanlluch wrote:

[...] but not 2:1 fast Switches with the same or similar pin layout than say a 74HC4053.

FWIW, there are faster 4053's than the HC version. Check out 74LV4053A, 74VHC4053A and the MAX4619. These are triple SPDT switches which preserve the 4053 pinout. (They're not as fast as the single-switch AUC part we're using, though.)

-- Jeff

Drass · Post by **Drass** » Tue Oct 27, 2020 2:40 am

Let's now take a closer look at the pipeline in this design. The objective is to reduce the cycle-time while keeping the cycle-count fixed. The critical path in the CPU falls squarely on the ALU, and associated pre- and post-processing. Rather than cramming all this into one cycle, the basic strategy is to push pre-processing to the prior cycle, and post-processing to the next. This allows the ALU to have the whole cycle to itself, giving us the headroom we need to boost the clock-rate.

Pre-processing here refers to the work required to set up the inputs to the ALU with appropriate values. That seemingly innocuous task takes a surprising amount of time -- we have to fetch microcode, decode control signals, select source values and output-enable the approriate registers. Post-processing, on the other hand, refers to updating the status flags and writing to the destination register. Rebalancing this workload around the ALU, we end up quite naturally with a four-stage pipeline, as follows:

We have FETCH, DECODE, EXECUTE and WRITEBACK -- the idea is to perform a roughly equal amount of work at each stage and then to pass the baton to the next. Along the way, we capture intermediate results in pipeline registers. Specifically, we have the Microinstruction Register (MIR) after the FETCH stage, we have ALUA, ALUB and ALUC registers at the ALU inputs and we have the R register at its output. The FTM (Flags To Modify) and RTM (Registers To Modify) registers direct the WRITEBACK stage regarding which flags and destination register to update. (More on the WRITEBACK stage below.)

Memory operations using "flow-through" synch RAMs are a good fit for this arrangement. A key feature of these RAMs is that we can clock an address into the RAM's internal registers then read the data value from its outputs before the next clock edge occurs. The ADL and ADH registers allow the pipeline to work in this same way with asynchronous peripherals. For writes, there is also the WE register and a Data Output Register (DOR).

As we've discussed before, the ALU features a "recirculate" path to allow the result to be fed back into its inputs. This is done during address calculation, for example, when the ALU result is immediately required in the next cycle. Memory reads are also recirculated, as either ALU operands or addresses to be used in the next cycle.

The WRITEBACK stage calculates the flags based on the ALU result, updates the P register according to the FTM, and writes the result to a destination register according to the RTM.

One important thing to highlight is that the WRITEBACK stage writes to registers using a mid-cycle rising clock-edge (PHI2 rising edge). Meanwhile, registers are always sampled at the end of the cycle (PHI1 rising edge). This discipline ensures that we always get an up to date value when a given register is being read and written to in the same cycle. For example, the P register may be updated in the same cycle that a branch test is being executed. Delaying the branch test until the second half of the cycle ensures that the branch test evaluates correctly.

Beyond allowing enough time to calculate the flags, a separate WRITEBACK stage allows the R register to neatly buffer the ALU from the rest of the CPU's internal registers (and the added bus capacitance they would impose). There are over ten destinations for the ALU output, all of which would add unnecessary delay to the ALU's critical path were they connected directly (10 loads x 3pF per load x 50Ω + 6" trace delay = 2.5ns).

Finally, we should note that the DECODE stage must receive a fresh instruction every cycle in order for the pipeline to function smoothly. To begin with, FETCH retrieves a new opcode from main memory (or simply generates a BRK on a CPU reset) and feeds it to DECODE stage via the Instruction Register (IR).

Thereafter, FETCH will retrieve microinstructions associated with that opcode from the microcode store, one per cycle, and feed them to the DECODE stage via the the Mircoinstruction Register (MIR).

Once we reach the end of the current opcode, a new opcode is fetched and the sequence repeats again. The DECODE stage, meanwhile, always delivers appropriate control signals for downstream pipeline stages, whether by decoding the opcode in one cycle or a microinstruction in another.

And that's it. We'll take a look at how this pipeline executes cycle-accurate 6502 instructions in a future post. For now, the main thing to note is that this is a relatively simple pipeline that still packs a punch in terms of performance. By way of comparison, the critical path on this pipeline is about 20ns long (50MHz) as compared to 50ns (20MHz) on the C74-6502 -- that's assuming similar components in both cases; ie, AC logic for the ALU and CBT logic for tri-state buffers. The hope of course is that faster components and further optimizations (like the FET Switch Adder) will enable us to double the clock-rate yet again and reach the 100MHz milestone. Only time will tell whether we'll manage to get there.

Cheers for now,
Drass

P.S. Many thanks to Dr Jefyll for helping to clarify and edit this description. It is much better for it. Thanks Jeff!

joanlluch · Post by **joanlluch** » Tue Oct 27, 2020 7:52 am

Hi Drass,

That's interesting.

I have a question on the "writeback" stage and the flow of the pipeline. You explain that by performing register writes on the mid-cycle, then reading them at the end of the cycle, you avoid data hazards on the pipeline. However, I think this only works (possibly) because the 6502 uses two cycles anyway to complete instructions. So in fact you have a two step gap between the fetch-decode-execute-writeback sequence of one cycle to the next. Or in other words, the next instruction fetch happens while the current cycle is in the executing stage, not while it is in the decode stage, as it would be the case for a standard pipelined risc processor. Is this right, or I am missing something fundamental here?

I mean, you have this:

Code: Select all

 Fetch   | Decode  | Execute | Writeback
                     Fetch   | Decode  | Execute | Writeback
                                         Fetch   | Decode  | Execute | Writeback
                                                             Fetch   | Decode  | Execute | Writeback

As opposed to this:

Code: Select all

 Fetch   | Decode  | Execute | Writeback
           Fetch   | Decode  | Execute | Writeback
                     Fetch   | Decode  | Execute | Writeback
                               Fetch   | Decode  | Execute | Writeback

Thanks
Joan

ttlworks · Post by **ttlworks** » Tue Oct 27, 2020 8:14 am

Drass wrote:

One important thing to highlight is that the WRITEBACK stage writes to registers using a mid-cycle rising clock-edge (PHI2 rising edge).
Meanwhile, registers are always sampled at the end of the cycle (PHI1 rising edge).

Now this gives me quite a headache.

Drass, since the ALU input latches will be edge triggered 74AUC16374 chips or such...
have you considered building the registers with transparent latches like 74AUC16373 ?

When using transparent latches for the registers (transparent during PHI2=HIGH),
the register data inputs wouldn't have to be stable/valid before the rising edge of PHI2.

joanlluch · Post by **joanlluch** » Tue Oct 27, 2020 9:27 am

ttlworks wrote:

Now this gives me quite a headache.

Drass, since the ALU input latches will be edge triggered 74AUC16374 chips or such...
have you considered building the registers with transparent latches like 74AUC16373 ?

When using transparent latches for the registers (transparent during PHI2=HIGH),
the register data inputs wouldn't have to be stable/valid before the rising edge of PHI2.

If I am allowed to share what I learned from developing my processor, I found to my surprise that the calculation of the ALU result flags takes longer than I have anticipated. The Z flag is surprisingly quite expensive to have. In my case, in addition to the Z, C, V flags, I have to compute at some point a result based on condition code (EQ, NE, LT, GT, GE, and so on), which is common in many processors and adds even more delays. I think the latter is not needed on a 6502, but in any case, it looks that the initially seemingly innocuous task of computing the flags, must require really some time to be completed. Thus it also looks to me that just half a cycle for write-back might not be enough time for what's required, and that as Dieter suggests, we might need to allow some incursion onto the second half of the cycle to make this affordable.

BigEd · Post by **BigEd** » Tue Oct 27, 2020 10:05 am

On the face of it, computing Z should be no worse than a carry-chain problem, and indeed the inputs to the Z function arrive earliest from the LSB and latest from the MSB, so Z might only need to be a gate-delay behind C. (I say this, knowing that computing Z often does seem to be a time-consuming thing. So I'm interested in why the difference between theory and common practice.)

joanlluch · Post by **joanlluch** » Tue Oct 27, 2020 11:14 am

Hi Ed,

BigEd wrote:

On the face of it, computing Z should be no worse than a carry-chain problem, and indeed the inputs to the Z function arrive earliest from the LSB and latest from the MSB, so Z might only need to be a gate-delay behind C. (I say this, knowing that computing Z often does seem to be a time-consuming thing. So I'm interested in why the difference between theory and common practice.)

As per my limited experience on processor design, essentially only acquired from reading books and my adventure at developing a processor architecture (on paper), I can say that your affirmation is only true in the context of ripple carry ALUs. Indeed, the Z flag can be computed with the same delay than the carry flag by just comparing each bit with the previous one.

However, once we apply carry lookahead circuits or carry skip strategies, then the Z flag starts to proportionally add some non-meaningless time to the set. This is because, apparently, there's no way to look for a Z flag ahead of the result. The Z flag must be computed from the result, so any delay on that is directly added up to the total ALU delay. If the ALU data width is significant, say 16 or 32 bits, then implementing the Z flag circuit with standard 74xx ics requires cascading them up to 3 or 4 levels.

Still, your comment from a conceptual point of view seems to be quite fair. So I would be interested too to know if it really there's no way to compute the Z flag ahead (or in parallel) to the result.

BigEd · Post by **BigEd** » Tue Oct 27, 2020 11:29 am

I suspect even in more sophisticated ALUs, the LSB results come sooner. So it might be an advantage to structure the Z logic to take that into account, with the MSB bits having a shallower logic depth.

(Not just logic depth: sometimes a many-input logic gate will in practice react faster to some inputs than others, faster to one sense of transition than the other. I don't think TTL spec sheets tend to show this, but the timing models used within chips do, AFAIR.)

joanlluch · Post by **joanlluch** » Tue Oct 27, 2020 12:38 pm

BigEd wrote:

I suspect even in more sophisticated ALUs, the LSB results come sooner. So it might be an advantage to structure the Z logic to take that into account, with the MSB bits having a shallower logic depth.

(Not just logic depth: sometimes a many-input logic gate will in practice react faster to some inputs than others, faster to one sense of transition than the other. I don't think TTL spec sheets tend to show this, but the timing models used within chips do, AFAIR.)

Makes sense!

100MHz TTL 6502: Here we go!

Re: 100MHz TTL 6502: Here we go!

Re: 100MHz TTL 6502: Here we go!

Re: 100MHz TTL 6502: Here we go!

Re: 100MHz TTL 6502: Here we go!

Re: 100MHz TTL 6502: Here we go!

Re: 100MHz TTL 6502: Here we go!

Re: 100MHz TTL 6502: Here we go!

Re: 100MHz TTL 6502: Here we go!

Re: 100MHz TTL 6502: Here we go!

Re: 100MHz TTL 6502: Here we go!

Re: 100MHz TTL 6502: Here we go!

Re: 100MHz TTL 6502: Here we go!

Re: 100MHz TTL 6502: Here we go!

Re: 100MHz TTL 6502: Here we go!

Re: 100MHz TTL 6502: Here we go!