6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Wed Jun 26, 2024 5:07 am

All times are UTC




Post new topic Reply to topic  [ 149 posts ]  Go to page Previous  1 ... 3, 4, 5, 6, 7, 8, 9, 10  Next
Author Message
PostPosted: Thu Aug 01, 2019 1:19 pm 
Offline
User avatar

Joined: Fri Nov 09, 2012 5:54 pm
Posts: 1397
OK, now for the unavoidable proposal for building a non_microcoded 6502 instruction decoder\sequencer from multiplexers,
something we certainly won't build for this project.

Attachment:
6502_MuxDecoder_proposal.png
6502_MuxDecoder_proposal.png [ 293.78 KiB | Viewed 2097 times ]


IR is the instruction register.

"data phase decoder" is a big lump of 8:1 multiplexers like in my previous posting.
It generates control signals for ALU, flags and registers for the mill in machine cycles where the ALU processes data.

"control decoder" is another lump of 8:1 multiplexers, together with a little lump of logic related to interrupts, flag testing etc.
it is responsible for what happens in all of the other machine cycles. (addressing modes etc.)

Some more 8:1 multiplexers controlled by the 3 Bit state machine counter (that's 8 states) then generate control signals.
//Means it could happen that one would have to make eight 16*16 OpCode tables, one for every state.

That 2:1 switch which routes either these control signals or control signals from the "data phase decoder"
to the mill is just for illustration of the concept:
On implementation level, one could use a 7404 inverter plus 8:1 multiplexers with three_state outputs like 74251 or 74CBT(LV)3251
for getting rid of that switch.

Some more clues might be hiding up in the thread.

;---

When trying to take this approach, better expect to end up with >150 TTL chips just for the instruction decoder\sequencer.

Because the select inputs of all these multiplexers are fed by the instruction register,
we are going to have a lot of capacitances, even when creatively buffering the instruction register outputs.

So a design like that would be too slow for the project (it probably won't be getting past 100MHz).

An instruction decoder\sequencer built from 8ns asynch SRAM containing microcode would have about the same speed,
while being a lot smaller, cheaper, more reliable, and simpler to modify (bug fixes).


Top
 Profile  
Reply with quote  
PostPosted: Mon Aug 05, 2019 7:16 am 
Offline
User avatar

Joined: Fri Nov 09, 2012 5:54 pm
Posts: 1397
Long time ago, my driving instructor had told me a story from the days when he had been young.

He got hands on a used Porsche, but it had no motor, and he only was able to ged hands on a motor from an old VW beetle.
Eventually he got tired of people sneering at him because of that lawnmower sound, and sold that Porsche with integrated VW beetle motor.

Some years later, he had an old VW beetle, but got hands on a Porsche motor, and then he wanted to know.
It took him 10 attempts (modifications of the car) to get the admission from the authorities for using his car on a public street,
the car needed a different (deeper) balance point (adding a lead plate), different brakes, different (bigger) tires,
modified mud wings for still being able to steer with the bigger tires, maybe a different gear box, and so on.
But he had said that driving the highway with this contraption was fun.

;---

What does this have to do with building a faster TTL CPU ?

When making your CPU faster, you need to have faster main memory too.
Else, the fun would be limited.

Considering a C64 address decoding without cartridges and without VIC-II bus access
(we would have to integrate 16MB RAM and some address decoding inside the VIC-II too, of course),
my proposal for a 4ns address decoder would look like this:

Attachment:
c64_adr_dec.png
c64_adr_dec.png [ 365.84 KiB | Viewed 2058 times ]


Typical propagation delays:
TI 74CBTLV3253: select to Q 3.25ns typ., data in to Q 0.25ns max.
TI 74CBTLV3251: select to Q 2.3ns typ., data in to Q 0.25ns max.

So when using 74CBTLV3253 and 74CBTLV3251, the decoder is supposed to have a ca. 4ns propagation delay.
Building it by only using 74CBTLV3251 instead would cut it down to ca. 3ns, but maybe this would take too much chips.
Another approach would be buying chips from Potatosemi, in theory this might cut down decoding time to 2ns.

Looking forward to see, for what 16MB memory layout and for what memory chips Drass decides later.


Top
 Profile  
Reply with quote  
PostPosted: Mon Aug 05, 2019 11:01 am 
Offline
User avatar

Joined: Fri Nov 09, 2012 5:54 pm
Posts: 1397
If address decoding would be too slow, the usual approach would be adding waitstates to memory reads\writes.

But in Drass's most recent CPU block diagram, placement of the ADL\ADH latches looks interesting:

Attachment:
drass_cpu_rev_b.png
drass_cpu_rev_b.png [ 53.1 KiB | Viewed 2043 times ]


It implicates, that there is a possibility of breaking the address decoder into two parts:

Attachment:
pad_concept.png
pad_concept.png [ 87.29 KiB | Viewed 2043 times ]


What means: making the address decoder (light blue) pipelined. (Note the red line.)

;---

For instance, if we would check if A23..A16 are zero at the LEFT (input) side of the ADL\ADH\ADX latches,
this would save us some time on the 16MB address decoding, at the cost of adding a waitstate cycle to K24 at maybe 30MHz+ PHI2
for instructions which load ADX directly from the data bus.

//That's because RAM timing plus A23..A16 decoding time would add up.

Since the K24 instructions supporting 24 Bit addressing are not cycle compatible to the 65816,
maybe that wouldn't be much of a problem.

;---

Putting the A15..A8 decoding to the left side of ADL\ADH would have two disadvantages:

At ca. 30MHz+, we would have to add a waitstate to 6502 instructions which load ADH directly from the data bus:
JMP, JSR, addressing modes like ABS, ABSX, ABSY etc...

The other disadvantage is, that the 6510 I\O port signals which control the memory mapping arrive one cycle too late.
Means, for C64 address decoding at the left side of ADH, we might have to add a waitstate cycle after 6510 I\O port writes...
...what might create some other problems.

;---

A7..A0 decoding at the left side of ADL\ADH is a thing we should try to avoid,
because at 30MHz+ we would need to add a waitstate cycle for zero page addressing modes, too.


Top
 Profile  
Reply with quote  
PostPosted: Mon Aug 05, 2019 12:50 pm 
Offline
User avatar

Joined: Fri Nov 09, 2012 5:54 pm
Posts: 1397
If you can't make memory faster, the second option would be making memory wider,
means having a 16, 32 or 64 Bit data bus instead of an 8 Bit data bus.

But when reading more than one Byte at a time from program memory,
we should spend some more thoughts on how good this goes together with the C64 memory decoding.
Like:
What might happen, if one creatively changes the memory layout by a 6510 I\O port write
shortly before PC rolls over from $9FFF to $A000, $BFFF to $C000 or $DFFF to $E000 ?

Once we have 32 or 64 Bit read from program memory completed,
the question how to make efficient use of it still could make enough fuel for running a neat academic career.

Attachment:
svc.png
svc.png [ 35.65 KiB | Viewed 2032 times ]

;---

Now some technical references to systems in the past with a data bus wider than 8 Bit,
just if we would be going to need them later for starting a 6502 related discussion.

VAX 11/780 had a 32 Bit data bus and an 8 Bytes instruction buffer.
VAX 11/780 maintenance handbook, page 22 of the PDF, page 16 of the document (before it was scanned).
More details about the VAX11\780 instruction buffer starting at page 166 of this PDF.

VAX11/730 had a 32 Bit data bus, but only a 4 Bytes "prefetch register".
Technical description, PDF page 120.

;...

MC68000 has a 16 Bit data bus.
Instructions are multiples of 16 Bit words, but data could be 1..4 Bytes anywhere in memory.
When there is an active bus cycle, the MC68000 considers it completed when the /DTACK input signal goes low
(in contrast to this, 6502 considers a bus cycle not complete when RDY is low).
MC68000 has the output signals /UDS and /LDS, which indicate which Byte of the 16 Bit data bus is active.

MC68020 user's manual, page 53 of the PDF: Bus operation.
MC68020 (32 Bit data bus) doesn't have the /DTACK signal, instead of this it has the /DSACK0 and /DSACK1 input signals.
If /DSACK0 and /DSACK1 are both high, the CPU doesn't consider the bus cycle as completed.
If /DSACK0 and/or /DSACK1 goes low, this tells the CPU that the bus cycle is completed...
...and if the data bus had been 8, 16 or 32 Bits wide in that bus cycle.
Unfortunately, 68020 lacks /UDS and /LDS, too.
The end user has go generate the enable signals for the 4 Bytes of the bus with a GAL or such
from the output signals A0,A1 and SIZ0,SIZ1.

;...

The PowerPC 60x bus interface featured a 64 Bit data bus, page 82 of the PDF.


Top
 Profile  
Reply with quote  
PostPosted: Mon Aug 05, 2019 6:56 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
Eventually you need to implement a cache, because your big memory and its decoder can't keep up with the CPU any more. So your fast logic checks for a cache hit and selects the correct cache line and byte, and if not, implements a wait-state so that slow logic can go out to the main memory, ROM, or I/O. You could use a 3-way direct-mapped cache with one entry for program accesses (read only), another for bank zero data accesses (read & write), and a third for data in other banks (read & write); you then don't need content-addressable memory, only an equality comparator on the line tag. I/O addresses should always miss the cache and not allocate into the cache.

Another good strategy, before you get to that point, is to prioritise distinguishing fast RAM from everything else, and selecting the correct fast RAM chip on the assumption that this is what's wanted. If the address is not to fast RAM, insert a wait-state and take advantage of the more lenient timings for a complex decode.


Top
 Profile  
Reply with quote  
PostPosted: Mon Aug 05, 2019 7:54 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10834
Location: England
A code buffer can be useful, even before you build a cache, because it frees up memory bandwidth and can help short loops directly.


Top
 Profile  
Reply with quote  
PostPosted: Tue Aug 06, 2019 3:02 am 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
Or just a 256KB area (say) designated as "fast RAM", and the rest with wait-states. It's then up to you to put hot code and data in the right place.


Top
 Profile  
Reply with quote  
PostPosted: Tue Aug 06, 2019 6:35 am 
Offline
User avatar

Joined: Fri Nov 09, 2012 5:54 pm
Posts: 1397
Chromatix wrote:
Eventually you need to implement a cache,

The problem here is, that solutions which worked good on a chip don't translate well to TTL.

Building a cache won't bring us far:
The 8 Bit comparators available (74HCT688, 74F521) are way too slow, massive use of 74LVC1G86 also won't cut it.
Cache memory built from TTL latches would be big, long PCB traces and a lot of capacitances would be killing speed.
Means we would be ending up with using the same SRAMs in the cache as in main memory.
When considering the edge cases of C64 memory decoding and things like self_modifying code,
the control circuitry for a cache would be a lot of circuitry.

So main memory SRAM plus address decoder would be faster than a (hypothetical) cache built from the same SRAMs
plus cache control logic (which would be bigger than the address decoder). :lol:

To quote Seymour Cray: "You can't fake a bandwidth that isn't there."

Chromatix wrote:
because your big memory and its decoder can't keep up with the CPU any more.

That's why I came up with the concept of integrating the address decoder into the CPU and pipelining some part of it.
A fully pipelined address decoder would require adding a waitstate for data access in some (maybe all)
of the addressing modes at ca. 30MHz+ PHI2 (depending on how much of the decoder is pipelined),
but IMHO program memory reads from [PC+] and stack reads/writes still would be running at full speed.

Chromatics wrote:
Or just a 256KB area (say) designated as "fast RAM", and the rest with wait-states. It's then up to you to put hot code and data in the right place.

At some point, it became obvious that the usual 8ns 512kB asynchronous SRAM won't cut it when trying to approach 100MHz PHI2.
So we started to consider using synchronous SRAM instead, and after we did an extensive web search,
we think that the GS840Z18T-250 is quite the thing.

From the timing diagrams on page 20 of the PDF, it is able to do a random access within a single cycle
in NBT flow through mode.
TQFP100 package, means one still has a chance to solder that chip by hand. 3.3V or 2.5V supply.
4M*18, means that when putting two opposing chips on a PCB, one on top and the other at the bottom,
that's 16MB in total, the full K24 or 65816 addressing range.

The address lines of the synchronous SRAMs would be tied to the INPUTS of the ADL\ADH\ADX latches,
in other words: the idea is to integrate the whole 16MB of RAM into the CPU (together with the address decoder).
(Note, that the GS840Z SRAM has three chip enable inputs, pretty much like the 74138).

Using the 250MHz version of the SRAM might sound like overkill, but it's cheaper than the 167MHz version,
and in case the pipelined address decoder idea doesn't work out well this still might give us the possibility
of running the SRAM at twice the CPU clock for having some nanoseconds left for address decoding.

At the moment (August 2019), one GS8640Z18GT-250I chip (which makes 8MB in total) does cost ca. 70€ at Mouser.
But hey, one 8ns 512kB asynch SRAM chip costs 4.24€ when ordering 25+ chips, that's ca. 68€ for 8MB, so what.

Edit: there might be bigger synchronous SRAM chips, but they tend to have BGA package and\or 1.8V supply or such.

BigEd wrote:
A code buffer can be useful, even before you build a cache, because it frees up memory bandwidth and can help short loops directly.

Yes, I wanted to post something about building an instruction buffer later, that's why I had put some VAX 11\780 material in my previous post. :)
VAX 11\780 features a 32 Bit data bus and an 8 Bytes instruction buffer, implemented with TTL chips.


Top
 Profile  
Reply with quote  
PostPosted: Tue Aug 06, 2019 8:29 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10834
Location: England
(might be worth noting that Arlet's small and fast HDL 6502 also is built to work with synchronous RAM: the address outputs leave just before the clock.)


Top
 Profile  
Reply with quote  
PostPosted: Tue Aug 06, 2019 1:03 pm 
Offline
User avatar

Joined: Fri Nov 09, 2012 5:54 pm
Posts: 1397
Arlet sure has a nice collection of interesting projects.

Unfortunately, I had been "almost" away from the forum from November 2018 to June 2019 for job related reasons,
so I had missed Arlet's thread Yet another TTL 6502. :roll:


Top
 Profile  
Reply with quote  
PostPosted: Tue Aug 06, 2019 1:19 pm 
Offline
User avatar

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON
BigEd wrote:
(might be worth noting that Arlet's small and fast HDL 6502 also is built to work with synchronous RAM: the address outputs leave just before the clock.)
That’s the case here as well BigEd. The CPU pipeline produces addresses in advance of the cycle in which they are required. For asynchronous devices, the address is clocked into an internal address register (ADL/ADH) and is available on the bus for the following cycle. Synchronous RAM connects directly to the ADH/ADL bus before the address registers and thereby receives the address just prior to the clock-edge. The same is true of the R/W signal. In this respect, the design can accommodate synchronous and asynchronous devices at the same time.

_________________
C74-6502 Website: https://c74project.com


Top
 Profile  
Reply with quote  
PostPosted: Tue Aug 06, 2019 1:53 pm 
Offline
User avatar

Joined: Fri Nov 09, 2012 5:54 pm
Posts: 1397
32 Bit data bus and 8 Bytes Instruction buffer (IB), here we go:

Attachment:
6502_instruction_buffer.png
6502_instruction_buffer.png [ 158.34 KiB | Viewed 1942 times ]

Notation, for avoiding any confusions related to 'little endian' and 'big endian':
Shift IB '-3' means shifting IB 3 Bytes toward IB0.
Shift IB '3' or '+3' means shifting IB 3 Bytes toward IB7.

Edit:
NMOS 6502 and 65C02 instruction size is 1..3 Bytes, and shifting IB -3..0 Bytes would do.
But for 65816 and K24, instruction size is 1..4 Bytes, take care.

The box labeled "merge" alternatively could be filled with OR gates.

;---

As an example, we take a short piece of code that does copy 256 Bytes in memory,
from $0400..$04FF to $0500..$05FF.

Code:
     LDY #$00
foo: LDA $0400,Y
     STA $0500,Y
     DEY
     BNE foo
     RTS


Code:
Program memory contents at $0100:
A0.00.B9.00.04.99.00.05.88.D0.F7.60

     .        .        .  .     .
LDY #$00      .        .  .     .
foo: .LDA $0400,Y
              .STA $0500,Y
                       .DEY
                          .BNE foo
                                .RTS


Instruction buffer activity explained slowly and step by step, the CPU actually might try to do some of the steps within one machine cycle.
red: execute an instruction
blue: read 32 Bit from program memory into the Instruction Buffer (short form: buffer).
green: shift or flush buffer

IB: Instruction Buffer
_0._1._2._3._4._5._6._7
xx.xx.xx.xx.xx.xx.xx.xx //No valid Bytes in the buffer. Need to read in 32 Bit.
A0.00.B9.00.XX.XX.XX.XX //Read in 32 Bit from program memory.
A0.00.B9.00.xx.xx.xx.xx //Execute A0 00 //LDY #$00
B9.00.xx.xx.xx.xx.xx.xx //Shift the buffer for removing the instruction which was executed.
B9.00.xx.xx.xx.xx.xx.xx //B9 is a three Bytes instruction, two valid Bytes in the buffer. Need to read in another 32 Bit.
B9.00.04.99.00.05.xx.xx //Read in 32 Bit from program memory.
B9.00.04.99.00.05.xx.xx //Execute B9 00 04 //LDA $0400,Y
99.00.05.xx.xx.xx.xx.xx //Shift the buffer for removing the instruction which was executed.
99.00.05.xx.xx.xx.xx.xx //Execute 99 00 05 //STA $0500,Y
xx.xx.xx.xx.xx.xx.xx.xx //Shift the buffer for removing the instruction which was executed.
xx.xx.xx.xx.xx.xx.xx.xx //No valid Bytes in the buffer. Need to read in 32 Bit.
88.D0.F7.60.xx.xx.xx.xx //Read in 32 Bit from program memory.
88.D0.F7.60.xx.xx.xx.xx //Execute 88 //DEY
D0.F7.60.xx.xx.xx.xx.xx //Shift the buffer for removing the instruction which was executed.
D0.F7.60.xx.xx.xx.xx.xx //Execute D0 F7 //BNE foo. Conditional branch taken.
xx.xx.xx.xx.xx.xx.xx.xx //Change in program memory flow, flush buffer. //the rest in the buffer is not executed.
B9.00.04.99.xx.xx.xx.xx //and from here, the loop repeats until Y contains zero.


Some observations:

0) IB0 contains the instruction Byte.
1) For 2 Byte instructions, IB1 contains either 8 Bit immediate data or the lower Byte of a 16 Bit address.
2) For 3 Byte instructions, IB2 contains the higher Byte of a 16 Bit address //screaming "build a 16 Bit ALU for address calculation into the CPU"

3) If an instruction was executed, we need to remove the related Bytes from the Instruction buffer (by shifting the buffer toward IB0).
4) If there was a change in program flow (JMP,JSR,RTS,RTI, Bxx true), we need to flush the Instruction Buffer.

5) We need to keep track, how many Bytes in the buffer are valid.
6) If the Instruction Buffer is empty, we need to read 32 Bit from program memory (into IB0..IB4).
7) If there are not enough valid Bytes for executing an instruction, we need to read 32 Bit from program memory.

8) We would have to shift the 32 Bit (4 Bytes) read from program memory by -3..+7 Bytes before merging them with the buffer for efficiently making use of the buffer.
9) For a 32 Bit data bus, the address from which the 4 Bytes are read usually isn't PC, but [PC AND (NOT 3)] //A0 and A1 are zero.
10) It might become necessary to read 4 Bytes from [PC AND (NOT 3) +4].

11) MC68010 featured a "loop mode" for two instruction loops.
When changing the Instruction Buffer size from 8 Bytes (IB0..IB7) to 16 Bytes (IB-7..IB8), when being able to Bytewise shifting the buffer
into both directions, and when pulling the one or other little trick, the loop in our example could repeat without the need for reading
Bytes into the buffer from program memory.


For "shifting", we would need something like a barrel shifter working at Byte level.
Barrel shifter could be either implemented by using 74CBTLV multiplexers (what won't simplify PCB layout) or by using a lattice of 74CBTLV3245 bus switches (what won't simplify generating the control signals).
When sacrificing a little bit of speed, the chip count for implementing a barrel shifter could be reduced (replacing one layer of 74CBTLV3251 8:1 multiplexers by three layers of 74CBTLV3257 2:1 multiplexers).

For "merging", I think it would be something like 74CBTLV3257 2:1 multiplexers working at Byte level... or a lot of OR gates.


Spending any thoughts on:
How to generate the 32 Bit program memory read address,
How many and which control signals are needed for implementing all this,
How to generate said control signals...
That's a thing that could be done later, after we have:
Some more spare time at our hands,
A decision if we really should go for a 32 Bit data bus and an 8 Bytes Instruction Buffer,
And a clear definition of what a (hypothetical) 32 Bit 6502 bus interface is supposed to look like.

OT: started to wonder, if anybody at TI still remembers the SN74AS897 barrel shifter.
BTW: using an Instruction Buffer and staying cycle compatible to the 6502 might be two different pairs of shoes.
...but now I need a break before "going superscalar".


Top
 Profile  
Reply with quote  
PostPosted: Wed Aug 07, 2019 4:35 am 
Offline

Joined: Thu Mar 10, 2016 4:33 am
Posts: 170
I just had a quick thought about the instruction buffer that could be a simpler way to build it, what if we had a 8 byte circular buffer that we could load into either the low word or the high word, and coupled with that we have an instruction pointer into the buffer. This would seem to be easier to implement.

So a reworked example would be (where . is the instruction pointer):

Code:
B: Instruction Buffer
_0._1._2._3._4._5._6._7
xx.xx.xx.xx.xx.xx.xx.xx //No valid Bytes in the buffer. Need to read in 32 Bit.
A0.00.B9.00.XX.XX.XX.XX //Read in 32 Bit from program memory.
.
A0.00.B9.00.xx.xx.xx.xx //Execute A0 00 //LDY #$00
      .
A0.00.B9.00.04.99.00.05 //Read in next 32 bit value (because IP is 1..3). This should be done in parallel with instruction execution.
      .
A0.00.B9.00.04.99.00.05 //Execute B9 00 04 //LDA $0400,Y
               .
88.D0.F7.60.04.99.00.05 //Read in 32 Bit from program memory (because IP is 5..7).
               .     
88.D0.F7.60.04.99.00.05 //Execute 99 00 05 //STA $0500,Y
.
88.D0.F7.60.xx.xx.xx.xx //Execute 88 //DEY
   .
88.D0.F7.60.B9.00.04.99 ////Read in next 32 bit value (because IP is 1..3).
      .

The logic for this should be relatively simple although there could be quite a few ways to implement it. I was thinking that no shift registers would be required. The data bus would either write to the low word or the high word, and bytes would be read sequentially from the other side.

This would also generalise to a larger buffer, and from there could be generalised into a cache (with a lot more chips).


Top
 Profile  
Reply with quote  
PostPosted: Wed Aug 07, 2019 6:47 am 
Offline
User avatar

Joined: Fri Nov 09, 2012 5:54 pm
Posts: 1397
Hi jds, thanks for joining the discussion.

I'm expecting something like >100MHz PHI2 with a 32 Bit (4 Bytes) data bus, and that's a throughput of >400MB/sec. peak.

When using an 8 Bit circular buffer to handle this, I think the buffer would have to be able operate at >400MHz.

The fastest latch from TI appears to be the SN74ALVTH16374, and it only makes 250MHz max.

The Potatosemi PO74G374A makes 600MHz, but with a 2.4ns propagation delay and a 15pF load at the output.
For more than 15pF of load capacitance, we could expect that speed dramatically drops.
When attaching some circuitry to a latch, expect IC input pins to have 4pF of capacitance, and three_state IC output pins to have 6pF.
Wiring the outputs of 8 latches together, plus PCB traces, plus some circuitry reading the latch outputs gives you a >50pF capacitive load pretty fast.

8 Bit ECL\LVPECL latches went pretty exotic by now, main mamory RAM and microcode RAM would be TTL,
TTL to ECL\LVPECL (and back) level converters don't tend to be too fast.


Would be hard to build an 8 Bit circular buffer working at >400MHz with the chips we are able to buy.
Building at least 4 circular 8 Bit buffers working in parallel and running at >100MHz probably might be an option,
but I'm starting to think that this would require more chips than to have an 8 Bytes IB latch with two Bytewise barrel shifters attached to it.

I'm afraid that building a >100MHz cache by using 8 Bit comparators with a >10ns propagation delay won't work out well, sorry.


Top
 Profile  
Reply with quote  
PostPosted: Wed Aug 07, 2019 11:05 am 
Offline
User avatar

Joined: Fri Nov 09, 2012 5:54 pm
Posts: 1397
On going "superscalar":

Over at anycpu.org, there is a thread about Readings on high-performance CPU designs, and half of the things in my list already went mentioned there.
including A Superscalar Out-of-Order x86 Soft Processor for FPGA (Thesis by Henry Ting-Hei Wong, PDF, 275 pages),
that picture on page 66 might give you an idea of what to do with the Instruction Buffer I had mentioned in my previous posts.

In 1961, IBM had released the 7030 Stretch, featuring some neat technical tricks, more details here.
In 1964, the IBM 7030 was dethroned as the world's fastest computer by the CDC 6600.
IBM president Thomas Watson took this personal and started what became the super secret ACS project (which never went into production).

Lynn Conway had contributed to the ACS project.
Her paper from 1966 about DIS, "dynamic instruction scheduling" is an interesting read (related to OOO, out of order execution).
//U.S. Patent 3,718,912 looks interesting, too.

Digging through old IBM Research & Development Journals for some clues also might be an idea.

The old Metaflow patents are good for giving you a headache, and they don't translate well to TTL.

The principal LSI engineer behind the 65CE02 was Victor F. Andrade (the 4510 CPU in the C65 had a 65CE02 core).
From Secret Weapons of Commodore: After Commodore's demise, 4510 designer Victor Andrade went on to design the K7 for AMD.
Maybe we should take a look at the AMD K7 Athlon.

MC68060 block diagram looks nice, started to wonder if there were related Motorola patents.
Attachment:
MC68060.png
MC68060.png [ 32.89 KiB | Viewed 1877 times ]

//Would it make sense to take a look at the DEC Alpha 21064 and the AMD AM29050 ?

Russian Elbrus 2000, something about its architecture here, reminds me to the Transmeta Crusoe.
Started to wonder, if "dynamic binary translation" of 6502 code to VLIW would be cheating... and what this would do to self_modifying code.

;---

Observations:

0) In the 6502 world, something that needs >10 clock cycles from fetching an instruction until executing it is considered "to have the agility of a supertanker".

1) Some of you probably had tried to write something like a BASIC interpreter.

That game when trying to evaluate arithmetic expressions like in "poke 1024 +40*(i/16) +i and 15, i".
It has to be broken into smaller parts, and they have to be evaluated in a specific order for getting the correct result.

When trying to take a "random" bunch of 6502 instructions, and trying to execute as much of it as possible in parallel,
the game feels somewhat similar, it's just bigger, more complicated, and you have to build it in hardware with the chips available.

2) Now to take a break, maybe it helps reducing that headache.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 149 posts ]  Go to page Previous  1 ... 3, 4, 5, 6, 7, 8, 9, 10  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 97 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: