6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Thu Apr 25, 2024 8:06 am

All times are UTC




Post new topic Reply to topic  [ 149 posts ]  Go to page Previous  1 ... 5, 6, 7, 8, 9, 10  Next
Author Message
PostPosted: Fri Aug 09, 2019 10:08 am 
Offline
User avatar

Joined: Fri Nov 09, 2012 5:54 pm
Posts: 1392
Mark Smotherman has quite a collection about historical computer designs on his homepage:

Instruction-Level Parallelism
Historical background for EPIC instruction set architectures

Attachment:
smotherman.png
smotherman.png [ 17.17 KiB | Viewed 1246 times ]


Top
 Profile  
Reply with quote  
PostPosted: Fri Aug 09, 2019 10:47 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England
Great links, again!

BTW there's a sort of microcoded 6502 emulator which might be worth a look:
viewtopic.php?f=8&t=5722


Top
 Profile  
Reply with quote  
PostPosted: Tue Aug 13, 2019 7:12 am 
Offline
User avatar

Joined: Fri Nov 09, 2012 5:54 pm
Posts: 1392
Paul Gardner's GS4502B: An attempt to create a high-performance 4502 and 6502 compatible CPU.

Quote: "It would also have been possible to implement out-of-order execution to further increase IPC, however the logic to do so is notoriously large in area"

;---

Found some neat slides at ETH Zürich, Dept. of Information Technology and Electrical Engineering:
Out of order execution in a nutshell.

Hmm... out of a gut feeling, I now strongly would recommend to stay away from implementing something like "out of order execution".


Top
 Profile  
Reply with quote  
PostPosted: Tue Aug 13, 2019 7:58 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England
Another nice slide set! Indeed, a superscalar machine is difficult to understand, difficult to reason about, difficult to build and difficult to verify... but of course it can be done.

Here are a couple more links from computer science lectures:
http://hpca23.cse.tamu.edu/taco/utsa-ww ... ture5.html
https://www.archive.ece.cmu.edu/~ece740 ... ecture.pdf


Top
 Profile  
Reply with quote  
PostPosted: Tue Aug 13, 2019 8:24 am 
Offline
User avatar

Joined: Fri Nov 09, 2012 5:54 pm
Posts: 1392
Some nice links\references.
BTW: there seems to be a little incompatibility in our definitions for "difficult" :lol:

;---

When taking a look at the instruction decoding in the Intel core microarchitecture (2006):

Attachment:
intel_core.png
intel_core.png [ 140.26 KiB | Viewed 1177 times ]

There are three "simple decoders" generating one µop, and a "complex decoder" plus microcode generating four µops.
//In the Intel Nehalem microarchitecture (2008), there are three "simple decoders" and one "complex decoder" again.
//Looks like getting hands on good\detailed block diagrams for Intel cores released after the Nehalem isn't that easy.

Of course I haven't looked up how the X86 instruction set breaks down at binary level, maybe because it hurts my brain,
but I'm starting to remember that 8080\8085 instructions with a binary 01xxxxxx and 10xxxxxx pattern tend to be simple register/register operations,
so I suppose the "simple decoders" are mainly there for register/register operations, and the "complex decoder" for the rest.

From the hardware point of view, unfortunately the 6502 instruction set breaks down a bit different...
...but in 6502 machine code, some instructions tend show up "paired", like:

STA ABS,X\Y or STA Z,X\Y plus INX\DEX\INY\DEY.
INX\DEX\INY\DEY plus BNE.
CLC plus ADC.
SEC plus SBC.
CMP plus Bxx.
Also maybe some shift/rotate operations.

One complex instruction where execution takes several machine cycles, and one rather simple instruction that could be executed in one machine cycle.
//That's overly simplified because of how flag evaluation goes together with a 4 level pipeline or such.


Top
 Profile  
Reply with quote  
PostPosted: Tue Aug 13, 2019 9:19 am 
Offline
User avatar

Joined: Fri Nov 09, 2012 5:54 pm
Posts: 1392
Just some more observations when it comes to assembly coding:

In the 8085\X86 world, you tend to travel with a big backpack (register set).
Re_packing the backpack during the journey takes some time\effortand and isn't much fun,
so you try to have everything in your backpack you might be ging to need during the journey
before the journey starts.

In the 6502 world, you travel light (without a big backpack),
but the trick is to miracuously have at hands what you are going to need
at the moment when you are going to need it during the journey.

;---

It would make sense trying to preemptively calculate a few things in the background when function blocks in the CPU are idle,
to have a chance to save the one or other machine cycle when executing the next instructions to come.

;---

In our old 20MHz TTL CPU, the ALU seems to be sitting idle in the cycle after an instruction fetch.

Calculating SP+1 there (if the previous instruction had changed SP) might save a cycle on PLA\PLX\PLY\PLP\RTS\RTI.
//Don't know, how good this would fit together with the concept of the ADL\ADH\ADX latches.

Considering loops like
Code:
foo: LDA SRC,X
     STA DST,X
     DEX
     BNE foo
one could preemptively try to calculate the next X-1 (plus the flag evaluation) ahead for speeding up the next DEX.
Same thing for INX and X+1, and for DEY\INY.
//Don't know yet, how good this would fit together with our new CPU architecture.

When using an Instruction buffer (and a wider data bus), all Bytes of an instruction might arrive at the same time,
so in theory with two 16 Bit adders one could preemptively calculate "d16+X" and "d16+Y" while the microcode decodes the instruction,
maybe to "truncate" the result for zero page addressing. //"d16" is a 16 Bit value from a three Bytes instruction or so.
;
So we would have the EA (effective address) for ABS,X\ABS,Y\Z,X\Z,Y at hand in the next cycle.
//Don't know, how good this would fit together with the concept of the ADL\ADH\ADX latches.


Top
 Profile  
Reply with quote  
PostPosted: Tue Aug 13, 2019 9:31 am 
Offline
User avatar

Joined: Fri Nov 09, 2012 5:54 pm
Posts: 1392
TREX had the PC (R7), and the link register LR (R6).
When there was a JMP or JSR, TREX wrote the JMP\JSR address into LR,
swapped "contents" of PC and LR by just swapping the control signals
for PC and LR in the right moment.

Image

Now about (hypothetically) speeding up the 6502 instruction DEX:
One could have a register that contains X, another register that contains X-1
(X-1 is calculated in the "background" after every time the value of X had changed),
and if X-1 is valid (calculated) and the next instruction is another DEX,
one could swap the control signals for both registers in the right moment,
without that it causes any traffic on the CPU internal bus systems.

At binary level, this trick would work with the N,Z flags too.

//Of course, if one would have three registers for handling X-1 and X+1,
//the game for swapping "register contents" would be getting more difficult.


Top
 Profile  
Reply with quote  
PostPosted: Tue Aug 13, 2019 2:35 pm 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3346
Location: Ontario, Canada
ttlworks wrote:
TREX [...] swapped "contents" of PC and LR by just swapping the control signals

Nice. Though the scale is modest, I believe this still qualifies as "Register Renaming"! :mrgreen: I use the same shortcut to "exchange" K0 and K3 when my KK Computer does a Far jump or JSR. (K0 has a function similar to that of the 65816's PBR.)

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Top
 Profile  
Reply with quote  
PostPosted: Tue Aug 13, 2019 3:46 pm 
Offline
User avatar

Joined: Fri Nov 09, 2012 5:54 pm
Posts: 1392
Hi Jeff.
Nah, 'register renaming' in a Pentium is a lot worse. :lol:

;---

Now for a (hypothetical, over_simplified and half_baked) picture about what might happen if a 6502 might be trying to "go 68060":
//Some of the images in the picture are "borrowed" from wikipedia.

Attachment:
6502_goes_68060.png
6502_goes_68060.png [ 799.56 KiB | Viewed 1137 times ]


First we have the (8 Bytes) Instruction Buffer plus the two Byte shifters, fed from the (32 Bit) data bus.
//The Byte shifters could be implemented with a "lattice" of 74CBTLV245 chips.
Attached to (almost) every Byte in the Instruction Buffer we have a predecoder which mainly checks for the length of an instruction.
//Expect each predecoder to be at least 12 chips when checking the length of K24\65816 instructions.
Then some multiplexers for extracting all the Bytes for two 6502 CISC instructions.
//If you want to extract more than two instructions at 100MHz with 74LVC\74CBTLV chips, this needs to be pipelined.
Basically, something like a "bucket wheel excavator" which grinds into program memory for breaking it into single 6502 CISC instructions.

Now that we have two "slots" containing the Bytes for two 6502 CISC instruction, we can't tell which of the instruction might finish first when getting executed.
This calls for something like two little buffers with 2:1 multiplexers in front of them for swapping the slots if needed.
//We need to keep track, in which order the instructions went in for doing a "dependency check" later.
//If we won't have this function block here, this would complicate the Instruction Buffer circuitry a lot.
Sort of a "shell game" with two 6502 CISC instructions.
//Other superscalar CPUs implement a bigger buffer for a lot of "CISC instructions".

The two 6502 CISC instructions then are handled by two microcode blocks, every microcode block has a "state machine counter".
Each microcode block emits one set of µOPs for every machine cycle, that's something like vertical microcode.
The speed of the microcode storage might be a topic.

We then need to check dependencies, how the µOPs would fit together.
Register usage, function block usage in the CPU, which CISC instruction went initiated first, etc.
//The dependency check mechanism has to tell the microcode counters, if a new µOP could be generated.
Might be something like hammering square pegs through round holes, plus something like Lynn Conway's DIS.
A multiplexer (or a set of multiplexers) then "magically" puts together a VLIW word from all this (that part still isn't worked out).
It somehow reminds me to a meat grinder producing sausages.

The VLIW word then is executed in the mill.

;---

The picture leaves out a lot of little details:

Control signals between dependency check, microcode counters and Instruction Buffer circuitry.
How to handle conditional\unconditional changes in program flow.
How to put together the µOPs into a VLIW word.
What the VLIW word actually is supposed to look like.
...and maybe the list isn't complete.

The pipeline of the CPU might be ending up being 8 levels in total,
and the basic concept for handling interrupts and conditional\unconditional changes in program flow needs to be re_done.

;---

But my point here is:
If the CPU would support a "native VLIW mode" and we would be using sort of a compiler for static 6502 CISC to VLIW conversion,
with VLIW instructions being long word alligned, this might save us (a lot) more than 100 chips, this could keep the pipeline at 4 levels,
and all the complicated and probably bug ridden hardware for turning 6502 CISC into VLIW would fall out of the design
because then the compiler has to handle all this by software.


Top
 Profile  
Reply with quote  
PostPosted: Thu Aug 15, 2019 6:11 am 
Offline
User avatar

Joined: Fri Nov 09, 2012 5:54 pm
Posts: 1392
Michael's 65CM02A already was mentioned in the forum.

There is some nice documentation for the M6502\M65C02S at the github page.

MichaelM wrote:
Essentially, the address generator computes the next address all the time.
Both the address generator and the stack pointer module (within the address generator) provide additional computational elements that speed the calculations of the next address.

I don't use the ADH/ADL address registers for the current address.
Instead, I use those registers to capture the address of the current cycle for use in the next address cycle.
This allows me to opportunity to shave 1 cycle off most instructions and thereby make branches (whether taken or not) to be two cycle instructions.
Almost all instructions fetches are overlapped with instruction execution.

I did have to take special care with the CLI/SEI instructions.
Those instructions are marked as uninterruptable (by NMI or IRQ), which guarantees that the I bit in P will be properly updated before next instruction is being executed.

I also have to perform a partial decode of the instruction during the fetch cycle.
That leads to me having two instruction decode tables in my core.
One table provides a registered decode of the opcode read from memory that controls the ALU functions.
The other table provides a decode of the opcode read from memory and implements the first cycle of the instruction (for multicycle instructions),
or the fetch cycle for the next instruction (for single byte, single cycle instructions).

See the documentation I provide in my M65C02A project on github.com for a layout of the instruction execution sequence for these two FPGA cores.
It may not work for you, but it may provide some food for thought.


Michael, thanks for the input. :)

This thread is just for collecting ideas for speeding up a 6502 (especially a TTL implementation of the 6502),
so _all_ contributions to that topic are welcome,
sorting out the ideas (into which would make sense and which won't) is something to be done later.

The point is:
If some of the ideas collected here won't be useful for _our_ project, maybe they still could be useful for other projects.


Top
 Profile  
Reply with quote  
PostPosted: Thu Aug 15, 2019 8:12 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England
For single-byte opcodes, the usual 6502 setup will already fetch the next byte as an operand, which turns out not to be needed. Maybe that could then be repurposed as an early fetch - maybe even run a second decoder on it before making the decision. So the outputs of the two decoders get switched if the first opcode is a single byte.

You certainly get a bit more by using the trick of peeking at the fetched data in the fetch cycle, but it eats into the memory access time, and also assumes that memory is relatively slower than the CPU. So, I imagine, not always an applicable tactic.

I wonder if a simple branch predictor/branch target buffer would help: when fetching a backward branch which has been fetched recently - and I'm thinking here of an inner loop - we could guess what it is, where it goes to, and that it might well be taken. So we can fetch the target address in the third cycle, in parallel with figuring out if we're right. There's no greater gain than cutting the overhead of an inner loop!

Combining these two ideas, if have a common sequence like
DEX
BNE backwards
we could maybe do it in 3 cycles instead of the usual 5.


Top
 Profile  
Reply with quote  
PostPosted: Thu Aug 15, 2019 2:28 pm 
Offline
User avatar

Joined: Fri Nov 09, 2012 5:54 pm
Posts: 1392
The 65CE02 had a trick for fetching the next instruction early.

Fetching one Byte after the instruction that currently is executed in advance is easy,
because it's always supposed to be the next instruction Byte.

But when fetching more than one Byte in advance, the Bytes could either be instruction Bytes or data.
Wider data bus, instruction buffer, predecoders etc., we already had that stuff somewhere up in the thread.

;---

Hmm...
Something like two "CPU cores", where one core isn't a 'real/complete' CPU but reads program memory ahead for looking
what could be calculated "in advance" and what should be "flagged/marked", and puts it into a "scratchpad memory"
to be read by the real core some machine cycles later.

But changes in program flow would be limiting the fun, especially conditional branches.
And this calls for dual port RAM program memory.

Reminds me to the old Helios spacecraft concept:
Nuclear rocket motor, dragging the crew module at a "supposed to be safe" distance.

Attachment:
helios.png
helios.png [ 35.58 KiB | Viewed 1079 times ]


Last edited by ttlworks on Thu Aug 15, 2019 2:50 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Thu Aug 15, 2019 2:46 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England
I think a decoupled front end is what's needed - that's what Intel did, AIUI. If we always think in terms of a monolithic CPU, we have difficulty thinking of it doing two things at once.


Top
 Profile  
Reply with quote  
PostPosted: Thu Aug 15, 2019 3:45 pm 
Offline
User avatar

Joined: Fri Nov 09, 2012 5:54 pm
Posts: 1392
Ed, for the past 19 years or so I only had monolithic CPU designs.

After reflecting a bit on this, my proposal for a wider data bus with instruction buffer now feels
like a "brute force attack" on program memory with a monolithic bucket wheel excavator.

The Helios concept is interesting:

One could build a conventional TTL CPU (original 6502 architecture at 100MHz would be 43MIPs),
and then plug a "Helios Motor" to it for still going faster.

;---

Most of the ALU operations inside a 6502 CPU seem to be related to address calculations.
The "Helios Motor" would be some instructions ahead of the CPU, trying to pre_emptively do
as much address calculations as possible.

It would have a sequenced microcode like the CPU, just built a bit different.
It would be aware if\when the next instructions the CPU is going to fetch might cause a change to X, Y, PC, SP etc.
It could be passively snooping the contents of X, Y, PC, SP etc. in the CPU through the CPU internal bus systems.

It doesn't need to do data calculations, means if it has "a copy" of program memory RAM (with a separate bus system of course),
it could fetch instructions with a faster bandwidth than the CPU.

In the right moment, it just disables the address bus drivers of the CPU,
and places an address on the CPU address bus, while the CPU is reading data or instructions from the data bus.


Top
 Profile  
Reply with quote  
PostPosted: Thu Aug 15, 2019 4:08 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England
Ahem, yes, well I'm very much an amateur and only an armchair pipe-dream kind of CPU designer, whereas you have actually designed things!


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 149 posts ]  Go to page Previous  1 ... 5, 6, 7, 8, 9, 10  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: