6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sat Apr 27, 2024 11:52 pm

All times are UTC




Post new topic Reply to topic  [ 149 posts ]  Go to page Previous  1 ... 6, 7, 8, 9, 10
Author Message
PostPosted: Wed Sep 18, 2019 5:40 am 
Offline
User avatar

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON
I’m finally getting ready to order the test PCB. The idea is to run through a bunch of high-speed signal integrity tests on an impedance controlled PCB.
Attachment:
top.png
top.png [ 52.24 KiB | Viewed 3931 times ]
The board features some hex rotary switches into a ICS525 configurable clock to generate frequencies on either side of 100MHz in fine increments. CDCLVC1110 clock buffers distribute the signal to ALVC, AVC and NC7SV family driver ICs. The drivers exercise traces in a variety of configurations — point-to-point lines of various lengths, buses with stubs for other drivers, crosstalk aggressor and victims lines at various clearances, etc. A couple of the buses have options for both diode and RC termination.

Eventually, all the traces arrive at 74LVC163 counters which should track together if edges are clean. Stopping the clock after a short run should reveal which specific counters are having problems at various clock-rates. There is also a provision for an external clock signal just to evaluate what impact inter-board connectors might have.

ALVC and NC7SV point-to-point drivers all have footprints for impedance matching source series resistors. AVC drivers, on the other hand, rely on Dynamic Ouptut Control (DOC) outputs instead. The idea is that the output impedance of DOC drivers changes dynamically during transitions to tame reflections.
Attachment:
951711CC-2B7D-47CC-9C97-F35EEEFF33B2.jpeg
951711CC-2B7D-47CC-9C97-F35EEEFF33B2.jpeg [ 305.77 KiB | Viewed 3931 times ]
Quoting from AVC Logic Family App Note (page 15):
Quote:
The DOC circuitry provides enough drive current to achieve faster slew rates and meet timing requirements, but quickly switches the impedance level to reduce the overshoot and undershoot noise that is often found in high-speed logic. This feature of AVC logic eliminates the need for damping resistors in the output circuit, which are often used in series, and sometimes integrated with logic devices, to limit electrical noise.
it will be interesting to see how that works out.

On the back of the board, there is a FET Switch cary chain set up to oscillate. The generated signal is used to drive a frequency-divide counter, which should tell us more about the propagation delay through the Elmore chain. All these ICs are in close proximity to minimize signal integrity issues. We’ll see.
Attachment:
bottom.png
bottom.png [ 28.91 KiB | Viewed 3931 times ]

A scaled printout of the board shows the footprints are good (those 0402 caps are tiny!), but I’ll give it a couple of days review before sending the thing out just in case.

Cheers for now,
Drass

_________________
C74-6502 Website: https://c74project.com


Top
 Profile  
Reply with quote  
PostPosted: Sun Oct 27, 2019 11:24 pm 
Offline
User avatar

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON
Having had the chance to study this thread more closely, I began to experiment with a new “reduced-CPI” 6502 design. After significant noodling, I settled on a seven-stage pipelined architecture, as follows:

[FEtch] —> [DEcode] —> [ADdress] —> [IndeX] —> [DAta] —> [EXecute] —> [WriteBack]

  • FE: Fetch an instruction (opcode and operands) from Program Memory
  • DE: Decode opcode, marshal operands, calculate pc-relative, stack-relative and pre-index offsets
  • AD: Fetch indirect address, including jump vectors, from Address Memory
  • IX: Apply post-indexed address offsets
  • DA: Read Data Memory
  • EX: Perform ALU operation, update register file
  • WB: Update P register, verify branch prediction, write to Data Memory

The basic idea is that a seven-stage pipeline can support the various 6502 addressing modes (including RMW instructions) in one pass. Because the first stage simultaneously fetches both the opcode and its inline operands, each instruction can work its way down the pipeline as an atomic unit. Instructions advance through the pipeline at a rate of one stage per cycle, making use of specific resources as they go — indirection, indexing, memory, the ALU, etc. Once the pipeline fills, the CPU is able to complete one instruction in something approaching one cycle on average (i.e. CPI = 1.x). (Various pipeline hazards make the ideal one-instruction-per-cycle pipeline somewhat of a chimera, but we can try to get close).

As others have suggested, pipelining a 6502 beyond just a few stages requires concurrent access to Program, Address and Data memory. For example, we could have Program Memory supplying a new instruction, Address Memory reading an indirect address from zero page, and Data Memory reading a byte of data... all at the same time. In concept, the requirement is analogous to a large multiport RAM with three simultaneous accesses possible. In practice, modern pipelines are supported by sophisticated multi-level cached memory for this purpose. But we will want to keep things as simple as possible to start -- we can always optimize later.

For now, let's propose three distinct single port RAMs working together. We will want to be able to read complete instructions and addresses as often as possible, so 32-Bit wide memory is appropriate. Let's assume also 64k bytes of addressable space. One consequence of using three single port RAMs is that writes must be mirrored across all memories to keep them synchronized. To prevent writes from stalling the pipeline excessively, each memory is outfited with a one-deep write buffer. Each memory can independently postpone the write if busy, until a cycle when it would otherwise have been inactive.

Beyond memory, certain accessory mechanisms are needed. 6502 variable length instructions mean the instruction boundaries don't always neatly word-align in 32-bit memory. For this reason Program memory feeds into a prefetch buffer, and any instruction that might have been divided because its bytes span a word boundary will appear intact and complete in the buffer. There is also an alignment problem in Address Memory. Here we can use a second parallel Address Memory to enable fetching of two consecutive words in one cycle. This guarantees that we will get a complete address, even when that address spans a 32-bit boundary. In addition, as instructions are processed, branches are identified and their target addresses loaded into a Branch Target Buffer. The BTB is then consulted on every fetch to resolve branch targets as early as possible, and to aid in branch prediction. Finally, with all this concurrent activity, a complement of hazard checks must constantly police the pipeline to make sure the correct program semantics are maintained.

Ok, so that’s a lot of stuff to keep straight. After some paper and pencil exploration, I decided to write a simulation to flesh out the concept and work out the kinks. The result is the P7 Simulator. It’s a fairly straight forward python script written to simulate the action of pipeline registers working cycle by cycle on 6502 opcodes. It currently supports NMOS opcodes and implements the seven-stage pipeline outlined above. It makes use of the following facilities across the various pipeline stages:

  • FE: Program Memory Read, Pre-Fetch Code Buffer and asssociated Byte-Select Mux Matrix, Branch Target Buffer
  • DE: Decode RAM, PC Increment Adder, Pre-Index Adder
  • AD: Address Memory Read
  • IX: Post-Index Adder
  • DA: Data Memory Read, Register File Read
  • EX: ALU, Register File Write
  • WB: Data Memory Write, Branch evaluation/validation

As mentioned above, writes to the three memories are completed during idle cycles. The Branch Target Buffer (btb), on the other hand, is read every cycle and so has no naturally occurring idle cycles. Thankfully, btb reads can simply be skipped when the need to write arises. We might incur a few more branch stalls as a result, but as we’ll see, this all works out pretty well. (One alternative is to run memory at twice pipeline clock-rate and time-multiplex separate read and write ports. There are a variety of low-cost high-speed synch RAMs available that are quite suitable for this purpose. And indeed, this can work equally well for all three memories in addition to the btb. The P7 simulation includes a run-time switch to simulate dual-port RAMs across the board).

One significant fly in the memory-management ointment is self-modifying code. Unfortunately, an instruction that is already in the pipeline will not be modified by a write to Program Memory. It will therefore not reflect a recent modification when it executes. Rather than instrumenting the pipeline to detect self-modifying writes, the workaround is to impose a minimum distance between dynamically modified instructions and the instructions which modify them. Not the end of the world, but certainly a limitation in terms of compatibility.

Finally, one other architectural note: instructions which require both stack operations and jumps — namely JSR/RTS and BRK/RTI — are done by feeding multiple successive lines of microcode down the pipeline for each of these instructions (whereas the norm is just a single line of microcode). So a JSR <abs>, for example, generates a PUSH PCH / PUSH PCL / JMP <abs> sequence in the pipeline. Similarly, BRK generates three pushes, an SEI, and then a JMP (abs) instruction to jump through the correct interrupt vector. (Speaking of interrupts, I’ve yet to instrument the simulation to handle them. Intuitively, it seems like inserting a BRK into the instruction stream should work, but it will be interesting to see how that bears out, and where interrupt latency ends up).

Ok, with that as a design outline, I got the simulation more or less put together. I then loaded the Dormann NMOS 6502 Test Suite to give it a whirl. Sure enough, there is self-modifying code in the suite! I had to insert a couple of strategically placed NOPs in the code to create the requisite minimum distance between self-modifying instructions. (Two NOPs just prior to the each of the JSRs in the CHKADD routine did the trick. The simulator does not currently support decimal mode, so I didn’t bother to similarly change CHKDADD).

And with that, I was ready for the big test. I let P7 run the test suite for a half million cycles, and then the simulation generated the following summary output:
Code:
(FOR EXPLANATION OF THE FIGURES SEE DISCUSSION BELOW)

------------ Run Stats ------------
Cycles:          500000
NMOS Cycles:    1424511
- - - - - - - - - - - - - - - - - -
Instructions:    462936   (93%)  <-- (JSR and RTS are counted as one instruction each)
Data Stalls        1641   (0%)
Control Stalls:    1104   (0%)
Delay Slots:      13831   (3%)
Generated:        12849   (3%)
Cache Busy:        7640   (2%)
Total:           500000   (100%)
- - - - - - - - - - - - - - - - - -
Data Fwd:        131477   (26%)
Operand Fwd:       3755   (1%)
- - - - - - - - - - - - - - - - - -
Branches:         76460   (17%)
Taken:             4993   (7%)
btb Hit:           4571   (6%)
Mispredictions:     276   (0%)
- - - - - - - - - - - - - - - - - -
bCache Access:        2   (0%)

----------- Performance -----------
CPI:               1.08
NMOS CPI:          3.08
Boost Factor:      2.9x

There’s a lot of information here, but the key takeaway is the performance summary at the bottom. P7 averaged 1.08 CPI during the run! :shock: Honestly, I was not expecting such a good result. We shall see how performance holds up under varying workloads. For now I’m quite happy to report a 2.9x performance boost over the NMOS equivalent CPI for this run. Fantastic!

Some other interesting things to highlight in the numbers above:
  • On the whole, the pipeline manages to resolve interdependencies between instructions very successfully, and very few Data Stalls were triggered on this run.
  • Control Stalls too are very few, which suggests the branch logic manages pretty well despite a 7-cycle branch mis-prediction penalty. (Branches are resolved in the final WB pipeline stage, so a branch mis-prediction requires that all pipeline stages be flushed and reloaded).
  • 3% of cycles are Delay Slots — a direct branch (or jump) will incur a single delay slot penalty. Indirect jumps trigger two delay slots. No attempt is made to fill the slots, so in fact they are full-fledged stalls. Conditional branches and direct jumps that have been seen at least once in the instruction stream do not incur a delay slot penalty; they are managed by the Branch Target Buffer. For various reasons, we don’t use the btb for indirect jumps.
  • Relatively few Cache Busy stalls are generated (about 2% of cycles), which suggests that single-entry cache write buffers are more than adequate in this case. There are usually plenty of idle cycles available to complete writes in a timely manner. Even so, this aspect of pipeline performance may be further improved. Running the simulation with dual port RAMs enabled produced a CPI of 1.06, as all Cache Busy stalls are eliminated. That's a significant improvement that is well worth considering, particularly if full caching is implemented and the inevitable cache misses begin to tax performance (although the locality of the working-set of most 6502 code is probably good enough so that cache miss rates may be very low. Anyway, the right tradeoffs should become clear with further experimentation).
  • About 3% of cycles are additional cycles Generated by JSR and RTS instructions during their translation to multiple pipeline operations. This seems like a very reasonable tradeoff. With it, we avoid undue complication in the pipeline, and pay a relatively modest performance penalty.
  • Fully 26% of cycles needed Data Forwarding to avoid stalls. Data is “forwarded” when it is passed directly from one pipeline stage to another as soon as it is generated (from the output of the ALU back into its input, for example). It’s very interesting to see just how critical this mechanism is. Without it, the pipeline would degrade significantly.
  • It’s interesting also that Operand Forwarding does not have nearly the same impact. The FE Stage goes to some lengths to have complete instructions ready when the DE Stage needs them. However, the first couple of cycles after a branch can be tricky because a branch invalidates the Pre-Fetch Code Buffer. If the branch target instruction spans a word boundary, one or more operands will be missing from the initial fetch, and the FE Stage falls behind. In some cases, it is possible for FE to forward the missing operands from memory directly to the DE stage, and thereby avoid a stall. I thought this feature would be well worth the added complexity, but the results above suggest otherwise. In this run at least, it would be better to tolerate a few more stalls rather than over-burden the FE stage with this additional logic.
  • 17% of instructions executed were Conditional Branches. Of those, only 7% were Taken Branches. This may be because the Dormann test suite is peppered throughout with error-checking branches which are never taken (once the emulation is working correctly that is). I suspect we will see a much higher proportion of taken branches with other code.
  • The Branch Target Buffer (BTB) seems to perform very well, and generates hits for nearly all taken branches. This is not altogether surprising given that looping branches are taken many times while not-taken branches repeat infrequently.
  • Meanwhile, Branch Prediction is “off-the-charts” accurate. I am using a 2-bit Branch Prediction scheme here, with a modified Backward-Taken-Forward-Not-Taken static prediction for the first encounter of a given branch. I say modified because a backward branch to itself is predicted as Not-Taken. There are a large number of these in the Dormann code, which probably accounts for the unusually high accuracy in this run. I would not expect prediction accuracy to stay at these levels for the common case workload.
  • Finally, it’s interesting to note that (contrary to my expectation) the bCache is hardly ever used. The bCache is a parallel Address Memory used by the AD Stage to fetch a complete address in a single cycle, even when that address spans a word-boundary. The above result suggests it might be best to dispense with the bCache altogether, and simply tolerate the occasional pipeline stall when necessary.

Alright, a lot remains to be worked out here, but I found these results very encouraging indeed. The next step is to profile a wider variety of workloads and see what can be learned. Once the design is fine-tuned, I’ll want to port the simulation to Logisim. That will provide further validation, and some sense of what it would take to implement something like this in the real world. As always, the objective here is to learn and to have fun, so I’ll still consider the exercise a success even if the design proves too ambitious for discrete logic.

Speaking of fun, one highlight for me was being able to witness first-hand how the pipeline stages interact with one another. The P7 simulator reports its progress and displays pipeline activity as it goes. It’s fascinating to see it work through different instruction stream scenarios. To that end, below is a brief snippet of the last 10 cycles of the Dormann test run. (In the report, cycles are delimited by a line showing the cycle number, and one line per pipeline stage is shown below that. Each pipeline stage is annotated with relevant pipeline activity as appropriate. The FE Stage shows the next Program Memory fetch address every cycle, and also dumps the 8-byte Prefetch Code Buffer whenever a new 32-bit word is brought in from Program Memory. In the report, iCache refers to Program Memory, aCache to Address Memory and dCache to Data Memory. Pipeline “bubbles” are indicated by “---“ in the opcode field. Note that an RTS instruction generates two pipeline operations, one to adjust SP, and the other to jump to the return address. This is an indirect jump which triggers two Delay Slots).

I would be delighted to share the python code with anyone wishing to try it out. (Please PM me). Alternatively, feel free to suggest any 6502 code you think might be an instructive test case. All I need to run a test is a binary image of the code (up to 64k), a load address, and a start address to jump to. A listing of the code is useful in case I run into any issues (e.g., self-modifying code). The only caveat is that the test must run without any user input or interrupts.

(One thing I’d like to test is loading BASIC and running one of the sieve type benchmarks. Would anyone be willing to create a binary image containing, say, ehBASIC with a BASIC program already loaded in memory such that it would auto-run without user input? That would be a cool thing to try).

As always, many thanks to Dr Jefyll and ttlworks for their continued help, and thanks in advance for any thoughts and comments on the above.

Cheers for now,
Drass

Code:
Program Start: $c000
------------- 499990 --------------
WB: js* $0213
EX: sbc #$fb
... A = $2e
DA: rts
IX: rt*
AD: ---
DE: ---
... iCache Write Pending
FE: $f0e8: 20 13 2 8 - 20 13 2 8
------------- 499991 --------------
WB: sbc #$fb
EX: rts
DA: rt*
IX: ---
AD: ---
DE: php
... iCache Write Pending
FE: $f0ec: 20 13 2 8 - c5 f d0 fe
------------- 499992 --------------
WB: rts
EX: rt*
DA: ---
IX: ---
AD: php
... iCache Write Pending
DE: cmp $0f
FE: $f0f0: 68 29 c3 c5 - c5 f d0 fe
------------- 499993 --------------
WB: rt*
EX: ---
DA: ---
IX: php
AD: cmp $0f
DE: bne $fe
... iCache Write Pending
FE: $f0f4: 68 29 c3 c5 - 11 d0 fe 28
------------- 499994 --------------
WB: ---
EX: ---
DA: php
IX: cmp $0f
AD: bne $fe
DE: pla
FE: $f0f8:
... Drain: iCache[$1fb] = $ea
------------- 499995 --------------
WB: ---
EX: php
DA: cmp $0f
... Alub = $2e = [$f]
IX: bne $fe
AD: pla
DE: and #$c3
FE: $f0f8:
------------- 499996 --------------
WB: php
... dCache[$1fc] = $30
EX: cmp $0f
DA: bne $fe
IX: pla
AD: and #$c3
... iCache Write Pending
DE: cmp $11
FE: $f0f8: 8 a5 d 75 - 11 d0 fe 28
... Drain: dCache[$1fc] = $30
... Drain: aCache[$1fc] = $30
------------- 499997 --------------
WB: cmp $0f
EX: bne $fe
DA: pla
... Alub = $30 = [$1fc]
IX: and #$c3
AD: cmp $11
DE: bne $fe
FE: $f0fc:
... Drain: iCache[$1fc] = $30
------------- 499998 --------------
WB: bne $fe
EX: pla
... A = $30
DA: and #$c3
... Forward A = $30
IX: cmp $11
AD: bne $fe
DE: plp
FE: $f0fc: 8 a5 d 75 - 0 8 c5 f
------------- 499999 --------------
WB: pla
EX: and #$c3
... A = $0
DA: cmp $11
... Forward A = $0
... Alub = $0 = [$11]
IX: bne $fe
AD: plp
DE: php
FE: $f100:
------------- 500000 --------------
WB: and #$c3
EX: cmp $11
DA: bne $fe
IX: plp
AD: php
DE: lda $0d
FE: $f100:

_________________
C74-6502 Website: https://c74project.com


Last edited by Drass on Mon Oct 28, 2019 10:08 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Mon Oct 28, 2019 1:11 am 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
Drass:

Great work. I came to a similar realization, but I wasn't motivated to carry it out. Your work with the Program, Address, and Data memories substantiates my thinking on this subject: the primary impediment to increasing performance of a standard 6502 is related to the limited bandwidth of the memory interface. Your solution of providing three separate interfaces, i.e. Program, Address, and Data memories, coupled with the instruction prefetch buffer dramatically increases the bandwidth available to your processor core.

I am not surprised that your simulation shows that providing the processor core increased bandwidth results in such a dramatic improvement in performance. The performance improvement that you are reporting, approximately 2.9x, seems to roughly equal the increase in memory access bandwidth that you provided to your processor core.

Your description of the write buffer did not include any description of an address comparator that checks to see if the contents of write buffer match the read address of a read cycle that follows the write buffer posting. I think that you may need to consider adding such a comparator and a bypass which returns the write buffer contents when a match occurs. Given the paucity of idle memory cycles in a standard 6502, a match of read cycle to the write buffer contents could be an opportunity to convert the read cycle into a write cycle because of the added bypass logic.

Reading your description of how you used the memories, I got the impression that your approach resembles a split cache so that you have in essence created a Harvard memory architecture for your processor core. The transformation of the von Neumann memory interface of a standard 6502 into a Harvard architecture provides substantial improvements in the total memory bandwidth available to a processor core. In essence, your architecture and simulation dramatically demonstrate the efficacy of using separate instruction and data caches.

All that being said, I am wondering if your pipelined processor core would be better served by a small single port instruction cache and instruction prefetch buffer as you described above, along with a small dual port data cache. Given that the working set of a 6502 program is generally small, I would think that a 256 x 32 instruction cache and a 1024 x 32 data cache would likely yield performance similar to what you describe above from your simulation. These caches could be direct mapped, or you could consider using a two-way set-associative organization for the instruction cache to improve the hit rate. A write through data cache policy could be used to good effect. A 32-bit memory could fill a cache line in either cache in a single read cycle.

A data cache holding both the data and address memory areas like you described above is not likely provide the same performance as your address memory provides by reading both halves of an indirect address pointer in a single read cycle using a dual address generator as you described. It would, however, not require the additional HW that your address memory is requiring in your current model.

I am a bit short of time at the moment to delve into your processor's simulation model, otherwise I would be very interested is seeing how you modeled your pipelined processor core. As things are going, it's going to be a quite a while before I have any free time away from work to study your processor model in detail.

Once again, I look forward to your posts. There's always something thought provoking in them. Looking forward to more posts on this project.

_________________
Michael A.


Top
 Profile  
Reply with quote  
PostPosted: Mon Oct 28, 2019 6:51 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England
Absolutely fantastic Drass!

I feel a new thread on this adventure might be a good investment. (A new pipelined microarchitecture isn't only good for a TTL implementation.)

Highlights for me are
- side-stepping the worst constraints of self-modifying code
- starting by writing a simulator
- running Klaus' test suite for a substantial number of cycles
- getting nearly 3x improvement

I don't doubt that there are many possible tweaks and explorations to the memory system architecture.


Top
 Profile  
Reply with quote  
PostPosted: Mon Oct 28, 2019 7:32 am 
Offline
User avatar

Joined: Fri Nov 09, 2012 5:54 pm
Posts: 1392
Drass, that's really great work, and I'm looking forward to reading some more of the details.

NMOS 6502 is rated 0.43MIPS at 1MHz.
So in theory, 100MHz * 0.43MIPS * 2.9 = 124.7MIPS, ant that's really impressive.
//Considering that a superscalar MC68060 at 75MHz is rated 110MIPS.

;---

Somewhere above in this thread, I had mentioned the idea to go for a VLIW core (and for a static compilation of 6502 to VLIW machine code)
for squeezing as much speed out of the CPU as possible.

While looking around a bit, I had stumbled over the Intel Itanium microarchitecture by accident.

From the Intel Itanium tutorial, page 18:

Attachment:
intel_itanium.png
intel_itanium.png [ 50.82 KiB | Viewed 3798 times ]


From the TREX instruction set:

Code:
    ------------------------------------
   |1|I|  Slot 1  |  Slot 2  |  Slot 3  |
    ------------------------------------
   31                                  0

And I _swear_ that I was totally unaware of the Intel Itanium when I did plan/build TREX. :roll:


Top
 Profile  
Reply with quote  
PostPosted: Mon Oct 28, 2019 10:38 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
Itanic was a development of the VLIW concept. It's normal in VLIW to have multiple instruction fields with bizarre lengths, within a single instruction word of POT length.

Most VLIWs, though, dedicate each field to one of the execution units, ensuring that each can be fed an operation in every cycle. Itanic tried to put an OoO instruction scheduler in the way, and it didn't work very well. Sure, it *functioned* - but it led to a huge and very hot design for its performance, and it never seemed to live up to the promises in that department either. Mind you, this was at around the same time as the Pentium 4, which ended up with similar problems for different reasons. Both are designs that are worth studying - as counterexamples.

Getting back on topic, it's interesting to see how even a 6502 can be pipelined relatively effectively. The pipeline shown is actually similar to that of the 68040, though that had a different method of de-conflicting memory accesses.


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 29, 2019 8:02 am 
Offline
User avatar

Joined: Fri Nov 09, 2012 5:54 pm
Posts: 1392
Chromatix wrote:
[Itanium, Pentium 4]
Both are designs that are worth studying - as counterexamples.

True, and Intel seemed to have some problems with that "x86 to Itanium compiler".
//Other VLIW examples would be Transmeta Crusoe and Elbrus E2K.

But the point here is, that when trying to make a CISC CPU (with an instruction set where the amount of Bytes per instruction may vary) superscalar,
naturally this leads to a lot of complicated circuitry in the instruction decoder\sequencer (affecting speed), and to deeper pipelines maybe.
The concept of VLIW just "moves" some of the problems from the hardware side to the software side, where fixing bugs is supposed to be less fuss.


68040 had a 6 level pipeline.
Unfortunately, the MC68040 designers handbook doesn't give much info about this on PDF page 27 (2-9).
//But the part about transmission lines and termination starting at PDF page 57 (5-1) looks nice.

Hmm... the 7 level pipeline of AVR32 AP looks interesting, PDF page 21.
Edit: I still wonder why they had "forked" the pipeline like this:

Attachment:
AVR32_AP_pipeline.png
AVR32_AP_pipeline.png [ 35.09 KiB | Viewed 3501 times ]


Last edited by ttlworks on Mon Nov 04, 2019 9:11 am, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Wed Oct 30, 2019 10:02 am 
Offline
User avatar

Joined: Fri Nov 09, 2012 5:54 pm
Posts: 1392
Speaking about transmission lines and termination:
For the physical implementation of a (hypothetical) 100MHz TTL CPU, connectors still are a topic.

//While looking around for stackable PCB connectors, I noticed that 'PC104' seems to be a good search term.

Eventually, I had stumbled over VPX and AdvancedTCA backplanes at Wikipedia.

While looking around at Mouser for 100Ohm high speed PCB connectors, I found three families of connectors to be interesting:

Amphenol Backplane connectors:
Amphenol AirMax VS: 12.5GBs, 2.0mm pitch

Molex Backplane connectors:
Molex VHDM: 2.5GBs, 2.0mm pitch
Molex Gbx I-Trac: 12.5GBs, 3.7mm pitch


Top
 Profile  
Reply with quote  
PostPosted: Fri Nov 01, 2019 8:46 am 
Offline
User avatar

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON
Thanks very much for the comments gents.

Quote:
the primary impediment to increasing performance of a standard 6502 is related to the limited bandwidth of the memory interface.
Agreed. This only became clear to me once BigEd mentioned it upthread. Then the penny dropped. :)

Quote:
The performance improvement that you are reporting, approximately 2.9x, seems to roughly equal the increase in memory access bandwidth that you provided to your processor core.
Interesting, I had not really thought about it in those terms but it makes intuitive sense. Memory bandwidth is key, but Data and control stalls can still prevent the pipeline from using the additional bandwidth effectively. So it seems plenty of bandwidth with appropriate dependency management is the right mix.

Quote:
Your description of the write buffer did not include any description of an address comparator that checks to see if the contents of write buffer match the read address of a read cycle that follows the write buffer posting. I think that you may need to consider adding such a comparator and a bypass which returns the write buffer contents when a match occurs. Given the paucity of idle memory cycles in a standard 6502, a match of read cycle to the write buffer contents could be an opportunity to convert the read cycle into a write cycle because of the added bypass logic.
Yes, very astute of you to note this. There are indeed write-buffer comparators in the simulation, just as you suggest, and a read that hits on the write-buffer leaves memory free to service a write in the same cycle.

Quote:
Reading your description of how you used the memories, I got the impression that your approach resembles a split cache so that you have in essence created a Harvard memory architecture for your processor core.
Exactly right. I didn’t start with that architecture specifically though. Instead, I was guided by the simple notion that any given resource should be used in only one pipeline stage in order to minimize conflicts. Hence, fetching from Program Memory in the FE stage only, and reading from Data Memory in the DA stage only. It turned out rather neatly that a Harvard architecture was the natural consequence of this approach.

Quote:
All that being said, I am wondering if your pipelined processor core would be better served by a small single port instruction cache and instruction prefetch buffer as you described above, along with a small dual port data cache.
Thank you for this commentary! Very helpful. Address memory is somewhat of a luxury, I agree (parallel address memory even more so). We can certainly store both data and addresses in one RAM, and then allow the pipeline to stall on conflicts. It's a tradeoff between performance and additional hardware, as you suggest. A key objective of the P7 simulator is to inform exactly this kind of design choice. Time will tell where we end up on various tradeoffs, but I suspect your intuition regarding Address Memory is correct.

I'm not certain, however, that the instruction cache is best left with a single port. Mirrored writes to the iCache can translate into substantial demand for bandwidth there. A series of absolute RMW instructions in sequence, for example, can easily overwhelm a single-port iCache with frequent instruction fetches and conflicting mirrored data writes. A Dual-port RAM would alleviate this bottleneck. (The same can be achieved with a deeper iCache write-buffer, but that too is not without its costs. If we go to the trouble of time-multiplexing dual-ports on Data Memory, it seems only sensible to do the same for Program Memory).

Quote:
Given that the working set of a 6502 program is generally small, I would think that a 256 x 32 instruction cache and a 1024 x 32 data cache would likely yield performance similar to what you describe above from your simulation. These caches could be direct mapped, or you could consider using a two-way set-associative organization for the instruction cache to improve the hit rate. A write through data cache policy could be used to good effect. A 32-bit memory could fill a cache line in either cache in a single read cycle.
The working-set of 6502 code is indeed small, particularly realtive to today's Synch RAMS (even the most reasonaly priced 32-bit RAMs can comfortably accomodate several banks of 64k). Hit rates are therefore likely to be excellent for a simple direct-mapped caches. That seems like a good option for a discrete design. But the tradeoff may be different for an FPGA with on-board RAM. Smaller caches with more sophisticated logic may be more appropriate in that case. I’ll have to think about a generic memory interface, so the memory sub-system can be tuned to either scenario as necessary. If the design allows the pipeline to stall on either reads or writes, then presumably the memory sub-system can manage things independently as needed.

Quote:
A data cache holding both the data and address memory areas like you described above is not likely provide the same performance as your address memory provides by reading both halves of an indirect address pointer in a single read cycle using a dual address generator as you described. It would, however, not require the additional HW that your address memory is requiring in your current model.
Agreed. Probably a good tradeoff as discussed above.

Quote:
I am a bit short of time at the moment to delve into your processor's simulation model, otherwise I would be very interested is seeing how you modeled your pipelined processor core. As things are going, it's going to be a quite a while before I have any free time away from work to study your processor model in detail.
Not to worry. Let me know if there is anything you'd like further detail on and I'd be happy to try to explain as best I can.

Quote:
Once again, I look forward to your posts. There's always something thought provoking in them. Looking forward to more posts on this project.
Thanks for the kind words Michael, and regards.

Quote:
Absolutely fantastic Drass!
Thanks BigEd! I very much appreciate your enthusiasm.

Quote:
I feel a new thread on this adventure might be a good investment. (A new pipelined microarchitecture isn't only good for a TTL implementation.)
Good point. I'll create a new thread on this when I get a chance and point here. I'll probably recycle the initial post above as an intro, rather than write something new.

Quote:
Drass, that's really great work, and I'm looking forward to reading some more of the details.
Thanks Dieter! And thanks for all the help as well.

Quote:
NMOS 6502 is rated 0.43MIPS at 1MHz.
So in theory, 100MHz * 0.43MIPS * 2.9 = 124.7MIPS, ant that's really impressive.
//Considering that a superscalar MC68060 at 75MHz is rated 110MIPS.
I've not yet had a chance to consider the critical path in this design carefully. Intuitively, it seems that the PC incrementer is likely to be the culprit once again. All the more so in this case since we don’t know the instruction length to use when increment PC until the instruction is decoded. We will need a really fast decode RAM! :)

Quote:
The pipeline shown is actually similar to that of the 68040, though that had a different method of de-conflicting memory accesses
Thanks for the comment Chromatix. Can you elaborate on the "different method of de-conflicting memory" employed? It would be wonderful to uncover a new optimization we may apply to this design.

_________________
C74-6502 Website: https://c74project.com


Top
 Profile  
Reply with quote  
PostPosted: Fri Nov 01, 2019 9:22 am 
Offline

Joined: Tue Sep 03, 2002 12:58 pm
Posts: 293
Drass wrote:
Intuitively, it seems that the PC incrementer is likely to be the culprit once again. All the more so in this case since we don’t know the instruction length to use when increment PC until the instruction is decoded.


An option here is to at least partially decode instructions as they enter the instruction cache. This doesn't need to be clever - you can decode every byte in the cache as if it was an opcode, whether it is or not.

You could have a single decoder, and load the cache with one byte at a time (hoping that they get re-used many times to cover the cost of loading the cache). Or you could have multiple decoders working in parallel. One 64Kx8 ROM will decode two bytes to 4 bits each, which is more than enough for the length. A wider ROM gives you more potential decoded signals.

With a wide enough decode cache, you might be able to eliminate the decode stage of the pipeline entirely.


Top
 Profile  
Reply with quote  
PostPosted: Fri Nov 01, 2019 11:18 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
Regardless of the precise number of pipeline stages, the 68040 integrates the effective-address calculation and operand fetch into the main pipeline ahead of the execution stage, which is the feature I was pointing out. The '040 has to stall instructions using the most complex 68K addressing modes which involve layers of indirection, and thus repeated cycles in the EA-calc and EA-fetch stages. The 6502 is not so complicated as that.

By comparison, modern x86 cores decode instructions with memory operands into separate fetch and execute operations, which are delivered to independent sub-pipelines with separate scheduling. This avoids having the entire pipeline stall due to a slow fetch while other instructions could be executing, but it is a major leap in complexity which is probably only justified with a superscalar design. In that case it could make sense to have a single memory-op pipeline alongside several execution pipelines, but with an ISA as register-poor as the 6502, that seems like a pipe-dream.

The '040 core has only a single port into the D-cache and another single port into the I-cache, which are sufficiently independent that an explicit cache flush is required before executing freshly written program code. Read-write contention at the D-cache is handled by providing a small write-back buffer, which is drained opportunistically into the cache when a read is not required in a particular cycle (which can be determined in advance by the control logic). Only if a write stalls due to the write-back buffer being full is a read denied, but that's actually a side-effect of the overall pipeline stall. I'm not quite sure how data in the write-back buffer is forwarded back into the pipeline if it is re-read; the cheapest way would be to stall the pipeline until the buffer drains and the cache is thus updated.


Top
 Profile  
Reply with quote  
PostPosted: Sat Nov 02, 2019 3:22 pm 
Offline
User avatar

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON
John West wrote:
An option here is to at least partially decode instructions as they enter the instruction cache. This doesn't need to be clever - you can decode every byte in the cache as if it was an opcode, whether it is or not.
What a great idea — Dr Jefyll made a similar suggestion with reference to branch targets. In general, the notion of preprocessing instructions on the way in has huge potential. The cost is additional cache memory, which is not an insignificant thing of course.

Quote:
With a wide enough decode cache, you might be able to eliminate the decode stage of the pipeline entirely.
Wow, that’s great — conceptually, the decode stage moves to the end of the pipeline, after the Write-Back stage. It’s no longer be part of the branch mis-prediction penalty at that point, which is significant. Very cool.

I’ve not yet grappled with the width of the microcode in this design, but this is certainly something to keep in mind.

Thank you!

_________________
C74-6502 Website: https://c74project.com


Top
 Profile  
Reply with quote  
PostPosted: Sat Nov 02, 2019 4:19 pm 
Offline
User avatar

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON
Chromatix wrote:
the 68040 integrates the effective-address calculation and operand fetch into the main pipeline ahead of the execution stage, which is the feature I was pointing out.
That does seem similar. Thank you for elaborating ...

Quote:
By comparison, modern x86 cores decode instructions with memory operands into separate fetch and execute operations ... but it is a major leap in complexity
Yup, that’s definitely above my pay-grade. :oops:

Quote:
The '040 core has only a single port into the D-cache and another single port into the I-cache, which are sufficiently independent that an explicit cache flush is required before executing freshly written program code.
I considered eliminating mirrored writes to the I-Cache, but self-modifying code is a prevalent practice for the 6502 (and it’s not uncommon to jump to modified code within just a few cycles!)

Quote:
Read-write contention at the D-cache is handled by providing a small write-back buffer, which is drained opportunistically into the cache when a read is not required in a particular cycle (which can be determined in advance by the control logic). Only if a write stalls due to the write-back buffer being full is a read denied, but that's actually a side-effect of the overall pipeline stall.
Actually, this is a very good description of how this design works as well — opportunistically draining a single-entry write-buffer when the D-Cache is idle. The difference here is that mirrored writes to the I-Cache create an additional opportunity for conflicts, this time with instruction fetch operations.

For that reason, among others, I’m currently leaning toward time-multiplexing separate read and write ports for both the I-Cache and the D-Cache. It’s an open question, however, whether construction issues will allow the bus to change directions at twice the pipeline clock-rate. We’ll have to see about that... :)

Quote:
I'm not quite sure how data in the write-back buffer is forwarded back into the pipeline if it is re-read; the cheapest way would be to stall the pipeline until the buffer drains and the cache is thus updated.
The P7 pipeline solution is to detect hits on the write-buffer, and forward the data to the read operation. This avoids the stall, but at the cost of some complication.

_________________
C74-6502 Website: https://c74project.com


Last edited by Drass on Sat Nov 02, 2019 7:36 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Sat Nov 02, 2019 4:57 pm 
Offline
User avatar

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON
MichaelM wrote:
I am wondering if your pipelined processor core would be better served by a small single port instruction cache and instruction prefetch buffer as you described above, along with a small dual port data cache.
Inspired by Michael’s comment, I ran a test to count the number of times that fetching an address would trigger a pipeline stall if addresses were stored in the D-Cache (“dCache Conflicts” below). The result is revealing:
Code:
------------ Run Stats ------------
Cycles:          500000
NMOS Cycles:    1425453
- - - - - - - - - - - - - - - - - -
Instructions:    463242   (93%)
Data Stalls        1482   (0%)
Control Stalls:    1128   (0%)
Delay Slots:      13602   (3%)
Generated:        12855   (3%)
Cache Busy:        7692   (2%)
Total:           500000   (100%)
- - - - - - - - - - - - - - - - - -
Data Fwd:        131655   (26%)
Operand Fwd:       3765   (1%)
- - - - - - - - - - - - - - - - - -
Branches:         76512   (17%)
Taken:             4995   (7%)
btb Hit:          75322   (98%)
Mispredictions:     282   (0%)
- - - - - - - - - - - - - - - - - -
bCache Access:        2   (0%)
dCache Conflicts:    48   (0%)

----------- Performance -----------
CPI:               1.08
NMOS CPI:          3.08
Boost Factor:      2.9x

Just 48 dCache Conflicts over a half-million cycles! It turns out that Address Memory is indeed a luxury, at least for the Dormann Test Suite. It will be interesting to see how this bears out in other code.

Thank you Michael.

_________________
C74-6502 Website: https://c74project.com


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 149 posts ]  Go to page Previous  1 ... 6, 7, 8, 9, 10

All times are UTC


Who is online

Users browsing this forum: Google [Bot] and 24 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: