6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 22, 2024 6:51 pm

All times are UTC




Post new topic Reply to topic  [ 149 posts ]  Go to page Previous  1 ... 4, 5, 6, 7, 8, 9, 10  Next
Author Message
PostPosted: Wed Aug 07, 2019 11:09 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
(Some nice links there!)

Picking up on the idea that a fast CPU which outpaces bulk memory may benefit from something to assist memory accesses: have you considered a local (on-board) RAM for page zero and page one? Not only are they commonly accessed, but they are also often accessed twice over, so a local version which allowed unaligned 16 bit accesses would be an additional win.

Also, if the writeback of a result to page zero or one goes via a different path, it can overlap with the fetch of the next instruction.


Top
 Profile  
Reply with quote  
PostPosted: Wed Aug 07, 2019 11:31 am 
Offline
User avatar

Joined: Fri Nov 09, 2012 5:54 pm
Posts: 1431
//Collecting just a few links sure took me some weeks.

If the CPU would feature 6502 plus K24\65816 modes, and switching between those two modes,
the problem here is that zeropage and page one could be located in any 64kB block of the 16MB memory.

With 65816 and K24, any of the 256 pages within a 64kB block could be the "zero page".

Using 3 or 4 blocks of 16MB RAM (each with its own address decoder), for being able to handle instruction fetch
plus zero page access plus stack push/pull plus data R/W in parallel sure does look like a nice idea...
..while keeping them 4 blocks consistent during a write cycle does not.
And implementing something like that would need quite some PCB space.

Need to spend some more thoughts on this.

BTW: in the 6502 world, 16 Bit pointers on the stack and in the zero page are not meant to be word\longword alligned.


Top
 Profile  
Reply with quote  
PostPosted: Wed Aug 07, 2019 11:40 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Indeed, no alignment constraint. So the (small) memories need to handle unaligned pairs. So it's not so much a 16 bit wide memory as a pair of banks - one even, one odd.

As for consistency, I'm not sure there's any need - oops, see below. It is necessary to allow general accesses to be directed to the small memories though, so a comparator is needed.

Hmm, but you're right, in a world where direct page and stack page move around, there is a problem. Hmm.


Top
 Profile  
Reply with quote  
PostPosted: Wed Aug 07, 2019 12:02 pm 
Offline
User avatar

Joined: Fri Nov 09, 2012 5:54 pm
Posts: 1431
2 Banks 16 Bit (maybe each bank with its own address decoder) for handling non_alligned pointers, that's a nice idea.

BigEd wrote:
so a comparator is needed

Which better would be not an agonizingly slow 74HCT688 or 74F521.

BTW: The biggest and fastest synchronous dual port SRAM which Mouser currently (August 2019) has on stock is the IDT 70T3509MS133BPGI.
1M*36, 133MHz, ball grid array package, 2.5V supply, 321€.
...it was worth a try. :lol:


Top
 Profile  
Reply with quote  
PostPosted: Wed Aug 07, 2019 12:19 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Another possible thought, is to use a pair of 32k banks to cover the whole of Bank 0. That solves the problem of direct page and stack page moving around, but if you happen to be running code from Bank 0 it doesn't decouple fetch from write.

However... if you had a write buffer (maybe as little as a single byte) then you could again often overlap writeback with fetch. This adds complexity, no doubt. I feel a high performance 6502 needs a separate memory subsystem, with separate ports for data and program, and maybe also separate for direct page and stack page.


Top
 Profile  
Reply with quote  
PostPosted: Wed Aug 07, 2019 2:05 pm 
Offline
User avatar

Joined: Fri Nov 09, 2012 5:54 pm
Posts: 1431
Hmm...
The usual approach seems to be building some memory around the CPU.
It appears that we would have to build some CPU around the memory instead.


Top
 Profile  
Reply with quote  
PostPosted: Wed Aug 07, 2019 8:38 pm 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
Interesting discussion.

If the objective is to increase total processing speed, and the memory access speed is the limiting factor, then the CDC 6600 Peripheral Controllers may be a very reasonable compromise. It certainly allows a pipelined processor implementation to simultaneously support more than one thread of execution without a significant increase in the HW required.

Pipelining the processor, while increasing the number of simultaneously executing threads, will yield improvements in the overall clock frequency of the processor without too much effort. The additional memory modules can operate at 1/n of the pipeline depth, and a time division multiplex of the read data from each of the memory modules is very feasible.

This "barrel processor", or c-slow re-timing, approach does not seem to be very popular, but it does work. Although I don't know for sure that Parallax Propeller chips use this approach, it does seem plausible from the descriptions of their operation that I've read.

If the objective is to run standard TTL at as high an operating frequency as possible, is it out of bounds to let the processor run n threads at 1/n of the clock frequency? Implementing a "barrel processor" like the CDC 6600 PPs using TTL running at 100 MHz would beat the performance of the PPs by a factor of 10. An 8-16 deep pipeline would certainly allow the delays of the various TTL components to be compensated for.

_________________
Michael A.


Top
 Profile  
Reply with quote  
PostPosted: Thu Aug 08, 2019 2:36 am 
Offline
User avatar

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON
MichaelM wrote:
If the objective is to run standard TTL at as high an operating frequency as possible, is it out of bounds to let the processor run n threads at 1/n of the clock frequency?
Nothing is out of bounds. :) One of the objectives here is to collect various ideas that might contribute to a faster TTL CPU, either in the current incarnation, or in the future. All ideas are fair game and very much welcome.

Quote:
Implementing a "barrel processor" like the CDC 6600 PPs using TTL running at 100 MHz would beat the performance of the PPs by a factor of 10. An 8-16 deep pipeline would certainly allow the delays of the various TTL components to be compensated for.
Thanks for the pointers to the CDC 6600 PPs — very interesting!

The current TTL design uses a four stage pipeline and may be capable of running at 100MHz (at least on paper — a lot of work remains to see if this bears out). The CPU supports multiple instruction-sets, so the notion is to keep the 6502 and 65C02 cycle-accurate, and perhaps try to push the envelope further with the 65816-K24 instruction-set. This can be by reducing cycles per instruction, increasing parallel processing, or any other means we can devise. Since the 6502 is so memory bound, some scheme will likely be required better manage access to memory.

BigEd wrote:
I feel a high performance 6502 needs a separate memory subsystem, with separate ports for data and program, and maybe also separate for direct page and stack page.
Concurrent access to these separate elements, along with a wider bus, seems promising. Dr Jefyll also suggested “offset-memory”, with ODD and EVEN banks, to deal with unaligned 16-bit reads. All good ideas.

_________________
C74-6502 Website: https://c74project.com


Top
 Profile  
Reply with quote  
PostPosted: Thu Aug 08, 2019 6:03 am 
Offline
User avatar

Joined: Fri Nov 09, 2012 5:54 pm
Posts: 1431
Barrel processor, I had considered this too, but there are two downsides to this concept:
One would need multiple status registers, and it won't speed up legacy 6502 code.

;---

More info about the CDC 6600 is on Whygee's homepage.

Spent some more thoughts on "going superscalar".

It became obvious, that a truely superscalar TTL 6502 would be going to be huge, and from the logic design of view it would be very complicated.
Long lines (long PCB traces) and big lumps of conventional logic in the design would be killing speed.
We are only a team of three self-educated hobbyists (none of us did study computer science at a university, that is).
Feels like a project of that sort would require 12 engineers, 4 coders and a janitor.

So we better forget the thought of building a superscalar TTL 6502.

;---

For running legacy 6502 code in a cycle compatible way, we could expect the CPU to have a MIPS rating similar to a 100MHz i486DX2.
When running 6502 code with an instruction buffer and a 16Bit (or 24 Bit) address adder, I think we could expect a performance boost of 25% or such,
that's a conservative estimation, somebody better check.

My advice here is to give the TTL CPU a "native mode", in which it has a RISC\VLIW instruction set,
and then to have something like "an optimizing compiler" that makes a static translation of 6502 code to RISC\VLIW.
Checking dependencies and making efficient use of the functional blocks within the CPU then is done by software
instead of hardware, and this will save us a lot of grey hairs.

When taking this path, there are two different approaches: to swim with the sharks, or to walk with the dinosaurs.

0) A VLIW instruction set, that resembles vertical microcode.

In fact, it _would_ be a form of "compressed" vertical microcode.
We would have some circuitry that just takes the longword (or long long word) from the Instruction Buffer
and translates it into horizontal microcode while bypassing the CPU microcode RAM, and then to be done with.

SHARC DSPs from Analog Devices have a neat VLIW instruction set,
and back in 2000 I did a little bit of ADSP21065L assembly coding for a living.

1) An instruction set where a bunch of simple, atomic RISC instructions is packed into a longword (long long word).

That would be something like EPIC or IA-64.

Reminds me to the ill fated TREX project. //32 Bit TTL RISC Experiment
"A lumbering mostrosity, supposed to be extinct, but still able to give people the creeps."
TREX had a 32 Bit instruction word, and 13 instructions in total.
If Bit 31 was 0, the 32 Bit word triggered a JMP\JSR... while being loaded into PC.
If Bit 31 was 1, there were three atomic RISC instructions in the 32 Bit word.

The TREX instruction set is explained here.
Getting CISC features by creatively grouping atomic RISC instructions is described there.

Funny thing is, that the TREX ISA is screaming "build me superscalar".

TREX only supported 32 Bit data types.
When emulating a 6502\K24\65816, maybe we are going to need 8, 16 ,24 Bit data types, too.


Top
 Profile  
Reply with quote  
PostPosted: Thu Aug 08, 2019 6:48 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
MichaelM wrote:
Implementing a "barrel processor" like the CDC 6600 PPs using TTL running at 100 MHz would beat the performance of the PPs by a factor of 10. An 8-16 deep pipeline would certainly allow the delays of the various TTL components to be compensated for.

Just to note: each pipeline stage costs an insertion delay, the time to traverse the flop (the time for data to leave plus the time for data to enter: clock-to-q and setup time.) So, if in your chosen technology the insertion delay is 4ns, when running at 50MHz you get 16ns of logic time. When running at 100MHz you get only 6ns of logic time - much less than half. So, in a given technology there's some kind of sweet spot as to how short it's worthwhile to make your clock period, and therefore how long it's worthwhile to make your pipeline.

There's also the cost of the pipeline flops: in chips, in board area, in power.

Having said which, it might indeed make sense, in a TTL processor, to build a pipeline much deeper than the usual 4 or 5 stages - just note that there are tradeoffs. (Quite possibly you know this already!)


Top
 Profile  
Reply with quote  
PostPosted: Thu Aug 08, 2019 7:07 am 
Offline
User avatar

Joined: Fri Nov 09, 2012 5:54 pm
Posts: 1431
Of course a deeper pipeline would allow to maximize for clock speed. :)

But some of the tradoffs are problems\delays when it comes to a change in program flow,
also the 6502 doesn't have many registers so dependencies between the instructions
when it comes to register usage certainly would be limiting the fun.

A 4 level pipeline already gave us enough of a headache.
Intel sure can afford to have a 20 level deep pipeline just for the ALUs or such.

In theory, the sharks and the dinosaurs might give us a chance to stick with that 4 level pipeline
at 100MHz PHI2 while getting a MIPS rating close to a 100MHz P5 Pentium (when doing it right),
that's my point.


Top
 Profile  
Reply with quote  
PostPosted: Thu Aug 08, 2019 2:51 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
One thing that a "barrel" multithreaded design might enable is the use of SDRAM with RAS-CAS delays without an overall throughput penalty, even without pipelining. Typical SDRAM has 4 independent banks, a 2-cycle CAS latency at lowish clock speeds (100MHz is feasible), and additionally a RAS delay whose magnitude I forget right now. Idle cycles of each stage of the CPU can be used to execute DRAM refresh cycles. The cost of each SDRAM chip, which usually provides 16-bit bus width, is also much lower than a fast Sync-SRAM.

What you then get is four independent CPUs running in four independent memory spaces on the same lump of hardware. Making good use of that programmatically would be somewhat interesting in itself.


Top
 Profile  
Reply with quote  
PostPosted: Fri Aug 09, 2019 5:41 am 
Offline
User avatar

Joined: Fri Nov 09, 2012 5:54 pm
Posts: 1431
The "barrel processor" concept sure is good for fast I\O,
but I had no luck with talking other hobbyists into building a non_proprietary Ethernet controller.

Nowaday, one can buy a Gigabit Ethernet controller like the WG82574L for 5€, but it's PCIe,
means what goes into the chip and what comes out is 8b/10b encoded.
TTL 8b/10b encoding ain't trivial at 100MHz.

;---

When expecting a TTL CPU to handle bursts of data (reads\writes with linear increasing memory address), SDRAM sure looks interesting.
Like when you are out to do digital signal processing, number crunching, displaying graphics and such.

When not expecting a TTL CPU to handle bursts of data and to do a lot of random memory reads\writes,
then maybe one would have to build a non_proprietary TTL SDRAM controller (what takes some PCB space),
and then at 16MB, synchronous SRAM might be a better and less expensive solution than SDRAM.

Micron MT61M256M32JE-12, RLDRAM2: eight banks and "SRAM type bus interface".
But you never know, for how long a special sort of DRAM stays in production.

Micron GDDR6 design guide, the PCB layout considerations starting at page 9 are an interesting read.


Top
 Profile  
Reply with quote  
PostPosted: Fri Aug 09, 2019 6:41 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
I see you can get fast wide SRAM with byte-write-enables:
CY7C1345G (8ns)
IDT71V35761S (3.5ns)
Are they any use?


Top
 Profile  
Reply with quote  
PostPosted: Fri Aug 09, 2019 7:13 am 
Offline
User avatar

Joined: Fri Nov 09, 2012 5:54 pm
Posts: 1431
IDT71V35761S isn't a flow_through synch SRAM.
Random reads take two clock cycles, that's one clock cycle too many.

CY7C1345G 128k*36 does random reads in one clock cycle.
From the datasheet, maybe that chip would make a nice microcode RAM.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 149 posts ]  Go to page Previous  1 ... 4, 5, 6, 7, 8, 9, 10  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 33 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: