A pool of odd ideas for speeding up a 6502 architecture

ttlworks · Post by **ttlworks** » Tue Aug 02, 2016 8:41 am

Some text from the "Hardware section" of the forum
related to speeding up a hypothetical 6502 compatible TTL CPU
by messing with the architecture...
viewtopic.php?f=4&t=3493&start=150&sid= ... 9c3#p46625

BigEd suggested that I post it here, too:

;------

In the cycle after an instruction fetch, the ALU is sitting idle.
So it would be possible to keep the ALU outputs and some of the
ALU inputs in latches and to do the flag evaluation in the cycle
after an ALU operation (like in M02, which wasn't timing compatible to the 6502).
But this would complicate branch logic, especially making a "branch not taken"
still within two cycles.

Getting rid of the multiplexers in front of the registers
by using bypass latches and "register renaming".
But this would complicate the interrupt logic.

Then, there would be "out of order execution".
We have that circuitry for calculating A&X and DP+1 for the UFOs.
("Unforeseen Opcodes" by the designers of the NMOS 6502).
Now imagine to have some circuitry that calculates X+1 and X-1
(plus the flag results) if X is loaded... same thing for Y, A, etc.
When reading the instruction stream a little bit ahead,
and making use some "register renaming" tricks (for flags in the
status register too), such things as INX, DEX etc. could be done
"in the background".

What brings up another interesting but complicated topic:
Instruction and data prefetching.
BTW: 65CE02 was microcoded, but the datasheet mentions that the code
would run up to 25% faster than on a mere 6502 running at the same
clock frequency.
Interesting thing is, that on the "secret weapons of Commodore" C65 page
there is a line of text that says that the designer of the 4510
(which had a 65CE02 core) later "went on to design the K7 for AMD".

Hmm... branch prediction. Like calculating the effective address
resulting from a Bxx instruction taken in advance...

What would complicate branch instructions a bit, of course.

Oh, and there also are tricks to speed up the microcode...
the simplest trick would be using fast RAMs instead of ROMs,
and to load them from a serial EEPROM after a hardware reset
(hey, that's the _standard_ when using FPGAs).
But it also would be possible to have "microcode ROM" rolled out "flat",
what means the "ROMs" just have 8 address lines fed by the OpCode,
but the "ROM" outputs are much wider than 32 Bit...
then to tap into them by using fast multiplexers controlled by a state machine...
like in X02.

So the propagation delay of the "ROMs" only would kick in after fetching
the OpCode...

...What brings us to "Barrel Processors".
Where the idea is to have a fast ALU and instruction decoder,
than to have multiple register sets (including multiple status registers)
to make efficient use of that speed...
from the hardware point of view, it's still a single CPU of course...
but for the "end user", it just might look like a dual or quad core.

;------

But another problem is, that when getting past 20MHz,
the address decoder and the peripheral chips might be getting too slow.

Also, implementing such things as mentioned above tend to be quite
difficile and labor intensive, because when trying to increase speed,
complexity of the CPU design will increase in a nonlinear fashion.

Oh... and it's all half_baked stuff, of course.

BigEd · Post by **BigEd** » Tue Aug 02, 2016 9:48 am

Great list!

Quote:

do the flag evaluation in the cycle after an ALU operation

In fact Arlet did something a bit like this and got a healthy speedup:

Quote:

With some small optimizations to the ALU flag handling, I could increase speed to 125 MHz. Pushing it harder with SmartXplorer, I got 133 MHz

which brings into play another category of speedup: buy a faster FPGA, and use the tools to their best effect.

Quote:

...branch prediction. Like calculating the effective address resulting from a Bxx instruction taken in advance

As an extreme case, hoglet and I found out the other week that the ARM in the Raspberry Pi can manage a zero-cycle branch in this way: the simplest delay loop is two instructions and was only taking one tick for each time around the loop.

Just for reference, there's an ongoing effort to make a superfast 6502 microarchitecture as linked in this thread. That design also has a wide instruction prefetch buffer which allows the possibility of decoding an entire instruction in one cycle, or even more than one instruction - an idea we also see in this thread.

ttlworks · Post by **ttlworks** » Tue Aug 02, 2016 10:59 am

First, I'll have to admit that I'm sort of an "old fashioned hardware guy",
and to me, toying with VHDL feels like coding...
and I don't like coding.

Back in 2008, I had toyed a little bit with Altera EP3C16:

Ignore that chip below the FPGA at the bottom of the PCB, it's only a PIC microcontroller.
A PCB with voltage regulators was plugged on top,
and another PCB with quite a few 74LVC245s (3.3V to 5V logic level converters)
was plugged at the bottom.

What kept me away from FPGAs where the somewhat impractical IC packages...
and the IDEs.

;---

In fact, I haven't paid too much attention to the FPGA section of the forum, sorry.

Thanks for mentioning, that Arlet already did "small optimizations to the ALU flag handling".

BigEd · Post by **BigEd** » Wed Aug 03, 2016 6:42 am

Another idea which has come up before is a faster path to zero page - and possibly the stack - where a two-byte wide access could help remove a cycle from many opcodes. See for example Bregalad's ideas in this thread:
viewtopic.php?p=26294

(And a write-back buffer seems tempting too - writes can be delayed while important reads proceed. If there's an instruction buffer and a fast path to stack and zero page, then there will be spare time for the writes to be committed before too long, and so write operations can take zero cycles.)

ttlworks · Post by **ttlworks** » Wed Aug 03, 2016 8:54 am

// Nice diagram !

Since we happen to have 16 Bit pointers in the zero page and in the stack area:
if block RAM is cheap it certainly would make sense to have zero page
inside the CPU as 16 Bit RAM.

Same thing for the stack area, especially if the CPU happens to have
stackpointer relative addressing modes.

I think that dual port RAM would increase the speed here, of course...

;---

Downsides:

When trying to stay binary compatible to the 6502,
unfortunately not all of the pointers would be alligned to even addresses
(making it neccessary to either increase complexity a bit or to sacrifice speed).
...telling your assembler or compiler to have pointers in the zero page alligned
to even addresses might be possible, but how does this fit to code which already
exists.

Also, since the zero page tends to be a crowded place (in the C64, for instance),
one might want to use other pages in memory as a zero page to compensate for this...
or to use a direct register for relocating the zero page like in the 65816.
BTW: in the 65816, the stack pointer can be 16 Bit.

;---

Hmm...now that's a good point:

When trying to speed up a 6502 architecture by modifying it,
how much backwards compatibility to original 6502 machine code
should be there ?

When tinkering too much with the ISA, there is a good chance
that the end result would "smell" a little bit like Motorola 88k.

BigEd · Post by **BigEd** » Wed Aug 03, 2016 9:12 am

Oh, I'd stick to speeding up existing 6502 architectures - there's a whole load of ideas for extending the architecture and for me that's a different conversation.

For zero page and stack, I'd recommend two 8-bit wide memories rather than a single 16 bit wide memory - as you say, half the accesses would be unaligned. As it happens the block RAMs on xilinx FPGAs are 2k bytes and dual-ported, so you could use two ports of one block RAM for this. (Edit: but an even RAM and an odd RAM would work perfectly well.)

Another idea is to make some use of distributed RAM - that's asynchronous rather than synchronous, so could in some cases save a cycle. If not implementing the whole stack, it would be possible to implement the top two or three elements as distributed RAM, more or less making them write-back registers. But this is going to be more complex.

The funny thing about FPGAs is that once you've chosen a chip, there's no special advantage in making your design smaller than it needs to be to fit. But if you go bigger, you need to take action.

ttlworks · Post by **ttlworks** » Wed Aug 03, 2016 12:02 pm

Quote:

Oh, I'd stick to speeding up existing 6502 architectures - there's a whole load of ideas
for extending the architecture and for me that's a different conversation.

Since I'm not much of a coder, I sure understand the need of staying compatible to old code...
when running my beloved DOS box on Pentium 4, for instance.

Nevertheless I'd like to mention, that the different sizes of address and data sure adds
a lot to the complexity of the instruction decoder\sequencer in the 6502.
If all registers (and data types) would be 16 Bit, if all instructions would be conditional,
if PC and P just would be registers like the others and if instructions would have the option
for pushing the contents of a register on stack before the register is written,
IMHO the control circuitry required for such a CPU would be pretty compact.
...But I'm getting off topic.

When the 6502 was invented, RAM and ROM was small, slow, and expensive...
so pulling any trick possible to save memory made sense, even at the expense of a more sophisticated
instruction decoder\sequencer... by using a CISC instruction set maybe.
Also, hobbyists usually had no compilers (you were happy if you just had an assembler),
and I think that peripherals with SPI interface were widely unknown back then.

Unfortunately, the world has changed since 1975: memory is fast, big and inexpensive,
speeding up CISC is more fuss than speeding up RISC, it became hard to get around using C,
and buying (for instance) A\D and D\A converters with a parallel interface became a little bit difficult.

For a design that old, the 6502 still does well, but maybe a few features could be added...

Quote:

For zero page and stack, I'd recommend two 8-bit wide memories rather than a single 16 bit wide memory

Yes, that's a better idea, but alignment still might require some tricks...

Using registers for the top two or three elements of the stack is a nice idea,
(reminds me to Forth hardware CPUs), unfortunately those registers would have to be
"transparent" to the "end user" because of code like

Code: Select all

TSX
LDA $0101,X

Edit: the 6502 uses pre_increment and post_decrement, means that unlike in most of the other CPUs,
SP is decremented _after_ the Byte is written to the stack.
So by default, SP is supposed to point to an "empty" Byte.

Starting to remember, that when I tried to add some 65816 instructions to my M02 TTL CPU,
the hardware wasn't designed for that... so for a few of those 65816 instructions,
the microcode just used that "empty Byte" in the stack area as a "scratch pad" for temporary storing a Byte.
Don't try this at home, kids.

Quote:

The funny thing about FPGAs is that once you've chosen a chip, there's no special advantage
in making your design smaller than it needs to be to fit. But if you go bigger, you need to take action.

True, true... of course, you might want to have a complete computer on a chip at some point...

...maybe with more than one CPU core, too.

BigEd · Post by **BigEd** » Wed Aug 03, 2016 12:28 pm

I should clarify - I have lots of interest in modified and derived CPU architectures - I just think it would be off-topic in this thread to talk about that. I've just posted an index, or an initial attempt at one, over here.

ttlworks · Post by **ttlworks** » Wed Aug 03, 2016 2:32 pm

...is there a Z80 related list, too ?

;---

Now just my two cents on the 6502 instruction set:

I think that compilers and the 6502 would "fit better" together,
if the instruction set would support PC relocative code,
and handling SP relative data and pointers (for parameter passing).

Would be nice to have PEA,PEI,PER plus additional addressing modes
like 'd,s' and '(d,s),y' from the 65816.

BRL on a 6502 just would be "the icing on the cake".

I still wonder, why there is no BSR instruction in the 6502 and 65816
for PC relative subroutine calls... but when looking at the 65816
Opcode map, I have no clue where to put that BSR in.

The 8085 from 1976 had two pins for "serial transmitting/receiving data",
of course it would be nice to have SPI in a 6502...
and to be able to boot from serial ROM after a hardware reset
because buying ROMs with a parallel interface won't be getting
simpler in the future...

BTW: when trying to build sort of a game system nowaday,
for a colored screen resolution of 640*480 or more I think that a 64kB address range
won't do, so there is that proposal for giving a 6502 a 24 Bit address bus...

;---

About the 65816:

SPI would be nice, too... maybe in combination with MVP and MVN.

It's annoying that the addresses of the 'native mode' interrupt vectors
conflict with the jump vector table of the Commodore kernal,
but I'm not sure how to give the 65816 something similar to the
VBR (vector base register) from the 68020.

;------

I'm not sure, if the 6502 needs a 'multiply' instruction,
as most of the code probably won't make use of it.

Hmm... in M02, I had two instructions that read a Byte from a 24 Bit address,
then either add or store it into a zero page location.
Address was formed from the registers X:Y:A or such (I don't remember exactly).
The idea was to have better support for table driven math,
but those instructions never were put into use...

So much for the instruction set, now back to architecture.

BigEd · Post by **BigEd** » Wed Aug 03, 2016 3:32 pm

(I would really recommend a new thread for those ideas, or you leave this thread as the only place to comment.)

Bregalad · Post by **Bregalad** » Thu Aug 04, 2016 12:22 pm

Are you interested in enhancing the instruction set, or enhancing the hardware architecture inplementing the same instruction set? This is two completely separate things.

BigEd · Post by **BigEd** » Thu Aug 04, 2016 3:58 pm

Indeed so, that's why we would usually have separate threads - any discussion about the architectural enhancements is best carried out somewhere else. If ttlworks hasn't started one, and if you have some ideas or comments, I recommend you take his text and quote it in a new thread, with your contribution.

(It would be good to link both forwards and backwards too.)

ttlworks · Post by **ttlworks** » Fri Aug 05, 2016 6:16 am

Rebooting... sorry for the delay.

Need to clarify:

Quote:

Are you interested in enhancing the instruction set, or enhancing the hardware architecture
inplementing the same instruction set? This is two completely separate things.

Actually, I'm no longer active.
While I was active, toying with architecture, instruction set, and digging through some of the literature,
I have noticed the one or other "little trick"...
Of course, my focus was making things work at all, speed never was a primary concern.

Felt a need to add a list to Drass' TTL CPU project how to further increase speed,
just in case he want's to get close to 20MHz (or maybe above it) and in case I won't be there
when he makes a try.
...Strangely, the list ended up here.

I sense a quite big incompatibility.

When comparing the design process, the appoach for building a TTL\transistor CPU
and a FPGA CPU is pretty different:

The traditional way of building a TTL\transistor CPU goes like this:
First you spend half a year or more with nothing but paper and pencil,
trying to figure out what architecture might be best, and how to speed it up.
The actual building process just takes a few months and can be "intense".
Once you have decided for an architecture, you will have to stick with it
for more than 2 years for some reason.

If you missed something... anything... during the initial design phase,
trying to fix this problem later has a habit of creating two new problems,
what causes "avoidable" extra work (and some more grey hairs maybe).

Means if you try to "add a feature" late in the design process,
chances are good that this either "cripples" the feature, the CPU, or maybe both.

Because of this, you try to look around very early in the design process
what additional features would be interesting (or would make sense to have),
and you try to plan your design (and how to speed it up) in a way that makes it possible
to add those "features" later, even if it feels like you won't be implementing
_any_ of the said features later, anyway.

So from the "TTL point of view", it feels like the list of "possible features"
just has to be located close to the list of how to possibly speed up a CPU,
because those two things "interact" with each other.

// If this is a problem, please copy the "features list" into another thread.

...Stranger in a strange land.

Considering all this, I'm now starting to wonder:
"How might I be able to be useful\helpful in this part of the forum ?"

BigEd · Post by **BigEd** » Fri Aug 05, 2016 8:00 am

> "How might I be able to be useful\helpful in this part of the forum ?"
I wouldn't worry too much about the different sections of the forum. I personally do tend to worry about the use of threads, perhaps I worry too much, but if a thread has a topic and several posts have pursued that topic, then each time someone adds a new topic as some kind of aside, there's the possibility of having to carry on two conversations in parallel, or following one topic at the expense of the other. If, as a poster, you have some complicated thoughts which bridge two topics, you always have the option of starting a new thread. If you're worried that no-one will see it (which actually won't happen, the forum isn't that busy) then post a link to the new thread in the old one.

Another case for taking care in a thread is when the thread is someone's progress log - then it really is about some particular thing.

Back to the topic of this thread: a pool of ideas for faster implementations, by means of fancier microarchitecture. It's good for anyone making a CPU to know something about how CPUs are made, and there are basic designs and there are aggressive and complex designs. As ever, it makes sense to start by mastering the basics.

You're quite right, of course, that the economics of implementing an idea have a big influence on how best to explore what you're about to do. FPGA implementation is rapid, once you have HDL. Even so, it makes sense to have some plan of attack, some diagrams, before writing that HDL. If you want to end up with a high clock speed, it will matter as to how much logic you place between flops, and what your internal connectivity looks like.

There's another point lurking here too. If I add, for example, a barrel shifter, I might slow down my clock speed by 5% or 10%. But maybe the performance of my CPU will increase, because now I have a barrel shifter. There's a tradeoff, when you add a feature, as to what happens to the cycle time, and similarly a tradeoff when you remove (or reject) a feature.

Most importantly of all, the performance of a CPU which is never finished isn't quite so important as the performance of one which is!

ttlworks · Post by **ttlworks** » Fri Aug 05, 2016 10:41 am

Thanks for your text, BigEd, this helps a lot.

BTW: a barrel shifter could be broken into, let's say, a three level pipeline.

:---

In the world of TTL CPUs, you know the propagation delay of the gates and function blocks,
you know how those elements are tied together (length of the signal paths etc.),
and if your chips happen to be in sockets, you even could plug in ICs from different
logic families (LS,F,HCT,ACT etc.) into a socket to check how the propagetion delays
in your design vary (and pile up).

That's engineering (and maybe science) in its purest form.

;---

When comparing this to FPGAs:

HDL\VHDL was invented just to give the compiler a clue of what your design is supposed to do
from the logic level of view. That's the highest level of abstraction possible.

Compilers tend to be 'closed source', so you can't really say what actually happens in the compiler,
what "strategy" it really uses for breaking apart your equations and mapping them to the hardware
in the FPGA.

There seems to be no documentation about the meaning of the Bits in the *.jed files (like with PALs\GALs),
and you have no clue what the hardware in the chip actually really might look like.

So I guess that if you want to go as fast as possible, things like "guesswork" or "try and error"
become part of the game.

Unfortunately, at some point you _always_ will be forced to migrate to the next generation
of compilers (which might act a little different to certain keywords in the HDL\VHDL than before),
and to the next generation of FPGAs (with a different internal architecture),
turning the "try and error" experiences you happen to have "on stock" from your previos designs
somewhat... obsolete.

Question to the experts: is the game really as bad as I think it is ?

A pool of odd ideas for speeding up a 6502 architecture

A pool of odd ideas for speeding up a 6502 architecture

Re: A pool of odd ideas for speeding up a 6502 architecture

Re: A pool of odd ideas for speeding up a 6502 architecture

Re: A pool of odd ideas for speeding up a 6502 architecture

Re: A pool of odd ideas for speeding up a 6502 architecture

Re: A pool of odd ideas for speeding up a 6502 architecture

Re: A pool of odd ideas for speeding up a 6502 architecture

Re: A pool of odd ideas for speeding up a 6502 architecture

Re: A pool of odd ideas for speeding up a 6502 architecture

Re: A pool of odd ideas for speeding up a 6502 architecture

Re: A pool of odd ideas for speeding up a 6502 architecture

Re: A pool of odd ideas for speeding up a 6502 architecture

Re: A pool of odd ideas for speeding up a 6502 architecture

Re: A pool of odd ideas for speeding up a 6502 architecture

Re: A pool of odd ideas for speeding up a 6502 architecture