Optimizing 6502 core performance in FPGA environments

Topics relating to PALs, CPLDs, FPGAs, and other PLDs used for the support or creation of 65-family processors, both hardware and HDL.
User avatar
Windfall
Posts: 229
Joined: 27 Nov 2011
Location: Amsterdam, Netherlands
Contact:

Optimizing 6502 core performance in FPGA environments

Post by Windfall »

Okay, the proof of the pudding (referring implicitly to a discussion from the past, started by myself, your mileage may vary ...) :

I've always thought (I'm summarizing here) that a 6502 core should not spend such an awful amount of time on fetching instructions. Experimentation time. What did I do ? I took Michael Morris's 65C02 core (an old version, lets leave the 'why' of that undiscussed ...), and made changes to it thusly :

a) All reads from <address> also read <address + 1> and <address +2> (importing either opcode argument bytes or up to 16 extra data bits). A simple memory moulding operation in an FPGA environment (more about that later).
b) Eliminated, from the microcode, argument byte fetches
c) Similarly, coalesced vector reads into a single access

Shortening, e.g., "LDA (dp),Y" or "BIT abs" by two cycles. Just for the price of widening the databus ... Of course, especially in FPGA enviroments (which I'm targetting), this is all basically free. And therefore it should be exploited ! Get off your lazy asses. :wink:

After hitting most of the instruction set (all zero page and absolute addressing related instructions, and lucrative instructions like JSR, JMP and RTS), I already improved core performance by around 50% in one of my own creations (the 'soft' Acorn 6502 Second Processor, see http://www.zeridajh.org/hardware/soft65 ... /index.htm).

And the end is not in sight. Optimization of some instructions is impossible due to contention of the instruction fetch and another memory operation. But what if we gave zero page and stack space separate storage (reflecting all writes to two copies : main memory (for degenerate accesses, like LDA 0010h,X or LDA 0100h,X), and a faster copy in e.g. registers (for natural accesses like LDA 10h,X or PHA). The latter copy would have no contention with instruction fetches, and opens the door to single cycle zero page operations.

In short : there's a lot to be exploited here !
User avatar
BigEd
Posts: 11467
Joined: 11 Dec 2008
Location: England
Contact:

Re: Optimizing 6502 core performance in FPGA environments

Post by BigEd »

Sounds good, very good indeed! By sacrificing efficiency, you've made very respectable gains. I hope you can share more details.

I expect it will be true that some cores will be more amenable to this transformation than others, which will be the reason behind some of the comments last time. The smallest and fastest core, Arlet's, is less amenable to this, I think. But it may be that a microcoded core - maybe Michael's older core is in this category - is a more convenient starting point. So long as the endpoint is a healthy margin faster than, say, Arlet's, it is a net win for performance.

Well done!

I'm interested to know whether your wide fetches are aligned, whether you marshal the 3 bytes you need from an 8-byte buffer which you fill 4 at a time, as previously sketched, or whether you've taken some other approach.

I like the tactic of keeping both a fast local and a slower global copy of pages one and two - that might help to make the machine more transparent and simpler. Again, less efficient, but more effective.

Another possibility, more complexity again, is to have a write buffer, so that pending writes don't hold up reads - assuming there are still dead cycles in which the write buffer can empty.
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Re: Optimizing 6502 core performance in FPGA environments

Post by Arlet »

Quote:
Of course, especially in FPGA enviroments (which I'm targetting), this is all basically free. And therefore it should be exploited ! Get off your lazy asses.
It'll take more resources, so I wouldn't call it 'free'. And in most cases, adding logic will also reduce maximum frequency.
User avatar
BigEd
Posts: 11467
Joined: 11 Dec 2008
Location: England
Contact:

Re: Optimizing 6502 core performance in FPGA environments

Post by BigEd »

It is free in cash terms, though, which I'm sure is John's point. And it's a very good point: once you've decided, somehow, on which FPGA to target, all of the resources of that FPGA are available for the problem in hand.

Of course you are right that adding complexity will sometimes reduce clock speed. Sometimes we are limited by external RAM speed, sometimes by the speed of the core.
User avatar
Rob Finch
Posts: 465
Joined: 29 Dec 2002
Location: Canada
Contact:

Re: Optimizing 6502 core performance in FPGA environments

Post by Rob Finch »

Quote:
Just for the price of widening the databus
There's been a few suggested designs with wider databuses just to enhance performance. (65org32,65org16). It's tricky because unless the design is really simple fmax will suffer. With a wider data bus there has to be a mux inserted to align the instruction. Inserting a mux into the instruction fetch reduces the fmax.
Quote:
a) All reads from <address> also read <address + 1> and <address +2> (
It's a good idea. 8088 had a six byte instruction queue from which instructions were read. If you want good performance from a standard memory system (dram / eerom) a cache is one way to go. Using a cache, FT832 has a 16 byte window available. This allows longer instructions to be used as well. One problem with a cache is the fmax is lower than what can be done running out of block rams.
User avatar
BigEd
Posts: 11467
Joined: 11 Dec 2008
Location: England
Contact:

Re: Optimizing 6502 core performance in FPGA environments

Post by BigEd »

What I quite like about John's story here is that there's no attempt to make a different CPU - this is exactly a 6502 in instruction-level behaviour. Other ideas elsewhere have generally aimed to get more out of a different microarchitecture, which is tempting, but makes the problem harder and doesn't even benefit existing software. As John's interest seems to be in enhancing and extending Acorn's 6502-based machines, which have lots of existing software, the tactic is fitting.

(Of course I have lots of interest in different CPUs too, but that's for other discussions!)
User avatar
Windfall
Posts: 229
Joined: 27 Nov 2011
Location: Amsterdam, Netherlands
Contact:

Re: Optimizing 6502 core performance in FPGA environments

Post by Windfall »

BigEd wrote:
Sounds good, very good indeed! By sacrificing efficiency, you've made very respectable gains. I hope you can share more details.
The 'sacrifice' can be logic only if memory remains unduplicated (but with split addressing and some byte twisting logic). Of course I can share the changes to the logic (very few, really) and the microcode (a lot of changes there), but it's not in releasable shape and may not be for some time. I've also had to recreate the microcode sources (which I lost) from the raw bit array, so the microcode generator is a chunk of C code now.
BigEd wrote:
I expect it will be true that some cores will be more amenable to this transformation than others
The differences won't be great. Very little extra work is done, in any core, during opcode argument fetch cycles. And these are the ones being eliminated.
BigEd wrote:
, which will be the reason behind some of the comments last time.
Oh, nonsense. Almost everyone tried to be the sceptical smartass, instead of actually taking in the idea.
BigEd wrote:
I'm interested to know whether your wide fetches are aligned, whether you marshal the 3 bytes you need from an 8-byte buffer which you fill 4 at a time, as previously sketched, or whether you've taken some other approach.
Right now, it's 3 duplicates of main memory. They are written to simultaneously (same addresses) and read from simultaneously (different addresses). That's just to keep things simple, and not pollute the logic with any byte twisting multiplexers. But I could (and I have briefly done so) use a single copy split into seperately addressable banks (which then inevitably requires some conditional address tweaks or byte twisting). The accumulator style I suggested in the earlier discussion has its own problems : on any control flow change, your previously read word becomes invalid, incurring a penalty (and probably a hold cycle), at least in the general case.
BigEd wrote:
Another possibility, more complexity again, is to have a write buffer, so that pending writes don't hold up reads - assuming there are still dead cycles in which the write buffer can empty.
Write buffers have their own problems. Generally, you will have to snoop it on reads which might clash with uncommitted writes. And you have to consider any effects on interrupt latency.
Last edited by Windfall on Sat May 27, 2017 12:25 pm, edited 1 time in total.
User avatar
Windfall
Posts: 229
Joined: 27 Nov 2011
Location: Amsterdam, Netherlands
Contact:

Re: Optimizing 6502 core performance in FPGA environments

Post by Windfall »

Arlet wrote:
Quote:
Of course, especially in FPGA enviroments (which I'm targetting), this is all basically free. And therefore it should be exploited ! Get off your lazy asses.
It'll take more resources, so I wouldn't call it 'free'. And in most cases, adding logic will also reduce maximum frequency.
I still have the exact same fmax. I checked. (92 MHz, although, admittedly, the Stratix III on this board has rather beefy cells : two 4-input LUTs and two registers each).
User avatar
Windfall
Posts: 229
Joined: 27 Nov 2011
Location: Amsterdam, Netherlands
Contact:

Re: Optimizing 6502 core performance in FPGA environments

Post by Windfall »

BigEd wrote:
It is free in cash terms, though, which I'm sure is John's point. And it's a very good point: once you've decided, somehow, on which FPGA to target, all of the resources of that FPGA are available for the problem in hand.
That is exactly what I meant, yes. In my particular case, for my particular design, even with tripled memory, I'm using only 2% of the available logic and 29% of the available internal RAM.
User avatar
Windfall
Posts: 229
Joined: 27 Nov 2011
Location: Amsterdam, Netherlands
Contact:

Re: Optimizing 6502 core performance in FPGA environments

Post by Windfall »

Rob Finch wrote:
Quote:
Just for the price of widening the databus
There's been a few suggested designs with wider databuses just to enhance performance. (65org32,65org16). It's tricky because unless the design is really simple fmax will suffer. With a wider data bus there has to be a mux inserted to align the instruction. Inserting a mux into the instruction fetch reduces the fmax.
You mean : it might. You can tell that the logic will be more complicated, but you cannot tell offhand how many extra resources the compiler will need to implement it.

In any case, simply duplicating memory banks does not require any byte twisting logic. Just memory which you may or may not already have.
User avatar
Windfall
Posts: 229
Joined: 27 Nov 2011
Location: Amsterdam, Netherlands
Contact:

Re: Optimizing 6502 core performance in FPGA environments

Post by Windfall »

BigEd wrote:
What I quite like about John's story here is that there's no attempt to make a different CPU - this is exactly a 6502 in instruction-level behaviour.
Full marks Ed. This is the thing. The nature of the beast hasn't changed. It's still an 8-bit CPU, with the same instruction set. Just a hell of a lot faster, against the tradeoff of using memory that is often already there (and can be moulded into almost any required shape).
User avatar
BigEd
Posts: 11467
Joined: 11 Dec 2008
Location: England
Contact:

Re: Optimizing 6502 core performance in FPGA environments

Post by BigEd »

Looks like you do get a huge amount of on-chip ram with those Altera parts - I'm used to the Xilinx parts with at most 64kByte on board. (Which is enough, for say an Acorn second processor, but it doesn't allow any use of block ram for microcode.)
User avatar
Windfall
Posts: 229
Joined: 27 Nov 2011
Location: Amsterdam, Netherlands
Contact:

Re: Optimizing 6502 core performance in FPGA environments

Post by Windfall »

BigEd wrote:
Looks like you do get a huge amount of on-chip ram with those Altera parts
The Stratix III development board I used for my experiments is a high end board with a high end FPGA, so it's hardly representative.
BigEd wrote:
- I'm used to the Xilinx parts with at most 64kByte on board. (Which is enough, for say an Acorn second processor, but it doesn't allow any use of block ram for microcode.)
In my experience, most recent development boards are likely to have plenty more than 64KB. The DE0 Nano SOC (http://www.terasic.com.tw/cgi-bin/page/ ... ish&No=941) is a good example of that (and it's pretty cheap). In any way, the microcode is only 512 bytes in this particular case, and about a third contains redundant data. It should be doable to implement it in logic.
User avatar
Windfall
Posts: 229
Joined: 27 Nov 2011
Location: Amsterdam, Netherlands
Contact:

Re: Optimizing 6502 core performance in FPGA environments

Post by Windfall »

I see you've added the DE0 Nano SOC to the 'Survey of FPGA dev boards'. You may also want to check out the DE10-Lite, then (http://www.terasic.com.tw/cgi-bin/page/ ... sh&No=1021). I have one of those, soft 2p ports to it pending.
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Re: Optimizing 6502 core performance in FPGA environments

Post by Arlet »

BigEd wrote:
Looks like you do get a huge amount of on-chip ram with those Altera parts - I'm used to the Xilinx parts with at most 64kByte on board. (Which is enough, for say an Acorn second processor, but it doesn't allow any use of block ram for microcode.)
That's because you've only looked at the small Xilinx devices. Here's a comparable Xilinx Artix 7 part with 600kB RAM on board.
Post Reply