6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Wed May 15, 2024 12:58 pm

All times are UTC




Post new topic Reply to topic  [ 19 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Fri May 26, 2017 8:17 pm 
Offline
User avatar

Joined: Sun Nov 27, 2011 12:03 pm
Posts: 229
Location: Amsterdam, Netherlands
Okay, the proof of the pudding (referring implicitly to a discussion from the past, started by myself, your mileage may vary ...) :

I've always thought (I'm summarizing here) that a 6502 core should not spend such an awful amount of time on fetching instructions. Experimentation time. What did I do ? I took Michael Morris's 65C02 core (an old version, lets leave the 'why' of that undiscussed ...), and made changes to it thusly :

a) All reads from <address> also read <address + 1> and <address +2> (importing either opcode argument bytes or up to 16 extra data bits). A simple memory moulding operation in an FPGA environment (more about that later).
b) Eliminated, from the microcode, argument byte fetches
c) Similarly, coalesced vector reads into a single access

Shortening, e.g., "LDA (dp),Y" or "BIT abs" by two cycles. Just for the price of widening the databus ... Of course, especially in FPGA enviroments (which I'm targetting), this is all basically free. And therefore it should be exploited ! Get off your lazy asses. :wink:

After hitting most of the instruction set (all zero page and absolute addressing related instructions, and lucrative instructions like JSR, JMP and RTS), I already improved core performance by around 50% in one of my own creations (the 'soft' Acorn 6502 Second Processor, see http://www.zeridajh.org/hardware/soft6502secondprocessor/index.htm).

And the end is not in sight. Optimization of some instructions is impossible due to contention of the instruction fetch and another memory operation. But what if we gave zero page and stack space separate storage (reflecting all writes to two copies : main memory (for degenerate accesses, like LDA 0010h,X or LDA 0100h,X), and a faster copy in e.g. registers (for natural accesses like LDA 10h,X or PHA). The latter copy would have no contention with instruction fetches, and opens the door to single cycle zero page operations.

In short : there's a lot to be exploited here !


Top
 Profile  
Reply with quote  
PostPosted: Sat May 27, 2017 6:44 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10800
Location: England
Sounds good, very good indeed! By sacrificing efficiency, you've made very respectable gains. I hope you can share more details.

I expect it will be true that some cores will be more amenable to this transformation than others, which will be the reason behind some of the comments last time. The smallest and fastest core, Arlet's, is less amenable to this, I think. But it may be that a microcoded core - maybe Michael's older core is in this category - is a more convenient starting point. So long as the endpoint is a healthy margin faster than, say, Arlet's, it is a net win for performance.

Well done!

I'm interested to know whether your wide fetches are aligned, whether you marshal the 3 bytes you need from an 8-byte buffer which you fill 4 at a time, as previously sketched, or whether you've taken some other approach.

I like the tactic of keeping both a fast local and a slower global copy of pages one and two - that might help to make the machine more transparent and simpler. Again, less efficient, but more effective.

Another possibility, more complexity again, is to have a write buffer, so that pending writes don't hold up reads - assuming there are still dead cycles in which the write buffer can empty.


Top
 Profile  
Reply with quote  
PostPosted: Sat May 27, 2017 6:47 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
Quote:
Of course, especially in FPGA enviroments (which I'm targetting), this is all basically free. And therefore it should be exploited ! Get off your lazy asses.

It'll take more resources, so I wouldn't call it 'free'. And in most cases, adding logic will also reduce maximum frequency.


Top
 Profile  
Reply with quote  
PostPosted: Sat May 27, 2017 9:03 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10800
Location: England
It is free in cash terms, though, which I'm sure is John's point. And it's a very good point: once you've decided, somehow, on which FPGA to target, all of the resources of that FPGA are available for the problem in hand.

Of course you are right that adding complexity will sometimes reduce clock speed. Sometimes we are limited by external RAM speed, sometimes by the speed of the core.


Top
 Profile  
Reply with quote  
PostPosted: Sat May 27, 2017 11:08 am 
Offline
User avatar

Joined: Sun Dec 29, 2002 8:56 pm
Posts: 449
Location: Canada
Quote:
Just for the price of widening the databus

There's been a few suggested designs with wider databuses just to enhance performance. (65org32,65org16). It's tricky because unless the design is really simple fmax will suffer. With a wider data bus there has to be a mux inserted to align the instruction. Inserting a mux into the instruction fetch reduces the fmax.
Quote:
a) All reads from <address> also read <address + 1> and <address +2> (

It's a good idea. 8088 had a six byte instruction queue from which instructions were read. If you want good performance from a standard memory system (dram / eerom) a cache is one way to go. Using a cache, FT832 has a 16 byte window available. This allows longer instructions to be used as well. One problem with a cache is the fmax is lower than what can be done running out of block rams.

_________________
http://www.finitron.ca


Top
 Profile  
Reply with quote  
PostPosted: Sat May 27, 2017 11:12 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10800
Location: England
What I quite like about John's story here is that there's no attempt to make a different CPU - this is exactly a 6502 in instruction-level behaviour. Other ideas elsewhere have generally aimed to get more out of a different microarchitecture, which is tempting, but makes the problem harder and doesn't even benefit existing software. As John's interest seems to be in enhancing and extending Acorn's 6502-based machines, which have lots of existing software, the tactic is fitting.

(Of course I have lots of interest in different CPUs too, but that's for other discussions!)


Top
 Profile  
Reply with quote  
PostPosted: Sat May 27, 2017 11:55 am 
Offline
User avatar

Joined: Sun Nov 27, 2011 12:03 pm
Posts: 229
Location: Amsterdam, Netherlands
BigEd wrote:
Sounds good, very good indeed! By sacrificing efficiency, you've made very respectable gains. I hope you can share more details.

The 'sacrifice' can be logic only if memory remains unduplicated (but with split addressing and some byte twisting logic). Of course I can share the changes to the logic (very few, really) and the microcode (a lot of changes there), but it's not in releasable shape and may not be for some time. I've also had to recreate the microcode sources (which I lost) from the raw bit array, so the microcode generator is a chunk of C code now.

BigEd wrote:
I expect it will be true that some cores will be more amenable to this transformation than others

The differences won't be great. Very little extra work is done, in any core, during opcode argument fetch cycles. And these are the ones being eliminated.

BigEd wrote:
, which will be the reason behind some of the comments last time.

Oh, nonsense. Almost everyone tried to be the sceptical smartass, instead of actually taking in the idea.

BigEd wrote:
I'm interested to know whether your wide fetches are aligned, whether you marshal the 3 bytes you need from an 8-byte buffer which you fill 4 at a time, as previously sketched, or whether you've taken some other approach.

Right now, it's 3 duplicates of main memory. They are written to simultaneously (same addresses) and read from simultaneously (different addresses). That's just to keep things simple, and not pollute the logic with any byte twisting multiplexers. But I could (and I have briefly done so) use a single copy split into seperately addressable banks (which then inevitably requires some conditional address tweaks or byte twisting). The accumulator style I suggested in the earlier discussion has its own problems : on any control flow change, your previously read word becomes invalid, incurring a penalty (and probably a hold cycle), at least in the general case.

BigEd wrote:
Another possibility, more complexity again, is to have a write buffer, so that pending writes don't hold up reads - assuming there are still dead cycles in which the write buffer can empty.

Write buffers have their own problems. Generally, you will have to snoop it on reads which might clash with uncommitted writes. And you have to consider any effects on interrupt latency.


Last edited by Windfall on Sat May 27, 2017 12:25 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Sat May 27, 2017 12:00 pm 
Offline
User avatar

Joined: Sun Nov 27, 2011 12:03 pm
Posts: 229
Location: Amsterdam, Netherlands
Arlet wrote:
Quote:
Of course, especially in FPGA enviroments (which I'm targetting), this is all basically free. And therefore it should be exploited ! Get off your lazy asses.

It'll take more resources, so I wouldn't call it 'free'. And in most cases, adding logic will also reduce maximum frequency.

I still have the exact same fmax. I checked. (92 MHz, although, admittedly, the Stratix III on this board has rather beefy cells : two 4-input LUTs and two registers each).


Top
 Profile  
Reply with quote  
PostPosted: Sat May 27, 2017 12:03 pm 
Offline
User avatar

Joined: Sun Nov 27, 2011 12:03 pm
Posts: 229
Location: Amsterdam, Netherlands
BigEd wrote:
It is free in cash terms, though, which I'm sure is John's point. And it's a very good point: once you've decided, somehow, on which FPGA to target, all of the resources of that FPGA are available for the problem in hand.

That is exactly what I meant, yes. In my particular case, for my particular design, even with tripled memory, I'm using only 2% of the available logic and 29% of the available internal RAM.


Top
 Profile  
Reply with quote  
PostPosted: Sat May 27, 2017 12:09 pm 
Offline
User avatar

Joined: Sun Nov 27, 2011 12:03 pm
Posts: 229
Location: Amsterdam, Netherlands
Rob Finch wrote:
Quote:
Just for the price of widening the databus

There's been a few suggested designs with wider databuses just to enhance performance. (65org32,65org16). It's tricky because unless the design is really simple fmax will suffer. With a wider data bus there has to be a mux inserted to align the instruction. Inserting a mux into the instruction fetch reduces the fmax.

You mean : it might. You can tell that the logic will be more complicated, but you cannot tell offhand how many extra resources the compiler will need to implement it.

In any case, simply duplicating memory banks does not require any byte twisting logic. Just memory which you may or may not already have.


Top
 Profile  
Reply with quote  
PostPosted: Sat May 27, 2017 12:12 pm 
Offline
User avatar

Joined: Sun Nov 27, 2011 12:03 pm
Posts: 229
Location: Amsterdam, Netherlands
BigEd wrote:
What I quite like about John's story here is that there's no attempt to make a different CPU - this is exactly a 6502 in instruction-level behaviour.

Full marks Ed. This is the thing. The nature of the beast hasn't changed. It's still an 8-bit CPU, with the same instruction set. Just a hell of a lot faster, against the tradeoff of using memory that is often already there (and can be moulded into almost any required shape).


Top
 Profile  
Reply with quote  
PostPosted: Sat May 27, 2017 4:45 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10800
Location: England
Looks like you do get a huge amount of on-chip ram with those Altera parts - I'm used to the Xilinx parts with at most 64kByte on board. (Which is enough, for say an Acorn second processor, but it doesn't allow any use of block ram for microcode.)


Top
 Profile  
Reply with quote  
PostPosted: Sat May 27, 2017 7:19 pm 
Offline
User avatar

Joined: Sun Nov 27, 2011 12:03 pm
Posts: 229
Location: Amsterdam, Netherlands
BigEd wrote:
Looks like you do get a huge amount of on-chip ram with those Altera parts

The Stratix III development board I used for my experiments is a high end board with a high end FPGA, so it's hardly representative.

BigEd wrote:
- I'm used to the Xilinx parts with at most 64kByte on board. (Which is enough, for say an Acorn second processor, but it doesn't allow any use of block ram for microcode.)

In my experience, most recent development boards are likely to have plenty more than 64KB. The DE0 Nano SOC (http://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=English&No=941) is a good example of that (and it's pretty cheap). In any way, the microcode is only 512 bytes in this particular case, and about a third contains redundant data. It should be doable to implement it in logic.


Top
 Profile  
Reply with quote  
PostPosted: Sun May 28, 2017 12:45 am 
Offline
User avatar

Joined: Sun Nov 27, 2011 12:03 pm
Posts: 229
Location: Amsterdam, Netherlands
I see you've added the DE0 Nano SOC to the 'Survey of FPGA dev boards'. You may also want to check out the DE10-Lite, then (http://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=English&No=1021). I have one of those, soft 2p ports to it pending.


Top
 Profile  
Reply with quote  
PostPosted: Sun May 28, 2017 6:10 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
BigEd wrote:
Looks like you do get a huge amount of on-chip ram with those Altera parts - I'm used to the Xilinx parts with at most 64kByte on board. (Which is enough, for say an Acorn second processor, but it doesn't allow any use of block ram for microcode.)

That's because you've only looked at the small Xilinx devices. Here's a comparable Xilinx Artix 7 part with 600kB RAM on board.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 19 posts ]  Go to page 1, 2  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 7 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: