6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sat Nov 16, 2024 6:14 pm

All times are UTC




Post new topic Reply to topic  [ 41 posts ]  Go to page 1, 2, 3  Next
Author Message
PostPosted: Sun Sep 13, 2020 1:18 pm 
Offline
User avatar

Joined: Sun Nov 27, 2011 12:03 pm
Posts: 229
Location: Amsterdam, Netherlands
And yet another 65C02 core was born. Yes, don't look away. :-)

Let me first say that making a 6502 core small, efficient, cycle accurate, or any combination thereof, is laudable. I adore efficiency myself. But what about already resource rich environments, like, say, a Stratix V FPGA with nearly a million LEs and 54 Mb of RAM ? Seems a waste, to have a 2000 cell core run in its little 0.2% corner of such a chip ... So, exclusively in the context of my 'soft' Second Processors for Acorn BBCs (google it), which run on a number of hardware development boards (both cheap and expensive ...) I set out to make something 'ridiculously wasteful but fast'. If you got it, use it.

Mainly by using many copies of main memory to a) read instructions in one go and b) speculatively read all possible memory operands, the resulting core is able to execute most 65C02 instructions in 1 (immediate or direct addressing) or 2 cycles (indirect addressing). Control flow instructions take a few cycles more than that, but are still awaiting optimization. The current end result is a 65C02 core that executes instructions in roughly half the cycles of a regular core.

There's quite a bit of Fmax optimization left to do (I only just started that), but on a 5SGXEABN2F45C2 Stratix V (yes, I know it is ridiculously high end, but it allows me to get close to actual limits), it currently runs at 170 MHz.

Proof that radically reducing cycle count is a viable concept. If only just to myself ;-)

Don't hold your breath for a public release (yet). It's somewhat of a 'lab' thing, at present.


Top
 Profile  
Reply with quote  
PostPosted: Sun Sep 13, 2020 1:57 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10980
Location: England
Wow! So how wide is the set of memory databusses? (Is that the same as asking how many 64k copies of memory you have?)

It might be that most of the benefits of such a wide machine would come from a suitably wide cache on a smaller chip, at the cost of having to cope also with cache misses.

In any case, very nice to have a proof of concept. A 1000x cheaper FPGA is more my kind of territory!


Top
 Profile  
Reply with quote  
PostPosted: Sun Sep 13, 2020 2:16 pm 
Offline
User avatar

Joined: Sun Nov 27, 2011 12:03 pm
Posts: 229
Location: Amsterdam, Netherlands
All copies are 8 bit wide. Some copies are 'wider' in that two or three bytes at sequential addresses are read simultaneously (for instant indirection (data processing and jumps), exception vectors, RTS/RTI stack pulls).

The number of copies is reduced by using true dual port memory, so two reads (from sequential addresses) can be done on the same copy, without any 'byte twisting'. Altogether about 450 KB of RAM instead of 64.


Top
 Profile  
Reply with quote  
PostPosted: Sun Sep 13, 2020 2:35 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10980
Location: England
Thanks!

So, if the instruction fetch is not to read 3 sequential bytes every time, did you implement your idea of a buffer of upcoming instruction bytes which is refilled as needed?


Top
 Profile  
Reply with quote  
PostPosted: Sun Sep 13, 2020 2:39 pm 
Offline
User avatar

Joined: Sun Nov 27, 2011 12:03 pm
Posts: 229
Location: Amsterdam, Netherlands
BigEd wrote:
So, if the instruction fetch is not to read 3 sequential bytes every time, did you implement your idea of a buffer of upcoming instruction bytes which is refilled as needed?

Code is read 3 bytes at a time as well.


Top
 Profile  
Reply with quote  
PostPosted: Sun Sep 13, 2020 2:41 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10980
Location: England
Very good - cheap and cheerful. Well, not so cheap!

Do you consolidate two-byte stack pushes into one write? (Probably not a huge win)


Top
 Profile  
Reply with quote  
PostPosted: Sun Sep 13, 2020 3:09 pm 
Offline
User avatar

Joined: Sun Nov 27, 2011 12:03 pm
Posts: 229
Location: Amsterdam, Netherlands
BigEd wrote:
Very good - cheap and cheerful. Well, not so cheap!

It has no pretenses of 'new design' practicality. But it's a proof of concept. And it will be going for the speed record ...

BigEd wrote:
Do you consolidate two-byte stack pushes into one write? (Probably not a huge win)

No, writes are 8 bit wide (to all RAM copies in parallel). It's likely that you could go 16 or 24 bit, with some rearranging and some more RAM, but especially 24 bit (for exception pushes) will get awkward. And indeed, it will have little effect on the average number of cycles.


Top
 Profile  
Reply with quote  
PostPosted: Sun Sep 13, 2020 9:29 pm 
Offline
User avatar

Joined: Sun Oct 18, 2015 11:02 pm
Posts: 428
Location: Toronto, ON
Very interesting Windfall. Is execution pipelined, or are the benefits derived mainly from wide memory?

_________________
C74-6502 Website: https://c74project.com


Top
 Profile  
Reply with quote  
PostPosted: Sun Sep 13, 2020 10:35 pm 
Offline
User avatar

Joined: Sun Nov 27, 2011 12:03 pm
Posts: 229
Location: Amsterdam, Netherlands
Drass wrote:
Very interesting Windfall. Is execution pipelined, or are the benefits derived mainly from wide memory?

No pipelining. That is reserved for the next version.


Top
 Profile  
Reply with quote  
PostPosted: Sun Sep 13, 2020 10:53 pm 
Offline

Joined: Tue Jul 24, 2012 2:27 am
Posts: 679
Windfall wrote:
Control flow instructions take a few cycles more than that, but are still awaiting optimization.


What's your plan for that? I could see decoding both the current and a bit of the next instruction in one go, with a 5-byte code lookahead instead of 3, to get going on upcoming branches earlier.

_________________
WFDis Interactive 6502 Disassembler
AcheronVM: A Reconfigurable 16-bit Virtual CPU for the 6502 Microprocessor


Top
 Profile  
Reply with quote  
PostPosted: Sun Sep 13, 2020 11:29 pm 
Offline
User avatar

Joined: Sun Nov 27, 2011 12:03 pm
Posts: 229
Location: Amsterdam, Netherlands
White Flame wrote:
Windfall wrote:
Control flow instructions take a few cycles more than that, but are still awaiting optimization.


What's your plan for that? I could see decoding both the current and a bit of the next instruction in one go, with a 5-byte code lookahead instead of 3, to get going on upcoming branches earlier.


I was simply referring to the fact that some control flow instructions can be executed more quickly than they are now. I'm not sure how you concluded I was talking about branch prediction.


Top
 Profile  
Reply with quote  
PostPosted: Mon Sep 14, 2020 12:44 am 
Offline

Joined: Tue Jul 24, 2012 2:27 am
Posts: 679
It was just an idea to expand on the multi-byte view of the upcoming instruction, not a prediction. :)

_________________
WFDis Interactive 6502 Disassembler
AcheronVM: A Reconfigurable 16-bit Virtual CPU for the 6502 Microprocessor


Top
 Profile  
Reply with quote  
PostPosted: Mon Sep 14, 2020 1:59 am 
Offline
User avatar

Joined: Sun Nov 27, 2011 12:03 pm
Posts: 229
Location: Amsterdam, Netherlands
White Flame wrote:
It was just an idea to expand on the multi-byte view of the upcoming instruction, not a prediction. :)

OK, I see.

Yes, reading two instructions at a time could certainly help speed up things in some scenarios, but then you're well into data dependency territory and things can get complicated very quickly. Actually, I was already planning on looking at simultaneously executing simple combinations of two instructions, but then within the existing 3 byte window (like INY:LDA (zp),Y).

Actually, I've done some similar work in the nineties, for a proprietary graphics processor that kept multiple units busy with one (huge) instruction word (specifically writing optimization algorithms for an assembler, to convert somewhat loosely coded assembly into tighter code keeping more units busy at the same time, lots of data dependencies on that one ...). I'd like to do something similar, most ambitiously something with multiple execution units that stall each other (or not) based on data dependencies or control flow changes. But that's way more complicated than what I have now and I'm not sure it'd be worth all the effort.


Top
 Profile  
Reply with quote  
PostPosted: Tue Sep 22, 2020 8:07 pm 
Offline
User avatar

Joined: Sun Nov 27, 2011 12:03 pm
Posts: 229
Location: Amsterdam, Netherlands
OK. By now, further optimization has clearly diminishing returns. It appears that my built-from-scratch 65C02 core, when wrapped up in an Acorn 6502 Second Processor, tops out at 190 MHz on the Stratix V mentioned earlier, where Arlet's core does 230 MHz. Respective benchmark speeds are 380 and 230.

For some, it may be interesting to see a description of the cycle counts and a few architectural details. So here goes.

Copies of main memory are used to simultaneously read : code (3 bytes), direct operands (1 or 2 bytes, e.g. for vectors), indirect operands (1 byte) and stack operands (3 bytes, e.g. for RTI pulls). Copies are byte wide, and combined into true dual ported memory (two reads or one write) or simple dual ported memory (one read and one write). Byte aligned addresses, requiring no 'byte twisting'. All writes are 1 byte at a time (to all memory copies).

In principle, all instructions take either 1 (immediate and direct addressing) or 2 cycles (indirect addressing). This is based on speculative and simultaneous reads of all possible direct operands (both absolute and zero page plus 0, X and Y, and stack plus 1), and feeding the zero page plus 0 and X results to a second stage that speculatively reads the three indirect operands. Instructions changing X or Y cause a bypass of the new X or Y value (i.e. prior to writing the register) to the +X and +Y direct addresses, so as not to slow down all these instructions (or forego the benefits of the +X and +Y reads).

A few exceptions are :

- Writing to memory (read-modify-write, store and push) is often one cycle extra, solely to avoid clashing with speculative reads done on true dual ported memory.

- Some PC or SP changing instructions can be one cycle shorter by bypassing PC or SP updates, but overall performance does not increase much, and Fmax seems to suffer in the current architecture.

- Which currently means (beyond the mentioned 1 or 2 cycle rule) : RMW, store and push are 2, branch is 1 (not taken) or 3 (taken), JMP/JSR are 3, RTS/RTI are 3, pull and TXS are 2, exceptions are 4.

That's probably all the performance that can be squeezed out of the chosen architecture. It is likely that only something seriously pipelined, or with multiple execution units, can execute 65C02 instructions more quickly.

At this stage, the only thing left is tradeoffs between cycles and Fmax, but it seems unlikely that the resulting benchmark won't suffer ...


Last edited by Windfall on Tue Sep 22, 2020 8:17 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Tue Sep 22, 2020 8:11 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10980
Location: England
Any idea if careful HDL (re)coding could improve fmax? Usually it can, if it hasn't already been done. That is, the microarchitecture is one thing, but the synthesis (and the placement) still has many degrees of freedom. (Here, I suspect Arlet's core has the advantage of being small and simple to start with. But by the same token, a larger newer core might have some low-hanging fruit.)


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 41 posts ]  Go to page 1, 2, 3  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 9 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: