6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sat Nov 23, 2024 8:03 am

All times are UTC




Post new topic Reply to topic  [ 125 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6, 7, 8, 9  Next
Author Message
 Post subject:
PostPosted: Mon Nov 22, 2010 7:58 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
I've had another look at Dr Naohiko Shimizu's 6502 (and a quick mail exchange with him: I'll call him Naohiko from now. He's a long time 6502 fan and is now aware of this site and this thread.)

Recall that he uses a high level language called SFL (the idea is higher productivity, and he wrote the 6502 core in a week.) Overtone supplies the SFL tools. For my purposes, a free 30-day non-profit license is enough...

The high-level sources - in SFL - are about 1100 lines, which compares favourably with Arlet's 1200 lines of verilog and retromaster's 2100 lines of VHDL. This high-level description is translated to about 2000 lines of low-level synthesisable verilog.

The synthesis results compare against those other two cores like this:
Code:
        flops  slices   LUTs RAM16 HDL      Notes
A2601     138     467    840    0  vhdl     by retromaster
m65       119     452    873    0  sfl      by Naohiko Shimizu (O2 mode)
m65       122     549   1058    0  sfl      by Naohiko Shimizu (default mode)
cpu.v     155     276    474    8  verilog  by Arlet Ottens (not today's version!)
Because the synthesiser sees only the low level verilog, no adders or similar operators were found:
Code:
Macro Statistics
# Registers                                            : 36
 1-bit register                                        : 25
 3-bit register                                        : 1
 8-bit register                                        : 10
Macro Statistics
# Registers                                            : 108
 Flip-Flops                                            : 108

Note that the speed optimisation then replicates a few of the state bits, which takes us up to 122. The speed is reported as 31MHz (or 38MHz for the O2 version) (again, this is not a carefully constrained synthesis.)

Cheers
Ed

ps. Naohiko has presented various retro FPGA projects: PDP/11, Space Invaders, Apple 1, using his high level languages. See slides from ICCD 2009 and from ASEAN 2003


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Tue May 24, 2011 8:11 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
Using ISE 12.4, and pushing the synthesis a little (high effort, optmize for timing), I can get my Atom model (6502 CPU + RAM/ROM/PIA/VDG) to pass synthesis with a 14ns clock:

Code:
All constraints were met.
 
 Data Sheet report:
 -----------------
 All values displayed in nanoseconds (ns)

 clk            |   13.893



This is on a Spartan-3 (XC3S200-5FT256). On a Spartan-6, I get a synthesis estimate of 111 MHz. Mapping fails due to bad pin constraints (I just used the Spartan-3 constraints file).


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Tue May 24, 2011 8:30 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
I notice in some of my experiments that the post-placement timing is sometimes a bit faster than the post-synthesis timing.

a 100MHz atom is quite something!


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Tue May 24, 2011 8:30 pm 
Offline

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA
Arlet wrote:
...On a Spartan-6, I get a synthesis estimate of 111 MHz. Mapping fails due to bad pin constraints (I just used the Spartan-3 constraints file).


Nice, no BCD mode?


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Tue May 24, 2011 8:33 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
Yeah, too bad I don't have a Spartan-6 board.... Digilent has one, but it's $349, which is a bit on the high side.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Tue May 24, 2011 8:33 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
No BCD mode. I used an old project dir, with the pre-BCD/pre-RDY core in it.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Tue May 24, 2011 9:38 pm 
Offline

Joined: Tue Jul 05, 2005 7:08 pm
Posts: 1043
Location: near Heidelberg, Germany
I'm really looking forward to how my 65k core compares, although I think, as a first-timer and writing a complete new core, with extra functionality, this will be much bigger. But it's generic, configurable to 16, 32 or 64 bit registers and a maximum memory interface of 8, 16, or 32 bit, so I could at least try to compare the 16bit registers/8bit memory interface with the numbers here. Maybe I should add a switch to switch off the extra opcodes and registers, just for comparison...

BTW: current status is that I have almost finished writing the fetch/decode logic - which is quite complex (I think this is the most complex part in my design actually) as I do prefix bytes, but it _should_ support decoding one opcode per cycle as long as memory bandwidth permits (so I should be able to break the one opcode / at least two cycles limit of the 6502 :-) It's untested as of now, though.

André


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Mon May 30, 2011 11:57 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Arlet wrote:
Yeah, too bad I don't have a Spartan-6 board.... Digilent has one, but it's $349, which is a bit on the high side.
As discussed privately, there are some less expensive options: I've posted to the FPGA dev board thread.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Nov 04, 2011 10:17 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
To follow up on this old observation
BigEd wrote:
... Arlet's is the only design which uses LUTs as 16-bit RAMs - this could be a reason why the slice count is by far the smallest.

I just ran a couple of runs with two versions of Arlet's core, one of which avoids using RAMs for the register file. The size difference is there, but not enormous. This is with version 12.4 of the xilinx tools, targetting a xc3s50a-5-tq144:
Code:
         flops  slices  LUTs  RAM16  MHz
6502     153    247     468    8     83.3
         189    265     501          81.5
65Org16  202    360     681   16     78.2
         271    389     737          76.3
(Note - I also disabled BCD for these runs)

Edit: updated with post-synthesis speed. This is an unconstrained synthesis with default tactics.


Last edited by BigEd on Fri Nov 04, 2011 12:13 pm, edited 3 times in total.

Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Nov 04, 2011 10:46 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
BigEd wrote:
I just ran a couple of runs with two versions of Arlet's core, one of which avoids using RAMs for the register file. The size difference is there, but not enormous.


Any difference in speed ? Can you show the code ? I'm curious how you implemented it.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Nov 04, 2011 11:57 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
I've updated the table with speeds - the RAM approach is just a little faster.

Here's the code - I dug it up from some old emails:

Code:
/*
 * write to a register. Usually this is the (BCD corrected) output of the
 * ALU, but in case of the JSR0 we use the S register to temporarily store
 * the PCL. This is possible, because the S register itself is stored in
 * the ALU during those cycles.
 */
always @(posedge clk)
    if( write_register & RDY )
        // this version will get us a RAM - denser!   
        AXYS[regsel] <= (state == JSR0) ? DIMUX : { ADD[dw-1:4] + ADJH, ADD[3:0] + ADJL };
        //case( state )
        //    JSR0: AXYS[regsel] <= DIMUX;
        //    default : AXYS[regsel] <= { ADD[dw-1:4] + ADJH, ADD[3:0] + ADJL };
        //endcase


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Nov 04, 2011 12:10 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
So, you only changed the code a little bit with that case statement, so it didn't recognize the RAM anymore.

I was thinking that instead of the register array, indexed by 'regsel', it's possible to use the register more directly by rewriting some of the code. For instance, when doing stack operations, the S register can be sent directly to the address bus, instead of going through the AXYS[] array. This may be a little faster (at the cost of more logic).


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Nov 04, 2011 1:41 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Actually, both versions are your code, from this time last year (private emails)! We did discover at that time how fragile (or robust) the RAM-recognition was.

I see what you mean - the S value from the ALU could be routed to the Address pins. I think I have never looked carefully at the timing reports, to see where the critical path is. I did determine that synthesis tactics could make quite a difference to speed.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Nov 04, 2011 2:11 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
It would be interesting to see what would happen if the design was changed from the current one, which mimics the way the original 6502 used the ALU to do everything, to one where the registers are separated, and combined with some logic. So, you could have an X-register unit that was capable of holding the value, but also incrementing and decrementing it. That way you could implement INX and DEX without going through the ALU.

This could still be done at the same number of cycles, but you could also choose to deviate from the original 6502 instruction times, and try to make it faster. In theory, you should be able to do INX and DEX in 1 cycle, for instance.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Nov 04, 2011 5:59 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Single cycle instructions? I always thought that would mean major rework - be interesting to see it though! And we know the 65CE02 managed it. In an FPGA fabric, an incrementer local to the register might well be efficient. It would presumably simplify push and pop type operations.

Cheers
Ed


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 125 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6, 7, 8, 9  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 9 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: