First glimpse: 65Org16 (6502 with 32bit addresses)

Topics relating to PALs, CPLDs, FPGAs, and other PLDs used for the support or creation of 65-family processors, both hardware and HDL.
User avatar
BigEd
Posts: 11463
Joined: 11 Dec 2008
Location: England
Contact:

Post by BigEd »

Thanks for the code! One of my problems is that it seems natural to have a combinatorial mux to feed the cpu's data in, which is controlled by address decoders, which read the address bus: but that's a combinatorial loop. So the machine won't come out of reset. I had that working before I added the block rams, by registering the mux output. But I haven't yet reconciled that approach with the approach I need for the block ram.

Is it right to think of this core as inside-out compared to 6502? Instead of producing addresses quite early after the clock, and capturing data just as the cycle closes, it produces addresses late in the cycle (which doesn't show up at all in a delay-free simulation) and will expect the data early in the following cycle. Crucially, data arrives before addresses leave.

With the timings above, I think it can't do any work at 50MHz, even though the clock to clock timing suggests it, because that leaves no time for memory access. Is that right?

Cheers
Ed
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Post by Arlet »

It's not really "inside out", rather it's a consequence of a fast single clock synchronous design. All the tools care about is that the data arrives just before the next clock edge.

If you'd run this design at 1-2 MHz, the address would also come out right after a clock edge, and it would sample the data just before the next. And if you'd run a 6502 at higher frequencies, the addresses would also be coming out late.

The reason it won't work at 50 MHz is that there is a considerable delay in going on/off chip, due to much larger capacitances, in addition to the already large delay in the DI->ALU path.

I think the best solution is to register the address outputs, to get rid of combinatorial delay in the address muxes, and to register the inputs, to split the external data path from the internal path. That leaves 20 ns for the memory access, which should be just possible with 10 ns SRAM. To deal with the extra delay, you'd have to insert 1 RDY cycle.
User avatar
BigEd
Posts: 11463
Joined: 11 Dec 2008
Location: England
Contact:

Post by BigEd »

Good point about running down at a few MHz, but the path from DI->AB is still a conundrum.

It's a while since I did this kind of thing, but when I did, we followed a rule of registering outputs. Not sure if that would allow 6502 cycle timings though - it would be nice to preserve those.

I'm not anxious to make changes - I'll persevere with getting my soc working. It'll become clearer.

Cheers
Ed
ElEctric_EyE
Posts: 3260
Joined: 02 Mar 2009
Location: OH, USA

Post by ElEctric_EyE »

BigEd wrote:
...One of my problems is that it seems natural to have a combinatorial mux to feed the cpu's data in, which is controlled by address decoders, which read the address bus: but that's a combinatorial loop...
Ed
This is what I do too... What is your MUX outputting when your external memory is not selected?
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Post by Arlet »

ElEctric_EyE wrote:
BigEd wrote:
...One of my problems is that it seems natural to have a combinatorial mux to feed the cpu's data in, which is controlled by address decoders, which read the address bus: but that's a combinatorial loop...
Ed
This is what I do too... What is your MUX outputting when your external memory is not selected?
No, you must have a delay in there somewhere because the writing of AB and the reading of DI don't happen in the same cycle. Either you need to register the address decoding, or the mux output.

Or, what I did in my code, instead of a MUX, use an OR, and let the memories produce '0' when not selected. There still an extra cycle delay, but that's from the block RAMs.

If you use asynchronous memories, and register the AB outputs, you can use this delayed AB to drive the input MUX.
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Post by Arlet »

BigEd wrote:
It's a while since I did this kind of thing, but when I did, we followed a rule of registering outputs. Not sure if that would allow 6502 cycle timings though - it would be nice to preserve those.
If you register the outputs, and not the inputs, you can get 6502 cycle timings, as long as you reduce clock frequency to account for extra output pad->memory-> input pad delay.

Even with fast (10ns) SRAM, you'd probably have to allow another 20 ns, so it would mean a drop from 50 to 25 MHz.

Alternatively, you can stay at 50 MHz, and add an extra RDY cycle to cover the 20 ns extra time you need.
ElEctric_EyE
Posts: 3260
Joined: 02 Mar 2009
Location: OH, USA

Post by ElEctric_EyE »

Well, instead of explaing it, the schematic on pg 14 of 6502SoC shows what I did. Sorry, at work... I do put the data headed to the external flash through a FF...
User avatar
BigEd
Posts: 11463
Joined: 11 Dec 2008
Location: England
Contact:

Post by BigEd »

I'm happy to say my simulation has sprung into life! Thanks for the helpful comments. (Which I have not yet fully digested.)

I haven't run this on FPGA, and my solution feels a bit untidy, but it's good to see fewer red Xs, so I've checked it in.

It's good to hear that registering the outputs would make the core behave in a more familiar way, at some (presumed) cost in flexibility. As we often deal with synchronous memory these days, it's better to anticipate that.

Arlet, you were quite right about BCD mode costing speed: although removing it conditionally is quite a mess of ifdefs, the result is an increase (post-synthesis) from 42MHz to 50MHz, reducing the critical path from 22 gates to 17. (Of course I haven't tested my changes)

I'm not seeing a useful post-routing timing report - it thinks it is unconstrained.

Cheers
Ed
ElEctric_EyE
Posts: 3260
Joined: 02 Mar 2009
Location: OH, USA

Post by ElEctric_EyE »

BigEd wrote:
I'm happy to say my simulation has sprung into life! Thanks for the helpful comments.
I'm not seeing a useful post-routing timing report - it thinks it is unconstrained.

Cheers
Ed
Excellent!.
I've noticed in my designs without at least a simple constraint like shown below, ISE will not show max speed. It'll show max delay... Notice it does not even have a jitter spec, which should be included as well.

Code: Select all

NET "RefClk" TNM_NET = "RefClk";
TIMESPEC TS_RefClk = PERIOD "RefClk" 41.16 ns HIGH 50 %;
Arlet wrote:
...The 16 bit ALU is certainly going to slow down max speed, due to the longer ripple carry path. If you care about speed, removing BCD support should help a bit there.
What else is BCD mode good for? Although, I currently use SED for hex to decimal conversion in my 6502SOC, it's been mentioned in this forum the mode is not even used in modern CPU's. Let's trash it and up the speed, but we will sacrifice any hope of backwards 6502 compatibility?...
Question: If ISE doesn't see a "trigger" that uses BCD mode won't it just optimize it out?
So we would free up 2 commands in microcode? SED and CLD...

I've been looking at the coding abit closer now as my interest in this is growing. First, nice addition here (and other places), starting at lines 33-40 (in ISE editor):

Code: Select all

parameter dw = 16; // data width (8 for 6502, 16 for 65Org16)
parameter aw = 32; // address width (16 for 6502, 32 for 65Org16)

input clk;		// CPU clock 
input reset;		// reset signal
output reg [aw-1:0] AB;	// address bus
input [dw-1:0] DI;	// data in, read bus
output [dw-1:0] DO; 	// data out, write bus
Making it so the data bus and address bus width's are variable, so they can be changed in situ, maybe based on an input? a command? or 6502 compatibility mode in the future? NICE!
Arlet wrote:
...Since you now have 16 bit instructions, it would be nice to add some more registers. Since the core is already using a register file, with enough resources to support 16 registers (only 4 are used right now), you can easily add 12 more without much impact on core size/speed.
I have a serious interest in adding more registers. I've made small progress mod'ing the cpu.v code to include an additional B, C, D, E registers to the A, X, Y, and Stack. I made it to line 484 before my understanding fell to a heap of crumpled mass:

Code: Select all

/*
 * register file, contains A, X, Y and S (stack pointer) registers. At each
 * cycle only 1 of those registers needs to be accessed, so they combined
 * in a small memory, saving resources.
 */

reg write_register;		// set when register file is written

always @*
    case( state )
	DECODE: write_register <= load_reg & ~plp;

	PULL1, 
	 RTS2, 
	 RTI3,
	 BRK3,
	 JSR0,
	 JSR2 : write_register <= 1;

       default: write_register <= 0;
Now that we have a [15:0] IR, maybe there will be room for all transfers between all registers I suggested TXB, TBX, TAE, TEA, etc.
More importantly though, which registers will be like the accumulator? I guess I need to look at the ALU.v code... More accumulator type registers will add some serious power.
Another question: Adding registers=easy. Adding accumulators=how difficult?
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Post by Arlet »

To add another accumulator, or other registers really isn't all that difficult. Suppose you want to add a register 'B', which can be used the same way as 'A':

- You'll need to widen the [1:0] regsel path, e.g. to [2:0] regsel. This allows 4 more registers. Widen src_reg and dst_reg the same way.

- Change SEL_A .. SEL_Y to 3'd0 .. 3'd3, and add SEL_B 3'd4.

To use the register B, all you need is extra decoding. The data path is already there. And the only decoding you need to use the B register instead of the A register, is to add a line for 'dst_reg <= SEL_B' and 'src_reg <=SEL_B ', when the IR matches your new instruction pattern. All the instructions that work for A (EOR, ADC, ROL, STA,...) then work automatically the same way for B.

The same thing applies to reg-reg transfers. To add a 'TBX' instruction, you'll need: load_reg <= 1, dst_reg <= SEL_X, src_reg <= SEL_B, and state <= REG when IR==TBX and state == DECODE.
User avatar
BigEd
Posts: 11463
Joined: 11 Dec 2008
Location: England
Contact:

Post by BigEd »

BigEd wrote:
I'm happy to say my simulation has sprung into life!
And now I've seen this design with block RAMs working on FPGA too - at a shade under 50MHz, so I don't need the DCM to produce a divided clock.

Arlet, your RAM code is so much cleaner than mine I'll have to port mine over to use your approach. I just needed to get mine working to satisfy myself.

Next stop on hardware development is to get host communication working properly.

Cheers
Ed
ElEctric_EyE
Posts: 3260
Joined: 02 Mar 2009
Location: OH, USA

Post by ElEctric_EyE »

...
Last edited by ElEctric_EyE on Sat May 21, 2011 10:55 pm, edited 1 time in total.
User avatar
Arlet
Posts: 2353
Joined: 16 Nov 2010
Location: Gouda, The Netherlands
Contact:

Post by Arlet »

Note that my RAM is used in dual-port configuration, because I'm using the other part for the video generator. If you don't need that, you can pick the single port configuration, and simplify it even more.
ElEctric_EyE
Posts: 3260
Joined: 02 Mar 2009
Location: OH, USA

Post by ElEctric_EyE »

...
Last edited by ElEctric_EyE on Sat May 21, 2011 10:56 pm, edited 1 time in total.
User avatar
BigEd
Posts: 11463
Joined: 11 Dec 2008
Location: England
Contact:

Post by BigEd »

ElEctric_EyE wrote:
Arlet wrote:
...Since you now have 16 bit instructions, it would be nice to add some more registers. Since the core is already using a register file, with enough resources to support 16 registers (only 4 are used right now), you can easily add 12 more without much impact on core size/speed.
I have a serious interest in adding more registers. I've made small progress mod'ing the cpu.v code to include an additional B, C, D, E registers to the A, X, Y, and Stack.

[snip]
Now that we have a [15:0] IR, maybe there will be room for all transfers between all registers I suggested TXB, TBX, TAE, TEA, etc.
More importantly though, which registers will be like the accumulator? I guess I need to look at the ALU.v code... More accumulator type registers will add some serious power.
Another question: Adding registers=easy. Adding accumulators=how difficult?
It just might be worth(*) a new thread on "65Org16 - more registers" for example, because there might be several things to explore (including the syntax and the implementation for the assembler, and the opcode encodings...) and we can keep them separate.

It's a worthwhile exploration, and I'd like to think 65Org16 is a good place to do that exploration - we have the opcode space, and a working starting point (and even, almost, a working SoC to experiment in)

You can make your own github fork with a single button press, once you've signed up to a (free) github account: that gives you somewhere to put your code.

You can even edit code on the github website, I think, so you don't have to get involved with using the 'git' command line tool. Which is a good thing, because that's yet another thing to learn.

Cheers
Ed

(*) Edit: note that EE has now started a thread for a register-rich variant of 65Org16
Last edited by BigEd on Mon May 23, 2011 4:08 pm, edited 1 time in total.
Post Reply