First glimpse: 65Org16 (6502 with 32bit addresses)
Thanks for the code! One of my problems is that it seems natural to have a combinatorial mux to feed the cpu's data in, which is controlled by address decoders, which read the address bus: but that's a combinatorial loop. So the machine won't come out of reset. I had that working before I added the block rams, by registering the mux output. But I haven't yet reconciled that approach with the approach I need for the block ram.
Is it right to think of this core as inside-out compared to 6502? Instead of producing addresses quite early after the clock, and capturing data just as the cycle closes, it produces addresses late in the cycle (which doesn't show up at all in a delay-free simulation) and will expect the data early in the following cycle. Crucially, data arrives before addresses leave.
With the timings above, I think it can't do any work at 50MHz, even though the clock to clock timing suggests it, because that leaves no time for memory access. Is that right?
Cheers
Ed
Is it right to think of this core as inside-out compared to 6502? Instead of producing addresses quite early after the clock, and capturing data just as the cycle closes, it produces addresses late in the cycle (which doesn't show up at all in a delay-free simulation) and will expect the data early in the following cycle. Crucially, data arrives before addresses leave.
With the timings above, I think it can't do any work at 50MHz, even though the clock to clock timing suggests it, because that leaves no time for memory access. Is that right?
Cheers
Ed
It's not really "inside out", rather it's a consequence of a fast single clock synchronous design. All the tools care about is that the data arrives just before the next clock edge.
If you'd run this design at 1-2 MHz, the address would also come out right after a clock edge, and it would sample the data just before the next. And if you'd run a 6502 at higher frequencies, the addresses would also be coming out late.
The reason it won't work at 50 MHz is that there is a considerable delay in going on/off chip, due to much larger capacitances, in addition to the already large delay in the DI->ALU path.
I think the best solution is to register the address outputs, to get rid of combinatorial delay in the address muxes, and to register the inputs, to split the external data path from the internal path. That leaves 20 ns for the memory access, which should be just possible with 10 ns SRAM. To deal with the extra delay, you'd have to insert 1 RDY cycle.
If you'd run this design at 1-2 MHz, the address would also come out right after a clock edge, and it would sample the data just before the next. And if you'd run a 6502 at higher frequencies, the addresses would also be coming out late.
The reason it won't work at 50 MHz is that there is a considerable delay in going on/off chip, due to much larger capacitances, in addition to the already large delay in the DI->ALU path.
I think the best solution is to register the address outputs, to get rid of combinatorial delay in the address muxes, and to register the inputs, to split the external data path from the internal path. That leaves 20 ns for the memory access, which should be just possible with 10 ns SRAM. To deal with the extra delay, you'd have to insert 1 RDY cycle.
Good point about running down at a few MHz, but the path from DI->AB is still a conundrum.
It's a while since I did this kind of thing, but when I did, we followed a rule of registering outputs. Not sure if that would allow 6502 cycle timings though - it would be nice to preserve those.
I'm not anxious to make changes - I'll persevere with getting my soc working. It'll become clearer.
Cheers
Ed
It's a while since I did this kind of thing, but when I did, we followed a rule of registering outputs. Not sure if that would allow 6502 cycle timings though - it would be nice to preserve those.
I'm not anxious to make changes - I'll persevere with getting my soc working. It'll become clearer.
Cheers
Ed
-
ElEctric_EyE
- Posts: 3260
- Joined: 02 Mar 2009
- Location: OH, USA
BigEd wrote:
...One of my problems is that it seems natural to have a combinatorial mux to feed the cpu's data in, which is controlled by address decoders, which read the address bus: but that's a combinatorial loop...
Ed
Ed
ElEctric_EyE wrote:
BigEd wrote:
...One of my problems is that it seems natural to have a combinatorial mux to feed the cpu's data in, which is controlled by address decoders, which read the address bus: but that's a combinatorial loop...
Ed
Ed
Or, what I did in my code, instead of a MUX, use an OR, and let the memories produce '0' when not selected. There still an extra cycle delay, but that's from the block RAMs.
If you use asynchronous memories, and register the AB outputs, you can use this delayed AB to drive the input MUX.
BigEd wrote:
It's a while since I did this kind of thing, but when I did, we followed a rule of registering outputs. Not sure if that would allow 6502 cycle timings though - it would be nice to preserve those.
Even with fast (10ns) SRAM, you'd probably have to allow another 20 ns, so it would mean a drop from 50 to 25 MHz.
Alternatively, you can stay at 50 MHz, and add an extra RDY cycle to cover the 20 ns extra time you need.
-
ElEctric_EyE
- Posts: 3260
- Joined: 02 Mar 2009
- Location: OH, USA
I'm happy to say my simulation has sprung into life! Thanks for the helpful comments. (Which I have not yet fully digested.)
I haven't run this on FPGA, and my solution feels a bit untidy, but it's good to see fewer red Xs, so I've checked it in.
It's good to hear that registering the outputs would make the core behave in a more familiar way, at some (presumed) cost in flexibility. As we often deal with synchronous memory these days, it's better to anticipate that.
Arlet, you were quite right about BCD mode costing speed: although removing it conditionally is quite a mess of ifdefs, the result is an increase (post-synthesis) from 42MHz to 50MHz, reducing the critical path from 22 gates to 17. (Of course I haven't tested my changes)
I'm not seeing a useful post-routing timing report - it thinks it is unconstrained.
Cheers
Ed
I haven't run this on FPGA, and my solution feels a bit untidy, but it's good to see fewer red Xs, so I've checked it in.
It's good to hear that registering the outputs would make the core behave in a more familiar way, at some (presumed) cost in flexibility. As we often deal with synchronous memory these days, it's better to anticipate that.
Arlet, you were quite right about BCD mode costing speed: although removing it conditionally is quite a mess of ifdefs, the result is an increase (post-synthesis) from 42MHz to 50MHz, reducing the critical path from 22 gates to 17. (Of course I haven't tested my changes)
I'm not seeing a useful post-routing timing report - it thinks it is unconstrained.
Cheers
Ed
-
ElEctric_EyE
- Posts: 3260
- Joined: 02 Mar 2009
- Location: OH, USA
BigEd wrote:
I'm happy to say my simulation has sprung into life! Thanks for the helpful comments.
I'm not seeing a useful post-routing timing report - it thinks it is unconstrained.
Cheers
Ed
I'm not seeing a useful post-routing timing report - it thinks it is unconstrained.
Cheers
Ed
I've noticed in my designs without at least a simple constraint like shown below, ISE will not show max speed. It'll show max delay... Notice it does not even have a jitter spec, which should be included as well.
Code: Select all
NET "RefClk" TNM_NET = "RefClk";
TIMESPEC TS_RefClk = PERIOD "RefClk" 41.16 ns HIGH 50 %;Arlet wrote:
...The 16 bit ALU is certainly going to slow down max speed, due to the longer ripple carry path. If you care about speed, removing BCD support should help a bit there.
Question: If ISE doesn't see a "trigger" that uses BCD mode won't it just optimize it out?
So we would free up 2 commands in microcode? SED and CLD...
I've been looking at the coding abit closer now as my interest in this is growing. First, nice addition here (and other places), starting at lines 33-40 (in ISE editor):
Code: Select all
parameter dw = 16; // data width (8 for 6502, 16 for 65Org16)
parameter aw = 32; // address width (16 for 6502, 32 for 65Org16)
input clk; // CPU clock
input reset; // reset signal
output reg [aw-1:0] AB; // address bus
input [dw-1:0] DI; // data in, read bus
output [dw-1:0] DO; // data out, write busArlet wrote:
...Since you now have 16 bit instructions, it would be nice to add some more registers. Since the core is already using a register file, with enough resources to support 16 registers (only 4 are used right now), you can easily add 12 more without much impact on core size/speed.
Code: Select all
/*
* register file, contains A, X, Y and S (stack pointer) registers. At each
* cycle only 1 of those registers needs to be accessed, so they combined
* in a small memory, saving resources.
*/
reg write_register; // set when register file is written
always @*
case( state )
DECODE: write_register <= load_reg & ~plp;
PULL1,
RTS2,
RTI3,
BRK3,
JSR0,
JSR2 : write_register <= 1;
default: write_register <= 0;More importantly though, which registers will be like the accumulator? I guess I need to look at the ALU.v code... More accumulator type registers will add some serious power.
Another question: Adding registers=easy. Adding accumulators=how difficult?
To add another accumulator, or other registers really isn't all that difficult. Suppose you want to add a register 'B', which can be used the same way as 'A':
- You'll need to widen the [1:0] regsel path, e.g. to [2:0] regsel. This allows 4 more registers. Widen src_reg and dst_reg the same way.
- Change SEL_A .. SEL_Y to 3'd0 .. 3'd3, and add SEL_B 3'd4.
To use the register B, all you need is extra decoding. The data path is already there. And the only decoding you need to use the B register instead of the A register, is to add a line for 'dst_reg <= SEL_B' and 'src_reg <=SEL_B ', when the IR matches your new instruction pattern. All the instructions that work for A (EOR, ADC, ROL, STA,...) then work automatically the same way for B.
The same thing applies to reg-reg transfers. To add a 'TBX' instruction, you'll need: load_reg <= 1, dst_reg <= SEL_X, src_reg <= SEL_B, and state <= REG when IR==TBX and state == DECODE.
- You'll need to widen the [1:0] regsel path, e.g. to [2:0] regsel. This allows 4 more registers. Widen src_reg and dst_reg the same way.
- Change SEL_A .. SEL_Y to 3'd0 .. 3'd3, and add SEL_B 3'd4.
To use the register B, all you need is extra decoding. The data path is already there. And the only decoding you need to use the B register instead of the A register, is to add a line for 'dst_reg <= SEL_B' and 'src_reg <=SEL_B ', when the IR matches your new instruction pattern. All the instructions that work for A (EOR, ADC, ROL, STA,...) then work automatically the same way for B.
The same thing applies to reg-reg transfers. To add a 'TBX' instruction, you'll need: load_reg <= 1, dst_reg <= SEL_X, src_reg <= SEL_B, and state <= REG when IR==TBX and state == DECODE.
BigEd wrote:
I'm happy to say my simulation has sprung into life!
Arlet, your RAM code is so much cleaner than mine I'll have to port mine over to use your approach. I just needed to get mine working to satisfy myself.
Next stop on hardware development is to get host communication working properly.
Cheers
Ed
-
ElEctric_EyE
- Posts: 3260
- Joined: 02 Mar 2009
- Location: OH, USA
-
ElEctric_EyE
- Posts: 3260
- Joined: 02 Mar 2009
- Location: OH, USA
ElEctric_EyE wrote:
Arlet wrote:
...Since you now have 16 bit instructions, it would be nice to add some more registers. Since the core is already using a register file, with enough resources to support 16 registers (only 4 are used right now), you can easily add 12 more without much impact on core size/speed.
[snip]
Now that we have a [15:0] IR, maybe there will be room for all transfers between all registers I suggested TXB, TBX, TAE, TEA, etc.
More importantly though, which registers will be like the accumulator? I guess I need to look at the ALU.v code... More accumulator type registers will add some serious power.
Another question: Adding registers=easy. Adding accumulators=how difficult?
It's a worthwhile exploration, and I'd like to think 65Org16 is a good place to do that exploration - we have the opcode space, and a working starting point (and even, almost, a working SoC to experiment in)
You can make your own github fork with a single button press, once you've signed up to a (free) github account: that gives you somewhere to put your code.
You can even edit code on the github website, I think, so you don't have to get involved with using the 'git' command line tool. Which is a good thing, because that's yet another thing to learn.
Cheers
Ed
(*) Edit: note that EE has now started a thread for a register-rich variant of 65Org16
Last edited by BigEd on Mon May 23, 2011 4:08 pm, edited 1 time in total.