My new verilog 65C02 core.
Re: My new verilog 65C02 core.
I did the test suite for the regular 6502 instructions, not the 65C02 version. I still have to download that. I tested the 65C02 instructions by hand.
The instructions that I haven't done yet are: STZ, the new BIT addressing modes, and JMP (abs,X), but those are just a matter of microcode. I also need to do the NMI and RDY signals.
The instructions that I haven't done yet are: STZ, the new BIT addressing modes, and JMP (abs,X), but those are just a matter of microcode. I also need to do the NMI and RDY signals.
Re: My new verilog 65C02 core.
Finished converting the ALU verilog code (alu.v) into Spartan6 instances, including all ALU operations, BCD adjustment, flag updates, and condition code checks.
Tightest hand placement is possible in 3x3 square of slices, but that may not be an optimal arrangement for highest speed.
- 21 LUT5
- 19 LUT6 (8 use carry chain)
- 23 Flip-flops
Tightest hand placement is possible in 3x3 square of slices, but that may not be an optimal arrangement for highest speed.
Re: My new verilog 65C02 core.
Almost done with converting the control logic. A bit of manual coaxing can make it fit in 5 slices (and a ROM). There are still a few small optimizations possible, maybe worth about half a slice.
The control logic only uses 6 flops (the little squares on the right side of each slice), so it's mostly combinatorial logic.
The control logic only uses 6 flops (the little squares on the right side of each slice), so it's mostly combinatorial logic.
- Attachments
-
- ctl.png (5.74 KiB) Viewed 2174 times
Re: My new verilog 65C02 core.
Preliminary summary so far, with each individual module manually placed, but not yet integrated.
- ALU: 9 slices
- CTL: 5 slices
- ABL: 7 slices.
- ABH: 6 slices
- register file: 2 slices
- output mux: 2 slices
- total: 31 slices
Re: My new verilog 65C02 core.
Today I spent some time playing with a real FPGA board, instead of running sims, and I discovered a problem related to the late update of the I flag that I talked about earlier. The problem is that the I flag is updated in the fetch cycle of the opcode, but that's also the same cycle that the decision is taken whether to take the interrupt or not. This means that when the CLI instruction is executed, you can still catch one interrupt.
At the time, I didn't think it would be a problem, but now I see the same problem happens when the I-flag is initialized at a reset, or when taking an interrupt. Just as the first instruction of the interrupt handler is fetched, the core decides that the I-flag is still 0, and immediately handles a nested IRQ.
Somewhat surprisingly, the nested interrupt is all handled properly, and after two corresponding RTIs, control returns to the main program.
Fortunately, the fix is easy, it's just a matter of looking at the input of the I flag register, instead of the output.
At the time, I didn't think it would be a problem, but now I see the same problem happens when the I-flag is initialized at a reset, or when taking an interrupt. Just as the first instruction of the interrupt handler is fetched, the core decides that the I-flag is still 0, and immediately handles a nested IRQ.
Somewhat surprisingly, the nested interrupt is all handled properly, and after two corresponding RTIs, control returns to the main program.
Fortunately, the fix is easy, it's just a matter of looking at the input of the I flag register, instead of the output.
Re: My new verilog 65C02 core.
The design is getting close to being finished, at least the Spartan-6 specific stuff. I still need to port some changes back to the generic verilog.
TODO list:
TODO list:
- add RDY support
- fix NMI
- fix missing 65C02 instructions/addressing modes (STZ, BIT, JMP)
- cleaning up
- documentation
Re: My new verilog 65C02 core.
The RDY support for this core is going to come with some restrictions.
In order to support on-chip synchronous block RAMs, I have combinatorial address bus outputs (AD). In some states, for instance when doing a zeropage access, the DB value from the data bus is fed back into the AD output. If you were to pull RDY low during such a state, and then change the DB value (as would happen naturally when switching from operand address to zeropage address), the AD value would change as well, and become invalid.
I have no intention to add extra muxes and flops to hold the old address to cover for this case. Instead, if someone wants to use RDY, then it's their responsibility to grab the AD value in the first cycle, and hold it as long as RDY=0. This shouldn't be a problem, I think. Solving this issue outside the core gives more flexibility anyway.
There is no problem when using the registered address bus outputs in combination with asynchronous memories, because the RDY signal can be connected to the clock enable of the output flops.
In order to support on-chip synchronous block RAMs, I have combinatorial address bus outputs (AD). In some states, for instance when doing a zeropage access, the DB value from the data bus is fed back into the AD output. If you were to pull RDY low during such a state, and then change the DB value (as would happen naturally when switching from operand address to zeropage address), the AD value would change as well, and become invalid.
I have no intention to add extra muxes and flops to hold the old address to cover for this case. Instead, if someone wants to use RDY, then it's their responsibility to grab the AD value in the first cycle, and hold it as long as RDY=0. This shouldn't be a problem, I think. Solving this issue outside the core gives more flexibility anyway.
There is no problem when using the registered address bus outputs in combination with asynchronous memories, because the RDY signal can be connected to the clock enable of the output flops.
Re: My new verilog 65C02 core.
Preliminary support for RDY appears to be working. It passes the test suite with random generator hooked up to RDY input (1/16 chance of getting deasserted in any cycle).
I've added a few logic expressions that need to be converted to LUT instances. Impact on design should be limited, because in most cases the RDY just goes straight to clock enable input of flops, or gets combined with existing logic. I think one extra LUT is needed, plus maybe some extra buffers to limit the fan-out.
Edit: converted new logic to LUTs, including one extra one. Haven't checked fan-out yet.
Also added microcode support for STZ and new BIT instructions (including the BIT # which needs special flag handling), as well as JMP(ABS,X), and the rest of the TSB/TRB instructions. I think that was the last of the 65C02 specific stuff.
I've added a few logic expressions that need to be converted to LUT instances. Impact on design should be limited, because in most cases the RDY just goes straight to clock enable input of flops, or gets combined with existing logic. I think one extra LUT is needed, plus maybe some extra buffers to limit the fan-out.
Edit: converted new logic to LUTs, including one extra one. Haven't checked fan-out yet.
Also added microcode support for STZ and new BIT instructions (including the BIT # which needs special flag handling), as well as JMP(ABS,X), and the rest of the TSB/TRB instructions. I think that was the last of the 65C02 specific stuff.
Re: My new verilog 65C02 core.
I tried Klaus Dormann's test for the 65C02 extensions, and it found a couple of mistakes in the microcode. The DEA wasn't doing anything, and the TRB/TSB accidentally set the N/V flags. Both have been fixed, and now it passes successfully.
I had to disable tests for WDC and Rockwell extensions, because they are not supported. Also, undefined instructions are not NOPs currently, but that is easily fixed by filling in the microcode table.
I had to disable tests for WDC and Rockwell extensions, because they are not supported. Also, undefined instructions are not NOPs currently, but that is easily fixed by filling in the microcode table.
Re: My new verilog 65C02 core.
Very nice!! Seems like the job is done? What is the final verdict on speed, say in a system with some on-chip synchronous RAM?
So, now that you run out of things to do on the core -- any chance you could revisit the use of distributed logic instead of a ROM block for the microcode?
(Yes, I know I'm being obstinate...)
So, now that you run out of things to do on the core -- any chance you could revisit the use of distributed logic instead of a ROM block for the microcode?
(Yes, I know I'm being obstinate...)
Re: My new verilog 65C02 core.
The generic code needs to be updated still. I've been focusing on the Spartan-6 specific stuff lately.
I haven't checked the speed in a while, but I'll probably do that tomorrow, and then push a bit with constraints. I also want to test a bit more on real hardware as well.
When that's all done, the next step is to make modifications to target synchronous memories, without using dual port for writing.
And then I'll look into distributed logic.
I haven't checked the speed in a while, but I'll probably do that tomorrow, and then push a bit with constraints. I also want to test a bit more on real hardware as well.
When that's all done, the next step is to make modifications to target synchronous memories, without using dual port for writing.
And then I'll look into distributed logic.
Re: My new verilog 65C02 core.
Oh, I meant to respond to the RDY thing, just to check:
- if we're not using RDY, all is well
- if we're using external asynchronous memory, all is well
- if we're using RDY and using on-chip synchronous memory, we need to add an address flop
- if we're not using RDY, all is well
- if we're using external asynchronous memory, all is well
- if we're using RDY and using on-chip synchronous memory, we need to add an address flop
Re: My new verilog 65C02 core.
Yes, and if you have both on-chip synchronous and external async, but you only use the RDY when addressing external memory, all is well.
Re: My new verilog 65C02 core.
And if you have slow on-chip peripherals/memory that need a wait state, you're probably pipelining anyway, and then you can just use the register & state machine that you already have for that.
Re: My new verilog 65C02 core.
Ok, that was crazy.
I've been playing with real hardware, including a UART module that can provide interrupts. I noticed that the hardware would sometimes hang completely when running interrupts, not even the reset button would fix it.
In my top level model I have a small 2kB RAM attached to memory bus, initialized with some test code. The upper address bits are not connected, so the RAM appears duplicated every $800 bytes. The code is assembled for $F800 (this is important)
In my sims, I've always been running with RST asserted from cycle #0, and then deasserted a few cycles later. In my real hardware, I have a reset button instead, so when you configure the FPGA, it doesn't get a proper reset until you press the button. What happens instead is that the microcode ROM outputs all-zeroes on the first cycle. This is interpreted as the BRK instruction, so the CPU starts handling a BRK right away. Likewise, the PC is initialized as all-zeroes, but gets incremented by one, so it is saved on the stack as 0001.
The BRK is handled, and eventually "returns" to address 0001, but this is the operand byte of the first instruction, so it's out of sync with the instruction stream. Because of that, it finds a "$F8" operand byte (remember my code is assembled for address F800), that it interprets as an instruction. This happens to be the "SED" instruction. After that, it resyncs on the proper instruction, so the code appears to be running fine at first, with the D flag set, but this doesn't matter.
The way I handle the decimal mode is to feed it directly in one of the microcode address lines, so it switches to another bank in the microcode ROM, which is identical to the normal bank, except for the ADC/SBC instructions. Except that it wasn't identical, I had forgotten to provide decimal versions of the IRQ code. So, when the IRQ from the UART came in, the processor ended up stuck in a loop. Pressing reset didn't help, because the microcode to handle a reset was also missing in the decimal bank.
I've been playing with real hardware, including a UART module that can provide interrupts. I noticed that the hardware would sometimes hang completely when running interrupts, not even the reset button would fix it.
In my top level model I have a small 2kB RAM attached to memory bus, initialized with some test code. The upper address bits are not connected, so the RAM appears duplicated every $800 bytes. The code is assembled for $F800 (this is important)
In my sims, I've always been running with RST asserted from cycle #0, and then deasserted a few cycles later. In my real hardware, I have a reset button instead, so when you configure the FPGA, it doesn't get a proper reset until you press the button. What happens instead is that the microcode ROM outputs all-zeroes on the first cycle. This is interpreted as the BRK instruction, so the CPU starts handling a BRK right away. Likewise, the PC is initialized as all-zeroes, but gets incremented by one, so it is saved on the stack as 0001.
The BRK is handled, and eventually "returns" to address 0001, but this is the operand byte of the first instruction, so it's out of sync with the instruction stream. Because of that, it finds a "$F8" operand byte (remember my code is assembled for address F800), that it interprets as an instruction. This happens to be the "SED" instruction. After that, it resyncs on the proper instruction, so the code appears to be running fine at first, with the D flag set, but this doesn't matter.
The way I handle the decimal mode is to feed it directly in one of the microcode address lines, so it switches to another bank in the microcode ROM, which is identical to the normal bank, except for the ADC/SBC instructions. Except that it wasn't identical, I had forgotten to provide decimal versions of the IRQ code. So, when the IRQ from the UART came in, the processor ended up stuck in a loop. Pressing reset didn't help, because the microcode to handle a reset was also missing in the decimal bank.