After the 'full' and 'small' versions of this core, now follows the 'tiny' one, with its own interesting memory / performance tradeoff.
Where all three variations have a memory footprint of 524288 * n + 2048 * 5 + 64 bits, and the full and small have an n of 7 and 4 respectively (the small needs fewer 64 KB blocks but delays absolute addressed instructions by one cycle), the tiny has an n of 1, so it requires little more than a regular core, but still benchmarks at 165% (where the full and small both do over 200%). The 165% can get higher still, because some penalty cycles can probably be prevented (although it may decrease FMax).
So, with 2% memory overhead (which allows speculative reads from zero page and the stack, the related instructions still take only 1 cycle), and the need for true dual ported memory (either two reads or one write), the 'tiny' version of the core still performs 65% better than a real 65C02. Also, where the full and small top out at 200 MHz on a Stratix V, the tiny manages 240 MHz (currently resulting in a 400 MHz benchmark), which is right up there with the other, slightly more resource efficient cores.
Yet another (unnamed) 65C02 core
Re: Yet another (unnamed) 65C02 core
Last edited by Windfall on Tue Feb 22, 2022 10:01 pm, edited 1 time in total.
Re: Yet another (unnamed) 65C02 core
Nice illustration of how microarchitectural refinements sometimes do, and sometimes don't, pay off against max frequency. (And of course those payoffs shift around depending on implementation techniques and technology.)
Re: Yet another (unnamed) 65C02 core
BigEd wrote:
Nice illustration of how microarchitectural refinements sometimes do, and sometimes don't, pay off against max frequency. (And of course those payoffs shift around depending on implementation techniques and technology.)
I've now published the 'tiny' version on my website, alongside slightly changed 'full' and 'small' versions : running Klaus Dormann 6502 verification tests obtained on https://github.com/mungre/beeb6502test revealed one bug (SBC V flag was wrong), and two largely irrelevant discrepancies (an issue with P bits 5 and 4, which I did not fix because it affects Fmax, and an issue with (not-EA) NOP sizes, which I fixed).
Since 'tiny' can make do with one copy of main memory, but requires it to be 3 x 8 = 24 bits wide (for shared code/data accesses), one 'byte twist' is needed: use true dual ported memory, read two consecutive 16-bit words (disregarding address bit 0), select the right 24 bits (according to address bit 0).
The published 'tiny' is 200 MHz Fmax, 200% benchmark (instead of 240 MHz Fmax, 165% benchmark, maybe that one follows some time).
Re: Yet another (unnamed) 65C02 core
Windfall wrote:
Since 'tiny' can make do with one copy of main memory, but requires it to be 3 x 8 = 24 bits wide (for shared code/data accesses), one 'byte twist' is needed: use true dual ported memory, read two consecutive 16-bit words (disregarding address bit 0), select the right 24 bits (according to address bit 0).
Re: Yet another (unnamed) 65C02 core
If it was 32Kx16, I can see how it might work: BRAM is dual-ported, so you can do two independent reads at the same time. If you want to read 3 bytes from address x, you read from address floor(x/2) from one port and floor(x/2)+1 from the other. Combine those into a 32 bit word and use either bits 0-23 or bits 8-24.
I don't see how a 24 bit wide memory works though. How do you translate a 16 bit address into the BRAM address without dividing by 3?
My own plan was to have four 8-bit memories, each providing 8 bits of a 32 bit word, but with independent addresses. Each memory can access either floor(x/4) or floor(x/4)+1. By controlling which memories get which address, and a bit of shuffling of the resulting data, you can get four consecutive bytes from any address. The address selection and shuffling would add a little complexity and probably reduce the maximum clock speed. I was hoping that needing fewer cycles for each instruction would compensate for that - I was particularly interested in the ability to have one instruction being read while the previous instruction wrote to memory (also, fetching 32 bits in one go means most of the time you get the opcode of the following instruction a cycle early, which might be helpful).
I probably won't be pursuing this any further though, as Windfall has already achieved most of what I was hoping to.
I don't see how a 24 bit wide memory works though. How do you translate a 16 bit address into the BRAM address without dividing by 3?
My own plan was to have four 8-bit memories, each providing 8 bits of a 32 bit word, but with independent addresses. Each memory can access either floor(x/4) or floor(x/4)+1. By controlling which memories get which address, and a bit of shuffling of the resulting data, you can get four consecutive bytes from any address. The address selection and shuffling would add a little complexity and probably reduce the maximum clock speed. I was hoping that needing fewer cycles for each instruction would compensate for that - I was particularly interested in the ability to have one instruction being read while the previous instruction wrote to memory (also, fetching 32 bits in one go means most of the time you get the opcode of the following instruction a cycle early, which might be helpful).
I probably won't be pursuing this any further though, as Windfall has already achieved most of what I was hoping to.
Re: Yet another (unnamed) 65C02 core
BigEd wrote:
Very interesting... but I must be missing something. I can't quite see how you get the sequential 24 bits you want. Could you elaborate a little please? How wide is the dual ported memory? Is it addressed by bytes? Does your 24 bit read take just one cycle total?
The setup in my 6502 Second Processor implementation is as below (how the TDP_W macro couples to memory should be self-explanatory, and imagine backslash-linefeed continuation in the macro part ...).
Code: Select all
`define RAM_IN_FPGA_TDP_W(ABITS, INAME, RENABLEA, RADDRESSA, RDATAA, RENABLEB, RADDRESSB, RDATAB, WENABLE, WADDRESS, WDATA, ROMFILE)
ram_in_fpga_tdp # (.DATA_BITS(16), .ADDRESS_BITS(ABITS-1), .RAM_FILE(ROMFILE)) INAME
(
.a_clock(ram_clock), .a_address(WENABLE ? WADDRESS[ABITS-1:1] : RADDRESSA), .ar_enable(RENABLEA), .ar_data(RDATAA), .aw_enable(WENABLE), .aw_data({ WDATA, WDATA }), .aw_byte(WADDRESS[0] ? 2'b10 : 2'b01),
.b_clock(ram_clock), .b_address( RADDRESSB), .br_enable(RENABLEB), .br_data(RDATAB)
);
[... snipped ...]
reg peek_16x24_1_skew;
wire [15:0] peek_16x24_1_data_0;
wire [15:0] peek_16x24_1_data_1;
always @(posedge ram_clock)
peek_16x24_1_skew <= peek_16x24_address_1[0];
`RAM_IN_FPGA_TDP_W(16, peek_16x24_1_1, peek_16x24_enable_1, peek_16x24_address_1[15:1] + 0, peek_16x24_1_data_0, peek_16x24_enable_1, peek_16x24_address_1[15:1] + 1, peek_16x24_1_data_1, peek_16x24_write_1, write_16x8_address, write_16x8_data, "reco6502_main_0_w.hex")
assign peek_16x24_data_1 = peek_16x24_1_skew ? { peek_16x24_1_data_1, peek_16x24_1_data_0[15:8] } : { peek_16x24_1_data_1[7:0], peek_16x24_1_data_0 };
Last edited by Windfall on Wed Feb 23, 2022 12:32 pm, edited 1 time in total.
Re: Yet another (unnamed) 65C02 core
John West wrote:
My own plan was to have four 8-bit memories, each providing 8 bits of a 32 bit word, but with independent addresses. Each memory can access either floor(x/4) or floor(x/4)+1. By controlling which memories get which address, and a bit of shuffling of the resulting data, you can get four consecutive bytes from any address.
Re: Yet another (unnamed) 65C02 core
Ah, thanks - the unavoidable thing, then, is to add one to an address - you can't just mask off the last bit - but it might well be that the (time) cost of the increment is no problem.
Re: Yet another (unnamed) 65C02 core
BigEd wrote:
Ah, thanks - the unavoidable thing, then, is to add one to an address - you can't just mask off the last bit - but it might well be that the (time) cost of the increment is no problem.
The addition can be avoided completely by doing the +1 read on another memory block with all its contents shifted down by 1 byte. This is what I do in several places in the 6502 Second Processor. Of course, the write address then incurs a -1. But that often turns out to be cheaper.
Re: Yet another (unnamed) 65C02 core
There's a limited set of places that the addresses can come from. If the +1 is in the critical path, you might be able to duplicate those sources and have move it elsewhere. So two units handling indexed addressing modes, with one adding an extra carry. Two PCs, one a byte ahead of the other. And so on.
Of course, if the critical path is one of those other places, that would end up making it slower. It needs careful reading of the timing reports.
Of course, if the critical path is one of those other places, that would end up making it slower. It needs careful reading of the timing reports.
Re: Yet another (unnamed) 65C02 core
John West wrote:
There's a limited set of places that the addresses can come from. If the +1 is in the critical path, you might be able to duplicate those sources and have move it elsewhere. So two units handling indexed addressing modes, with one adding an extra carry. Two PCs, one a byte ahead of the other. And so on.