Yet another (unnamed) 65C02 core

Windfall · Post by **Windfall** » Sat Feb 19, 2022 9:44 pm

After the 'full' and 'small' versions of this core, now follows the 'tiny' one, with its own interesting memory / performance tradeoff.

Where all three variations have a memory footprint of 524288 * n + 2048 * 5 + 64 bits, and the full and small have an n of 7 and 4 respectively (the small needs fewer 64 KB blocks but delays absolute addressed instructions by one cycle), the tiny has an n of 1, so it requires little more than a regular core, but still benchmarks at 165% (where the full and small both do over 200%). The 165% can get higher still, because some penalty cycles can probably be prevented (although it may decrease FMax).

So, with 2% memory overhead (which allows speculative reads from zero page and the stack, the related instructions still take only 1 cycle), and the need for true dual ported memory (either two reads or one write), the 'tiny' version of the core still performs 65% better than a real 65C02. Also, where the full and small top out at 200 MHz on a Stratix V, the tiny manages 240 MHz (currently resulting in a 400 MHz benchmark), which is right up there with the other, slightly more resource efficient cores.

BigEd · Post by **BigEd** » Mon Feb 21, 2022 7:26 am

Nice illustration of how microarchitectural refinements sometimes do, and sometimes don't, pay off against max frequency. (And of course those payoffs shift around depending on implementation techniques and technology.)

Windfall · Post by **Windfall** » Tue Feb 22, 2022 9:56 pm

BigEd wrote:

Nice illustration of how microarchitectural refinements sometimes do, and sometimes don't, pay off against max frequency. (And of course those payoffs shift around depending on implementation techniques and technology.)

Yes. After sufficient optimisation, architectural changes do tend to result in the same perfomance ...

I've now published the 'tiny' version on my website, alongside slightly changed 'full' and 'small' versions : running Klaus Dormann 6502 verification tests obtained on https://github.com/mungre/beeb6502test revealed one bug (SBC V flag was wrong), and two largely irrelevant discrepancies (an issue with P bits 5 and 4, which I did not fix because it affects Fmax, and an issue with (not-EA) NOP sizes, which I fixed).

Since 'tiny' can make do with one copy of main memory, but requires it to be 3 x 8 = 24 bits wide (for shared code/data accesses), one 'byte twist' is needed: use true dual ported memory, read two consecutive 16-bit words (disregarding address bit 0), select the right 24 bits (according to address bit 0).

The published 'tiny' is 200 MHz Fmax, 200% benchmark (instead of 240 MHz Fmax, 165% benchmark, maybe that one follows some time).

BigEd · Post by **BigEd** » Wed Feb 23, 2022 8:38 am

Windfall wrote:

Since 'tiny' can make do with one copy of main memory, but requires it to be 3 x 8 = 24 bits wide (for shared code/data accesses), one 'byte twist' is needed: use true dual ported memory, read two consecutive 16-bit words (disregarding address bit 0), select the right 24 bits (according to address bit 0).

Very interesting... but I must be missing something. I can't quite see how you get the sequential 24 bits you want. Could you elaborate a little please? How wide is the dual ported memory? Is it addressed by bytes? Does your 24 bit read take just one cycle total?

John West · Post by **John West** » Wed Feb 23, 2022 9:42 am

If it was 32Kx16, I can see how it might work: BRAM is dual-ported, so you can do two independent reads at the same time. If you want to read 3 bytes from address x, you read from address floor(x/2) from one port and floor(x/2)+1 from the other. Combine those into a 32 bit word and use either bits 0-23 or bits 8-24.

I don't see how a 24 bit wide memory works though. How do you translate a 16 bit address into the BRAM address without dividing by 3?

My own plan was to have four 8-bit memories, each providing 8 bits of a 32 bit word, but with independent addresses. Each memory can access either floor(x/4) or floor(x/4)+1. By controlling which memories get which address, and a bit of shuffling of the resulting data, you can get four consecutive bytes from any address. The address selection and shuffling would add a little complexity and probably reduce the maximum clock speed. I was hoping that needing fewer cycles for each instruction would compensate for that - I was particularly interested in the ability to have one instruction being read while the previous instruction wrote to memory (also, fetching 32 bits in one go means most of the time you get the opcode of the following instruction a cycle early, which might be helpful).

I probably won't be pursuing this any further though, as Windfall has already achieved most of what I was hoping to.

Windfall · Post by **Windfall** » Wed Feb 23, 2022 12:19 pm

BigEd wrote:

Very interesting... but I must be missing something. I can't quite see how you get the sequential 24 bits you want. Could you elaborate a little please? How wide is the dual ported memory? Is it addressed by bytes? Does your 24 bit read take just one cycle total?

You map the 64K x 8 address (r) onto 32K x 16 memory. Read 16 bits from r[15:1] + 0 and r[15:1] + 1, then use r[0] to extract the right 24 bits.

The setup in my 6502 Second Processor implementation is as below (how the TDP_W macro couples to memory should be self-explanatory, and imagine backslash-linefeed continuation in the macro part ...).

Code: Select all

`define RAM_IN_FPGA_TDP_W(ABITS, INAME, RENABLEA, RADDRESSA, RDATAA, RENABLEB, RADDRESSB, RDATAB, WENABLE, WADDRESS, WDATA, ROMFILE)

ram_in_fpga_tdp # (.DATA_BITS(16), .ADDRESS_BITS(ABITS-1), .RAM_FILE(ROMFILE)) INAME
(
  .a_clock(ram_clock), .a_address(WENABLE ? WADDRESS[ABITS-1:1] : RADDRESSA), .ar_enable(RENABLEA), .ar_data(RDATAA), .aw_enable(WENABLE), .aw_data({ WDATA, WDATA }), .aw_byte(WADDRESS[0] ? 2'b10 : 2'b01),
  .b_clock(ram_clock), .b_address(                                RADDRESSB), .br_enable(RENABLEB), .br_data(RDATAB)
);

[... snipped ...]

reg         peek_16x24_1_skew;
wire [15:0] peek_16x24_1_data_0;
wire [15:0] peek_16x24_1_data_1;

always @(posedge ram_clock)
  peek_16x24_1_skew <= peek_16x24_address_1[0];

`RAM_IN_FPGA_TDP_W(16, peek_16x24_1_1, peek_16x24_enable_1, peek_16x24_address_1[15:1] + 0, peek_16x24_1_data_0, peek_16x24_enable_1, peek_16x24_address_1[15:1] + 1, peek_16x24_1_data_1, peek_16x24_write_1, write_16x8_address, write_16x8_data, "reco6502_main_0_w.hex")

assign peek_16x24_data_1 = peek_16x24_1_skew ? { peek_16x24_1_data_1, peek_16x24_1_data_0[15:8] } : { peek_16x24_1_data_1[7:0], peek_16x24_1_data_0 };

Windfall · Post by **Windfall** » Wed Feb 23, 2022 12:30 pm

John West wrote:

My own plan was to have four 8-bit memories, each providing 8 bits of a 32 bit word, but with independent addresses. Each memory can access either floor(x/4) or floor(x/4)+1. By controlling which memories get which address, and a bit of shuffling of the resulting data, you can get four consecutive bytes from any address.

There is usually no need to shuffle data. Only addresses.

BigEd · Post by **BigEd** » Wed Feb 23, 2022 1:46 pm

Ah, thanks - the unavoidable thing, then, is to add one to an address - you can't just mask off the last bit - but it might well be that the (time) cost of the increment is no problem.

Windfall · Post by **Windfall** » Wed Feb 23, 2022 2:06 pm

BigEd wrote:

Ah, thanks - the unavoidable thing, then, is to add one to an address - you can't just mask off the last bit - but it might well be that the (time) cost of the increment is no problem.

It will take a bit of time, like everything in front of it (multiplexing, other address calculations). The optimizer may be able to combine it with other additions (a +1 could simply become a carry in of another addition).

The addition can be avoided completely by doing the +1 read on another memory block with all its contents shifted down by 1 byte. This is what I do in several places in the 6502 Second Processor. Of course, the write address then incurs a -1. But that often turns out to be cheaper.

John West · Post by **John West** » Wed Feb 23, 2022 3:10 pm

There's a limited set of places that the addresses can come from. If the +1 is in the critical path, you might be able to duplicate those sources and have move it elsewhere. So two units handling indexed addressing modes, with one adding an extra carry. Two PCs, one a byte ahead of the other. And so on.

Of course, if the critical path is one of those other places, that would end up making it slower. It needs careful reading of the timing reports.

Windfall · Post by **Windfall** » Wed Feb 23, 2022 3:50 pm

John West wrote:

There's a limited set of places that the addresses can come from. If the +1 is in the critical path, you might be able to duplicate those sources and have move it elsewhere. So two units handling indexed addressing modes, with one adding an extra carry. Two PCs, one a byte ahead of the other. And so on.

If you build for speed only, the optimizer could very well do that, yes. It could duplicate logic as needed while weighing the benefits against the routing costs (if any), and pick the best one. As long as you don't use 'synthesis keep' it will do its own combinatorial optimization anyway, and probably do a much better job than you ever could

Yet another (unnamed) 65C02 core

Re: Yet another (unnamed) 65C02 core

Re: Yet another (unnamed) 65C02 core

Re: Yet another (unnamed) 65C02 core

Re: Yet another (unnamed) 65C02 core

Re: Yet another (unnamed) 65C02 core

Re: Yet another (unnamed) 65C02 core

Re: Yet another (unnamed) 65C02 core

Re: Yet another (unnamed) 65C02 core

Re: Yet another (unnamed) 65C02 core

Re: Yet another (unnamed) 65C02 core

Re: Yet another (unnamed) 65C02 core