6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sat May 11, 2024 3:59 pm

All times are UTC




Post new topic Reply to topic  [ 41 posts ]  Go to page Previous  1, 2, 3
Author Message
PostPosted: Sat Feb 19, 2022 9:44 pm 
Offline
User avatar

Joined: Sun Nov 27, 2011 12:03 pm
Posts: 229
Location: Amsterdam, Netherlands
After the 'full' and 'small' versions of this core, now follows the 'tiny' one, with its own interesting memory / performance tradeoff.

Where all three variations have a memory footprint of 524288 * n + 2048 * 5 + 64 bits, and the full and small have an n of 7 and 4 respectively (the small needs fewer 64 KB blocks but delays absolute addressed instructions by one cycle), the tiny has an n of 1, so it requires little more than a regular core, but still benchmarks at 165% (where the full and small both do over 200%). The 165% can get higher still, because some penalty cycles can probably be prevented (although it may decrease FMax).

So, with 2% memory overhead (which allows speculative reads from zero page and the stack, the related instructions still take only 1 cycle), and the need for true dual ported memory (either two reads or one write), the 'tiny' version of the core still performs 65% better than a real 65C02. Also, where the full and small top out at 200 MHz on a Stratix V, the tiny manages 240 MHz (currently resulting in a 400 MHz benchmark), which is right up there with the other, slightly more resource efficient cores.


Last edited by Windfall on Tue Feb 22, 2022 10:01 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Mon Feb 21, 2022 7:26 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10800
Location: England
Nice illustration of how microarchitectural refinements sometimes do, and sometimes don't, pay off against max frequency. (And of course those payoffs shift around depending on implementation techniques and technology.)


Top
 Profile  
Reply with quote  
PostPosted: Tue Feb 22, 2022 9:56 pm 
Offline
User avatar

Joined: Sun Nov 27, 2011 12:03 pm
Posts: 229
Location: Amsterdam, Netherlands
BigEd wrote:
Nice illustration of how microarchitectural refinements sometimes do, and sometimes don't, pay off against max frequency. (And of course those payoffs shift around depending on implementation techniques and technology.)

Yes. After sufficient optimisation, architectural changes do tend to result in the same perfomance ...

I've now published the 'tiny' version on my website, alongside slightly changed 'full' and 'small' versions : running Klaus Dormann 6502 verification tests obtained on https://github.com/mungre/beeb6502test revealed one bug (SBC V flag was wrong), and two largely irrelevant discrepancies (an issue with P bits 5 and 4, which I did not fix because it affects Fmax, and an issue with (not-EA) NOP sizes, which I fixed).

Since 'tiny' can make do with one copy of main memory, but requires it to be 3 x 8 = 24 bits wide (for shared code/data accesses), one 'byte twist' is needed: use true dual ported memory, read two consecutive 16-bit words (disregarding address bit 0), select the right 24 bits (according to address bit 0).

The published 'tiny' is 200 MHz Fmax, 200% benchmark (instead of 240 MHz Fmax, 165% benchmark, maybe that one follows some time).


Top
 Profile  
Reply with quote  
PostPosted: Wed Feb 23, 2022 8:38 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10800
Location: England
Windfall wrote:
Since 'tiny' can make do with one copy of main memory, but requires it to be 3 x 8 = 24 bits wide (for shared code/data accesses), one 'byte twist' is needed: use true dual ported memory, read two consecutive 16-bit words (disregarding address bit 0), select the right 24 bits (according to address bit 0).

Very interesting... but I must be missing something. I can't quite see how you get the sequential 24 bits you want. Could you elaborate a little please? How wide is the dual ported memory? Is it addressed by bytes? Does your 24 bit read take just one cycle total?


Top
 Profile  
Reply with quote  
PostPosted: Wed Feb 23, 2022 9:42 am 
Offline

Joined: Tue Sep 03, 2002 12:58 pm
Posts: 298
If it was 32Kx16, I can see how it might work: BRAM is dual-ported, so you can do two independent reads at the same time. If you want to read 3 bytes from address x, you read from address floor(x/2) from one port and floor(x/2)+1 from the other. Combine those into a 32 bit word and use either bits 0-23 or bits 8-24.

I don't see how a 24 bit wide memory works though. How do you translate a 16 bit address into the BRAM address without dividing by 3?

My own plan was to have four 8-bit memories, each providing 8 bits of a 32 bit word, but with independent addresses. Each memory can access either floor(x/4) or floor(x/4)+1. By controlling which memories get which address, and a bit of shuffling of the resulting data, you can get four consecutive bytes from any address. The address selection and shuffling would add a little complexity and probably reduce the maximum clock speed. I was hoping that needing fewer cycles for each instruction would compensate for that - I was particularly interested in the ability to have one instruction being read while the previous instruction wrote to memory (also, fetching 32 bits in one go means most of the time you get the opcode of the following instruction a cycle early, which might be helpful).

I probably won't be pursuing this any further though, as Windfall has already achieved most of what I was hoping to.


Top
 Profile  
Reply with quote  
PostPosted: Wed Feb 23, 2022 12:19 pm 
Offline
User avatar

Joined: Sun Nov 27, 2011 12:03 pm
Posts: 229
Location: Amsterdam, Netherlands
BigEd wrote:
Very interesting... but I must be missing something. I can't quite see how you get the sequential 24 bits you want. Could you elaborate a little please? How wide is the dual ported memory? Is it addressed by bytes? Does your 24 bit read take just one cycle total?

You map the 64K x 8 address (r) onto 32K x 16 memory. Read 16 bits from r[15:1] + 0 and r[15:1] + 1, then use r[0] to extract the right 24 bits.

The setup in my 6502 Second Processor implementation is as below (how the TDP_W macro couples to memory should be self-explanatory, and imagine backslash-linefeed continuation in the macro part ...).

Code:
`define RAM_IN_FPGA_TDP_W(ABITS, INAME, RENABLEA, RADDRESSA, RDATAA, RENABLEB, RADDRESSB, RDATAB, WENABLE, WADDRESS, WDATA, ROMFILE)

ram_in_fpga_tdp # (.DATA_BITS(16), .ADDRESS_BITS(ABITS-1), .RAM_FILE(ROMFILE)) INAME
(
  .a_clock(ram_clock), .a_address(WENABLE ? WADDRESS[ABITS-1:1] : RADDRESSA), .ar_enable(RENABLEA), .ar_data(RDATAA), .aw_enable(WENABLE), .aw_data({ WDATA, WDATA }), .aw_byte(WADDRESS[0] ? 2'b10 : 2'b01),
  .b_clock(ram_clock), .b_address(                                RADDRESSB), .br_enable(RENABLEB), .br_data(RDATAB)
);

[... snipped ...]

reg         peek_16x24_1_skew;
wire [15:0] peek_16x24_1_data_0;
wire [15:0] peek_16x24_1_data_1;

always @(posedge ram_clock)
  peek_16x24_1_skew <= peek_16x24_address_1[0];

`RAM_IN_FPGA_TDP_W(16, peek_16x24_1_1, peek_16x24_enable_1, peek_16x24_address_1[15:1] + 0, peek_16x24_1_data_0, peek_16x24_enable_1, peek_16x24_address_1[15:1] + 1, peek_16x24_1_data_1, peek_16x24_write_1, write_16x8_address, write_16x8_data, "reco6502_main_0_w.hex")

assign peek_16x24_data_1 = peek_16x24_1_skew ? { peek_16x24_1_data_1, peek_16x24_1_data_0[15:8] } : { peek_16x24_1_data_1[7:0], peek_16x24_1_data_0 };


Last edited by Windfall on Wed Feb 23, 2022 12:32 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Wed Feb 23, 2022 12:30 pm 
Offline
User avatar

Joined: Sun Nov 27, 2011 12:03 pm
Posts: 229
Location: Amsterdam, Netherlands
John West wrote:
My own plan was to have four 8-bit memories, each providing 8 bits of a 32 bit word, but with independent addresses. Each memory can access either floor(x/4) or floor(x/4)+1. By controlling which memories get which address, and a bit of shuffling of the resulting data, you can get four consecutive bytes from any address.

There is usually no need to shuffle data. Only addresses.


Top
 Profile  
Reply with quote  
PostPosted: Wed Feb 23, 2022 1:46 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10800
Location: England
Ah, thanks - the unavoidable thing, then, is to add one to an address - you can't just mask off the last bit - but it might well be that the (time) cost of the increment is no problem.


Top
 Profile  
Reply with quote  
PostPosted: Wed Feb 23, 2022 2:06 pm 
Offline
User avatar

Joined: Sun Nov 27, 2011 12:03 pm
Posts: 229
Location: Amsterdam, Netherlands
BigEd wrote:
Ah, thanks - the unavoidable thing, then, is to add one to an address - you can't just mask off the last bit - but it might well be that the (time) cost of the increment is no problem.

It will take a bit of time, like everything in front of it (multiplexing, other address calculations). The optimizer may be able to combine it with other additions (a +1 could simply become a carry in of another addition).

The addition can be avoided completely by doing the +1 read on another memory block with all its contents shifted down by 1 byte. This is what I do in several places in the 6502 Second Processor. Of course, the write address then incurs a -1. But that often turns out to be cheaper.


Top
 Profile  
Reply with quote  
PostPosted: Wed Feb 23, 2022 3:10 pm 
Offline

Joined: Tue Sep 03, 2002 12:58 pm
Posts: 298
There's a limited set of places that the addresses can come from. If the +1 is in the critical path, you might be able to duplicate those sources and have move it elsewhere. So two units handling indexed addressing modes, with one adding an extra carry. Two PCs, one a byte ahead of the other. And so on.

Of course, if the critical path is one of those other places, that would end up making it slower. It needs careful reading of the timing reports.


Top
 Profile  
Reply with quote  
PostPosted: Wed Feb 23, 2022 3:50 pm 
Offline
User avatar

Joined: Sun Nov 27, 2011 12:03 pm
Posts: 229
Location: Amsterdam, Netherlands
John West wrote:
There's a limited set of places that the addresses can come from. If the +1 is in the critical path, you might be able to duplicate those sources and have move it elsewhere. So two units handling indexed addressing modes, with one adding an extra carry. Two PCs, one a byte ahead of the other. And so on.

If you build for speed only, the optimizer could very well do that, yes. It could duplicate logic as needed while weighing the benefits against the routing costs (if any), and pick the best one. As long as you don't use 'synthesis keep' it will do its own combinatorial optimization anyway, and probably do a much better job than you ever could :D


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 41 posts ]  Go to page Previous  1, 2, 3

All times are UTC


Who is online

Users browsing this forum: No registered users and 5 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: