6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Mon Apr 29, 2024 4:34 am

All times are UTC




Post new topic Reply to topic  [ 232 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6, 7, 8 ... 16  Next
Author Message
PostPosted: Tue Oct 27, 2020 12:29 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England
Is that Verilator? I gather it's long been the highest-performance simulator, but I'd be interested in others.


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 27, 2020 12:37 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
Yes, latest version of Verilator.

Note this is only true on the generic code. Running the LUT-level simulation is significantly slower, because then it needs to handle each bit separately. It also generates a line of C++ code for each bit initialized in each of the LUT memories, producing thousands extra lines of source code, which take noticeably longer to compile.


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 27, 2020 12:54 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
A while ago, when I was near the start of this project, I thought I had a clever idea to make a 'bypassed M input' in the ALU shifter:
Code:
 * op       function
 * ==============================
 * 00---  | unmodified adder result
 * 01---  | bypassed M input
 * 10---  | adder shift left
 * 11---  | adder shift right

The idea was that the ALU itself would not need an option to just pass M unmodified (M is the data from last memory fetch), but it would always incorporate the register file in the calculations. The result then goes into the shifter, which can pass it along, or do a shift/rotate left or right. Because there was an unused 4th option, I figured that I could use that as a bypass for unmodified M. This could be useful for doing LDA/LDX/LDY instructions.

But it now turns out that this feature is not needed. Instead, I can just calculate "M <OR> R", where R comes from the register file, and is selected to fixed zero register. Not only does that reduce some routing pressure, as well as fan-out for M, but it also means that it's easier to speed up Z-flag calculation, since the bits always come from the ALU adder output.

Not sure it's causal, or just random variation, but fastest run is now 6.405 ns (156 MHz)


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 27, 2020 1:31 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
By the way, some of the things are only possible because I'm using a dual port register file. For instance, LDA is doing A <= M | Z, using both A and Zero registers at the same time. And of course, the single-cycle TXA uses both X and A registers.

If you're ever designing a CPU, I highly recommend using dual port (or even triple port) distributed memory for the register file. The costs are acceptable (if your FPGA target supports it), and it makes life a lot easier.


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 27, 2020 5:54 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
I finished the (hopefully) final rewrite of the microcode control bits for the address bus, incorporating the F7MUX, and also did some more instantiations in the Spartan6 design elements, including this ALU shifter mux that worked the first time, even though I wrote down the hex INIT strings without first making a table.
Code:
 * op       function
 * ===============================
 * 0?--  | unmodified adder result
 * 10--  | adder shift left
 * 11--  | adder shift right
 */

LUT5 #(.INIT(32'hf0ccaaaa)) out0(.O(OUT[0]), .I0(add[0]), .I1(SI),     .I2(add[1]), .I3(op[3]), .I4(op[4]));
LUT5 #(.INIT(32'hf0ccaaaa)) out1(.O(OUT[1]), .I0(add[1]), .I1(add[0]), .I2(add[2]), .I3(op[3]), .I4(op[4]));
LUT5 #(.INIT(32'hf0ccaaaa)) out2(.O(OUT[2]), .I0(add[2]), .I1(add[1]), .I2(add[3]), .I3(op[3]), .I4(op[4]));
LUT5 #(.INIT(32'hf0ccaaaa)) out3(.O(OUT[3]), .I0(add[3]), .I1(add[2]), .I2(add[4]), .I3(op[3]), .I4(op[4]));
LUT5 #(.INIT(32'hf0ccaaaa)) out4(.O(OUT[4]), .I0(add[4]), .I1(add[3]), .I2(add[5]), .I3(op[3]), .I4(op[4]));
LUT5 #(.INIT(32'hf0ccaaaa)) out5(.O(OUT[5]), .I0(add[5]), .I1(add[4]), .I2(add[6]), .I3(op[3]), .I4(op[4]));
LUT5 #(.INIT(32'hf0ccaaaa)) out6(.O(OUT[6]), .I0(add[6]), .I1(add[5]), .I2(add[7]), .I3(op[3]), .I4(op[4]));
LUT5 #(.INIT(32'hf0ccaaaa)) out7(.O(OUT[7]), .I0(add[7]), .I1(add[6]), .I2(SI),     .I3(op[3]), .I4(op[4]));
LUT5 #(.INIT(32'hf0ccaaaa)) out8(.O(CO),     .I0(C8),     .I1(add[7]), .I2(add[0]), .I3(op[3]), .I4(op[4]));

The LUT5 is a nice thing to use, because on the Spartan 6 this means you get 8 LUTs in a slice instead of 4.

The address bus logic is now fully instantiated, as well as the main ALU path, but not yet the random logic involving the flags and BCD adjustment. Still need to do the microcode control logic, flags, and register file.

Slice count is 44 right now, 60 flip flops, and 134 LUTs. The slice/LUT ratio indicates that slice count could go lower with better packing. Something in the 30's could be possible, maybe.


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 27, 2020 6:09 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England
(Just a thought here, for those wanting a full 64k of block RAM and finding a conflict with this new core's 2k microcode taking some of that up: if there's 2k of ROM in the system, perhaps put that into distributed RAM? Or even synthesise it - if it's say a font, it might be quite compressible. Or maybe give yourself 2k of distributed RAM - would that work? It might be that distributed RAM works better for general purpose byte-wide than it does for microcode.)


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 27, 2020 6:20 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
As you can see here, automatic floorplanning doesn't look very impressive. And if you show the ratsnest for some of these isolated elements, it goes all over the place.

Of course, the Xilinx business plan is to sell you a bigger/faster FPGA instead.


Attachments:
floorplan.png
floorplan.png [ 15.54 KiB | Viewed 560 times ]
Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 27, 2020 6:32 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
Here are the 8 base register muxes for the lower address bus, plus all their connections to other elements.

That area in the bottom left is where the lower half of program counter is stored.


Attachments:
adl.png
adl.png [ 67.36 KiB | Viewed 557 times ]
Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 27, 2020 7:12 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
That's pretty bad. What does it look like for the same elements after you manually place stuff?


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 27, 2020 7:39 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
Here I have manually placed the same input mux in a row (the parts in orange). The blue parts are still left up to the tools, and are still scattered everywhere.

Some of the other orange things are just blocks I moved out of the way to make room.

The user interface of that tool (planAhead) is pretty horrible, by the way. I constantly ended up accidentally zooming instead of moving components. It is helpful that it shows you where you can place components. If you hover over the wrong place, shows that it won't fit there. For instance, in the slices with the F7MUX used, you can still use 2 of the 8 flip flops in the slice using bypass inputs, but not any of the remaining 6.

Together with the slice diagram, it's helpful to learn the possibilities and restrictions.

Unfortunately, on my (Linux) machine, the FPGA editor doesn't work. I'm missing some (old) dynamic libraries.


Attachments:
manual.png
manual.png [ 67.45 KiB | Viewed 543 times ]


Last edited by Arlet on Tue Oct 27, 2020 7:47 pm, edited 1 time in total.
Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 27, 2020 7:41 pm 
Offline
User avatar

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany
BigEd wrote:
(Just a thought here, for those wanting a full 64k of block RAM and finding a conflict with this new core's 2k microcode taking some of that up: if there's 2k of ROM in the system, perhaps put that into distributed RAM? Or even synthesise it - if it's say a font, it might be quite compressible. Or maybe give yourself 2k of distributed RAM - would that work? It might be that distributed RAM works better for general purpose byte-wide than it does for microcode.)

For the 65F02, I would not want to "hardwire" the design to one specific host -- and would hence want to grab the host ROM upon startup. So it would need to be distributed RAM, not ROM, in the FPGA.

Most host systems will have less than 64k of RAM+ROM anyway, so one could spare 2k for the microcode. Or I could forfeit acceleration for one 2k block of ROM (or RAM), and always access the host memory for that block. Opening libraries for the chess computer, or video RAM for personal computers come to mind; that would not hurt the performance much.

But all the above schemes suffer from the difficulty of excluding (only) a 2k block from the 64k memory range. As I have learned during my experiments with 9-bit-wide memory, Xilinx synthesizes the 64k*8 memory by using the block RAMs in 16k*1 configuration -- resulting in much smaller and faster 4:1 multiplexers than if they would build it from 2k*8 blocks. So I wouldn't be able to carve a 2k*8 block out of the 64k memory, it seems.


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 27, 2020 7:48 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England
Oh, yes, I'd forgotten that architectural situation.


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 27, 2020 7:53 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
You could still mix and match, using 32 kB + 16 kB + 8 kB blocks, etc.. not ideal, but better than building it from 2k*8.

And instead of using a mux, you can make a 6-input OR to combine them. The block RAMs have a reset pin that allows you to set the output to all zeroes (or any value you want). So, you could do 16, 16, 16, 8, 4, 2, and use a single LUT per data bit.


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 27, 2020 9:00 pm 
Offline
User avatar

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany
Arlet wrote:
And instead of using a mux, you can make a 6-input OR to combine them. The block RAMs have a reset pin that allows you to set the output to all zeroes (or any value you want). So, you could do 16, 16, 16, 8, 4, 2, and use a single LUT per data bit.

But the block RAM's reset option works only with registered outputs, if I recall correctly. And that wouldn't work with your core, right?


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 27, 2020 9:29 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
The block RAM data is also registered, so the reset will work the same.

I've modified my core to also work with synchronous memory, using unregistered address output. The only caveat is that the writes overlap with the reads, so it requires dual port RAM. If you're not using the 2nd port for anything, it should work with the block RAMs.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 232 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6, 7, 8 ... 16  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 12 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: