6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sat Nov 23, 2024 5:38 pm

All times are UTC




Post new topic Reply to topic  [ 232 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6, 7 ... 16  Next
Author Message
PostPosted: Sun Oct 25, 2020 6:42 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
Arlet wrote:
When I add a pipeline stage to external AB signals, I can get fmax to 150 MHz.

No, unfortunately not. That was a mistake. I messed up with my new AB logic, and forgot to hook up the carry signal from ABL to ABH. With the carry restored, the AB calculation becomes the longest path. I've also run simulations with the test suite to verify the current code is correct.

Looking at the details of that path, it appears that the logic is implemented optimally according to the verilog source, but there may be some design changes possible. One of the options that I'm looking at is to create two parallel ABH paths, for carry inputs 0 and 1, and then use a mux to pick the winner. That would require 2 extra slices of logic for the extra ABH path, and 1 or 2 slices for the mux, which is probably a good trade off. It could be made a configuration option.


Last edited by Arlet on Sun Oct 25, 2020 6:43 am, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Sun Oct 25, 2020 6:43 am 
Online
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Carry-select! My favourite...


Top
 Profile  
Reply with quote  
PostPosted: Sun Oct 25, 2020 9:02 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
With the carry select, I can get back to 150 MHz.

The ABL path is often the longest. The basic operation is already optimal, and very simple: there's an input mux selecting a base register, which is followed by adder that applies an offset. That's it. It's implemented in two layers of logic, which is the minimum considering it's dealing with 5 different input sources, 4 select bits, and it needs to do an addition.

However, there's still room for improvement, namely the select bits for that input mux. These should ideally come straight from the microcode ROM, but because of the limited width, these bits are compressed and need an extra layer of logic to get decoded. On Xilinx FPGAs, ROM is actually 36 bits wide (thanks to the 4 parity bits), so there would be enough room to put the mux select bits in there. However, that would make the control vector 33 bits wide, and I think that would be rude for people trying to work with a 32 bit wide memory (like the ice40).

But, there is hope. The input mux has a free input pin. The offset adder has no free inputs, but has a don't care in the 'case' statement. Also, the 4 encoded bits in the microcode ROM are currently picked arbitrarily. So, I'm hopeful that by picking a clever encoding scheme, the decoder can actually be integrated inside the mux.


Top
 Profile  
Reply with quote  
PostPosted: Sun Oct 25, 2020 9:15 am 
Offline
User avatar

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany
It's really exciting to follow your rapid progress! 150 MHz is quite amazing.

I realize that in my application, which uses all the block RAM across the chip, I will incur an unavoidable penalty for routing delays. But compared to your original core this new one should provide a nice extra margin, to hopefully reach 100 MHz in a less "borderline" configuration...


Top
 Profile  
Reply with quote  
PostPosted: Sun Oct 25, 2020 1:05 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
Arlet wrote:
So, I'm hopeful that by picking a clever encoding scheme, the decoder can actually be integrated inside the mux.


Nope. One big problem is that the branch condition also feeds into the decision. Even with wider control word from the microcode, there would still be a need for a logic layer.

I also don't see anywhere else in that path how the logic can be reduced. All that's left is better placement.


Top
 Profile  
Reply with quote  
PostPosted: Sun Oct 25, 2020 1:08 pm 
Online
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Are there some less time-critical microcode bits which could be compressed, so the ones which are time-critical can be less compressed, or uncompressed?


Top
 Profile  
Reply with quote  
PostPosted: Sun Oct 25, 2020 1:23 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
Probably not, but it wouldn't help because the branch condition signal doesn't come from the microcode, and it needs to be incorporated as well.

There are some options to shuffle some bits, which might be marginally faster, but I'm never getting rid of the mux selection logic.

On the other hand, all hope for higher speeds is not lost. About half the path is routing delays, and there's certainly room for improvement there. However, good manual placement requires manual instantiation, so I need to work on that first. Doing manual instantiation also gives perfect insight in exactly how much logic is required and how it fits together, so there may still be some trick that I'm overlooking.


Top
 Profile  
Reply with quote  
PostPosted: Sun Oct 25, 2020 7:19 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
I was playing around a bit with manual placement (using LOC attributes in the verilog), and it seems to be working. On the floorplan, the locked elements appear in an orange color.

It shouldn't be too much work to place the critical path, and see how much that improves things.


Attachments:
loc.png
loc.png [ 4.73 KiB | Viewed 929 times ]
Top
 Profile  
Reply with quote  
PostPosted: Mon Oct 26, 2020 6:59 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
Arlet wrote:
Probably not, but it wouldn't help because the branch condition signal doesn't come from the microcode, and it needs to be incorporated as well. There are some options to shuffle some bits, which might be marginally faster, but I'm never getting rid of the mux selection logic.

I did find a way to improve this path.

Like I said, the ABL calculation involves selecting a base register and adding an offset. The problem is that the branch condition test is involved in the base register select. However, it turns out that it is possible to swap the base and offset inputs, which means that the branch condition test isn't needed in the first step, but only in the second, which gives it a bit more time.

Also, because there are only 3 choices for the base register, I have 3 select inputs available on the LUT, which is just enough to feed in 3 compressed microcode bits directly, without having to go through intermediate decoding stage.

By the way, a LUT decoder would probably be faster, because the distributed memory is asynchronous, so I could put in a free extra register.


Top
 Profile  
Reply with quote  
PostPosted: Mon Oct 26, 2020 8:26 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
Quote:
I did find a way to improve this path.


Questionable. The branch condition signal is no longer a problem, but now the register file access is needed earlier, and causing problems with the timing. The end result is about the same.


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 27, 2020 10:02 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
Thinking about this some more, I can go back to the original plan, using 6 input LUT for the mux, which has 3 inputs to choose from, and 3 selection bits. Ignoring the condition code, it is possible to take 3 bits straight from the microcode, which saves a decoder stage. From the 3 inputs, 2 come straight from registers, so these are fast. The other comes from the DB, which depends on how far the memory sits. For my high speed test setup, that memory is close enough.

Now, we can put 2 LUTs side-by-side, and evaluate both branch-taken, and branch-not-taken cases in parallel. The outputs then feed into a dedicated F7MUX inside the slice, which is operated by the condition code.

Total resources needed is the same, instead of 2 LUTs in series, we put them in parallel. The F7MUX is free, and very fast. In addition, this reduces the setup time requirement for the condition code test.


Attachments:
f7mux.png
f7mux.png [ 34.85 KiB | Viewed 839 times ]
Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 27, 2020 11:05 am 
Offline
User avatar

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany
Really nice to see your design thought process in action, thank you for sharing it!

I do hope that you will find the time to look into a version which uses distributed ROM for the instruction decoding instead of the block RAM. If I try to retrofit that into your current design, I'm bound to mess things up...


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 27, 2020 11:22 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
For some reason, it's not meeting 150 MHz constraint with new design (one run was missing it by 2 picoseconds), but this longest path is amazingly short with only 3 logic layers, although depending on the run, often our old friend, the Z flag from the ALU is critical as well, and sometimes simply the path from register file, through ALU, back into register file.

I have some ideas for the Z flag, though. Instead of waiting for the ALU adder output to go through the left/right shifter, we could grab the bits at the shifter input. The 6 center bits will always come out, whether we shift or not. These bits can go into their own LUT6. We can then use another LUT6 to combine this with the 2 outside bits, the shift operation (left/right/none), and the carry shift input.


Attachments:
timing.png
timing.png [ 57.26 KiB | Viewed 836 times ]
Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 27, 2020 12:14 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
I found the problem. It was caused by an unrelated small change. I changed a 4-input signal into a 5-input signal at the end of a critical path. This should not have caused any problems by itself. However, the tools were being too clever, and combined that logic with something else, blowing it up to a 7-input and making it slower (and maybe even bigger).

Longest path is now 6.501 ns, or 153 MHz, again in the ALU Z-flag.


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 27, 2020 12:21 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
By the way, running on my desktop PC, the Verilog simulator runs through the generic version at simulated speed of 29 MHz, which is pretty impressive for something that has to simulate all the internal bits of the CPU. It only takes 3 seconds to complete the 87 million cycles of Klaus' test suite.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 232 posts ]  Go to page Previous  1, 2, 3, 4, 5, 6, 7 ... 16  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron