6502.org

Posted: **Wed Apr 25, 2012 9:42 pm**

ElEctric_EyE wrote:

I'm ears Ed, do you (or anyone) have any ideas on how to go actually go about this?

It's fair to challenge me! In truth I'm not sure of the details, only a vague idea. It's always the details! I may have a look at the weekend.

Posted: **Wed Apr 25, 2012 10:02 pm**

EEye:

If I understand your question correctly, the performance difference that you note between your Pentium 4 and the i7 machines is mostly due to the differences in their microarchitectures: number of instruction decoders and dispatchers, execution pipeline depth, cache architecture, and memory interface bus speed and efficiency. In a detailed tutorial/analysis of the Intel/AMD microarchitectures that I found on the web, the performance differences are essentially due to Intel's decision to focus its resources on the Pentium II microarchitecture on which the i7 processors are based instead of the Pentium 4 microarchitecture.

Posted: **Wed Apr 25, 2012 10:05 pm**

Thanks MM for responding, but why should any processor being used have a different outcome in ISE? any version of ISE? on any x86 processor? when using SmartXPlorer.

Posted: **Wed Apr 25, 2012 10:09 pm**

EEyE:

I guess that I misunderstood your original question. From your reply, I gather that you are noting performance differences for the synthesized HDL on two different computers??

Posted: **Wed Apr 25, 2012 10:17 pm**

MichaelM wrote:

EEyE:

I guess that I misunderstood your original question. From your reply, I gather that you are noting performance differences for the synthesized HDL on two different computers??

Yes, identical project files. ISE13.4 on both machines...

I would usually post a queston like this on the Xilinx forums, as it is a nagging curiousity. I was just wondering if anyone else has come across a similar experience.

BTW, this forum is more active than Xilinx forums.

Thanks for sharing your expertise!

Posted: **Wed Apr 25, 2012 10:44 pm**

EEyE:

That is somewhat curious behaviour. At the synthesis level, I can't think of any reason why there would be any differences. The Xilinx MAP and PAR tools, on the other hand, generally operate with algorithms that are stochastic, i.e. driven by random seeds. Thus, each mapping and place and route operation can yield different results. A simple PERIOD constraint on the clock signal is generally enough to get each "random" output to converge to a consistent result.

In the past, I've had trouble with the Xilinx tools taking shortcuts and using results from previous cycles as guides for a new place and route. Therefore, I periodically purge the project files to force a clean build.

One final thought. If you've not compared the TCL files of both projects, I would run the "Generate TCL Script" under the "Project" menu, and compare them to ensure that all of the tool options are set to the same values. (That file is something that I've recently begun using after someone showed me how easy it was to save and restore a project settings using it. Previously, I simply tried to remember and reset the project settings from their GUIs.)

Posted: **Wed Apr 25, 2012 11:21 pm**

MichaelM wrote:

...One final thought. If you've not compared the TCL files of both projects, I would run the "Generate TCL Script" under the "Project" menu, and compare them to ensure that all of the tool options are set to the same values. (That file is something that I've recently begun using after someone showed me how easy it was to save and restore a project settings using it. Previously, I simply tried to remember and reset the project settings from their GUIs.)

Thank You! I'll try this out...
The only thing I've tried is to 'Cleanup Project Files'. This used to work on earlier versions of ISE.
ISE13.4 seems to more 'tight' as far as lack of errors and this is a good thing!

Posted: **Thu Apr 26, 2012 5:03 am**

Just to note that when I looked into SmartXplorer 12.4 some time ago, there was more gain to be had with further manual exploration beyond the 7 default tactics.

To answer an earlier query, I wouldn't expect changing an 'x' into a value to help with the result, because you're decreasing the freedom of the tool. It might do, if the tool is failing to find a good solution which you know about. I don't advise changing the HDL source in search of higher clock speeds unless you're also looking at the synthesis reports and understanding the nature of the critical path. If the critical path is in the ALU, then changing the instruction decoding is unlikely to help.

Cheers
Ed

Posted: **Thu Apr 26, 2012 9:33 pm**

BigEd wrote:

... I don't advise changing the HDL source in search of higher clock speeds unless you're also looking at the synthesis reports and understanding the nature of the critical path. If the critical path is in the ALU, then changing the instruction decoding is unlikely to help.

Cheers
Ed

There's a nice tool in planahead/analyze timing that lets you see the actual path , and the paths that fail are highlighted in red. Is this what you were referring to?
In my case src_reg_3 is one that is slowest. Now how to figure out where src_reg_3 is in the code...

Posted: **Thu Apr 26, 2012 9:51 pm**

Interesting - I've only seen textual reports, not looked in the GUI.

Posted: **Fri Apr 27, 2012 12:01 am**

Ah, I found out src_reg_3 is actually bit 3 of the src_reg. Now this would lead me to believe my previous argument about being carerul with the bits in regards to opcode decoding can actually lead to higher speeds, especially if you can negate them with a '0'. That way, I'm surmising here, the tools will naturally cut the path as close to the 'source' as possible instead of routing 'x's through many levels of MUX's. Am I on the right track?

BTW, that PlanAhead tool is worth some investigating and experimentation.

Posted: **Fri Apr 27, 2012 12:15 pm**

Hi EEye
The 'x' in a case statement should always be working in your favour, because they give the logic synthesis the maximum degree of freedom to implement whatever is smallest or fastest.

In a clean (RISC-like) instruction encoding, the destination register would just be some selection of bits from the IR, and wouldn't take any decoding. But we have this legacy instruction set: and bit 3 is right down there in that legacy set. In the 6502, that was quite efficient, but of course in this branch you've added more registers and operations and filled out the bottom 8 bits. That's evidently slowed down the machine - it's bound to slow down the decode, in retrospect, because unless you're extremely careful, you're perturbing something which was put together with great care for efficient decoding. Up to a point you get away with that because decode hasn't been on the critical path, but if you add extra opcodes without the same care for placement as the original 6502 designers then the decode will get more complex.

A radical suggestion - which would need even more cooperation from the assembler than you already need - would be to modify the encodings of existing opcodes, so that for example LDX would signal in the upper bits that X is the destination. Instead of $00A6, $00B6 and so on, you'd use $20A6, $20B6 or whatever would be appropriate to address X in the register file. However, I see from your code that your register file now has 5-bit addresses... so I'll steer clear of trying to think too hard about this particular implementation. I think I'm revisiting ideas I put forward in the 65Org16.c thread.

I see in your latest checkin that you have changed some x's to 0's and you have the speed as 91MHz. What was it before?

The problem with lots of complex decode lines - even with Arlet's original - is being absolutely sure that every opcode is doing what it should, and nothing more.

Cheers
Ed

Posted: **Fri Apr 27, 2012 12:44 pm**

Very astute observations, thanks very much for that! I'll be checking for errors in my decoding once again today.

BigEd wrote:

... I see in your latest checkin that you have changed some x's to 0's and you have the speed as 91MHz. What was it before?...
Cheers
Ed

It was passing when my constraint was at 12ns, I think. The max speed was 89MHz, so I was pretty excited I got it back up to at least 90MHz which is my goal for this core.

One question which has been nagging at me for sometime now: Do you think that depending how long the program is in the blockRAM, when running synthesis, that this will have an affect top speed? I would hope not, but I'm not sure...

Posted: **Fri Apr 27, 2012 1:01 pm**

No, blockRAM contents would have no effect. (My own codebase has a tiny ROM which is hard-synthesised to logic, so that's a different case)

Posted: **Fri Apr 27, 2012 11:57 pm**

Ok, awesome! So now I am thinking when I actually do finally straighten all my decoding errors out, and by mainly following Arlet's 8-bit original core, that the longest path should still be to the ALU?

I found some more decoding errors in the state machine which are abit easier to find, now that I'm not adding any more features, or even thinking of adding more at this point.

So tonight I plan to push synthesis to a very narrow failure margin, then run smartXplorer and apply those setting to the synthesis to make it pass. Then lower the speed constraint till it fails, and rerun smartXplorer.

Will post results, which probably will mean just something as I'm sure I still have a few decoding errors to find...

Will have time tomorrow to find the longest path on this update.

6502.org

65ORG16.b Core

Re: 65ORG16.b Core

Re: 65ORG16.b Core

Re: 65ORG16.b Core

Re: 65ORG16.b Core

Re: 65ORG16.b Core

Re: 65ORG16.b Core

Re: 65ORG16.b Core

SmartXplorer

Re: SmartXplorer

Re: 65ORG16.b Core

Re: 65ORG16.b Core

Re: 65ORG16.b Core

Re: 65ORG16.b Core

Re: 65ORG16.b Core

Re: 65ORG16.b Core