Pipelined 6502
Re: Pipelined 6502
Is it fast enough for what?! Typically you'd have an application in mind, and you'd iterate on your HDL and your synthesis tactics (maybe your placement tactics) until you get there.
123MHz isn't too bad, in some general sense - it's a useful speed. To know whether it's impressive, I'd need to know what a PicoBlaze or other CPU would synth to - or indeed, Arlet's core - on this choice of chip with this speed grade.
One worrying thing to note: you have a clock period just over 8ns but you need some signal to arrive 9ns before the clock - so in reality your speed is lower, as you'd assume that signal is ultimately also controlled by the clock. In the best case, it becomes the speed limit. (In the worst case, it comes from something else with a substantial delay. You can see, if two people make designs and one produces a signal with 8ns delay and another uses that signal with an 8ns setup time, the actual period achievable will be something like 16ns. And they both thought they'd done a good job!)
It's a worthwhile skill to dig into the timing reports and be able to make sense of them.
123MHz isn't too bad, in some general sense - it's a useful speed. To know whether it's impressive, I'd need to know what a PicoBlaze or other CPU would synth to - or indeed, Arlet's core - on this choice of chip with this speed grade.
One worrying thing to note: you have a clock period just over 8ns but you need some signal to arrive 9ns before the clock - so in reality your speed is lower, as you'd assume that signal is ultimately also controlled by the clock. In the best case, it becomes the speed limit. (In the worst case, it comes from something else with a substantial delay. You can see, if two people make designs and one produces a signal with 8ns delay and another uses that signal with an 8ns setup time, the actual period achievable will be something like 16ns. And they both thought they'd done a good job!)
It's a worthwhile skill to dig into the timing reports and be able to make sense of them.
Re: Pipelined 6502
I'm not sure what stage in the synthesis this is, but for the most accurate result you should set a timing constraint by specifying a clock frequency yourself, and then see if the tools can meet that. If so, you can increase the number until you run into problems.
Without a constraint, you get an estimate, but this isn't always realistic, since the tools don't know what you want, so they can't perform a good area/speed optimization for instance.
It's also a good idea to check the timing analyzer output to see if there are easy improvements. In some designs there may be just a couple of long paths blocking higher speeds, and some local optimizations may yield a good improvement. It's however not always easy to understand the timing analyzer output, and map it back to the source code, so you may want to skip this part for now.
Without a constraint, you get an estimate, but this isn't always realistic, since the tools don't know what you want, so they can't perform a good area/speed optimization for instance.
It's also a good idea to check the timing analyzer output to see if there are easy improvements. In some designs there may be just a couple of long paths blocking higher speeds, and some local optimizations may yield a good improvement. It's however not always easy to understand the timing analyzer output, and map it back to the source code, so you may want to skip this part for now.
Re: Pipelined 6502
BigEd wrote:
Is it fast enough for what?! Typically you'd have an application in mind, and you'd iterate on your HDL and your synthesis tactics (maybe your placement tactics) until you get there.
Re: Pipelined 6502
Arlet wrote:
I'm not sure what stage in the synthesis this is, but for the most accurate result you should set a timing constraint by specifying a clock frequency yourself, and then see if the tools can meet that. If so, you can increase the number until you run into problems.
Without a constraint, you get an estimate, but this isn't always realistic, since the tools don't know what you want, so they can't perform a good area/speed optimization for instance.
It's also a good idea to check the timing analyzer output to see if there are easy improvements. In some designs there may be just a couple of long paths blocking higher speeds, and some local optimizations may yield a good improvement. It's however not always easy to understand the timing analyzer output, and map it back to the source code, so you may want to skip this part for now.
Without a constraint, you get an estimate, but this isn't always realistic, since the tools don't know what you want, so they can't perform a good area/speed optimization for instance.
It's also a good idea to check the timing analyzer output to see if there are easy improvements. In some designs there may be just a couple of long paths blocking higher speeds, and some local optimizations may yield a good improvement. It's however not always easy to understand the timing analyzer output, and map it back to the source code, so you may want to skip this part for now.
Re: Pipelined 6502
I just did a quick search for "xilinx how to read timing report" and it looks like it would lead to some good reading.
Re: Pipelined 6502
BigEd wrote:
I just did a quick search for "xilinx how to read timing report" and it looks like it would lead to some good reading.
Re: Pipelined 6502
In my version of ISE it's called "post place & route static timing". But if you enter a timing constraint (basically just the frequency of your input clock), and then generate the design, it should give you an error if it can't meet the timing.
In the menu, there's a Timing Analyzer tool for details.
Note that the timing also depends on the pin constraints and other resources used in the FPGA. A small design in an otherwise empty FPGA can be optimized a lot more than a large design that uses a lot of routing resources, or forces long signals across the chip.
Of course, if you just want to make a ballpark comparison, you don't need to worry about all these details, but you should be aware that the results may be imprecise.
In the menu, there's a Timing Analyzer tool for details.
Note that the timing also depends on the pin constraints and other resources used in the FPGA. A small design in an otherwise empty FPGA can be optimized a lot more than a large design that uses a lot of routing resources, or forces long signals across the chip.
Of course, if you just want to make a ballpark comparison, you don't need to worry about all these details, but you should be aware that the results may be imprecise.
Re: Pipelined 6502
Arlet wrote:
In my version of ISE it's called "post place & route static timing". But if you enter a timing constraint (basically just the frequency of your input clock), and then generate the design, it should give you an error if it can't meet the timing.
In the menu, there's a Timing Analyzer tool for details.
Note that the timing also depends on the pin constraints and other resources used in the FPGA. A small design in an otherwise empty FPGA can be optimized a lot more than a large design that uses a lot of routing resources, or forces long signals across the chip.
Of course, if you just want to make a ballpark comparison, you don't need to worry about all these details, but you should be aware that the results may be imprecise.
In the menu, there's a Timing Analyzer tool for details.
Note that the timing also depends on the pin constraints and other resources used in the FPGA. A small design in an otherwise empty FPGA can be optimized a lot more than a large design that uses a lot of routing resources, or forces long signals across the chip.
Of course, if you just want to make a ballpark comparison, you don't need to worry about all these details, but you should be aware that the results may be imprecise.
Re: Pipelined 6502
Hello guys, it's me again 
This is a big day for me, because I'm releasing a source code of mine in public place. I hope you guys forgive me for the delay between the opening time of the topic and the release time of the code. This is pipelined 6502 source code.
Well ... good news first:
1. The code is completely synthesizable (I did post-synthesis simulation with Vivado toolsets successfully. But I did not have any FPGAs to test the code.)
2. I did some comparison with the native 6502 model (I took Arlet's model as a reference), it seems that the new processor can outperform up to 40% sometimes. (But still I need to study much much much more about this claim)
Now bad news:
1. I had to change the architecture so I just got rid of caches.
2. I did not pass any testing suits yet, so please help me to do so.
3. The memory which is working with this CPU is very special. It has 4 Async read ports and 1 sync write port. I think this will make the processor very slow (what do you think guys?)
BTW you can think of this code as a working MODEL not as a REAL 6502 (not yet!!!).
Looking forward to read your comments.
thank you all for encouraging me to finish this hard project.
M. A. Nili
P.S. I have written the Verilog HDL code in TextWrangler editor on macOS. So I don't know what would happen if you open the source code in other editors!
P.S. Again sorry for my bad english!
This is a big day for me, because I'm releasing a source code of mine in public place. I hope you guys forgive me for the delay between the opening time of the topic and the release time of the code. This is pipelined 6502 source code.
Well ... good news first:
1. The code is completely synthesizable (I did post-synthesis simulation with Vivado toolsets successfully. But I did not have any FPGAs to test the code.)
2. I did some comparison with the native 6502 model (I took Arlet's model as a reference), it seems that the new processor can outperform up to 40% sometimes. (But still I need to study much much much more about this claim)
Now bad news:
1. I had to change the architecture so I just got rid of caches.
2. I did not pass any testing suits yet, so please help me to do so.
3. The memory which is working with this CPU is very special. It has 4 Async read ports and 1 sync write port. I think this will make the processor very slow (what do you think guys?)
BTW you can think of this code as a working MODEL not as a REAL 6502 (not yet!!!).
Looking forward to read your comments.
thank you all for encouraging me to finish this hard project.
M. A. Nili
P.S. I have written the Verilog HDL code in TextWrangler editor on macOS. So I don't know what would happen if you open the source code in other editors!
P.S. Again sorry for my bad english!
Re: Pipelined 6502
hi, manili. Nice to see you're persevering with this very challenging project.
I admit I don't understand your diagram (above). Can you choose a typical instruction (such as LDA abs, perhaps) and describe how it would execute? (Your English is just fine, BTW!
)
I know that there are several pipeline stages, of course. I guess what I'm missing is exactly how the instructions overlap.
I took the diagram below from another thread, so it's not 100% appropriate. But it does show four pipeline stages, with multiple instructions making their way through. Do your plans specify this kind of detail -- how instructions will overlap -- or is that part yet to be determined?
cheers,
Jeff
I admit I don't understand your diagram (above). Can you choose a typical instruction (such as LDA abs, perhaps) and describe how it would execute? (Your English is just fine, BTW!
I know that there are several pipeline stages, of course. I guess what I'm missing is exactly how the instructions overlap.
I took the diagram below from another thread, so it's not 100% appropriate. But it does show four pipeline stages, with multiple instructions making their way through. Do your plans specify this kind of detail -- how instructions will overlap -- or is that part yet to be determined?
cheers,
Jeff
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html
https://laughtonelectronics.com/Arcana/ ... mmary.html
Re: Pipelined 6502
Hi Manili,
The 6502 address space is small enough that it doesn’t really need caches, the ram in a PLD can be used directly. But if using external ram resources retaining the caches would be a good idea. The long term goals for the project might dictate inclusion / exclusion of caches.
On a clock by clock comparison the core may be faster but there is also likely a lower fmax for the more complex core. Does the 40% faster take into consideration fmax comparisons ?
I think you are correct with bad news #3 point. Having five port ram to interface to along with bypassing logic is bound to make the design larger and slower. Even if it were possible to use asynch ram directly some loss in the fmax is to be expected compared to a simpler design.
Real memory resource in many PLD’s (FPGA) is synchronous hence more pipelining complexity must be added to make use of it in a single cycle. Using two clock cycles to access the RAM would defeat the purpose of an overlapped-pipelined design.
The 6502 address space is small enough that it doesn’t really need caches, the ram in a PLD can be used directly. But if using external ram resources retaining the caches would be a good idea. The long term goals for the project might dictate inclusion / exclusion of caches.
On a clock by clock comparison the core may be faster but there is also likely a lower fmax for the more complex core. Does the 40% faster take into consideration fmax comparisons ?
I think you are correct with bad news #3 point. Having five port ram to interface to along with bypassing logic is bound to make the design larger and slower. Even if it were possible to use asynch ram directly some loss in the fmax is to be expected compared to a simpler design.
Real memory resource in many PLD’s (FPGA) is synchronous hence more pipelining complexity must be added to make use of it in a single cycle. Using two clock cycles to access the RAM would defeat the purpose of an overlapped-pipelined design.
Re: Pipelined 6502
@Dr Jefyll
Thanks for your reply. I didn't understand your problem. Do the following example help you?
One of my tests during this project was Fibonacci series generation test. So here is the Verilog code:
Now this is how the processor behave during the execution of the program.
Remember I used the assembly symbol inside the picture (not the Verilog macros).
Thanks for your reply. I didn't understand your problem. Do the following example help you?
One of my tests during this project was Fibonacci series generation test. So here is the Verilog code:
Code: Select all
MEM[32814] = `LDY_IME;
MEM[32815] = 8'h07;
MEM[32816] = `LDA_IME;
MEM[32817] = 8'h00;
MEM[32818] = `STA_ABS;
MEM[32819] = 8'h03;
MEM[32820] = 8'h00;
MEM[32821] = `LDA_IME;
MEM[32822] = 8'h01;
MEM[32823] = `TAX;
MEM[32824] = `ADC_ABS;
MEM[32825] = 8'h03;
MEM[32826] = 8'h00;
MEM[32827] = `STX_ABS;
MEM[32828] = 8'h03;
MEM[32829] = 8'h00;
MEM[32830] = `DEY;
MEM[32831] = `BNE;
MEM[32832] = 8'hF6;
Last edited by manili on Wed Mar 29, 2017 9:30 pm, edited 1 time in total.
Re: Pipelined 6502
Rob Finch wrote:
Hi Manili,
The 6502 address space is small enough that it doesn’t really need caches, the ram in a PLD can be used directly. But if using external ram resources retaining the caches would be a good idea. The long term goals for the project might dictate inclusion / exclusion of caches.
On a clock by clock comparison the core may be faster but there is also likely a lower fmax for the more complex core. Does the 40% faster take into consideration fmax comparisons ?
I think you are correct with bad news #3 point. Having five port ram to interface to along with bypassing logic is bound to make the design larger and slower. Even if it were possible to use asynch ram directly some loss in the fmax is to be expected compared to a simpler design.
Real memory resource in many PLD’s (FPGA) is synchronous hence more pipelining complexity must be added to make use of it in a single cycle. Using two clock cycles to access the RAM would defeat the purpose of an overlapped-pipelined design.
The 6502 address space is small enough that it doesn’t really need caches, the ram in a PLD can be used directly. But if using external ram resources retaining the caches would be a good idea. The long term goals for the project might dictate inclusion / exclusion of caches.
On a clock by clock comparison the core may be faster but there is also likely a lower fmax for the more complex core. Does the 40% faster take into consideration fmax comparisons ?
I think you are correct with bad news #3 point. Having five port ram to interface to along with bypassing logic is bound to make the design larger and slower. Even if it were possible to use asynch ram directly some loss in the fmax is to be expected compared to a simpler design.
Real memory resource in many PLD’s (FPGA) is synchronous hence more pipelining complexity must be added to make use of it in a single cycle. Using two clock cycles to access the RAM would defeat the purpose of an overlapped-pipelined design.
Your points are really important. One of my biggest problem is that I don't know how to find the Fmax from Vivado toolsets. The problem is that I set the timing constraints but synthesizing the 'Core' module cost no timing constraints and it passes every timing constraints which is very odd to me. But when I choose the 'System' module as the top module to synthesize it could reach about 70MHz.
I was thinking to make this unusual memory like a cache with very little size and use an external memory as the main memory. So it could make the processor much more faster.
You pointed out that "using 2 clock cycle to access the RAM would defeat the purpose of an overlapped-pipeline", so you think the whole project is kinda wasting time?
Thanks a lot.
manili
Last edited by manili on Wed Mar 29, 2017 9:33 pm, edited 1 time in total.
Re: Pipelined 6502
@Arlet
Would you like tell us about Fmax of your processor, please?
Would you like tell us about Fmax of your processor, please?
Re: Pipelined 6502
My core synthesized for 100 MHz without effort, including internal memory and peripherals, on a Spartan 6. I didn't attempt to push it harder at the time.
But it depends a lot on the device and synthesis tools, and things attached to the core. So for a fair comparison it's best if you synthesize it yourself, using the same settings as for your own core.
But it depends a lot on the device and synthesis tools, and things attached to the core. So for a fair comparison it's best if you synthesize it yourself, using the same settings as for your own core.