Pipelined 6502

Arlet · Post by **Arlet** » Mon Oct 10, 2016 9:06 am

GARTHWILSON wrote:

If you have to load 1K at a time, at 100MB/s (for the sake of discussion), that's 10µs, which is a long, long time to wait for interrupt service on a processor designed for performance high enough to justify having caches.

The data I referred to show that this particular processor takes 8ns to fill a cache line from external memory, which is not a long, long time, even assume the CPU is actually stalled, which is most likely not the case.

BigEd · Post by **BigEd** » Mon Oct 10, 2016 9:31 am

manili wrote:

Thank you all for this hot discussion :mrgreen: !

I hope we've been able to help! You should feel free to continue here or to start new threads if you have any particular questions. Many of us use this site not just for discussion, but also as a repository of wisdom, so it's good to be able to find things later - even years later.

One thing you did say earlier, about Klaus' testsuite - indeed, it's only intended to test 6502 behaviour, and only some of it, but most implementations or emulations have one or two bugs and this suite is a good one to pass. It will exercise some of your pipelined architecture, but probably not every possibility. One reason to make a machine which is not super-complicated is the difficulty of verifying it. Even Intel don't put every possible smart idea into each revision - they move forward a step at a time.

Good luck, anyway, and hopefully we will see more interesting news as you progress.

ttlworks · Post by **ttlworks** » Mon Oct 10, 2016 9:33 am

Since a problem when implementing an instruction cache would be self_modifying code,
I would suggest to take a look at how the MC68030 data cache works:

"If a cache hit occurs on a write cycle, both the data cache and the external device are updated with the new data.
If a write cycle generates a cache miss, the external device is updated, and a new data cache entry can be
replaced or allocated for that address..."

Would something like that be helpful for building a 6502 instruction cache ?

68030 manual, Page 13.

manili · Post by **manili** » Mon Oct 10, 2016 9:49 am

BigEd wrote:

manili wrote:

Thank you all for this hot discussion

!

I hope we've been able to help! You should feel free to continue here or to start new threads if you have any particular questions. Many of us use this site not just for discussion, but also as a repository of wisdom, so it's good to be able to find things later - even years later.

One thing you did say earlier, about Klaus' testsuite - indeed, it's only intended to test 6502 behaviour, and only some of it, but most implementations or emulations have one or two bugs and this suite is a good one to pass. It will exercise some of your pipelined architecture, but probably not every possibility. One reason to make a machine which is not super-complicated is the difficulty of verifying it. Even Intel don't put every possible smart idea into each revision - they move forward a step at a time.

Good luck, anyway, and hopefully we will see more interesting news as you progress.

I will, Thank you.

You mean the test suite does not test every possible states (I know thats impossible

) ? So how much reliable will my processor be after passing this test suite ?

ttlworks wrote:

Since a problem when implementing an instruction cache would be self_modifying code,
I would suggest to take a look at how the MC68030 data cache works:

"If a cache hit occurs on a write cycle, both the data cache and the external device are updated with the new data.
If a write cycle generates a cache miss, the external device is updated, and a new data cache entry can be
replaced or allocated for that address..."

Would something like that be helpful for building a 6502 instruction cache ?

68030 manual, Page 13.

Currently I'm using Write-Back policy, so you mean I should change it to Write-Through ?
How do you think about my bypassing idea viewtopic.php?f=4&t=4270&start=15#p47851

ttlworks · Post by **ttlworks** » Mon Oct 10, 2016 10:45 am

manili wrote:

How do you think about my bypassing idea

I now have to admit that I'm from the "TTL nerd corner", what means building CPUs from individual TTL chips... or transistors.
In my case, memory always was faster than the CPU, so I'm not too deep into implementing caches.

What about "having a lookup table somewhere" which tells the CPU what part of the memory should be cached
and what part should not ? // 1KB "page granularity" would take 8 Bytes of lookup table for 64kB of memory.

BigEd · Post by **BigEd** » Mon Oct 10, 2016 11:20 am

manili wrote:

BigEd wrote:

...about Klaus' testsuite - indeed, it's only intended to test 6502 behaviour, and only some of it...

You mean the test suite does not test every possible states (I know thats impossible ;) ) ? So how much reliable will my processor be after passing this test suite?

Indeed, complete coverage is almost impossible, and there's something of a challenge in even measuring coverage. I couldn't guess! I see Michael has expressed confidence.

One thing I'll note: Arlet's first version of his core lacked RDY, and at some point it passed Klaus' suite. Later, Arlet added RDY, and a little later ran the suite again, with a randomly-toggling RDY. I think I remember that doing so did show up one bug. Much much later we found another RDY-related bug. So, even something as conceptually simple as stalling a basic 6502 is not completely or easily covered by this one test suite.

(Another interesting datapoint: when the visual6502 was first used to model the boot of a C64 to the Basic prompt, Michael Steil investigated how many transistors could be deleted without breaking the boot. It turned out to be about a third. So, getting good coverage is difficult!)

There's a whole career to be had, specialising in CPU verification. CPU teams are often half staffed with designers and half with verifiers.

GARTHWILSON · Post by **GARTHWILSON** » Mon Oct 10, 2016 5:32 pm

Per Ed's suggestion, the suitability of caches for the 65xx is taken up in a new topic, at viewtopic.php?f=4&t=4271 .

manili · Post by **manili** » Fri Oct 14, 2016 11:26 pm

Hi all,
Good news first :
Some people kindly told me to inform the forum with my progress. Currently I'm working hard to implement stack related instructions which are the most challenging instructions to implement. My processor, now accepts PHA, PHP, PLA, PLP and the most important one, RTS. After a lot of changing (believe me when I say a lot !) finally I found a way to handle instructions like RTI.
Here are my successfully passed test cases :

Code: Select all

		//RAM

		memory[16'h0000] <= 8'h00;
		memory[16'h0001] <= 8'h00;
		memory[16'h0002] <= 8'h00;
		
		memory[16'h001A] <= 8'h20;
		memory[16'h001B] <= 8'h10;

		memory[16'h00FF] <= 8'h00;

		memory[16'h0020] <= 8'h0F;
		memory[16'h0021] <= 8'h10;
		
		memory[16'h100D] <= 8'h06;
		memory[16'h100E] <= 8'h80;
		memory[16'h100F] <= 8'h0C;

		//ROM
		
		/* Jump + Indirect Addressing Tests */

		memory[`P_C_START + 00] <= `JMP_I;
		memory[`P_C_START + 01] <= 8'h0D;
		memory[`P_C_START + 02] <= 8'h10;
		
		memory[`P_C_START + 06] <= `LDA_IME;
		memory[`P_C_START + 07] <= 8'hFF;
		memory[`P_C_START + 08] <= `LDA_IME;
		memory[`P_C_START + 09] <= 8'h0F;
		
		memory[`P_C_START + 00] <= `LDA_IME;
		memory[`P_C_START + 01] <= 8'h20;
		memory[`P_C_START + 02] <= `LDY_IME;
		memory[`P_C_START + 03] <= 8'hFF;
		memory[`P_C_START + 04] <= `STA_I_Y;
		memory[`P_C_START + 05] <= 8'h00;
		memory[`P_C_START + 06] <= `LDX_ABS;
		memory[`P_C_START + 07] <= 8'hFF;
		memory[`P_C_START + 08] <= 8'h00;
		memory[`P_C_START + 09] <= `JMP_ABS;
		memory[`P_C_START + 10] <= 8'h11;
		memory[`P_C_START + 11] <= 8'h80;
		memory[`P_C_START + 12] <= `ADC_I_X;
		memory[`P_C_START + 13] <= 8'h00;
		memory[`P_C_START + 14] <= `JMP_ABS;
		memory[`P_C_START + 15] <= 8'h1A;
		memory[`P_C_START + 16] <= 8'h80;
		memory[`P_C_START + 17] <= `LDA_IME;
		memory[`P_C_START + 18] <= 8'h01;
		memory[`P_C_START + 19] <= `LDA_IME;
		memory[`P_C_START + 20] <= 8'h02;
		memory[`P_C_START + 21] <= `LDA_IME;
		memory[`P_C_START + 22] <= 8'h03;
		memory[`P_C_START + 23] <= `JMP_ABS;
		memory[`P_C_START + 24] <= 8'h0C;
		memory[`P_C_START + 25] <= 8'h80;

		/* Flags Setting/Clearing Tests */

		memory[`P_C_START + 00] <= `JMP_ABS;
		memory[`P_C_START + 01] <= 8'd26;
		memory[`P_C_START + 02] <= 8'h80;

		memory[`P_C_START + 26] <= `SED;
		memory[`P_C_START + 27] <= `SEC;
		memory[`P_C_START + 28] <= `CLI;
		memory[`P_C_START + 29] <= `CLC;

		/* Loop Test */

		memory[`P_C_START + 00] <= `JMP_ABS;
		memory[`P_C_START + 01] <= 8'd30;
		memory[`P_C_START + 02] <= 8'h80;

		memory[`P_C_START + 30] <= `LDA_ABS;
		memory[`P_C_START + 31] <= 8'h02;
		memory[`P_C_START + 32] <= 8'h00;
		memory[`P_C_START + 33] <= `ADC_IME;
		memory[`P_C_START + 34] <= 8'h01;
		memory[`P_C_START + 35] <= `STA_ABS;
		memory[`P_C_START + 36] <= 8'h02;
		memory[`P_C_START + 37] <= 8'h00;
		memory[`P_C_START + 38] <= `EOR_IME;
		memory[`P_C_START + 39] <= 8'h0F;
		memory[`P_C_START + 40] <= `BNE;
		memory[`P_C_START + 41] <= 8'hF6;
		
		/* Transition Tests */
		
		memory[`P_C_START + 00] <= `JMP_ABS;
		memory[`P_C_START + 01] <= 8'd42;
		memory[`P_C_START + 02] <= 8'h80;
		
		memory[`P_C_START + 42] <= `TXA;
		memory[`P_C_START + 43] <= `TYA;
		memory[`P_C_START + 44] <= `TAX;
		memory[`P_C_START + 45] <= `DEY;

		/* Fibonacci Test */

		memory[`P_C_START + 00] <= `JMP_ABS;
		memory[`P_C_START + 01] <= 8'd46;
		memory[`P_C_START + 02] <= 8'h80;

		memory[`P_C_START + 46] <= `LDY_IME;
		memory[`P_C_START + 47] <= 8'h07;
		memory[`P_C_START + 48] <= `LDA_IME;
		memory[`P_C_START + 49] <= 8'h00;
		memory[`P_C_START + 50] <= `STA_ABS;
		memory[`P_C_START + 51] <= 8'h03;
		memory[`P_C_START + 52] <= 8'h00;
		memory[`P_C_START + 53] <= `LDA_IME;
		memory[`P_C_START + 54] <= 8'h01;
		memory[`P_C_START + 55] <= `TAX;
		memory[`P_C_START + 56] <= `ADC_ABS;
		memory[`P_C_START + 57] <= 8'h03;
		memory[`P_C_START + 58] <= 8'h00;
		memory[`P_C_START + 59] <= `STX_ABS;
		memory[`P_C_START + 60] <= 8'h03;
		memory[`P_C_START + 61] <= 8'h00;
		memory[`P_C_START + 62] <= `DEY;
		memory[`P_C_START + 63] <= `BNE;
		memory[`P_C_START + 64] <= 8'hF8;

		/* Simple Stack Tests */
		
		memory[`P_C_START + 00] <= `JMP_ABS;
		memory[`P_C_START + 01] <= 8'd65;
		memory[`P_C_START + 02] <= 8'h80;

		memory[`P_C_START + 65] <= `TXS;
		memory[`P_C_START + 66] <= `LDX_IME;
		memory[`P_C_START + 67] <= 8'hAA;
		memory[`P_C_START + 68] <= `TXS;
		memory[`P_C_START + 69] <= `TSX;
		
		memory[`P_C_START + 70] <= `PHA;
		memory[`P_C_START + 71] <= `LDA_IME;
		memory[`P_C_START + 72] <= 8'h0F;
		memory[`P_C_START + 73] <= `PHP;
		memory[`P_C_START + 74] <= `PLA;
		memory[`P_C_START + 75] <= `PLP;

		/* Return Oriented Instruction Test */

		memory[`P_C_START + 00] <= `JMP_ABS;
		memory[`P_C_START + 01] <= 8'd76;
		memory[`P_C_START + 02] <= 8'h80;
		
		memory[`P_C_START + 76] <= `LDA_IME;
		memory[`P_C_START + 77] <= 8'h80;
		memory[`P_C_START + 78] <= `PHA;
		memory[`P_C_START + 79] <= `LDA_IME;
		memory[`P_C_START + 80] <= 8'h56;
		memory[`P_C_START + 81] <= `PHA;
		memory[`P_C_START + 82] <= `RTS;

		memory[`P_C_START + 86] <= `LDA_IME;
		memory[`P_C_START + 87] <= 8'hFF;
		memory[`P_C_START + 88] <= `LDA_IME;
		memory[`P_C_START + 89] <= 8'h0F;

Bad news :
The more I implement this guy, the more I become hopeless. When I started the project, I thought this is going to be a useful project but now I think I'm going to ruin the 6502 performance with this kind of pipeline

!!! Useless ...

MichaelM · Post by **MichaelM** » Sat Oct 15, 2016 1:22 am

manili:

There are some lessons to be learned from your project. Don't be too discouraged if pipelining cannot be generally applied.

We know that CISC machine can be successfully pipelined, but it generally requires a substantial transformation of the underlying microarchitecture. AMD/Intel have successfully pipelined the x86 instruction set architecture, but even they never attempted to pipeline the 8086/80186/80286/80386 processors. Only when the payoff was sufficient, did these companies apply significant resources and talent to adding pipeline stages to the 80486 generation of x86 processors.

Finish your project, and use the knowledge learned to guide your future studies and efforts in this area. I can say that some of my best lessons were those for projects for which I felt the end result was not good, i.e. failures as you've described. It's those lessons that keep you from making bold statements about cost/schedule.

White Flame · Post by **White Flame** » Sat Oct 15, 2016 8:03 am

Title your paper "Pros and cons of pipelining in microcontroller environments" or something and discuss the tradeoffs. Your project doesn't (shouldn't?) hinge on speedups, especially if you can explore & show how pipeline design decisions interacted with the 6502.

manili · Post by **manili** » Sat Oct 15, 2016 10:33 pm

MichaelM wrote:

We know that CISC machine can be successfully pipelined, but it generally requires a substantial transformation of the underlying microarchitecture. AMD/Intel have successfully pipelined the x86 instruction set architecture, but even they never attempted to pipeline the 8086/80186/80286/80386 processors. Only when the payoff was sufficient, did these companies apply significant resources and talent to adding pipeline stages to the 80486 generation of x86 processors.

Completely true. This is one of the lessons I've learned from this project. Thanks a lot for all of your advices.

White Flame wrote:

Title your paper "Pros and cons of pipelining in microcontroller environments" or something and discuss the tradeoffs. Your project doesn't (shouldn't?) hinge on speedups, especially if you can explore & show how pipeline design decisions interacted with the 6502.

Do you have any recommendation for comparison ? I mean which items should I compare (e.g. speedup) and what is a good project to compare with ? Thanks.

BigEd · Post by **BigEd** » Sun Oct 16, 2016 6:07 am

Some general comments about pros and cons, without reference to the 6502. Hope it helps.

Price, and performance - those are the drivers for microarchitecture changes. Oh, and power consumption, but let's leave that for now.

Price, for chips, is a function of area - an exponential function. And it's also, famously, a function of time, also exponential. So, the 386 came out in 1986 and comprises 275k transistors, whereas the pipelined and cache-containing 486 came out in 1989 and comprised 1200k transistors. That's more than 4x the number of transistors, which can surely make a huge difference, but being 3 years later the cost to manufacture might be about the same. That's amazing, and it's what's driven the progress in performance.

Performance, when comparing implementations of the same instruction set, is the product of two factors: clock frequency, and instructions per clock. Pipelining helps clock frequency, by doing less work between ticks. (Other transistor-expensive tactics improve clock frequency too - faster adders, more adders, or more generally faster logic and more logic.) Caches help instructions per clock, by improving effective memory bandwidth. Other expensive tactics help instructions per clock too: fancier decoders, branch predictors.

So, for an exercise in making a new microarchitecture for an existing instruction set, like this one, you can see directly how difficult it has been - that's a measure of the engineering cost, or the capital cost, not the production cost - but it's possible you can't yet see the two other measures. You can't see instructions per clock until you've run some simulations, you can't see clock speed until you've run through synthesis, and you can't see the production cost until, again, you've run through synthesis.

In principle you can compare your implementation, after synthesis, for clock speed and gate count, with another implementation such as the large T65 or the small Arlet core. Your core, with the caches, will run with proportionally slower memory than a cacheless core, so you could also compare two (or three) implementations on the assumption that the core speed is constrained only by memory speed. But to show the performance benefit of pipelining on instructions per clock, I think you'll need some simulations.

(BTW, often you will see CPI, or clocks per instruction, rather than instructions per clock. Same thing, but upside down.)

All the above, then, is quantitative. I'm sure it would be good for you also (or instead) to make some qualitative comparisons, which is what White Flame suggests. Probably comparing 6502 with RISC would be useful. The 6502 has variable length instructions, has very few architectural registers and only narrow ones, has complex addressing modes which have multiple memory references. All of those, and probably more, have led to it being difficult to construct an improved microarchitecture.

Intel had the same predicament as you did, when designing the 486. The Wikipedia page isn't at all bad, and hopefully links to some authoritative and technical sources - you wouldn't want to cite Wikipedia! It might be that now, after this project, you can make a good parallel argument and comparison with the 386 to 486 microarchitectures. It wouldn't hurt at all to show your understanding, in my opinion! (I would first break down the equation 486 = 386 + cache + FPU, to estimate how many transistors, or how much area, was spent on the microarchitectural upgrade. You can do the same with your 6502.)

manili · Post by **manili** » Mon Oct 17, 2016 1:21 pm

Hi all,
I have done the BRK and RTI instructions. And now I can say implementing other instructions, is somewhat straightforward and I think it's going to be finished until to night. My processor has still no way to handle external interrupts.
Two questions :
- After BRK, should I push B and I flags on to the stack, because they are both 1 ?
- After RTI, should I pop B and I flags off the stack, because maybe they are turned off during interrupt handling ?

If the answer is yes for both questions, good for me because this is current state of my processor

.

Thanks a lot.

manili · Post by **manili** » Mon Oct 17, 2016 1:25 pm

BigEd wrote:

In principle you can compare your implementation, after synthesis, for clock speed and gate count, with another implementation such as the large T65 or the small Arlet core. Your core, with the caches, will run with proportionally slower memory than a cacheless core, so you could also compare two (or three) implementations on the assumption that the core speed is constrained only by memory speed. But to show the performance benefit of pipelining on instructions per clock, I think you'll need some simulations.

Thanks a lot Ed.
The list was a complete one. Would you mind give a link to the source of most popular open-core 6502 ?

BigEd · Post by **BigEd** » Mon Oct 17, 2016 2:04 pm

There's a list of cores at http://6502.org/homebuilt#HDL - I couldn't say which are most popular, you might need to find a way to research that. But I think T65 and the Syntiak core (Wendrich/Daly core) are well-regarded, and Arlet's core is popular here at least.

Pipelined 6502

Re: Pipelined 6502

Re: Pipelined 6502

Re: Pipelined 6502

Re: Pipelined 6502

Re: Pipelined 6502

Re: Pipelined 6502

Re: Pipelined 6502

Re: Pipelined 6502

Re: Pipelined 6502

Re: Pipelined 6502

Re: Pipelined 6502

Re: Pipelined 6502

Re: Pipelined 6502

Re: Pipelined 6502

Re: Pipelined 6502