6502.org

Posted: **Sat Jun 08, 2013 7:37 pm**

OK I've done some mookups on the possibility of such a "RISC" machine. I've end up with sometihng not too complicated in concept, but it will definitely be a major change from usual RISC machine.

I'll explain myself. My goals were the following :
1) All direct or immediate instructions should be in 1 cycle, including RMW instructions like inc $memory
2) All sta and lda indexed should be done in 1 cycle
3) sta (),Y and lda (),Y should be done in 1 cycle
4) I don't care if adc (),Y lda(,X) sta(,X) and adc(,X) etc... takes more than 1 cycle, as those are rarely used
5) Anything not mentioned yet has not been thought of now, but I don't care if it takes more than 1 cycle

In the end I have came up with the conclusion that 4) does not change many things, because if I want to fulfil 1-3, it costs nothing more to add support for 4). I might have missed something here, if it is the case, please tell me so.

Because I want to do RMW in 1 cycle it is necessary to have two different memory read and memory write stages which are independent from eachother. And because I want to do indexed instructions in a single cycle, we need an address adder which is independent from the ALU.

This, here you are the details about the RISC-like pipeline which I think could execute the vast majority of microcoded 6502 instructions in a single cycle :

Stage 1 : Register prefetch. Sends the address of the index registers to the register file. Also reads two zero page locations for indirect addressing modes. The reads are done as two separate byte reads instead of single 16-bit read to handle easily all cases : non aligned reads and the usage of ($ff),Y

Hardware resources for stage 1: 1 register read port, 2 zero page bytes read port

Stage 2 : Address calculation, memory load and accumulator prefetch.
This stage contains a 16-bit adder and is able to add a register with a pointer (all values fetched from stage 1) or a register with a literal (which is fetched from the opcode), etc, etc...

The result of the adder is (optionally) send to the data (cache) bus for a load access. The accumulator is prefetched in this stage for most instructions (this could easily be added to prefetch any other reg, for instance).

Hardware resources for stage 2: 1 register read port, 1 data master (read only), 1 16-bit+8-bit with 16-bit result adder

Stage 3: Execute stage, and status flags.
This stage contains the normal 8-bit ALU and do any possible arithmetic or logic of the instructions. One operand of the ALU is the fetched register from stage 2, the other operand is data line from the memory read. It keeps the status flags and updates them, which makes this stage the best candidate to execute branches as well.

Hardware ressources for stage 3 : 1 8-bit ALU, several flipflops for status register, 1 16-bit+8-bit with 16-bit result adder for branches (could possibly be merged with ALU)

Stage 4 : Register and memory writeback
This stage write backs the results of the instruction in the register file and/or into memory (it is to be determined if both will be required at the same time or not)

Hardware ressources for stage 4 : 1 writing port to memory, 1 writing port to register file

This stage could be merged with stage 3, but chances are that it would shorten the critical path to split it like this.

If someone has any optimization or improvement, I'm all ears.

So now the total extenral resources we need are :
- 2 register read ports
- 2 specific zero page read ports
- 1 general purpose memory read port
- 1 register write port
- 1 general purpose memory write port

I see two possible approaches for implementing this.

1) Zero page is register approaches
The register file would be 260 bytes wide (ZP, A, X, Y, SP). It would have 1 writing port and 4 reading ports. It would be quite wasteful of the FPGA's SRAM, as it will very likely synthesize in 2kb of SRAM (260 rounded up to the biggest power of two * 4). However, this is less SRAM than my current 32-bit ARM core.

Some logic in the memory reading and writing detects whenever the high byte of the address is null. In this case, the memory can't be read/written like normal, and the register file should be accessed instead. I don't know how this would be done, but it does not matter if this process costs additional cycles, as long as the logic is here.

There is two separate masters for reading and writing data to/from non-zeropage memory. Either this is done at a D-Cache level, or simply there is two actually different masters on the external bus. In any case, the write master should be made a higher priority than the read one, and the pipeline would be stalled whenever a read should happen in stage 2 while a write is happening in stage 4.

2) Zero page is cached approach
In this approach, the register file is only 4 bytes (A, X, Y, SP), plus perhaps a couple of extra temp registers used exclusively for microcoding purposes, and they would have 2 reading ports. This would ease up on FPGA resources for register file A LOT.

However, ZP has to be stored somewhere. It would not be acceptable to have it on a different chip, and even if it would we would need 3 data masters which does not makes any sense and have few chances of being efficient (tell me if I am wrong please). So this means D-cache is mandatory (well not technically, but it would not make sense not to have one).

The D-Cache would have 3 reading ports and 1 writing ports. Some strategy would be made to be sure zero page is always, or preferably, cached, although it would end up like this with normal algorithms anyway. It would also be smart to ensure $fff0-$ffff are cached, as the interrupt vectors are here, this would grand fast access to them.

A cache coherency mechanism would allow any memory, including zero-page, to be shared with the rest of the system. If the I-Cache contains a coherency mechanism too, executing selfmod code will be possible. However the main bus have to be appropriate with this.

Conclusion
I think a super efficient 6502 is perfectly possible with a bit of trought puts into it. I also forgot to mention that some aggressive forwarding will have to be done.

I think approach 2) is probably cleaner, and closer to what the 6502 originally wanted to be (while being much more efficient). However I feel like approach 1) is easier to implement, as there is no need for a D-Cache, and it is more independent from the external bus, as long as a multi-master bus with fixed priorities is here. Solution 2) needs a powerful bidirectional D-Cache with coherency mechanism, which is easier said than done.

So if it were me I'd try to do it like 1), and see the results. If it happens to be successful, it would not be hard to turn it into 2) at a later stage. If any one have suggestion, or comments I am all ears.
HOWEVER I do not want people trolling or starting flame wars as it has been done in the 6 first pages of this thread. If any one does this, I will ignore their comment and consider them to be stupid and not worth listening, and will attempt implementing my core my own way.

Finally I am not doing this against the original 6502, I just personally want to see how far the 6502 instruction set can become in terms of efficiency.

Posted: **Sat Jun 08, 2013 9:57 pm**

Quote:

4) I don't care if adc (),Y lda(,X) sta(,X) and adc(,X) etc... takes more than 1 cycle, as those are rarely used

These are used heavily in Forth, and I expect in other higher-level languages too.

I would like to see the hardware stack page 1 be onboard also, possibly on the same bus(es) as ZP.

Quote:

It would also be smart to ensure $fff0-$ffff are cached, as the interrupt vectors are here. This would grant fast access to them.

Excellent.

Quote:

as it has been done in the 6 first pages of this thread.

How about starting a new topic on it, since it really is a different subject.

Would there be a benefit to running the processor at 2 or 4 times the bus speed? For example, the processor might be run internally at 40MHz or 80MHz while the bus is at 20MHz (or scale the numbers up, keeping the same ratios). I'm thinking of the 6502's dead bus cycles where an internal operation is going on. If that internal operation could take place in sub-cycles during phase 1 of the bus cycle and be done before phase 2 starts, cycles could be eliminated. Without having ever touched microprocessor design or FPGAs, I would think that this kind of thing could simplify the pipeline and other things internally also.

Posted: **Sun Jun 09, 2013 5:06 am**

Bregalad wrote:

1) Zero page is register approaches
The register file would be 260 bytes wide (ZP, A, X, Y, SP). It would have 1 writing port and 4 reading ports.

My first instinct would be to separate the ZP memory from the other registers. The ZP memory can be implemented as a block RAM, and the registers in distributed RAM so you can do async reads.

Posted: **Sun Jun 09, 2013 7:58 am**

Quote:

Would there be a benefit to running the processor at 2 or 4 times the bus speed?

Basically, if there are efficient I-Cache and D-Caches, then yes it's worth it, otherwise, forget about it. Also the effect of a cache miss becomes 2 to 4 times more "catastrophic" in terms of lost cycles for obvious reasons.

Quote:

I would like to see the hardware stack page 1 be onboard also, possibly on the same bus(es) as ZP.

My conclusion is that it's not very convenient to have ZP as being separate from the other, as you'd have to manually trap those if they are accessed by other addressing modes. For stack it'd be the same, and would move us further away from the idea to decode instruction in a RISC-like philosophy idea of keeping a processor simple.

So I'd say, let the D-Cache handle this, and it will automatically cache the stack and the zeropage as it will see they are more used than anything else, without extra effort from us.

Quote:

My first instinct would be to separate the ZP memory from the other registers. The ZP memory can be implemented as a block RAM, and the registers in distributed RAM so you can do async reads.

Yes, after some troughts, it makes more sense to let the D-Cache handle the ZP by itself. More flexible too. At first I could have a very lame D-Cache which have the strategy of "always cache ZP, never cache anything else", which is roughly equivalent to my solution 1) in terms of performance, but cleaner, and more easily expandable.

I have not tought of async A/X/Y/SP reads. I will have to think about it deeply before I can make any comment. I have no idea what is the most efficient on a FPGA. I know that for larger register files synch reads are more efficient, but for such a small set of registers (hey it's 32-bits we are talking about !) things might be very different.

So basically now I have new goals, I know D-Cache is in-avoidable for what I want to do, and that I should be able to handle all instructions in 1-cycle, even those which I did not plan to originally. I will think about my RISC-like pipeline again (in fact it is NOT RISC like at all because it allows RMW in 1-cycle, so I don't know how it should be called), and I will post my new conclusions.

Posted: **Sun Jun 09, 2013 8:10 am**

Bregalad wrote:

I have not tought of async A/X/Y/SP reads. I will have to think about it deeply before I can make any comment. I have no idea what is the most efficient on a FPGA. I know that for larger register files synch reads are more efficient, but for such a small set of registers (hey it's 32-bits we are talking about !) things might be very different

For a small register file, the async read is about as fast as reading a local register. Edit: No, that's not right. It's equivalent to 1 layer of combinatorial logic, like a mux. It may even be faster than reading a block RAM, due to easier routing. The big advantage of async reads is that you save a cycle, reducing the amount of pipeline forwarding you'll need. In this design, I suspect that the pipeline forwarding will end up taking a lot of resources, so anything you can save there will boost performance.

Posted: **Sun Jun 09, 2013 2:52 pm**

So here you are the basic block diagram of what I had in mind.

The caches are read synchronously while the registers (A, X, Y, SP) are read asynchronously.

The I-cache is optional (the fetch stage and FIFO could do raw access on the bus), but the D-cache is not.

6502.org wrote:

Image no longer available: https://dl.dropboxusercontent.com/u/23465629/other_junk/6502_pipelined.svg

The particularity is 3 D-Cache read ports instead of just 1, and 3 register read ports.

For the D-Cache 2 register ports are probably going to be exclusive to Z-Page and used only for indirect addressing modes. They might come to some other uses at a later point, though.

For the registers, the 1st read port is here mainly for the (,X) addressing mode, while the second is used mainly for the (),Y addressing mode. The normal indexed modes, like $1234,X can use either. It does not matter if the instruction is read, write or RWM, as all of them will work for all addressing modes. Even if it doesn't exist on the 6502, this architecture could do a RMW operation on (,X) or (),Y just fine.

At first I would have said that, to save overhead, it would be possible to remove one of both indexed reading ports for the registers, typically the 1st reading port which is here for (,X). That way (,X) operations would be done in 2 cycles instead, and that'll be fine because they are sparsely used. However, removing would be a bad idea, because a RTS/RTI instruction uses the "hidden" ($100,S) addressing mode, and this instruction IS often used. We'll see this in due time, now is no time to worry about details like that yet.

For a showcase, here you are how instructions will typically behave in the pipeline. I use ADC because it does both a read and a write to the accumulator (and therefore is a good example) :

1) ADC $1234,X

Fetch/Decode : The opcode plus the "$12" and "$34" arguments are available at the output of the 4-byte FIFO in a single cycle.
The decode stages detect that this opcode takes 3 bytes, which will increment the PC by 3 and take 3 bytes from the FIFO in the next cycle.

Pointer Prefetch : This is a direct addressing mode so we basically have nothing to do here (or we could already add $1234 with X optionally, but it doesn't change much)

Operand fetch : We add $1234 with the value of X which is read from the register file, and send this to the D-Cache along with the order to read data

Alu & Flags : If the cache could not read the information in one cycle, we'll have to stall the pipeline until the info is here. Once it's here, we can add the read data with the value read from the accumulator, and update the status flags.

Writeback :There is nothing to write to memory, so we just put the result data into the accumulator register.

2) ADC ($12,X)

Fetch/Decode : The opcode plus the "$12" argument are available at the output of the 4-byte FIFO in a single cycle.
The decode stages detect that this opcode takes 2 bytes, which will increment the PC by 2 and take 2 bytes from the FIFO in the next cycle.

Pointer Prefetch :We add $12 with the value of X which is read from the register, and send the computed address to the D-Cache.

Operand fetch : Let's assume from now on that the data cache is perfect and always retrieves information in 1 cycle. We now have all the 16-bits of the pointer we were seeking (done as two separate 8-bit cache reads in parallel), so we send this pointer to the D-Cache and order a read.

Alu & Flags : We can add the data read from the D-Cache with the value read from the accumulator, and update the status flags.

Writeback :There is nothing to write to memory, so we just put the result data into the accumulator register.

3) ADC ($34),Y

Fetch/Decode : The opcode plus the "$34" argument are available at the output of the 4-byte FIFO in a single cycle.
The decode stages detect that this opcode takes 2 bytes, which will increment the PC by 2 and take 2 bytes from the FIFO in the next cycle.

Pointer Prefetch :We send the addresses $34 and $35 to the D-Cache reading ports 1 and 2.

Operand fetch : Retrieve the pointer data which is read from the D-Cache as a 16-bit value, and add to it the value of Y which is read from the register. Send the result of this addition to the D-Cache and order a read.

Alu & Flags : We can add the data read from the D-Cache with the value read from the accumulator, and update the status flags.

Writeback :There is nothing to write to memory, so we just put the result data into the accumulator register.

Other addressing modes are trivial so I won't detail them.

Now let's analyze operation of RMW operations on memory, for instance I'll use ASL $1234,X

Fetch/Decode : The opcode plus the "$12" and "$34" arguments are available at the output of the 4-byte FIFO in a single cycle.
The decode stages detect that this opcode takes 3 bytes, which will increment the PC by 3 and take 3 bytes from the FIFO in the next cycle.

Pointer Prefetch : This is a direct addressing mode so we basically have nothing to do here (or we could already add $1234 with X optionally, but it doesn't change much)

Operand fetch : We add $1234 with the value of X which is read from the register file, and send this to the D-Cache along with the order to read data

Alu & Flags : We can shift left data which is read from the D-Cache and update the status flags.

Writeback :The result is written back to the D-Cache. In this case we are certain the data is already cached as we just loaded it a few cycles ago. But in the case of an STA instruction, a cache miss could happen, and the pipeline should stall because of bus access.

Finally I'm not sure exactly how I'll do it, but let's consider a branch, let's say a BPL $50 instruction

Fetch/Decode : The opcode plus the "$50" arguments are available at the output of the 4-byte FIFO in a single cycle.
The decode stages detect that this opcode takes 2 bytes, and let's assume the branch prediction is not implemented yet (or simply assumes the branch will be false), so we will increment the PC by 2 and take 3 bytes from the FIFO in the next cycle.

Pointer Prefetch :There is nothing to be done in this cycle.

Operand fetch : Now it gets complicated. I'm heistant between two variations. In variation 1), the simpler, there is nothing to be done here.

In variation 2), we could snoop the flags that are going to be updated by the previous instruction which is in the following stage, and act accordingly. This could lead to a long critical path through.

If anyone have comments I'm all ears.

Alu & Flags : In variation 1), we can see if the N flag is set. If this is not the case, we have to branch, and invalidate all instructions in the operand fetch, pointer prefetch, decode and fetch stages, and write the new value (old pc + 2) + $50 to the PC. We should also flush the FIFO.
On next cycle, the 32-bit word containing the target instruction will be loaded into the FIFO, but the program won't be able to continue just yet. 0-3 dummy bytes will have to be taken out of the FIFO after this read, as they precede the instruction we want to execute.

Finally, on the cycle after that, the FIFO will be fully loaded with the next instruction to read, and the execution of the program will continue.

We can immediately see that the branching overhead is major, so a branch prediction mechanism should be added to the work to be done once we get something working (if we ever reach that stage).

Writeback :Nothing to be done in this case.

I hope it starts to get a bit clearer in everyone's mind.

I'm open to any suggestions.

Posted: **Sun Jun 09, 2013 3:02 pm**

You could start with a simple branch predictor that assumes backwards branches are always taken. This will at least favour tight loops.

Posted: **Mon Jun 10, 2013 11:28 pm**

Beautiful diagram! What software do you use to diagram? (haven't really processed the content...)

Posted: **Tue Jun 11, 2013 12:56 pm**

I did it with a program called Inkscape.

Now we can see immediately that the main problem will be to design the desired data cache, which is able to sustain the desired data rate bandwith of 3 reads (2 of those being normally two consecutive bytes) and 1 write, all during the same cycle. This is the key to designing a super efficient 6502 I think. However it's not as simple as it sounds, because of the tags, etc...

We could go for a direct mapped cache, this would be rather simple, but it would be terrible in some cases. For instance, if the cache is 512 bytes in size, and that someone constantly accesses $00 and $200 (for example), it would be much worse than the original 6502.
So either a direct mapped cache is made, with some extra logic that makes sure page 0 and 1 are always cached (and can't be overriden by other mem areas), or a more complex system is made in order to take the frequency of accesses into account.

Posted: **Wed Jun 12, 2013 7:53 pm**

Now there is some kind of dilemma.

If we use a direct mapped D-cache, things are simple but it won't work well because of the problem I mentioned above. The only way to do this would be to use a special cache page for page zero and one (for fast stack access), which, if my understanding is correct, would end up like a particular case of a 2-way associative cache. This would not be too good in terms of performance anyways.

The other extremum, fully associative cache, there is 2 ways to do it :
1) Have all the tags in SRAM. The address bits not used to fetch a byte within a cache line are used to fetch the SRAM, which will tell if the data is in the cache, and if so, where the data is. Then a second, indirect, read would reaveal the actual data.
This is simple and fast, but, in the most optimal case, half of SRAM is used for the tags, while the other half is used for data. Thus, if we want 1kb of such a cache with 3 read ports, we're going to eat up 6kb of SRAM on the FPGA. This in itself might not be too much of a problem, but the worst part is that because FPGA SRAM is synchronous as far I know, 2 cycles are needed for every read. This would force us to add extra (dummy ?) pipeline stages in the processor, decreasing the pipeline performance (more likely to get conflicts), and increasing the possibility to get bugs.

2) All tags data are stored in FPGA registers that we can read asynchronously. The good news is, they don't have to be duplicated for every read port. The problem is that it wastes a lot of registers for tags (which means a lot of cells), and they will be quite slow and inefficient once implemented on the FPGA, unless there is only a few of them. All the logic that goes from a pipeline stage to the comparators with *all* the tags is going to eat a lot of space. This could be the bottleneck of the system. Quite ironical for a cache that is supposed to speed things up, but that is necessary only for it's multiple read ports.
These problems are minimized if few cache lines exists (so fewer tags), but then it also means that we have to have larger cache lines, with less cache efficiency.

If there is something wrong about what I just said, please let me know. I'm not very knowledgeable about caches, I just tried to understand the inner working of them and ask myself what I'd do if I had to implement that.

Further progress on the project could not be made unless I try both options, benchmark them on a real FPGA and see what is the best (if there is an obvious choice at all).

Also I guess in any cases, for the first 2 access ports, a system who grants easy access to two consecutive bytes on the same cache line, but would need extra cycles if they fall on 2 different lines, would be the way to go. I'm not sure how this would end up in hardware, and if any improvement as opposed to two separate reading ports (who just happen to read two consecutive bytes) can be made.

Posted: **Wed Jun 12, 2013 8:18 pm**

Instead of using a cache for pages 0 and 1, why not have memory dedicated for that purpose ? That way, you won't need tags, because you always know where to find the data. In addition make a D-cache for the other pages.

Posted: **Fri Jun 14, 2013 3:14 pm**

But then you have to handle reading/writing to page 0 or 1 with general purpose addressing modes. It would be pretty much equivalent to the "hardwired cache" I mentionned before, exept tags would be gone for 2 of our 3 reading ports, which is probably a good thing.

Posted: **Mon Aug 05, 2013 6:04 pm**

In my opinion this design has major issues.
Here are the reasons why.

1) The cache is overbuilt

This design has a cache with three read ports and one write port. Single-ported caches are already huge, and this will be ginormous because you will need to replicate the tag six times and the data RAM three times to create three read ports.

With a simple 5-stage pipeline and 4k cache, the chip area is usually about 50/50 for the pipeline and cache. With this pipeline and 3 port 4k cache, the chip area distribution will probably be about 20/80 for the pipeline and cache.

A cache of this design can support a load-store ISA pipeline that can issue four instructions per clock. The current pipeline design can only issue one instruction per clock.

Also, this cache will have long signal paths and require a 4-1 mux at the end. This increases chip area, increases cache latency and reduces max operating frequency.

2) Basic pipeline has terrible performance

This pipeline has a branch penalty of at least 5 clocks. Given that a typical code mix has about 20% branch instructions, the pipeline w/o branch prediction will execute five instructions in nine clocks, or 0.44 IPC.

In order to fix this problem, this requires a branch predictor. Branch predictors are huge because they are implemented as flops or RAM. It's basically another cache.

A branch predictor is justifiable on a processor which issues two or more instructions per clock, because it increases the IPC to more than 1, and also the pipeline is larger so the percentagewise die area increase is lower.

By comparison, a five-stage RISC pipeline implementing a load-store ISA with delayed branch instructions can easily achieve 0.9 IPC with no branch prediction.

3) Bad assumptions made about FIFO

A FIFO is a hardware component can clock in/out one word per clock.

This design assumes two or three words can be read/written to the FIFO per clock.

4) Bad assumption made about Operand Fetch pipeline stage

> Operand fetch : We add $1234 with the value of X which is read
> from the register file, and send this to the D-Cache along with
> the order to read data.

The result of $1234 + X is available at the end of the cycle. The result of the D-cache read needs to be available at the start of the next clock cycle. This gives the D-cache less than one clock to return the value.

5) Missing bypass from the Writeback stage

If an instruction sequence such as:

ADC #1
NOP
NOP
ADC #1

is executed, the second ADC will calculate an incorrect result because there is no bypass from the writeback stage.

High-level summary:

Fundamentally, ISAs with (memory) read-modify-write instructions are not well-suited for a pipelined processor design without weird kludges like breaking into micro-ops.
This is why CISC ISAs are mostly dead.

The reason for this is:

1) RMW instructions require extra pipeline stages
2) Extra pipeline stages increases the branch penalty
3) The amount of performance gained through RMW instructions (typically 2-5%) is much less than the
peformance loss through increased branch penalty (>10%)

I worked for two years at a company that had a proprietary ISA with read-modify instructions and have also written cycle-accurate processor simulators, so I'm fairly familiar with these issues.

Toshi

Posted: **Mon Aug 05, 2013 6:46 pm**

(Not to quibble with Toshi's points, but I think we haven't yet seen a cache tackled in FPGA. Block RAMs have two interesting properties: dual ports, and possibility of different widths of those ports. So a RAM which can be filled in bytes but read in (aligned) pairs of bytes is possible. Also, we have the block RAMs whether we use them or not. So, using a pair of RAMs to cover odd and even addresses allows us to read and write unaligned byte-pair values, useful for page 0 and page 1.)
Cheers
Ed

Posted: **Mon Aug 05, 2013 9:12 pm**

BigEd wrote:

(Not to quibble with Toshi's points, but I think we haven't yet seen a cache tackled in FPGA.
...
Ed

You should use Google. Here are a several references from five minutes of Googling:

http://archvlsi.ics.forth.gr/ipc/cache_ ... amos09.pdf
http://ieeexplore.ieee.org/xpl/login.js ... %3D1275768
http://ieeexplore.ieee.org/xpl/login.js ... %3D5695314
http://www.hindawi.com/journals/ijrc/2012/915178/
http://www.crash-safe.org/node/21
http://www.ijera.com/papers/Vol3_issue3/AW33283286.pdf
http://www.computer.org/csdl/proceeding ... 0-abs.html

Toshi

6502.org

32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit

Re: 32 is the new 8-bit