Non-uniform memory for the (fast) 6502

BigEd · Post by **BigEd** » Tue Oct 27, 2020 7:03 pm

Beeb816 has a lot of potential flexibility for how to handle the 32k of on-board RAM: it can do anything from all-slow to all-fast (now we have a means of dealing with the frame buffer.) And the same with the 32k of ROM space. In both cases it's tempting to think of it as running either in maximum compatibility mode, or maximum performance mode, but there are points in between. It's possible for example to speed up only the Basic interpreter and the zero page and stack.

This is of course a bit of a red herring because once you have an all-fast system it's a simple matter to decide to slow some of it down, to see what the effect is.

kakemoms · Post by **kakemoms** » Sun Jan 09, 2022 7:48 am

Interesting discussion.

I haven't been so active lately, but just started to look at my N6502 which had several limitations in order to be small and fast. I would like to extend the core towards a 6502 and keep (some of the) speed.

The original implementation did it this way:

1)Fetch byte x (opcode)
2)Fetch byte x+$100 (data)
3)Run instruction
4)Fetch or Store

1&3 were executed at raising edge, while 2&4 were executed at falling edge in a kind of 2-stage pipeline. So it ran at one instruction per cycle and got to around 150Mips. And it used dual-port memory to access the data area. No read-modify-write instructions (yet) as they would require two cycles.

I want to combine opcode and data area. The ultimate goal is to get close to an actual 6502. The number of opcodes is currently limited, but extending that is the easy part. I imagine a modified implementation which could do it this way:

One Cycle:
Instruction prefetch (stage 1)
Data memory preread (stage 2)
Opcode decode (stage 3)
Instruction memory write (stage 4)
Data memory write (stage 4)

While the 4 stages are spread over 4 cycles, it does all this for every cycle, meaning one needs dual port memory (which is easy to implement in an FPGA).
To have overlapping instruction and data memory, I imagine to have the memory mirrored. Kind of a 4-port memory to handle the 4 read/write instances for each cycle. Faster memory would obviously be an alternative.

In effect this would be a 4-stage pipeline. For now, I am just ignoring the memory alignment (an input buffer would be required to align the bytes since each access fetches 4 bytes).

Example:
INC $1234
Bytes:
$EE $34 $12
(address location $1234 contains $01)

Stage 1:
Prefetch instruction ($EE3412)

Stage 2
Data memory preread ($1234)

Stage 3
(Preread returned $01)
$EE decode-> add $01 to $01

Stage 4
Write $02 to $1234

All this can happen in the same cycle, so we have effectively accessed the same memory 3 times within the same clock cycle. If we have two instances of the same memory (e.g. a mirror), stage 4 needs to write two times to two different memory instances. So we can get 4 memory accesses within one clock cycle.

An alternative would be to allow only two accesses to the same memory at once. This would be ok for almost all instructions. An exception could be done for read-modify-write to generate a 2-clock instruction.

Instructions that access the same memory (or modifies the next instruction) would also need to generate an exception since this is a 4-stage pipeline. Branches would do the same, but a small branch prediction buffer could help with that.

Thoughts?

Proxy · Post by **Proxy** » Sun Jan 09, 2022 3:26 pm

Not sure if it's entirely relavent to the thread, but i've been toying with the idea of having a 65C02 core where the ZP and Stack are inside the CPU as 2 seperate 256 Byte Register files with 2 read ports and 1 write port.
This would greatly speed up any ZP and Stack based instructions as you can now access ZP/Stack at the same time as external memory. plus any 16 bit reads can be done in a single cycle.

For example LDA ($00,X) would be done in 3 cycles:

1. Read Opcode from external Memory
2. Read Operand Byte from external Memory
3. Load Byte from external Memory pointed by ZP+Operand+X (no need to store the word from the ZP as it's already inside the CPU) into the Accumulator

Similarly all Push/Pull Instructions and RTS would only require a single cycle.
Though any instruction that writes a 16 bit word (JSR) would still need 2 cycles, as building a Register file with 2 write ports is a massive pain and not worth it.
Also the internal ZP and Stack would only be accesable through their respective instructions, so Instruction fetching and absolute addressing modes would only access external memory (though i guess it could be redirected to the internal memory by just checking the high byte of the address, but it wouldn't give you any speed benefit)

Another idea, speed is one thing, but what about width? like why not make the data bus wider ala 8086 sytle? i don't think i've really seen any 65C02/816 cores that have a full 16 bit data bus without massively changing the Instruction set. having a 16 bit wide fully compatible 65C02 would on average cut the amount of cycles required to fetch instructions in half, and any aligned 16 bit accesses would also be twice as fast.
But if driving a full 16 bit data bus is too much (maybe it drags down Fmax too far or it just takes too many IO Pins or traces on a PCB), then just keep it internal.
Have the BlockRAM of the FPGA be 16, 32, or even 64 bits wide with the CPU built to take full advantage of that width, and when accessing external memory cut it into 8 bit pieces (at the cost of speed) so that externally you can just plug it into an existing 65C02 System.
you could also throw a DMA Controller into the mix so you can load the internal BlockRAM much faster than with software.

BigEd · Post by **BigEd** » Sun Jan 09, 2022 4:58 pm

I think it's all possible, and it ought to give a nice performance win, and (or but) I don't think it's been done.

I think it's very much a good idea to be able to access all 64k uniformly, for general code to work, so I think checking that high address byte and allowing all addressing modes to see the same memory is a good idea. I'd almost say essential, but of course if you're writing your own code you can choose to do without. (For example, there's code out there which uses absolute indexed in the FF00 range to access zero page by wrapping.)

What I believe is needed is to decouple the memory subsystem from the core. The original 6502 and all the cores I can think of use a single state machine to decide what the core is doing. But with the machine described, that becomes a very complex piece of state.

Conventionally, when considering this sort of thing one would usually start by writing a cycle-accurate model, and using that to see what the performance benefit of each idea would be, and then to refine the model into mechanisms which can be simulated and built. That is, the architectural model would be in a high level language, not in an HDL. It's a way to de-risk the project, and to decouple design from implementation.

That said, one could also just dive in.

Thinking about this previously, the two-byte accesses needed for zero page and for stack are unaligned but adjacent, so I've imagined having a memory of odd locations and another memory of even locations would be helpful. It's not quite a dual-port setup. But this idea might not be any good.

It would certainly be interesting to see such a machine take shape!

Windfall · Post by **Windfall** » Sun Jan 09, 2022 6:24 pm

Proxy wrote:

Not sure if it's entirely relavent to the thread, but i've been toying with the idea of having a 65C02 core where the ZP and Stack are inside the CPU as 2 seperate 256 Byte Register files with 2 read ports and 1 write port.

See here (viewtopic.php?t=6284) for a description of a core that takes that to the extreme. It uses block RAM for everything, duplicates RAM as needed (all copies are written to at the same time), and reads whole instructions rather than a byte at a time.

Proxy wrote:

Though any instruction that writes a 16 bit word (JSR) would still need 2 cycles, as building a Register file with 2 write ports is a massive pain and not worth it.
Also the internal ZP and Stack would only be accesable through their respective instructions, so Instruction fetching and absolute addressing modes would only access external memory (though i guess it could be redirected to the internal memory by just checking the high byte of the address, but it wouldn't give you any speed benefit)

If it's acceptable for locations 0x00aa (absolute) and 0xaa (zero page) to become two different locations. Same for stack locations. In general cases, it's not acceptable. Especially stack locations will often be manipulated with absolute addressing.

Proxy wrote:

Another idea, speed is one thing, but what about width? like why not make the data bus wider ala 8086 sytle? i don't think i've really seen any 65C02/816 cores that have a full 16 bit data bus without massively changing the Instruction set. having a 16 bit wide fully compatible 65C02 would on average cut the amount of cycles required to fetch instructions in half, and any aligned 16 bit accesses would also be twice as fast.
But if driving a full 16 bit data bus is too much (maybe it drags down Fmax too far or it just takes too many IO Pins or traces on a PCB), then just keep it internal.
Have the BlockRAM of the FPGA be 16, 32, or even 64 bits wide with the CPU built to take full advantage of that width, and when accessing external memory cut it into 8 bit pieces (at the cost of speed) so that externally you can just plug it into an existing 65C02 System.

See the core mentioned above. It does most of that.

kakemoms · Post by **kakemoms** » Sun Jan 09, 2022 9:48 pm

I hope to dive into it soon in my untraditional way. My 6502MMU has been halted for a while, so I may finish that first...

The scare with 32-bit wide memory is that the wide bus tends to slow everything down (because it takes alot of space). I like to use Lattice MachXO3, and that can cope with quite high memory speed.. which could be the alternative. Anyway, dividing the instructions into several stages seems to be the best way to keep the speed up. 4 stages might not be enough, but one can go to 5 or 6 without much penalty (at least with branch prediction in place).

Stratix V is way above my pay grade. I can't think of a project that would defend the cost.

Windfall · Post by **Windfall** » Sun Jan 09, 2022 10:39 pm

kakemoms wrote:

The scare with 32-bit wide memory is that the wide bus tends to slow everything down (because it takes alot of space).

That's true. The routing pressure of 'ambitious' memory setups can be very high.

kakemoms wrote:

Stratix V is way above my pay grade. I can't think of a project that would defend the cost.

I paid very little for the board it's on (which is why I got two). Regardless of cost, it's an ideal environment in which to abuse resources for speed.

Windfall · Post by **Windfall** » Fri Jan 14, 2022 4:32 pm

Just in case it helps, I've now published the source of my 65C02 core. See the end of viewtopic.php?t=6284.

IMO, it would still be interesting to see the performance improvement that is possible by duplicating only page 0 and 1, since the memory footprint of that would be really small. I can't really do that with my core, since accessing code and data at the same time is sort-of the heart of the whole setup, and that necessitates duplication of main memory (and the memory footprint of that is so high that you might as well duplicate everything else ...).

Windfall · Post by **Windfall** » Sat Feb 19, 2022 9:54 pm

Windfall wrote:

IMO, it would still be interesting to see the performance improvement that is possible by duplicating only page 0 and 1, since the memory footprint of that would be really small.

Actually, the 'tiny' version of Z65C02 is more or less an implementation of this. Again, see the end of viewtopic.php?t=6284.

kakemoms · Post by **kakemoms** » Wed Feb 23, 2022 4:21 am

Windfall wrote:

IMO, it would still be interesting to see the performance improvement that is possible by duplicating only page 0 and 1, since the memory footprint of that would be really small.

Actually, the 'tiny' version of Z65C02 is more or less an implementation of this. Again, see the end of viewtopic.php?t=6284.

I looked at your tiny core, but could not get my head around the memory access. How do you wire all those PEEKxxxx input/output instances into a single dual port?

Windfall · Post by **Windfall** » Thu Mar 10, 2022 6:27 pm

kakemoms wrote:

I looked at your tiny core, but could not get my head around the memory access. How do you wire all those PEEKxxxx input/output instances into a single dual port?

Somehow, I must have missed your message ... Anyway :

You connect the 8xNs to their own memory blocks (they're just a page each). Only the 16x24 needs 64 KB. Look here for elaboration on that one : viewtopic.php?f=10&t=6284&start=30#p90905.

So in total you need (using dual ported memory) : one 64 KB (for 16x24), five 256 byte (for 8x24/16/8s), one 8 byte (for 3x16).

Non-uniform memory for the (fast) 6502

Re: Non-uniform memory for the (fast) 6502

Re: Non-uniform memory for the (fast) 6502

Re: Non-uniform memory for the (fast) 6502

Re: Non-uniform memory for the (fast) 6502

Re: Non-uniform memory for the (fast) 6502

Re: Non-uniform memory for the (fast) 6502

Re: Non-uniform memory for the (fast) 6502

Re: Non-uniform memory for the (fast) 6502

Re: Non-uniform memory for the (fast) 6502

Re: Non-uniform memory for the (fast) 6502

Re: Non-uniform memory for the (fast) 6502