Non-uniform memory for the (fast) 6502

BigEd · Post by **BigEd** » Tue Oct 27, 2020 10:15 am

This is (also) a hardware topic, but I think it might have most applicability to FPGA (or CPLD) implementations.

Normally a 6502 system has uniform memory access times, although it might slow down for peripherals. The next most common case might be a slow ROM (or EPROM) which needs a wait state or a slower clock, and in this case it's common enough to copy the slow ROM to fast RAM shortly after reset.

We don't commonly see caches in use in the land of 6502 - perhaps the main reason is that memory accesses are normally single-cycle, so there's no point.

However, an FPGA-based 6502 will commonly have the chance to use fast on-chip RAM, or slower off-chip RAM. It might even be, at 100MHz or more, that the on-chip RAM isn't as fast as we'd like it to be.

In these cases, it might be interesting to consider non-uniform access: page zero should be fast, surely, and perhaps there's a likely area for the application code. Putting the stack in fast memory seems tempting, but in practice might not be very many accesses. The application data area might comprise three zones:
- static and global data, such as the text of interpreted code
- an application stack used for tracking nested structures
- the rest, much of which might be unused anyway

Other than the application, there's also the operating system: it might be that the data for a graphics subsystem would usefully be fast, the frame buffer if any, and perhaps buffers for file I/O.

Has anyone any thoughts or experience on using, say, 4k to 32k of fast RAM and the rest being two-cycle?

65f02 · Post by **65f02** » Tue Oct 27, 2020 10:49 am

I had thought about that when I realized that routing delays to the outer block RAMs were limiting the performance of my 65F02 design. But they are not so significant that they would double the access times. E.g. if the full 64k RAM available on a Spartan-6 can be accessed at 100 MHz, a zero-page RAM close to the CPU core could not operate at 200 MHz.

I guess one could think about fractional clock ratios, say 100 and 150 MHz derived from a common master clock, but I did not want to open that can of worms...

BigEd · Post by **BigEd** » Tue Oct 27, 2020 11:04 am

Very good point though: with stretched-clock designs, as opposed to using RDY with a fixed clock, there's room for finer granularity adjustment.

I was thinking something more like a 100MHz CPU+64k being one alternative, and a 120MHz CPU + fast 8k + slow 56k being another. Although it feels possible that a 60MHz bulk of memory would erode the advantage of the 120MHz CPU.

65f02 · Post by **65f02** » Tue Oct 27, 2020 11:29 am

The performance would become quite dependent on the code being executed, I assume. I would guesstimate that for most programs, well above half of the bus cycles are for access to the program code itself. So in order for the benefit of the fast access cycles to outweigh the penalty of the slower regular memory, one would have to accelerate the program ROM, or at least the core loops. (Which means knowing where they are.)

BigEd · Post by **BigEd** » Tue Oct 27, 2020 12:50 pm

Agreed, ideally application code and zero page. If, for example, the bottom 8k were faster, it would accelerate small programs anyway, and larger programs could place performance-critical loops there.

It shouldn't be too hard to instrument an emulator to see the effect of various choices.

Arlet · Post by **Arlet** » Tue Oct 27, 2020 1:41 pm

200 MHz might just be achievable with manual placement. In that case, adding an extra wait state for "remote" memory could be a valuable trade off, as long as the extra logic for that wait state doesn't interfere with fast memory access.

Keep in mind that an extra wait state doesn't mean you're effectively running at 100 MHz. Instead, it might turn a 3 cycle instruction into 4 cycles, assuming your code/zeropage can run from fast memory.

Edit: the wait state logic could run in parallel to the rest of the CPU, and quickly pull down the CE to all the flops before next clock edge.

Chromatix · Post by **Chromatix** » Tue Oct 27, 2020 3:10 pm

I think the most critical areas of RAM to make fast are:

1: Program code. Many 6502 instructions' speed is limited solely by opcode and operand fetches. It may be useful to make the region of accelerated program memory configurable, hence as a write-through or write-invalidating cache without dynamic allocation, and this would enable running code quickly even from a slow ROM. A write-through cache would remain fast even for self-modifying code.

2: Zero page. This is where data that needs to be accessed as fast as possible is routinely placed by programmers, and it is also a dependent step for indirect addressing modes.

3: Stack memory comes a distant third place. Stack push and pop instructions are relatively slow on a standard 6502 (taking double the number of cycles that would be expected solely from the number of accesses), so they would not be expected to be particularly fast by programmers. Stack performance becomes more relevant when considering interrupt latency and overhead, as both taking the interrupt and returning from it requires a tight sequence of three stacked bytes. To a lesser extent, it is also important for subroutines.

4: The interrupt vector table is important for interrupt latency and overhead. However, it is not uncommon to intercept vector fetches and substitute a vector that depends on the interrupt source. WDC's own 6502-family microcontrollers do this internally.

Arlet · Post by **Arlet** » Tue Oct 27, 2020 3:24 pm

If we have access to dual ported RAM, the writes can be done without a wait state.

BigEd · Post by **BigEd** » Tue Oct 27, 2020 3:36 pm

I'm not sure how much it costs or how much it gains, but I now recall the SuperCPU had a write buffer: so the first write is free.

Arlet · Post by **Arlet** » Tue Oct 27, 2020 3:47 pm

I don't think a write buffer is needed. The RAM still runs at the same clock, so when the CPU wants to write, the bus is always available.

Arlet · Post by **Arlet** » Tue Oct 27, 2020 3:49 pm

You could do a speculative read from next location, though, and then do an address compare near the CPU input.

You could even hold it there, as long as the CPU doesn't read another location from that same memory. So, if you're reading a bunch of memory sequentially, every read would be valid, as long as the code itself is running from local memory.

BigEd · Post by **BigEd** » Tue Oct 27, 2020 5:30 pm

It almost starts to look like a cache line!

Windfall · Post by **Windfall** » Tue Oct 27, 2020 6:35 pm

BigEd wrote:

It almost starts to look like a cache line!

It all boils down to 'cache'. So there seems little point in trying to do it piecemeal. Just make zero page and stack fast (every FPGA can accomodate a fast 512 bytes). Cache everything else. Then the cache line fill policy is all you should worry about. Best cases, no extra cycles (a hit). Worst case depends only on cache line fill speed (and therefore also its size and however much slower external memory is). And maybe have a separate cache for code and data.

BigEd · Post by **BigEd** » Tue Oct 27, 2020 6:40 pm

One difference between non-uniform memory speed and cache is 'determinacy' - not in a strict sense, of course, but in a sense of being easy to know what to expect.

As it happens, Beeb816 offers a sort of experimental platform: it can run from its own fast RAM, at the configured speed according to the crystal and the divider, or from the Beeb's 2MHz RAM. Or it can do a sort of write-through cache whereby reads are full speed but writes slow down. However, as the fast and slow clocks are not related, there's a cycle or two lost every time it has to switch.

Having said that, any LX9 FPGA board is an experimental platform!

Windfall · Post by **Windfall** » Tue Oct 27, 2020 6:53 pm

BigEd wrote:

As it happens, Beeb816 offers a sort of experimental platform: it can run from its own fast RAM, at the configured speed according to the crystal and the divider, or from the Beeb's 2MHz RAM. Or it can do a sort of write-through cache whereby reads are full speed but writes slow down. However, as the fast and slow clocks are not related, there's a cycle or two lost every time it has to switch.

But that system essentially has two copies of main memory, not one with slow and fast parts. It is only burdened by having to do write-throughs to a framebuffer and perhaps other memory mapped hardware I/O.

Non-uniform memory for the (fast) 6502

Non-uniform memory for the (fast) 6502

Re: Non-uniform memory for the (fast) 6502

Re: Non-uniform memory for the (fast) 6502

Re: Non-uniform memory for the (fast) 6502

Re: Non-uniform memory for the (fast) 6502

Re: Non-uniform memory for the (fast) 6502

Re: Non-uniform memory for the (fast) 6502

Re: Non-uniform memory for the (fast) 6502

Re: Non-uniform memory for the (fast) 6502

Re: Non-uniform memory for the (fast) 6502

Re: Non-uniform memory for the (fast) 6502

Re: Non-uniform memory for the (fast) 6502

Re: Non-uniform memory for the (fast) 6502

Re: Non-uniform memory for the (fast) 6502

Re: Non-uniform memory for the (fast) 6502