6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sat Nov 23, 2024 3:25 am

All times are UTC




Post new topic Reply to topic  [ 26 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Tue Oct 27, 2020 10:15 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
This is (also) a hardware topic, but I think it might have most applicability to FPGA (or CPLD) implementations.

Normally a 6502 system has uniform memory access times, although it might slow down for peripherals. The next most common case might be a slow ROM (or EPROM) which needs a wait state or a slower clock, and in this case it's common enough to copy the slow ROM to fast RAM shortly after reset.

We don't commonly see caches in use in the land of 6502 - perhaps the main reason is that memory accesses are normally single-cycle, so there's no point.

However, an FPGA-based 6502 will commonly have the chance to use fast on-chip RAM, or slower off-chip RAM. It might even be, at 100MHz or more, that the on-chip RAM isn't as fast as we'd like it to be.

In these cases, it might be interesting to consider non-uniform access: page zero should be fast, surely, and perhaps there's a likely area for the application code. Putting the stack in fast memory seems tempting, but in practice might not be very many accesses. The application data area might comprise three zones:
- static and global data, such as the text of interpreted code
- an application stack used for tracking nested structures
- the rest, much of which might be unused anyway

Other than the application, there's also the operating system: it might be that the data for a graphics subsystem would usefully be fast, the frame buffer if any, and perhaps buffers for file I/O.

Has anyone any thoughts or experience on using, say, 4k to 32k of fast RAM and the rest being two-cycle?


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 27, 2020 10:49 am 
Offline
User avatar

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany
I had thought about that when I realized that routing delays to the outer block RAMs were limiting the performance of my 65F02 design. But they are not so significant that they would double the access times. E.g. if the full 64k RAM available on a Spartan-6 can be accessed at 100 MHz, a zero-page RAM close to the CPU core could not operate at 200 MHz.

I guess one could think about fractional clock ratios, say 100 and 150 MHz derived from a common master clock, but I did not want to open that can of worms...


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 27, 2020 11:04 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Very good point though: with stretched-clock designs, as opposed to using RDY with a fixed clock, there's room for finer granularity adjustment.

I was thinking something more like a 100MHz CPU+64k being one alternative, and a 120MHz CPU + fast 8k + slow 56k being another. Although it feels possible that a 60MHz bulk of memory would erode the advantage of the 120MHz CPU.


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 27, 2020 11:29 am 
Offline
User avatar

Joined: Wed Jul 01, 2020 6:15 pm
Posts: 79
Location: Germany
The performance would become quite dependent on the code being executed, I assume. I would guesstimate that for most programs, well above half of the bus cycles are for access to the program code itself. So in order for the benefit of the fast access cycles to outweigh the penalty of the slower regular memory, one would have to accelerate the program ROM, or at least the core loops. (Which means knowing where they are.)


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 27, 2020 12:50 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Agreed, ideally application code and zero page. If, for example, the bottom 8k were faster, it would accelerate small programs anyway, and larger programs could place performance-critical loops there.

It shouldn't be too hard to instrument an emulator to see the effect of various choices.


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 27, 2020 1:41 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
200 MHz might just be achievable with manual placement. In that case, adding an extra wait state for "remote" memory could be a valuable trade off, as long as the extra logic for that wait state doesn't interfere with fast memory access.

Keep in mind that an extra wait state doesn't mean you're effectively running at 100 MHz. Instead, it might turn a 3 cycle instruction into 4 cycles, assuming your code/zeropage can run from fast memory.

Edit: the wait state logic could run in parallel to the rest of the CPU, and quickly pull down the CE to all the flops before next clock edge.


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 27, 2020 3:10 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
I think the most critical areas of RAM to make fast are:

1: Program code. Many 6502 instructions' speed is limited solely by opcode and operand fetches. It may be useful to make the region of accelerated program memory configurable, hence as a write-through or write-invalidating cache without dynamic allocation, and this would enable running code quickly even from a slow ROM. A write-through cache would remain fast even for self-modifying code.

2: Zero page. This is where data that needs to be accessed as fast as possible is routinely placed by programmers, and it is also a dependent step for indirect addressing modes.

3: Stack memory comes a distant third place. Stack push and pop instructions are relatively slow on a standard 6502 (taking double the number of cycles that would be expected solely from the number of accesses), so they would not be expected to be particularly fast by programmers. Stack performance becomes more relevant when considering interrupt latency and overhead, as both taking the interrupt and returning from it requires a tight sequence of three stacked bytes. To a lesser extent, it is also important for subroutines.

4: The interrupt vector table is important for interrupt latency and overhead. However, it is not uncommon to intercept vector fetches and substitute a vector that depends on the interrupt source. WDC's own 6502-family microcontrollers do this internally.


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 27, 2020 3:24 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
If we have access to dual ported RAM, the writes can be done without a wait state.


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 27, 2020 3:36 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
I'm not sure how much it costs or how much it gains, but I now recall the SuperCPU had a write buffer: so the first write is free.


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 27, 2020 3:47 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
I don't think a write buffer is needed. The RAM still runs at the same clock, so when the CPU wants to write, the bus is always available.


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 27, 2020 3:49 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
You could do a speculative read from next location, though, and then do an address compare near the CPU input.

You could even hold it there, as long as the CPU doesn't read another location from that same memory. So, if you're reading a bunch of memory sequentially, every read would be valid, as long as the code itself is running from local memory.


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 27, 2020 5:30 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
It almost starts to look like a cache line!


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 27, 2020 6:35 pm 
Offline
User avatar

Joined: Sun Nov 27, 2011 12:03 pm
Posts: 229
Location: Amsterdam, Netherlands
BigEd wrote:
It almost starts to look like a cache line!

It all boils down to 'cache'. So there seems little point in trying to do it piecemeal. Just make zero page and stack fast (every FPGA can accomodate a fast 512 bytes). Cache everything else. Then the cache line fill policy is all you should worry about. Best cases, no extra cycles (a hit). Worst case depends only on cache line fill speed (and therefore also its size and however much slower external memory is). And maybe have a separate cache for code and data.


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 27, 2020 6:40 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
One difference between non-uniform memory speed and cache is 'determinacy' - not in a strict sense, of course, but in a sense of being easy to know what to expect.

As it happens, Beeb816 offers a sort of experimental platform: it can run from its own fast RAM, at the configured speed according to the crystal and the divider, or from the Beeb's 2MHz RAM. Or it can do a sort of write-through cache whereby reads are full speed but writes slow down. However, as the fast and slow clocks are not related, there's a cycle or two lost every time it has to switch.

Having said that, any LX9 FPGA board is an experimental platform!


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 27, 2020 6:53 pm 
Offline
User avatar

Joined: Sun Nov 27, 2011 12:03 pm
Posts: 229
Location: Amsterdam, Netherlands
BigEd wrote:
As it happens, Beeb816 offers a sort of experimental platform: it can run from its own fast RAM, at the configured speed according to the crystal and the divider, or from the Beeb's 2MHz RAM. Or it can do a sort of write-through cache whereby reads are full speed but writes slow down. However, as the fast and slow clocks are not related, there's a cycle or two lost every time it has to switch.

But that system essentially has two copies of main memory, not one with slow and fast parts. It is only burdened by having to do write-throughs to a framebuffer and perhaps other memory mapped hardware I/O.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 26 posts ]  Go to page 1, 2  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 10 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: