6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sat Sep 28, 2024 7:20 am

All times are UTC




Post new topic Reply to topic  [ 21 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Mon Oct 10, 2016 10:01 am 
Offline

Joined: Fri Oct 07, 2016 9:44 pm
Posts: 31
It all started from here, so I think it's going to be a hot topic :mrgreen: .

I myself agree with the idea to integrate caches with CPU. Because if we also integrate MMU inside the chip we can use memory protection, virtual addressing, process switching and etc... .

Let's talk about it here ...


Top
 Profile  
Reply with quote  
PostPosted: Mon Oct 10, 2016 11:23 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10938
Location: England
(Long reply. Executive summary: yes, use a cache whenever your memory bandwidth is not enough to keep your CPU running at full speed.)

(BTW, for me, before you necessarily add an MMU to a design with caches, you need something in between the caches and the memory proper - whether you call it a memory controller, or memory interface, or whatever. In other words, an MMU, which as you say may have several possible purposes, is an optional extra introducing even more questions as to what can be done and what is worth doing.)

I was just about to post about this kind of topic. I guess I can paste it here instead without disturbing your intention too much:

Caches, pipelines, microarchitecture - what's all the fuss?

Elsewhere in a discussion about a pipelined 6502 implementation with caches I was a bit grumpy, so as penance here's an attempt to explain why we sometimes find ourselves looking at CPU designs which are not straightforward clock-for-clock compatible with the original 6502. Others here also understand this territory and can surely help when I miss the mark.

The original 6502 is a fairly simple machine, with a slight degree of pipelining, intended to be a low-cost MPU connected to bytewide memory and I/O devices. It is in some ways a simplification and improvement on the 6800.

Why do we sometimes discuss and implement more complicated approaches? Why is the original not perfect for all purposes?

Performance.

Or, more usefully, increased performance at moderate cost. Why bring in cost? Because it's always a real-world consideration, and without cost restriction, we could just build a much faster version of the original - and for some purposes that's just not practical. (By faster, we mean something with a faster clock, and of course also with faster memory.)

The speed of light is finite, and not especially high in this context, and that turns out to limit the speed of large memory systems connected to a CPU. So, you can either build a really fast memory system of low capacity, or build a much larger one and find the access time is somewhat slower.

Or, it turns out you can use a combination of a small fast memory and a large slow memory and get better performance at a given cost. This is what caching is all about. What do we mean by a small memory? That depends on the technology we're using to implement the system. A small cheap FPGA might have much less than 64kB of memory onboard. If we're building a KIM-1 then that's enough memory. If we're building a 6502 system with 16MByte of (paged) memory then it isn't enough.

So, if we're talking about caches, we just need to make a quick check that in the context of the problem we're trying to solve and the technology we intend to use, that the full complement of RAM won't run at the speed we intend to run the CPU at. Conversely, if it will, then caches are not needed.

What's the magic which allows a small fast memory near the CPU to increase the performance of the system, when there's still a large slow memory out there which needs also to be accessed? It's locality. It's the idea that if I grab a byte here, now, then I will probably be accessing nearby bytes pretty soon. And conversely, if I haven't touched this little area of memory for a while, I probably won't touch it soon.

So, most often, a cache is organised as lines, of some number of bytes like 8 or 16, and the entire line will be present or absent. When the necessary data is not present, it is fetched as a line, and when it must be got rid of to make space, it is written as a line.

This leads to a second advantage: modern large memories are not really random access. Sending a whole address and getting back just one byte isn't good use of the bus or of the internal organisation of the RAM. So instead we can send half an address and get back a bunch of bytes in quick succession. This fits well with filling and spilling cache lines.

There's a third advantage: multi-port memory is more expensive than single port, and wide busses are more expensive than narrow ones. A small fast memory near the CPU can afford to have wide access, and to have multiple ports. Now we can do things like fetch multiple bytes in one cycle into our decoder, and have both opcode and operands available in the same cycle. We can fetch both bytes of an two-byte address in direct page, in one cycle, and we can push or pull two bytes at a time from the stack. Or, when we talk about pipelined architectures, we might be able to write from the write stage in the same cycle as we read from the address generation stage.

Let's talk about pipelining. Why does it make sense to break an instruction down into three, four, five, six or more small steps? We have to execute them all anyway...

Clock speed depends on the slowest part of the design. It depends on how much logic sits between two flops - and to some extend on distance too. If you have a really fast ALU, but it takes a while to increment the PC, you can only run at the rate limited by the PC.

Pipelining breaks down the actions that need to be taken into smaller steps which are simpler, and therefore faster. The pipelined design can be clocked faster than the unpipelined design. If it can commence one instruction per clock, every clock, then it goes faster even if some instructions take six clocks to finish.

But when the program is not purely sequential - when it hits a branch or a jump - the next address to fetch from may not be known until the previous instruction has finished, which is still several clocks away. So pipelined processors always have some penalty for executing branches, and that's one reason the pipelines don't get arbitrarily long. It's also the reason why a pipelined design can generally be sped up further by spending more logic and complexity on predicting branches, and correctly abandoning mispredicted ones. But that's an optional refinement - it's possible and sensible in some situations to have a pipeline and no branch prediction.


Top
 Profile  
Reply with quote  
PostPosted: Mon Oct 10, 2016 2:32 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
Another potential advantage for caching is in multi processor systems, where each processor could have a local cache, all connected to a single shared memory.

A variant is a multi bus master system, with a single CPU, but one or more peripherals using DMA to access memory directly. When the DMA is accessing the external memory, the CPU can benefit from a local cache to keep running.


Top
 Profile  
Reply with quote  
PostPosted: Mon Oct 10, 2016 8:12 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8520
Location: Southern California
Very well written, Ed. One question: Are the cache lines really as small as 8 bytes? Considering that A., even an L1 cache latency is a minimum of four cycles, and B., 8 bytes may not even be three instructions, and C., of the frequent branches in typical 6502 code, the vast majority will be farther than that (even omitting loop branches since the whole loop will remain loaded) and simple branches can reach forward or back a half a page, it seems that such a short cache line would be of no benefit, adding only complexity. I would think that a cache line should be no less than 128 or 256 bytes' worth. At the cost of more complexity, branch prediction may help. For intense branching in some kinds of structures, tables and the use of the complex indexed and indirect addressing modes aid efficiency; but those are apparently a hindrance to what manili wants to do.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Mon Oct 10, 2016 8:34 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10938
Location: England
Thanks! I made up the numbers. The web says 64 bytes is common. One would typically simulate to figure out the best cache organisation. The base cost is the area, or total size, so if you have longer lines then you have fewer lines. You'll be more likely to hit within a long line but when you replace it you're more likely to evict something useful. But you're quite right that the characteristics of the memory system are part of the equation - there's probably more overhead, proportionally, to refilling a short line.

(Another question is associativity: caches come in different organisations which relate to which cache lines are available for use to store which addresses. A fully associative cache has best performance but greatest cost, and so on.)

Edit: in a 6502-on-FPGA context, I'd expect L1 to be one or two cycle access. Just because Intel's huge caches and ultra fast CPUs need lots of cycles doesn't mean we'd be in the same boat. This is something which will vary according to the implementation technology.


Top
 Profile  
Reply with quote  
PostPosted: Mon Oct 10, 2016 11:54 pm 
Offline

Joined: Thu Jan 21, 2016 7:33 pm
Posts: 276
Location: Placerville, CA
BigEd wrote:
in a 6502-on-FPGA context, I'd expect L1 to be one or two cycle access. Just because Intel's huge caches and ultra fast CPUs need lots of cycles doesn't mean we'd be in the same boat. This is something which will vary according to the implementation technology.

In fact, on the 6502 there'd be precious little point to caches that aren't single-cycle access, since it's so dependent on memory access.


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 11, 2016 3:49 am 
Offline
User avatar

Joined: Sun Dec 29, 2002 8:56 pm
Posts: 452
Location: Canada
Is it suitable ? Depends on the application. Depends on how complex a system is needed.

I've used a combination of caches and single cycle access RAM/ROM (for a non-6502 system). Zero page is used so much by the '02 that I'd be inclined just to make this a single cycle accessible memory without it being cached. It might prevent some cache thrashing. Also some of the boot vector code could be in a non-cached single cycle access ROM. If you'd like to retain high-speed interrupt servicing it might help to have the interrupt code non-cached and always available. Alternately a means of locking lines in a multi-way associative cache could be used.

In some of the FPGA systems a 16 bit wide DDR-RAM is in use. These are typically setup for 128 bit accesses (8x16 bits burst). A reasonable size for a cache line is a multiple of 16 bytes then. An entire 16 byte cache line can be loaded with just a single bus cycle. I think the cache line should store at least four instructions.

I've found it helpful to have some non-cached memory in order to get a system up and running.

_________________
http://www.finitron.ca


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 11, 2016 6:09 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10938
Location: England
Good point about wider memory - not only is wider memory a good way to boost memory bandwidth, but it's often the way that memory is supplied. A cache can intermediate between two different widths.

I'm not at all sure that a cache line needs to be very long, in terms of instruction count. If there's a longer code sequence or a larger loop, it will just span several lines, and all of them may be in cache. Indeed, some fraction of a line will be wasted at the beginning and at the end. I'm quite sure that the 'best' length is a function of other parameters of the system, and should be determined by experiment or analysis. With luck, it's a pretty flat function - after all, we expect to quantise to a power of two.

I guessed at 8 or 16 bytes - it's certainly seeming that 8 would be a bit on the short side. But then, if the FPGA offers memories which can be configured as 64 bits wide, it might be natural. But a pair of such memories, if available, would be 128 bits wide.

I see Arlet mentions 8 or 16 bytes at viewtopic.php?f=10&t=2624 although most likely that's an off-the-top-of-the-head remark.


Top
 Profile  
Reply with quote  
PostPosted: Tue Oct 11, 2016 6:26 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
BigEd wrote:
I see Arlet mentions 8 or 16 bytes at viewtopic.php?f=10&t=2624 although most likely that's an off-the-top-of-the-head remark.

Indeed. I was thinking about smallish FPGAs, with only a few 2kB Block RAMs for cache, and a simple SDRAM for external memory. The overhead of setting up a burst on the SDRAM is modest, so a 8-16 cycle burst offer a good improvement without blocking the bus excessively. Longer cache lines are more appropriate for wide memories, and/or for big caches.


Top
 Profile  
Reply with quote  
PostPosted: Thu Oct 13, 2016 2:25 pm 
Offline

Joined: Sat Mar 27, 2010 7:50 pm
Posts: 149
Location: Chexbres, VD, Switzerland
Quote:
There's a third advantage: multi-port memory is more expensive than single port, and wide busses are more expensive than narrow ones. A small fast memory near the CPU can afford to have wide access, and to have multiple ports. Now we can do things like fetch multiple bytes in one cycle into our decoder, and have both opcode and operands available in the same cycle. We can fetch both bytes of an two-byte address in direct page, in one cycle, and we can push or pull two bytes at a time from the stack. Or, when we talk about pipelined architectures, we might be able to write from the write stage in the same cycle as we read from the address generation stage.

In the case of an enhanced 6502, this is probably the only advantage we're looking for. The 6502 spends most cycles fetching instructions from memory, time that can be optimized out if it fetches the same instructions again and again by doing it internally and kept the extrenal interface mostly for data read/writes.


Top
 Profile  
Reply with quote  
PostPosted: Thu Oct 13, 2016 2:48 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10938
Location: England
I've just noticed that there might be some confusion here:
GARTHWILSON wrote:
Are the cache lines really as small as 8 bytes? Considering that A., even an L1 cache latency is a minimum of four cycles, and B., 8 bytes may not even be three instructions, and C., of the frequent branches in typical 6502 code, the vast majority will be farther than that (even omitting loop branches since the whole loop will remain loaded) and simple branches can reach forward or back a half a page, it seems that such a short cache line would be of no benefit, adding only complexity. I would think that a cache line should be no less than 128 or 256 bytes' worth.

It looks like you're thinking that L1 is 4 cycles away, and the unit of transfer to the CPU is a cache line. That might be true in the x86 world, and you're right that with greater L1 access times a larger transfer to the CPU seems like a good idea. But I'm thinking of FPGA implementations with the L1 being one cycle away, such that a core can run at full speed from L1. Even a naive core with byte-wide access to an on-chip L1 is going to get a benefit, if the main memory is slower. A sophisticated core which gains more useful bandwidth will do even better.

In effect, we're saying that when L1 is too far from the CPU, we need a buffer at the CPU. And we need to size that buffer well, which usually means simulation.

A bit more backgrounder:

In general, a cache has many lines. If we have say a 2kbyte cache with 16 byte lines, it will have 128 lines. A routine of 100 bytes or so would span seven or eight lines, and there's room for several such routines before something has to be evicted to make room for something else. There's no need for a loop to sit entirely within a line, and there's no need for a branch target to be in the same line as the branch itself.

At the time of needing to fetch a line into the cache, there will be a short delay. This argues for lines not being too long. But the general idea is that once a line is resident, it will be used several times before it must be evicted. So the delay in loading is amortised over several accesses.


Top
 Profile  
Reply with quote  
PostPosted: Thu Oct 13, 2016 5:58 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
BigEd wrote:
In general, a cache has many lines. If we have say a 2kbyte cache with 16 byte lines, it will have 128 lines. A routine of 100 bytes or so would span seven or eight lines, and there's room for several such routines before something has to be evicted to make room for something else. There's no need for a loop to sit entirely within a line, and there's no need for a branch target to be in the same line as the branch itself..


Keep in mind there are different types of cache organisations. Most commonly, they are organised as "set associative", which means that the cache is divided into smaller regions, and some of the address lines are used to pick a fixed one of the regions. This simplifies the matching hardware, at the cost of losing some mapping flexibility.

Here's an explanation: http://www.cs.umd.edu/class/sum2003/cms ... y/set.html

Quote:
Considering that A., even an L1 cache latency is a minimum of four cycles,

Like Ed said, that's for a specific CPU (Intel i7 Haswell), and they probably have the access pipelined such that it can still run at full speed for most cases.


Top
 Profile  
Reply with quote  
PostPosted: Fri Oct 14, 2016 1:20 am 
Offline
User avatar

Joined: Sun Dec 29, 2002 8:56 pm
Posts: 452
Location: Canada
Here is a sample of something I'm working on with a 4-way set associative cache with random replacement. Non-6502, but instructions can span cache lines so it might be adaptable.

http://github.com/robfinch/Cores/blob/master/DSD/trunk/rtl/DSD7.v about line 296.

What would have to be done in order to implement an I-cache for the 6502 ? It might be useful when running the cpu at 20+ MHz with a slow ROM / RAM.

Since the 02 doesn't have an abort line or cache control lines some trickery would be involved. The sync signal would have to be monitored then the number of opcode bytes determined (counted) from the first opcode byte of the instruction. This is to know whether instruction or data are being accessed.
On a cache miss NOP instructions would be fed to the 6502 so it can keep running, followed later by a jump instruction back to the missed code. Perhaps something similar to the tv-typewriter could be done, using the 6502 to load the cache lines.

_________________
http://www.finitron.ca


Top
 Profile  
Reply with quote  
PostPosted: Fri Oct 14, 2016 6:21 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10938
Location: England
Isn't RDY enough to hold the machine while it waits for something?


Top
 Profile  
Reply with quote  
PostPosted: Fri Oct 14, 2016 9:32 am 
Offline

Joined: Tue Jul 24, 2012 2:27 am
Posts: 674
RDY doesn't prevent writes, so it wouldn't be enough for a dirty cache line eviction write pause. Of course, the caching system can have its own buffering to deal with this, if the 6502 is bone-stock. But we're generally talking about a FPGA reimplementation, so I would expect the caching system's tendrils to delve deeper into the CPU core itself.

_________________
WFDis Interactive 6502 Disassembler
AcheronVM: A Reconfigurable 16-bit Virtual CPU for the 6502 Microprocessor


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 21 posts ]  Go to page 1, 2  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 24 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: