Cache in FPGA

Rob Finch · Post by **Rob Finch** » Tue Aug 20, 2013 3:01 pm

Quote:

You don't need three read ports to read an instruction. Make your read port 16 bits wide, and use two memory blocks: one for even-numbered words, and one for odd-numbered words. Feed the address (without the low bit) into both. If bit 1 is set, then the even-numbered block gets 1 added to its address. Then you just need some multiplexing on the output, driven by the bottom two bits of the address. That will give 24 bits (actually 32 bits if you wanted it) at any alignment, in a single read, without wasting precious block RAM.

That would work. I see, only dual porting is really needed. If the cache width were 8 bytes and instruction could be 1 to 5 bytes long, that would allow 32 bit operands for a 32 bit processor. 5 bytes = opcode + 4 byte operand.

Quote:

For an 8 bit 6502, there is usually enough block RAM. For the 16/32 variants people have discussed on the forum, a large external memory makes sense. And if that external memory is SDRAM, a cache would help a lot. I would prefer just a simple cache and a regular core. Multi-port caches and complicated cores that can exploit them seems like a lot of work for little practical benefit.

Some of the newer FPGA parts have built in memory controllers. The Spartan6 for instance uses 32,64, or 128 bit ports. So the memory side of a cache has to match an available port size, or memory will be wasted. This means some sort of bus size translation must take place in the cache. (adds a mux in the cache) . The simplest cache would be to return the byte indexed by the PC. Changing the bus width adds some complexity, especially to a data cache.

A cache needs a mux and control logic for those non-cached addresses. That means there must be a control register(s) somewhere. Zero page and the stack page could assume to be non-cached. Also the BIOS ROM (if in block RAM) and the I/O range.

TMorita · Post by **TMorita** » Fri Aug 23, 2013 6:35 am

MichaelM wrote:

Toshi:

I am sure that after all of the effort that compiler writers and software developers have devoted to extracting the most performance from an instruction set, they could have at least devoted some effort to static analysis of the link image for cache thrashing issues. In fact, I would have thought that the garbage collector for the Java virtual machine would have been designed to spot issues such as cache thrashing and automatically adjust the relative positions of the offending code segments to avoid the issue while it had essentially brought the machine to a standstill. This is such a simple concept that it should be relatively easy to implement since the cache design and strategy of the target is a known quantity. Hennessy and Patterson claimed that these type of problems should be left to the software tools and not be implemented in the computer's hardware, i.e. the cache subsystem or the execution pipeline. After driving processor design to virtually unimaginable levels of complexity, it's time for a bit of innovation on the software development side of the house by the so called computer scientists.

Few comments:

1) Code placement is a linker issue, not a compiler issue.

2) The linker has a very limited freedom on where to place functions due to file format issues.
When a compiler compiles a C files and produces a COFF or EL object file, the object file
contains all the code for all the functions in the file in one contiguous code section.
Therefore, the linker usually has no freedom to rearrange the order of the functions in an
object file. The linker only has the ability to determine the order of the object files compiled
into the final executable, so this is very coarse.

3) If the linker had the freedom to reorder individual functions by using a different object file format,
then the linker needs semantic information such as the call graph in order to determine a
theoretically optimal code layout. So the object file would also need to include call graph
information.

4) The call graph information is useful to the linker, but it is statically determined information, and
may not accurately represent the execution path of the executable with different inputs.

5) I know a person who did this as a project, and after optimizing the layout of the functions,
found that it ran slower than the original code. For his case, which was optimizing the layout
of functions in the Mozilla web browser on Linux, the main source of branch latency was the
paging in of the executable from disk rather than the cache latency because disk I/O is
a few orders of magnitude slower than cache line fills from memory. The project was called
GNU Rope - the paper here describes it, but does not mention the final results which I'm
recalling from memory: http://www.nat.org/grope.pdf

So based on my experience watching other people implement this optimization, my opinion is that it requires a significant amount of work to implement but the average return may be close to zero - it helps some executables but hurts some executables for an average gain of zero.

Toshi

enso · Post by **enso** » Fri Aug 23, 2013 4:49 pm

As a minimalist, I am strongly against caching. While you can't argue that you can gain a lot of performance, you lose deterministic execution times. The complexity of the design increases drastically. And the feel of the system becomes sloppy. Not to mention that the software is dragged into the mess as well - the order of accessing data becomes crucial to performance, size of cache in a 'compatible' system may completely change how the application behaves, etc.

I understand that Intel has no other choice as the cores are running more than an order of magnitude faster then RAM. How fast can our cores run? Maybe a couple of times faster than RAM if we are lucky. Why bother?

Personally, after years of writing low-level code for flavors of x86, I am extremely happy to be back in 6502-land, where I can count cycles and know exactly how long something will run. At this point I find that just mentioning a cache makes me lose interest.

Arlet · Post by **Arlet** » Fri Aug 23, 2013 5:00 pm

The fact is that with external SDRAM, or external SRAM shared with video, you've already lost deterministic timing, even without a cache. Luckily, in most systems, deterministic execution times are only needed in a few key places. In those cases, you put the code in a BRAM.

MichaelM · Post by **MichaelM** » Fri Aug 23, 2013 7:49 pm

Toshi:

Sorry about that, but I was just stirring the pot.

There was no need to respond. As is clear from your response, the issue of optimization in a non-deterministic environment would be very difficult and may not be worth the effort required for the marginal improvement that might be achieved.

enso · Post by **enso** » Sat Aug 24, 2013 4:10 pm

Arlet wrote:

The fact is that with external SDRAM, or external SRAM shared with video, you've already lost deterministic timing, even without a cache. Luckily, in most systems, deterministic execution times are only needed in a few key places. In those cases, you put the code in a BRAM.

That is true. Unless video access is interleaved. The truth is that deterministic timing is not that important.

Video access and refresh are also regular, pretty fast and fairly deterministic when measured over seconds. A thrashing program can create variable and unpredictable delays.

When left unchecked, things like multi-level caches and virtual memory create so much slack in the system that you find yourself waiting for the computer to catch up. I don't really understand how that is possible with a 2GHz+ computer, but I often try to guess what my computer did in the last 3 billion cycles that was so important that I needed to wait.

Arlet · Post by **Arlet** » Sat Aug 24, 2013 4:19 pm

Most of the really bad delays on modern PCs is due to disk access. Disks, especially the rotating kind, are still really slow compared to main memory.

enso · Post by **enso** » Sat Aug 24, 2013 4:25 pm

A thought... If variable instruction size is really an issue, why not just use 24-bit (or 32-bit) memory and modify the compiler to pad the instructions (or create a conversion utility)? Our machines are experimental and making the RAM wider is not prohibitively expensive.

ElEctric_EyE · Post by **ElEctric_EyE** » Sun Aug 25, 2013 12:00 am

There seems to be a discrepancy in this thread regarding the discussion of a cache used inside a FPGA for data or opcodes.

Video uses a data cache with a data width that has a set amount based on the pixel color width.

Then the more difficult opcode cache, which may not be worth discussing since more knowledgeable folks have chimed in that this effort is not worth the invested time.

Rob Finch · Post by **Rob Finch** » Sun Aug 25, 2013 1:07 am

Quote:

Then the more difficult opcode cache

The opcode cache only becomes more difficult when one tries to read entire instructions at once from the cache and has to deal with cache line spanning issues. It can be reasonably simple if one sticks to reading just single bytes from the cache.

A cache should reduce the amount of memory traffic which might be important if there are multiple devices accessing the memory. In a SoC there might be a video controller, and ethernet controller also accessing memory. How useful a cache is depends on the system I guess. I've used caches in a couple of SoC's now. It can add to the debugging fun when one is trying to trace instructions. It's important also to have non-cached accesses available which adds to the trickery.

A nano-cache might be more useful to translate 8 bit accesses into 32 bit accesses for memory ports. One just needs a 32 bit buffer with an address tag.

The problem with a cache is one ends up modifing the processor core in order to couple it to the cache. there's an 'I don't really want to go back and change the processor, now that it's working' impetus.

Quote:

If variable instruction size is really an issue, why not just use 24-bit (or 32-bit) memory and modify the compiler to pad the instructions (or create a conversion utility)? Our machines are experimental and making the RAM wider is not prohibitively expensive.

Could be done, but may waste a significant amount of memory in a small system. It also doesn't always work. Eg. What if instructions need to be more than 32 bits and the memory port size is limited? Also what if the system is reconfigurable for different processors ? How wide does the RAM get ? Eg. 68000 reads up to 10 bytes for instructions.

Arlet · Post by **Arlet** » Sun Aug 25, 2013 7:54 am

I think the goal should be just to make a simple cache. Start with direct mapped, even though it's not optimal for performance, and keep it byte-wide on the CPU side. When it works, make it 2-way set associative. Fancy tricks such as combining opcodes with operands involve a complete redesign of the core, and great increase in complexity that are probably not worth the effort.

BigEd · Post by **BigEd** » Sun Aug 25, 2013 5:32 pm

Agreed!

White Flame · Post by **White Flame** » Tue Aug 27, 2013 10:32 pm

Since it's not been mentioned yet, I'll cast a vote for including a prefetch instruction/mechanism. The "illegal" NMOS NOP instructions with address operands would have been perfect for that.

BigEd · Post by **BigEd** » Tue Aug 27, 2013 10:37 pm

Slightly related to that is the trick of refilling a line out of order, to first fetch the byte that caused the fill request, and for the processor to continue as soon as that byte arrives.

Arlet · Post by **Arlet** » Wed Aug 28, 2013 5:34 am

And of course there's the issue of the different write policies. Still a lot of work to do even for a simple cache.

Cache in FPGA

Re: Cache in FPGA

Re: Cache in FPGA

Re: Cache in FPGA

Re: Cache in FPGA

Re: Cache in FPGA

Re: Cache in FPGA

Re: Cache in FPGA

Re: Cache in FPGA

Re: Cache in FPGA

Re: Cache in FPGA

Re: Cache in FPGA

Re: Cache in FPGA

Re: Cache in FPGA

Re: Cache in FPGA

Re: Cache in FPGA