6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Thu Nov 14, 2024 2:48 am

All times are UTC




Post new topic Reply to topic  [ 35 posts ]  Go to page 1, 2, 3  Next
Author Message
 Post subject: Cache in FPGA
PostPosted: Tue Aug 06, 2013 7:24 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10977
Location: England
Several times, talking about our various 6502-like cores running on FPGA, we've mentioned the idea of putting a cache in place as well. That might be instead of, or as well as, some fast on-chip RAM, and of course a cache supposes a larger off-chip RAM. The point of the cache is to allow the CPU to run most of the time faster than the RAM's access speed.

In another thread, Toshi noted a list of pdf papers (and pages) which describe implementing cache on FPGA. A couple of them were behind a paywall, so I've tweaked the list and re-present it here:


Thanks to Toshi for looking these out.

In the case of 6502-style cores, one might implement page 0 and 1 as fast on-chip memory. In the case of 65Org16 where those pages are each 64k, they might be served by private cache, or shared cache.

Our present cores don't distinguish instruction fetch or otherwise lend themselves to a separate instruction cache, but it would be a reasonable thing to make: the data cache or RAM port might be busy with a write from the previous instruction, and the fetch could proceed.

As ever, there are many tradeoffs, and diminishing returns. External RAM cycle times of 4, 6 or more CPU cycles provides the incentive.

Cheers
Ed

Edit (much much later) just to reference Xilinx App Notes xapp201, xapp204, xapp463


Last edited by BigEd on Sat Feb 11, 2023 9:49 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
 Post subject: Re: Cache in FPGA
PostPosted: Tue Aug 06, 2013 7:27 pm 
Offline

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA
As far as implementing a cache it seems natural to me to implement it within the FPGA fabric as internal blockRAM, especially Arlet's core. Michael's core seems to have been specialized to adapt to external slow RAM.

EDIT: A "cache" would be a zero page and stack ram assigned to internal FPGA blockRAM.
All external RAM should be considered 1/2 speed due to all the external routing delays.

_________________
65Org16:https://github.com/ElEctric-EyE/verilog-6502


Top
 Profile  
Reply with quote  
 Post subject: Re: Cache in FPGA
PostPosted: Wed Aug 07, 2013 8:01 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
ElEctric_EyE wrote:
All external RAM should be considered 1/2 speed due to all the external routing delays.

With a properly designed memory system, external RAM would be > 90% speed of CPU because of the cache. Whenever you access external SRAM, you check the cache at the same time. When there's a hit, you use the data from the cache with no delay. When there's a miss, you fetch the data from external RAM, but you don't read just one byte from RAM, but a whole cache line (e.g. 8 or 16 bytes) in a burst. Reading a burst from SRAM can be pipelined, so the I/O delays are only incurred once for the whole burst.

The perfect companion for a cache is an SDRAM, by the way. The cache hides most of the SDRAM overhead, so you can benefit from the excellent price/capacity/speed ratio of the SDRAM.


Top
 Profile  
Reply with quote  
 Post subject: Re: Cache in FPGA
PostPosted: Wed Aug 07, 2013 8:28 pm 
Offline

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA
Arlet, I see what you're saying with the SDRAM being cheap. I just don't like the DRAM capacitor technology where the RAM rows and columns have to be refreshed. From the old C64 days (4164, 64Kx1), I've always disliked it, but I guess I should be willing to learn.

The community sort of got excited around the 65Org16.x DevBoard (judging by views), which I would like to incorporate into a new 65Org16 centered, K1 control board in order to control multiple PVB's situated on the PVB backplane.
The Devboard should be and will be designed in order to be used for multiple purposes, although it will have the same right angle 96-pin female connector as present on a PVB. The mating connector would be necessary for interfacing.

Regarding refresh, video would not seem to be a problem since a bitmap is usually just a linear pattern and the refresh pattern is fully predictable. Right now this is my area of interest, so I guess I should be more interested. Also, power demands are far less and larger (32Mx16) DRAM IC packaging is half size and 8x more storage, compared to a larger SyncRAM footprint (4Mx18).

But for a program space inside SDRAM, the refresh algorithm would be more difficult since the program may branch an unknown amount, and I think a cpu softcore design might change significantly to take advantage of SDRAM. We've hit on this topic before and I know BigEd is interested in this angle of SDRAM.

So some issues maybe you could address:
SDRAM for a program cache, an easy thing to implement?
SDRAM for videoRAM without a FPGA blockRAM line buffer?
SDRAM for videoRAM with a FPGA blockRAM line buffer?

_________________
65Org16:https://github.com/ElEctric-EyE/verilog-6502


Top
 Profile  
Reply with quote  
 Post subject: Re: Cache in FPGA
PostPosted: Thu Aug 08, 2013 5:34 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
I don't think refresh is something to worry about. On the SDRAM device we've used before, you need to do 8192 refresh cycles every 64 ms. That's on average 1 refresh cycle every 7.8 usecs. But you don't need to worry about doing them very evenly. It is allowed to wait 64 ms, and then do 8192 in a row, or anything in between.
So first of all, you can make the refresh algorithm so that it will try to do refreshes when the SDRAM is idle, and avoid them when SDRAM is busy. I did something like that in my latest SDRAM controller. In any case, refresh only takes a small amount of bandwidth (about 1%), and 99% of 100 MHz memory bus is better than 100% of 50MHz bus, right ?

For video (or other hard real-time data) the solution is to use a block RAM as line buffer or FIFO. That way, there's always enough data available. And for video, the problem is not refresh, but the problem is the CPU getting access to the SDRAM.

For CPU you would use the RDY signal to delay the CPU until SDRAM is ready. This is usually not a problem. It just makes the CPU slow down a little. Your desktop PC does the same thing, and it's not something you notice unless you do some really accurate tests.

So, all we need is a good cache :)


Top
 Profile  
Reply with quote  
 Post subject: Re: Cache in FPGA
PostPosted: Thu Aug 08, 2013 9:01 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
I am too busy right now to make a real implementation, but I spent some time thinking about the general cache design. Assuming something like an 16Mx16 SDRAM device, a simple, small, direct-mapped cache could be made from 4 block RAMs (I'm only doing this from brand "X").

By configuring the block RAMs as 32 bit wide, you get 4 parity bits for free. If you combine the 4 BRAMS, you get 128 bit wide data, plus 16 free parity bits. The 128 bit data would contain a single cache line (minimal cached unit), and would be followed by a 128->8/16/32 MUX for the CPU. The 16 parity bits would hold the tag. The 4 BRAMs would hold 512 cache lines.

Together we have 9 bits for the cache line, 16 bits for the tag, and 4 bits for the offset inside the cache line. That's 29 bits combined, which is more than enough for the 16MB memory.

By using two of those direct mapped caches in parallel, you could implement a 2-way set associate cache.


Top
 Profile  
Reply with quote  
 Post subject: Re: Cache in FPGA
PostPosted: Thu Aug 08, 2013 11:23 am 
Offline

Joined: Mon Mar 02, 2009 7:27 pm
Posts: 3258
Location: NC, USA
This is foreign territory for me! I did a wiki on CPU cache.
So the cache deals directly with CPU opcodes and data, the address and flag bits/tags are 'controllers'?
Maybe a graphic would help.
Is the "128->8/16/32 MUX for the CPU" for interfacing to different data bus widths for a general type cache?

_________________
65Org16:https://github.com/ElEctric-EyE/verilog-6502


Top
 Profile  
Reply with quote  
 Post subject: Re: Cache in FPGA
PostPosted: Thu Aug 08, 2013 12:30 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
In the wiki page, check out the graphic for the direct mapped cache. That's the simplest one, and one that I would start to implement.

And, yes, the cache would have to be able to support 8, 16 or 32 bit data widths, for the different cores.

The video controller would not go through the cache, by the way.


Top
 Profile  
Reply with quote  
 Post subject: Re: Cache in FPGA
PostPosted: Thu Aug 15, 2013 5:04 pm 
Offline

Joined: Sun Sep 15, 2002 10:42 pm
Posts: 214
Arlet wrote:
In the wiki page, check out the graphic for the direct mapped cache. That's the simplest one, and one that I would start to implement.

And, yes, the cache would have to be able to support 8, 16 or 32 bit data widths, for the different cores.

The video controller would not go through the cache, by the way.


I've worked on one processor with a direct-mapped cache (the Renesas SH4) and I'll mention one things as a caveat: the problem with direct-mapped caches is that they're very code- and data-alignment sensitive.

I've seen situations where you bum one instruction out of a code sequence and the code runs slower because removing the instruction changed the alignment of another function. So trying to optimize code for a processor with a direct-mapped icache tends to be very frustrating.

Toshi


Top
 Profile  
Reply with quote  
 Post subject: Re: Cache in FPGA
PostPosted: Sun Aug 18, 2013 4:15 am 
Offline
User avatar

Joined: Sun Dec 29, 2002 8:56 pm
Posts: 460
Location: Canada
Using an i-cache there's the opportunity to eliminate some of the opcode fetches by reading multiple bytes of the opcode all at once. Using a triple ported cache can read PC,PC+1 and PC+2 bytes all at the same time. This means the i-fetch states of the core has to be modified. But if we're going for performance we might as well use a 24 bit instruction port and modify the state machine. Just thinking.

_________________
http://www.finitron.ca


Top
 Profile  
Reply with quote  
 Post subject: Re: Cache in FPGA
PostPosted: Sun Aug 18, 2013 5:16 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8540
Location: Southern California
Wow Rob, first post in eight years! Nice to have you back! I hope you'll become active again.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
 Post subject: Re: Cache in FPGA
PostPosted: Sun Aug 18, 2013 4:28 pm 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
Rob:

I second Garth's welcome. It's good to have you join the discussions here again. I have synthesized your BC6502 core to compare against mine. Perhaps you'd share with us here some of your experiences with your core since it preceded my efforts and those of several others. For example, did you have the opportunity in your work to apply it on a commercial project?

On the subject of your post.
Rob Finch wrote:
Using an i-cache there's the opportunity to eliminate some of the opcode fetches by reading multiple bytes of the opcode all at once. Using a triple ported cache can read PC,PC+1 and PC+2 bytes all at the same time. This means the i-fetch states of the core has to be modified. But if we're going for performance we might as well use a 24 bit instruction port and modify the state machine. Just thinking.

Caveat: applies to 8-bit 6502-like processor.

I agree in general. However, I disagree with the recommended implementation: 3 port cache. I have looked at implementing internal memory in this style but limiting it to using dual port nature of the built-in Block RAMs. In my opinion, extending the native number ports of the internal block RAMs to support additional ports adds more complexity than is needed. (Note: Xilinx and others have demonstrated using the dual-port nature of the block RAMs and overclocking the interface to generate quad port memories. However, once we start talking about overclocking an interface or peripheral, the discussion quickly degenerates into a discussion about whether the core operates in one, two, or more cycles. So let's just keep the discussion focused on the concepts and not the detailed implementation.)

With a processor which has variable length instructions, instruction fetches will always cross a buffer/cache line boundary. So even with a 16-bit dual ported, 24-bit triple ported, or 32-bit quad ported memory/cache interface there's a very high likelihood that the instruction stream will cross a memory/cache line boundary. In a cache, the line buffer multiplexer can ameliorate this issue for a time, but eventually, the instruction stream will cross a memory/cache line boundary, and the instruction fetch unit will have to deal with the break in the instruction stream especially if the next byte/word is not in the following buffer/cache line. It will have to hold/buffer that part of the instruction fetched from one memory/cache segment (or restart the fetch sequence), fetch from the next buffer/cache line, and align the required data into required instruction packet. Whether it is a 16-biit, 24-bit, or 32-bit interface, this issue with the 6502 instruction stream is going to occur.

Instructions in the 6502 instruction set are 1, 2, or 3 bytes in length. If memory is not an issue, then one solution would be to create an internal memory space in which each complete instruction would be assembled and held: 1 byte instructions would not use two of the bytes; 2 byte instructions would not use one of the bytes; and 3 byte instructions would use all bytes. As the instruction fetch unit requests instructions, the entire instruction is written to (as read from main memory on a buffer/cache miss) or read from (on a buffer/cache hit) the instruction buffer. In addition, either the buffer or the instruction fetch unit provides the increment value for the program counter.

Although I don't have any hard data on the average number of bytes in an average 6502 program, but it appears, without explicit counting, that Klaus' 6502 functional test program (the biggest 6502 program I have access to at the moment) looks to have an average of 2 bytes per instruction, or just less than two bytes per instruction. If this perception of average instruction length holds, then about one third of the memory cells in this type of buffer would be unused. This penalty of underutilized block RAM storage may be worthwhile if it essentially turns most zero page and absolute direct access instructions into single or two cycle instructions instead of the normal 3 or 4 cycle instructions.

One issue with such an instruction buffer is that the addresses of the instructions it holds are not linear. I can think of a number of possibilities for aligning blocks of the instruction buffer with blocks of addresses. Flushing blocks in the event of an instruction buffer miss can probably be implemented in a manner similar to that of a standard instruction cache. Thus, there is some additional complexity related to accessing whole instructions that may not be worth the effort.

Another issue with either your approach or mine is that the basic nature of the processor has not changed: it is still an 8-bit machine internally. Thus, to accommodate the wider instruction fetch data path and update the instruction register and any operand registers using a data stream of variable width, there may be a need for additional combinatorial logic in the input data path to align the instruction data with the correct internal registers. This may or may not affect the overall performance, but it will most likely increase the path delays and lower the overall clock frequency that can be achieved. Sometimes a lower operating clock frequency provides better overall performance, although it will marginally reduce the dynamic power dissipation.

One other thought that I've had along these lines is to use a dual port block RAM with two independent address buses to read indirect address operands from memory in a single cycle. One address is the address of the LSB, and the other address is that address + 1. In this manner, the 8-bit nature of the instruction/data spaces is accommodated naturally, i.e. there's no need to waste storage as in the instruction buffer described above. One port is always set as a read/write port (the LSB port), and the other is a read-only (the MSB) port.

When used in combination with the instruction buffer, zero page indirect, {(zp); (zp,X); (zp),Y} and absolute indirect, {(abs); (abs,X)} instructions will experience a savings of at least three clock cycles per instruction. These savings and the corresponding increase in performance may be sufficient to warrant the additional effort needed to implement the instruction and data caches in the manner described here. (Note: the absolute indirect instructions, {jmp (abs) and jmp (abs,X)}, are likely to be used infrequently, so the performance benefit gained from the instruction buffer and data cache can generally be discounted for these two instructions.)

Toshi recently made a good point regarding the performance of direct mapped caches. The concepts described here are not limited to direct mapped caches, but it certainly makes the concepts significantly easier to implement if a direct mapped approach is taken as a first step. In the special case that he described for his SH4 application, the cache thrashing experienced with the direct mapped cache of the SH4, although not desired, is a common issue with direct mapped caches. That issue is present in any cache, but it is much more apparent in a direct mapped cache. However, regardless of the cache implementation strategy, cache thrashing is statistically possible in any program unless all instructions and data reside in the cache. This last point should raise the question in everyone's mind: is a cache even needed for a 6502-like processor in an FPGA with at least as much block RAM as program space?

In my opinion, driving the memory cycle rate of a 6502-like processor into the region much beyond that than can be reasonably achieved using internal block RAM memory is not very productive. As a controller, the 6502 and its kin are great devices. There are some 6502 limitations that should raise concerns as to its applicability outside its normal application domain. In its domain of 8-bit embedded systems, the 6502 is undoubtably an outstanding processor.

As BigEd, EEyE, and others have demonstrated with the 65Org16 and 65Org32 variants, that the 6502 architecture can be extended into other domains, but in doing so, the low-cost nature of the memory subsystem is lost. Because these variants have larger address spaces, wider instructions, etc., a cache subsystem tied to a large external memory like a 16 MB SDRAM is much more reasonable than a cache and a 64kB address space.

Toshi:

I am sure that after all of the effort that compiler writers and software developers have devoted to extracting the most performance from an instruction set, they could have at least devoted some effort to static analysis of the link image for cache thrashing issues. In fact, I would have thought that the garbage collector for the Java virtual machine would have been designed to spot issues such as cache thrashing and automatically adjust the relative positions of the offending code segments to avoid the issue while it had essentially brought the machine to a standstill. This is such a simple concept that it should be relatively easy to implement since the cache design and strategy of the target is a known quantity. Hennessy and Patterson claimed that these type of problems should be left to the software tools and not be implemented in the computer's hardware, i.e. the cache subsystem or the execution pipeline. After driving processor design to virtually unimaginable levels of complexity, it's time for a bit of innovation on the software development side of the house by the so called computer scientists. :D

_________________
Michael A.


Top
 Profile  
Reply with quote  
 Post subject: Re: Cache in FPGA
PostPosted: Tue Aug 20, 2013 6:08 am 
Offline
User avatar

Joined: Sun Dec 29, 2002 8:56 pm
Posts: 460
Location: Canada
As far as I know my cpu isnt used in a commercial product. I've had a couple of people contact me though and say it works great in their projects.

The code for the cpu is ugly because it was all tuned for the synthesizer. I broke up the state machine into multiple machines for instance in order for the clock tree to be optimized better. I had to explicitly put in a tri-state bus on the address mux and other things. I think this had to do with an older toolset not optimizing the best.

3 read ports plus a single write port just requires multiple copies of the block RAM with the write inputs tied together; it isn't that complex.This is pushing back the line buffer mux into the cache; the control logic is simple then. There is no need for hold logic on a line cache wrap-around. Depending on the PC,PC+1or PC+2 index will read from the right cache line. (The PC+1 or PC+2 might read from the next cache line). When the instruction wraps around into the next cache line, it's treated just llike a miss (if it is a miss). There are however three miss signals to deal with (one for each copy of the cache). Which miss(es) is valid depends on the length of the instructions. This is perhaps the stupid but expensize solution (It require 3x the block RAM).

AS you say, the alternative line buffer requires a holding buffer and a shifter to place the instruction properly. The control logic is pretty fancy. Any solution that attempts to handle the whole instruction at once is going to be a bit of a nightmare.

A cache for the 6502 is somewhat academic. It's better to just use the block RAMs as memory. Eg. 8kbCache to for an 8kb app memory doesn't make any sense.

Dual address ports on zero page memory is an interesting idea.For the 816 it would require triple ports. Allows reading a whole address value in a single cycle. I like it.

_________________
http://www.finitron.ca


Top
 Profile  
Reply with quote  
 Post subject: Re: Cache in FPGA
PostPosted: Tue Aug 20, 2013 7:13 am 
Offline

Joined: Tue Sep 03, 2002 12:58 pm
Posts: 336
You don't need three read ports to read an instruction. Make your read port 16 bits wide, and use two memory blocks: one for even-numbered words, and one for odd-numbered words. Feed the address (without the low bit) into both. If bit 1 is set, then the even-numbered block gets 1 added to its address. Then you just need some multiplexing on the output, driven by the bottom two bits of the address. That will give 24 bits (actually 32 bits if you wanted it) at any alignment, in a single read, without wasting precious block RAM.

Depending on the FPGA, the write port can still be 8 bit, to avoid any complications on that side.


Top
 Profile  
Reply with quote  
 Post subject: Re: Cache in FPGA
PostPosted: Tue Aug 20, 2013 7:17 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
Rob Finch wrote:
A cache for the 6502 is somewhat academic. It's better to just use the block RAMs as memory. Eg. 8kbCache to for an 8kb app memory doesn't make any sense.

For an 8 bit 6502, there is usually enough block RAM. For the 16/32 variants people have discussed on the forum, a large external memory makes sense. And if that external memory is SDRAM, a cache would help a lot. I would prefer just a simple cache and a regular core. Multi-port caches and complicated cores that can exploit them seems like a lot of work for little practical benefit.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 35 posts ]  Go to page 1, 2, 3  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 14 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: