6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sun Nov 24, 2024 1:35 am

All times are UTC




Post new topic Reply to topic  [ 31 posts ]  Go to page 1, 2, 3  Next
Author Message
PostPosted: Mon Jan 14, 2019 9:21 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
Another thread prompted me to consider the practicalities of using currently-available DRAM with current WDC CPUs, preferably at full speed.

For starters, if you're using a 6502, just use SRAM. At the memory sizes you typically install in such machines, it's cheap enough to not care about. So I'm going to assume the use of a 65816, which may also help avoid some of the performance tradeoff.

By Mouser prices, there is a potentially significant cost saving here if you intend to install a lot of RAM, to the tune of nearly €3.50 per 2MB, minus the extra cost of babysitting the DRAM versus the very simple method of driving SRAM. For a full 16MB array, that adds up to nearly €28, and you can get a fair amount of 74-series logic for that. The question is, how much 74-series logic do we need to do the necessary babysitting, and how fast can we go with it?

I'm basing my initial analysis on the ISSI 50ns 1Mx16 EDO chips currently available from Mouser. These are 3.3V parts, so I'll be working to WDC's 8MHz 3.3V timings.

The basic 65xx bus contract is that addresses are valid at the end of Phi1 (when Phi2 goes high) and that data is valid in either direction at the end of Phi2 (when Phi2 falls). At 8MHz, the full cycle takes 125ns, so I'll round down and take each full phase as 60ns. But DRAM needs more than two distinct phases to operate, so we could also consider quarter-phases of 30ns each; we'll need a double-speed clock to regulate those.

At 3.3V, however, the address and data setup times are 40ns. That means we can't rely on addresses being valid and stable at the midpoint of Phi1, only at the end, nor can we rely on write-data being valid and stable at the mid-point of Phi2. We need addresses and data valid before we can even start feeding them into the DRAM, because its control signals are edge-triggered.

So we must assert /RAS with the upper half of the address at the end of Phi1, /CAS (and /OE for read cycles) with the lower half of the address at the midpoint of Phi2, and /WE (for write cycles) at the end of Phi2 (a "late write" cycle in DRAM terms). We have to deassert all control signals at the end of Phi2 on a read cycle to avoid conflicting with the bank address multiplexing, but at the midpoint of Phi1 on a write cycle.

As far as I can tell, the timings all theoretically make sense with that scheme, for both reads and writes. The timing relationships need careful management in some places, particularly the switch on the DRAM address bus between /RAS and /CAS. We don't need to worry about delays while switching DRAM pages, because we can issue a full RAS-CAS sequence for every CPU cycle. But this still leaves open the question of how to handle refresh.

Side note: this is a 16-bit DRAM, but it has separate /CAS inputs for the high and low bytes, so you can treat it like a pair of 1Mx8 DRAMs stuck together, with the DQ lines commoned and the appropriate /CAS line selected by one of the address bits, perhaps A0. The DQ lines will remain high-impedance and impervious to writes if the corresponding /CAS line stays high.

Refresh is greatly simplified by the fact that these DRAM chips include an internal refresh row counter, which means we don't need to intervene on the address bus with our own row counter, only instruct the chip when to perform a refresh cycle (by falling /RAS while /CAS is already low). Classic micros, whose DRAMs lacked this feature, often co-opted the CRTC as a refresh counter, and designed the video memory access sequence to meet refresh requirements; we don't need to bother with that.

These DRAMs also support "hidden refresh", but I don't think we can use that if we want high performance. The cycle time of /RAS is limited by the sum of tRAS and tRP, which is 80ns for this device, but we need to perform two /RAS cycles during one CPU cycle if we want hidden refresh, and 160ns is much more than the 125ns we can even theoretically tolerate. So we need to find or make opportunities to perform a CBR (/CAS before /RAS) refresh in which no data transfer occurs.

The 65816 obliges by identifying "internal operation" cycles by keeping both VDA and VPA low; these are cycles where the CPU isn't doing any data transfer. These signals are valid at the end of Phi1, so we can hold /OE and /WE high for that cycle, assert /CAS at the end of Phi1, and assert /RAS at the midpoint of Phi2. We can also do the same in cycles where the CPU is accessing something other than DRAM - or, at least in theory, when it accesses a different DRAM (though this will increase the complexity of the glue logic). The DRAM chip will ignore address lines and keep data lines high-impedance during a refresh cycle.

However, we can't guarantee that the 65816 will use enough "internal operation" cycles or I/O device accesses to satisfy tREF for the DRAM; we need to cycle through all 1024 rows within 16ms, or 128,000 Phi2 cycles at 8MHz, or 125 cycles per row on average. To work around this, we can set up a 0-63 (refresh every 64 cycles) or 0-99 (refresh every 100 cycles) counter, incremented by Phi2, and reset when we can do a refresh with no performance penalty. If the counter overflows, it will assert its carry signal to trigger a forced refresh cycle. This is the same as an "internal operation" refresh cycle, except that RDY is also negated to keep the CPU out of the way. RDY is valid at the end of Phi2.

So that, I think, establishes that we *can* use modern DRAM as a cost-saving measure, with a feasible amount of extra glue logic and very little performance penalty. I think I'll leave checking the 5V 14MHz options and figuring out exactly what that glue logic looks like for another day.


Top
 Profile  
Reply with quote  
PostPosted: Tue Jan 15, 2019 8:21 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
Another option, involving a bit more work and modern hardware is to take a fast SDRAM + FPGA, run them at 100 MHz or so, and present itself to the 65816 as asynchronous RAM. The block RAMs could be used as additional static RAM and/or cache.


Top
 Profile  
Reply with quote  
PostPosted: Tue Jan 15, 2019 12:05 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
One important takeaway from this summary is that once you move away from SRAM, you'll be lucky to retain the deterministic timing that you're used to, because refreshes need time, and in-page accesses and in-bank accesses can be quicker than page and bank misses. The only way to retain fixed access times is to run relatively slower than the DRAM can manage, so the overheads are hidden.

For me, that's fine: only the simplest and smallest computers allow for easy cycle-counting, and I'd rather have the performance.

I like the idea of using an FPGA to make a fast large cheap pseudo-SRAM. As noted, it needs to have some idea of stalling for slower accesses.


Top
 Profile  
Reply with quote  
PostPosted: Tue Jan 15, 2019 4:58 pm 
Offline

Joined: Sat Dec 13, 2003 3:37 pm
Posts: 1004
But it's a difference of $30 (at 16MB), and about 30% of the RAM price in savings. How cheap do you really need to go for? How expensive is the glue logic, both in terms of hardware, board space, development time, and just cognitive load of understanding the system.


Top
 Profile  
Reply with quote  
PostPosted: Tue Jan 15, 2019 5:06 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
For EDO RAM the price difference isn't very big, but hopefully it would not require a lot of glue logic.

For SDRAM, the price difference is much bigger, and would easily pay for an FPGA. Also, for 16 MB board space would be less because the SDRAM comes in a single chip. Development time is free when you're having fun.


Top
 Profile  
Reply with quote  
PostPosted: Tue Jan 15, 2019 7:01 pm 
Offline

Joined: Fri Dec 21, 2018 1:05 am
Posts: 1120
Location: Albuquerque NM USA
In the the last 2 years I have five processor designs using DRAM, three designs used 16meg 72-pin SIMM, and two designs used 1Mx16 DRAM. They are not 6502 designs, but I think my DRAM experiences are relevant to this discussion.

The SIMM module is very cheap, $2 for 16-meg SIMM at quantity of 10. (I bought mine from this guy: https://www.ebay.com/itm/LOT-OF-TEN-16M ... 1119011755 ) So that's a motivation for processors that can use this much memory. The processors I targeted are 68030, 68000, and Z280

The DRAM controller can be designed with a modest 5V CPLD like Altera EPM7128. It needs a total of 9 flip flops for normal DRAM access plus the refresh access. For 68030 and 68000, there need to be a refresh counter that generates periodic refresh request, so that's another 7 flip flops. For Z280, there is already an internal refresh timer, so 9 FF are all it needs. 680x0 and Z280 have multi-clock bus cycle, so digital delay taps for DRAM can take advantage of that. For 6502, a 4x CPU clock is needed to generate the necessary delay taps.

DRAM has large power surge so it absolutely needs filter caps (multiple 10uF tantalum in my designs). I was lucky that all 5 designs managed to work with 2-layer pc boards without transmission line design rules. I have to add additional ground wires for some of the prototypes, but the final designs have better ground layout so extra wires are not needed.

I should mention that for 68000 and Z280, I can not figure out how to use all that 16 meg of memory other than CP/M RAMdisk. Even with 8-meg RAMdisk, 68000 CP/M-68K has 7-meg of TPA! I eventually redesigned 68000 and Z280 with 2meg of memory using 1Mx16 DRAM.
Bill


Top
 Profile  
Reply with quote  
PostPosted: Tue Jan 15, 2019 9:33 pm 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8514
Location: Midwestern USA
Arlet wrote:
Development time is free when you're having fun.

Good point! It is a hobby, after all, not a life's calling.

You mean it is a life's calling? :shock: :roll:

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Top
 Profile  
Reply with quote  
PostPosted: Tue Jan 15, 2019 10:05 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
Now for a wrinkle: the 5V DRAM parts in the same capacity are no faster than the 3.3V ones. There are 35ns parts in Mouser, but they are 4Mbit capacity instead of 16Mbit, and only half the price, so SRAM actually becomes cheaper overall. So for 14MHz+ operation, we have to see what we can do with 50ns parts. (These are already faster than what most PCs used in the 1990s.)

The Phi2 cycle time at 14MHz is a touch over 70ns. It takes 80ns minimum to do a full RAS-CAS cycle on these parts, and that's if you time everything just right. So the strategy we used at 8MHz won't work, but if we resort to inserting a wait-state for every DRAM access, we're potentially throwing away a lot of performance. Instead, let's try using EDO Page Mode as the manufacturer intended. This claims a /CAS cycle time of just 30ns - exactly what we need.

First, we need a register to remember which page we've "opened" in the DRAM, and a comparator to detect when we make an access to a different page, or if we haven't opened a page yet. If we need to open the page, we need to insert a wait-state to allow /RAS to be released (end of Phi1) and reasserted (end of Phi2), at which point the register can be reloaded and the access can continue. A read cycle proceeds by asserting /CAS and /OE at the end of Phi1 and releasing them at the end of Phi2. A write cycle asserts only /CAS at the end of Phi1, then /WE at the end of Phi2, and releases both "shortly" thereafter. The /WE pulse technically only needs to be 8ns wide, and we need to get /CAS deasserted in time for the end of the next Phi1.

Data access patterns without the benefit of a cache can be pretty chaotic. In particular, there will be frequent switches between program memory, direct-page, stack memory, and one or more locations where bulk data is stored, and it's unlikely that all these will be in the same 2KB DRAM page. This pressure can be substantially reduced by providing SRAM for the first 256KB (covering direct-page and stack at a stroke), and keeping program code within that area, using DRAM only for bulk data in high memory. In that case, possibly only one page register and one /RAS line will be required between several DRAM chips. But a more general solution is to provide a page register for each chip; they can be multiplexed onto the same comparator and other logic, as long as a separate /RAS line is provided to each chip.

Now for refresh. Refresh requires pulsing /RAS, and we can only do that every other cycle due to that 80ns limit, so if we inadvertently trigger a refresh just before we need to change pages, we need to insert a second wait-state. That potentially makes refresh more costly. We also need to delay a refresh cycle if it comes due immediately after we've opened a new page for access. But do we need to delay it by a whole cycle, or only half of one? If we can perform it out-of-phase with the page-change /RAS cycles, then we can do a refresh and a page change together (in any order) in a total of three cycles, not four. But we still need to coordinate them to avoid timing violations, and the refresh cycle still requires inserting a wait-state.


Top
 Profile  
Reply with quote  
PostPosted: Wed Jan 16, 2019 5:57 am 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
Speaking of SDRAM, I looked at a 16Mx8 chip that happens to be available on Mouser and is surprisingly cheap (less than a single EDO DRAM chip).

It turns out I was mistaken about there being a formal minimum clock speed on these things. However, no matter how slowly you clock them, there's a minimum 2-cycle delay between providing the /CAS address and getting data out, whereas the 65xx bus assumes you can do that in half a cycle (unless you want to insert wait-states). So that implies you need a 4x clock to run the SDRAM, *if* you accept the need to insert a wait-state every time you need to change rows. For 14MHz operation, that means you need a 56MHz SDRAM clock, and glue logic which can keep up with that rapid pace. Ouch.

Worse, the 65816 requires 5V to run (officially, at least) at 14MHz, but the SDRAM chip isn't 5V tolerant. To get the most performance, you might need to run on a compromise voltage, such as 4V. You'll have to experiment to find out how fast the WDC chips will go at that voltage. The SDRAM would technically be running out-of-spec (ie. might get hot) but 4V is comfortably within the absolute-maximum rating and not *too* far outside the recommended operating region.

Using a single device for the entire 16MB array is actually feasible, because it is divided into four banks which have independent row selection. Each row is 1KB, which is less than on the EDO devices, but still enough to hold direct-page and stack in one go. If you're smart about which address lines you choose to select the bank, you could go for quite a while without needing to change rows. Some more good news is that an auto-refresh cycle can be completed in 60ns, which is inside one CPU cycle, so can be performed opportunistically with a refresh timer as a failsafe.

A slightly more exotic scheme requires using a 5x or 6x SDRAM clock, so that you can squeeze three SDRAM clocks into the CPU's half-cycle. This takes advantage of the fact that the RAS-to-CAS delay is specified in nanoseconds, *not* in cycles, and that up to 66.67MHz the spec corresponds to a single-cycle delay. The 5x option takes, say, a 60MHz SDRAM clock, drives Phi2 low for the first two cycles, then high for the next three, during which you perform a complete RAS-CAS sequence with auto-precharge, eliminating wait-states for row changes. The CPU sees an effective 12MHz clock cycle, with a 40/60 duty cycle. The 6x scheme just makes the CPU clock symmetric, and may or may not be faster overall (depending on whether the CPU or the glue logic is the bottleneck).

So this is an option, but not one to take at all lightly. Lots of people seem to still have trouble building machines that run at 1MHz (though these are usually people using breadboards and flying wires), let alone complex glue logic that can do 60MHz. One for the experienced folks only.


Top
 Profile  
Reply with quote  
PostPosted: Wed Jan 16, 2019 6:14 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Good point about the voltage/speed tradeoffs, but an awkward situation.

I think we've seen, with the various people having trouble building slow-speed systems, that the root causes are usually unsafe glue logic design, sometimes power supply or decoupling issues. Perhaps once a very old and resistive breadboard. But it's true to say that going above 20MHz is going to be challenging, and 56MHz is some way over that. It's also true, I think, that it's clocks and strobes which cause the most difficulty, and DRAM is all about strobes: those signals need to have clean edges and no glitches. Quite different from the requirements on address and data busses.


Top
 Profile  
Reply with quote  
PostPosted: Wed Jan 16, 2019 6:28 am 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
I think it's also possible to imagine an application where you have fast SRAM for "low memory" and keep all your program code there, and need a lot of extra RAM for some reason but it doesn't need to be at all fast, so having it cheaply is nice.

You could then run your '816 in-spec at 8MHz 3.3V, and an SDRAM chip at a mere 16MHz, so that every SDRAM access needs one wait-state. The first Phi2 phase suffices to provide the /RAS address, the second Phi1 phase clocks in the /CAS address and read command, and the data appears neatly at the end of the second Phi2 cycle, exactly when required.


Top
 Profile  
Reply with quote  
PostPosted: Wed Jan 16, 2019 6:36 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
If you had two or four banks of DRAM, you could have more than one page open, and be slightly more likely to get page-hits for page 0 accesses, or page 1 accesses, if you mapped the address bits so that different 6502 pages map to different banks. It's not going to manage the hit-rate of an actual cache, or a dedicated SRAM for low pages though.

I'm reminded of the ARM, which has one output pin to indicate an upcoming sequential access and which then helps the memory controller stay in-page, for a pretty significant performance win. A 6502-aware memory controller could, for example, take in the SYNC signal, and model the code flow, stack, and zero page accesses accordingly. That would be rather complex for a TTL implementation but well within the possibilities of an FPGA memory controller. If you tied off the SYNC input because it's difficult to connect in your system, you'd get something which still worked but at lower performance.


Top
 Profile  
Reply with quote  
PostPosted: Wed Jan 16, 2019 6:42 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
I've had more problems keeping I2C lines clean and glitch-free than 100 MHz memory interfaces. If you have an FPGA side by side with SDRAM, the traces will be so short that there's little chance of ringing or cross talk. My first FPGA+SDRAM project was on a double sided board, with wide traces and big vias, and it ran fine at 60 MHz.

I've attached pictures of the front and back. There are ground pours on top/bottom, but it's all cut up by signals and power. On the left/right of the FPGA you can see two more 60 MHz buses.

If you put this on a 4 layer board with clean ground plane, you will have no problems.


Attachments:
back.jpg
back.jpg [ 301.6 KiB | Viewed 2631 times ]
front.jpg
front.jpg [ 219.19 KiB | Viewed 2631 times ]
Top
 Profile  
Reply with quote  
PostPosted: Wed Jan 16, 2019 6:46 am 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
Quote:
If you had two or four banks of DRAM, you could have more than one page open, and be slightly more likely to get page-hits…

That's true, and I mentioned it previously, but with the simple/slow alternative I just mentioned, you don't actually get an advantage for staying in-page. You still need exactly 2 SDRAM clocks between /CAS (with address) and data arriving, so unless you run an x4 SDRAM clock (or stretch the CPU clock, which is another can of worms), it's impossible to provide that within a single Phi2 phase.

Since in this alternative the fastest clock is 16MHz, it becomes feasible to build the glue logic in the conventional way, without resorting to an FPGA.


Top
 Profile  
Reply with quote  
PostPosted: Fri Jan 18, 2019 3:00 am 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
So, let's open that can of worms. Dynamically stretching the Phi2 clock will distort any timers that rely directly on it, like the 6522, but timers with an independent clock source will be fine. Bear that in mind when considering the following:

Assume an '816 at a nominal 8MHz, 3.3V. We provide an x4 SDRAM clock at 32MHz - high for a hobby project, but less intimidating to implement than 60MHz - and derive the Phi2 clock from it via the SDRAM controller. SRAM, ROM and I/O accesses can proceed as normal, clocked and qualified by Phi2 alone, possibly with /RDY wait-states. So what happens with the SDRAM?

Refresh cycles can be taken opportunistically, whenever the SDRAM is not selected, without modifying the Phi2 clock. In most cases, taking advantage of CPU internal-operation cycles will be sufficient. If you normally run code from low SRAM, even better. The controller merely issues an auto-refresh command at the end of Phi1.

A complete RAS-CAS sequence can be completed in 3 SDRAM cycles, by stretching the Phi2 phase by one SDRAM clock. If done continuously, the effective CPU clock would slow to 6.4 MHz with 2:3 duty cycle. This is a relatively small performance penalty for accessing a large, cheap memory array, and the resulting controller logic should be simple enough to implement in an ATF750 CPLD (which uses the 22V10 pinout and will happily run at 3.3V) plus a small number of support chips.

A CAS-only access can be completed in 2 SDRAM cycles, without stretching Phi2, but requires keeping track of the open row addresses. The extra logic costs (to store four 12-bit row addresses and compare the relevant one to the current address) might be offset by the ability to drop the low-memory SRAM from the BOM. Sequential or nearby accesses could then be completed at the full 8MHz rate. However, it would then be necessary to issue a bank-precharge command before changing rows, extending the row-change to 2 SDRAM cycles, for a total of 6 SDRAM cycles in one CPU cycle, at a 1:2 duty cycle and an effective rate of 5.333 MHz.

Some CPUs might produce addresses early enough to make a decision on changing rows before the address lines are formally "stable". In that case, the penalty for changing rows can be reduced from a half-cycle to a quarter-cycle by issuing the precharge command at the midpoint of Phi1. It's important in this case to choose address bits for the SDRAM bank-select lines that become stable as early as possible, as reliably as possible. Any two bits from the bank address will probably be suitable.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 31 posts ]  Go to page 1, 2, 3  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 43 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: