6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sun Nov 24, 2024 5:53 am

All times are UTC




Post new topic Reply to topic  [ 31 posts ]  Go to page Previous  1, 2, 3  Next
Author Message
PostPosted: Fri Jan 18, 2019 7:36 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
BTW, it might be possible to make use of both edges of a fast clock, and to use two clocks in quadrature, to make it a bit easier to have all the edges you need to time things from, without having quite such fast clocks.


Top
 Profile  
Reply with quote  
PostPosted: Fri Jan 18, 2019 8:07 am 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
The SDRAM itself needs a full-speed clock. The glue logic will probably be making its signal transitions on the falling edge, to meet the setup and hold times that the SDRAM needs relative to the rising edge.


Last edited by Chromatix on Fri Jan 18, 2019 8:16 am, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Fri Jan 18, 2019 8:15 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Ah.


Top
 Profile  
Reply with quote  
PostPosted: Fri Jan 18, 2019 8:20 am 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
Conversely, there are FPM DRAMs which are a little cheaper than the EDO DRAMs I considered before, in the same sizes and with the same speeds and voltages. Driving them from the 65xx bus may actually be a little bit easier than EDO, because the bus turnaround after the read cycle is more favourable; fewer signals need to be deasserted to turn off the outputs.


Top
 Profile  
Reply with quote  
PostPosted: Sat Jan 19, 2019 5:47 pm 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
Interesting thread! A few incidental points... (in a spirit of opening up options to consider, rather than advocating any particular choice).
Chromatix wrote:
fewer signals need to be deasserted to turn off the outputs.
Might adding a bus transceiver or FET bus switch be a worthwhile tradeoff? There's a certain cost in board area and BOM, but the cost in prop delay is small -- almost zero, in the case of a FET bus switch (and these switches are available with pinouts identical to 74xx245). Also: a transceiver may incidentally offer level translation, potentially solving a TTL-to-CMOS issue, or even opening the door to options involving multiple supply voltages.

Quote:
Dynamically stretching the Phi2 clock will distort any timers that rely directly on it, like the 6522
It's an important point. What does it buy us, and what does it cost us?

The cost of dynamically stretching Phi2 is to either
  • tolerate the timer issue (eg: issue a caveat about using 6522 timers), or
  • include a fix, perhaps as follows. An oscillator or Master Clock is divided by two to provide a constant, free-running Phi2 used only by chips like the 6522. The same master clock is divided by two -- or sometimes three or more -- to generate a dynamically stretchable clock for the system (CPU and DRAM). This makes half-cycle "wait states" available for occasions when a DRAM access requires one. A half-cycle wait-state also occurs if the 6522 is accessed and the system clock is presently out of step with the free-running 6522 clock. The circuit isn't complicated, so it's a fairly cheap way to supply a predictable timebase for 6522's (eg). (OTOH, program execution time won't always be easily predictable. But having reliable timers may be enough.)

What we buy with the fix is better DRAM performance (on the premise that half-cycle wait states do a better job of matching actual DRAM timing needs than do full-cycle wait states as provided by RDY). Arguably, an even better choice would be if the Master Clock were 4x, say (rather than 2x) -- a tradeoff of complexity vs diminishing returns. 2x is nice and simple... :)

Drifting slightly OT, another aspect whose performance benefits from sub-cycle wait states is opcode substitution -- a topic recently touched on elsewhere. Also there may well be I/O devices (or ROM's) whose timing needs are most efficiently met by sub-cycle wait states.

Parting comment on saving money by using DRAM: in many 65xx contexts, 16MB will naturally be the ceiling. But it's no great stretch to contemplate much larger spaces. And that's where DRAM will really become important!

-- Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Top
 Profile  
Reply with quote  
PostPosted: Sat Jan 19, 2019 7:09 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
Currently seeing if I can coax an ATF750C into doing the clever bits of handling an FPM or EDO DRAM. Since these are available in 5V tolerant versions, they're more relevant to high-performance goals, though the overall cost of a full 16MB array seems likely to be higher than for an SDRAM chip. Of course my practical experience with programmable logic was 20 years ago, very brief, and used PALASM rather than WinCUPL - but at the time, I *did* successfully complete the extra-credit assignment.

I'm basing the timings around a fixed 24MHz input clock, and generating the CPU's Phi2 clock within the controller. There are two dynamically selected speed modes, producing nominal 8MHz and 12MHz CPU cycles. This allows me to stretch Phi2 when required, potentially reducing the cost of a row change to half a cycle at 12MHz, and also keeps the CPU properly synchronised with the DRAM signals. The 8MHz mode has a symmetric Phi2 clock and is thus compatible with 3.3V operation, and has single-cycle DRAM access without any clock stretching, and wait-states only when refreshes are forced by the timer (after 64 CPU cycles with no opportunistic refreshes). The speed selection pin is sampled near the beginning of Phi1, so the transition should be seamless.

Incidentally, I think the best solution to accurate timing is to use a device whose timers run independently of the bus cycle - which includes the 28L92 UART, as well as any true RTC chip. The 6522 is decidedly anomalous in its tight coupling to the 65xx bus.

Some external hardware will be needed, especially for getting the most out of 12MHz "fast mode", as that relies on at least one external row-address register and comparator. I'm still figuring out the details of that. The ATF750C is big enough on the inside to keep the refresh counter as well as a fairly complex state machine, but it doesn't have enough pins or storage to keep a row-address register as well, let alone multiples of them, nor to directly drive the signals that need to be kept to individual DRAMs. I have to say that 10 bits is an awkward size of register to implement in discrete logic, especially in multiples.

The 24MHz input clock gives me a stream of phases 20ns or so long. One of the principal constraints is the /RAS cycle, which requires at minimum 50ns low and 30ns high, hence we need a minimum of 5 phases (60ns + 40ns = 100ns) to complete a /RAS cycle - which is needed to complete either a refresh or a row-change. The /CAS cycle can be completed much more quickly; in SDRAM the CAS latency is the biggest headache, but with FPM or EDO the CAS latency is almost trivial. However we do need to change the address muxes over between the /RAS and /CAS active edges, so having three clock edges during Phi2-high in "slow mode" turns out to be useful. Again, I need to trace out all the possibilities of what happens in "fast mode" to be sure of when the stretched clock needs to happen, but I'm hoping to minimise the penalty associated with refresh cycles.


Top
 Profile  
Reply with quote  
PostPosted: Wed Jan 23, 2019 12:36 am 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
Still haven't completely figured out the external circuitry for the highest-performance option, but I *have* managed to flesh out how to handle the different access cases in "fast mode". I basically worked out that it wasn't worth it to attempt to do "opportunistic refresh" in that mode, since the knowledge of it being a non-DRAM cycle comes too late to avoid impacting the timing of the immediately following cycle, which is likely to be a DRAM cycle (unless the system designer has arranged to minimise DRAM accesses).

The key difference in "fast mode" is that /RAS is normally held low, and only cycled high for row-changes and refreshes. In "slow mode", /RAS is normally high, and cycled low as part of an access or refresh. The tradeoff is that it takes longer to change rows when /RAS is already low, because you have to spend two phases in the RAS-precharge state before you can assert it for the row address - but when /RAS is already low and the correct row is active, it takes two phases less time to get around to asserting /CAS.

Active-page accesses and non-DRAM cycles go by in 4 phases of the 24MHz clock, giving the CPU a symmetric 12MHz clock.

Row-change cycles require 6 phases for a write, and 8 phases for a read. If I had a way to deal with shifting the state machine by a half-cycle of the 24MHz clock, then i'd only need 7 phases for the read case.

Refresh cycles come in no less than four different flavours, and are started in Phi1 once the refresh timer expires. The flavour, however, is only determined at the end of Phi1 when the type of CPU cycle in progress becomes known. The refresh cycle begins as a "hidden refresh" which preserves the identity of the active row.

A non-DRAM refresh cycle is fortuitous, since we can complete a pure refresh over a 4-phase cycle and still be set up for a real access in time for the end of Phi1 of the following cycle.

An active-row access during the refresh cycle can be handled with only 2 extra phases, for a total of 6, regardless of whether it's a read or a write.

A row-change access coinciding with the refresh is more awkward, since I now need to cycle /RAS twice - once for the refresh itself, and again to change the active row - and I have to hold /RAS low for 3 phases to meet the timings. But I can send /CAS high during that period, and it works out to a total of 10 phases for a write, 12 for a read.

The worst-case timing is thus 3 times as long as the best-case timing. Ignoring the comparatively rare refresh cycles, the worst-case timing is still twice as long as the best case.

However, in actual 65C02/65816 code, row-change accesses can occur in at most two-thirds of all cycles by my reckoning, this worst case being illustrated by a string of Direct Page load instructions executed from the same bank but a different row from Direct Page itself. The two bytes of each instruction will be accessed with one row-change-read (8 phases) and one active-row-read (4 phases), then the Direct Page access is another row-change-read for a total of 20 phases for the instruction. For the same sequence, the "slow mode" timing is 18 phases, since all cycles are 6 phases.

Hence it is possible for the "fast mode" to be slightly slower than the "slow mode" in pathological conditions. This is unfortunate, but long sequences of Direct Page reads are probably not found in real code, and refinements to the system design (particularly providing SRAM for the low portion of memory) can help avoid this case entirely.


Top
 Profile  
Reply with quote  
PostPosted: Wed Jan 23, 2019 7:28 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Have you taken one or two traces of real 6502 programs to see what kind of page-hit rate might be seen?


Top
 Profile  
Reply with quote  
PostPosted: Wed Jan 23, 2019 7:56 am 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
Not for this purpose, no. But I looked through the WDC manual while thinking about what the worst-case timings might be, and what could possibly trigger them - and I have some idea of what real-world 6502 code usually looks like. Most of the instructions actually used would have row-change cycles less than half the time, making the overall performance better than "slow mode" even if a single huge DRAM bank is used.

For example, performance-critical code usually includes some index increments and branches to form loops. An implicit-mode increment instruction has one opcode fetch cycle and one non-access cycle, and the following instruction (often a branch) is fetched sequentially. A taken branch can then be assumed to involve two sequential fetches, one non-DRAM cycle, and the next instruction is nearby (within ~127 bytes) so is probably in the same DRAM row. So that's a 5-cycle sequence involving only best-case 4-phase cycles for a total of 20, while in "slow mode" they would take 30 phases (completing two opportunistic refreshes in the process).

To erase that advantage would require five same-bank Direct Page reads, where Direct Page is not backed by SRAM or a different DRAM bank. Additionally, if that read were 16-bit it would be performance-neutral between fast and slow modes.

Applying a similar analysis to subroutine calls and interrupt handlers also shows a clear advantage for even single-bank fast page mode. JSR, BRK, RTS, RTI, PHx, PLx each have some dead (non-access) cycles and/or sequential accesses which end up outweighing the overhead of the possible row changes. As these are consistently some of the slowest operations on the 6502 family, speeding them up is worthwhile.

With that said, the best pure-DRAM performance is offered by having multiple banks with independent active rows. I'm still working out how best to handle that in hardware terms.


Top
 Profile  
Reply with quote  
PostPosted: Wed Jan 23, 2019 9:56 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Good to know that even very short sequences give a net win.

I'm fairly sure hoglet has captured the Beeb's boot sequence as a trace, perhaps here but more likely over in stardot. As a trace, it might be useful to analyse the potential performance of various schemes, but I think it does contain a memory sweep and a screen clear which might unfairly dominate. Of course, in principle pretty much any emulator should be able to provide a trace.


Top
 Profile  
Reply with quote  
PostPosted: Wed Jan 23, 2019 10:18 am 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
One complication with the NMOS and CMOS 6502 is that a non-access cycle is not clearly identified by the CPU. The addresses generated for these cycles are often bogus and may refer unnecessarily to addresses far away from the action. It's probably no accident that classic micros didn't even attempt to utilise fast-page-mode (not that they needed to).

Analysing the trace as though it were running on a 65816 in emulation mode might be fruitful, but I'm not sure if there's any software already able to do that.

My own simulator currently traces all the necessary accesses, but doesn't reliably count how many non-access cycles are needed (which is necessary to check what effect refreshes have). I need to dig back into that to find out where the discrepancies lie.


Top
 Profile  
Reply with quote  
PostPosted: Wed Jan 23, 2019 12:03 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Mmm, it's interesting as to whether it makes sense for a memory controller to watch the instruction stream, decode it, and figure out which accesses are real. In the limit you have a full 6502 inside the memory controller!

But, well before that point, it might be notable that the memory controller will always see the second access of an instruction and will generally need to act on it. In the case that the instruction takes no operand, that byte could be buffered and served up if requested next. For example, the instruction following a register increment, decrement or transfer will already have been fetched once.


Top
 Profile  
Reply with quote  
PostPosted: Wed Jan 23, 2019 1:03 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
When the memory controller is stuffed into a 22V10 form factor, any storage for a cache has to be external. And really, even a small cache is a big can of worms for a 6502 system; it has to solve the same problems as the multi-bank DRAM and then some, in order to be effective.

But it looks like my sim is more reliable in its timings than I thought, though it still doesn't identify the precise cycles within each instruction which "go spare". So, I could use it to try out some real code to see how it interacts with FPM DRAM.


Top
 Profile  
Reply with quote  
PostPosted: Wed Jan 23, 2019 2:18 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Ah, indeed, a 22V10 is pretty minimal!


Top
 Profile  
Reply with quote  
PostPosted: Wed Jan 23, 2019 2:22 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
It's actually an ATF750C that I'm targeting, but there still aren't enough pins to implement a cache internally.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 31 posts ]  Go to page Previous  1, 2, 3  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 43 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: