6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 22, 2024 6:39 pm

All times are UTC




Post new topic Reply to topic  [ 12 posts ] 
Author Message
PostPosted: Sun Mar 13, 2011 6:36 am 
Offline

Joined: Sun Mar 13, 2011 4:58 am
Posts: 17
Location: Rindge NH USA
Kinda on a lark, I decided to design an MMU for the 6502. The idea was basically to allow for a protected mode, similar to the x86 (and others). Being a bit of a NES hacker, I'm familiar with basic bank swapping mechanisms, but I decided to go for something more advanced.

The main idea was to extend the address space and implement page tables in general purpose memory. I decided to use a 20-bit address bus and 256-byte pages. This requires 12-bit page addresses in the page table, which (when bumped up to a round number of bytes) leaves four bits for flags. So a page table takes two pages.

So now that I had figured out how I wanted the page tables to function, I had to come up with a way to implement them. I sketched out a few different designs, but I eventually decided that the best way was to make three memory accesses every time the CPU tries to access the memory.

The first access fetches the low byte of the page table entry. The second access fetches the high byte of the page table entry. And the third access reads or writes to the mapped page.

To make this work, the clock must be run at five times the CPU speed (I think). Three external registers are defined:
  • PTR - Page Table Register
  • PAR - Page Address Register
  • PFR - Page Flags Register
The PTR points to the page table. The PAR holds the address of the page about to be accessed and the PFR holds it's flags. Accessing memory works as follows:

Clock Cycle
  1. Connect the PTR, A₁₅₋₈5, and a 0 to the address bus. Read memory.
  2. Load PAR₀₋₇ from the data bus.
  3. Connect the PTR, A₈₋₁₅, and a 1 to the address bus. Read memory.
  4. Load PAR₈₋₁₁ and PFR₀₋₃ from the data bus.
  5. Connect the PAR and A₀₋₇ to the address bus. Read or write memory.
It may be possible to do cycles 2 and 4 on the falling (or is it rising) clock edge and reduce the required clock speed. (But I don't know for sure.)

I've developed a little schematic of the basic workings of the idea, which you can look at here. It's somewhat over simplified and is missing a few things, but I think it conveys the gist of it. (In case you're wondering, PXR0 and PXR1 are 74377s. PAR is PXR0₀₋₇ and PXR1₃₋₇. PFR is PXR1₄₋₇.)

Notably missing is the write line handling, which should be ANDed with C₅ and clocking the bus. The bus should be clocked at C₁, C₃, and C₅, but I'm not sure that's safe as the clock cycles would be of inconsistent length.

Also missing is all of the flag handling logic, which ought to be pretty simple. Basically, at cycle 5, if a fault is detected, an interrupt is raised, rather than reading the memory.
Code:
if PFR₀ == 0: fault                 # Present
if PFR₁ == 0 and RWB == 0: fault    # Write
if PFR₂ == 0 and SYNC == 0: fault   # Execute

Speaking of interrupts, PTR really ought to be two registers. One for user mode and one for kernel mode. However, I haven't quite worked out the details of that just yet. (Hopefully I can do it without requiring more multiplexers.)

Some parts of the schematic haven't been fully fleshed out yet. In particular, the multiplexers. The only suitable IC that I'm aware of (which isn't saying much) is the 74157 4-bit 2:1 multiplexer, which isn't so bad for the data multiplexer. However, the address multiplexer would require five of them, for a grand total of seven, which is sure to be messy. Any suggestions for a better solution would be greatly appreciated.

Finally, I don't really know how to split the clock in five (or three if that's viable), so there's just a black box. I suppose I could use a decade counter and some AND gates. Although, that wouldn't help with the splitting the clock in three (which seems preferable, if it'd work).

So please take a look and let me know what you think. Suggestions for improvement would be very much appreciated. Also, this is the first complicated electronic thing I've ever designed, so please point out any newbie errors I may have made. Thanks.

Anyway, I'm off to read more datasheets (especially the timing diagrams) to try and see where the inevitable bugs are.

_________________
@loop: lda (src),y — sta (dst),y — iny — bne @loop — inc src+1 — inc dst+1 — dex — bne @loop


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sun Mar 13, 2011 8:16 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California
I don't claim to know anything about protected mode and MMUs, but be sure you check out the 65816 before going to the lengths you're talking about. It has a 24-bit address space in 256 banks of 64K each, and is much better suited for multitasking and relocatable code. It does have an "abort" interrupt for the memory faults you're talking about.

There's a similar topic at viewtopic.php?t=620 .


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sun Mar 13, 2011 10:35 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Hi Karatorian,
Just a couple of hints. I agree with Garth that if you're looking for results with least effort, 65816 can do a lot more than 6502. But if you're intent on flexing your design-from-scratch muscles:

I suggest you arrange to simulate this design before you build it: you could use the free Xilinx tools to capture the design in an HDL or schematics and then simulate using isim. You might use the T65 model for the 6502. See this thread. You don't have to be intending to use an FPGA - just use the tools to help develop and debug your design. Of course, there are other free simulators - Icarus verilog, IRSIM.

If you use tristate drivers and tristate registers, you won't need the mux(*) - and you can get them in octal sizes. That should help. See 74x574 and 74x244

Your plan is to the clock the CPU rather slow, and have a faster clock which controls the MMU unit? Another possibility (with a CMOS 6502) is to use RDY to stall the CPU on each access.

I'm not clear on your clocking scheme - it feels to me like you need 3 accesses therefore only 3 clock cycles using the normal 6502 approach. (But this might be premature optimisation.)
1. (read) PXR0 captures middle byte of memory address, using PTR and high byte of address from CPU
2. (read) PXR1 captures high byte of memory address (includes flags)
3. (read or write) using PXR1, PXR0 and low byte of address from CPU

At the beginning of each clock cycle you present the address to memory and at the end you have the data you wanted.

Cheers
Ed
(*) I'm not certain, because I haven't thought carefully about the way you steer the addresses, which isn't aligned by bytes. But if you don't interleave the low and high bytes of the page table, you could I think align the addresses, and deal throughout with high byte/middle byte/low byte.

Edit: typo


Last edited by BigEd on Mon Mar 14, 2011 11:11 am, edited 1 time in total.

Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sun Mar 13, 2011 2:06 pm 
Offline

Joined: Sun Mar 13, 2011 4:58 am
Posts: 17
Location: Rindge NH USA
Oh, I'm aware of the 65816, but it kinda goes against the grain of this idea. Basically, the idea was to build a full-featured MMU for a processor that was never designed to have one. It's kinda a silly bit of overkill, but it's interesting from a design perspective. To be honest, the idea (to make a 6502 MMU, not the implementation details) came to me when I was half asleep after hacking on some x86 page table code.

As I don't intend on building this thing in hardware anytime soon (I have a much less ambitious design I'm gonna play with first), the simulation pointers are much appreciated. (I'd been thinking about simulating it and my other project already.)

I'd thought a little bit about how tri-stating could be used to remove the need for the muxers (which I used with the PXR registers), but I wasn't aware of the ICs you mention (or how to use them). So thanks for the pointer, I'll have to look them up. (Which just goes to show how out of my league I am here. My electronics knowledge mainly comes from knowing way too much about computers, rather than actual experience.)

As for the clocking, yes, the idea is to run the CPU slow and the memory fast (the exact opposite of the state of the art these days). You're probably right about only needing 3 clocks per CPU cycle. When I was designing it at the conceptual level, I only had three. I added the other two because the value on the data lines needs to be clocked into the PXR registers sometime after they stabilize. There's probably a better way to do that than spending a whole clock cycle on it.

I'll have to look into using the RDY line instead. That might simplify the design somewhat. Also it may allow for adding a cache of some sort later, which wouldn't require 3 cycles for every memory access.

These insights should help me improve the design a lot. Thanks for the feedback.

_________________
@loop: lda (src),y — sta (dst),y — iny — bne @loop — inc src+1 — inc dst+1 — dex — bne @loop


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sun Mar 13, 2011 8:26 pm 
Offline

Joined: Tue Jul 05, 2005 7:08 pm
Posts: 1043
Location: near Heidelberg, Germany
I understand that the PTR holds the high byte of a 512-byte table. In that table the CPU high address byte (old A8-15) is the index to read two bytes that form 12 bit address extension (new A8-19) and flags, right?

That builds what modern CPUs use too, but they use lookup-tables (LUT) to cache the translation results.

I have built a system like that myself, basically only using a "LUT" as translation mechanism. I'm using 4k pages, with an 8 bit extension (also coming to 20 address bits). Optionally I can use read-only, no-execute and not-mapped flags as well (as a later add-on).

Here is the board description: http://www.6502.org/users/andre/csa/cpu/index.html and here http://www.6502.org/users/andre/icapos/mmu65.html some more information about my approach.

An advantage of my approach is that it does not need extra memory accesses per CPU access. A disadvantage of my approach is that it needs a complete MMU reload on context switch, while you only reload the PTR with a new value.

BTW: I use a second 6502 to catch error conditions (see the AUXCPU board http://www.6502.org/users/andre/csa/auxcpu/index.html ) to somewhat replace the missing ABORT input of the 6502.

André


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Mon Mar 14, 2011 3:21 am 
Offline

Joined: Tue Nov 18, 2003 8:41 pm
Posts: 250
fachat wrote:

An advantage of my approach is that it does not need extra memory accesses per CPU access. A disadvantage of my approach is that it needs a complete MMU reload on context switch, while you only reload the PTR with a new value.

André


I think we've covered this ground here before. (somewhere)

My thought was to build the address translation with fast cache rams.
So for example you might use 32k rams doing an 8 > 16 bit address
translation. With 256 byte pages, that would give you room for 128 tables
Then you'ld set up a table before hand and switch amongst them with
a single byte.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Mon Mar 14, 2011 9:05 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Here's an idea for a hybrid approach: if the memory system is at least two chips, and fast enough, you can have a 16-bit wide datapath to load both the PXR in the first phase, and then complete the CPU access during the second phase. (You have to steer that final access to the appropriate 8-bit of the memory data bus, or the appropriate bank if you think of it that way.)

No need to change the CPU clocking, and you retain the in-memory pagetable.

(You can get 128k x8 sram in PDIP with 55ns access time - higher densities typically in surface mount packages.)

(Very nice write-up from André!)


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Mon Mar 14, 2011 10:54 am 
Offline

Joined: Sun Mar 13, 2011 4:58 am
Posts: 17
Location: Rindge NH USA
One of my earlier designs actually used a dedicated RAM to store the page table, however, I decided against it because of the need to reload the entire table every task switch. If one used larger page sizes, the overhead of doing so wouldn't be too bad, but 256 byte pages just feel natural on the 6502 to me. However, they require a fairly large page table to use.

As for making the page table RAM larger, so as to hold multiple page tables, I considered it, but I didn't really bring the idea to it's full potential. I only considered holding two page tables, one for supervisor mode, and one for user mode. That would have avoided a page table load on interrupts and system calls. However, it still would have required page table loads on task switches. The idea of having multiple user mode page tables would definitely speed things up. (Honestly, it's probably a more realistic design. However, somehow the idea of keeping the page tables in general purpose memory appeals to me.)

I actually thought of using a 16-bit RAM and loading the TXR registers in one cycle. However I decided that it would be simpler to keep the data bus eight bits wide. On the other tentacle, twenty bit addressing and sixteen bit data is what the 8086 used, so one could use ISA as a bus standard, which could be a bonus in the end.

Thanks for all the feedback. I'm currently messing around with gschem in an attempt to produce better schematics and incorporate some of the suggestions for improvement.

@fachat: I actually checked out your design when I was looking for examples of 6502 boards to get a feel for how things work. I'm very impressed, especially with the MMU. In fact, the fact that you used a look up table only was part of the reason I decided not to. I kinda wanted to do something different.

_________________
@loop: lda (src),y — sta (dst),y — iny — bne @loop — inc src+1 — inc dst+1 — dex — bne @loop


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Mon Mar 14, 2011 11:44 am 
Offline

Joined: Tue Jul 05, 2005 7:08 pm
Posts: 1043
Location: near Heidelberg, Germany
Only quick: you could use a LUT and use one bit in the LUT to define if the LUT entry is valid, and if not fall back to your read-from-main-memory approach (using the LUT as cache). That is, as far as I understand, what current CPUs do.

André


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Mon Mar 14, 2011 6:21 pm 
Offline

Joined: Sun Mar 13, 2011 4:58 am
Posts: 17
Location: Rindge NH USA
fachat wrote:
Only quick: you could use a LUT and use one bit in the LUT to define if the LUT entry is valid, and if not fall back to your read-from-main-memory approach (using the LUT as cache). That is, as far as I understand, what current CPUs do.

Well, they actually use content addressable memory for the cache (at least Intel does), but yeah, that's basically how they work.

_________________
@loop: lda (src),y — sta (dst),y — iny — bne @loop — inc src+1 — inc dst+1 — dex — bne @loop


Top
 Profile  
Reply with quote  
PostPosted: Tue Mar 15, 2011 2:04 pm 
Offline
User avatar

Joined: Mon Dec 08, 2008 6:32 pm
Posts: 143
Location: Brighton, England
Karatorian wrote:
Some parts of the schematic haven't been fully fleshed out yet. In particular, the multiplexers. The only suitable IC that I'm aware of (which isn't saying much) is the 74157 4-bit 2:1 multiplexer, which isn't so bad for the data multiplexer. However, the address multiplexer would require five of them, for a grand total of seven, which is sure to be messy. Any suggestions for a better solution would be greatly appreciated.


You don't need to use multiplexors if you use registors with tri-state outputs, you can simply enable the appropriate chip onto the bus. Alternatively, use a three-state buffer to enable something that isn't already three-state onto the bus.

Karatorian wrote:
Finally, I don't really know how to split the clock in five (or three if that's viable), so there's just a black box. I suppose I could use a decade counter and some AND gates. Although, that wouldn't help with the splitting the clock in three (which seems preferable, if it'd work).


Take a look at the datasheet for the 74HC4017 chip. This handy little chip counts and decodes a clock into individual outputs. it's a decade counter but you can easily make it count to less than 10 by connecting the appropriate output to the reset pin.

_________________
Shift to the left,
Shift to the right,
Mask in, Mask Out,
BYTE! BYTE! BYTE!


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Fri Mar 18, 2011 10:56 pm 
Offline

Joined: Tue Nov 18, 2003 8:41 pm
Posts: 250
bogax wrote:
My thought was to build the address translation with fast cache rams.
So for example you might use 32k rams doing an 8 > 16 bit address
translation. With 256 byte pages, that would give you room for 128 tables
Then you'ld set up a table before hand and switch amongst them with
a single byte.


It occurs to me to mention another thought I had (and I think Garth
Wilson has suggested something like this before).

That is to rig, say, one page so that writes go to the mapper and reads
come from huge look up tables.
So you could eg write a couple of bytes to particular locations and then
read various functions of those two bytes at particular locations.
I'm thinking stuff like multiplication tables, squares, sines, reciprocals, logs etc
but I expect you could rig it so that it was any function you cared
to calculate, ie it needn't just be looking up stuff in ROM it'd be a different
kind of mapping that you could setup for whatever


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 12 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 28 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: