Joined: Thu May 28, 2009 9:46 pm Posts: 8505 Location: Midwestern USA
|
At this point in time, hardware development on POC V1 is essentially a done deal. I've fixed all known hardware bugs, SCSI DMA access via IRQs is working, TIA-232 I/O has been refined to run at high speeds with less buffer space, and the unit is stable at 10 MHz, although it can run faster without the SCSI host adapter installed (more about this below). I will continue to use POC V1 to refine my driver primitives, as they will be applicable to later development, as well as perform general coding concept testing. The hardware focus is now on POC V2.
The twin goals for POC V2 are to achieve full throttle (20 MHz) operation and create a hardware environment that can support some sort of preemptive multitasking kernel. Getting the unit to run at full tilt is really not that difficult: use programmable logic to avoid performance-killing propagation delays, and carefully lay out the board to avoid performance-killing distributed capacitance, ground bounce, askew signal slew, etc. The multitasking part won't be quite so simple.
Just to recapitulate a bit for anyone who hasn't read earlier posts in this topic, my original plan for POC V2 was to set up a segmented memory model, in which the glue logic (CPLD) would create a multitude of isolated RAM segments spanning from $xx0000 to $xxBFFF, where xx is a bank number from $00 to $FF, assuming sufficient RAM is present (16 MB maximum). Addresses above $BFFF would always be some combination of common RAM, ROM and/or I/O, the exact mapping being under the control of a "hardware management unit" or HMU, a virtual device encapsulated in the CPLD. The 65C816's A16-A23 (bank) bits would play no role in setting up addresses—only the HMU would determine what the '816 sees at any given time. I had developed code many years ago for a 65C02 powered SBC that had a similar, though less complete, implementation of this model.
The segmented memory model has a number of potential advantages, not the least of which is that memory protection is practical. Segmentation can be arranged to run each process in a multitasking environment in a hardware-defined "sandbox," preventing all processes except the kernel itself from touching anything in the system that they shouldn't. The CPLD logic, upon detecting an illegal memory access, would toggle the '816's ABORT input. The logic to do so is not real complicated and if correctly timed, would guarantee that a user space process would be halted before an access violation would occur. The kernel, upon handling the ABORT interrupt, could kill the process, much like a memory fault (SEGFAULT) does in UNIX or Linux.
Segmentation also makes it relatively easy to give each process an isolated zero page (ZP) and stack. In fact, any process would be free to set its stack and ZP anywhere in its segment that it chooses. With isolated ZPs and stacks, context switches would be relatively easy and fast—little more than saving and loading MPU registers and mapping in a different segment.
However, segmentation is not without its drawbacks. Since a user-space process is not allowed to access another process's segment, the time-honored method of accessing a kernel API as a subroutine isn't going to work without some serious logic gyrations. The JSR would cause a hardware exception by virtue of accessing "privileged" RAM (the kernel's space, in this case). Allowing such access would mean a special case condition would have to exist in the logic to allow a JSR to a kernel API address, but nowhere else outside of user process space. I won't mention the fact that there are actually three different JSR type instructions in the '816, two of which could cause major havoc if incorrectly used. Hence a lot of CPLD capability could be consumed trying to enforce this very narrow access rule.
The alternate method of accessing the kernel API would be via software interrupts, in which an INT N instruction (where N is any non-zero eight bit value that follows the BRK instruction) would turn control over to kernel API number "N". This method is actually advantageous because applications don't have to know the kernel jump table addresses. They only have to know which interrupt number does what, making it possible to relocate the kernel in memory and not break all the applications.
However, an interrupt-driven API creates an awkward situation in which the stack on which the MPU will push the program counter (PC) and status register (SR) in response to the BRK instruction would be that of the calling process, not the kernel. Changing context to the kernel and its stack wouldn't be possible until after the MPU had reacted to BRK, pushed PC and SR, and jumped through the BRK hardware vector. However, once kernel space is in context (it would be in the range $000000-$00BFFF in this model), the process space will be invisible and accessing the process' stack for any reason (e.g., parameter passing) would involve some potentially ugly programming.
Also, segmentation doesn't completely prevent undesirable hardware behavior in all situations. For example, if a user process sets the stack pointer to an address outside of process space, the resulting hardware exception will cause the MPU to "auger in" due to the illegal access that results from trying to using privileged space as a stack—the MPU would be repeatedly hammered with ABORT interrupts. The '816 has no privileged (aka "supervisor") instructions, so it can't be told when it is allowed or not allowed to accept changes to key registers, such as the stack pointer. This potential nightmare could be averted by monitoring the data bus during the opcode fetch part of an instruction sequence and looking for anything that would change the stack pointer, ZP pointer, etc. Were such an instruction detected a hardware fault could be raised, same as a memory access violation. Again, the CPLD code required to do so isn't trivial. And, in some cases, being able to change the stack and ZP pointers could be desirable.
When all is said and done, it would appear that segmentation is not an ideal arrangement. However, it is possible with some tricky glue logic.
The other memory model is linear or "flat" addressing. In this model, the bank bits presented on D0-D7 when Ø2 is low and the MPU says a valid address exists would be used as part of the effective address, with CPLD code translating the bank bits to the appropriate memory chip selects. It's not at all complicated. In this model, a process, kernel or user, would be assigned a 64 KB address space for code, data and storage, which in a maxxed out system, would theoretically support 256 processes at any one time (but not really, since address space will be needed for other things).
Memory protection would be achieved by having the CPLD watch for any address in which the bank bits are not the same as the bank in which a user-space process is running. If such a condition is detected, an abort would be triggered. Kernel-space processes would not be subject to this policing—the process table will have flags to tell the CPLD when a kernel-space process has control.
In this memory mode, a kernel API function can be called with a software interrupt and the kernel will be able to access the caller's stack space for parameter passing purposes. This characteristic would also allow the kernel to easily change the RTI address pushed to the process' stack when a hardware exception that triggers an abort is detected, making the killing of a rogue process fairly easy.
Linear addressing also gives more options in the number and placement of I/O buffers. This will have file access performance implications, as any process that is using buffered file access could have local buffers that the kernel can empty or fill during disk access using MVN/MVP instructions, which run real fast and require small coding. Other features, such as shared memory, semaphores and pipes could likewise be implemented without too much difficulty.
Linear addressing also presents some significant complications. In a such a system, all ZP and stack activity would occur in bank $00, as that characteristic is hard-wired into the MPU's design. This is not good! One problem is that with everything going to bank $00 for ZP and stack space, a hard limit would exist on how many processes could be running at any one time without resorting to swapping idle processes to disk. A good swapping algorithm is not a trivial coding exercise, and the required disk accesses would have to be implemented in a way that is both transparent to other system activity and fast. Having SCSI helps in this regard, but it is only going to go so fast, no matter how well I write my drivers. So it would be possible to work with the bank $00 limitation, but not particularly easy.
A much uglier issue with linear addressing is that bank $00 would have to be read/write enabled for all processes. This gives rise to the potential for a rogue process to start gobbling up bank $00 space and step on other stacks and ZPs. A related issue is the specter of a program error inadvertently changing the ZP or stack pointer to another process' space. As I earlier alluded, the 65C816 doesn't have any "rings of privilege" to prevent a program from executing an instruction that it shouldn't. So there is really nothing to stop a process from pointing the ZP and stack anywhere it wants and trashing the system.
If I am to use a flat memory model I will have to somehow somehow make up for the bank $00 issues in glue logic. I have some very vague ideas on what would have to happen but haven't been able to bring them into focus as yet.
Meanwhile, I went to work on revised POC V2 hardware. The schematics follow:
Attachment:
File comment: POC V2 Schematic: Memory Map
sbc_p1.gif [ 120.09 KiB | Viewed 1261 times ]
Attachment:
File comment: POC V2 Schematic: I/O & HMU Decoding
sbc_p2.gif [ 107.8 KiB | Viewed 1261 times ]
Attachment:
File comment: POC V2 Schematic: MPU Interface
sbc_p3.gif [ 171.26 KiB | Viewed 1261 times ]
Attachment:
File comment: POC V2 Schematic: RAM, ROM & I/O
sbc_p4.gif [ 144.2 KiB | Viewed 1261 times ]
Attachment:
File comment: POC V2 Schematic: External Interface
sbc_p5.gif [ 107.04 KiB | Viewed 1261 times ]
Due to the forum software's five attachment limit per post, the preliminary printed circuit board layout is in the next post.
Salient features of this design are:
- 1024 KB of RAM, implemented with two 512 KB SRAMs in a flat memory model. Addressing ranges from $000000 to $0FFFFF. Pages 1 and 2 of the schematic give the details. The first 64 KB of RAM is "base RAM" and everything above is "extended RAM." I used those terms so I could easily refer to these ranges in the CPLD code. Functionally speaking, this design could work with just base RAM, but would then effectively be nothing more than POC V1 with faster logic.
- Glue logic in an Atmel 1508AS CPLD. The 1508AS is a fairly mature product that can support 3.3 or 5 volt I/O operation. The core itself runs at 5 volts. Currently it appears the 1508AS may be a bit of overkill, but that's better than not having sufficient resources. Incidentally, Atmel makes a test rig for this device (and many of their other 1500 series CPLDs) that doubles as a programmer. I decided to invest in one rather than scratch-design and build a JTAG device.
- Segmented ROM. The $00E000-$00FFFF ROM range will contain reset code, the BIOS and a basic interrupt handler. The $00C000-$00CFFF range will be the new home of the machine language monitor. Either or both ROM segments can be mapped out to expose the RAM below. A write to any ROM address will automatically bleed through to RAM, so the BIOS segment could copy itself to RAM and then map itself out and continue running from RAM. This would be effectively like "BIOS shadowing" in the PC architecture, with a corresponding performance improvement.
- I/O at $00D000, with eight device selects available. There will also be RAM under the I/O block that can be exposed if for some as-yet-unknown reason I decide that 4 KB chunk of RAM is good for something. Incidentally, the $00C000 and $00E000 RAM blocks can be write-protected in this scheme. Write-protecting the $00E000 block will be important, as the kernel's interrupt handlers and the MPU vectors will be in this range.
- Hardware memory protection. Once this hardware is working, I may try to see about policing attempts to execute potentially destructive instructions (e.g., stack pointer changes), assuming there are enough P-terms left in the CPLD when everything else has been handled. If I can do that I may be able to solve the bank $00 dilemma.
- ROM and I/O wait-stating. At any speed above 12.5 MHz wait-stating will be necessary. As the access times of the ROM and I/O devices are about the same, a one-size-fits-all wait-state period should be okay for initial testing. I did wire an external jumper (EWS) into the circuit which, if shorted, will add an extra wait-state without having to reprogram the CPLD.
- Priority interrupt controller (PIC), with eight IRQs available. I don't actually need that many, but it's a good programming exercise figuring out how to make it work. The interrupt controller will eliminate the need for polling each device to see if it's interrupting, which will certainly help performance. As with the HMU, the interrupt controller is a virtual device within the CPLD. The controller will present an IRQ number ($00-$07) on the data bus when interrogated. The PIC will ultimately yank on the MPU's IRQB input to produce the interrupt. That CPLD output will be actively driven in both directions so slew time in the IRQ circuit doesn't get in the way of high speed interrupt processing (e.g., when an octart has all eight channels going full blast).
- Active bus drivers. Although these were not present in POC V1, testing has revealed that they will be required in V2. More on this below.
- Backward compatibility to POC V1—sort of. The physical layout of POC V2's printed circuit board will allow me to plug in my existing SCSI host adapter (HBA). A small patch will have to be made to the HBA to account for the use of separate IRQs for the SCSI controller and the watchdog timer. Otherwise, it'll work like it does in POC V1.
When I had built the first SCSI HBA I quickly found that I had to slow down the Ø2 clock in order for the unit to work. During 'scope observations of POC V1's buses with the SCSI HBA installed, it was soon clear that the MPU was having some trouble driving the buses hard enough to maintain acceptable slew rates. The problem is that the HBA effectively takes the entire data bus and part of the address bus off the mainboard, which unavoidably adds unwanted reactance to the circuit, mostly a lot of stray capacitance. With that capacitive loading in place, the MPU couldn't generate enough drive to slew the data bus at an acceptable rate once Ø2 got over 10 MHz. If the HBA were removed then error-free operation was possible at 15 MHz, although there were timing violations to the I/O hardware that caused occasional funny behavior with the TIA-232 ports.
After mulling this for a while, and mindful that WDC doesn't actually state in the data sheet what the '816's fanout capability is, I concluded that I needed to do something about bus drive. Otherwise, it was a sure bet that the extra loading of the CPLD and two pieces of SRAM would sabotage any attempt to achieve high speed operation. Although it is generally known that the '816's drive strength is better than might be expected, it wouldn't be safe to design around an assumption. So this new design includes 74ABT541 bus drivers for A0-A15 and a 74ABT245 transceiver for D0-D7. ABT logic is exceptionally fast (3-4 ns prop time) and produces high output drive. The only real negative is that there are a few more parts for which to find space on the PCB. I resolved that issue by using SOIC (50 mil) packages.
Naturally, this design will receive more scrutiny before anything gets produced, as it's likely some little oversight is waiting to get me.
_________________ x86? We ain't got no x86. We don't NEED no stinking x86!
Last edited by BigDumbDinosaur on Wed Jun 27, 2012 6:21 am, edited 6 times in total.
|
|