Ultra-minimal 3-wire Interface boots up 65xx CPU'sHere is a three-wire interface sufficient to boot up a ROM-less 6502-, 65C02- or 65C816-based computer. Because no ROM/ EPROM/ EEPROM is used the 65xx computer can benefit from
simpler address decoding and from running with
no wait states.
In one possible scenario, the three signals would originate on-board, supplied either by a microcontroller (possibly a tiny, 8-pin device) or by an SPI EEPROM and some glue logic.
In another scenario, the three signals could deliver code supplied by a separate host computer, in which case the on-board microcontroller or SPI EEPROM would be either omitted or just temporarily inhibited. Notice that this offers an opportunity to download arbitrary code, making the scheme attractive for development as well as for testing. (With suitable glue logic you could even temporarily attach the host computer in order to update the contents of the SPI EEPROM.)
Edit (the latest of several): Later in this thread forum member nyef proposes some promising changes, and I follow through by implementing and testing that idea as well as another key improvement. As a result, the 3-wire interface becomes usable on NMOS CPU's (which were ineligible for the original version). Also the bootup sequence becomes somewhat simpler and faster, and it becomes possible to create a substantially more efficient Level 2 Loader.
Thus the description that follows is slightly obsolete (although still valid and workable). In future I hope to provide a revised writeup, but what follows won't lead you too far astray. The original interface often places dummy $00's and $80's on the data bus, and the CPU is fooled into accepting these in place of values fetched from memory -- values which represent instruction opcodes, instruction operands and vector bytes. The revised interface is similar but the dummy values are $00 and $6A instead.
The interface was inspired by Jens Madsen's remarkable
Z-80 version; however, 65xx CPU's present a different set of challenges. Two years ago forum member DavidBuchanan
described a 5-wire 65xx version which looks like it would work, and borrowing slightly from that I developed and successfully tested this 3-wire version using a rudimentary 6502/65C02/ 65C816 system described
here.
Attachment:
3-wire bootup interface.png [ 6.14 KiB | Viewed 14976 times ]
OverviewThe so-called "Level 1 Loader" executes two loops. The first loop fills the stack area ($0100 - $01FF) with a very specific program I devised; then the program is executed. Its sole purpose is to initialize the 65xx's stack pointer, S.
Then the second loop writes another program to the stack area. This program can be one you created, and assuming your program is happy running in the stack area then maybe it's all you'll ever need.
But if not then what you'll run in the stack area is a "Level 2" loader that executes from the stack area but can write to anywhere.
The Level 2 loader could be one that you create, one that activates a UART or mass storage controller specific to your system. But in many cases it'll make sense to just continue using the 3-wire interface instead, and it's my assumption that this is what most users will do.
Now the details of the Level 1 loader.
The 65xx PC register (Program Counter) is adjusted to a desired value by means of a series of two-byte BRA (BRanch Always) relative branch instructions which get faked onto the data bus by 8 resistors. (While this is happening memory is inhibited; see schematic above.) Then PC (or at least PCH,
the high eight bits of PC) gets pushed to stack in response to a BRK (software interrupt) instruction which is also faked. 256 iterations of this procedure result in the entire stack area ($0100 - $01FF) being written. Then memory is enabled (causing the fake instructions to cease) and the newly written bytes are executed, starting at $0100.
Notice the
order of the bytes is known, but their
location within the $0100 - $01FF stack area is probably skewed, because the writes were performed via S (the Stack Pointer). S probably didn't point to $00FF on powerup, which means it probably wrapped around before the last of the 256 bytes got pushed. But, as I'll explain later, the bytes are a program which is written to execute properly regardless of the starting point.
$0100 may turn out to be somewhere in the middle of the program, but nevertheless if you start at $0100 it will always initialize S to a known value. With this important matter resolved, memory is inhibited, the fake BRA and BRK instructions resume, and another 256 bytes are written, this time to a predictable destination.
What's now in the stack area is actual bootstrap code that's customized for whatever you want to happen next. Probably that'll be a Level-2 loader, which might activate mass storage or open a serial port. Alternatively, the Level-2 loader can be one that continues accepting input from the 3-wire interface... but at higher speed, and without being constrained to writing in the stack area.
One of the three signals is simply the 65xx Clock and the other two are used mainly for controlling the data bus, as follows:
- When nCE = 1 and OP =1 memory and I/O are inhibited and the resistors cause $80 to appear.
- When nCE = 1 and OP =0 memory and I/O are inhibited and the resistors cause $00 to appear.
- When nCE = 0 and OP =1 the data bus is normal -- ie; able to read and write memory and I/O in the usual fashion.
- When nCE = 0 and OP =0 reset is asserted (and the data bus = don't-care).
The latter condition (reset) is detected by an OR "gate" -- the diode and resistor. An actual OR gate could of course be substituted. (Also, an octal tristate buffer could replace the 8-resistor array attached to the data bus. Such a substitution isn't helpful in most situations, although exceptions exist. I'll save that topic for future posts.)
My tests used an x86 box as the host computer, and the host code includes a couple of variables which are maintained as images of the 65xx PC (Program Counter) and S (stack pointer) registers. These register-image variables are crucial -- and updates to them are based entirely on inference, because all the host communication is outward bound. There's absolutely *no* feedback from the 65xx system! Therefore you need to keep track of what opcode the 65xx is executing, and keep a
100% perfect tally of thousands of clock pulses that need to be sent! I found this latter goal finicky to achieve. But in principle the code isn't terribly complicated, and things got easier after the basic routines were working.
A few of the maneuvers I'll explain in advance. Sometimes there's a need to use fake instructions merely as a means to alter the Program Counter before PCH gets pushed. The values $80 and $00 allow me to do this, as follows. $80 (as a fake BRA opcode) followed by $00 (as a fake operand) will advance PC by two, and $80 followed $80 will BRA backwards by about half a page. With these two tricks you can move PC all around the 64K map. [ PC can only ever point to even locations but that's OK because it's PCH we're gonna push. In the new version, $6A replaces $80, and $6A aka ROR can be used as a one-byte NOP. Thus PC can be made to point at both odd and even locations, making it feasible to push PCL instead. ]
You can also alter PC by using $00 as a BRK opcode and then faking the vector-low and vector-high bytes, in effect accomplishing a jump. $00 (used as a BRK) can serve as a jump, a push, or both a jump and a push. Its most common use is to push PCH to stack. Of course BRK will try to push *three* bytes (PCH, PCL, and P), and the trick is to enable RAM only during the cycle when the write of PCH is attempted. Here are the details for the push-PCH maneuver:
- with clock high on a cycle when the 65xx is poised to accept an opcode, put $00 (BRK) on the bus.
- wiggle the Clock line low then high again, twice. This takes us to cycle 3 of the BRK instruction (cycle 1 was the opcode), which is when the 65xx tries to write PCH to stack.
- switch the data bus to "normal" mode. This enables RAM, causing the write to take place immediately. Then switch the data bus to $80 mode. RAM is now un-enabled. (And the $80 will, in cycles 6 and 7, serve as the low-byte and high-byte respectively of the software-interrupt vector.)
- wiggle the Clock line low then high 5 more times. This takes us past cycles 4, 5, 6 and 7 of the BRK and leaves the 65xx poised to accept the next opcode.
So, PCH got pushed -- which is good
-- and S got decremented..... by 3,
which may seem
not-good but is actually OK. By the way, note that the "push-PCH" maneuver has the side effect of updating PC to $8080 (thanks to the vector-low and vector-high bytes fetched in cycles 6 and 7 respectively).
the Play By PlayThe host begins by setting clock high. Then it switches to Reset Mode, which immediately resets the 65xx CPU. Next is a switch to $80 Mode (which releases reset) followed by 9 clock pulses -- meaning clock low then high, 9 times. (
Note: that's for a WDC 'C02 or '816. I also tested a Rockwell 'C02, which wanted 8, not 9.)
In any case you need to "get to first base," with the 65xx poised to accept the first opcode. At that time SYNC (or VPA and VDA) will be high, and PC -- which'll appear on the address bus -- will be $8080. (When debugging your host software you can pause to verify the state of these pins manually.) Because PC is now known the host software can give the PC register-image its first update. (S is still unknown, and the S image can remain uninitialized.)
Now begins the first 256-iteration loop. The host has reserved 256 bytes as a source buffer, and its content is what'll be copied to the stack area ($0100 to $01FF) in 65xx RAM. The host software uses the S image as a post-decrementing index to *read* a byte from the buffer. Next it manipulates the 65xx PC until PCH becomes equal to the byte. Then the push-PCH maneuver uses actual S as a post-decremented index to *write* PCH into the stack area. The decrement value is 3, not 1; also the 8-bit index will underflow and wrap around twice before we're done. But the same anomalies occur on both ends, and it all works out alright provided the number of iterations is 256. Here is the procedure in more detail:
- using the S image as an index, the host fetches a byte from the buffer. This byte and PCH from the PC image are compared to see if they are equal.
- if they're not equal then $80 (BRA) is placed on the 65xx data bus and 3 clocks are issued (or 4 in case of a page crossing; see Misc Tips below). As mentioned earlier this BRA $80 maneuver moves the 65xx PC backwards about half a page. (The PC advances by 2 as the instruction is fetched. Then sign-extended $80 aka $FF80 is added to PC.) The PC image is updated accordingly ($7E backwards) and then PCH is tested again. (I opted for a simple BEGIN - WHILE - REPEAT algorithm.) If necessary more BRA $80's are executed, and sooner or later PCH will equal the byte from the buffer.
- PCH is pushed to stack using the BRK-based push-PCH maneuver explained previously.
Relocatable and entry-point irrelevant code to initialize SAlmost all of the 256 bytes in the host buffer are $EA (NOP). But slipped in the deck are two jokers, each being an instance of the sequence
$A2 $EA $9A. Here are the 256 bytes as they appear in the host buffer:
Quote:
$A2 $EA $9A $EA $EA $EA ... [all EA's up to location $7F] $A2 $EA $9A $EA $EA $EA ... [all EA's up to location $FF]
When this gets copied to the 65xx stack area things are likely to get skewed (wrapped around within the circular space). For example, the byte at $0100 certainly came from somewhere in the host buffer but not necessarily from the beginning. Nevertheless,
$0100 is where we will begin execution. We get PC = $100 by faking another BRK. This BRK isn't allowed to write
any bytes to stack, and also we tweak cycles 6 & 7 so it vectors to $0080. Then, to get from $0080 to $0100, we use 64 of the BRA $00- based two-byte NOP's mentioned earlier. With PC = $100 the data bus is then put in normal mode so the bytes in RAM become readable. We issue a few hundred clock cycles, then stop -- and the result is to initialize S, no matter what. Here's how:
If there's no skew,
$A2 $EA $9A $EA $EA $EA ... is what will appear at $0100. The CPU interprets it as this:
Code:
$A2 $EA ( LDX #$EA ; load X immediate )
$9A ( TXS ; transfer X to S )
$EA ( NOP )
$EA ( NOP )
$EA ( etc )
The CPU will execute the two-byte LDX instruction at $0100 then proceed to the $9A (TXS) and then the $EA's (NOP's) that follow.
In this case S will immediately be initialized, as you can see.
If things are skewed by one,
$EA $9A $EA $EA $EA ... is what will appear at $0100. The CPU interprets it as this:
Code:
$EA ( NOP )
$9A ( TXS ; transfer X to S )
$EA ( NOP )
$EA ( NOP )
$EA ( etc )
The CPU will execute the $EA (NOP) at $0100 then proceed to the $9A (TXS) and the $EA's (NOP's) that follow. Proper initialization doesn't occur (at least not immediately).
If things are skewed by two,
$9A $EA $EA $EA ... is what will appear at $0100. The CPU interprets it as this:
Code:
$9A ( TXS ; transfer X to S )
$EA ( NOP )
$EA ( NOP )
$EA ( etc )
The CPU will execute the $9A (TXS) at $0100 then proceed to the $EA's (NOP's) that follow. Proper initialization doesn't occur (at least not immediately).
If things are skewed by 3, 4, 5 ... then $EA $EA $EA $EA ... is what will appear at $0100. The CPU interprets it as this:
Code:
$EA ( NOP )
$EA ( NOP )
$EA ( NOP )
$EA ( etc )
No initialization occurs (at least not immediately).
In these latter cases S will at first not get properly initialized. But by issuing enough clock cycles you're bound to encounter the second joker sequence, and *it* will succeed because the
LDX #$EA and
TXS will be properly framed and will get the desired interpretation by the CPU. As for timing, all the instructions are 2 cycles each no matter what, and 258 cycles ought to do the trick. Add a few extra if you like (in my tests I rounded up to $110, or 272), but the number must be even so you'll stop on a cycle during which the CPU is poised to accept the next opcode (ie, SYNC=1 or VPA=VDA=1).
Now S's actual value ($EA) can be written to the host's register-image of S, which of course must be kept updated hereafter. Reminder: S decrements by three with every BRK, regardless of what the BRK is used for. And of course the register-image of S must always read as an 8-bit value (or get ANDed with $FF immediately following every read).
From here on it's all smooth sailing. The host's source buffer gets a new set of 256 bytes moved in (or a different buffer is used). Instead of a slightly tainted sea of $EA's, the buffer now contains the 65xx bootstrap code. The copy process is performed just as before, and the only real difference is what happens when the stuff executes. This second set of 256 bytes is more likely to have some semblance of sanity!
SpeedObviously there's no exact specification for how fast bootup needs to be, but as a ballpark goal we can try to keep it under a second or two. That'll be easy to achieve unless many KBytes need to be copied or a bottleneck exists. As an example of a bottleneck, my test host outputs on a 115.2 Kbaud serial link, and at
86.8 uS per event it takes 25 seconds to do enough wire-wiggling to get to the point where the bootstrap code is ready to execute.
An on-board microcontroller could do the same job in perhaps 1/100th the time -- .25 seconds, assuming .868 uS per event. (Any time any of the three signals needs to be set high or low, that counts as one event. Eg: a clock pulse counts as two events.)
Almost all the time is spent doing fake BRA $80's, spinning the 65xx PC backwards $7E at a time. Filling up the stack area involves 256 instances where
PC begins at $8080 and has to be spun backwards until PCH equals whatever byte we want pushed. Then we do the "push-PCH" maneuver -- which is also what sets the $8080
starting point for next time. Two speedup strategies become apparent, useful in cases like mine where a bottleneck is present.
- Don't always use $8080 as the starting point. Instead, have the host look ahead during each "push-PCH" maneuver for a hint about what's gonna get pushed next time. Have it choose whether to put $00 or $80 on the bus during cycle 7 of the "push-PCH" maneuver. PC gets left = $8080 or $0080 (whichever will result in less spinning)... and download speed roughly doubles.
- The first set of 256 bytes is mostly do-nothing filler, and although I mentioned using a sea of NOP's ($EA) I lied. $EA is actually a rather slow value to download ($80 and $00 are the fastest values to download, and $81 and $01 are the slowest) so in reality I replaced every $EA with $78 (SEI - set interrupt mask). $EA and $78 are both one-byte, two-cycle instructions that don't do much. Except for the difference in download speed I consider them equivalent for this application.
Using both of these speedups the host or microcontroller will average (using random bytes for bootup "code") about 290,000 events before the bootstrap code is ready to execute. As noted, most of the time is devoted to lots and lots of BRA $80's.
If there's a Level-2 loader also using the 3-wire interface then I estimate its events-per-byte ratio will be about 10- or 20-fold better -- still not fast, but good enough for lots of situations; and as noted it wouldn't be constrained to 256 bytes, nor to writing only the stack area. For notes on a Level-2 loader using the 3-wire interface see subsequent posts.
Selecting the final Clock SourceDuring the bootload every clock pulse is issued very deliberately. But eventually you'll wanna release the 65xx to free-run at top speed. Maybe your microcontroller will be capable of outputting a full-speed clock on the same pin it uses to output the single-step clock used in bootup. Otherwise you'll probably use a multiplexer to select between the single-step clock and an oscillator. The mux will need something to drive its Select input, and a cool option for that is to use the 816's E output (assuming you're using an '816). The CPU is guaranteed to start off in Emulation Mode, which means E=1. After the bootload when all your ducks are in a row you'd single-step into an XCE instruction that exits Emulation Mode ... and when that happens E goes low (synchronously), the mux switches over, and you'll take off like a slingshot!
Misc. Tips and commentsWouldn't it be faster and easier to push PCL (instead of PCH)? Yes. But we need all 256 values available, and there's no sequence of $80's and $00's that can cause PCL to have 0 in the least-significant bit. Edit: nyef's suggestion sidesteps this by using $E8's and $00's rather than $80's and $00's.
The 3-wire interface won't work with NMOS 65xx CPU's because they lack the BRA instruction. Edit: this, too, is sidestepped by nyef's suggestion.
Be careful to avoid accidental resets. When you're wiggling nCE and OP, always set first then clear. ie: to switch to "BRK" mode, first set nCE then clear OP. To switch to "normal" mode, first set OP then clear nCE.
Your bootstrap code should start by initializing S again, to $FF.
Only high input-impedance chips (eg, NMOS and CMOS) may connect to the bus that's driven by the resistors. The resistors wouldn't be capable of driving any TTL or LSTTL loads.
Especially with 65c816, a '245 transceiver may be present between the memory-I/O data bus and the CPU data bus. If the resistor array attaches to the memory-I/O data bus then nCE must inhibit memory and I/O. If the resistor array attaches to the CPU data bus then nCE must inhibit the '245 and memory and I/O.
Cycle counts must be perfect. Note that a BRA will take 3 or 4 cycles depending on whether there's a page crossing. The exact definition of a page crossing is crucial. You need to compare the ADH of the byte that's 2 past the BRA opcode with the ADH of the byte you're branching to. Also (and maybe this is obvious), having a BRA operand of 0 (as in the BRA $00 "two-bye NOP") doesn't make a BRA execute faster. It's
not like BNE or BMI which are conditional, and take only 2 cycles in the case of "branch not taken."
Questions & comments welcome! Now that the idea is more than just a theory I hope someone is able to use it in a real project. Cheers,
-- Jeff