6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Thu Nov 21, 2024 11:22 pm

All times are UTC




Post new topic Reply to topic  [ 15 posts ] 
Author Message
PostPosted: Tue Aug 07, 2018 1:59 am 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
Ultra-minimal 3-wire Interface boots up 65xx CPU's

Here is a three-wire interface sufficient to boot up a ROM-less 6502-, 65C02- or 65C816-based computer. Because no ROM/ EPROM/ EEPROM is used the 65xx computer can benefit from simpler address decoding and from running with no wait states.

In one possible scenario, the three signals would originate on-board, supplied either by a microcontroller (possibly a tiny, 8-pin device) or by an SPI EEPROM and some glue logic.

In another scenario, the three signals could deliver code supplied by a separate host computer, in which case the on-board microcontroller or SPI EEPROM would be either omitted or just temporarily inhibited. Notice that this offers an opportunity to download arbitrary code, making the scheme attractive for development as well as for testing. (With suitable glue logic you could even temporarily attach the host computer in order to update the contents of the SPI EEPROM.)

    Edit (the latest of several): Later in this thread forum member nyef proposes some promising changes, and I follow through by implementing and testing that idea as well as another key improvement. As a result, the 3-wire interface becomes usable on NMOS CPU's (which were ineligible for the original version). Also the bootup sequence becomes somewhat simpler and faster, and it becomes possible to create a substantially more efficient Level 2 Loader.

    Thus the description that follows is slightly obsolete (although still valid and workable). In future I hope to provide a revised writeup, but what follows won't lead you too far astray. The original interface often places dummy $00's and $80's on the data bus, and the CPU is fooled into accepting these in place of values fetched from memory -- values which represent instruction opcodes, instruction operands and vector bytes. The revised interface is similar but the dummy values are $00 and $6A instead.

The interface was inspired by Jens Madsen's remarkable Z-80 version; however, 65xx CPU's present a different set of challenges. Two years ago forum member DavidBuchanan described a 5-wire 65xx version which looks like it would work, and borrowing slightly from that I developed and successfully tested this 3-wire version using a rudimentary 6502/65C02/ 65C816 system described here.


Attachment:
3-wire bootup interface.png
3-wire bootup interface.png [ 6.14 KiB | Viewed 14978 times ]

Overview

The so-called "Level 1 Loader" executes two loops. The first loop fills the stack area ($0100 - $01FF) with a very specific program I devised; then the program is executed. Its sole purpose is to initialize the 65xx's stack pointer, S.

Then the second loop writes another program to the stack area. This program can be one you created, and assuming your program is happy running in the stack area then maybe it's all you'll ever need. :) But if not then what you'll run in the stack area is a "Level 2" loader that executes from the stack area but can write to anywhere.

The Level 2 loader could be one that you create, one that activates a UART or mass storage controller specific to your system. But in many cases it'll make sense to just continue using the 3-wire interface instead, and it's my assumption that this is what most users will do.

Now the details of the Level 1 loader.

The 65xx PC register (Program Counter) is adjusted to a desired value by means of a series of two-byte BRA (BRanch Always) relative branch instructions which get faked onto the data bus by 8 resistors. (While this is happening memory is inhibited; see schematic above.) Then PC (or at least PCH, the high eight bits of PC) gets pushed to stack in response to a BRK (software interrupt) instruction which is also faked. 256 iterations of this procedure result in the entire stack area ($0100 - $01FF) being written. Then memory is enabled (causing the fake instructions to cease) and the newly written bytes are executed, starting at $0100.

Notice the order of the bytes is known, but their location within the $0100 - $01FF stack area is probably skewed, because the writes were performed via S (the Stack Pointer). S probably didn't point to $00FF on powerup, which means it probably wrapped around before the last of the 256 bytes got pushed. But, as I'll explain later, the bytes are a program which is written to execute properly regardless of the starting point. :shock: $0100 may turn out to be somewhere in the middle of the program, but nevertheless if you start at $0100 it will always initialize S to a known value. With this important matter resolved, memory is inhibited, the fake BRA and BRK instructions resume, and another 256 bytes are written, this time to a predictable destination.

What's now in the stack area is actual bootstrap code that's customized for whatever you want to happen next. Probably that'll be a Level-2 loader, which might activate mass storage or open a serial port. Alternatively, the Level-2 loader can be one that continues accepting input from the 3-wire interface... but at higher speed, and without being constrained to writing in the stack area.

One of the three signals is simply the 65xx Clock and the other two are used mainly for controlling the data bus, as follows:

  • When nCE = 1 and OP =1 memory and I/O are inhibited and the resistors cause $80 to appear.
  • When nCE = 1 and OP =0 memory and I/O are inhibited and the resistors cause $00 to appear.
  • When nCE = 0 and OP =1 the data bus is normal -- ie; able to read and write memory and I/O in the usual fashion.
  • When nCE = 0 and OP =0 reset is asserted (and the data bus = don't-care).

The latter condition (reset) is detected by an OR "gate" -- the diode and resistor. An actual OR gate could of course be substituted. (Also, an octal tristate buffer could replace the 8-resistor array attached to the data bus. Such a substitution isn't helpful in most situations, although exceptions exist. I'll save that topic for future posts.)

My tests used an x86 box as the host computer, and the host code includes a couple of variables which are maintained as images of the 65xx PC (Program Counter) and S (stack pointer) registers. These register-image variables are crucial -- and updates to them are based entirely on inference, because all the host communication is outward bound. There's absolutely *no* feedback from the 65xx system! Therefore you need to keep track of what opcode the 65xx is executing, and keep a 100% perfect tally of thousands of clock pulses that need to be sent! I found this latter goal finicky to achieve. But in principle the code isn't terribly complicated, and things got easier after the basic routines were working.

A few of the maneuvers I'll explain in advance. Sometimes there's a need to use fake instructions merely as a means to alter the Program Counter before PCH gets pushed. The values $80 and $00 allow me to do this, as follows. $80 (as a fake BRA opcode) followed by $00 (as a fake operand) will advance PC by two, and $80 followed $80 will BRA backwards by about half a page. With these two tricks you can move PC all around the 64K map. [ PC can only ever point to even locations but that's OK because it's PCH we're gonna push. In the new version, $6A replaces $80, and $6A aka ROR can be used as a one-byte NOP. Thus PC can be made to point at both odd and even locations, making it feasible to push PCL instead. ]

You can also alter PC by using $00 as a BRK opcode and then faking the vector-low and vector-high bytes, in effect accomplishing a jump. $00 (used as a BRK) can serve as a jump, a push, or both a jump and a push. Its most common use is to push PCH to stack. Of course BRK will try to push *three* bytes (PCH, PCL, and P), and the trick is to enable RAM only during the cycle when the write of PCH is attempted. Here are the details for the push-PCH maneuver:

  • with clock high on a cycle when the 65xx is poised to accept an opcode, put $00 (BRK) on the bus.

  • wiggle the Clock line low then high again, twice. This takes us to cycle 3 of the BRK instruction (cycle 1 was the opcode), which is when the 65xx tries to write PCH to stack.

  • switch the data bus to "normal" mode. This enables RAM, causing the write to take place immediately. Then switch the data bus to $80 mode. RAM is now un-enabled. (And the $80 will, in cycles 6 and 7, serve as the low-byte and high-byte respectively of the software-interrupt vector.)

  • wiggle the Clock line low then high 5 more times. This takes us past cycles 4, 5, 6 and 7 of the BRK and leaves the 65xx poised to accept the next opcode.

So, PCH got pushed -- which is good :) -- and S got decremented..... by 3, :roll: which may seem not-good but is actually OK. By the way, note that the "push-PCH" maneuver has the side effect of updating PC to $8080 (thanks to the vector-low and vector-high bytes fetched in cycles 6 and 7 respectively).

the Play By Play

The host begins by setting clock high. Then it switches to Reset Mode, which immediately resets the 65xx CPU. Next is a switch to $80 Mode (which releases reset) followed by 9 clock pulses -- meaning clock low then high, 9 times. (Note: that's for a WDC 'C02 or '816. I also tested a Rockwell 'C02, which wanted 8, not 9.)

In any case you need to "get to first base," with the 65xx poised to accept the first opcode. At that time SYNC (or VPA and VDA) will be high, and PC -- which'll appear on the address bus -- will be $8080. (When debugging your host software you can pause to verify the state of these pins manually.) Because PC is now known the host software can give the PC register-image its first update. (S is still unknown, and the S image can remain uninitialized.)

Now begins the first 256-iteration loop. The host has reserved 256 bytes as a source buffer, and its content is what'll be copied to the stack area ($0100 to $01FF) in 65xx RAM. The host software uses the S image as a post-decrementing index to *read* a byte from the buffer. Next it manipulates the 65xx PC until PCH becomes equal to the byte. Then the push-PCH maneuver uses actual S as a post-decremented index to *write* PCH into the stack area. The decrement value is 3, not 1; also the 8-bit index will underflow and wrap around twice before we're done. But the same anomalies occur on both ends, and it all works out alright provided the number of iterations is 256. Here is the procedure in more detail:

  • using the S image as an index, the host fetches a byte from the buffer. This byte and PCH from the PC image are compared to see if they are equal.

  • if they're not equal then $80 (BRA) is placed on the 65xx data bus and 3 clocks are issued (or 4 in case of a page crossing; see Misc Tips below). As mentioned earlier this BRA $80 maneuver moves the 65xx PC backwards about half a page. (The PC advances by 2 as the instruction is fetched. Then sign-extended $80 aka $FF80 is added to PC.) The PC image is updated accordingly ($7E backwards) and then PCH is tested again. (I opted for a simple BEGIN - WHILE - REPEAT algorithm.) If necessary more BRA $80's are executed, and sooner or later PCH will equal the byte from the buffer.

  • PCH is pushed to stack using the BRK-based push-PCH maneuver explained previously.


Relocatable and entry-point irrelevant :shock: code to initialize S

Almost all of the 256 bytes in the host buffer are $EA (NOP). But slipped in the deck are two jokers, each being an instance of the sequence $A2 $EA $9A. Here are the 256 bytes as they appear in the host buffer:
Quote:
$A2 $EA $9A $EA $EA $EA ... [all EA's up to location $7F] $A2 $EA $9A $EA $EA $EA ... [all EA's up to location $FF]

When this gets copied to the 65xx stack area things are likely to get skewed (wrapped around within the circular space). For example, the byte at $0100 certainly came from somewhere in the host buffer but not necessarily from the beginning. Nevertheless, $0100 is where we will begin execution. We get PC = $100 by faking another BRK. This BRK isn't allowed to write any bytes to stack, and also we tweak cycles 6 & 7 so it vectors to $0080. Then, to get from $0080 to $0100, we use 64 of the BRA $00- based two-byte NOP's mentioned earlier. With PC = $100 the data bus is then put in normal mode so the bytes in RAM become readable. We issue a few hundred clock cycles, then stop -- and the result is to initialize S, no matter what. Here's how:

If there's no skew, $A2 $EA $9A $EA $EA $EA ... is what will appear at $0100. The CPU interprets it as this:
Code:
$A2 $EA   ( LDX #$EA ; load X immediate )
$9A       ( TXS      ; transfer X to S  )
$EA       ( NOP                         )
$EA       ( NOP                         )
$EA       ( etc                         )

The CPU will execute the two-byte LDX instruction at $0100 then proceed to the $9A (TXS) and then the $EA's (NOP's) that follow.
In this case S will immediately be initialized, as you can see.

If things are skewed by one, $EA $9A $EA $EA $EA ... is what will appear at $0100. The CPU interprets it as this:
Code:
$EA       ( NOP                         )
$9A       ( TXS      ; transfer X to S  )
$EA       ( NOP                         )
$EA       ( NOP                         )
$EA       ( etc                         )

The CPU will execute the $EA (NOP) at $0100 then proceed to the $9A (TXS) and the $EA's (NOP's) that follow.  Proper initialization doesn't occur (at least not immediately).

If things are skewed by two, $9A $EA $EA $EA ... is what will appear at $0100. The CPU interprets it as this:
Code:
$9A       ( TXS      ; transfer X to S  )
$EA       ( NOP                         )
$EA       ( NOP                         )
$EA       ( etc                         )

The CPU will execute the $9A (TXS) at $0100 then proceed to the $EA's (NOP's) that follow. Proper initialization doesn't occur (at least not immediately).

If things are skewed by 3, 4, 5 ... then $EA $EA $EA $EA ... is what will appear at $0100. The CPU interprets it as this:
Code:
$EA       ( NOP                         )
$EA       ( NOP                         )
$EA       ( NOP                         )
$EA       ( etc                         )

No initialization occurs (at least not immediately).

In these latter cases S will at first not get properly initialized. But by issuing enough clock cycles you're bound to encounter the second joker sequence, and *it* will succeed because the LDX #$EA and TXS will be properly framed and will get the desired interpretation by the CPU. As for timing, all the instructions are 2 cycles each no matter what, and 258 cycles ought to do the trick. Add a few extra if you like (in my tests I rounded up to $110, or 272), but the number must be even so you'll stop on a cycle during which the CPU is poised to accept the next opcode (ie, SYNC=1 or VPA=VDA=1).

Now S's actual value ($EA) can be written to the host's register-image of S, which of course must be kept updated hereafter. Reminder: S decrements by three with every BRK, regardless of what the BRK is used for. And of course the register-image of S must always read as an 8-bit value (or get ANDed with $FF immediately following every read).

From here on it's all smooth sailing. The host's source buffer gets a new set of 256 bytes moved in (or a different buffer is used). Instead of a slightly tainted sea of $EA's, the buffer now contains the 65xx bootstrap code. The copy process is performed just as before, and the only real difference is what happens when the stuff executes. This second set of 256 bytes is more likely to have some semblance of sanity!

Speed

Obviously there's no exact specification for how fast bootup needs to be, but as a ballpark goal we can try to keep it under a second or two. That'll be easy to achieve unless many KBytes need to be copied or a bottleneck exists. As an example of a bottleneck, my test host outputs on a 115.2 Kbaud serial link, and at 86.8 uS per event it takes 25 seconds to do enough wire-wiggling to get to the point where the bootstrap code is ready to execute. :| An on-board microcontroller could do the same job in perhaps 1/100th the time -- .25 seconds, assuming .868 uS per event. (Any time any of the three signals needs to be set high or low, that counts as one event. Eg: a clock pulse counts as two events.)

Almost all the time is spent doing fake BRA $80's, spinning the 65xx PC backwards $7E at a time. Filling up the stack area involves 256 instances where PC begins at $8080 and has to be spun backwards until PCH equals whatever byte we want pushed. Then we do the "push-PCH" maneuver -- which is also what sets the $8080 starting point for next time. Two speedup strategies become apparent, useful in cases like mine where a bottleneck is present.

  • Don't always use $8080 as the starting point. Instead, have the host look ahead during each "push-PCH" maneuver for a hint about what's gonna get pushed next time. Have it choose whether to put $00 or $80 on the bus during cycle 7 of the "push-PCH" maneuver. PC gets left = $8080 or $0080 (whichever will result in less spinning)... and download speed roughly doubles. :)
  • The first set of 256 bytes is mostly do-nothing filler, and although I mentioned using a sea of NOP's ($EA) I lied. $EA is actually a rather slow value to download ($80 and $00 are the fastest values to download, and $81 and $01 are the slowest) so in reality I replaced every $EA with $78 (SEI - set interrupt mask). $EA and $78 are both one-byte, two-cycle instructions that don't do much. Except for the difference in download speed I consider them equivalent for this application.

Using both of these speedups the host or microcontroller will average (using random bytes for bootup "code") about 290,000 events before the bootstrap code is ready to execute. As noted, most of the time is devoted to lots and lots of BRA $80's.

If there's a Level-2 loader also using the 3-wire interface then I estimate its events-per-byte ratio will be about 10- or 20-fold better -- still not fast, but good enough for lots of situations; and as noted it wouldn't be constrained to 256 bytes, nor to writing only the stack area. For notes on a Level-2 loader using the 3-wire interface see subsequent posts.

Selecting the final Clock Source

During the bootload every clock pulse is issued very deliberately. But eventually you'll wanna release the 65xx to free-run at top speed. Maybe your microcontroller will be capable of outputting a full-speed clock on the same pin it uses to output the single-step clock used in bootup. Otherwise you'll probably use a multiplexer to select between the single-step clock and an oscillator. The mux will need something to drive its Select input, and a cool option for that is to use the 816's E output (assuming you're using an '816). The CPU is guaranteed to start off in Emulation Mode, which means E=1. After the bootload when all your ducks are in a row you'd single-step into an XCE instruction that exits Emulation Mode ... and when that happens E goes low (synchronously), the mux switches over, and you'll take off like a slingshot! :mrgreen:


Misc. Tips and comments

Wouldn't it be faster and easier to push PCL (instead of PCH)? Yes. But we need all 256 values available, and there's no sequence of $80's and $00's that can cause PCL to have 0 in the least-significant bit. Edit: nyef's suggestion sidesteps this by using $E8's and $00's rather than $80's and $00's.

The 3-wire interface won't work with NMOS 65xx CPU's because they lack the BRA instruction. Edit: this, too, is sidestepped by nyef's suggestion.

Be careful to avoid accidental resets. When you're wiggling nCE and OP, always set first then clear. ie: to switch to "BRK" mode, first set nCE then clear OP. To switch to "normal" mode, first set OP then clear nCE.

Your bootstrap code should start by initializing S again, to $FF. :wink:

Only high input-impedance chips (eg, NMOS and CMOS) may connect to the bus that's driven by the resistors. The resistors wouldn't be capable of driving any TTL or LSTTL loads.

Especially with 65c816, a '245 transceiver may be present between the memory-I/O data bus and the CPU data bus. If the resistor array attaches to the memory-I/O data bus then nCE must inhibit memory and I/O. If the resistor array attaches to the CPU data bus then nCE must inhibit the '245 and memory and I/O.

Cycle counts must be perfect. Note that a BRA will take 3 or 4 cycles depending on whether there's a page crossing. The exact definition of a page crossing is crucial. You need to compare the ADH of the byte that's 2 past the BRA opcode with the ADH of the byte you're branching to. Also (and maybe this is obvious), having a BRA operand of 0 (as in the BRA $00 "two-bye NOP") doesn't make a BRA execute faster. It's not like BNE or BMI which are conditional, and take only 2 cycles in the case of "branch not taken."

Questions & comments welcome! Now that the idea is more than just a theory I hope someone is able to use it in a real project. Cheers,

-- Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Last edited by Dr Jefyll on Thu Sep 30, 2021 3:05 pm, edited 8 times in total.

Top
 Profile  
Reply with quote  
PostPosted: Tue Aug 07, 2018 7:09 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Very clever indeed! (Lazy question: roughly how many clock cycles are needed to get to the point of executing the 256 byte program in page 1?)


Top
 Profile  
Reply with quote  
PostPosted: Wed Aug 08, 2018 2:18 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Oops, I see now you say 250ms should be enough to get to the point of running the 256-byte loader.

In effect that loader is then free to bit-bang a synchronous serial interface, perhaps a bit like this:
Code:
LDA #$0
ORA #$xx
ROR
ORA #$xx
ROR
ORA #$xx
ROR
ORA #$xx
ROR
ORA #$xx
ROR
ORA #$xx
ROR
ORA #$xx
ROR
ORA #$xx
STA (zp),Y
INY
BNE


Top
 Profile  
Reply with quote  
PostPosted: Wed Aug 08, 2018 11:27 pm 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
BigEd wrote:
Oops, I see now you say 250ms should be enough to get to the point of running the 256-byte loader.
Yes, I edited the lead post to clarify this point and a few others. 250 ms would cover the wonky "initialize S" code and the download of 256 bytes of comparatively normal code. (I refer to the latter as the bootstrap code). The delay estimate is based on the microcontroller or host link requiring about 1 uS every time any of the 3 wires goes high or low. (The host link I used happens to be much slower -- about 87 uS per event).

The bootstrap code might be a loader that relies on mass storage or an RS232 link. These are solid options, and it's pretty easy to imagine how they'd work. But it's more fun :) (and in some circumstances it'd be better) to avoid other I/O channels and instead continue accepting input from the 3-wire interface... whose abilities revolve around faking $00's and $80's onto the data bus.

Quote:
In effect that loader is then free to bit-bang a synchronous serial interface

Whoa -- excellent! Yes. In fact you leapfrogged my first idea for the "get-a-byte" snippet. (That version put a constant in X then repeatedly used CPX #$ xx to set or clear carry before ROR'ing it into A). But you suggested this...
Code:
LDA #$0        ; initialize A with 0
ORA #$xx       ; xx denotes a faked operand, either $80 or 0.
ROR            ; Shift.
ORA #$xx       ; xx denotes a faked operand, either $80 or 0.
 [ repeat the last two instructions 6 more times  to get an entire byte ]

... and I had the exact same until I noticed a minor improvement: one less instruction at the beginning, like this.
Code:
LDA #xx        ; initialize A with faked operand, either $80 or 0.
ROR            ; got one bit in the top of A. Shift.
ORA #$xx       ; OR A with faked operand, either $80 or 0.
 [ repeat the last two instructions 6 more times  to get an entire byte ]

This works out to 30 clocks per byte -- I mean just the "get-a-byte" snippet that does the input (and that's where most of the time would be spent).

-- Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Top
 Profile  
Reply with quote  
PostPosted: Thu Aug 09, 2018 6:10 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Nice tweak!


Top
 Profile  
Reply with quote  
PostPosted: Fri Dec 21, 2018 8:18 am 
Offline

Joined: Sun Jul 28, 2013 12:59 am
Posts: 235
I don't know if this would work or not (either electrically in terms of the data bus, or in terms of the NMOS CPU behaviour), but a possible angle for an NMOS 6502 version would be to stuff $70 (BVS) instead of $80 (BRA), and to tie /RESET to /SO. Run the clock once while the CPU is in the reset state to force the V flag to be set coming out of reset. At this point your branches jump forward by $72 instead of backwards by $7E, your RESET and IRQ vectors are $7070, and everything else is "the same".


Top
 Profile  
Reply with quote  
PostPosted: Fri Dec 21, 2018 3:58 pm 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
nyef wrote:
a possible angle for an NMOS 6502 version would be to stuff $70 (BVS) instead of $80 (BRA)
Hmmm, like this, then.... :)
Attachment:
NMOS 3-wire bootup interface.png
NMOS 3-wire bootup interface.png [ 7.03 KiB | Viewed 12218 times ]


Thanks for your interest and your suggestion, nyef! I think this version would indeed work for NMOS '02. Too bad there's no "one size fits all" which could include '02, 'C02 and '816. But the '816 lacks an /SO input! :| BTW, in this diagram I tied /SO to OP instead of /RESET, but probably either would work. And the host software would require only minor changes, as you noted.

Due to time constraints, I'm unlikely to test this version. Comments welcome. What I think would be more valuable is...
  • to clean up the host code and post it
  • to implement a Level-2 Loader similar to what Ed and I discussed

-- Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Top
 Profile  
Reply with quote  
PostPosted: Fri Dec 21, 2018 9:55 pm 
Offline

Joined: Sun Jul 28, 2013 12:59 am
Posts: 235
No "one size fits all"? I think I found one: Use $E8 (INX). It's a two-cycle, one byte operation with a side-effect that we don't need to care about. Valid for all three CPU types.

As the CPU comes out of reset, it reads a vector of $xxE8. At that location it reads a BRK. Suppress PCH, store PCL (now $EA), suppress P. The IRQ vector is $xxE8 again. To store any byte other than $EA, you step PCL forward one byte per two cycles of $E8 until it is two less than the value you need. Your worst-case cost for a byte is now $E9, at 462 cycles to set the correct PCL (starting from $xx00, step to $E7), and this is still faster than your worst-case with stuffing $80 (7 cycles/step for two branches taken with one page crossing and 127 steps is 889 cycles). And we might be able to optimize this further by storing PCH ($E8) as well as PCL during the initial "set S to a known value" phase, since we could set the entire stack in only two passes instead of three, even if many of the bytes set end up being $E8 (which we can compensate for).

Edit: Here's a rough timing analysis of the $80 (BRA) and $E8 (INX) versions. It may be inaccurate in terms of page-boundary penalties for branches, or anything else that I forgot, and only covers to the point that the CPU has a known stack pointer value.

9 cycles to leave RESET.

First, the $80 (BRA) version.

BRK is 7 cycles. To get to $78 from $80 is 8 steps at 7 cycles/step.
That's (* 7 9) => 63 cycles per "NOP" written. We need 252 NOPs, so
15876 cycles for those. We also need two $A2s and two $9As, and those
have to be found backwards from $00. 94 steps (665 cycles with the
BRK) and 102 steps (721 cycles, likewise), respectively. 128 cycles
to run to the start of the stack area, 258 more cycles to guarantee
that S is initialized. (+ 9 15876 665 665 721 721 128 258) => 19043
cycles to get to a "known" S value.

Now the $E8 (INX) version.

We can write two bytes per BRK to the stack, and our "default" value
is already loaded as part of the interrupt vector. So we only need
128 BRKs. All but four of these writes require no adjustment to the
program counter, so are 7 cycles each. That's (* 124 7) => 868
cycles. We also need two $E8A2 and two $E89A writes. For these, we
stuff INX instructions at two cycles per step starting from $E800 to
get to $E8A0 and $E898 (the BRK adds two to these values). These work
out at 318 and 304 cycles each, plus 28 for the BRKs themselves.
Execution starts at $00E8 and needs to step 24 bytes (48 cycles) to
get to the start of stack space, plus the same 258 cycles to guarantee
that S is initialized. (+ 9 868 318 304 28 48 258) => 1833 cycles to
get to a "known" S value.


Last edited by nyef on Fri Dec 21, 2018 10:34 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Fri Dec 21, 2018 10:07 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
(I'd forgotten how satisfyingly minimal this idea was!)


Top
 Profile  
Reply with quote  
PostPosted: Sat Dec 22, 2018 3:07 pm 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
Wow, nyef, I believe you've cracked the "one size fits all" challenge! By avoiding both BRA (which the 6502 lacks) and /SO (which the '816 lacks) you've arrived at a recipe that's equally palatable to 6502, 65C02 and '816! :P That's due to stuffing $E8 (INX) and $00 (BRK) onto the data bus, rather than $80 (BRA) and $00 (BRK).

You're concentrating on pushing PCL, rather than PCH as in the version I presented. ( In an unpublished earlier version there was a reason -- which I've forgotten -- why PCL wasn't easily workable. I guess in the course of subsequent iterations that reason ceased to apply, but I failed to notice.)

The speed enhancement of sometimes pushing both PCH and PCL is very clever, although it may perhaps be a tad tricky to code cleanly -- we'll see. Faster is always nicer, but I'm not that worried about speed when it comes to the "initialize S" phase of the operation. What pleases me most is the "one size fits all" breakthrough. Thanks for your contribution!

-- Jeff



Update 1 (Jun 16, 2021): In this post, Enso asked,
Quote:
Dr Jefyll, how is it going with your 3-wire bootloader?
I have a new version in the works, with two key improvements! The details will be duly posted, but here's a Spoiler. :)

As noted upthread, with the original version I would manipulate PC then push its high byte (PCH) when it had the desired value. That proved successful, but (as also noted) forum member nyef very kindly chimed in with the suggestion that I manipulate and push PCL instead. Besides being somewhat simpler and faster, this makes NMOS 02's eligible -- a "one size fits all" approach -- because the resistor-based "fake" data bus values would be $E8 and 0 rather than $80 and 0 as in my original. ($80 was problematic because I sometimes used it as a BRA opcode; but BRA is not in the NMOS 02's repertoire.)

So, manipulating PCL has a double benefit, being simpler and increasing the scope to include NMOS.

The other key improvement arose when I was looking ahead to the issue of the 2nd-level loader (Loader Two) -- a subject that's been only passingly discussed so far
. I'll backtrack briefly for some context:

The bottom-level loader (Loader One) is sufficient to initialize S and do a simple RCE exploit in page $01. :P That's the gist of the hack, and it gets us over the hump. But if you wanna download more than just a small program that runs in Page $01, what you run in Page $01 is Loader Two... and it in turn can download anything to anywhere, with no restrictions.

Loader 1 is necessarily very wonky. Writes executed by the processor sometimes get suppressed, and sometimes during reads the two fake resistor values get slipped onto the bus. This could happen almost anytime; for example, when the processor is fetching an instruction's opcode or its inline operand, or when it's fetching the vector associated with a BRK instruction. The activity each cycle is under control of the host.

Loader 2 is also under cycle-by-cycle control, but the wonky BRKs and PC manipulation are absent. Instead the 65xx is merely running a simple program from (mostly) RAM whose inner loop spends most of the time shifting bits into the A register.

Recently I did some sweating over the Loader 2 inner loop because it seemed suboptimal -- 2 instructions for each bit shifted. :| To improve that, I changed the resistor-based "fake" data bus values, which are now $6A and 0. This is actually a slight but tolerable disadvantage for Loader 1... but it is a major boon for Loader 2! Here's how $6A is used by Loader 2 for quickly shifting in bits.

As stored in memory, Loader 2's inner loop has eight $4A's (LSR instructions) in a row. The setup for the loop has already loaded A and Carry full of ones. What will happen is eight shifts in a row, leaving any of 256 possible values in A. The values vary because each of the eight shifts will be either a real $4A/LSR from memory (to shift in a 0) or a fake 6A/ROR from the resistors (to shift in a 1). Naturally it's the host which'll determine each individual $4A vs $6A choice, based on the data it wants to shift in. I'm pleased with the efficiency boost that results. The inner loop now only requires one instruction for each bit shifted.

Thanks for your interest. If you're eager, feel free to take the ball and run with it! I do hope to post fuller details "soon" but other projects are competing for my attention. Maybe I should give this one a higher priority...

ETA: one of the "competing" projects is somewhat complementary. I've now refined this idea and gotten it working, and such a sequencer could easily act as the "host" for a 3-wire download. :)

Update 2 (Aug 28 2021):

The new Level 1 loader (using dummy values $6A and 0, not $80 and 0) was successfully tested some while ago. I haven't yet found time to write a proper report, but I'm very pleased with the outcome. In summary I can say it's faster and quite a lot simpler. And, as noted above, $6A seems to hold promise for a substantially faster Loader 2. (Loader 2 hasn't gotten off the drawing board yet. I don't foresee any difficulty, though -- other than finding the time. :roll: )

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Last edited by Dr Jefyll on Sun Oct 22, 2023 9:21 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Wed Aug 09, 2023 12:09 pm 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
Looking into potentially using this from SPI EEPROMs rather than connected to another computer, this is the kind of circuit I think I'd need, any thoughts?

Attachment:
3wirespiboot.png
3wirespiboot.png [ 125.58 KiB | Viewed 9271 times ]


I used two EEPROMs - one per signal. They need to be at least 2Kbit according to the stats posted above - larger would allow for later boot stages to also be present in these EEPROMs. The clock for this stage needs to be slow enough to drive them - the ones I looked at here only support up to 3MHz, but alternatives could go faster. So the clock probably needs to go through a multiplexer with a faster clock to switch to when the boot is complete.

The BOOT signal should be low on power-on to load the shift register, then high during boot, then low again after booting is complete - I think this requires some external feedback, or could possibly be driven by an RC circuit based on the EEPROM outputs so that if they are both high for a long enough period we deem the boot process to be over - that's assuming there's generally quite a bit of transitioning going on on these signals, without long periods of both being high.

I changed the reset wiring a bit to make it tolerant of brief dips as new data is clocked from the EEPROMs, in case both signals need to change at the same time. The RC circuit should be very fast, much less than a clock period - just enough to cover any transition time on the outputs from the EEPROMs after they are clocked. Maybe D2 isn't actually needed and R1 could go there instead, like in Jeff's circuit - so my only addition would be the capacitor.

The EEPROMs are clocked by PHI2 so their state changes at the rising edge of PHI2 and is held steady through the low phase of the clock.

I added the transceiver too to avoid having resistive loads on the data bus permanently. Jeff I think you suggested this, but hinted it might cause probelms but I don't think you said what they were, or I missed it. If it's an issue then we could have resistors after the transceiver, then at least when the transceiver is disabled after booting, the data bus is not really being pulled to any particular level. The transceiver inputs need to be set up for whatever values are working best - it sounds like this is $4A/$6A - and its output enable probably comes from the inverse of the RAM's one.


Top
 Profile  
Reply with quote  
PostPosted: Wed Aug 09, 2023 2:47 pm 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
Thanks for your interest, George! I was dazzled by Jens Madsen's original (ie, Z80) version of this idea, and IMO it and this 65xx version are potent secret weapons for scenarios such as those I describe in the lead post. I'm eager to see your effort succeed. However, I am presently on vacation which means my comments at this time will be somewhat brief and/or half baked! :)

gfoot wrote:
Looking into potentially using this from SPI EEPROMs rather than connected to another computer, this is the kind of circuit I think I'd need, any thoughts?
I like it. :) In case anyone's wondering what the heck the shift register is for, "In order to start (or restart) regurgitation, the SPI EEPROM needs to be sent an 8-bit Read opcode followed by a 24-bit Start Address." There's further description and a timing diagram here (pertaining to an SPI-EEPROM readback scheme similar to George's).

Quote:
I changed the reset wiring a bit to make it tolerant of brief dips as new data is clocked from the EEPROMs, in case both signals need to change at the same time. The RC circuit should be very fast [...] my only addition would be the capacitor.
It's a crucial concern, because asserting the CPU /RES input even a very briefly will upset the apple cart. I managed the "potential glitch" problem by wiggling the nCE and OP signals individually, one at a time. If I need to do a transition that threatens to produce a low-going glitch on the diode-OR output that feeds CPU /RES, I break it into two parts. First I transition to the state where both nCE and OP are high; then I transition to the desired state. (This is perhaps explained more clearly in the lead post under "Misc. Tips and comments.") Basically, I simply avoid the situation where both signals change at the same time and a a low-going glitch could result. Compared to this, I not sure if the the capacitor approach offers much advantage.

Quote:
I added the transceiver too to avoid having resistive loads on the data bus permanently. Jeff I think you suggested this, but hinted it might cause probelms but I don't think you said what they were, or I missed it.
It's been five years since I wrote those posts, so if there's any hinting I'll need to re-read the material and evaluate the clues myself! Meanwhile, do you have any particular concerns about leaving resistive loads on the data bus permanently? The resistors can have a fairly high value (chosen, of course, as a tradeoff with bootup speed).

Signing off for now... Sorry I've failed to address some of the points you mentioned.

-- Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Top
 Profile  
Reply with quote  
PostPosted: Wed Aug 09, 2023 4:19 pm 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
Dr Jefyll wrote:
In case anyone's wondering what the heck the shift register is for, "In order to start (or restart) regurgitation, the SPI EEPROM needs to be sent an 8-bit Read opcode followed by a 24-bit Start Address." There's further description and a timing diagram here (pertaining to an SPI-EEPROM readback scheme similar to George's).

Your EEPROM has a slightly different interface to the ones I was looking at - it requires a different command code and has an active-low chip select. Fairly minor differences overall though, just an extra inverter or two would be needed for that one I think.

In any case I don't actually have any yet, I'm planning to just buy a selection and see what they end up being useful for!

Quote:
Quote:
I changed the reset wiring a bit to make it tolerant of brief dips as new data is clocked from the EEPROMs, in case both signals need to change at the same time. The RC circuit should be very fast [...] my only addition would be the capacitor.
It's a crucial concern, because asserting the CPU /RES input even a very briefly will upset the apple cart. I managed the "potential glitch" problem by wiggling the nCE and OP signals individually, one at a time. If I need to do a transition that threatens to produce a low-going glitch on the diode-OR output that feeds CPU /RES, I break it into two parts. First I transition to the state where both nCE and OP are high; then I transition to the desired state. (This is perhaps explained more clearly in the lead post under "Misc. Tips and comments.") Basically, I simply avoid the situation where both signals change at the same time and a a low-going glitch could result. Compared to this, I not sure if the the capacitor approach offers much advantage.

The trouble for me is that the CPU clock is the same as the SPI clock, so I can't really change one signal and then the other without the CPU executing a cycle in between. The capacitor is meant to help with this, but if it's not enough then it might need a more complex clocking circuit.

Quote:
Quote:
I added the transceiver too to avoid having resistive loads on the data bus permanently. Jeff I think you suggested this, but hinted it might cause probelms but I don't think you said what they were, or I missed it.
It's been five years since I wrote those posts, so if there's any hinting I'll need to re-read the material myself! Meanwhile, do you have any particular concerns about leaving resistive loads on the data bus permanently? The resistors can have a fairly high value (chosen, of course, as a tradeoff with bootup speed).

I think I figured it out - or at least that my initial scheme wouldn't work very well - it's that when we are blocking writes to RAM, the CPU is still driving the data bus, so we can't simply use the RAM's chip select inverted to control a transceiver, it needs to also be disabled during write operations. There are options, like putting the resistors on the other side of the transceiver and hooking its direction up to RWB, but it's at least not as simple as replacing the resistors with a transceiver alone.

My only concern with the resistive load is that one of the design constraints on the prototype I designed a few weeks ago was to avoid loading this bus as much as possible. It is probably being a bit extreme and overcautious though.

Quote:
Signing off for now... Sorry I've failed to address some of the points you mentioned.

Not at all, good to hear from you and I hope you enjoy your vacation!


Top
 Profile  
Reply with quote  
PostPosted: Fri Aug 11, 2023 6:26 pm 
Offline

Joined: Tue Jul 05, 2005 7:08 pm
Posts: 1043
Location: near Heidelberg, Germany
I wonder if you can do it in a simpler way.

Assume a 12 bit counter. Put its output onto the address bits of a 4k parallel ROM, and you have a general purpose signal generator.

Use the ROM output for SPI clock, MOSI and select, as well as RAM /cs etc. One state bit that disables the counter when done, could be reused as cpu /res and /be, and put the upper 8 bits of the counter on address bits 0-7, pulling the other bits up. Then use a serial in parallel out shift register to read spi miso and put it into the databus - to read the first spi ROM block into page $ffxx

That's basically the discrete version of what I have implemented as part of a larger CPLD.

_________________
Author of the GeckOS multitasking operating system, the usb65 stack, designer of the Micro-PET and many more 6502 content: http://6502.org/users/andre/


Top
 Profile  
Reply with quote  
PostPosted: Sat Aug 26, 2023 1:37 pm 
Offline

Joined: Fri Jul 09, 2021 10:12 pm
Posts: 741
I've come back to this, reread the notes, and thought through how the stage 1 loader works using the new $6A system. I had to transcribe everything onto paper and doodle there, to fully get my head around the building blocks. I thought I'd summarise my notes here, but am conscious it might not add much to the discussion, as I am just retreading Jeff's path - however, this train of thought helped me figure out how to control this system, so maybe it will be helpful to others.

One thing I've found is that regarding the reset issue, at least in the first stage loader it never comes up because I can always transition from having nCE and OP both high to having one being low - I never need to directly transition between one being low and the other being low.

So we have two control lines going to the CPU/RAM - nCE and OP. I enumerated the combinations of states of these lines as 0, 1, 2, and 3, with the low bit being the state of OP and the high bit being the state of nCE. Thus state 0 is reset; state 1 is normal operation with RAM enabled; state 2 is RAM disabled, pushing 0 to the data bus; and state 3 is RAM disabled, pushing $6A to the data bus.

Then we can list the various actions we can take to control the CPU, along with the specific state transitions necessary to achieve them. In contrast to Dr Jeffyl's original post, I am considering that there's a clock pulse (low-high-low) after every state change, so there's no need to encode the clock. I haven't built any hardware though so may have missed a reason why this wouldn't work. (Dr Jeffyl's description had the clock idling in the high state, and for a RAM write operation he was toggling nCE low then high without ticking the clock.)

I'll enumerate the actions that I think are possible or useful:

RESET - sequence 000...33333 - this forces a reset for a number of clocks, then takes five clocks of non-reset state to recover, during which the value doesn't matter much. I've used 3 because that's the safest state, transition-wise. SP is reduced by 3 and the CPU is on the point of loading a new address, which I'll mention below.

INC PC - sequence 33 - this is a ROR A operation, the second state could equally be a 2 but we avoid 1, perhaps, because of the possible need to follow this up with a "2" state (we want to avoid 1-to-2 or 2-to-1 transitions) and also because perhaps reading from arbitrary PC addresses could cause I/O issues.

BRKxx (stack write and branch) - sequence 23, followed by 1 or 3, then 1 or 3, then 3 - this is a BRK instruction. The second state could be a 2, but again we want to avoid a possible transition from 2 to 1, so use 3 here. The third state controls whether the first stack write - PC high - occurs. The second controls the second stack write - PC low. For these, 1 allows the stack write, 3 prevents it. The third controls the status register stack write, which is not useful so is always disabled. After this operation the CPU is ready to receive a new PC address.

From now on I'll write BRK for a BRKxx operation with both writes disabled, BRKH for one with the PCH write enabled, BRKL for one with PCL write enabled, and BRKHL for one with both PC byte writes enabled.

I think those are all the useful actions. The only other option is starting with state 1, which would read from RAM, and that's only useful if there's something in RAM, so later in the process.

RESET and BRKxx end with the CPU about to read a program counter value, and values we can load there are $0000 (states 22), $006A (states 32), $6A00 (states 23), or $6A6A (states 33). Note that state 2 supplies zero, state 3 supplies $6A, and the low byte is first.

So using these basic primitives we can set the program counter to a specific value, and then write either the low byte, or the high byte, or both to the stack. It is much cheaper in terms of CPU cycles to write arbitrary values via the low byte, because it takes a lot of cycles to change PC's high byte. All these stack writing operations reduce SP by 3. After each stack-writing operation, the CPU will be expecting a new PC value, so it is very convenient to consider supplying this to be part of the next high level operation we perform - i.e. all these high level operations will start by supplying an initial PC value.

As an example of such a higher level operation, we can set PC to $6A6A, advance by one, then write $6A followed by $6D to the stack (note that the low byte is two higher than you'd expect), followed by reducing SP one more time without writing, by executing the sequence "load 6A6A; INC PC; BRKHL" which, in states, is "33; 33; 23113". This state sequence is effectively a macro to perform the operation of writing $6A6D.. to the stack.

So what values are useful to write to the stack? Dr Jeffyl's original stack code was essentially two copies of "A2; xx; 9A" with EAs in between - that's "LDX #xx; TSX; NOP; NOP; ..." - the value of xx doesn't matter so long as it's consistent. In the $6A version we use 6A instead of EA, with much the same effect but using ROR instead of NOP. 6A is by far the cheapest useful value to write to the stack; 00 is also easy to write but we mustn't because it would cause unpredictable BRKs.

We want to write at least two copies of the useful initialisation code into the stack. Writing more copies could reduce the number of instruction cycles necessary later on to execute the stack memory and know confidently that the required operations have taken place. Spreading out the two copies also reduces the number of cycles needed. Writing "A2 6A 9A A2 6A 9A 6A 6A 6A ...", if we happen to start execution at the second "6A", would require 256*2=512 cycles to fully execute the critical 9A (TXS) instruction. But if the second copy is halfway through, then we only need half as many cycles. Using three copies would reduce the number of cycles to one third of the original value. However, as far as I can see, each extra copy costs about 190 cycles to insert, so at that point it's no longer worth adding more copies.

We can fill the whole stack with $6A by executing BRKH instructions back-to-back, loading the program counter with $6A00 or $6A6A after each one. This costs 7 cycles per stack byte. We can also fill it more efficiently by writing the high and low bytes together. This costs 7 cycles for two stack bytes, but the second (low) byte costs a bit more to set up. We can't just load $6A00 or $6A6A because that would result in the low byte being $02 or $6C (two greater than the loaded value of PC) neither of which are acceptable opcodes - $02 is undocumented, and $6C is a JMP instruction. We need the low byte to be a single-byte, two-cycle instruction, occuring as soon after $02 or $6C as possible. The first that looks useful is $0A - ASL.

So we can fill most of the stack with - in descending address order - $6A, $0A, $?? - by loading address $6A00, executing 8 "INC PC" operations, then one BRKHL operation, and repeating this 256/3 times (83 and a third). After that - or before it - we need to fill in the blanks ($??) that were not written during this step.

To understand this better, I considered a 16-element stack instead of 256. The remainder modulo 3 is still 1, and it allows easier visualisation of the layering process. We don't know the initial SP value, but everything here is relative to that. Let's start out by writing the LDX #$6A instruction as that's cheap and easy to do:

Code:
    .. .. .. .. .. .. .. .. .. .. .. .. ..|.. A2 6A|
                                        sp

The state sequence for this is "33" to load $6A6A; "33" $36 times to increment it by $36 to $6AA0; then the BRKHL sequence "23113". I've delimited the stack locations written by this operation with bars.

Now we want to fill the rest of the stack with $6A0A no-ops, except for a bit in the middle that we'll come back to later:
Code:
    ..|.. 0A 6A|.. 0A 6A|.. 0A 6A|.. 0A 6A|.. A2 6A|
    sp

The sequence for writing $6A0A?? is "33;3333333333333333;23113", and we repeat that (16-1)/3-1 times, leaving the stack pointer's value one higher than it was before we did anything at all.

Now we consider the next three locations down in the stack. We can write this first location as the high byte of the next BRK, but we want it to be $9A, and that's expensive to write as the high byte - so we will leave it blank, and overwrite it on the next pass instead. The second location - the low byte - will wrap and coincide with the first byte we wrote, 6A, and we don't need to change that byte, so let's not. So the next operation is a none-writing BRK; it also doesn't matter what PC address we load, so let's say "33;23333":
Code:
    ..|.. 0A 6A|.. 0A 6A|.. 0A 6A|.. 0A 6A|.. A2 6A|
    .. .. .. .. .. .. .. .. .. .. .. .. .. ..|.. ..
                                           sp

The second row shows second-tier writes to the stack, so the final value for each stack location is the one on the lowest row in which it's filled in. None of the second row is filled yet.

On this next pass through the stack we need to fill in all the high bytes. We can do that very cheaply with $6A, so let's pencil that in for now - bearing in mind, again, that we need to come back to this and make sure we fit a second "$A2 $6A $9A" sequence in the middle later on. This is state sequence "33;23133" repeated the same number of times as above:
Code:
    ..|.. 0A 6A|.. 0A 6A|.. 0A 6A|.. 0A 6A|.. A2 6A|
    .. ..|.. .. 6A|.. .. 6A|.. .. 6A|.. .. 6A|.. ..
       sp
    .. .. 0A 6A 6A 0A 6A 6A 0A 6A 6A 0A 6A 6A A2 6A

For clarity I've shown the combined stack content in a row under the sp line.

Now we need the next byte to be some form of no-op (6A or 0A) and the one after it to be 9A. Let's write $6A9A to achieve that:
Code:
    ..|.. 0A 6A|.. 0A 6A|.. 0A 6A|.. 0A 6A|.. A2 6A|
    9A 6A|.. .. 6A|.. .. 6A|.. .. 6A|.. .. 6A|.. ..
                                                |..
                                              sp
    9A 6A 0A 6A 6A 0A 6A 6A 0A 6A 6A 0A 6A 6A A2 6A

This is very close to what we wanted! But we want a second copy of the magic sequence in the middle. This requires writing A2 and 9A with a one-byte gap. The cheapest way to do these is as the low bytes of existing operations, for example like this:
Code:
    ..|.. 0A 6A|.. 0A 6A|.. 9A 6A|.. 0A 6A|.. A2 6A|
    9A 6A|.. .. 6A|.. A2 6A|.. .. 6A|.. .. 6A|.. ..
                                                |..
                                              sp
    9A 6A 0A 6A 6A 0A A2 6A 9A 6A 6A 0A 6A 6A A2 6A
                      ^^    ^^

This required changing one of the "6A0A..' writes to a "6A9A..", and changing one of the "6A...." writes to "6AA2..". In my 16-byte stack there were 4 of each; now there's just one "6A0A.." followed by the new "6A9A..", then the last two "6A0A.."; and in the second pass it's the other way around - two "6A...." operations first, then the new "6AA2..", then the remaining one "6A....". I think this will also work out in the 256 byte stack - for my 16-bite case, 4 is (16-1)/3-1, and with a 256-byte stack it will be (256-1)/3-1 = 84, and this will split into 42x"6A0A.." + "6A9A.." + 41x"6A0A" for example.

Finally the next step in Jeff's process is to execute this code, starting at $0100. We get there by loading PC with $006A (state codes "32") and executing a sequence of RORs ("33" for each; we need $100-$6A copies of this). Now we need to execute at least 256+2=258 CPU cycles to be confident that the A2;6A;9A sequence was executed once. Every operation takes two cycles; there is a subtlety in that "A2;6A" is a single instruction though and this means the final value of PC isn't entirely known. That's OK though because we certainly didn't reach the end of page 1, so after 258 cycles it will be slightly after the middle of page 1, plus or minus a byte or two, and it will also be about to execute a new instruction, so we can inject a BRK (23333) and be ready to supply a new PC. Now the stack pointer will be the value we loaded ($6A) minus three due to the BRK.

(As an aside, could we load a different value for SP? Not efficiently - 6A worked well because it's the high byte of a two-byte write operation. It would cost a lot of cycles to load something else.)

This brings us to the next stage, which I haven't thought through a lot yet. I guess we'd just write a short stub at the start of the stack page which loops over the critical code sequence for reading arbitrary data efficiently:
Code:
l0100:
    lda #$ff
    lsr : lsr : lsr : lsr : lsr : lsr : lsr : lsr
    pha
    jmp l0100

This is short, and only capable of filling the stack with arbitrary values more efficiently than we can using the previous mechanism. But it's enough then to get a third stage loader in, which can fill the rest of the memory.

Or maybe it's better to just make this stage go ahead and load the whole memory image instead of just the stack:
Code:
    ldy #2 : sty 1
loop:
    lda #$ff
    lsr : lsr : lsr : lsr : lsr : lsr : lsr : lsr
    sta (0),y
    iny : bne loop
    inc 1 : bne loop

We can override the reads from $00 with forced reads of 0, so no need to initialise that; this then streams data into memory from $0202 upwards, until we end it with a BRK-based jump.

The opcodes are likely rather awkward values for us to load in, so this will probably be best done as three passes through the stack, using BRKL to write single bytes each time. For most of the stack we don't care what's written, so don't need to load specific PC values - we only need to be careful for the bit at the bottom where our actual code is.

I'd be interested in how close my thoughts on all of this are to your existing implementation Jeff. I have tried to make the first stage as efficient as possible, but sticking to $6A as the data bus value as it's ingenious how much that will help with the second stage loader! And having the second stage loader be minimal is also helpful as it's still rather expensive to load its code bytes in.

I will probably prototype something this weekend, just a 6502 with RAM, no I/O, and I'll use hoglet's decoder to check it's operating correctly, loading all the code etc. I may go the SPI EEPROM route, but more likely will start with a microcontroller sending the control signals.

Edit: I wrote some code to generate the sequence for stage 0 and stage 1. My stage 2 is 24 bytes long. Getting that in place costs about 3500 cycles for stage 0 (establishing a known SP) and 4700 cycles for stage 1 (loading the 24 bytes of stage2 into place). It's more than I expected given nyef's calculations, but $6A is less efficient in the early stages. So unless I've missed some optimisation, these two stages require 16kbit of data. Stage 2 can then load at about 28 cycles per byte, requiring 56 bits of storage per byte. So a pair of 16kbit EEPROMs could probably load about 2K of "stage 3" code, while a pair of 32kbit EEPROMs could load about 6K.

Edit 2: After more tinkering, reducing the size of stage 2 helps a lot, even if it hampers its functionality. The shortened code below saves 1200 cycles during stage 1, though it's only able to fill the stack, albeit more efficiently than stage 1 was (24 cycles per byte I think, whereas stage 1 averages something like 250 cycles per byte).

Code:
; On entry, X = $6a, SP = $fe

* = $100

loop:
    lda #$ff
    lsr : lsr : lsr : lsr : lsr : lsr : lsr : lsr
    pha
    bcs loop


It could end by overwriting its own last byte so that execution can flow into the newly-written stage 3 code without needing another costly BRK-based branch. Stage 3 can then use more flexible instructions to read data into other memory, as by the time stage 2 is running, all values we load have the same fairly-low cost; or it could more directly do efficient I/O from another source. In the latter case, a pair of 16Kbit EEPROMs could fill the stack with stage 3 code to execute.

There's a lot of potential to compress the data stream, with an RLE decoding circuit for example - I've played with those before for monochrome video compression, they're fairly simple to build. I'm also conscious that we only used "reset" once in this whole operation - perhaps we can let the system handle the reset itself, and watch out for it happening (does that count as an extra wire? Maybe tristate one of the wires?) so that we can use the both-wires-low signal to mean something different, or just take advantage of this in the data compression.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 15 posts ] 

All times are UTC


Who is online

Users browsing this forum: Google [Bot] and 49 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: