6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Sep 20, 2024 12:27 pm

All times are UTC




Post new topic Reply to topic  [ 16 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Sun Feb 10, 2019 10:31 am 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
When bootstrapping or developing for your SBC, it's convenient to be able to reprogram your EEPROM in-circuit instead of pulling it out of its socket and shoving it in a dedicated programmer. EEPROMs like the AT28C256 make this easy since they can respond to standard CPU write cycles and don't need an external programming voltage. However, they do have the complication of taking several milliseconds to complete the write cycle once initiated, and their read behaviour during this time precludes accessing any routines or IRQ vectors that might normally be stored in that device, even if they are not being overwritten. This includes any fancy interrupt-driven serial routines or monitor programs you might already have implemented.

I've just scratched together a 256-byte wonder (actually 249 bytes at the moment) which can talk over a typical UART to receive data from a loader running on a host computer, verify a strong CRC32C checksum, and perform the correct write-cycle routine for an Atmel EEPROM. It can be configured to work with most typical UARTs through constants at the top of the assembly file, and uses 9 bytes of zero page. There is some error handling to allow the SBC to recover from foreseeable glitches on the serial line, even if the flashing goes wrong; you can send a break condition on the serial line to force resynchronisation to the protocol, and there is a status byte returned for each block. At the end, you can instruct the SBC to return to caller or to jump to an arbitrary address.

Keeping it down to 256 bytes allows loading the code into RAM using the simplest possible bootstrap routines, without having to nest or repeat a copy loop. This includes cases where you might be toggling such a bootstrap in by hand - 256 bytes is too tedious for that, but a loop to pull 256 bytes over serial should be fine.

Currently there's no provision for reading data back from the SBC, nor for comparing the current contents of the EEPROM and skipping unnecessary writes (may be desirable to minimise wear), nor for running an overall checksum over the EEPROM as a final verification step. So far I haven't figured out how those would all fit into 256 bytes. However, you can simply load extra code into RAM this way and execute it.

Now I just need to write the host-side utility which can talk to this. Python should work…


Top
 Profile  
Reply with quote  
PostPosted: Sun Feb 10, 2019 12:20 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
This is the last segment of code in the file, but the first written - the CRC32C routines, occupying 53 bytes:
Code:
crc32c_read:
   JSR serial_read
crc32c_byte:
   ; combines one byte into crc - clobbers X, preserves A & Y
   PHA
   PHY
   STA crctmp
   LDX #8
:   ROL crctmp
   ROL crc+3
   ROL crc+2
   ROL crc+1
   ROL crc+0
   BCC :++
   LDY #3
:   LDA crc,Y
   EOR crc32c_polynomial,Y
   STA crc,Y
   DEY
   BPL :-
:   DEX
   BNE :---
   PLY
   PLA
   RTS

crc32c_polynomial:
   .byte $1E,$DC,$6F,$41

crc32c_check:
   ; returns Z set iff crc matches next 4 bytes (after inversion) - clobbers A,X
   LDX #3
:   JSR serial_read
   EOR crc,X
   INC A
   BNE :+
   DEX
   BMI :-
   INX
   RTS
:   LDA #'C'
   JSR serial_write
   RTS
It turned out to cost very little to run a 32-bit CRC rather than a 16-bit one. The "Castagnoli" version is one of the more robust formulations; by contrast it was hard to tell which of the many 16-bit CRCs was most effective.

A failed CRC causes a "C" to be sent back to the host, and the calling code drops the errored frame and waits for a sync pattern before resuming work. This protects the SBC from a misconfigured or noisy serial line corrupting the EEPROM.


Top
 Profile  
Reply with quote  
PostPosted: Sun Feb 10, 2019 1:35 pm 
Offline
User avatar

Joined: Tue Mar 05, 2013 4:31 am
Posts: 1382
I've been doing something quite similar for several years now. As most terminal programs support Xmodem, I wrote a loader that talks to the terminal program via Xmodem-CRC. The CRC is 16-bit, but I don't see a problem with that. My loader also detects and processes S-record files automagically, so code I write using WDC Tools is simple to move over to the SBC. Once loaded into RAM (user can input an offset for S-Record or a load address for binary data) I use the EEPROM program command in the monitor to move it into the EEPROM. This provides a simple and clean method to write code on the PC, quickly load it to the SBC and optionally write it to EEPROM when you're happy with it. The nice part is both the console and the Xmodem data transfers function over the same serial port, so no extra com port or cabling is required. I also don't have to write any code on the PC side.

I've recently added a Xmodem-CRC save function so I can move data from the SBC via the same console port back to the PC. Using Xmodem, I don't have to worry about file names on the SBC, just use the terminal program interface to the filesystem on the PC to send or receive files. I'm adding code to my CMOS version of EhBasic to use the Xmodem Load and Save functions next, which will allow me to easily load and save Basic programs as well.

I used the above scenario to get a working CMOS version of Enhanced Basic last year, as I did this when I away from the house and no access to my programmer. I also used this setup for writing the disassembler some years ago and later integrated it into the Monitor.

A side note: CRC-16 routine is 39 bytes.

Code:
; CRC-16 Generation program. This routine generates the 16-bit CRC for the 128 byte
;  data block stored in RBUFF. It is a separate routine as it's used in both the
;  Xmodem load and save routines. It saves 31 bytes with a small penalty in speed.
CRC16_GEN       STZ     CRCLO           ;Reset the CRC value by
                STZ     CRCHI           ;putting all bits off
                LDY     #$00            ;Set index for data offset
CALCCRC         LDA     RBUFF+2,Y       ;Get first data byte
                PHP                     ;Save status reg
                LDX     #$08            ;Load index for 8 bits
                EOR     CRCHI           ;XOR High CRC byte
CRCLOOP         ASL     CRCLO           ;Shift carry to CRC low byte
                ROL     A               ;Shift bit to carry flag
                BCC     CRCLP1          ;Branch if MSB is 1
                EOR     #$10            ;Exclusive OR with polynomial
                PHA                     ;Save result on stack
                LDA     CRCLO           ;Get CRC low byte
                EOR     #$21            ;Exclusive OR with polynomial
                STA     CRCLO           ;Save it back
                PLA                     ;Get previous result
CRCLP1          DEX                     ;Decrement index
                BNE     CRCLOOP         ;Loop back for all 8 bits
                STA     CRCHI           ;Update CRC high byte
                PLP                     ;Restore status reg
                INY                     ;Increment index to the next data byte
                BPL     CALCCRC         ;Branch back until all 128 fed to CRC routine
                RTS                     ;Return to caller
;

_________________
Regards, KM
https://github.com/floobydust


Top
 Profile  
Reply with quote  
PostPosted: Sun Feb 10, 2019 2:01 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
If we compare like for like, I have 32 bytes for the "process one byte" subroutine, including the constants and the call to the "get a byte" routine. And yes, it's slow, but it only has to keep up with a serial line.

I looked at XMODEM and promptly broke out in hives, then I looked at S-Record format - which isn't so bad, but parsing it was an extra complication that would be hard to squeeze into the size I wanted. So I have a custom protocol that should be easy to translate from S-Record format on the host, but has error protection and retransmit capabilities, and may be as much as twice as fast on a slow serial link. It should be straightforward to incorporate a minimal terminal in the host application, for interacting with an SBC's monitor and bringing up the flashing code.
Code:
   ; tell host we're ready to begin (SYN ENQ)
   LDA #22
   JSR serial_write
   LDA #5
   JSR serial_write

serial_sync:
   ; wait for sync pattern (SYN ACK)
:   JSR serial_read
:   CMP #22   ; SYN
   BNE :-
   JSR serial_read
   CMP #6   ; ACK
   BNE :-

   ; 'W' frame is command code (1B), address (2B), length (1B), data (1-256B), checksum (4B)
   ; 'X' frame is command code (1B), address (2B), checksum (4B)
   ; 'R' frame is command code (1B), checksum (4B)
   ; write data using 'W' command, acknowledged by '.' character if CRC correct
   ; jump to address using 'X' command, acknowledged by '}'
   ; return to caller using 'R' command, acknowledged by ']'
   ; unknown commands replied to with 'U'; follow with sync
   ; bad CRCs replied to with 'C'; follow with sync
   ; hardware detected errors (and break conditions) replied to with 'H'; follow with sync
   ; addresses are in big-endian order
   ; address and length must fit in a 64-byte aligned page for correct EEPROM writing
   ; when writing to RAM, 256-byte frames are fine


Top
 Profile  
Reply with quote  
PostPosted: Sun Feb 10, 2019 2:33 pm 
Offline
User avatar

Joined: Tue Mar 05, 2013 4:31 am
Posts: 1382
I agree it's not exactly apples to apples. As I'm using Xmodem, I receive the entire block first, then check for the correct block number, generate the CRC, compare it to what was received, then either move it into memory or start the S-Record decode and move to memory. Even with a 4MHz CPU, this easily keeps up with a 38.4K baud rate.

While Xmodem isn't the prettiest data transfer implementation, it's not exactly "break out in hives" ugly, but it has been around for many decades and has kinda stood the test of time. It's also an easy fit and handles bad records and the handshaking back and forth.

Outside of the 6 text messages (254 bytes) and the BIOS routines which send/receive data via the UART, my send/receive Xmodem code is sitting at 691 bytes. This includes everything for CRC-16, S-Records, user prompts for setting up the terminal program, error handling, etc. So one could say it's on the large side at 945 bytes, but it works very well and is quite simple to use.

All of this is in the Monitor code, which is about 5KB now, but with a lot of function. The BIOS and I/O page is under 1KB. In my overall memory map, the upper 2KB is reserved for BIOS and I/O followed by another 6KB for the Monitor. The only remaining piece for the Monitor is the assembler, which I'll hopefully get around to later this year. I have over 1KB left in the BIOS area to add more hardware.

In any case, I'm looking forward to see your implementation, as there's always (yet) another way to skin a cat.

_________________
Regards, KM
https://github.com/floobydust


Top
 Profile  
Reply with quote  
PostPosted: Sun Feb 10, 2019 2:54 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
One big complication I have is with testing this, since I don't presently have working 65xx hardware to hand. I'll need to look into whether one of the emulators has a virtual serial port I can interface with. The BBC Micro came with a 6551…


Top
 Profile  
Reply with quote  
PostPosted: Sun Feb 10, 2019 3:33 pm 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
In my recent project, I used a very simple loader. I just take the binary data, put a header in front and CRC at the end, and then just dump the whole thing over the serial port as-is. The end of the data is simply detected by a period of silence.

The header tells where the data goes. If the CRC is correct, it programs it into the flash, and if the CRC is wrong, it sends an error message.

Worked very well, and only requires screenful of code.


Top
 Profile  
Reply with quote  
PostPosted: Sun Feb 10, 2019 9:54 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
I think my natural approach to the problem of detecting completion is to include a length in the header, rather than mucking about with timing.

One thing you can do with my approach is to reprogram several mapped banks of a large EEPROM, by interspersing one-byte writes to mapping registers. This includes dealing with EEPROMs that are larger than available RAM. The larger address space of an '816 can also be accommodated with small changes to the target-side code, and one extra byte of zero^W direct page memory.


Top
 Profile  
Reply with quote  
PostPosted: Mon Feb 11, 2019 7:01 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
The advantage of listening for silence is that you can never lose sync. Plus it's really simple. Here's my code:
Code:
        ldx     #0
@1:     bit     $cff8
        bvs     @got_byte
        dex
        bne     @1                      ;
        bra     @done

The 'bit $cff8' detects whether a new byte is ready. After 256 iterations of silence, it assumes that everything was received.


Top
 Profile  
Reply with quote  
PostPosted: Mon Feb 11, 2019 9:45 am 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
Quick calculations suggest that code would timeout between consecutive bytes received at 9600 baud with the CPU at 4MHz. So that's not really encouraging, as it would be very easy for a newbie to pick those settings by accident, then wonder why your "foolproof" code doesn't work.

A protocol with a length field will only get out of sync if there's a transmission error resulting in bytes getting dropped. My code therefore has no sense of time whatsoever, but *is* sensitive to the UART's error flags as well as the data-available flag. The host can explicitly trigger one of these error flags by setting the "break condition" on the line (a 6551 will detect it as a framing error; a 28L92 will correctly identify it as a break). This gives an opportunity to get back into sync and generally recover the flashing process.

Meanwhile, a bootstrap to load a 256-byte utility might look like:
Code:
start:
   LDX #0
:   BIT uart_status
   BVC :-
   LDA uart_rx
   STA target,X
   INX
   BNE :-
target:
This requires that the utility is padded to *exactly* 256 bytes, but that's easy to arrange. The bootstrap is just 16 bytes and can be placed at, say, $0FF0 to load code at $1000.


Last edited by Chromatix on Mon Feb 11, 2019 10:05 am, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Mon Feb 11, 2019 9:51 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
Quote:
code would timeout between consecutive bytes received at 9600 baud with the CPU at 4MHz

Correct, but I was going for minimal code in a small (<256 byte) loader, fixed at 115200 baud.


Top
 Profile  
Reply with quote  
PostPosted: Mon Feb 11, 2019 10:53 am 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
Well, here are my UART interfacing routines, along with the configurable constants which allow them to operate with either a 6551 or 28L92, and probably other parallel UARTs as well:
Code:
; the serial model implemented is generic and relies on the UART already being set up
; the default params are correct for the first port of a 28L92 based at $E000

serial_rx       = $E003
serial_tx       = $E003
serial_status   = $E001
serial_status_rx  = $01   ; true when data ready
serial_status_tx  = $04   ; true when ready for data
serial_status_err = $F0 ; true when any error occurs, TX or RX

; parameters for a (non-faulty) 6551:
;serial_rx       = $E000
;serial_tx       = $E000
;serial_status   = $E001
;serial_status_rx  = $08   ; true when data ready
;serial_status_tx  = $10   ; true when ready for data (broken on new WDC chips)
;serial_status_err = $07   ; true when any error occurs, TX or RX

serial_write:
   PHA
   LDA #serial_status_tx
:   BIT serial_status
   BEQ :-   ; wait for ready
   PLA
   STA serial_tx
   RTS

serial_read:
:   LDA serial_status
   BIT #serial_status_err
   BNE serial_error
   BIT #serial_status_rx
   BEQ :-   ; wait for ready
   LDA serial_rx
   RTS

serial_error:
   ; break or framing or overrun error occurred
   ; flag it to host, then fall back to sync wait
   LDA #'H'
   JSR serial_write
   LDX stackptr
   TXS

   ; NB: on 28L92, the overrun error is sticky, and this doesn't reset it
   ; avoid overruns by implementing RTS/CTS handshake lines
   ; other errors are cleared on FIFO read if block error mode isn't enabled
   LDA serial_rx
   BRA serial_sync_trampoline

; Base serial routines total 41 bytes
The faulty versions of the 6551 are not accounted for explicitly, but most of the responses sent back to the host are single bytes which don't need special handling. The exception is the sync pattern, but the host software can probably notice that (seeing only the last byte of the pattern) and make adjustments.

I might remove the ability to return to the flasher's caller, along with some support code keeping the stack orderly for that purpose, and replace it with a jump through the Reset vector. I could use the extra space to add readback and/or checksum commands, making it usable as a basic monitor in itself.


Top
 Profile  
Reply with quote  
PostPosted: Tue Feb 12, 2019 2:35 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
After much refactoring and optimisation, I now have a version taking exactly 256 bytes which includes a readback function (also checksummed). To do this, I had to remove the "return to caller" function completely, but the readback and execute commands can be combined to emulate a "soft reset" which simply jumps through the Reset vector. As a further space-saving measure, the sync pattern is now a null byte followed by the CRC32C of a null byte.

Examples of micro-optimisations for size:
Code:
crc32c_check:
   ; returns Z set iff crc matches next 4 bytes (after inversion)
   ; clobbers A,X and falls through to crc32c_init
   LDX #3
:   JSR serial_read
   EOR crc,X
   INC A
   BNE crc32c_fail
   DEX
   BPL :-
crc32c_init:
   LDA #FF
   LDX #3
:   STA crc,X
   DEX
   BPL :-
   RTS

crc32c_fail:
   LDA #'C'
serial_sync_trampoline:
   JMP serial_sync
Comparing to the earlier version I posted above, you can see I've rolled up sequences of simple operations into loops. The loop overhead in this form is 5 bytes, replacing in one case 6 bytes of individual instructions (saving just 1 byte), and in another case (not shown here) replacing 8 bytes to save 3 bytes.

Falling through from _check to _init saves 1 byte directly from eliminating RTS, but also several bytes elsewhere from not having to add a JSR from the main code. Less obviously, serial_sync now points to a JSR serial_write just *before* the routine to check the sync pattern, saving three bytes in every error handling path:
Code:
serial_flash:
   ; receives a binary image over serial with error detection & retry protocol
   ; writes it to location specified by sender in 64-byte bursts
   ; this works with Atmel EEPROMs and also with plain RAM
   ; these routines must be run from RAM, with IRQs disabled

   ; tell host we're ready to begin (ENQ)
   LDA #5
serial_sync:
   JSR serial_write

   ; wait for sync pattern (NUL followed by CRC of NUL byte: 52 7D 53 51)
:   JSR crc32c_init
   JSR crc32c_read
   BNE :-
   JSR crc32c_check

; 16 bytes for sync handler

   ; 'W' frame is command code (1B), address (2B), length (1B), data (1-256B), checksum (4B)
   ; 'L' frame is command code (1B), address (2B), length (1B), checksum (4B)
   ; 'X' frame is command code (1B), address (2B), checksum (4B)
   ; 'D' frame is acknowledge code (1B), data (1-256B), checksum (4B)
   ; write data using 'W' command, acknowledged by '.' character if CRC correct
   ; read data using 'L' command, acknowledged by 'D' frame
   ; jump to address using 'X' command, acknowledged by '}'
   ; unknown commands replied to with 'U'; follow with sync
   ; bad CRCs replied to with 'C'; follow with sync
   ; hardware detected errors (and break conditions) replied to with 'H'; follow with sync
   ; addresses are in big-endian order, CRCs cover entire frame including command code
   ; address and length must fit in a 64-byte aligned page for correct EEPROM writing
   ; when writing to RAM or reading back, 256-byte frames (length byte 0) are fine
serial_frame:
   JSR crc32c_read
In the above, you can also see that I've eliminated the initial SEI, on the grounds that if you're loading this from a toggled bootstrap then interrupts will already be disabled, while if you've already stored this in ROM then the routine which pulls it out to RAM can include a setup routine as well. That gave me the last byte I needed to squeeze it into 256 bytes.

Still need to actually test this somehow. I think pySerial's RFC2217 support and BeebEm's "IP232" support should be mutually compatible - somehow.


Top
 Profile  
Reply with quote  
PostPosted: Tue Feb 12, 2019 8:34 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
Ah, and earlier I said I thought the BBC Micro had a 6551 - no, it has a 6850. Very similar in terms of functionality, though - from this utility's perspective, only the bits in the status and control registers are shuffled around, and it's easy to adjust the constants to work with that.

Unlike the 28L92, neither the 6551 nor 6850 have hardware control of their RTS signal output for flow control, nor do they have a deep FIFO to absorb received data, so the host-side software will need to wait for an acknowledgement before sending each frame, to ensure the target is ready to grab each byte as it arrives. For that reason (if no other), I've abandoned any idea of overlapping reception of the next frame with waiting for write-completion in the current frame. They do however have hardware response to CTS on the transmit side, so a slow response to interrupts by the host (which might particularly happen with USB serial adapters) shouldn't corrupt readback data.

The 16550 is apparently even worse; it has no hardware flow control for RTS *or* CTS. These signals are made available to the host CPU via status/control registers and even linked to edge-triggered interrupts, but there is no way to halt the flow of data from the transmit FIFO to the shift register in response to CTS, except by forcibly clearing the FIFO (in which case there's no easy way to determine how many bytes were dropped and need to be restored later). That's just a recipe for overrun errors at the receiver.

Anyway, my code has no space to add software flow control, except implicitly by performing request/response protocols and delaying the acknowledgement of a W command until the EEPROM write cycle is complete. The 28L92's hardware flow control should work just fine, as long as it was correctly set up in advance.


Top
 Profile  
Reply with quote  
PostPosted: Wed Feb 13, 2019 2:40 pm 
Offline
User avatar

Joined: Wed Aug 17, 2005 12:07 am
Posts: 1228
Location: Soddy-Daisy, TN USA
Chromatix,

This might be exactly what I need. I'm currently trying to get my SBC computer working again. I've run into a hardware issue I'm trying to resolve.

A few months ago, when it did work, I was using a 65C51 (with the bug). So once I get my setup working again, I will give this a shot.

Thanks!

PS, do I just need to assemble the blocks of code or do you have a single listing?


PPS, shoot. I'm not sure this would work on my setup. I don't currently have any RS-232 conversion. Just straight Rx/Tx and no hardware flow control at the moment.

_________________
Cat; the other white meat.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 16 posts ]  Go to page 1, 2  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 31 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: