After much refactoring and optimisation, I now have a version taking exactly 256 bytes which includes a readback function (also checksummed). To do this, I had to remove the "return to caller" function completely, but the readback and execute commands can be combined to emulate a "soft reset" which simply jumps through the Reset vector. As a further space-saving measure, the sync pattern is now a null byte followed by the CRC32C of a null byte.
Examples of micro-optimisations for size:
Code:
crc32c_check:
; returns Z set iff crc matches next 4 bytes (after inversion)
; clobbers A,X and falls through to crc32c_init
LDX #3
: JSR serial_read
EOR crc,X
INC A
BNE crc32c_fail
DEX
BPL :-
crc32c_init:
LDA #FF
LDX #3
: STA crc,X
DEX
BPL :-
RTS
crc32c_fail:
LDA #'C'
serial_sync_trampoline:
JMP serial_sync
Comparing to the earlier version I posted above, you can see I've rolled up sequences of simple operations into loops. The loop overhead in this form is 5 bytes, replacing in one case 6 bytes of individual instructions (saving just 1 byte), and in another case (not shown here) replacing 8 bytes to save 3 bytes.
Falling through from _check to _init saves 1 byte directly from eliminating RTS, but also several bytes elsewhere from not having to add a JSR from the main code. Less obviously, serial_sync now points to a JSR serial_write just *before* the routine to check the sync pattern, saving three bytes in every error handling path:
Code:
serial_flash:
; receives a binary image over serial with error detection & retry protocol
; writes it to location specified by sender in 64-byte bursts
; this works with Atmel EEPROMs and also with plain RAM
; these routines must be run from RAM, with IRQs disabled
; tell host we're ready to begin (ENQ)
LDA #5
serial_sync:
JSR serial_write
; wait for sync pattern (NUL followed by CRC of NUL byte: 52 7D 53 51)
: JSR crc32c_init
JSR crc32c_read
BNE :-
JSR crc32c_check
; 16 bytes for sync handler
; 'W' frame is command code (1B), address (2B), length (1B), data (1-256B), checksum (4B)
; 'L' frame is command code (1B), address (2B), length (1B), checksum (4B)
; 'X' frame is command code (1B), address (2B), checksum (4B)
; 'D' frame is acknowledge code (1B), data (1-256B), checksum (4B)
; write data using 'W' command, acknowledged by '.' character if CRC correct
; read data using 'L' command, acknowledged by 'D' frame
; jump to address using 'X' command, acknowledged by '}'
; unknown commands replied to with 'U'; follow with sync
; bad CRCs replied to with 'C'; follow with sync
; hardware detected errors (and break conditions) replied to with 'H'; follow with sync
; addresses are in big-endian order, CRCs cover entire frame including command code
; address and length must fit in a 64-byte aligned page for correct EEPROM writing
; when writing to RAM or reading back, 256-byte frames (length byte 0) are fine
serial_frame:
JSR crc32c_read
In the above, you can also see that I've eliminated the initial SEI, on the grounds that if you're loading this from a toggled bootstrap then interrupts will already be disabled, while if you've already stored this in ROM then the routine which pulls it out to RAM can include a setup routine as well. That gave me the last byte I needed to squeeze it into 256 bytes.
Still need to actually test this somehow. I think pySerial's RFC2217 support and BeebEm's "IP232" support should be mutually compatible - somehow.