Page 1 of 2
Getting code size even smaller
Posted: Sun Jan 10, 2016 1:38 am
by floobydust
As I've started doing a refresh on my recent C02Monitor code, I've started crunching the size down on some routines and the results are more than expected. Some of it is the result of reuse in the monitor and other the result of leveraging some of the newer opcodes in the CMOS version. Some of the routines recently rewritten include memory fill, move and compare routines. The result is over 100 bytes saved. As the 6502 has been around a long time, memory fill, move and compare routines have been around and almost standardized on for decades now. The usual method is to handle full pages of 256 bytes first, then handle the remaining page of less than 256 bytes. This method results in two complete routine loops to accomplish the function.
Doing it differently in a single loop, I ended up being able to reuse a core routine that updates source, target and length pointers commonly used to move memory (non-overlapping), fill memory and compare memory. I can use the new routines for the monitor functions which fill memory, compare memory, write to the EEPROM insitu (basically a move/compare) and half of the memory move function. Here's some snippets of the core code:
Code: Select all
USERFILL TAX ;Save fill byte
FILL_LP LDA LENL ;Get length low byte
ORA LENH ;OR in length high byte
BEQ DONEFILL ;Exit if zero
TXA ;Get fill byte
STA (TGTL) ;Store in target location
JSR UPD_TL ;Update Target/Length pointers
BRA FILL_LP ;Loop back until done
DONEFILL RTS ;Return to caller
Code: Select all
;Move memory block first byte to last byte, no overwrite condition, do the move
MVNO_LP LDA LENL ;Get length low byte
ORA LENH ;OR in length high byte
BEQ QUITMV ;Exit if zero bytes to move
LDA (SRCL) ;Load source data
STA (TGTL) ;Store as target data
JSR UPD_STL ;Update Source/Target/Length variables
BRA MVNO_LP ;Branch back until length is zero
;
Code: Select all
COMPLP LDA LENL ;Get low byte of length
ORA LENH ;OR in High byte of length
BEQ QUITMV ;If zero, nothing to write
LDA (SRCL) ;Get the source data
CMP (TGTL) ;Compare to target data
BEQ CMP_OK ;If equal, continue compare
JSR COMPERR ;Print failed compare address
CMP_OK JSR UPD_STL ;Update Source/Target/Length variables
BRA COMPLP ;Branch back until done
QUITMV RTS ;Return to caller
;
COMPERR JSR SPC2 ;Send 2 spaces
JSR DOLLAR ;Print $ sign
LDA TGTH ;Get high byte of address
LDY TGTL ;Get Low byte of address
JSR PRWORD ;Print word
JMP SPC ;Add 1 space for formatting and return
Code: Select all
PROG_LP LDA LENL ;Get low byte of length
ORA LENH ;OR in High byte of length
BEQ QUITPRG ;If zero, nothing to write
;
JSR BURN_BYTE ;Write a byte to EEPROM
LDA (SRCL) ;Get the source data
CMP (TGTL) ;Compare to target data
BEQ PRG_OK ;If equal, continue writing
;
LDA #$FF ;Get flag setting
STA TEMP2 ;Set flag for error
JSR COMPERR ;Print compare fail address
;
PRG_OK JSR UPD_STL ;Update Source/Target/Length variables
BRA PROG_LP ;Branch back until done
;
QUITPRG LDA TEMP2 ;Get Prog error flag
BEQ PRG_GOOD ;Branch to good program exit
LDA #$26 ;Get Prog failed message
BRA BRA_PRMPT ;Branch to Prompt routine
;
PRG_GOOD LDA #$25 ;Get completed message
JSR PROMPT ;Send to console and exit
LDA #$27 ;Get warning message for RTC and Reset
BRA_PRMPT JMP PROMPT ;Send to console and exit
Code: Select all
;UPD_STL subroutine: Increments Source and Target pointers
;UPD_TL subroutine: Increments Target pointers only
; then drops into decrement length pointer. Used by multiple commands
UPD_STL INC SRCL ;Increment source low byte
BNE UPD_TL ;Check for rollover
INC SRCH ;Increment source high byte
UPD_TL INC TGTL ;Increment target low byte
BNE DECLEN ;Check for rollover
INC TGTH ;Increment target high byte
;
;DECLEN subroutine: decrement 16-bit variable LENL/LENH
DECLEN SEC ;Set carry for subtract
LDA LENL ;Get length low byte
SBC #$01 ;Subtract one
STA LENL ;Save it back
LDA LENH ;Get length high byte
SBC #$00 ;Subtract carry
STA LENH ;Save it back
RTS ;Return to caller
;
So far I've reduced ROM size in excess of hundred bytes but have rewritten multiple routines and made a few minor changes to help as well. Just thought I'd share the above routines for anyone looking for shorter code. I have more routines I'm looking at but the above are a good start. Granted, execution time is going to be longer, but when you're typing commands into the monitor, it's still blazingly quick.
Re: Getting code size even smaller
Posted: Sun Jan 10, 2016 4:17 am
by bogax
why not decrement length in the usual way?
Code: Select all
DECLEN
LDA LENL
BNE SKIP
DEC LENH
SKIP
DEC LENL
RTS ;Return to caller
also it looks like it would be useful to have DECLEN
return a zero flag (ie move the zero test to the DECLEN routine)
Re: Getting code size even smaller
Posted: Sun Jan 10, 2016 4:42 am
by barrym95838
... also it looks like it would be useful to have DECLEN
return a zero flag (ie move the zero test to the DECLEN routine)
... with an ORA LENH just before the RTS, right? Oops, no, because A has the old value of LENL.
So, it would have to be
Code: Select all
...
DEC ; dea or dec a
ORA LENH
RTS
Mike B.
Re: Getting code size even smaller
Posted: Sun Jan 10, 2016 5:07 am
by floobydust
why not decrement length in the usual way?
Code: Select all
DECLEN
LDA LENL
BNE SKIP
DEC LENH
SKIP
DEC LENL
RTS ;Return to caller
also it looks like it would be useful to have DECLEN
return a zero flag (ie move the zero test to the DECLEN routine)
Bogax,
Good catch, probably because I wrote that routine around o'dark-thirty and the focus was elsewhere

Could also move the zero test there as well... as Mike noted the LDA LENL and ORA LENH at the end could also be used to set the zero flag. I'll take another run through the code an knock it down further on size... plus the alternate DECLEN will be a bit quicker on execution. Danke!
Re: Getting code size even smaller
Posted: Sun Jan 10, 2016 5:11 am
by theGSman
As I've started doing a refresh on my recent C02Monitor code, I've started crunching the size down on some routines and the results are more than expected. Some of it is the result of reuse in the monitor and other the result of leveraging some of the newer opcodes in the CMOS version. Some of the routines recently rewritten include memory fill, move and compare routines. The result is over 100 bytes saved. As the 6502 has been around a long time, memory fill, move and compare routines have been around and almost standardized on for decades now. The usual method is to handle full pages of 256 bytes first, then handle the remaining page of less than 256 bytes. This method results in two complete routine loops to accomplish the function.
Doing it differently in a single loop, I ended up being able to reuse a core routine that updates source, target and length pointers commonly used to move memory (non-overlapping), fill memory and compare memory. I can use the new routines for the monitor functions which fill memory, compare memory, write to the EEPROM insitu (basically a move/compare) and half of the memory move function. Here's some snippets of the core code:
Code: Select all
USERFILL TAX ;Save fill byte
FILL_LP INC LENL ;Get length low byte
ORA LENH ;OR in length high byte
BEQ DONEFILL ;Exit if zero
TXA ;Get fill byte
STA (TGTL) ;Store in target location
JSR UPD_TL ;Update Target/Length pointers
BRA FILL_LP ;Loop back until done
DONEFILL RTS ;Return to caller
Code: Select all
;Move memory block first byte to last byte, no overwrite condition, do the move
MVNO_LP LDA LENL ;Get length low byte
ORA LENH ;OR in length high byte
BEQ QUITMV ;Exit if zero bytes to move
LDA (SRCL) ;Load source data
STA (TGTL) ;Store as target data
JSR UPD_STL ;Update Source/Target/Length variables
BRA MVNO_LP ;Branch back until length is zero
;
It would appear that you have sacrificed speed for a small saving in code (that JSR UPD_TL will slow things down considerably). I have rewritten your code in a more conventional way - except that LENL/H actually stores the ones complement of LEN since increasing double bytes is quicker than decreasing them.
Code: Select all
USERFILL LDY #0 ;Start of target location
FILL_LP INC LENL ;Update count
BNE FILL_NB ; Store next byte if not zero
INC LENH ;Update hi-byte of count
BNE FILL_NB ;Store next byte if not zero
RTS ;Return if count has advanced to zero
FILL_NB STA (TGTL),Y ;Store in target location
INY ;Point to next target location
BNE FILL_LP ;Return to checking count
INC TGTL+1 ;Point to next page
BNE FILL_NP ; TGTL+1 is always non-zero
This takes 20 bytes compared to the 16 bytes in your code. (Note that I have not included the space taken by your UPD_TL routine).
Code: Select all
;Move memory block first byte to last byte, no overwrite condition, do the move
USERMOV LDY #0 ;Start of source and target location
MVNO_LP INC LENL ;Update count
BNE MVNO_NB ;Move next byte if not zero
INC LENH ;Update hi-byte of count
BNE MVNO_NB ;Store next byte if not zero
RTS ;Return if count has advanced to zero
MVNO_NB LDA (SRCL),Y ;Load source data
STA (TGTL),Y ;Store as target data
INY
BNE MVNO_LP ;Return to checking count
INC SRCL+1 ;Next page of source
INC TGTL+1 ;Next page of target
BNE MVNO_LP ;TGTL+1 is always non-zero
24 bytes compared to your code of 17 bytes (plus UPD_STL).
Re: Getting code size even smaller
Posted: Sun Jan 10, 2016 5:29 am
by barrym95838
There's also a way that you can use (ind),y without two loops. You take 256 - LENL and put it in y for your first (partial) page, and then move LENH more full pages, with the same LDA STA INY BNE that you used for the first partial page. You do have to adjust your initial source and destination down by the same initial value of y, so it's probably not going to be shorter, but it should be much quicker than JSRing all over.
viewtopic.php?f=2&t=3054#p34680
Mike B.
Re: Getting code size even smaller
Posted: Sun Jan 10, 2016 5:44 am
by floobydust
Actually, the main goal was to reduce code size over execution speed. I was using similar routines which were faster but opted to reduce ROM space over speed.
Re: Getting code size even smaller
Posted: Sun Jan 10, 2016 6:00 am
by theGSman
Actually, the main goal was to reduce code size over execution speed. I was using similar routines which were faster but opted to reduce ROM space over speed.
Code: Select all
JSR UPDTL
UPDTL INC TGTL
BNE TGTL_EX
INC TGTL+1
TGTL_EX RTS
Inlining UPDTL costs 6 bytes compared to 3 bytes for the subroutine call (plus the initial 7 bytes for the subroutine). You would have to make the call to UPDTL at least 3 times before any code saving was possible.
Re: Getting code size even smaller
Posted: Sun Jan 10, 2016 6:09 am
by floobydust
Actually, the main goal was to reduce code size over execution speed. I was using similar routines which were faster but opted to reduce ROM space over speed.
Code: Select all
JSR UPDTL
UPDTL INC TGTL
BNE TGTL_EX
INC TGTL+1
TGTL_EX RTS
Inlining UPDTL costs 6 bytes compared to 3 bytes for the subroutine call (plus the initial 7 bytes for the subroutine). You would have to make the call to UPDTL at least 3 times before any code saving was possible.
As the (further updated) code sits now, the UPD_TL is part is UPD_STL. In the monitor code, UPD_STL is called by the EEPROM program routine, the memory compare routine and the memory move routine. The UPD_TL is called by the memory fill routine. So in all, the update routine is used by four different monitor commands.
Re: Getting code size even smaller
Posted: Sun Jan 10, 2016 7:07 am
by barrym95838
Actually, the main goal was to reduce code size over execution speed. I was using similar routines which were faster but opted to reduce ROM space over speed.
Woz would likely be proud. I'm pretty sure that he used a similar tactic in his "{dest}<{start}.{end}M" command for the Apple 2 monitor, which wasn't much of a speed-demon. That original
2K Apple monitor ROM listing is quite a neat little bit of "quaintly-structured" wizardry, and at least 10% of that 2K (lo-res graphics routines, integer multiplication and division, reading paddles) had nothing to do with "normal" monitor operation.
Mike B.
Re: Getting code size even smaller
Posted: Sun Jan 10, 2016 7:09 am
by theGSman
As the (further updated) code sits now, the UPD_TL is part is UPD_STL. In the monitor code, UPD_STL is called by the EEPROM program routine, the memory compare routine and the memory move routine. The UPD_TL is called by the memory fill routine. So in all, the update routine is used by four different monitor commands.
My bad; I didn't notice that DECLEN was part of UPD_STL.
Actually, if you can put up with a little less robustness in your FILL/MOVE/COMPARE routines (each loop always executes at least once and you are indifferent about SRCH when executing the USERFILL routine) then even bigger savings are possible.
Code: Select all
USERFILL LDY #0 ;Start of count
FILL_LP STA (TGTL),Y ;Store byte in target location
PHA ;Save byte
JSR UPD_STL ;Update pointers and count
PLA ;Restore byte
BNE FILL_LP ;Continue if not finished
RTS
MOVENO LDY #0 ;Start of count
MVNO_LP LDA (SRCL),Y ;Load source data
STA (TGTL),Y ;Store as target data
JSR UPD_STL ;Update pointers and count
BNE MVNO_LP ;Continue if not finished
RTS
UPD_STL INY ;Point to next byte
BNE DECLEN ;Adjust page if necessary
INC SRCH
INC TGTH
DECLEN LDA LENL
BNE DN_LNL ;Dec Hi byte if necessary
DEC LENH
DN_LNL DEC A ;Dec Lo byte
STA LENL ;Save Lo byte
ORA LENH ;Check if count is at zero
RTS ; Return with Z flag as appropriate
Note that the COMPERR (not written) would become more complicated and the task of checking for zero values before loading LENL/LENH is moved out of these routines.
Re: Getting code size even smaller
Posted: Sun Jan 10, 2016 2:39 pm
by floobydust
As the (further updated) code sits now, the UPD_TL is part is UPD_STL. In the monitor code, UPD_STL is called by the EEPROM program routine, the memory compare routine and the memory move routine. The UPD_TL is called by the memory fill routine. So in all, the update routine is used by four different monitor commands.
My bad; I didn't notice that DECLEN was part of UPD_STL.
Actually, if you can put up with a little less robustness in your FILL/MOVE/COMPARE routines (each loop always executes at least once and you are indifferent about SRCH when executing the USERFILL routine) then even bigger savings are possible.
Code: Select all
USERFILL LDY #0 ;Start of count
FILL_LP STA (TGTL),Y ;Store byte in target location
PHA ;Save byte
JSR UPD_STL ;Update pointers and count
PLA ;Restore byte
BNE FILL_LP ;Continue if not finished
RTS
MOVENO LDY #0 ;Start of count
MVNO_LP LDA (SRCL),Y ;Load source data
STA (TGTL),Y ;Store as target data
JSR UPD_STL ;Update pointers and count
BNE MVNO_LP ;Continue if not finished
RTS
UPD_STL INY ;Point to next byte
BNE DECLEN ;Adjust page if necessary
INC SRCH
INC TGTH
DECLEN LDA LENL
BNE DN_LNL ;Dec Hi byte if necessary
DEC LENH
DN_LNL DEC A ;Dec Lo byte
STA LENL ;Save Lo byte
ORA LENH ;Check if count is at zero
RTS ; Return with Z flag as appropriate
Note that the COMPERR (not written) would become more complicated and the task of checking for zero values before loading LENL/LENH is moved out of these routines.
Looking at your code, you're using the Y reg as an index, so you need to load it with zero and use the indexed indirect page zero addressing. I opted for the 65C02 indirect page zero addressing without the index, which is one of the savings, not needing to load the Y reg with zero first. As I use the update routine in multiple commands, there is a savings. As noted, the COMPERR routine becomes more complicated (which I had before) and there is a decent code size savings by having the common pointer update routine.
You also found the same problem I did with putting the length check for zero in the update routine.... meaning you execute the first loop operation (move, fill compare) before updating the pointers. Unfortunately, the input routine allows you to enter zero as a field for source, target and.... well, length. So if you don't check the length first, you'll do the first operation, then decrement the count from $0000 to $FFFF, which is not a good idea for a fill or move operation.
It's unclear if trying to fix this elsewhere in the code would be worth any space savings, which is the main goal for rewriting multiple routines. I'm still looking at the code for optimization of space first and speed second. Airing the code here is always beneficial as the replies make you rethink some stuff and get some great ideas and code samples to help. Based on this thread, I saved an extra 17 bytes in the monitor code by rewriting DECLEN and two other routines which increment and decrement 16-bit pointers. It also resulted in faster execution of those routines as well.

Re: Getting code size even smaller
Posted: Sun Jan 10, 2016 5:37 pm
by theGSman
Looking at your code, you're using the Y reg as an index, so you need to load it with zero and use the indexed indirect page zero addressing.
Even so, my code saves 4 bytes per block subroutine and 7 bytes on the UPD_STL routine. Of course, as you point out, this is to no avail if you have to waste too much code space checking the LENgth each time before deciding to call one of these routines.
Unfortunately, the input routine allows you to enter zero as a field for source, target and.... well, length.
If you can get your input routine to store the one's complement of the length in the LENL:LENH then a further saving can be made by moving the UPD_STL routine to the start of the loop. This will still save 4 bytes per block subroutine (assuming you don't use Y indexing) since UPD_STL will return with the Z flag set if there are no more bytes to move. The one's complement of 0 is FFFF which, when incremented, will also set the Z flag so nothing will happen if the user inputs 0 for a length. (Of course, you would have to initially store source-1 and target-1 in SRCL:SRCH and TGTL:TGTH respectively which may negate these savings).
Re: Getting code size even smaller
Posted: Sun Jan 10, 2016 5:52 pm
by barrym95838
The one's complement length idea is pretty cool, and I don't think you need to worry about the source-1 and target-1 baggage if you pre-increment your length and post-increment your load and store pointers. If I'm in error here, there's always the option of pre-incrementing your loads/stores as well ... it's not like we have a native addressing mode that does the work for us, like a 6809 or 68k.
Mike B.
@theGSman: I didn't know you were such an optimization freak! After seeing your UM/MOD code
here, I assumed you were a "git er done and move on to the next problem" kind of guy.
Re: Getting code size even smaller
Posted: Mon Jan 11, 2016 6:18 am
by theGSman
I don't think you need to worry about the source-1 and target-1 baggage if you pre-increment your length and post-increment your load and store pointers.
That would require an additional subroutine call for the post-increment.
@theGSman: I didn't know you were such an optimization freak! After seeing your UM/MOD code
here, I assumed you were a "git er done and move on to the next problem" kind of guy.
That was my second effort. The first copied the numbers on the stack to a fixed RAM address then copied the results back onto the stack. By doing it all on the stack I have already saved some code. I'm sure that much greater optimizations can be achieved but that would take a lot of further research and I already have something that works a treat. Note that the chief optimization is in providing only the basic mul/div routines and leaving it to the user to define the other words as required.
My main objective with my Forth system was speed - not size. Everything that pushes a number on the stack (defined variables, constants, literals and even inline string literals) is inlined in assembly language. If (for example) you wanted to push the number 2 on the stack frequently then to save some space you would probably define
2 CONSTANT 2.