6502.org

Posted: **Sun Jan 10, 2016 1:38 am**

As I've started doing a refresh on my recent C02Monitor code, I've started crunching the size down on some routines and the results are more than expected. Some of it is the result of reuse in the monitor and other the result of leveraging some of the newer opcodes in the CMOS version. Some of the routines recently rewritten include memory fill, move and compare routines. The result is over 100 bytes saved. As the 6502 has been around a long time, memory fill, move and compare routines have been around and almost standardized on for decades now. The usual method is to handle full pages of 256 bytes first, then handle the remaining page of less than 256 bytes. This method results in two complete routine loops to accomplish the function.

Doing it differently in a single loop, I ended up being able to reuse a core routine that updates source, target and length pointers commonly used to move memory (non-overlapping), fill memory and compare memory. I can use the new routines for the monitor functions which fill memory, compare memory, write to the EEPROM insitu (basically a move/compare) and half of the memory move function. Here's some snippets of the core code:

Code: Select all

USERFILL TAX ;Save fill byte
FILL_LP LDA LENL ;Get length low byte
	ORA LENH ;OR in length high byte
	BEQ DONEFILL ;Exit if zero
	TXA ;Get fill byte
	STA (TGTL) ;Store in target location
	JSR UPD_TL ;Update Target/Length pointers
	BRA FILL_LP ;Loop back until done
DONEFILL RTS ;Return to caller

Code: Select all

;Move memory block first byte to last byte, no overwrite condition, do the move
MVNO_LP LDA LENL ;Get length low byte
	ORA LENH ;OR in length high byte
	BEQ QUITMV ;Exit if zero bytes to move
	LDA (SRCL) ;Load source data
	STA (TGTL) ;Store as target data
	JSR UPD_STL ;Update Source/Target/Length variables
	BRA MVNO_LP ;Branch back until length is zero
;

Code: Select all

COMPLP LDA LENL ;Get low byte of length
	ORA LENH ;OR in High byte of length
	BEQ QUITMV ;If zero, nothing to write
	LDA (SRCL) ;Get the source data
	CMP (TGTL) ;Compare to target data
	BEQ CMP_OK ;If equal, continue compare
	JSR COMPERR ;Print failed compare address
CMP_OK	JSR UPD_STL ;Update Source/Target/Length variables
		BRA COMPLP ;Branch back until done
QUITMV RTS ;Return to caller
;
COMPERR JSR SPC2 ;Send 2 spaces
	JSR DOLLAR ;Print $ sign
	LDA TGTH ;Get high byte of address
	LDY TGTL ;Get Low byte of address
	JSR PRWORD ;Print word
	JMP SPC ;Add 1 space for formatting and return

Code: Select all

PROG_LP LDA LENL ;Get low byte of length
	ORA LENH ;OR in High byte of length
	BEQ QUITPRG ;If zero, nothing to write
;
	JSR BURN_BYTE ;Write a byte to EEPROM
	LDA (SRCL) ;Get the source data
	CMP (TGTL) ;Compare to target data
	BEQ PRG_OK ;If equal, continue writing
;
	LDA #$FF ;Get flag setting
	STA TEMP2 ;Set flag for error
	JSR COMPERR ;Print compare fail address
;
PRG_OK JSR UPD_STL ;Update Source/Target/Length variables
	BRA PROG_LP ;Branch back until done
;
QUITPRG LDA TEMP2 ;Get Prog error flag
	BEQ PRG_GOOD ;Branch to good program exit
	LDA #$26 ;Get Prog failed message
	BRA BRA_PRMPT ;Branch to Prompt routine
;
PRG_GOOD LDA #$25 ;Get completed message
	JSR PROMPT ;Send to console and exit
	LDA #$27 ;Get warning message for RTC and Reset
BRA_PRMPT JMP PROMPT ;Send to console and exit

Code: Select all

;UPD_STL subroutine: Increments Source and Target pointers
;UPD_TL subroutine: Increments Target pointers only
; then drops into decrement length pointer. Used by multiple commands
UPD_STL INC SRCL ;Increment source low byte
	BNE UPD_TL ;Check for rollover
	INC SRCH ;Increment source high byte
UPD_TL INC TGTL ;Increment target low byte
	BNE DECLEN ;Check for rollover
	INC TGTH ;Increment target high byte
;
;DECLEN subroutine: decrement 16-bit variable LENL/LENH
DECLEN SEC ;Set carry for subtract
	LDA LENL ;Get length low byte
	SBC #$01 ;Subtract one
	STA LENL ;Save it back
	LDA LENH ;Get length high byte
	SBC #$00 ;Subtract carry
	STA LENH ;Save it back
	RTS ;Return to caller
;

So far I've reduced ROM size in excess of hundred bytes but have rewritten multiple routines and made a few minor changes to help as well. Just thought I'd share the above routines for anyone looking for shorter code. I have more routines I'm looking at but the above are a good start. Granted, execution time is going to be longer, but when you're typing commands into the monitor, it's still blazingly quick.

Posted: **Sun Jan 10, 2016 4:17 am**

why not decrement length in the usual way?

Code: Select all

DECLEN
   LDA LENL 
   BNE SKIP
   DEC LENH
SKIP
   DEC LENL
   RTS ;Return to caller

also it looks like it would be useful to have DECLEN
return a zero flag (ie move the zero test to the DECLEN routine)

Posted: **Sun Jan 10, 2016 4:42 am**

bogax wrote:

... also it looks like it would be useful to have DECLEN
return a zero flag (ie move the zero test to the DECLEN routine)

... with an ORA LENH just before the RTS, right? Oops, no, because A has the old value of LENL.

So, it would have to be

Code: Select all

    ...
    DEC         ; dea  or dec a
    ORA  LENH
    RTS

Mike B.

Posted: **Sun Jan 10, 2016 5:07 am**

bogax wrote:

why not decrement length in the usual way?

Code: Select all

DECLEN
   LDA LENL 
   BNE SKIP
   DEC LENH
SKIP
   DEC LENL
   RTS ;Return to caller

also it looks like it would be useful to have DECLEN
return a zero flag (ie move the zero test to the DECLEN routine)

Bogax,
Good catch, probably because I wrote that routine around o'dark-thirty and the focus was elsewhere

Could also move the zero test there as well... as Mike noted the LDA LENL and ORA LENH at the end could also be used to set the zero flag. I'll take another run through the code an knock it down further on size... plus the alternate DECLEN will be a bit quicker on execution. Danke!

Posted: **Sun Jan 10, 2016 5:11 am**

floobydust wrote:

As I've started doing a refresh on my recent C02Monitor code, I've started crunching the size down on some routines and the results are more than expected. Some of it is the result of reuse in the monitor and other the result of leveraging some of the newer opcodes in the CMOS version. Some of the routines recently rewritten include memory fill, move and compare routines. The result is over 100 bytes saved. As the 6502 has been around a long time, memory fill, move and compare routines have been around and almost standardized on for decades now. The usual method is to handle full pages of 256 bytes first, then handle the remaining page of less than 256 bytes. This method results in two complete routine loops to accomplish the function.

Doing it differently in a single loop, I ended up being able to reuse a core routine that updates source, target and length pointers commonly used to move memory (non-overlapping), fill memory and compare memory. I can use the new routines for the monitor functions which fill memory, compare memory, write to the EEPROM insitu (basically a move/compare) and half of the memory move function. Here's some snippets of the core code:

Code: Select all

USERFILL TAX ;Save fill byte
FILL_LP INC LENL ;Get length low byte
	ORA LENH ;OR in length high byte
	BEQ DONEFILL ;Exit if zero
	TXA ;Get fill byte
	STA (TGTL) ;Store in target location
	JSR UPD_TL ;Update Target/Length pointers
	BRA FILL_LP ;Loop back until done
DONEFILL RTS ;Return to caller

Code: Select all

;Move memory block first byte to last byte, no overwrite condition, do the move
MVNO_LP LDA LENL ;Get length low byte
	ORA LENH ;OR in length high byte
	BEQ QUITMV ;Exit if zero bytes to move
	LDA (SRCL) ;Load source data
	STA (TGTL) ;Store as target data
	JSR UPD_STL ;Update Source/Target/Length variables
	BRA MVNO_LP ;Branch back until length is zero
;

It would appear that you have sacrificed speed for a small saving in code (that JSR UPD_TL will slow things down considerably). I have rewritten your code in a more conventional way - except that LENL/H actually stores the ones complement of LEN since increasing double bytes is quicker than decreasing them.

Code: Select all

USERFILL LDY #0 ;Start of target location
FILL_LP INC LENL ;Update count
	BNE FILL_NB ; Store next byte if not zero
	INC LENH ;Update hi-byte of count
	BNE FILL_NB ;Store next byte if not zero
	RTS ;Return if count has advanced to zero
FILL_NB STA (TGTL),Y ;Store in target location
	INY ;Point to next target location
	BNE FILL_LP ;Return to checking count
	INC TGTL+1 ;Point to next page
	BNE FILL_NP ; TGTL+1 is always non-zero

This takes 20 bytes compared to the 16 bytes in your code. (Note that I have not included the space taken by your UPD_TL routine).

Code: Select all

;Move memory block first byte to last byte, no overwrite condition, do the move
USERMOV LDY #0 ;Start of source and target location
MVNO_LP INC LENL ;Update count
	BNE MVNO_NB ;Move next byte if not zero
	INC LENH ;Update hi-byte of count
	BNE MVNO_NB ;Store next byte if not zero
	RTS ;Return if count has advanced to zero
MVNO_NB LDA (SRCL),Y ;Load source data
	STA (TGTL),Y ;Store as target data
	INY
	BNE MVNO_LP ;Return to checking count
	INC SRCL+1 ;Next page of source
	INC TGTL+1 ;Next page of target
	BNE MVNO_LP ;TGTL+1 is always non-zero

24 bytes compared to your code of 17 bytes (plus UPD_STL).

Posted: **Sun Jan 10, 2016 5:29 am**

There's also a way that you can use (ind),y without two loops. You take 256 - LENL and put it in y for your first (partial) page, and then move LENH more full pages, with the same LDA STA INY BNE that you used for the first partial page. You do have to adjust your initial source and destination down by the same initial value of y, so it's probably not going to be shorter, but it should be much quicker than JSRing all over.

viewtopic.php?f=2&t=3054#p34680

Mike B.

Posted: **Sun Jan 10, 2016 5:44 am**

Actually, the main goal was to reduce code size over execution speed. I was using similar routines which were faster but opted to reduce ROM space over speed.

Posted: **Sun Jan 10, 2016 6:00 am**

floobydust wrote:

Actually, the main goal was to reduce code size over execution speed. I was using similar routines which were faster but opted to reduce ROM space over speed.

Code: Select all

	JSR UPDTL

UPDTL INC TGTL
	BNE TGTL_EX
	INC TGTL+1
TGTL_EX RTS

Inlining UPDTL costs 6 bytes compared to 3 bytes for the subroutine call (plus the initial 7 bytes for the subroutine). You would have to make the call to UPDTL at least 3 times before any code saving was possible.

Posted: **Sun Jan 10, 2016 6:09 am**

theGSman wrote:

floobydust wrote:

Actually, the main goal was to reduce code size over execution speed. I was using similar routines which were faster but opted to reduce ROM space over speed.

Code: Select all

	JSR UPDTL

UPDTL INC TGTL
	BNE TGTL_EX
	INC TGTL+1
TGTL_EX RTS

Inlining UPDTL costs 6 bytes compared to 3 bytes for the subroutine call (plus the initial 7 bytes for the subroutine). You would have to make the call to UPDTL at least 3 times before any code saving was possible.

As the (further updated) code sits now, the UPD_TL is part is UPD_STL. In the monitor code, UPD_STL is called by the EEPROM program routine, the memory compare routine and the memory move routine. The UPD_TL is called by the memory fill routine. So in all, the update routine is used by four different monitor commands.

Posted: **Sun Jan 10, 2016 7:07 am**

floobydust wrote:

Actually, the main goal was to reduce code size over execution speed. I was using similar routines which were faster but opted to reduce ROM space over speed.

Woz would likely be proud. I'm pretty sure that he used a similar tactic in his "{dest}<{start}.{end}M" command for the Apple 2 monitor, which wasn't much of a speed-demon. That original 2K Apple monitor ROM listing is quite a neat little bit of "quaintly-structured" wizardry, and at least 10% of that 2K (lo-res graphics routines, integer multiplication and division, reading paddles) had nothing to do with "normal" monitor operation.

Mike B.

Posted: **Sun Jan 10, 2016 7:09 am**

floobydust wrote:

As the (further updated) code sits now, the UPD_TL is part is UPD_STL. In the monitor code, UPD_STL is called by the EEPROM program routine, the memory compare routine and the memory move routine. The UPD_TL is called by the memory fill routine. So in all, the update routine is used by four different monitor commands.

My bad; I didn't notice that DECLEN was part of UPD_STL.

Actually, if you can put up with a little less robustness in your FILL/MOVE/COMPARE routines (each loop always executes at least once and you are indifferent about SRCH when executing the USERFILL routine) then even bigger savings are possible.

Code: Select all

USERFILL LDY #0 ;Start of count
FILL_LP STA (TGTL),Y ;Store byte in target location
	PHA ;Save byte
	JSR UPD_STL ;Update pointers and count
	PLA ;Restore byte
	BNE FILL_LP ;Continue if not finished
	RTS

MOVENO LDY #0 ;Start of count
MVNO_LP LDA (SRCL),Y ;Load source data
	STA (TGTL),Y ;Store as target data
	JSR UPD_STL ;Update pointers and count
	BNE MVNO_LP ;Continue if not finished
	RTS

UPD_STL INY ;Point to next byte
	BNE DECLEN ;Adjust page if necessary
	INC SRCH
	INC TGTH
DECLEN LDA LENL
	BNE DN_LNL ;Dec Hi byte if necessary
	DEC LENH
DN_LNL  DEC A ;Dec Lo byte
	STA LENL ;Save Lo byte
	ORA LENH ;Check if count is at zero
	RTS ; Return with Z flag as appropriate

Note that the COMPERR (not written) would become more complicated and the task of checking for zero values before loading LENL/LENH is moved out of these routines.

Posted: **Sun Jan 10, 2016 2:39 pm**

theGSman wrote:

floobydust wrote:

As the (further updated) code sits now, the UPD_TL is part is UPD_STL. In the monitor code, UPD_STL is called by the EEPROM program routine, the memory compare routine and the memory move routine. The UPD_TL is called by the memory fill routine. So in all, the update routine is used by four different monitor commands.

My bad; I didn't notice that DECLEN was part of UPD_STL.

Actually, if you can put up with a little less robustness in your FILL/MOVE/COMPARE routines (each loop always executes at least once and you are indifferent about SRCH when executing the USERFILL routine) then even bigger savings are possible.

Code: Select all

USERFILL LDY #0 ;Start of count
FILL_LP STA (TGTL),Y ;Store byte in target location
	PHA ;Save byte
	JSR UPD_STL ;Update pointers and count
	PLA ;Restore byte
	BNE FILL_LP ;Continue if not finished
	RTS

MOVENO LDY #0 ;Start of count
MVNO_LP LDA (SRCL),Y ;Load source data
	STA (TGTL),Y ;Store as target data
	JSR UPD_STL ;Update pointers and count
	BNE MVNO_LP ;Continue if not finished
	RTS

UPD_STL INY ;Point to next byte
	BNE DECLEN ;Adjust page if necessary
	INC SRCH
	INC TGTH
DECLEN LDA LENL
	BNE DN_LNL ;Dec Hi byte if necessary
	DEC LENH
DN_LNL  DEC A ;Dec Lo byte
	STA LENL ;Save Lo byte
	ORA LENH ;Check if count is at zero
	RTS ; Return with Z flag as appropriate

Note that the COMPERR (not written) would become more complicated and the task of checking for zero values before loading LENL/LENH is moved out of these routines.

Looking at your code, you're using the Y reg as an index, so you need to load it with zero and use the indexed indirect page zero addressing. I opted for the 65C02 indirect page zero addressing without the index, which is one of the savings, not needing to load the Y reg with zero first. As I use the update routine in multiple commands, there is a savings. As noted, the COMPERR routine becomes more complicated (which I had before) and there is a decent code size savings by having the common pointer update routine.

You also found the same problem I did with putting the length check for zero in the update routine.... meaning you execute the first loop operation (move, fill compare) before updating the pointers. Unfortunately, the input routine allows you to enter zero as a field for source, target and.... well, length. So if you don't check the length first, you'll do the first operation, then decrement the count from $0000 to $FFFF, which is not a good idea for a fill or move operation.

It's unclear if trying to fix this elsewhere in the code would be worth any space savings, which is the main goal for rewriting multiple routines. I'm still looking at the code for optimization of space first and speed second. Airing the code here is always beneficial as the replies make you rethink some stuff and get some great ideas and code samples to help. Based on this thread, I saved an extra 17 bytes in the monitor code by rewriting DECLEN and two other routines which increment and decrement 16-bit pointers. It also resulted in faster execution of those routines as well.

Posted: **Sun Jan 10, 2016 5:37 pm**

floobydust wrote:

Looking at your code, you're using the Y reg as an index, so you need to load it with zero and use the indexed indirect page zero addressing.

Even so, my code saves 4 bytes per block subroutine and 7 bytes on the UPD_STL routine. Of course, as you point out, this is to no avail if you have to waste too much code space checking the LENgth each time before deciding to call one of these routines.

floobydust wrote:

Unfortunately, the input routine allows you to enter zero as a field for source, target and.... well, length.

If you can get your input routine to store the one's complement of the length in the LENL:LENH then a further saving can be made by moving the UPD_STL routine to the start of the loop. This will still save 4 bytes per block subroutine (assuming you don't use Y indexing) since UPD_STL will return with the Z flag set if there are no more bytes to move. The one's complement of 0 is FFFF which, when incremented, will also set the Z flag so nothing will happen if the user inputs 0 for a length. (Of course, you would have to initially store source-1 and target-1 in SRCL:SRCH and TGTL:TGTH respectively which may negate these savings).

Posted: **Sun Jan 10, 2016 5:52 pm**

The one's complement length idea is pretty cool, and I don't think you need to worry about the source-1 and target-1 baggage if you pre-increment your length and post-increment your load and store pointers. If I'm in error here, there's always the option of pre-incrementing your loads/stores as well ... it's not like we have a native addressing mode that does the work for us, like a 6809 or 68k.

Mike B.

@theGSman: I didn't know you were such an optimization freak! After seeing your UM/MOD code here, I assumed you were a "git er done and move on to the next problem" kind of guy.

Posted: **Mon Jan 11, 2016 6:18 am**

barrym95838 wrote:

I don't think you need to worry about the source-1 and target-1 baggage if you pre-increment your length and post-increment your load and store pointers.

That would require an additional subroutine call for the post-increment.

barrym95838 wrote:

@theGSman: I didn't know you were such an optimization freak! After seeing your UM/MOD code here, I assumed you were a "git er done and move on to the next problem" kind of guy.

That was my second effort. The first copied the numbers on the stack to a fixed RAM address then copied the results back onto the stack. By doing it all on the stack I have already saved some code. I'm sure that much greater optimizations can be achieved but that would take a lot of further research and I already have something that works a treat. Note that the chief optimization is in providing only the basic mul/div routines and leaving it to the user to define the other words as required.

My main objective with my Forth system was speed - not size. Everything that pushes a number on the stack (defined variables, constants, literals and even inline string literals) is inlined in assembly language. If (for example) you wanted to push the number 2 on the stack frequently then to save some space you would probably define 2 CONSTANT 2.

6502.org

Getting code size even smaller

Getting code size even smaller

Re: Getting code size even smaller

Re: Getting code size even smaller

Re: Getting code size even smaller

Re: Getting code size even smaller

Re: Getting code size even smaller

Re: Getting code size even smaller

Re: Getting code size even smaller

Re: Getting code size even smaller

Re: Getting code size even smaller

Re: Getting code size even smaller

Re: Getting code size even smaller

Re: Getting code size even smaller

Re: Getting code size even smaller

Re: Getting code size even smaller