6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sat Nov 16, 2024 11:20 am

All times are UTC




Post new topic Reply to topic  [ 19 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Sun Jan 10, 2016 1:38 am 
Offline
User avatar

Joined: Tue Mar 05, 2013 4:31 am
Posts: 1385
As I've started doing a refresh on my recent C02Monitor code, I've started crunching the size down on some routines and the results are more than expected. Some of it is the result of reuse in the monitor and other the result of leveraging some of the newer opcodes in the CMOS version. Some of the routines recently rewritten include memory fill, move and compare routines. The result is over 100 bytes saved. As the 6502 has been around a long time, memory fill, move and compare routines have been around and almost standardized on for decades now. The usual method is to handle full pages of 256 bytes first, then handle the remaining page of less than 256 bytes. This method results in two complete routine loops to accomplish the function.

Doing it differently in a single loop, I ended up being able to reuse a core routine that updates source, target and length pointers commonly used to move memory (non-overlapping), fill memory and compare memory. I can use the new routines for the monitor functions which fill memory, compare memory, write to the EEPROM insitu (basically a move/compare) and half of the memory move function. Here's some snippets of the core code:

Code:
USERFILL TAX ;Save fill byte
FILL_LP LDA LENL ;Get length low byte
   ORA LENH ;OR in length high byte
   BEQ DONEFILL ;Exit if zero
   TXA ;Get fill byte
   STA (TGTL) ;Store in target location
   JSR UPD_TL ;Update Target/Length pointers
   BRA FILL_LP ;Loop back until done
DONEFILL RTS ;Return to caller


Code:
;Move memory block first byte to last byte, no overwrite condition, do the move
MVNO_LP LDA LENL ;Get length low byte
   ORA LENH ;OR in length high byte
   BEQ QUITMV ;Exit if zero bytes to move
   LDA (SRCL) ;Load source data
   STA (TGTL) ;Store as target data
   JSR UPD_STL ;Update Source/Target/Length variables
   BRA MVNO_LP ;Branch back until length is zero
;


Code:
COMPLP LDA LENL ;Get low byte of length
   ORA LENH ;OR in High byte of length
   BEQ QUITMV ;If zero, nothing to write
   LDA (SRCL) ;Get the source data
   CMP (TGTL) ;Compare to target data
   BEQ CMP_OK ;If equal, continue compare
   JSR COMPERR ;Print failed compare address
CMP_OK   JSR UPD_STL ;Update Source/Target/Length variables
      BRA COMPLP ;Branch back until done
QUITMV RTS ;Return to caller
;
COMPERR JSR SPC2 ;Send 2 spaces
   JSR DOLLAR ;Print $ sign
   LDA TGTH ;Get high byte of address
   LDY TGTL ;Get Low byte of address
   JSR PRWORD ;Print word
   JMP SPC ;Add 1 space for formatting and return


Code:
PROG_LP LDA LENL ;Get low byte of length
   ORA LENH ;OR in High byte of length
   BEQ QUITPRG ;If zero, nothing to write
;
   JSR BURN_BYTE ;Write a byte to EEPROM
   LDA (SRCL) ;Get the source data
   CMP (TGTL) ;Compare to target data
   BEQ PRG_OK ;If equal, continue writing
;
   LDA #$FF ;Get flag setting
   STA TEMP2 ;Set flag for error
   JSR COMPERR ;Print compare fail address
;
PRG_OK JSR UPD_STL ;Update Source/Target/Length variables
   BRA PROG_LP ;Branch back until done
;
QUITPRG LDA TEMP2 ;Get Prog error flag
   BEQ PRG_GOOD ;Branch to good program exit
   LDA #$26 ;Get Prog failed message
   BRA BRA_PRMPT ;Branch to Prompt routine
;
PRG_GOOD LDA #$25 ;Get completed message
   JSR PROMPT ;Send to console and exit
   LDA #$27 ;Get warning message for RTC and Reset
BRA_PRMPT JMP PROMPT ;Send to console and exit


Code:
;UPD_STL subroutine: Increments Source and Target pointers
;UPD_TL subroutine: Increments Target pointers only
; then drops into decrement length pointer. Used by multiple commands
UPD_STL INC SRCL ;Increment source low byte
   BNE UPD_TL ;Check for rollover
   INC SRCH ;Increment source high byte
UPD_TL INC TGTL ;Increment target low byte
   BNE DECLEN ;Check for rollover
   INC TGTH ;Increment target high byte
;
;DECLEN subroutine: decrement 16-bit variable LENL/LENH
DECLEN SEC ;Set carry for subtract
   LDA LENL ;Get length low byte
   SBC #$01 ;Subtract one
   STA LENL ;Save it back
   LDA LENH ;Get length high byte
   SBC #$00 ;Subtract carry
   STA LENH ;Save it back
   RTS ;Return to caller
;


So far I've reduced ROM size in excess of hundred bytes but have rewritten multiple routines and made a few minor changes to help as well. Just thought I'd share the above routines for anyone looking for shorter code. I have more routines I'm looking at but the above are a good start. Granted, execution time is going to be longer, but when you're typing commands into the monitor, it's still blazingly quick.

_________________
Regards, KM
https://github.com/floobydust


Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 10, 2016 4:17 am 
Offline

Joined: Tue Nov 18, 2003 8:41 pm
Posts: 250
why not decrement length in the usual way?

Code:
DECLEN
   LDA LENL
   BNE SKIP
   DEC LENH
SKIP
   DEC LENL
   RTS ;Return to caller


also it looks like it would be useful to have DECLEN
return a zero flag (ie move the zero test to the DECLEN routine)


Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 10, 2016 4:42 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1949
Location: Sacramento, CA, USA
bogax wrote:
... also it looks like it would be useful to have DECLEN
return a zero flag (ie move the zero test to the DECLEN routine)


... with an ORA LENH just before the RTS, right? Oops, no, because A has the old value of LENL.

So, it would have to be
Code:
    ...
    DEC         ; dea  or dec a
    ORA  LENH
    RTS


Mike B.


Last edited by barrym95838 on Sun Jan 10, 2016 5:08 am, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 10, 2016 5:07 am 
Offline
User avatar

Joined: Tue Mar 05, 2013 4:31 am
Posts: 1385
bogax wrote:
why not decrement length in the usual way?

Code:
DECLEN
   LDA LENL
   BNE SKIP
   DEC LENH
SKIP
   DEC LENL
   RTS ;Return to caller


also it looks like it would be useful to have DECLEN
return a zero flag (ie move the zero test to the DECLEN routine)


Bogax,
Good catch, probably because I wrote that routine around o'dark-thirty and the focus was elsewhere ;-) Could also move the zero test there as well... as Mike noted the LDA LENL and ORA LENH at the end could also be used to set the zero flag. I'll take another run through the code an knock it down further on size... plus the alternate DECLEN will be a bit quicker on execution. Danke!

_________________
Regards, KM
https://github.com/floobydust


Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 10, 2016 5:11 am 
Offline

Joined: Mon Jan 26, 2015 6:19 am
Posts: 85
floobydust wrote:
As I've started doing a refresh on my recent C02Monitor code, I've started crunching the size down on some routines and the results are more than expected. Some of it is the result of reuse in the monitor and other the result of leveraging some of the newer opcodes in the CMOS version. Some of the routines recently rewritten include memory fill, move and compare routines. The result is over 100 bytes saved. As the 6502 has been around a long time, memory fill, move and compare routines have been around and almost standardized on for decades now. The usual method is to handle full pages of 256 bytes first, then handle the remaining page of less than 256 bytes. This method results in two complete routine loops to accomplish the function.

Doing it differently in a single loop, I ended up being able to reuse a core routine that updates source, target and length pointers commonly used to move memory (non-overlapping), fill memory and compare memory. I can use the new routines for the monitor functions which fill memory, compare memory, write to the EEPROM insitu (basically a move/compare) and half of the memory move function. Here's some snippets of the core code:

Code:
USERFILL TAX ;Save fill byte
FILL_LP INC LENL ;Get length low byte
   ORA LENH ;OR in length high byte
   BEQ DONEFILL ;Exit if zero
   TXA ;Get fill byte
   STA (TGTL) ;Store in target location
   JSR UPD_TL ;Update Target/Length pointers
   BRA FILL_LP ;Loop back until done
DONEFILL RTS ;Return to caller


Code:
;Move memory block first byte to last byte, no overwrite condition, do the move
MVNO_LP LDA LENL ;Get length low byte
   ORA LENH ;OR in length high byte
   BEQ QUITMV ;Exit if zero bytes to move
   LDA (SRCL) ;Load source data
   STA (TGTL) ;Store as target data
   JSR UPD_STL ;Update Source/Target/Length variables
   BRA MVNO_LP ;Branch back until length is zero
;

It would appear that you have sacrificed speed for a small saving in code (that JSR UPD_TL will slow things down considerably). I have rewritten your code in a more conventional way - except that LENL/H actually stores the ones complement of LEN since increasing double bytes is quicker than decreasing them.

Code:
USERFILL LDY #0 ;Start of target location
FILL_LP INC LENL ;Update count
   BNE FILL_NB ; Store next byte if not zero
   INC LENH ;Update hi-byte of count
   BNE FILL_NB ;Store next byte if not zero
   RTS ;Return if count has advanced to zero
FILL_NB STA (TGTL),Y ;Store in target location
   INY ;Point to next target location
   BNE FILL_LP ;Return to checking count
   INC TGTL+1 ;Point to next page
   BNE FILL_NP ; TGTL+1 is always non-zero

This takes 20 bytes compared to the 16 bytes in your code. (Note that I have not included the space taken by your UPD_TL routine).

Code:
;Move memory block first byte to last byte, no overwrite condition, do the move
USERMOV LDY #0 ;Start of source and target location
MVNO_LP INC LENL ;Update count
   BNE MVNO_NB ;Move next byte if not zero
   INC LENH ;Update hi-byte of count
   BNE MVNO_NB ;Store next byte if not zero
   RTS ;Return if count has advanced to zero
MVNO_NB LDA (SRCL),Y ;Load source data
   STA (TGTL),Y ;Store as target data
   INY
   BNE MVNO_LP ;Return to checking count
   INC SRCL+1 ;Next page of source
   INC TGTL+1 ;Next page of target
   BNE MVNO_LP ;TGTL+1 is always non-zero

24 bytes compared to your code of 17 bytes (plus UPD_STL).


Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 10, 2016 5:29 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1949
Location: Sacramento, CA, USA
There's also a way that you can use (ind),y without two loops. You take 256 - LENL and put it in y for your first (partial) page, and then move LENH more full pages, with the same LDA STA INY BNE that you used for the first partial page. You do have to adjust your initial source and destination down by the same initial value of y, so it's probably not going to be shorter, but it should be much quicker than JSRing all over.

viewtopic.php?f=2&t=3054#p34680

Mike B.


Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 10, 2016 5:44 am 
Offline
User avatar

Joined: Tue Mar 05, 2013 4:31 am
Posts: 1385
Actually, the main goal was to reduce code size over execution speed. I was using similar routines which were faster but opted to reduce ROM space over speed.

_________________
Regards, KM
https://github.com/floobydust


Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 10, 2016 6:00 am 
Offline

Joined: Mon Jan 26, 2015 6:19 am
Posts: 85
floobydust wrote:
Actually, the main goal was to reduce code size over execution speed. I was using similar routines which were faster but opted to reduce ROM space over speed.

Code:
   JSR UPDTL

UPDTL INC TGTL
   BNE TGTL_EX
   INC TGTL+1
TGTL_EX RTS

Inlining UPDTL costs 6 bytes compared to 3 bytes for the subroutine call (plus the initial 7 bytes for the subroutine). You would have to make the call to UPDTL at least 3 times before any code saving was possible.


Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 10, 2016 6:09 am 
Offline
User avatar

Joined: Tue Mar 05, 2013 4:31 am
Posts: 1385
theGSman wrote:
floobydust wrote:
Actually, the main goal was to reduce code size over execution speed. I was using similar routines which were faster but opted to reduce ROM space over speed.

Code:
   JSR UPDTL

UPDTL INC TGTL
   BNE TGTL_EX
   INC TGTL+1
TGTL_EX RTS

Inlining UPDTL costs 6 bytes compared to 3 bytes for the subroutine call (plus the initial 7 bytes for the subroutine). You would have to make the call to UPDTL at least 3 times before any code saving was possible.


As the (further updated) code sits now, the UPD_TL is part is UPD_STL. In the monitor code, UPD_STL is called by the EEPROM program routine, the memory compare routine and the memory move routine. The UPD_TL is called by the memory fill routine. So in all, the update routine is used by four different monitor commands.

_________________
Regards, KM
https://github.com/floobydust


Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 10, 2016 7:07 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1949
Location: Sacramento, CA, USA
floobydust wrote:
Actually, the main goal was to reduce code size over execution speed. I was using similar routines which were faster but opted to reduce ROM space over speed.

Woz would likely be proud. I'm pretty sure that he used a similar tactic in his "{dest}<{start}.{end}M" command for the Apple 2 monitor, which wasn't much of a speed-demon. That original 2K Apple monitor ROM listing is quite a neat little bit of "quaintly-structured" wizardry, and at least 10% of that 2K (lo-res graphics routines, integer multiplication and division, reading paddles) had nothing to do with "normal" monitor operation.

Mike B.


Last edited by barrym95838 on Sun Jan 10, 2016 7:15 am, edited 3 times in total.

Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 10, 2016 7:09 am 
Offline

Joined: Mon Jan 26, 2015 6:19 am
Posts: 85
floobydust wrote:
As the (further updated) code sits now, the UPD_TL is part is UPD_STL. In the monitor code, UPD_STL is called by the EEPROM program routine, the memory compare routine and the memory move routine. The UPD_TL is called by the memory fill routine. So in all, the update routine is used by four different monitor commands.

My bad; I didn't notice that DECLEN was part of UPD_STL.

Actually, if you can put up with a little less robustness in your FILL/MOVE/COMPARE routines (each loop always executes at least once and you are indifferent about SRCH when executing the USERFILL routine) then even bigger savings are possible.
Code:
USERFILL LDY #0 ;Start of count
FILL_LP STA (TGTL),Y ;Store byte in target location
   PHA ;Save byte
   JSR UPD_STL ;Update pointers and count
   PLA ;Restore byte
   BNE FILL_LP ;Continue if not finished
   RTS

MOVENO LDY #0 ;Start of count
MVNO_LP LDA (SRCL),Y ;Load source data
   STA (TGTL),Y ;Store as target data
   JSR UPD_STL ;Update pointers and count
   BNE MVNO_LP ;Continue if not finished
   RTS

UPD_STL INY ;Point to next byte
   BNE DECLEN ;Adjust page if necessary
   INC SRCH
   INC TGTH
DECLEN LDA LENL
   BNE DN_LNL ;Dec Hi byte if necessary
   DEC LENH
DN_LNL  DEC A ;Dec Lo byte
   STA LENL ;Save Lo byte
   ORA LENH ;Check if count is at zero
   RTS ; Return with Z flag as appropriate

Note that the COMPERR (not written) would become more complicated and the task of checking for zero values before loading LENL/LENH is moved out of these routines.


Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 10, 2016 2:39 pm 
Offline
User avatar

Joined: Tue Mar 05, 2013 4:31 am
Posts: 1385
theGSman wrote:
floobydust wrote:
As the (further updated) code sits now, the UPD_TL is part is UPD_STL. In the monitor code, UPD_STL is called by the EEPROM program routine, the memory compare routine and the memory move routine. The UPD_TL is called by the memory fill routine. So in all, the update routine is used by four different monitor commands.

My bad; I didn't notice that DECLEN was part of UPD_STL.

Actually, if you can put up with a little less robustness in your FILL/MOVE/COMPARE routines (each loop always executes at least once and you are indifferent about SRCH when executing the USERFILL routine) then even bigger savings are possible.
Code:
USERFILL LDY #0 ;Start of count
FILL_LP STA (TGTL),Y ;Store byte in target location
   PHA ;Save byte
   JSR UPD_STL ;Update pointers and count
   PLA ;Restore byte
   BNE FILL_LP ;Continue if not finished
   RTS

MOVENO LDY #0 ;Start of count
MVNO_LP LDA (SRCL),Y ;Load source data
   STA (TGTL),Y ;Store as target data
   JSR UPD_STL ;Update pointers and count
   BNE MVNO_LP ;Continue if not finished
   RTS

UPD_STL INY ;Point to next byte
   BNE DECLEN ;Adjust page if necessary
   INC SRCH
   INC TGTH
DECLEN LDA LENL
   BNE DN_LNL ;Dec Hi byte if necessary
   DEC LENH
DN_LNL  DEC A ;Dec Lo byte
   STA LENL ;Save Lo byte
   ORA LENH ;Check if count is at zero
   RTS ; Return with Z flag as appropriate

Note that the COMPERR (not written) would become more complicated and the task of checking for zero values before loading LENL/LENH is moved out of these routines.


Looking at your code, you're using the Y reg as an index, so you need to load it with zero and use the indexed indirect page zero addressing. I opted for the 65C02 indirect page zero addressing without the index, which is one of the savings, not needing to load the Y reg with zero first. As I use the update routine in multiple commands, there is a savings. As noted, the COMPERR routine becomes more complicated (which I had before) and there is a decent code size savings by having the common pointer update routine.

You also found the same problem I did with putting the length check for zero in the update routine.... meaning you execute the first loop operation (move, fill compare) before updating the pointers. Unfortunately, the input routine allows you to enter zero as a field for source, target and.... well, length. So if you don't check the length first, you'll do the first operation, then decrement the count from $0000 to $FFFF, which is not a good idea for a fill or move operation.

It's unclear if trying to fix this elsewhere in the code would be worth any space savings, which is the main goal for rewriting multiple routines. I'm still looking at the code for optimization of space first and speed second. Airing the code here is always beneficial as the replies make you rethink some stuff and get some great ideas and code samples to help. Based on this thread, I saved an extra 17 bytes in the monitor code by rewriting DECLEN and two other routines which increment and decrement 16-bit pointers. It also resulted in faster execution of those routines as well. :mrgreen:

_________________
Regards, KM
https://github.com/floobydust


Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 10, 2016 5:37 pm 
Offline

Joined: Mon Jan 26, 2015 6:19 am
Posts: 85
floobydust wrote:
Looking at your code, you're using the Y reg as an index, so you need to load it with zero and use the indexed indirect page zero addressing.

Even so, my code saves 4 bytes per block subroutine and 7 bytes on the UPD_STL routine. Of course, as you point out, this is to no avail if you have to waste too much code space checking the LENgth each time before deciding to call one of these routines.

floobydust wrote:
Unfortunately, the input routine allows you to enter zero as a field for source, target and.... well, length.

If you can get your input routine to store the one's complement of the length in the LENL:LENH then a further saving can be made by moving the UPD_STL routine to the start of the loop. This will still save 4 bytes per block subroutine (assuming you don't use Y indexing) since UPD_STL will return with the Z flag set if there are no more bytes to move. The one's complement of 0 is FFFF which, when incremented, will also set the Z flag so nothing will happen if the user inputs 0 for a length. (Of course, you would have to initially store source-1 and target-1 in SRCL:SRCH and TGTL:TGTH respectively which may negate these savings).


Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 10, 2016 5:52 pm 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1949
Location: Sacramento, CA, USA
The one's complement length idea is pretty cool, and I don't think you need to worry about the source-1 and target-1 baggage if you pre-increment your length and post-increment your load and store pointers. If I'm in error here, there's always the option of pre-incrementing your loads/stores as well ... it's not like we have a native addressing mode that does the work for us, like a 6809 or 68k.

Mike B.

@theGSman: I didn't know you were such an optimization freak! After seeing your UM/MOD code here, I assumed you were a "git er done and move on to the next problem" kind of guy.


Top
 Profile  
Reply with quote  
PostPosted: Mon Jan 11, 2016 6:18 am 
Offline

Joined: Mon Jan 26, 2015 6:19 am
Posts: 85
barrym95838 wrote:
I don't think you need to worry about the source-1 and target-1 baggage if you pre-increment your length and post-increment your load and store pointers.

That would require an additional subroutine call for the post-increment.

barrym95838 wrote:
@theGSman: I didn't know you were such an optimization freak! After seeing your UM/MOD code here, I assumed you were a "git er done and move on to the next problem" kind of guy.

That was my second effort. The first copied the numbers on the stack to a fixed RAM address then copied the results back onto the stack. By doing it all on the stack I have already saved some code. I'm sure that much greater optimizations can be achieved but that would take a lot of further research and I already have something that works a treat. Note that the chief optimization is in providing only the basic mul/div routines and leaving it to the user to define the other words as required.

My main objective with my Forth system was speed - not size. Everything that pushes a number on the stack (defined variables, constants, literals and even inline string literals) is inlined in assembly language. If (for example) you wanted to push the number 2 on the stack frequently then to save some space you would probably define 2 CONSTANT 2.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 19 posts ]  Go to page 1, 2  Next

All times are UTC


Who is online

Users browsing this forum: Google [Bot] and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: