U* SpeedupU* attracted my attention
when whartung was testing NUMBER. Later I took a close look at dclxvi's multipy routines above, and I noticed a significant optimization used there.
U* involves
four bytes of data getting repeatedly shifted. But dclxvi keeps only three bytes in memory, and the remaining byte is held in the Accumulator (which of course can be shifted more quickly). The 4~ speedup nets a total saving of 64~ when all 16 iterations of the loop are considered. I like it!
My own
U* (below) copies that technique and adds a trick of its own. Shifting still uses a combination of memory and the accumulator. But
instead of four bytes of data all getting shifted 16 times, there's a three-byte subset that gets shifted 8 times and then a different three-byte subset shifted 8 times.
- the first 8 shifts are A -> N -> 2,X
- the final 8 shifts are A -> N -> 3,X
This is illustrated visually in the following post; see the third of the three
animations.
( Edit: In 2021 I did an '816 version of the modified-shift multiply and posted it in
this thread. It accepts two 32-bit operands and produces a 64-bit result.)
Execution time improves by about 80 cycles -- and memory usage swells by 19 bytes.
Code:
;
; U*
;
; based on Bruce Clark's UM*
; http://forum.6502.org/viewtopic.php?p=4389#p4389
; Shifting strategy modified for speed - JWL
L386 .BYTE $82,'U',$AA
.WORD L365 ; link to CMOVE
USTAR .WORD *+2
CLC ;it's OK to omit the CLC if NEXT leaves C clear. *Implementation Dependent*
LDA 0,X ;Copy TOS value to N+2,N+3. To eliminate CLC inside the loop, the
SBC #0 ;value at N+2,N+3 is reduced by 1 in advance. Note: in this U*
STA N+2 ;special accommodation is mandatory for the case of TOS = 0.
LDA 1,X
SBC #0
BCC UST_Z ;TOS = 0? Mandatory special treatment for this
STA N+3
TYA ;Set A=0. Assumes NEXT has left Y=0. *Implementation Dependent*
STA N ;16 bits of zero in A, N
; Note: First 8 shifts are A -> N -> 2,X
; Final 8 shifts are A -> N -> 3,X
STX N+4 ;tested later for exit from outer loop
DEX ;bias applied to X
DEX
UST_OUTLP LDY #8 ;count for inner loop
LSR 4,X ;think "2,x" then later "3,X"
UST_INLP BCC UST_NOADD
STA N+1 ;Save time, don't CLC. Value in N+2,N+3 is reduced by 1 to allow this
LDA N
ADC N+2
STA N
LDA N+1
ADC N+3
UST_NOADD ROR A ;shift
ROR N
ROR 4,X ;think "2,x" then later "3,X"
DEY
BNE UST_INLP ;go back for 1 more shift?
INX
CPX N+4
BNE UST_OUTLP ;go back for 8 more shifts?
STA 1,X ;ms byte of hi-word of result
LDA N
STA 0,X ;ls byte of hi-word of result
UST_EXIT JMP NEXT
UST_Z STY 2,X ;Assumes NEXT has left Y=0. *Implementation Dependent*
STY 3,X
BCC UST_EXIT
dclxvi wrote:
The PHA and PLA could be replaced by the slightly faster STA N+1 and LDA N+1.
The CLC can be eliminated, which will make UM* longer, but faster in most cases.
I've used these ideas as well, with a shorter solution to the CLC business.
As a test, I wrote and ran a looping routine that multiplies every possible combination of operands -- IOW it does 2^32 multiplications. Other code in the loop uses
addition to separately keep track of what the multiplication result ought to be.
The test routine revealed some erroneous results, which I fixed by adding the
BCC UST_Z etc. This causes an early exit for cases in which TOS=0. Note that this is
not just a speedup tactic! The trick of reducing the value at N+2,N+3 by 1 in advance fails when the value is "reduced" from 0 to $FFFF.
-- Jeff
Edit: Clarify credit for dclxvi. Also misc. tweaks. 2021: add link to other thread.