1. Repeated additions: 1,596
2. Repeated additions, smaller multiplicand in counter: 1,236.66
3. Shift/add: 300.8
4. 4x8 bit lookup table: 107
5. 4x8 bit lookup tables aligned to $x000: 103
6. Russian peasant algorithm: 274.02
7. Quarter squares with first half of table: 115.33
8. Quarter squares with whole table: 118
Code is below. Please let me know what I can improve or if you see any mistakes.
Main loop
Code: Select all
.START Main ;start execution at Main
;Macros for Kowalski simulator (only needed for debug features)
.INCLUDE "..\k_io.65s"
;4x8 bit lookup table
.ORG $1000
.INCLUDE "table1.65s"
;Quarter square lookup table
.ORG $4000
.INCLUDE "table3.65s"
;4x8 lookup table split into four for alignment
.ORG $5000
.INCLUDE "table5lo.65s"
.ORG $6000
.INCLUDE "table5hi.65s"
.ORG $7000
.INCLUDE "table6lo.65s"
.ORG $8000
.INCLUDE "table6hi.65s"
;Beginning of functions
.ORG $F000
Main:
Arg1 .SET $10 ;First multiplicand
Arg2 .SET $11 ;Second multiplicand
ResultLo .SET $12 ;16 bit result
ResultHi .SET $13
R0 .SET $14 ;Generic counter/pointer variables
R1 .SET $15
R2 .SET $16
R3 .SET $17
STZ Arg1 ;Multiplicand 1
STZ Arg2 ;Multiplicand 2
MultiplyRoutine=1 ;Which of 8 rountines to use
Debug=0 ;Boolean: print multicands and answers?
SED
.multiplyloop ;Loop from $0 x $0 through $99 x $99
.IF Debug==1 ;If debug is enabled, print multiplicands
LDA Arg2
puthex
LDA #'*'
putc
LDA Arg1
puthex
LDA #'='
putc
.ENDIF
;************************
;**Multiplier code here**
;************************
.multiplydone
.IF Debug==1 ;If debug is enabled, print the answer
LDA ResultHi
puthex
LDA ResultLo
puthex
putnl
.ENDIF
LDA Arg1 ;Increase multiplicand 1 by one
CLC
ADC #1
STA Arg1
BCS .carryset ;If no carry, jump back to top of loop
JMP .multiplyloop
.carryset ;If carry is set, increment multiplicand 2
LDA Arg2
ADC #0
STA Arg2
BCS .loopdone ;If multiplicand 1 is $99 and multiplicand 2 roles over, end
JMP .multiplyloop
.loopdone
BRK
Code: Select all
.IF CheckForZero==1 ;If one multiplicand is zero, skip calculation
LDA Arg1 ;Check multiplicand 1 for zero
BEQ .zero
LDA Arg2 ;Check multiplicand 2 for zero
BEQ .zero
BRA .notzero;If neither is zero, continue
.zero
.IF Debug==1 ;Print 'z' to show no calculation was done
LDA #'z'
putc
.ENDIF
STZ ResultLo ;Store zero in result
STZ ResultHi
BRA .multiplydone ;Skip to end of loop avoiding calculation
.notzero
.ENDIF
.IF CheckForOne==1 ;If one multiplicand is one, skip calculation
LDA Arg1 ;Check multiplicand 1 for one
CMP #1
BEQ .Arg1one ;If so, copy multiplicand 2 to result
LDA Arg2 ;Check multiplicand 2 for one
CMP #1
BEQ .Arg2one ;If so, copy multiplicand 1 to result
BRA .notone;If neither is one, continue
.Arg1one
.IF Debug==1
LDA #'a' ;Print 'a' to show no calculation was done
putc
.ENDIF
LDA Arg2 ;Copy multiplicand 2 to result
STA ResultLo
STZ ResultHi
BRA .multiplydone ;Skip to end of loop avoiding calculation
.Arg2one
.IF Debug==1
LDA #'b' ;Print 'a' to show no calculation was done
putc
.ENDIF
LDA Arg1 ;Copy multiplicand 2 to result
STA ResultLo
STZ ResultHi
BRA .multiplydone ;Skip to end of loop avoiding calculation
.notone
.ENDIF
Code: Select all
.IF MultiplyRoutine==1
STZ ResultLo ;Zero results
STZ ResultHi
LDX Arg1 ;Hold multiplicand 1 in X to copy quickly
LDA Arg2 ;Store multiplicand 2 in R0 counter
STA R0
.calcloop ;Calculation loop
BEQ .multiplydone ;Loop as long as R0 counter is not 0
TXA ;Load multiplicand 1 copy stored in X
CLC
ADC ResultLo ;Add to result
STA ResultLo
BCC .calcnocarry ;Skip updating high byte of result if no carry
LDA #0 ;Load 0 since only carry is being added
ADC ResultHi ;Add carry
STA ResultHi
.calcnocarry
;Decrease R0 counter by one
LDA R0
SBC #0
STA R0
BRA .calcloop
.ENDIFCode: Select all
.IF MultiplyRoutine==2
STZ ResultLo ;Zero results
STZ ResultHi
LDA Arg1 ;Check which multiplicand is larger
CMP Arg2
BMI .switchargs
LDX Arg1 ;Hold multiplicand 1 in X to copy quickly
LDA Arg2 ;Store multiplicand 2 in R0 counter
STA R0
BRA .switchdone
.switchargs
LDX Arg2 ;Hold multiplicand 2 in X to copy quickly
LDA Arg1 ;Store multiplicand 2 in R0 counter
STA R0
.switchdone
;REST IS THE SAME AS Method 1 above
.calcloop ;Calculation loop
BEQ .multiplydone ;Loop as long as R0 counter is not 0
TXA ;Load multiplicand 1 copy stored in X
CLC
ADC ResultLo ;Add to result
STA ResultLo
BCC .calcnocarry ;Skip updating high byte of result if no carry
LDA #0 ;Load 0 since only carry is being added
ADC ResultHi ;Add carry
STA ResultHi
.calcnocarry
;Decrease R0 counter by one
LDA R0
SBC #0
STA R0
BRA .calcloop
.ENDIFCode: Select all
.IF MultiplyRoutine==3
STZ ResultLo ;Zero results
STZ ResultHi
;Start with ones place of multiplicand one
LDA Arg1
AND #$F
TAX ;Use X to countdown
.calcloop
BEQ .ones_done ;Loop until ones place of counter reaches 0
;Add multiplicand 2 to result
LDA Arg2
CLC
ADC ResultLo
STA ResultLo
;Carry to upper byte if necessary
BCC .nocarry
LDA #0
ADC ResultHi
STA ResultHi
.nocarry
;Decrease counter works on BCD since only ones place
DEX
BRA .calcloop
.ones_done
;Shift ones nibble of multiplicand 2 left and store in R1
LDA Arg2
ASL
ASL
ASL
ASL
STA R1
;Shift 10s nibble of multiplicand 2 to use as high byte
LDA Arg2
ROR
ROR
ROR
ROR
AND #$F
STA R2
;Use 10s nibble of multiplicand 1 as counter in X
LDA Arg1
ROR
ROR
ROR
ROR
AND #$F
TAX
CLC
.loop10s
BEQ .tens_done ;Loop until X counts down to 0
LDA R1 ;Load low byte of shifted muliplicand 2
ADC ResultLo ;Add to result
STA ResultLo
LDA R2
ADC ResultHi ;Load high byte of shifted multiplicand 2
STA ResultHi
DEX ;Decrement 10s counter of multiplicand 1
BRA .loop10s
.tens_done
.ENDIFTables are stored as $0xyy where x is a 4 bit BCD number and yy is an 8 bit packed BCD number. Each table has a high and low section for the high and low values of the 16 bit answer. Table one has unshifted answers where $0xyy = x * yy and table two is shifted answers where $0xyy = x * yy0. Note that BCD mode is enabled and disabled to do pointer arithmetic.
Code: Select all
.IF MultiplyRoutine==4
LDA Arg1 ;Store multiplicand 1 in table pointer R0
STA R0
LDA Arg2 ;Get low nibble of multiplicand 1
AND #$F
PHA ;Make a copy of nibble for second half of table pointer
CLC
CLD ;Disable BCD mode for pointer arithmetic
ADC #>Table1_Lo ;Add Table1 offset to high byte of table pointer
STA R1
LDA (R0) ;Copy table entry to lowbyte of result
STA ResultLo
PLA ;Get copy of low nibble of multiplicand 2
ADC #>Table1_Hi ;Add Table1 offset to high byte of table pointer
STA R1
LDA (R0) ;Copy table entry to lowbyte of result
STA ResultHi
LDA Arg2 ;Get high nibble of multiplicand 2
ROR
ROR
ROR
ROR
AND #$F
PHA ;Make a copy of nibble for second half of table pointer
CLC
ADC #>Table2_Lo ;Add Table1 offset to high byte of table pointer
SED ;Reenable BCD for adding table entry to result
STA R1 ;High byte of table pointer
LDA (R0) ;Add table entry to low byte of result
ADC ResultLo
STA ResultLo
LDA #0 ;Apply carry to high byte
ADC ResultHi
STA ResultHi
PLA ;Get copy of high nibble o multiplicand 2
CLD ;Disable BCD for pointer arithmetic
ADC #>Table2_Hi ;Add Table1 offset to high byte of table pointer
SED ;Reenable BCD for adding table entry to result
STA R1 ;High byte of table pointer
LDA (R0) ;Add table entry to high byte of result
ADC ResultHi
STA ResultHi
.ENDIFSame as method five but the high and low parts of tables 1 and 2 are split up into four different files and each is aligned to $x000, so pointer arithmetic can be done in BCD mode.
Code: Same as method 4 without SED and CLD. Also, uses Table5 and Table 6 instead of Table1 and Table2.
6. Russian peasant algorithm (274.02 cycles)
Halving the smaller multiplicand is a little faster on average.
Code: Select all
.IF MultiplyRoutine==6
;Put larger of the two multiplicands in R0 to be doubled
;Put smaller multiplicand in R2 to be halved
LDA Arg1
CMP Arg2
BMI .switchargs
STA R0
LDA Arg2
STA R2
BRA .dontswitch
.switchargs
STA R2
LDA Arg2
STA R0
.dontswitch
STZ R1 ;Clear high byte of 16 bit value for doubling
STZ ResultLo ;Clear results
STZ ResultHi
CLC
.calcloop
BBR #0,R2,.dontadd ;Test halving variable for evenness
;Add 16 bit doubling variable (R0-R1) to result
LDA R0
ADC ResultLo
STA ResultLo
LDA R1
ADC ResultHi
STA ResultHi
.dontadd ;Jump here if halving variable is even
;Double 16 bit doubling variable
LDA R0
ADC R0
STA R0
LDA R1
ADC R1
STA R1
;Halve the halving variable
CLC
ROR R2
CLC
BBR #3,R2,.dontcorrect ;Check if BCD correction needed
RMB #3,R2 ;Clear high bit of low nibble shifted in from high nibble
LDA #5 ;Add 5 to low nibble instead
ADC R2
STA R2
.dontcorrect ;No BCD correction needed
BEQ .calcloopdone
JMP .calcloop
.calcloopdone
.ENDIFI tried implementing the algorithm myself and ended up only using half the table that other algorithms I looked at later used. (This is not to save space, it just didn't occur to me to generate the second half since the first half is sufficient).
Code: Select all
.IF MultiplyRoutine==7
CLC ;Store sum of multiplicand 1 and 2 in table pointer R0
LDA Arg1
ADC Arg2
STA R0
LDA #0 ;Store copy of carry in R2 for later
ADC #0
STA R2
ADC #>Table3_Lo ;Add offset into first table to high byte of table pointer
STA R1
LDA (R0) ;Store table entry in result
STA ResultLo
LDA R2 ;Retrieve carry from adding multiplicands
ADC #>Table3_Hi ;Add offset into first table to high byte of table pointer
STA R1
LDA (R0) ;Store table entry in result
STA ResultHi
SEC
LDA Arg1 ;Find difference of multiplicands
SBC Arg2
BCS .notinverse ;Check if difference is negative
STA R3 ;If negative, take absolute value
SEC
LDA #$00
SBC R3
.notinverse ;Jump here if difference is not negative
STA R0 ;Store absolute value of difference of multiplicands in table pointer
LDA #>Table3_Lo ;High byte of pointer to table
STA R1
LDA (R0) ;Retrieve table value
STA R3 ;Save temporarily
LDA ResultLo ;Subract table value from result from earlier table access
SEC
SBC R3
STA ResultLo
BCS .nosubcarry ;Borrow from high byte of result if subtraction requires
LDA ResultHi
SBC #0
STA ResultHi
.nosubcarry
LDA #>Table3_Hi ;High byte of pointer into table
STA R1
LDA (R0) ;Retrieve table value
STA R3 ;Save temporarily
LDA ResultHi ;Subtract table value from high byte of result
SEC
SBC R3
STA ResultHi
.ENDIFI tried to base this off the fast multiplication routine I found here:
http://codebase64.org/doku.php?id=base: ... iplication
In that version, the X register indexes into the table. I can't do the same thing by loading a BCD number into X, since it won't be added to the address in BCD format, so I calculate the address myself, like in method 7. Strangely, using the larger table is slightly slower as I have implemented it. I think it might be possible to improve this one. Any ideas?
Code: Select all
.IF MultiplyRoutine==8
CLC ;Store sum of multiplicand 1 and 2 in table pointer R0
LDA Arg1
ADC Arg2
STA R0
LDA #0 ;Store copy of carry in R2 for later
ADC #0
STA R2
ADC #>Table3_Lo ;Add offset into first table to high byte of table pointer
STA R1
LDA (R0) ;Store table entry in result
STA ResultLo
LDA R2 ;Retrieve carry from adding multiplicands
ADC #>Table3_Hi ;Add offset into first table to high byte of table pointer
STA R1
LDA (R0) ;Store table entry in result
STA ResultHi
;Take inverse of multiplicand 1 before summing multiplicands
;(Same trick as EOR $#FF when calculation is binary)
SEC
LDA #$99
SBC Arg1
CLC ;Find sum of multiplicands
ADC Arg2
STA R0 ;Store in low byte of table pointer
LDA #0 ;Save copy of carry
ADC #0
STA R2
ADC #>Table4_Lo ;Add carry and table address to high byte of table pointer
STA R1
SEC
LDA ResultLo ;Subtract table entry from result from first table access
SBC (R0)
STA ResultLo
LDA ResultHi ;Borrow from high byte of result if necessary
SBC #0
STA ResultHi
CLC
LDA R2 ;Retrieve carry from summing multiplicands
ADC #>Table4_Hi ;Add carry and table address to high byte of table pointer
STA R1
SEC ;Subtract table entry from high byte of result
LDA ResultHi
SBC (R0)
STA ResultHi
.ENDIF