Fast BCD multiplication

Programming the 6502 microprocessor and its relatives in assembly and other languages.
User avatar
Druzyek
Posts: 367
Joined: 12 May 2014
Contact:

Fast BCD multiplication

Post by Druzyek »

I was curious about the fastest way to do 8 bit packed BCD multiplication, so I tried it a few different ways to get the average cycles required. I ran all combinations from $00 x $00 to $99 x $99 in a loop in the simulator. For each one, I subtracted the loop overhead time from the total time and divided by 10,000 to get average cycles per calculation. Here are the results:

1. Repeated additions: 1,596
2. Repeated additions, smaller multiplicand in counter: 1,236.66
3. Shift/add: 300.8
4. 4x8 bit lookup table: 107
5. 4x8 bit lookup tables aligned to $x000: 103
6. Russian peasant algorithm: 274.02
7. Quarter squares with first half of table: 115.33
8. Quarter squares with whole table: 118

Code is below. Please let me know what I can improve or if you see any mistakes.

Main loop

Code: Select all

.START Main	;start execution at Main
	
	;Macros for Kowalski simulator (only needed for debug features)
	.INCLUDE "..\k_io.65s"
	
	;4x8 bit lookup table
	.ORG $1000
	.INCLUDE "table1.65s"
	
	;Quarter square lookup table
	.ORG $4000
	.INCLUDE "table3.65s"
	
	;4x8 lookup table split into four for alignment
	.ORG $5000
	.INCLUDE "table5lo.65s"
	.ORG $6000
	.INCLUDE "table5hi.65s"
	.ORG $7000
	.INCLUDE "table6lo.65s"
	.ORG $8000
	.INCLUDE "table6hi.65s"

;Beginning of functions
	.ORG $F000
Main:
	
Arg1			.SET $10			;First multiplicand
Arg2			.SET $11			;Second multiplicand
ResultLo		.SET $12			;16 bit result
ResultHi		.SET $13
R0			.SET $14			;Generic counter/pointer variables
R1			.SET $15
R2			.SET $16
R3			.SET $17
	
	STZ Arg1					;Multiplicand 1
	STZ Arg2					;Multiplicand 2

MultiplyRoutine=1			;Which of 8 rountines to use
Debug=0					;Boolean: print multicands and answers?

	SED		
.multiplyloop					;Loop from $0 x $0 through $99 x $99
	.IF Debug==1			;If debug is enabled, print multiplicands
		LDA Arg2
		puthex
		LDA #'*'
		putc
		LDA Arg1
		puthex
		LDA #'='
		putc
	.ENDIF

	;************************
	;**Multiplier code here**
	;************************


.multiplydone
	.IF Debug==1			;If debug is enabled, print the answer
   	LDA ResultHi
		puthex
		LDA ResultLo
		puthex
		putnl
	.ENDIF
	LDA Arg1					;Increase multiplicand 1 by one
	CLC
	ADC #1
	STA Arg1
	BCS .carryset			;If no carry, jump back to top of loop
		JMP .multiplyloop
.carryset					;If carry is set, increment multiplicand 2
	LDA Arg2
	ADC #0
	STA Arg2
	BCS .loopdone			;If multiplicand 1 is $99 and multiplicand 2 roles over, end
		JMP .multiplyloop
.loopdone
	
	BRK
I also added checks for 0 and 1 to skip calculating, but using either or both of those was slower on average for all 8 methods, so I left them out.

Code: Select all

	.IF CheckForZero==1	;If one multiplicand is zero, skip calculation
		LDA Arg1				;Check multiplicand 1 for zero
		BEQ .zero			
			LDA Arg2			;Check multiplicand 2 for zero
			BEQ .zero
				BRA .notzero;If neither is zero, continue
.zero
		.IF Debug==1		;Print 'z' to show no calculation was done
			LDA #'z'
			putc
		.ENDIF
		STZ ResultLo		;Store zero in result
		STZ ResultHi
		BRA .multiplydone	;Skip to end of loop avoiding calculation
.notzero
	.ENDIF

	.IF CheckForOne==1	;If one multiplicand is one, skip calculation
		LDA Arg1				;Check multiplicand 1 for one
		CMP #1
		BEQ .Arg1one		;If so, copy multiplicand 2 to result
			LDA Arg2			;Check multiplicand 2 for one
			CMP #1
			BEQ .Arg2one	;If so, copy multiplicand 1 to result
				BRA .notone;If neither is one, continue
.Arg1one
		.IF Debug==1
			LDA #'a'			;Print 'a' to show no calculation was done
			putc
		.ENDIF
		LDA Arg2				;Copy multiplicand 2 to result
		STA ResultLo
		STZ ResultHi
		BRA .multiplydone	;Skip to end of loop avoiding calculation
.Arg2one
		.IF Debug==1
			LDA #'b'			;Print 'a' to show no calculation was done
			putc
		.ENDIF
		LDA Arg1				;Copy multiplicand 2 to result
		STA ResultLo
		STZ ResultHi
		BRA .multiplydone ;Skip to end of loop avoiding calculation
.notone
	.ENDIF
1. Repeated additions (1,596 cycles)

Code: Select all

	.IF MultiplyRoutine==1
		STZ ResultLo 		;Zero results
		STZ ResultHi
		LDX Arg1				;Hold multiplicand 1 in X to copy quickly
		LDA Arg2				;Store multiplicand 2 in R0 counter
		STA R0				
.calcloop					;Calculation loop
		BEQ .multiplydone	;Loop as long as R0 counter is not 0
			TXA				;Load multiplicand 1 copy stored in X
			CLC
			ADC ResultLo	;Add to result
			STA ResultLo	
			BCC .calcnocarry	;Skip updating high byte of result if no carry
				LDA #0			;Load 0 since only carry is being added
				ADC ResultHi	;Add carry
				STA ResultHi
.calcnocarry
			;Decrease R0 counter by one
			LDA R0
			SBC #0
			STA R0
		BRA .calcloop
	.ENDIF
2. Repeated additions, smaller multiplicand in counter (1,236.66 cycles)

Code: Select all

.IF MultiplyRoutine==2
		STZ ResultLo		;Zero results
		STZ ResultHi
		
		LDA Arg1				;Check which multiplicand is larger
		CMP Arg2
		BMI .switchargs
			LDX Arg1			;Hold multiplicand 1 in X to copy quickly
			LDA Arg2			;Store multiplicand 2 in R0 counter
			STA R0
			BRA .switchdone		
.switchargs
		LDX Arg2				;Hold multiplicand 2 in X to copy quickly
		LDA Arg1				;Store multiplicand 2 in R0 counter
		STA R0
.switchdone

;REST IS THE SAME AS Method 1 above
.calcloop					;Calculation loop
		BEQ .multiplydone ;Loop as long as R0 counter is not 0
			TXA				;Load multiplicand 1 copy stored in X
			CLC
			ADC ResultLo	;Add to result
			STA ResultLo
			BCC .calcnocarry	;Skip updating high byte of result if no carry
				LDA #0			;Load 0 since only carry is being added
				ADC ResultHi	;Add carry
				STA ResultHi
.calcnocarry
			;Decrease R0 counter by one
			LDA R0
			SBC #0
			STA R0
		BRA .calcloop
	.ENDIF
3. Shift/add (300.8 cycles)

Code: Select all

	.IF MultiplyRoutine==3
		STZ ResultLo		;Zero results
		STZ ResultHi
		
		;Start with ones place of multiplicand one
		LDA Arg1
		AND #$F
		TAX					;Use X to countdown
.calcloop
		BEQ .ones_done		;Loop until ones place of counter reaches 0
			;Add multiplicand 2 to result
			LDA Arg2
			CLC
			ADC ResultLo
			STA ResultLo
			;Carry to upper byte if necessary
			BCC .nocarry
				LDA #0
				ADC ResultHi
				STA ResultHi
.nocarry
		;Decrease counter works on BCD since only ones place
		DEX
		BRA .calcloop
.ones_done
		
		;Shift ones nibble of multiplicand 2 left and store in R1
		LDA Arg2
		ASL
		ASL
		ASL
		ASL
		STA R1
		
		;Shift 10s nibble of multiplicand 2 to use as high byte
		LDA Arg2
		ROR
		ROR
		ROR
		ROR
		AND #$F
		STA R2
		
		;Use 10s nibble of multiplicand 1 as counter in X
		LDA Arg1
		ROR
		ROR
		ROR
		ROR
		AND #$F
		TAX
		
		CLC
.loop10s
		BEQ .tens_done		;Loop until X counts down to 0
			LDA R1			;Load low byte of shifted muliplicand 2
			ADC ResultLo	;Add to result
			STA ResultLo
			LDA R2
			ADC ResultHi	;Load high byte of shifted multiplicand 2
			STA ResultHi
			DEX				;Decrement 10s counter of multiplicand 1
		BRA .loop10s		
.tens_done
	.ENDIF
4. 4x8 bit lookup tables (107 cycles)
Tables are stored as $0xyy where x is a 4 bit BCD number and yy is an 8 bit packed BCD number. Each table has a high and low section for the high and low values of the 16 bit answer. Table one has unshifted answers where $0xyy = x * yy and table two is shifted answers where $0xyy = x * yy0. Note that BCD mode is enabled and disabled to do pointer arithmetic.

Code: Select all

	.IF MultiplyRoutine==4
		LDA Arg1				;Store multiplicand 1 in table pointer R0
		STA R0				
		LDA Arg2				;Get low nibble of multiplicand 1
		AND #$F
		PHA					;Make a copy of nibble for second half of table pointer
		CLC
		CLD					;Disable BCD mode for pointer arithmetic
		ADC #>Table1_Lo	;Add Table1 offset to high byte of table pointer
		STA R1
		LDA (R0)				;Copy table entry to lowbyte of result
		STA ResultLo		
		PLA					;Get copy of low nibble of multiplicand 2
		ADC #>Table1_Hi	;Add Table1 offset to high byte of table pointer
		STA R1
		LDA (R0)				;Copy table entry to lowbyte of result
		STA ResultHi
		LDA Arg2				;Get high nibble of multiplicand 2
		ROR					
		ROR
		ROR
		ROR
		AND #$F
		PHA					;Make a copy of nibble for second half of table pointer
		CLC 					
		ADC #>Table2_Lo	;Add Table1 offset to high byte of table pointer
		SED					;Reenable BCD for adding table entry to result
		STA R1				;High byte of table pointer
		LDA (R0)				;Add table entry to low byte of result
		ADC ResultLo
		STA ResultLo
		LDA #0				;Apply carry to high byte
		ADC ResultHi
		STA ResultHi
		PLA					;Get copy of high nibble o multiplicand 2
		CLD					;Disable BCD for pointer arithmetic
		ADC #>Table2_Hi	;Add Table1 offset to high byte of table pointer
		SED					;Reenable BCD for adding table entry to result
		STA R1				;High byte of table pointer
		LDA (R0)				;Add table entry to high byte of result
		ADC ResultHi
		STA ResultHi
	.ENDIF
5. 4x8 bit lookup tables aligned to $x000 (103 cycles)
Same as method five but the high and low parts of tables 1 and 2 are split up into four different files and each is aligned to $x000, so pointer arithmetic can be done in BCD mode.
Code: Same as method 4 without SED and CLD. Also, uses Table5 and Table 6 instead of Table1 and Table2.

6. Russian peasant algorithm (274.02 cycles)
Halving the smaller multiplicand is a little faster on average.

Code: Select all

	.IF MultiplyRoutine==6
		;Put larger of the two multiplicands in R0 to be doubled
		;Put smaller multiplicand in R2 to be halved
		LDA Arg1		
		CMP Arg2				
		BMI .switchargs
			STA R0
			LDA Arg2
			STA R2
			BRA .dontswitch
.switchargs
			STA R2
			LDA Arg2
			STA R0
.dontswitch
		STZ R1					;Clear high byte of 16 bit value for doubling
		STZ ResultLo			;Clear results
		STZ ResultHi
		CLC
.calcloop	
		BBR #0,R2,.dontadd	;Test halving variable for evenness
			;Add 16 bit doubling variable (R0-R1) to result
			LDA R0				
			ADC ResultLo
			STA ResultLo
			LDA R1
			ADC ResultHi
			STA ResultHi
.dontadd							;Jump here if halving variable is even
		
		;Double 16 bit doubling variable
		LDA R0
		ADC R0
		STA R0
		LDA R1
		ADC R1
		STA R1
		
		;Halve the halving variable
		CLC
		ROR R2
		CLC
		BBR #3,R2,.dontcorrect	;Check if BCD correction needed
			RMB #3,R2				;Clear high bit of low nibble shifted in from high nibble
			LDA #5					;Add 5 to low nibble instead
			ADC R2
			STA R2
.dontcorrect						;No BCD correction needed
		BEQ .calcloopdone
			JMP .calcloop
.calcloopdone
	.ENDIF
7. Quarter squares with first half of table (115.33 cycles)
I tried implementing the algorithm myself and ended up only using half the table that other algorithms I looked at later used. (This is not to save space, it just didn't occur to me to generate the second half since the first half is sufficient).

Code: Select all

	.IF MultiplyRoutine==7		
		CLC					;Store sum of multiplicand 1 and 2 in table pointer R0
		LDA Arg1						
		ADC Arg2
		STA R0
		LDA #0				;Store copy of carry in R2 for later
		ADC #0
		STA R2
		ADC #>Table3_Lo	;Add offset into first table to high byte of table pointer
		STA R1
		LDA (R0)				;Store table entry in result
		STA ResultLo
		LDA R2				;Retrieve carry from adding multiplicands
		ADC #>Table3_Hi	;Add offset into first table to high byte of table pointer
		STA R1
		LDA (R0)				;Store table entry in result
		STA ResultHi
		SEC					
		LDA Arg1				;Find difference of multiplicands
		SBC Arg2
		BCS .notinverse	;Check if difference is negative
			STA R3			;If negative, take absolute value
			SEC
			LDA #$00
			SBC R3
.notinverse					;Jump here if difference is not negative
		STA R0				;Store absolute value of difference of multiplicands in table pointer
		LDA #>Table3_Lo	;High byte of pointer to table
		STA R1
		LDA (R0)				;Retrieve table value
		STA R3				;Save temporarily
		LDA ResultLo		;Subract table value from result from earlier table access
		SEC
		SBC R3
		STA ResultLo
		BCS .nosubcarry	;Borrow from high byte of result if subtraction requires
			LDA ResultHi
			SBC #0
			STA ResultHi
.nosubcarry
		LDA #>Table3_Hi	;High byte of pointer into table
		STA R1
		LDA (R0)				;Retrieve table value
		STA R3				;Save temporarily
		LDA ResultHi		;Subtract table value from high byte of result
		SEC
		SBC R3
		STA ResultHi
	.ENDIF
8. Quarter squares with whole table (118 cycles)
I tried to base this off the fast multiplication routine I found here:
http://codebase64.org/doku.php?id=base: ... iplication

In that version, the X register indexes into the table. I can't do the same thing by loading a BCD number into X, since it won't be added to the address in BCD format, so I calculate the address myself, like in method 7. Strangely, using the larger table is slightly slower as I have implemented it. I think it might be possible to improve this one. Any ideas?

Code: Select all

	.IF MultiplyRoutine==8
		CLC 					;Store sum of multiplicand 1 and 2 in table pointer R0
		LDA Arg1
		ADC Arg2
		STA R0
		LDA #0				;Store copy of carry in R2 for later
		ADC #0
		STA R2
		ADC #>Table3_Lo	;Add offset into first table to high byte of table pointer
		STA R1
		LDA (R0)				;Store table entry in result
		STA ResultLo
		LDA R2				;Retrieve carry from adding multiplicands
		ADC #>Table3_Hi	;Add offset into first table to high byte of table pointer
		STA R1
		LDA (R0)				;Store table entry in result
		STA ResultHi
		
		;Take inverse of multiplicand 1 before summing multiplicands
		;(Same trick as EOR $#FF when calculation is binary)
		SEC					
		LDA #$99
		SBC Arg1
		CLC					;Find sum of multiplicands
		ADC Arg2
		STA R0				;Store in low byte of table pointer
		LDA #0				;Save copy of carry 
		ADC #0
		STA R2
		ADC #>Table4_Lo	;Add carry and table address to high byte of table pointer
		STA R1				
		SEC
		LDA ResultLo		;Subtract table entry from result from first table access
		SBC (R0)
		STA ResultLo
		LDA ResultHi		;Borrow from high byte of result if necessary
		SBC #0
		STA ResultHi
		CLC					
		LDA R2				;Retrieve carry from summing multiplicands
		ADC #>Table4_Hi	;Add carry and table address to high byte of table pointer
		STA R1
		SEC					;Subtract table entry from high byte of result
		LDA ResultHi
		SBC (R0)
		STA ResultHi
	.ENDIF
Attachments
table6lo.asm
(13.26 KiB) Downloaded 259 times
table6hi.asm
(13.85 KiB) Downloaded 234 times
table5lo.asm
(11.89 KiB) Downloaded 250 times
table5hi.asm
(12.47 KiB) Downloaded 235 times
table3.asm
(8.49 KiB) Downloaded 246 times
table1.asm
(51.47 KiB) Downloaded 231 times
multtest.asm
(13.41 KiB) Downloaded 257 times
k_io.asm
(753 Bytes) Downloaded 248 times
User avatar
Druzyek
Posts: 367
Joined: 12 May 2014
Contact:

Re: Fast BCD multiplication

Post by Druzyek »

With small look up tables to replace 4-bit shifts and the halving of the Russian peasant algorithm, I shaved off 4-16 cycles from routines 3-6.

I tried converting the BCD multiplicands to decimal with a look up table so I could use the faster quarter square algorithm. It finishes in 62 cycles on average and needs 24 more (86 total) to convert decimal back to BCD with a table.

Code: Select all

		;Set up quarter square registers (at startup)
		LDA #>Table7_Lo
		STA QS1_Hi
		LDA #>Table7_Hi
		STA QS2_Hi
		LDA #>Table8_Lo
		STA QS3_Hi
		LDA #>Table8_Hi
		STA QS4_Hi

	.IF MultiplyRoutine==9
		LDX Arg1				;Convert Arg1 from BCD to binary
		LDA HEXtoBIN,X
		TAY					;Keep binary representation in Y
		LDX Arg2
		LDA HEXtoBIN,X		;Convert Arg2 from BCD to binary
		STA QS1_Lo			;Low byte of pointer into first table
		STA QS2_Lo			;High byte of pointer into first table
		EOR #$FF				
		STA QS3_Lo			;Low byte of pointer into second table
		STA QS4_Lo			;High byte of pointer into second table
		
		CLD					;Disable decimal mode for binary arithmetic
		SEC
		LDA (QS1_Lo),Y		;Difference of low bytes from first and second table
		SBC (QS3_Lo),Y
		TAX					;Store in X for indexing into binary to BCD table
		LDA (QS2_Lo),Y		;Difference of high bytes from first and second table
		SBC (QS4_Lo),Y
		TAY					;Store in Y for indexing into binary to BCD table
		
		;Convert decimal to BCD
		SED
		CLC
		LDA Table9_Lo,X	;Add low bytes of BCD look up table
		ADC Table10_Lo,Y	
		STA ResultLo		
		LDA Table9_Hi,X	;Add high bytes of BCD look up table
		ADC Table10_Hi,Y
		STA ResultHi
		
	.ENDIF
Attachments
converters.asm
(4.82 KiB) Downloaded 239 times
table7.asm
(16.92 KiB) Downloaded 237 times
Cray Ze
Posts: 134
Joined: 02 May 2015

Re: Fast BCD multiplication

Post by Cray Ze »

I started working on a test using the ūrdhva tiryagbhyāṃ Sūtra for multiplication, though ran into a 4bit shift issue of my own (too many).
I can probably expand the table to get around it.

Basically, the idea was to break the single 8x8 multiply into four 4x4 multiplies where the results can be fetched from a lookup table.
The table address consists of the high and low nyble to be multiplied with the data containing the result.

Code: Select all


;     0    1    2    3    4    5    6    7    8    9  |---- Address  Padding ----|

.DB $00, $00, $00, $00, $00, $00, $00, $00, $00, $00, $FA, $FB, $FC, $FD, $FE, $FF	; x 0
.DB $00, $01, $02, $03, $04, $05, $06, $07, $08, $09, $FA, $FB, $FC, $FD, $FE, $FF	; x 1
.DB $00, $02, $04, $06, $08, $10, $12, $14, $16, $18, $FA, $FB, $FC, $FD, $FE, $FF	; x 2
.DB $00, $03, $06, $09, $12, $15, $18, $21, $24, $27, $FA, $FB, $FC, $FD, $FE, $FF	; x 3
.DB $00, $04, $08, $12, $16, $20, $24, $28, $32, $36, $FA, $FB, $FC, $FD, $FE, $FF	; x 4
.DB $00, $05, $10, $15, $20, $25, $30, $35, $40, $45, $FA, $FB, $FC, $FD, $FE, $FF	; x 5
.DB $00, $06, $12, $18, $24, $30, $36, $42, $48, $54, $FA, $FB, $FC, $FD, $FE, $FF	; x 6
.DB $00, $07, $14, $21, $28, $35, $42, $49, $56, $63, $FA, $FB, $FC, $FD, $FE, $FF	; x 7
.DB $00, $08, $16, $24, $32, $40, $48, $56, $64, $72, $FA, $FB, $FC, $FD, $FE, $FF	; x 8
.DB $00, $09, $18, $27, $36, $45, $54, $63, $72, $81, $FA, $FB, $FC, $FD, $FE, $FF	; x 9

I put together a couple of examples to help show the workings behind the math.
Urdhva Tiryagbhyam worked examples.
Urdhva Tiryagbhyam worked examples.
kakemoms
Posts: 349
Joined: 02 Mar 2016

Re: Fast BCD multiplication

Post by kakemoms »

Interesting results. We discussed some routines on Denial last year:

Multiplication in assembler

Larger tables translates into quicker routines.

Another way:

Code: Select all

// y*a = ((y+a)^2-(y-a)^2)/4
STA $1
TYA
STA $2
SEC
SBC $1
TAX
TYA
CLC
// 3+2+3+2+3+2+3+2  +3+2+2+2+3+2+2+2 (or 5+3+3+3+3)
ADC $2
TAY
BCS trop
LDA squaretableLSB,Y    // All results pre-divided by 4, 0-255 table
SEC
SBC squaretableLSB,X    // All results pre-divided by 4, 0-255 table
STA resultLSB
LDA squaretableMSB,Y
SBC squaretableMSB,X
STA resultMSB
RTS
trop:
LDA squaretableLSB2,Y    // All results pre-divided by 4, 0-255 table
SEC
SBC squaretableLSB,X    // All results pre-divided by 4, 0-255 table
STA resultLSB
LDA squaretableMSB2,Y
SBC squaretableMSB,X
STA resultMSB
RTS
Clocks in at 53 cycles, but with a 512 byte table. Unceremoniously non-tested and non-optimized. For BCD, you only need a 200 byte table.

Edit 2:
Last edited by kakemoms on Fri Nov 17, 2017 1:28 pm, edited 3 times in total.
User avatar
BigEd
Posts: 11463
Joined: 11 Dec 2008
Location: England
Contact:

Re: Fast BCD multiplication

Post by BigEd »

Quote:
In that version, the X register indexes into the table. I can't do the same thing by loading a BCD number into X, since it won't be added to the address in BCD format, so I calculate the address myself, like in method 7.
I haven't studied this carefully, but even if X holds a BCD number, surely it can still index into a table, it's just that the table is a bit sparsely populated - 100 entries in a field of 256.
User avatar
Druzyek
Posts: 367
Joined: 12 May 2014
Contact:

Re: Fast BCD multiplication

Post by Druzyek »

Quote:
Another way:
Cool. What are the lines before the table read doing?
Quote:
Clocks in at 53 cycles, but with a 512 byte table.
So are the squaretableLSB and squaretableMSB tables each 512 bytes?
Quote:
For BCD, you only need a 200 byte table.
Right, but I don't think you can use X indexing into the square table for BCD numbers (unless that is what you are handling in the first few lines). Have you tested this way with BCD numbers?
BigEd wrote:
Quote:
In that version, the X register indexes into the table. I can't do the same thing by loading a BCD number into X, since it won't be added to the address in BCD format, so I calculate the address myself, like in method 7.
I haven't studied this carefully, but even if X holds a BCD number, surely it can still index into a table, it's just that the table is a bit sparsely populated - 100 entries in a field of 256.
The table holds values for (a+b)^2/4, so any two numbers added together should point to the result of that calculation in the table like this:

Code: Select all

LDA #>Table
STA PtrHi
LDA Arg1
STA PtrLo
LDX Arg2
LDA (PtrLo),X
So imagine if Arg1 is 30 and Arg2 is 40. The low byte of the pointer will be 70 and it will point to the table entry for (70)^2/4. It works the same if Arg1 is 35 and Arg2 is 35.
With BCD, though, if Arg1 is $30 and Arg2 $40, the pointer will not point to the same address as when Arg1 is $35 and Arg2 is $35.
kakemoms
Posts: 349
Joined: 02 Mar 2016

Re: Fast BCD multiplication

Post by kakemoms »

Druzyek wrote:
Quote:
Another way:
Cool. What are the lines before the table read doing?
Quote:
Clocks in at 53 cycles, but with a 512 byte table.
So are the squaretableLSB and squaretableMSB tables each 512 bytes?
Quote:
For BCD, you only need a 200 byte table.
Right, but I don't think you can use X indexing into the square table for BCD numbers (unless that is what you are handling in the first few lines). Have you tested this way with BCD numbers?
I think I edited my post while you were writing an answer, but anyway, corrected code and statements are shown above (in previous post).

The code before the table is to calculate (a+b)/2 and (a-b)/2 for BCD.

You need to correct the LSR for BCD (both a-b and a+b). The only problem with LSR is if there is something in bit 4 before the LSR (=10) in which it ends in bit 3 afterwards (=8) which is 3 more than it should be. So by testing bit 3 and subtracting 3 (if its there) in both a and b, the result becomes correct (see here for a better explanation)

The lookup table is simply to get the square value of a number. There are 256 possible inputs for a byte, so 256 bytes in the table LSB and 256 bytes in the table MSB. For BCD you only have 100 possible inputs, but the table needs to cover as much memory as for a full byte since you can't start filtering out $A,$B,$C,$D,$E and $F values in the first or last 4-bit container of the byte..

I don't show the table (which you either need to type in manually or calculate in assembly), but it will work with BCD values if you make it point to BCD values:

table,X --> if X=25 in BCD this is 2=%0010 for first container and 5=%0101 for second, so X=%00100101 which will point to the 37'th value in the table. As the table is precalculated, that entry will contain 25^2 or 625, which is %00000110 for MSB (=06) and %00100101 for LSB (=25). And so on..
User avatar
Druzyek
Posts: 367
Joined: 12 May 2014
Contact:

Re: Fast BCD multiplication

Post by Druzyek »

Quote:
I think you could convert BCD to binary and divide by two with one table look up and use a 200 byte table like you mentioned.
On second thought, you have to add 1 if a and b are both odd, since halving them leaves two halves (which your code does I think).

Edit: Also, how to do you handle one odd and one even input? Don't you need to subtract one, then add it back after the table look up? I was trying things out on paper and couldn't get it to work out for one even and one odd, then I found this: http://www.trottermath.net/algebra/multsqs.html

Edit 2: It looks like you can do it taking even and odd into account too: http://codebase64.org/doku.php?id=base: ... iplication
User avatar
BigEd
Posts: 11463
Joined: 11 Dec 2008
Location: England
Contact:

Re: Fast BCD multiplication

Post by BigEd »

Oops, you're right, I do see now that the table indexing wouldn't work: $09+$09 is $12 in binary, but $18 in BCD, but you can also reach $12 with $10+$2 (in either sense) - so the table entry cannot correspond to a unique sum of BCD inputs.
User avatar
Druzyek
Posts: 367
Joined: 12 May 2014
Contact:

Re: Fast BCD multiplication

Post by Druzyek »

BigEd wrote:
Oops, you're right, I do see now that the table indexing wouldn't work: $09+$09 is $12 in binary, but $18 in BCD, but you can also reach $12 with $10+$2 (in either sense) - so the table entry cannot correspond to a unique sum of BCD inputs.
I haven't had time to try it yet, but the fastest version might be to convert BCD to decimal with a lookup table and use that to index into a table of BCD values.
User avatar
Druzyek
Posts: 367
Joined: 12 May 2014
Contact:

Re: Fast BCD multiplication

Post by Druzyek »

Quote:
I haven't had time to try it yet, but the fastest version might be to convert BCD to decimal with a lookup table and use that to index into a table of BCD values.
This does seem to be the fastest method. I got it down to 60 cycles:

Code: Select all

;Method 10 - Quarter square with decimal index and BCD data
	.IF MultiplyRoutine==10
		LDX Arg1				;Convert Arg1 from BCD to binary
		LDA HEXtoBIN,X
		TAY					;Keep binary representation in Y
		LDX Arg2
		LDA HEXtoBIN,X		;Convert Arg2 from BCD to binary
		STA QS1_Lo			;Low byte of pointer into first table
		STA QS2_Lo			;High byte of pointer into first table
		EOR #$FF				
		STA QS3_Lo			;Low byte of pointer into second table
		STA QS4_Lo			;High byte of pointer into second table
		
		SEC
		LDA (QS1_Lo),Y		;Difference of low bytes from first and second table
		SBC (QS3_Lo),Y
		STA ResultLo
		LDA (QS2_Lo),Y		;Difference of high bytes from first and second table
		SBC (QS4_Lo),Y
		STA ResultHi
	.ENDIF
Quote:
Clocks in at 53 cycles, but with a 512 byte table. Unceremoniously non-tested and non-optimized. For BCD, you only need a 200 byte table.
I still don't think that what you are showing is possible without checking if the sum is even or odd, making the sum even, then adding to the sum after the table fetch it the sum was odd. I would be glad to see how you do it though if you can. Here is my implementation of that process for BCD which takes 113 cycles on average:

Code: Select all

	.IF MultiplyRoutine==11
		LDA Arg1
		BEQ .zero				;If a+b is odd, we need to subtract one, so make sure
									;at least one multiplicand is above 0
			CMP Arg2				
			BCC .swap			;Make sure larger argument ends up in Y, and smaller in X
				TAY
				STA R2			;Save R2 for correction at end if a+b was odd
				LDX Arg2			
				BEQ .zero
				BRA .swapdone		
.swap
			TAX
			LDY Arg2
			STY R2
			BEQ .zero 
.swapdone
				LDA HEXtoBIN,Y	;Convert larger multiplicand from BCD to dec
				TAY
				STA R1			;Store in R1 for EORing with other multiplicand
				LDA HEXtoBIN,X	;Convert smaller multiplicand from BCD to dec
				STA R0			;Save fr calculating a+b and a-b
				EOR R1			;EOR a and b to see if a+b is even or odd
				ROR				;Shift bit 0 into Carry for testing
				TYA				;Load larger multiplicand in preparation for a+b
				BCS .odd			;If odd, calculate differently
					CLD			;Calculate a+b
					CLC
					ADC R0	
					ROR			;Divide by 2
					TAX			;Store in X for indexing into table
					TYA			;Larger multiplicand
					SEC
					SBC R0		;Calculate a-b
					CLC
					ROR			;Divide by 2
					TAY			;Store in Y for indexing into table
					SED			;Diference of low bytes from square table
					SEC
					LDA Table13_Lo,X
					SBC Table13_Lo,Y
					STA ResultLo
					LDA Table13_Hi,X	;Difference of high bytes in square table
					SBC Table13_Hi,Y
					STA ResultHi
					BRA .multiplydone
.odd
					DEC R0		;Decrease smaller multiplicand to make a+b even
					;Next 19 lines same as above
					CLD			;Calculate a+b
					CLC
					ADC R0
					ROR			;Divide by 2
					TAX			;Store in X for indexing into table
					TYA			;Larger multiplicand
					SEC
					SBC R0		;Calculate a-b
					CLC
					ROR			;Divide by 2
					TAY			;Store in Y for indexing into table
					SED			;Diference of low bytes from square table
					SEC
					LDA Table13_Lo,X
					SBC Table13_Lo,Y
					STA ResultLo
					LDA Table13_Hi,X	;Difference of high bytes in square table
					SBC Table13_Hi,Y
					STA ResultHi
					CLC					;Add back copy of larger multiplicand
					LDA R2
					ADC ResultLo
					STA ResultLo
					LDA #0
					ADC ResultHi
					STA ResultHi
		BRA .multiplydone
.zero
		STZ ResultLo			;One multplicand was zero
		STZ ResultHi
	.ENDIF
I was also thinking the absolute fastest way would be to map three buffers into zero page that drive a 128k ROM and one address to read the output. That would be 32 cycles like this:

Code: Select all

	LDA Arg1
	STA ROM_PtrLo
	LDA Arg2
	STA ROM_PtrHigh
	STZ ROM_Op
	LDA ROM_Out
	STA ResultLo
	INC ROM_Op
	LDA ROM_Out
	STA ResultHi
User avatar
BigEd
Posts: 11463
Joined: 11 Dec 2008
Location: England
Contact:

Re: Fast BCD multiplication

Post by BigEd »

Druzyek wrote:
BigEd wrote:
Oops, you're right, I do see now that the table indexing wouldn't work: $09+$09 is $12 in binary, but $18 in BCD, but you can also reach $12 with $10+$2 (in either sense) - so the table entry cannot correspond to a unique sum of BCD inputs.
I haven't had time to try it yet, but the fastest version might be to convert BCD to decimal with a lookup table and use that to index into a table of BCD values.
I wonder, would excess-3 notation be a better bet for this? Add $33 to both operands, and then the sum is $33 greater than the BCD sum - it all becomes very linear.
User avatar
Druzyek
Posts: 367
Joined: 12 May 2014
Contact:

Re: Fast BCD multiplication

Post by Druzyek »

Quote:
I wonder, would excess-3 notation be a better bet for this?
Hmm, I've never heard of this. The Wikipedia page says it was used in some calculators. So, I made a spreadsheet with $33 added to both operands but it still looks like there is overlap. Wouldn't adding $33 to both operands make it $66 greater than the BCD sum? Do you mean to add it as BCD or just a binary add? I tried all the combinations and there is still overlap.
User avatar
BigEd
Posts: 11463
Joined: 11 Dec 2008
Location: England
Contact:

Re: Fast BCD multiplication

Post by BigEd »

Hmm, I thought it worked out neatly, but trying it out I seem to be confused.
kakemoms
Posts: 349
Joined: 02 Mar 2016

Re: Fast BCD multiplication

Post by kakemoms »

Druzyek wrote:
Quote:
I think you could convert BCD to binary and divide by two with one table look up and use a 200 byte table like you mentioned.
On second thought, you have to add 1 if a and b are both odd, since halving them leaves two halves (which your code does I think).

Edit: Also, how to do you handle one odd and one even input? Don't you need to subtract one, then add it back after the table look up? I was trying things out on paper and couldn't get it to work out for one even and one odd, then I found this: http://www.trottermath.net/algebra/multsqs.html

Edit 2: It looks like you can do it taking even and odd into account too: http://codebase64.org/doku.php?id=base: ... iplication
I guess you are right there. We lose 0.5 in instances were one of the original numbers were odd, so doing a LSR is not going to work.

In fact, one should skip the division by two of the original numbers and double the array (to 512 words) to get a correct lookup table (and take that into account that you are retrieving a value 4 times what you intended). After subtraction you simply divide by 4.
Post Reply