yzoer wrote:
I just had to get my $0.02 in and make this my first post
Steve Judd ( I think ) posted something a long long time ago which is a pretty good trade off between speed & accuracy; a 256 byte table is all you need + some trickery
You actually need a 1024-byte table. Here's why:
If a and b are 8-bits, then a+b is 9 bits ($FF+$FF = $1FE). Also, the square of an 8-bit number is a 16-bit number. So, 512 entries at 16-bits each yields 1024 bytes devoted to table data.
That being said, however, you can still use 8-bit arithmetic to address it using some of those "tricks" you've been talking about.
For example:
Code:
ma = $F0 ; input A
mb = $F1 ; input B
mapb = $F2 ; temporary: ((A+B)^2)/4; also holds results
mamb = $f4 ; temporary: ((A-B)^2)/4
mul8x8:
clc ; Compute ((A+B)^2)/4
lda ma
adc mb
bcs 1$
tax
lda sqtable00,x
sta mapb
lda sqtable01,x
sta mapb+1
jmp 2$
1$:
tax
lda sqtable10,x
sta mapb
lda sqtable11,x
sta mapb+1
2$:
sec ; Compute ((A-B)^2)/4
lda ma
sbc mb
bcc 3$
tax
lda sqtable00,x
sta mamb
lda sqtable01,x
sta mamb+1
jmp 4$
3$:
tax
lda sqtable10,x
sta mamb
lda sqtable11,x
sta mamb+1
4$:
sec ; Subtract them to get AB.
lda mapb
sbc mamb
sta mapb
lda mapb+1
sbc mamb+1
sta mamb+1
rts
The key here is the division of the lookup table's duties:
sqtable00 is the table of square's for sums 0-255, low byte.
sqtable01 is the table of square's for sums 0-255, high byte.
sqtable10 is the table of square's for sums 256-511, low byte.
sqtable11 is the table of square's for sums 256-511, high byte.
NOTE: For this to all fit in 16-bits or less, you must store the squares pre-divided by four. That is, the second entry for sqtable00 isn't 4, as you'd expect, it's 1.
Drop any fraction, since the fractional components would get subtracted out in the end anyway.
NOTE: This is an unsigned multiply. You need a wrapper around this multiplication code to handle signed multiplications. But this is true of most 6502 multiplication subroutines anyway.
The 65816 represents a more convenient package to do this type of multiplication with (note: this is my first attempt at coding 65816-version of the above code, so it might have some errors, and could probably stand more optimization):
Code:
ma = $F0 ; input A (byte!)
mb = $F2 ; input B (byte!)
mapb = $F4
mamb = $F6
mul8x8_16:
longa on ; Assume accumulator is 16-bits wide,
longi on ; as are the index registers.
sep #MF ; But first, zero the high-bytes of the input parameters
stz ma+1 ; guarantees the subroutine will work without additional
stz mb+1 ; error handling.
rep #MF
sec ; Compute ((A-B)^2)/4 first.
lda ma
sbc mb
and #$01FF ; AND in case the result is negative; confines result into valid index bounds
asl a
tax
lda sqtable,x
sta mamb
clc ; Compute ((A+B)^2)/4 next. We do this here so we avoid having
lda ma ; to refetch its value from memory. It's already in A, so we just...
adc mb
asl a ; Note we don't need to AND for addition. High bits guaranteed to be clear.
tax
lda sqtable,x
sec
sbc mamb ;... subtract the other term, and A now has the product.
rts
As usual, sqtable is a table of squares predivided by four. However, it's formatted as 512 entries of two bytes each -- still 1KB in size, but more "logical" to address.
Interestingly enough, a 16x16=32 multiplication subroutine for the 65816 would take on the same form as the mul8x8 routine for the 6502, with similar performances. But it would require a 512KB square table.
It's probably better to use mul8x8_16 four times and add partial products instead, or to use an early-out "traditional" multiplication routine.
-- algebraic derivation for those who don't believe --
ab = ((a+b)^2)/4 - ((a-b)^2)/4
ab = ((a+b)^2 - (a-b)^2)/4
ab = ((a^2+2ab+b^2) - (a^2-2ab+b^2))/4
ab = (a^2+2ab+b^2-a^2+2ab-b^2)/4
ab = (a^2-a^2+2ab+2ab+b^2-b^2)/4
ab = (4ab)/4
ab = ab