You either throw a large table at the problem (x*y=65536 bytes for 8-bit precision) or cut it down to smaller tables and some computation. The fastest way is always a table:
Code:
stx mod+1
mod: lda table,y
The table is precomputed and is so large that you need banked memory... not very practical unless your system supports it.
For a smaller 256 byte table, you can have 4*4bit precomputed and make the 8*8bit result out of that:
Code:
mult88f
txa ;2
and #$f ;2
sta m4s1+1 ;4
asl ;2
asl ;2
asl ;2
asl ;2
sta m4u1+1 ;4
txa ;2
and #$f0 ;2
sta m4s2+1 ;4
lsr ;2
lsr ;2
lsr ;2
lsr ;2 (c=0)
sta m4s3+1 ;4
tya ;2
and #$f0 ;2
tax ;2
tya ;2
and #$f ;2
tay
m4s1 lda tab44,x ;5
m4s2 adc tab44,y ;5 h4bx*l4by - c=0
sta $92 ;3
ror ;2 shift right and put carry in bit7
lsr ;2
lsr ;2
lsr ;2 all bits in bit4-bit0
sta $93 ;3 high byte result (e.g. *16)
lda $92 ;3
and #$f ;2
asl ;2
asl ;2
asl ;2
asl ;2 *16
m4u1 adc tab44,y ;5 l4bx*l4by
sta $92 ;3
m4s3 lda tab44,x ;5 h4bx*h4by*16*16
adc $93 ;3 result high byte, add overflow carry
sta $93 ;3
It is 106 cycles to compute the 16-bit result. The 4*4bit table can be made with this code:
Code:
initmult
; set up 15x15 multiplication table
ldx #$0 ; multiplier
ldy #$0 ; index
iml0 tya
clc
adc #16
sta imc1+1
txa ; start value=multiplier
sta $96 ; add multiplier to A each round
iml1 sta tab44,y
adc $96
iny
imc1 cpy #00
bne iml1
inx
cpx #16
bne iml0
rts
You can also have slightly larger tables (512 bytes, 1024 bytes) and save another 20-40 cycles on the multiplication (like in the link you referred to).
If you only want to scale by a predefined factor (like 1/10th), you can just throw a division at the problem:
A/10=A*26/256=(A*16+A*8+A*2)/256.
Which can be done by a few ADD, ASL and LSR.