I believe that is the fastest possible 8-bit signed multiply. I would point out a few things though, the total size is wrong. I break it down like this:
Code:
;zp 0
;data 2044
;data total 2044
;code 35
;code+data 2079
and the table generation is slightly wrong. lda $x0FF,X where X=$ff can only reach to $x1FE, since 255+255=510 (511 bytes). I wrote the tables like this (!align is definitely compatible with ACME):
Code:
; Tables must be aligned to page boundary
!align 255,0 ;fill with 0 until $xxFF
sqr_lo
!for i = -256 TO 254
!byte <((i*i)/4)
!end
free1 = *
!align 255,0
sqr_hi
!for i = -256 TO 254
!byte >((i*i)/4)
!end
free2 = *
!align 255,0
neg_sqr_lo
!for i = -255 TO 255
!byte <((i*i)/4)
!end
free3 = *
!align 255,0
neg_sqr_hi
!for i = -255 TO 255
!byte >((i*i)/4)
!end
free4 = *
I wrote my main routine slightly differently (input in Y is consistent with some other routines that have to use (zp),Y in them):
Code:
smult8
;signed 8 bit multiply, selfmod version
;perform A:X = A*Y
;inputs:
; A, Y
;outputs:
; X, A
;registers used: A X Y
;cycles: 53.99
;memory used:
;zp 0
;data 2044
;data total 2044
;code 35
;code+data 2079
eor #$80
sta p_sqr_lo+1
sta p_sqr_hi+1
eor #$ff
sta p_neg_sqr_lo+1
sta p_neg_sqr_hi+1
tya
eor #$80
tay
sec
p_sqr_lo
lda sqr_lo,y
p_neg_sqr_lo
sbc neg_sqr_lo,y
tax
p_sqr_hi
lda sqr_hi,y
p_neg_sqr_hi
sbc neg_sqr_hi,y
rts
In order to use a table for (x+y)^2/4, it has to be monotonic (linear), that's the purpose of the eor #$80, because then the values range from -128 to +127 in order as 0 to 255. Otherwise, its the same as the unsigned version. There is a way to avoid it, but then you have to restrict the domain (input) so that all pairs have a unique place in the table.
As far as extending to 16-bit, its not that simple unfortunately. I can comment later.