>Yeah, I would certainly be very interested to see a full 32x32 version. It will be close. Three 16x16 multiplies in the best case is already 561 cycles, then a lot of adding, say another 160, estimating over 720 cycles. There is already a 32x32=64 in <750 cycles. It uses the sqr tables but with a variation on the inner columns where most of the adding of partial products is done in the column. In that case, optimized adding is most important. I believe it comes down to txa:adc #k3:ldx #0:ldy y3:adc (sqr),y:bcc s1:inx:clc s1...sta z+3. This looks strange, but remember the sqr method subtracts two table look ups. Some adjustments have to be made, but I add the 2's complement so that I add them all in a row. There's some adjustments which creates constants kn. I have to fudge the sqr tables. There's up to 14 adds in a row in the middle columns. There's a total of 64 adds which are averaging <11 cycles each, so the bulk of the routine is <704 cycles. I haven't published that one yet. You could also try the sqr formula with larger inputs, or the Knuth method. Given x=x1*256+x0, y=y1*256+y0 y1 y0 *x1 x0
Calculation Range k=(y1-y0)*(x1-x0) +-FE01 l=y1*x1 0-FE01 m=y0*x0 0-FE01 n=l+m-k 0-1FC02 l+m 0-1FC02 l-k or m-k +-FE01
x*y=l*65536+n*256+m (for 8 bit Xn/Yn)
Example (16x16 bits): y1 y0 ->$20 10 *x1 x0 40 30
k=(20-10)*(40-30)=100 l=20*40=800 m=10*30=300 n=800+300-100=a00
llll0000 -> 08000000 nnnnn00 00a0000 mmmm 0300 ---------------- 2010*4030 = 080a0300
x*y=l*4294967296+n*65536+m (for 16 bit Xn/Yn) --- In the situation of multiplying negative numbers via shift and add, indeed you can save time by checking if there's a lot of 1 bits in the multiplier then eor #$ff, and fix up later.
|