CBM (C64) floating point improvement

BigEd · Post by **BigEd** » Sat Jan 30, 2016 9:03 pm

It might be advantageous to do a lot of 48-bit arithmetic and convert to and from 40 bit format.

White Flame · Post by **White Flame** » Sat Jan 30, 2016 11:21 pm

The byte-parallel multiply I outlined would still work with standard 2^n floating point numbers; just larger code for more speed.

malcontent · Post by **malcontent** » Mon Feb 01, 2016 2:54 am

Hey white flame, your input is greatly appreciated (and you're codebase work I have often cribbed, so thanks for that)

Basically I am having trouble with conceptualising your suggestion. My understanding of the built-in routine is that it calculates 8 bytes of mantissa (n byte multiplication = n*2 result) but rotates the answer right through 4 bytes, shifting and adding, effectively ousting the least significant 4 bytes. I know I'm doing something wrong extending this idea to individual bytes, where I would get two byte results but drop the LSB because no way can I see how these truncated values can get me to a product. I guess the question is, I can multiply by a byte, what do I do with the numbers I get?

malcontent · Post by **malcontent** » Mon Feb 01, 2016 7:21 pm

Actually I stumbled upon something like it, Vedic Multiplication: http://www.cse.usf.edu/~hthapliy/Papers/v2-57.pdf

Still byte-byte multiplication still seems to involve 16 shifts so I'm confused where your 8 shift number comes from.

barrym95838 · Post by **barrym95838** » Mon Feb 01, 2016 9:12 pm

With m*256^e format, normalization should be faster, but I'm also interested in seeing how multiplying the mantissas would be any faster. I have been messing with Woz' FMUL, trying to adapt it to 40-bit MS format ... no joy yet.

Mike B.

malcontent · Post by **malcontent** » Tue Feb 02, 2016 6:51 am

Just tried to implement above posted algorithm, but haven't been able to get a correct result yet. But with all the steps implemented, it's taking about the same amount of time as the built-in version. I'm using the smaller table-based fast multiply from codebase, maybe the 2k table version would be better, (I'd be using that routine anyways if I wanted add a 3D view to simulator).

BTW, the regular multiply takes about 2300 cycles including loading the accumulators.

malcontent · Post by **malcontent** » Thu Feb 04, 2016 7:43 pm

Here is what i have so far, seems to do a little over 1500 cycles. Uses the fast multiply from codebase, it's about a 800 cycle improvement over the built in one, though perhaps it could be optimized for multiplying zero bytes. Table generation and test code not included:

Code: Select all

;Fast Floating Point Multiplication

;zp
RESULT	= $26 ;-$2c
FAC	= $62 ;mantissa - $65
ARG	= $6a ;mantissa - $6d
res8	= $fb ;lobyte of 8x8 product 

!macro addresult .num {
	pha
	lda res8
	clc
	adc RESULT+.num
	sta RESULT+.num
	pla 
	adc (RESULT-1)+.num
	sta (RESULT-1)+.num
	!set .num = .num - 1
	!if .num != 0 { 
		bcc +
		inc (RESULT-1)+.num
	}
	!do while .num > 1 {
		!set .num = .num - 1
		bne +
		inc (RESULT-1)+.num
	}	
+
}
	
mul_FP	bne +
	rts	;0 in FAC returns 0
+	jsr $bab7	;add exponents
	lda #0		;clear RESULT
	sta $26
	sta $27
	sta $28
	sta $29
	sta $2a
	sta $2b
	sta $2c
	
	lda FAC+3
	ldx ARG+3
	jsr mul8x8	
	sta RESULT+6
	ldx ARG+2
	jsr mulf
	+addresult 6
	ldx ARG+1
	jsr mulf
	+addresult 5
	ldx ARG
	jsr mulf
	+addresult 4
	
	lda FAC+2
	ldx ARG+3
	jsr mul8x8
	+addresult 6
	ldx ARG+2
	jsr mulf
	+addresult 5
	ldx ARG+1
	jsr mulf
	+addresult 4
	ldx ARG
	jsr mulf
	+addresult 3
	
	lda FAC+1
	ldx ARG+3
	jsr mul8x8
	+addresult 5
	ldx ARG+2
	jsr mulf
	+addresult 4
	ldx ARG+1
	jsr mulf
	+addresult 3
	ldx ARG
	jsr mulf
	+addresult 2
	
	lda FAC
	ldx ARG+3
	jsr mul8x8
	+addresult 4
	ldx ARG+2
	jsr mulf
	+addresult 3
	ldx ARG+1
	jsr mulf
	+addresult 2
	ldx ARG
	jsr mulf
	+addresult 1	;add to result+1,+0
	
	jmp $bb8f	;normal multiply exit. copies RESULT to FAC and normalises.

;---------------------------------
			
mul8x8	sta sm1+1                                             
	sta sm3+1                                             
	eor #$ff                                              
	sta sm2+1                                             
	sta sm4+1
mulf	sec   
sm1	lda multablo,x
sm2	sbc multab2lo,x
	sta res8   
sm3	lda multabhi,x
sm4	sbc multab2hi,x
	rts

Here's what's it's doing, and an example of $ffffffff^2, the c64 prg below calculates 3.15151515^2.

White Flame · Post by **White Flame** » Fri Feb 05, 2016 11:10 pm

Here's the breakdown of my "parallel" idea, which is just a tweak on shift-add multiplication. Note that it's orthogonal to how the exponent works.

In calculating M1 * M2, we're going to be reading bits out of M1, and adding shifted versions of M2 to the accumulator.

M1 is made out of the 4 bytes [a,b,c,d], where d is the LSB.
- Right-shift d, and only d, to get the 1's bit out of M1. If set, add M2 to the accumulator.
- Right-shift c, and only c, to get the 256's bit out of M1. If set, add M2 to the accumulator offset by 1 memory location, which is effectively M2*256.
- Right-shift b, if bit is set then add M2 to the accumulator at 2 bytes offset.
- Right-shift a, if bit is set then add M2 to the accumulator at 3 bytes offset.

Shift M2 to the left by 1 bit, multiplying it by 2 for the next round of incoming bits.

At this point, we've done 1 full shift of the 32-bit M1 and M2, and have performed 4 out of 32 potential 32-bit additions. Now, repeat to draw the 2nd bit out of each byte of M1. This will add the M2*2, M2*512, etc components to the accumulator.

After all is said and done, each full 32-bit value will have only been shifted 8 times instead of 32. Or in more specific terms, this requires 64 read-modify-write shifting instructions as opposed to 256, which comes out to hundreds of cycles, if the rest of the code can manage to be of equivalent speed. For 5-cycle zeropage non-indexed shift instructions, that's 192*5 = 960 cycles saved, at the expense of likely a 4x larger code size for the inner loop.

Note that this number isn't necessarily guaranteed, as shifting M2 would likely involve 5 shifts instead of 4. Standard 32-bit shift/add routines tend to butt the various numbers together, sharing shift instructions across field boundaries. But this shouldn't be a huge factor, and the same optimization might be brought into this "parallel" version.

I have a hard time imagining this would be novel. I came up with this probably 15 years ago, but haven't seen anything similar in software optimizations. It seems like a natural optimization to me, though only really emerges from the expense of shifting 32-bit numbers on this particular platform. It wouldn't offer anything to hardware multipliers or 32-bit register platforms.

barrym95838 · Post by **barrym95838** » Sat Feb 06, 2016 5:57 am

Something like this? (completely raw, non-optimized, and untested)

Code: Select all

mul:
    ldx  #7
    lda  #0
    sta  aux        ; catches MSBs from m2
mul2:
    sta  p,x        ; zero the 64-bit big-endian product
    dex
    bpl  mul2
    ldy  #8
mul3:
    ldx  #3
mul4:
    lsr  m1,x
    bcc  mul5
    clc
    lda  p+4,x
    adc  m2+3
    sta  p+4,x      ; add m2 to p + offset
    lda  p+3,x
    adc  m2+2
    sta  p+3,x
    lda  p+2,x
    adc  m2+1
    sta  p+2,x
    lda  p+1,x
    adc  m2
    sta  p+1,x
    lda  p,x
    adc  aux
    sta  p,x
    bcc  mul5       ; (added per White Flame's observation)
    inc  p-1,x      ;               "
    bne  mul5       ;               "
    inc  p-2,x      ;               "
    bne  mul5       ;               "
    inc  p-3,x      ;               "
mul5:
    dex
    bpl  mul4
    asl  m2+3
    rol  m2+2       ; shift m2 for outer loop
    rol  m2+1
    rol  m2
    rol  aux
    dey
    bne  mul3
    rts

If my example actually works (dunno yet), it looks like there are 32 8-bit shifts plus eight 40-bit shifts, for a total of 72 native RMW instructions executed to acquire a 64-bit product (which would then need to be normalized and rounded). It could be a win over a more traditional 32 64-bit shift version, which would execute 256 RMW instructions, like you said.

Mike B.

White Flame · Post by **White Flame** » Sat Feb 06, 2016 1:37 pm

In adding M2 to P, you need to check for carry propagation all the way up to the full size of P, which is going to have to be somewhat aware of where you started from. The traditional form gets away without that because of the increasing order in which it adds to the product, ensuring everything above it is zero. However, it should be early-exit in the vast majority of cases.

Unrolling the inner 0-3 loop would save around a dozen cycles per iteration in avoiding indexing costs. Times 32 passes, that's quite a bit as well.

bogax · Post by **bogax** » Sat Feb 06, 2016 3:31 pm

you could precompute some partial products, say M2 x 3 and M2 x 4
and do the addition predicated on two bits per shift

barrym95838 · Post by **barrym95838** » Sat Feb 06, 2016 4:18 pm

Good catch on the carry propagation, White Flame. It also occurs to me that we'll only be keeping 32-bits of the product, meaning that we might be able to get away with rounding the inputs to just above 16 significant bits, to avoid throwing away about half of the work we did in calculating an exact answer.

@bogax: I'm suffering from a bit of high density this morning. Could you post some code explaining what you mean? Based on some of your previous work (brilliant, BTW), I'll assume that your post above will substitute for the inevitable lack of comments in your source.

Mike B.

malcontent · Post by **malcontent** » Sun Feb 07, 2016 3:40 pm

barrym95838 wrote:

Good catch on the carry propagation, White Flame. It also occurs to me that we'll only be keeping 32-bits of the product, meaning that we might be able to get away with rounding the inputs to just above 16 significant bits.

I was thinking of this, but you do loose quite a bit of precision that way. Look at FF FF FF FF^2 above and see that FFFF is already in 2 of the 4 saved values after multiplying the 2 LSBs. I know that's an extreme example though. Maybe some sort of estimation could take place instead. When I get some free time in the coming days I'll put your code through the wringer.

White Flame · Post by **White Flame** » Sun Feb 07, 2016 9:30 pm

In order to make use of the top 16 bits, you'd have to normalize at the bit-shifting level to align the top 1 bit, which is what I tend to try to avoid.

If anything, taking the top 3 bytes would leave more leeway for precision, while nipping off ¼th of the work. Not sure how often it would b0rk the results, though.,

barrym95838 · Post by **barrym95838** » Sun Feb 07, 2016 10:14 pm

Yeah, the top three bytes of each should be enough, by my guess. If I'm not mistaken, every MS float (except 0.0) is already normalized, so my idea would just ignore the least-significant byte. If the routine shifted the other direction, it could be "tuned" to stop when a 32-bit (plus rounding byte) answer was guaranteed. My guess would be between 20 and 24 shifts. Since the interior addition is quite expensive, it might help average performance to swap the initial values containing differing numbers of 1-bits beforehand to minimize the additions.

Mike B.

CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement