CBM (C64) floating point improvement
Re: CBM (C64) floating point improvement
It might be advantageous to do a lot of 48-bit arithmetic and convert to and from 40 bit format.
-
White Flame
- Posts: 704
- Joined: 24 Jul 2012
Re: CBM (C64) floating point improvement
The byte-parallel multiply I outlined would still work with standard 2^n floating point numbers; just larger code for more speed.
-
malcontent
- Posts: 12
- Joined: 28 Jan 2016
Re: CBM (C64) floating point improvement
Hey white flame, your input is greatly appreciated (and you're codebase work I have often cribbed, so thanks for that)
Basically I am having trouble with conceptualising your suggestion. My understanding of the built-in routine is that it calculates 8 bytes of mantissa (n byte multiplication = n*2 result) but rotates the answer right through 4 bytes, shifting and adding, effectively ousting the least significant 4 bytes. I know I'm doing something wrong extending this idea to individual bytes, where I would get two byte results but drop the LSB because no way can I see how these truncated values can get me to a product. I guess the question is, I can multiply by a byte, what do I do with the numbers I get?
Basically I am having trouble with conceptualising your suggestion. My understanding of the built-in routine is that it calculates 8 bytes of mantissa (n byte multiplication = n*2 result) but rotates the answer right through 4 bytes, shifting and adding, effectively ousting the least significant 4 bytes. I know I'm doing something wrong extending this idea to individual bytes, where I would get two byte results but drop the LSB because no way can I see how these truncated values can get me to a product. I guess the question is, I can multiply by a byte, what do I do with the numbers I get?
-
malcontent
- Posts: 12
- Joined: 28 Jan 2016
Re: CBM (C64) floating point improvement
Actually I stumbled upon something like it, Vedic Multiplication: http://www.cse.usf.edu/~hthapliy/Papers/v2-57.pdf
Still byte-byte multiplication still seems to involve 16 shifts so I'm confused where your 8 shift number comes from.
Still byte-byte multiplication still seems to involve 16 shifts so I'm confused where your 8 shift number comes from.
- barrym95838
- Posts: 2056
- Joined: 30 Jun 2013
- Location: Sacramento, CA, USA
Re: CBM (C64) floating point improvement
With m*256^e format, normalization should be faster, but I'm also interested in seeing how multiplying the mantissas would be any faster. I have been messing with Woz' FMUL, trying to adapt it to 40-bit MS format ... no joy yet.
Mike B.
Mike B.
-
malcontent
- Posts: 12
- Joined: 28 Jan 2016
Re: CBM (C64) floating point improvement
Just tried to implement above posted algorithm, but haven't been able to get a correct result yet. But with all the steps implemented, it's taking about the same amount of time as the built-in version. I'm using the smaller table-based fast multiply from codebase, maybe the 2k table version would be better, (I'd be using that routine anyways if I wanted add a 3D view to simulator).
BTW, the regular multiply takes about 2300 cycles including loading the accumulators.
BTW, the regular multiply takes about 2300 cycles including loading the accumulators.
-
malcontent
- Posts: 12
- Joined: 28 Jan 2016
Re: CBM (C64) floating point improvement
Here is what i have so far, seems to do a little over 1500 cycles. Uses the fast multiply from codebase, it's about a 800 cycle improvement over the built in one, though perhaps it could be optimized for multiplying zero bytes. Table generation and test code not included:
Here's what's it's doing, and an example of $ffffffff^2, the c64 prg below calculates 3.15151515^2.
Code: Select all
;Fast Floating Point Multiplication
;zp
RESULT = $26 ;-$2c
FAC = $62 ;mantissa - $65
ARG = $6a ;mantissa - $6d
res8 = $fb ;lobyte of 8x8 product
!macro addresult .num {
pha
lda res8
clc
adc RESULT+.num
sta RESULT+.num
pla
adc (RESULT-1)+.num
sta (RESULT-1)+.num
!set .num = .num - 1
!if .num != 0 {
bcc +
inc (RESULT-1)+.num
}
!do while .num > 1 {
!set .num = .num - 1
bne +
inc (RESULT-1)+.num
}
+
}
mul_FP bne +
rts ;0 in FAC returns 0
+ jsr $bab7 ;add exponents
lda #0 ;clear RESULT
sta $26
sta $27
sta $28
sta $29
sta $2a
sta $2b
sta $2c
lda FAC+3
ldx ARG+3
jsr mul8x8
sta RESULT+6
ldx ARG+2
jsr mulf
+addresult 6
ldx ARG+1
jsr mulf
+addresult 5
ldx ARG
jsr mulf
+addresult 4
lda FAC+2
ldx ARG+3
jsr mul8x8
+addresult 6
ldx ARG+2
jsr mulf
+addresult 5
ldx ARG+1
jsr mulf
+addresult 4
ldx ARG
jsr mulf
+addresult 3
lda FAC+1
ldx ARG+3
jsr mul8x8
+addresult 5
ldx ARG+2
jsr mulf
+addresult 4
ldx ARG+1
jsr mulf
+addresult 3
ldx ARG
jsr mulf
+addresult 2
lda FAC
ldx ARG+3
jsr mul8x8
+addresult 4
ldx ARG+2
jsr mulf
+addresult 3
ldx ARG+1
jsr mulf
+addresult 2
ldx ARG
jsr mulf
+addresult 1 ;add to result+1,+0
jmp $bb8f ;normal multiply exit. copies RESULT to FAC and normalises.
;---------------------------------
mul8x8 sta sm1+1
sta sm3+1
eor #$ff
sta sm2+1
sta sm4+1
mulf sec
sm1 lda multablo,x
sm2 sbc multab2lo,x
sta res8
sm3 lda multabhi,x
sm4 sbc multab2hi,x
rts
- Attachments
-
- fastm.zip
- (692 Bytes) Downloaded 155 times
-
White Flame
- Posts: 704
- Joined: 24 Jul 2012
Re: CBM (C64) floating point improvement
Here's the breakdown of my "parallel" idea, which is just a tweak on shift-add multiplication. Note that it's orthogonal to how the exponent works.
In calculating M1 * M2, we're going to be reading bits out of M1, and adding shifted versions of M2 to the accumulator.
M1 is made out of the 4 bytes [a,b,c,d], where d is the LSB.
- Right-shift d, and only d, to get the 1's bit out of M1. If set, add M2 to the accumulator.
- Right-shift c, and only c, to get the 256's bit out of M1. If set, add M2 to the accumulator offset by 1 memory location, which is effectively M2*256.
- Right-shift b, if bit is set then add M2 to the accumulator at 2 bytes offset.
- Right-shift a, if bit is set then add M2 to the accumulator at 3 bytes offset.
Shift M2 to the left by 1 bit, multiplying it by 2 for the next round of incoming bits.
At this point, we've done 1 full shift of the 32-bit M1 and M2, and have performed 4 out of 32 potential 32-bit additions. Now, repeat to draw the 2nd bit out of each byte of M1. This will add the M2*2, M2*512, etc components to the accumulator.
After all is said and done, each full 32-bit value will have only been shifted 8 times instead of 32. Or in more specific terms, this requires 64 read-modify-write shifting instructions as opposed to 256, which comes out to hundreds of cycles, if the rest of the code can manage to be of equivalent speed. For 5-cycle zeropage non-indexed shift instructions, that's 192*5 = 960 cycles saved, at the expense of likely a 4x larger code size for the inner loop.
Note that this number isn't necessarily guaranteed, as shifting M2 would likely involve 5 shifts instead of 4. Standard 32-bit shift/add routines tend to butt the various numbers together, sharing shift instructions across field boundaries. But this shouldn't be a huge factor, and the same optimization might be brought into this "parallel" version.
I have a hard time imagining this would be novel. I came up with this probably 15 years ago, but haven't seen anything similar in software optimizations. It seems like a natural optimization to me, though only really emerges from the expense of shifting 32-bit numbers on this particular platform. It wouldn't offer anything to hardware multipliers or 32-bit register platforms.
In calculating M1 * M2, we're going to be reading bits out of M1, and adding shifted versions of M2 to the accumulator.
M1 is made out of the 4 bytes [a,b,c,d], where d is the LSB.
- Right-shift d, and only d, to get the 1's bit out of M1. If set, add M2 to the accumulator.
- Right-shift c, and only c, to get the 256's bit out of M1. If set, add M2 to the accumulator offset by 1 memory location, which is effectively M2*256.
- Right-shift b, if bit is set then add M2 to the accumulator at 2 bytes offset.
- Right-shift a, if bit is set then add M2 to the accumulator at 3 bytes offset.
Shift M2 to the left by 1 bit, multiplying it by 2 for the next round of incoming bits.
At this point, we've done 1 full shift of the 32-bit M1 and M2, and have performed 4 out of 32 potential 32-bit additions. Now, repeat to draw the 2nd bit out of each byte of M1. This will add the M2*2, M2*512, etc components to the accumulator.
After all is said and done, each full 32-bit value will have only been shifted 8 times instead of 32. Or in more specific terms, this requires 64 read-modify-write shifting instructions as opposed to 256, which comes out to hundreds of cycles, if the rest of the code can manage to be of equivalent speed. For 5-cycle zeropage non-indexed shift instructions, that's 192*5 = 960 cycles saved, at the expense of likely a 4x larger code size for the inner loop.
Note that this number isn't necessarily guaranteed, as shifting M2 would likely involve 5 shifts instead of 4. Standard 32-bit shift/add routines tend to butt the various numbers together, sharing shift instructions across field boundaries. But this shouldn't be a huge factor, and the same optimization might be brought into this "parallel" version.
I have a hard time imagining this would be novel. I came up with this probably 15 years ago, but haven't seen anything similar in software optimizations. It seems like a natural optimization to me, though only really emerges from the expense of shifting 32-bit numbers on this particular platform. It wouldn't offer anything to hardware multipliers or 32-bit register platforms.
- barrym95838
- Posts: 2056
- Joined: 30 Jun 2013
- Location: Sacramento, CA, USA
Re: CBM (C64) floating point improvement
Something like this? (completely raw, non-optimized, and untested)
If my example actually works (dunno yet), it looks like there are 32 8-bit shifts plus eight 40-bit shifts, for a total of 72 native RMW instructions executed to acquire a 64-bit product (which would then need to be normalized and rounded). It could be a win over a more traditional 32 64-bit shift version, which would execute 256 RMW instructions, like you said.
Mike B.
Code: Select all
mul:
ldx #7
lda #0
sta aux ; catches MSBs from m2
mul2:
sta p,x ; zero the 64-bit big-endian product
dex
bpl mul2
ldy #8
mul3:
ldx #3
mul4:
lsr m1,x
bcc mul5
clc
lda p+4,x
adc m2+3
sta p+4,x ; add m2 to p + offset
lda p+3,x
adc m2+2
sta p+3,x
lda p+2,x
adc m2+1
sta p+2,x
lda p+1,x
adc m2
sta p+1,x
lda p,x
adc aux
sta p,x
bcc mul5 ; (added per White Flame's observation)
inc p-1,x ; "
bne mul5 ; "
inc p-2,x ; "
bne mul5 ; "
inc p-3,x ; "
mul5:
dex
bpl mul4
asl m2+3
rol m2+2 ; shift m2 for outer loop
rol m2+1
rol m2
rol aux
dey
bne mul3
rts
Mike B.
Last edited by barrym95838 on Wed Feb 10, 2016 3:25 am, edited 2 times in total.
-
White Flame
- Posts: 704
- Joined: 24 Jul 2012
Re: CBM (C64) floating point improvement
In adding M2 to P, you need to check for carry propagation all the way up to the full size of P, which is going to have to be somewhat aware of where you started from. The traditional form gets away without that because of the increasing order in which it adds to the product, ensuring everything above it is zero. However, it should be early-exit in the vast majority of cases.
Unrolling the inner 0-3 loop would save around a dozen cycles per iteration in avoiding indexing costs. Times 32 passes, that's quite a bit as well.
Unrolling the inner 0-3 loop would save around a dozen cycles per iteration in avoiding indexing costs. Times 32 passes, that's quite a bit as well.
Re: CBM (C64) floating point improvement
you could precompute some partial products, say M2 x 3 and M2 x 4
and do the addition predicated on two bits per shift
and do the addition predicated on two bits per shift
- barrym95838
- Posts: 2056
- Joined: 30 Jun 2013
- Location: Sacramento, CA, USA
Re: CBM (C64) floating point improvement
Good catch on the carry propagation, White Flame. It also occurs to me that we'll only be keeping 32-bits of the product, meaning that we might be able to get away with rounding the inputs to just above 16 significant bits, to avoid throwing away about half of the work we did in calculating an exact answer.
@bogax: I'm suffering from a bit of high density this morning. Could you post some code explaining what you mean? Based on some of your previous work (brilliant, BTW), I'll assume that your post above will substitute for the inevitable lack of comments in your source.
Mike B.
@bogax: I'm suffering from a bit of high density this morning. Could you post some code explaining what you mean? Based on some of your previous work (brilliant, BTW), I'll assume that your post above will substitute for the inevitable lack of comments in your source.
Mike B.
-
malcontent
- Posts: 12
- Joined: 28 Jan 2016
Re: CBM (C64) floating point improvement
barrym95838 wrote:
Good catch on the carry propagation, White Flame. It also occurs to me that we'll only be keeping 32-bits of the product, meaning that we might be able to get away with rounding the inputs to just above 16 significant bits.
-
White Flame
- Posts: 704
- Joined: 24 Jul 2012
Re: CBM (C64) floating point improvement
In order to make use of the top 16 bits, you'd have to normalize at the bit-shifting level to align the top 1 bit, which is what I tend to try to avoid.
If anything, taking the top 3 bytes would leave more leeway for precision, while nipping off ¼th of the work. Not sure how often it would b0rk the results, though.,
If anything, taking the top 3 bytes would leave more leeway for precision, while nipping off ¼th of the work. Not sure how often it would b0rk the results, though.,
- barrym95838
- Posts: 2056
- Joined: 30 Jun 2013
- Location: Sacramento, CA, USA
Re: CBM (C64) floating point improvement
Yeah, the top three bytes of each should be enough, by my guess. If I'm not mistaken, every MS float (except 0.0) is already normalized, so my idea would just ignore the least-significant byte. If the routine shifted the other direction, it could be "tuned" to stop when a 32-bit (plus rounding byte) answer was guaranteed. My guess would be between 20 and 24 shifts. Since the interior addition is quite expensive, it might help average performance to swap the initial values containing differing numbers of 1-bits beforehand to minimize the additions.
Mike B.
Mike B.