6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sun Sep 22, 2024 11:26 am

All times are UTC




Post new topic Reply to topic  [ 45 posts ]  Go to page Previous  1, 2, 3  Next
Author Message
PostPosted: Sat Jan 30, 2016 9:03 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10938
Location: England
It might be advantageous to do a lot of 48-bit arithmetic and convert to and from 40 bit format.


Top
 Profile  
Reply with quote  
PostPosted: Sat Jan 30, 2016 11:21 pm 
Offline

Joined: Tue Jul 24, 2012 2:27 am
Posts: 674
The byte-parallel multiply I outlined would still work with standard 2^n floating point numbers; just larger code for more speed.

_________________
WFDis Interactive 6502 Disassembler
AcheronVM: A Reconfigurable 16-bit Virtual CPU for the 6502 Microprocessor


Top
 Profile  
Reply with quote  
PostPosted: Mon Feb 01, 2016 2:54 am 
Offline

Joined: Thu Jan 28, 2016 11:12 pm
Posts: 12
Hey white flame, your input is greatly appreciated (and you're codebase work I have often cribbed, so thanks for that)

Basically I am having trouble with conceptualising your suggestion. My understanding of the built-in routine is that it calculates 8 bytes of mantissa (n byte multiplication = n*2 result) but rotates the answer right through 4 bytes, shifting and adding, effectively ousting the least significant 4 bytes. I know I'm doing something wrong extending this idea to individual bytes, where I would get two byte results but drop the LSB because no way can I see how these truncated values can get me to a product. I guess the question is, I can multiply by a byte, what do I do with the numbers I get?


Top
 Profile  
Reply with quote  
PostPosted: Mon Feb 01, 2016 7:21 pm 
Offline

Joined: Thu Jan 28, 2016 11:12 pm
Posts: 12
Actually I stumbled upon something like it, Vedic Multiplication: http://www.cse.usf.edu/~hthapliy/Papers/v2-57.pdf

Still byte-byte multiplication still seems to involve 16 shifts so I'm confused where your 8 shift number comes from.


Top
 Profile  
Reply with quote  
PostPosted: Mon Feb 01, 2016 9:12 pm 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1948
Location: Sacramento, CA, USA
With m*256^e format, normalization should be faster, but I'm also interested in seeing how multiplying the mantissas would be any faster. I have been messing with Woz' FMUL, trying to adapt it to 40-bit MS format ... no joy yet.

Mike B.


Top
 Profile  
Reply with quote  
PostPosted: Tue Feb 02, 2016 6:51 am 
Offline

Joined: Thu Jan 28, 2016 11:12 pm
Posts: 12
Just tried to implement above posted algorithm, but haven't been able to get a correct result yet. But with all the steps implemented, it's taking about the same amount of time as the built-in version. I'm using the smaller table-based fast multiply from codebase, maybe the 2k table version would be better, (I'd be using that routine anyways if I wanted add a 3D view to simulator).

BTW, the regular multiply takes about 2300 cycles including loading the accumulators.


Top
 Profile  
Reply with quote  
PostPosted: Thu Feb 04, 2016 7:43 pm 
Offline

Joined: Thu Jan 28, 2016 11:12 pm
Posts: 12
Here is what i have so far, seems to do a little over 1500 cycles. Uses the fast multiply from codebase, it's about a 800 cycle improvement over the built in one, though perhaps it could be optimized for multiplying zero bytes. Table generation and test code not included:

Code:
;Fast Floating Point Multiplication

;zp
RESULT   = $26 ;-$2c
FAC   = $62 ;mantissa - $65
ARG   = $6a ;mantissa - $6d
res8   = $fb ;lobyte of 8x8 product

!macro addresult .num {
   pha
   lda res8
   clc
   adc RESULT+.num
   sta RESULT+.num
   pla
   adc (RESULT-1)+.num
   sta (RESULT-1)+.num
   !set .num = .num - 1
   !if .num != 0 {
      bcc +
      inc (RESULT-1)+.num
   }
   !do while .num > 1 {
      !set .num = .num - 1
      bne +
      inc (RESULT-1)+.num
   }   
+
}
   
mul_FP   bne +
   rts   ;0 in FAC returns 0
+   jsr $bab7   ;add exponents
   lda #0      ;clear RESULT
   sta $26
   sta $27
   sta $28
   sta $29
   sta $2a
   sta $2b
   sta $2c
   
   lda FAC+3
   ldx ARG+3
   jsr mul8x8   
   sta RESULT+6
   ldx ARG+2
   jsr mulf
   +addresult 6
   ldx ARG+1
   jsr mulf
   +addresult 5
   ldx ARG
   jsr mulf
   +addresult 4
   
   lda FAC+2
   ldx ARG+3
   jsr mul8x8
   +addresult 6
   ldx ARG+2
   jsr mulf
   +addresult 5
   ldx ARG+1
   jsr mulf
   +addresult 4
   ldx ARG
   jsr mulf
   +addresult 3
   
   lda FAC+1
   ldx ARG+3
   jsr mul8x8
   +addresult 5
   ldx ARG+2
   jsr mulf
   +addresult 4
   ldx ARG+1
   jsr mulf
   +addresult 3
   ldx ARG
   jsr mulf
   +addresult 2
   
   lda FAC
   ldx ARG+3
   jsr mul8x8
   +addresult 4
   ldx ARG+2
   jsr mulf
   +addresult 3
   ldx ARG+1
   jsr mulf
   +addresult 2
   ldx ARG
   jsr mulf
   +addresult 1   ;add to result+1,+0
   
   jmp $bb8f   ;normal multiply exit. copies RESULT to FAC and normalises.

;---------------------------------
         
mul8x8   sta sm1+1                                             
   sta sm3+1                                             
   eor #$ff                                             
   sta sm2+1                                             
   sta sm4+1
mulf   sec   
sm1   lda multablo,x
sm2   sbc multab2lo,x
   sta res8   
sm3   lda multabhi,x
sm4   sbc multab2hi,x
   rts


Here's what's it's doing, and an example of $ffffffff^2, the c64 prg below calculates 3.15151515^2.


Attachments:
32bitMul.PNG
32bitMul.PNG [ 13.46 KiB | Viewed 3034 times ]
fastm.zip [692 Bytes]
Downloaded 148 times
Top
 Profile  
Reply with quote  
PostPosted: Fri Feb 05, 2016 11:10 pm 
Offline

Joined: Tue Jul 24, 2012 2:27 am
Posts: 674
Here's the breakdown of my "parallel" idea, which is just a tweak on shift-add multiplication. Note that it's orthogonal to how the exponent works.

In calculating M1 * M2, we're going to be reading bits out of M1, and adding shifted versions of M2 to the accumulator.

M1 is made out of the 4 bytes [a,b,c,d], where d is the LSB.
- Right-shift d, and only d, to get the 1's bit out of M1. If set, add M2 to the accumulator.
- Right-shift c, and only c, to get the 256's bit out of M1. If set, add M2 to the accumulator offset by 1 memory location, which is effectively M2*256.
- Right-shift b, if bit is set then add M2 to the accumulator at 2 bytes offset.
- Right-shift a, if bit is set then add M2 to the accumulator at 3 bytes offset.

Shift M2 to the left by 1 bit, multiplying it by 2 for the next round of incoming bits.

At this point, we've done 1 full shift of the 32-bit M1 and M2, and have performed 4 out of 32 potential 32-bit additions. Now, repeat to draw the 2nd bit out of each byte of M1. This will add the M2*2, M2*512, etc components to the accumulator.

After all is said and done, each full 32-bit value will have only been shifted 8 times instead of 32. Or in more specific terms, this requires 64 read-modify-write shifting instructions as opposed to 256, which comes out to hundreds of cycles, if the rest of the code can manage to be of equivalent speed. For 5-cycle zeropage non-indexed shift instructions, that's 192*5 = 960 cycles saved, at the expense of likely a 4x larger code size for the inner loop.

Note that this number isn't necessarily guaranteed, as shifting M2 would likely involve 5 shifts instead of 4. Standard 32-bit shift/add routines tend to butt the various numbers together, sharing shift instructions across field boundaries. But this shouldn't be a huge factor, and the same optimization might be brought into this "parallel" version.


I have a hard time imagining this would be novel. I came up with this probably 15 years ago, but haven't seen anything similar in software optimizations. It seems like a natural optimization to me, though only really emerges from the expense of shifting 32-bit numbers on this particular platform. It wouldn't offer anything to hardware multipliers or 32-bit register platforms.

_________________
WFDis Interactive 6502 Disassembler
AcheronVM: A Reconfigurable 16-bit Virtual CPU for the 6502 Microprocessor


Top
 Profile  
Reply with quote  
PostPosted: Sat Feb 06, 2016 5:57 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1948
Location: Sacramento, CA, USA
Something like this? (completely raw, non-optimized, and untested)
Code:
mul:
    ldx  #7
    lda  #0
    sta  aux        ; catches MSBs from m2
mul2:
    sta  p,x        ; zero the 64-bit big-endian product
    dex
    bpl  mul2
    ldy  #8
mul3:
    ldx  #3
mul4:
    lsr  m1,x
    bcc  mul5
    clc
    lda  p+4,x
    adc  m2+3
    sta  p+4,x      ; add m2 to p + offset
    lda  p+3,x
    adc  m2+2
    sta  p+3,x
    lda  p+2,x
    adc  m2+1
    sta  p+2,x
    lda  p+1,x
    adc  m2
    sta  p+1,x
    lda  p,x
    adc  aux
    sta  p,x
    bcc  mul5       ; (added per White Flame's observation)
    inc  p-1,x      ;               "
    bne  mul5       ;               "
    inc  p-2,x      ;               "
    bne  mul5       ;               "
    inc  p-3,x      ;               "
mul5:
    dex
    bpl  mul4
    asl  m2+3
    rol  m2+2       ; shift m2 for outer loop
    rol  m2+1
    rol  m2
    rol  aux
    dey
    bne  mul3
    rts

If my example actually works (dunno yet), it looks like there are 32 8-bit shifts plus eight 40-bit shifts, for a total of 72 native RMW instructions executed to acquire a 64-bit product (which would then need to be normalized and rounded). It could be a win over a more traditional 32 64-bit shift version, which would execute 256 RMW instructions, like you said.

Mike B.


Last edited by barrym95838 on Wed Feb 10, 2016 3:25 am, edited 2 times in total.

Top
 Profile  
Reply with quote  
PostPosted: Sat Feb 06, 2016 1:37 pm 
Offline

Joined: Tue Jul 24, 2012 2:27 am
Posts: 674
In adding M2 to P, you need to check for carry propagation all the way up to the full size of P, which is going to have to be somewhat aware of where you started from. The traditional form gets away without that because of the increasing order in which it adds to the product, ensuring everything above it is zero. However, it should be early-exit in the vast majority of cases.

Unrolling the inner 0-3 loop would save around a dozen cycles per iteration in avoiding indexing costs. Times 32 passes, that's quite a bit as well.

_________________
WFDis Interactive 6502 Disassembler
AcheronVM: A Reconfigurable 16-bit Virtual CPU for the 6502 Microprocessor


Top
 Profile  
Reply with quote  
PostPosted: Sat Feb 06, 2016 3:31 pm 
Offline

Joined: Tue Nov 18, 2003 8:41 pm
Posts: 250
you could precompute some partial products, say M2 x 3 and M2 x 4
and do the addition predicated on two bits per shift


Top
 Profile  
Reply with quote  
PostPosted: Sat Feb 06, 2016 4:18 pm 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1948
Location: Sacramento, CA, USA
Good catch on the carry propagation, White Flame. It also occurs to me that we'll only be keeping 32-bits of the product, meaning that we might be able to get away with rounding the inputs to just above 16 significant bits, to avoid throwing away about half of the work we did in calculating an exact answer.

@bogax: I'm suffering from a bit of high density this morning. Could you post some code explaining what you mean? Based on some of your previous work (brilliant, BTW), I'll assume that your post above will substitute for the inevitable lack of comments in your source. :wink: :mrgreen:

Mike B.


Top
 Profile  
Reply with quote  
PostPosted: Sun Feb 07, 2016 3:40 pm 
Offline

Joined: Thu Jan 28, 2016 11:12 pm
Posts: 12
barrym95838 wrote:
Good catch on the carry propagation, White Flame. It also occurs to me that we'll only be keeping 32-bits of the product, meaning that we might be able to get away with rounding the inputs to just above 16 significant bits.


I was thinking of this, but you do loose quite a bit of precision that way. Look at FF FF FF FF^2 above and see that FFFF is already in 2 of the 4 saved values after multiplying the 2 LSBs. I know that's an extreme example though. Maybe some sort of estimation could take place instead. When I get some free time in the coming days I'll put your code through the wringer.


Top
 Profile  
Reply with quote  
PostPosted: Sun Feb 07, 2016 9:30 pm 
Offline

Joined: Tue Jul 24, 2012 2:27 am
Posts: 674
In order to make use of the top 16 bits, you'd have to normalize at the bit-shifting level to align the top 1 bit, which is what I tend to try to avoid.

If anything, taking the top 3 bytes would leave more leeway for precision, while nipping off ¼th of the work. Not sure how often it would b0rk the results, though.,

_________________
WFDis Interactive 6502 Disassembler
AcheronVM: A Reconfigurable 16-bit Virtual CPU for the 6502 Microprocessor


Top
 Profile  
Reply with quote  
PostPosted: Sun Feb 07, 2016 10:14 pm 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1948
Location: Sacramento, CA, USA
Yeah, the top three bytes of each should be enough, by my guess. If I'm not mistaken, every MS float (except 0.0) is already normalized, so my idea would just ignore the least-significant byte. If the routine shifted the other direction, it could be "tuned" to stop when a 32-bit (plus rounding byte) answer was guaranteed. My guess would be between 20 and 24 shifts. Since the interior addition is quite expensive, it might help average performance to swap the initial values containing differing numbers of 1-bits beforehand to minimize the additions.

Mike B.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 45 posts ]  Go to page Previous  1, 2, 3  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 35 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: