6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Thu Nov 21, 2024 7:20 pm

All times are UTC




Post new topic Reply to topic  [ 48 posts ]  Go to page 1, 2, 3, 4  Next
Author Message
PostPosted: Mon Jan 02, 2023 11:39 pm 
Offline

Joined: Thu Jun 17, 2021 7:53 am
Posts: 37
I have exhaustively tested and compared the performance and memory use of over 60 integer multiplication routines from the internet. (EDIT: Now over 120 routines)

I also explore the algorithms and techniques behind them:

https://github.com/TobyLobster/multiply_test

There was more to say on this subject than I thought when I started, so it was interesting to explore. If there are other good or interesting multiply routines out there let me know!


Last edited by TobyLobster on Mon Oct 16, 2023 9:22 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Tue Jan 03, 2023 2:03 am 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8504
Location: Midwestern USA
TobyLobster wrote:
I have exhaustively tested and compared the performance and memory use of over 60 integer multiplication routines from the internet.

I also explore the algorithms and techniques behind them:

https://github.com/TobyLobster/multiply_test

There was more to say on this subject than I thought when I started, so it was interesting to explore. If there are other good or interesting multiply routines out there let me know!

It might be better if you post them here as a message attachment. Aside from the possibility that your link could go 404 one day, some of us don't do Github.

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Top
 Profile  
Reply with quote  
PostPosted: Tue Jan 03, 2023 2:58 am 
Online
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California
BigDumbDinosaur wrote:

It might be better if you post them here as a message attachment. Aside from the possibility that your link could go 404 one day, some of us don't do Github.

Attaching it would be excellent, if it's practical.  One of my problems with github is that I can never can find what I'm looking for on github when someone gives us a link to their repo; but in this case, your detailed summary on the front page was excellent.  I added the link at the end of the UM* topic in the Forth section of the forum at viewtopic.php?f=9&t=689 .  I also put it on my links page.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Tue Jan 03, 2023 9:08 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Well done TobyLobster - a tour de force!

(All the static parts of 6502.org are mastered on github... in that sense, we all use it. But way to derail an interesting offering...)


Top
 Profile  
Reply with quote  
PostPosted: Tue Jan 03, 2023 3:50 pm 
Offline

Joined: Sun Nov 08, 2009 1:56 am
Posts: 411
Location: Minnesota
That is a lot of very interesting information. One piece of additional data that might be helpful is a 'mean' time in cycles as well as an 'average'.

All the 'shift-and-add' routines I looked at all used an index register to count loops. In that case, it seems to me a fairly simple optimization for the 16x16 multipliers would be to check if the high byte of the multiplier is zero. If so, nothing in it can affect the result, and so the loop counter could be reduced from 16 to 8. It's only a few bytes to check, and any multiplication by less than 256 would benefit. To go further, you could also check the multiplicand and swap if the multiplier was greater than 256 but the multiplicand was less (more code, of course). To go further, you could write 8x8, 8x16 and full 16x16 routines and branch to one of them based on examining the multiplicand and multiplier (more code, of course).

Instead of counting, another way to control the loop is to check if the multiplier has become zero after each iteration. In a 16x16 multiplier, OR-ing the high and low bytes of the multiplier together would terminate as soon as the highest set bit was shifted out. A multiplier of zero or one would terminate after one iteration with no special check required.


Top
 Profile  
Reply with quote  
PostPosted: Tue Jan 03, 2023 4:50 pm 
Offline

Joined: Thu Jun 17, 2021 7:53 am
Posts: 37
teamtempest wrote:
That is a lot of very interesting information. One piece of additional data that might be helpful is a 'mean' time in cycles as well as an 'average'.

Just to be clear, the 'average' figures I give are the arithmetic mean: the sum of all cycle counts divided by the number of counts. The error bars shown in the graphs show this average as the dot between the minimum and maximum cycle counts. Many of the routines have a low variance in the min and max values (bars of small height). Those that don't is usually because they have special case code for handling zero.

teamtempest wrote:
All the 'shift-and-add' routines I looked at all used an index register to count loops. In that case, it seems to me a fairly simple optimization for the 16x16 multipliers would be to check if the high byte of the multiplier is zero. If so, nothing in it can affect the result, and so the loop counter could be reduced from 16 to 8. It's only a few bytes to check, and any multiplication by less than 256 would benefit. To go further, you could also check the multiplicand and swap if the multiplier was greater than 256 but the multiplicand was less (more code, of course). To go further, you could write 8x8, 8x16 and full 16x16 routines and branch to one of them based on examining the multiplicand and multiplier (more code, of course).

Instead of counting, another way to control the loop is to check if the multiplier has become zero after each iteration. In a 16x16 multiplier, OR-ing the high and low bytes of the multiplier together would terminate as soon as the highest set bit was shifted out. A multiplier of zero or one would terminate after one iteration with no special check required.

Good ideas. I wonder if the time to check for these cases is small enough to get win across all inputs? I'd be interested to see. The OR-ing trick is only done by omult1.a I think, but is it worth it? Certainly if you know in advance you are likely to be multiplying small numbers these will win.


Top
 Profile  
Reply with quote  
PostPosted: Tue Jan 03, 2023 4:53 pm 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1949
Location: Sacramento, CA, USA
teamtempest wrote:
Instead of counting, another way to control the loop is to check if the multiplier has become zero after each iteration. In a 16x16 multiplier, OR-ing the high and low bytes of the multiplier together would terminate as soon as the highest set bit was shifted out. A multiplier of zero or one would terminate after one iteration with no special check required.

That's how I did it in VTL02. It's very quick if the first term is small or the second term is a large-ish power of two, but probably suffers a bit with random inputs. I won't use tables in VTL02, but I might peek at a few of the smaller routines Toby has to see if they might be more suitable for an unsigned 16x16->16:
Code:
;-----------------------------------------------------;
; var[x] = (var[x]*var[x+2])mod65536 (unsigned)
; uses:    plus:, {>}
; exit:    var[x+2] and {>} are modified
; 39 bytes
mul:
    lda  0,x        ;
    sta  gthan      ;
    lda  1,x        ; copy multiplicand to {>}
    sta  gthan+1    ;
    stz  0,x        ; zero the product to begin
    stz  1,x        ;
    .db  $cd        ; "cmp abs" naked op-code
mul1:
    rol  3,x        ;
mul2:
    lda  gthan      ;
    ora  gthan+1    ;
    beq  mulrts     ; exit early if multiplicand = 0
    lsr  gthan+1    ;
    ror  gthan      ; right-shift multiplicand
    bcc  mul3       ; check the bit shifted out
    jsr  plus       ; form the product in var[x]
mul3:
    asl  2,x        ; left-shift multiplier
    bne  mul1       ;
    rol  3,x        ;
    bne  mul2       ; loop until multiplier = 0
mulrts:
    rts             ;

_________________
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!

Mike B. (about me) (learning how to github)


Last edited by barrym95838 on Tue Jan 03, 2023 10:04 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Tue Jan 03, 2023 7:37 pm 
Offline

Joined: Sun Nov 08, 2009 1:56 am
Posts: 411
Location: Minnesota
Quote:
One piece of additional data that might be helpful is a 'mean' time in cycles as well as an 'average'.


My bad. I meant to say 'median'. And maybe not the cycle time but what the actual operands are at that point (ie., do they have more than eight bits?).


Top
 Profile  
Reply with quote  
PostPosted: Tue Jan 03, 2023 7:59 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
It's a tricky question, I think, because the distribution of input values will come into it. It may well be that in many applications small multipliers will be found often, but in many others they won't.


Top
 Profile  
Reply with quote  
PostPosted: Tue Jan 03, 2023 8:36 pm 
Offline

Joined: Thu Jun 17, 2021 7:53 am
Posts: 37
barrym95838 wrote:
That's how I did it in VTL02

Interesting technique to index with '0,X' to allow any zero page variable. If speed is important it may actually be faster to copy those inputs into temporary zp storage to avoid losing a cycle on every access.


Top
 Profile  
Reply with quote  
PostPosted: Tue Jan 03, 2023 9:15 pm 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1949
Location: Sacramento, CA, USA
Well, I'm already blowing 12 cycles per iteration by not in-lining the add. The VTL02 "IDE" is a balancing act between small size and high performance, with a (not so) slight bias in favor of small size.

_________________
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!

Mike B. (about me) (learning how to github)


Last edited by barrym95838 on Sun Feb 26, 2023 5:11 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Wed Jan 04, 2023 12:12 am 
Offline

Joined: Thu Jun 17, 2021 7:53 am
Posts: 37
The idea of testing for an 8x16 bit multiply inside a 16x16 bit multiply was implemented back in 1980, in 'Micro - The 6502 Journal' magazine https://archive.org/details/micro-6502- ... 1/mode/2up (they also have a separate 8x8 multiply).


Top
 Profile  
Reply with quote  
PostPosted: Wed Jan 04, 2023 8:00 pm 
Offline

Joined: Sun Nov 08, 2009 1:56 am
Posts: 411
Location: Minnesota
Intuitively it makes sense to shortcut, but it may not be so in practice. Suppose we add a header to an 16x16 multiply routine like this:

Code:
       ldx #8             ; assume 8 loops
       lda MPLIER+1    ; can we get away with that ?
       beq @1            ; b: yes
       ldx #16            ; no, full mutiply
@1   


That seems simple enough, but throughout the full 16-bit range of operands will add 30,081,548,288 cycles (2^24 * 8 branch taken + (2^32 -2^24) * 7 branch not taken) to the overall execution time. On the other hand, it will only save 5,100,273,664 or so cycles overall (2^24 * (8 * 38)), depending on how long a 16x16 do-nothing loop takes. So a net slowdown.

Half measures may not do. Rather than try to optimize one loop, it may be best to split into multiple loops, as the article suggests.


Top
 Profile  
Reply with quote  
PostPosted: Thu Jan 05, 2023 8:01 pm 
Offline

Joined: Thu Jun 17, 2021 7:53 am
Posts: 37
I added the Micro magazine listing that I linked to above into my tests including its branch to 8x16 multiply (mult48). It's not very competitive these days, coming in at 707 cycles average, but should be representative of a typical shift-and-add.

If I remove the supposed 'shortcut' to the 8x16 routine it comes in slightly faster at 703 cycles average (mult49). So the supposed shortcut is actually slightly counterproductive if all inputs are equally likely (as we suspected). However, if in the context of a hypothetical use case the 'acl/h' input number of this routine is limited to < 32768 (i.e. top bit clear) then the shortcut gives a minor improvement. The smaller the top of the range the better the improvement:

Code:
(acl,ach) less than | average cycles
--------------------+---------------
65536 (no limit)    | 707   
32768               | 696
16384               | 685
8192                | 672
4096                | 656
2048                | 632
1024                | 594
512                 | 527
256                 | 404    (always uses the 8x16 code)


This is for unsigned numbers. (Negative numbers will have the top bit set, so don't get a benefit if the unsigned multiply is used before adjusting for signedness).
So a shortcut can be useful if one of the inputs has a smaller range.


Top
 Profile  
Reply with quote  
PostPosted: Fri Jan 06, 2023 4:23 pm 
Offline

Joined: Thu Jun 17, 2021 7:53 am
Posts: 37
I've added some more good routines to the page, mostly 16 bit.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 48 posts ]  Go to page 1, 2, 3, 4  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: