Comparing 6502 multiply routines
-
TobyLobster
- Posts: 37
- Joined: 17 Jun 2021
Comparing 6502 multiply routines
I have exhaustively tested and compared the performance and memory use of over 60 integer multiplication routines from the internet. (EDIT: Now over 120 routines)
I also explore the algorithms and techniques behind them:
https://github.com/TobyLobster/multiply_test
There was more to say on this subject than I thought when I started, so it was interesting to explore. If there are other good or interesting multiply routines out there let me know!
I also explore the algorithms and techniques behind them:
https://github.com/TobyLobster/multiply_test
There was more to say on this subject than I thought when I started, so it was interesting to explore. If there are other good or interesting multiply routines out there let me know!
Last edited by TobyLobster on Mon Oct 16, 2023 9:22 pm, edited 1 time in total.
- BigDumbDinosaur
- Posts: 9425
- Joined: 28 May 2009
- Location: Midwestern USA (JB Pritzker’s dystopia)
- Contact:
Re: Comparing 6502 multiply routines
TobyLobster wrote:
I have exhaustively tested and compared the performance and memory use of over 60 integer multiplication routines from the internet.
I also explore the algorithms and techniques behind them:
https://github.com/TobyLobster/multiply_test
There was more to say on this subject than I thought when I started, so it was interesting to explore. If there are other good or interesting multiply routines out there let me know!
I also explore the algorithms and techniques behind them:
https://github.com/TobyLobster/multiply_test
There was more to say on this subject than I thought when I started, so it was interesting to explore. If there are other good or interesting multiply routines out there let me know!
It might be better if you post them here as a message attachment. Aside from the possibility that your link could go 404 one day, some of us don't do Github.
x86? We ain't got no x86. We don't NEED no stinking x86!
- GARTHWILSON
- Forum Moderator
- Posts: 8773
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
Re: Comparing 6502 multiply routines
BigDumbDinosaur wrote:
It might be better if you post them here as a message attachment. Aside from the possibility that your link could go 404 one day, some of us don't do Github.
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
Re: Comparing 6502 multiply routines
Well done TobyLobster - a tour de force!
(All the static parts of 6502.org are mastered on github... in that sense, we all use it. But way to derail an interesting offering...)
(All the static parts of 6502.org are mastered on github... in that sense, we all use it. But way to derail an interesting offering...)
-
teamtempest
- Posts: 443
- Joined: 08 Nov 2009
- Location: Minnesota
- Contact:
Re: Comparing 6502 multiply routines
That is a lot of very interesting information. One piece of additional data that might be helpful is a 'mean' time in cycles as well as an 'average'.
All the 'shift-and-add' routines I looked at all used an index register to count loops. In that case, it seems to me a fairly simple optimization for the 16x16 multipliers would be to check if the high byte of the multiplier is zero. If so, nothing in it can affect the result, and so the loop counter could be reduced from 16 to 8. It's only a few bytes to check, and any multiplication by less than 256 would benefit. To go further, you could also check the multiplicand and swap if the multiplier was greater than 256 but the multiplicand was less (more code, of course). To go further, you could write 8x8, 8x16 and full 16x16 routines and branch to one of them based on examining the multiplicand and multiplier (more code, of course).
Instead of counting, another way to control the loop is to check if the multiplier has become zero after each iteration. In a 16x16 multiplier, OR-ing the high and low bytes of the multiplier together would terminate as soon as the highest set bit was shifted out. A multiplier of zero or one would terminate after one iteration with no special check required.
All the 'shift-and-add' routines I looked at all used an index register to count loops. In that case, it seems to me a fairly simple optimization for the 16x16 multipliers would be to check if the high byte of the multiplier is zero. If so, nothing in it can affect the result, and so the loop counter could be reduced from 16 to 8. It's only a few bytes to check, and any multiplication by less than 256 would benefit. To go further, you could also check the multiplicand and swap if the multiplier was greater than 256 but the multiplicand was less (more code, of course). To go further, you could write 8x8, 8x16 and full 16x16 routines and branch to one of them based on examining the multiplicand and multiplier (more code, of course).
Instead of counting, another way to control the loop is to check if the multiplier has become zero after each iteration. In a 16x16 multiplier, OR-ing the high and low bytes of the multiplier together would terminate as soon as the highest set bit was shifted out. A multiplier of zero or one would terminate after one iteration with no special check required.
-
TobyLobster
- Posts: 37
- Joined: 17 Jun 2021
Re: Comparing 6502 multiply routines
teamtempest wrote:
That is a lot of very interesting information. One piece of additional data that might be helpful is a 'mean' time in cycles as well as an 'average'.
teamtempest wrote:
All the 'shift-and-add' routines I looked at all used an index register to count loops. In that case, it seems to me a fairly simple optimization for the 16x16 multipliers would be to check if the high byte of the multiplier is zero. If so, nothing in it can affect the result, and so the loop counter could be reduced from 16 to 8. It's only a few bytes to check, and any multiplication by less than 256 would benefit. To go further, you could also check the multiplicand and swap if the multiplier was greater than 256 but the multiplicand was less (more code, of course). To go further, you could write 8x8, 8x16 and full 16x16 routines and branch to one of them based on examining the multiplicand and multiplier (more code, of course).
Instead of counting, another way to control the loop is to check if the multiplier has become zero after each iteration. In a 16x16 multiplier, OR-ing the high and low bytes of the multiplier together would terminate as soon as the highest set bit was shifted out. A multiplier of zero or one would terminate after one iteration with no special check required.
Instead of counting, another way to control the loop is to check if the multiplier has become zero after each iteration. In a 16x16 multiplier, OR-ing the high and low bytes of the multiplier together would terminate as soon as the highest set bit was shifted out. A multiplier of zero or one would terminate after one iteration with no special check required.
- barrym95838
- Posts: 2056
- Joined: 30 Jun 2013
- Location: Sacramento, CA, USA
Re: Comparing 6502 multiply routines
teamtempest wrote:
Instead of counting, another way to control the loop is to check if the multiplier has become zero after each iteration. In a 16x16 multiplier, OR-ing the high and low bytes of the multiplier together would terminate as soon as the highest set bit was shifted out. A multiplier of zero or one would terminate after one iteration with no special check required.
Code: Select all
;-----------------------------------------------------;
; var[x] = (var[x]*var[x+2])mod65536 (unsigned)
; uses: plus:, {>}
; exit: var[x+2] and {>} are modified
; 39 bytes
mul:
lda 0,x ;
sta gthan ;
lda 1,x ; copy multiplicand to {>}
sta gthan+1 ;
stz 0,x ; zero the product to begin
stz 1,x ;
.db $cd ; "cmp abs" naked op-code
mul1:
rol 3,x ;
mul2:
lda gthan ;
ora gthan+1 ;
beq mulrts ; exit early if multiplicand = 0
lsr gthan+1 ;
ror gthan ; right-shift multiplicand
bcc mul3 ; check the bit shifted out
jsr plus ; form the product in var[x]
mul3:
asl 2,x ; left-shift multiplier
bne mul1 ;
rol 3,x ;
bne mul2 ; loop until multiplier = 0
mulrts:
rts ;
Last edited by barrym95838 on Tue Jan 03, 2023 10:04 pm, edited 1 time in total.
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!
Mike B. (about me) (learning how to github)
Mike B. (about me) (learning how to github)
-
teamtempest
- Posts: 443
- Joined: 08 Nov 2009
- Location: Minnesota
- Contact:
Re: Comparing 6502 multiply routines
Quote:
One piece of additional data that might be helpful is a 'mean' time in cycles as well as an 'average'.
Re: Comparing 6502 multiply routines
It's a tricky question, I think, because the distribution of input values will come into it. It may well be that in many applications small multipliers will be found often, but in many others they won't.
-
TobyLobster
- Posts: 37
- Joined: 17 Jun 2021
Re: Comparing 6502 multiply routines
barrym95838 wrote:
That's how I did it in VTL02
- barrym95838
- Posts: 2056
- Joined: 30 Jun 2013
- Location: Sacramento, CA, USA
Re: Comparing 6502 multiply routines
Well, I'm already blowing 12 cycles per iteration by not in-lining the add. The VTL02 "IDE" is a balancing act between small size and high performance, with a (not so) slight bias in favor of small size.
Last edited by barrym95838 on Sun Feb 26, 2023 5:11 pm, edited 1 time in total.
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!
Mike B. (about me) (learning how to github)
Mike B. (about me) (learning how to github)
-
TobyLobster
- Posts: 37
- Joined: 17 Jun 2021
Re: Comparing 6502 multiply routines
The idea of testing for an 8x16 bit multiply inside a 16x16 bit multiply was implemented back in 1980, in 'Micro - The 6502 Journal' magazine https://archive.org/details/micro-6502- ... 1/mode/2up (they also have a separate 8x8 multiply).
-
teamtempest
- Posts: 443
- Joined: 08 Nov 2009
- Location: Minnesota
- Contact:
Re: Comparing 6502 multiply routines
Intuitively it makes sense to shortcut, but it may not be so in practice. Suppose we add a header to an 16x16 multiply routine like this:
That seems simple enough, but throughout the full 16-bit range of operands will add 30,081,548,288 cycles (2^24 * 8 branch taken + (2^32 -2^24) * 7 branch not taken) to the overall execution time. On the other hand, it will only save 5,100,273,664 or so cycles overall (2^24 * (8 * 38)), depending on how long a 16x16 do-nothing loop takes. So a net slowdown.
Half measures may not do. Rather than try to optimize one loop, it may be best to split into multiple loops, as the article suggests.
Code: Select all
ldx #8 ; assume 8 loops
lda MPLIER+1 ; can we get away with that ?
beq @1 ; b: yes
ldx #16 ; no, full mutiply
@1
Half measures may not do. Rather than try to optimize one loop, it may be best to split into multiple loops, as the article suggests.
-
TobyLobster
- Posts: 37
- Joined: 17 Jun 2021
Re: Comparing 6502 multiply routines
I added the Micro magazine listing that I linked to above into my tests including its branch to 8x16 multiply (mult48). It's not very competitive these days, coming in at 707 cycles average, but should be representative of a typical shift-and-add.
If I remove the supposed 'shortcut' to the 8x16 routine it comes in slightly faster at 703 cycles average (mult49). So the supposed shortcut is actually slightly counterproductive if all inputs are equally likely (as we suspected). However, if in the context of a hypothetical use case the 'acl/h' input number of this routine is limited to < 32768 (i.e. top bit clear) then the shortcut gives a minor improvement. The smaller the top of the range the better the improvement:
This is for unsigned numbers. (Negative numbers will have the top bit set, so don't get a benefit if the unsigned multiply is used before adjusting for signedness).
So a shortcut can be useful if one of the inputs has a smaller range.
If I remove the supposed 'shortcut' to the 8x16 routine it comes in slightly faster at 703 cycles average (mult49). So the supposed shortcut is actually slightly counterproductive if all inputs are equally likely (as we suspected). However, if in the context of a hypothetical use case the 'acl/h' input number of this routine is limited to < 32768 (i.e. top bit clear) then the shortcut gives a minor improvement. The smaller the top of the range the better the improvement:
Code: Select all
(acl,ach) less than | average cycles
--------------------+---------------
65536 (no limit) | 707
32768 | 696
16384 | 685
8192 | 672
4096 | 656
2048 | 632
1024 | 594
512 | 527
256 | 404 (always uses the 8x16 code)So a shortcut can be useful if one of the inputs has a smaller range.
-
TobyLobster
- Posts: 37
- Joined: 17 Jun 2021
Re: Comparing 6502 multiply routines
I've added some more good routines to the page, mostly 16 bit.