Fast multiplication

drogon · Post by **drogon** » Sun Oct 24, 2021 9:30 pm

BigDumbDinosaur wrote:

drogon wrote:

BigDumbDinosaur wrote:

S_LONG = 8, S_WORD = 2. The above runs on the 65C816 in native mode with a 16-bit accumulator and 8-bit index.

I think S_LONG is supposed to be 4 here, however that's good and compact code there.

You're correct. Dunno how that ended up being 8.

Quote:

It's fractionally faster than my original long multiplication code @ 13.311 secs. vs. 13.380 for my version with the loop unrolled, or 14.183 seconds for the loop unrolled 31 times. My version is based on code from: http://nparker.llx.com/a2/mult.html

When you tested it did you define FACA and FACB on direct page?

Yes.

There is some "boilerplate" I use in all the MUL tests that works out the sign of the result, makes both sides positive, then does the multiply, fixes up the sign of the result. All these VM registers are inside the first 64 bytes DP.

By unrolling the loop and doing a few other trivial optimisations to my boilerplate code I get your code down to 12.1 seconds from 13.311 on the same little test - still fractionally slower than the 8x8 way but it added some 730 bytes which almost takes it closer to the 1024 bytes of tables (+code) needed for the fast 8x8 way...

However the same boilerplate optimisations takes the 8x8 way from 11.099 to 10.964 but clutching at single cycles is never a good thing to chase unless I was running code for hours and hours to save a minute or 2...

There's never an easy win!

Cheers,

-Gordon

BigDumbDinosaur · Post by **BigDumbDinosaur** » Sun Oct 24, 2021 10:33 pm

drogon wrote:

There's never an easy win!

That's true of life in general. We are forever chasing time.

barrym95838 · Post by **barrym95838** » Sun Oct 24, 2021 11:14 pm

BigDumbDinosaur wrote:

I don't claim this is the most optimized way it could be done—it's actually a scaled-up version of a 16-bit multiplication routine I originally devised for the 65C02 back around 1990 when I was doing a contract project.

Yeah, my submission is a scaled-up version of my 16-bit VTL02 multiply, so I have a bit of confidence that it works properly, if I didn't klutz it up. It's not as small as yours, and it discards product bits 32 to 63, but it has the possible advantages of leaving the index registers alone and exiting as soon as it realizes it has reached the final answer. For numerically small inputs, this can be a significant feature.

Dr Jefyll · Post by **Dr Jefyll** » Mon Oct 25, 2021 5:14 am

BigDumbDinosaur wrote:

I don't claim this is the most optimized way it could be done—it's actually a scaled-up version of a 16-bit multiplication routine I originally devised for the 65C02

Agreed -- there is room for optimization.

... and a good place to focus attention is the rotates (which happen unconditionally for every iteration of the loop):

Code: Select all

.0000010 ror faca+s_long+s_word
         ror faca+s_long
         ror faca+s_word
         ror faca

What's being shifted is a 64-bit value that'll become the result of the 32- by 32-bit multiply. The '816 can manage this 64-bit value with four consecutive R-M-W instructions operating on memory. But R-M-W instructions operating on memory tend to eat a lot of cycles.

In 2005, Bruce Clark aka dclxvi presented a partial remedy, and, inspired by him, I kicked the ball even further. The context was an '02/'C02 routine that multiplied two 16-bit values to yield a 32-bit result, but the same principles would apply for an '816 version that multiplies two 32-bit values to yield a 64-bit result. Just remember that (as with BDD's code) the result gets shifted in four pieces (four quarters).

Bruce's main loop keeps the most-significant quarter in the Accumulator, which is a lot speedier to shift. The other three quarters still reside in memory, but nevertheless it's a significant improvement. Instead of having each iteration of the loop shift...

Mem3 -> Mem2 -> Mem1 -> Mem0... he shifted...
Accum -> Mem2 -> Mem1 -> Mem0

My variation uses two successive loops. The first half of the multiply shifts...

Accum -> Mem2 -> Mem0... and the second half of the multiply shifts...
Accum -> Mem2 -> Mem1

I know that looks broken. But 25% of the shifting gets eliminated altogether, and all the bits do end up in the right place! (In fact, I ran an exhaustive test to verify this and another optimization. Using the Kowalksi simulator running on Windoze box I verified the result for all 2^32 possible combinations of input values.)

Back in 2005 I made an animation to illustrate (below). What's shown are the shift patterns for...

a somewhat clunky left-shift multiply from FIG Forth (intended to sidestep the famous ROR bug?)
Bruce's right-shift multiply, and
my modified right-shift multiply.

The code and further commentary can be found in the thread UM* (multiplication) bug in common 6502 Forths.

-- Jeff

BigDumbDinosaur · Post by **BigDumbDinosaur** » Mon Oct 25, 2021 8:55 am

Dr Jefyll wrote:

BigDumbDinosaur wrote:

I don't claim this is the most optimized way it could be done—it's actually a scaled-up version of a 16-bit multiplication routine I originally devised for the 65C02

Agreed -- there is room for optimization.

I'm always open to suggestions, especially since that 32-bit multiplication function is going to become important with some programming projects I'll be doing.

Quote:

Back in 2005 I made an animation to illustrate (below).

So I see...although I can't see all of it.

drogon · Post by **drogon** » Mon Oct 25, 2021 9:49 am

Dr Jefyll wrote:

... and a good place to focus attention are the rotates (which happen unconditionally for every iteration of the loop):

Code: Select all

.0000010 ror faca+s_long+s_word
         ror faca+s_long
         ror faca+s_word
         ror faca

What's being shifted is a 64-bit value that'll become the result of the 32- by 32-bit multiply. The '816 can manage this 64-bit value with four consecutive R-M-W instructions operating on memory. But R-M-W instructions operating on memory tend to eat a lot of cycles.

I replaced that last ror faca with a tya ; ror ; tay sequence (with a pre-load of Y and store in the boilerplate code. It saves one cycle per loop, so 33 cycles overall. It does make a measurable difference (about 190ms faster in a 12-13 second run-time mandelbrot generator) but we're back to chasing single cycles again.

At this point I feel it really needs a new algorithm as improvements here are unlikely to be a significant gain - sure, we can shave a cycle here and there, but unless it's something we're calling millions of times in some scientific code then is it worth more time? (and being realistic, much as we love it, we're really not going to be running big scientific code on our old CPUs when there are far better systems for that).

My conclusion so-far is that the table driven way is still (slightly) faster, especially for 8x8 multiplies but is it that much faster to justify the expense of the RAM required? In my system, I have the space RAM (I put the tables at $F000 and $F200 which is otherwise unused in Ruby) and I also have an ATmega "co-processor" attached which does many things including offloading 32-bit integer and floating point maths to (in my BCPL system)

Cheers,

-Gordon

drogon · Post by **drogon** » Mon Oct 25, 2021 12:39 pm

Another update on this - in a fit of boredom, I hacked up a multiply benchmark program (in BCPL) and it's raised some interesting observations.

The order of multiplying is sometimes important ie. x := a * b vs. x := b * a. I think this is due to some algorithms terminating early with small numbers that shift to zero quickly. (and it's not always the same - more research needed).

However what surprised me is that the tables of quarter squares is actually faster than handing the numbers over to the co-processor. My guess here is that in the maldelbrot test I was using, it also does divides. These divides are relatively slow and also handled by the co-processor so that may be skewing the results, so take the divide out and here we are (and if someone can come up with a really fast 32-bit divide, then woo-hoo!

Here they are - I've picked the fastest out of a * b vs. b * a to condense the results here and the range is 0 through 9999 being multiplied by a constant 12345. I've avoided negative numbers as that's handled in the boilerplate for each algorithm and is fairly constant, but note that the co-pro code doesn't do the 'boilerplate' of handling the negative numbers but the numbers are passed over to it to handle a signed multiply regardless of the number range and it's faster with negative numbers as a result.

Code: Select all

Orig:   Time taken: 1.483 - 0.226 = 1.257 seconds: Muls/sec:  7954
BDD:    Time taken: 1.303 - 0.226 = 1.077 seconds: Muls/sec:  9284
CoPro:  Time taken: 1.073 - 0.226 = 0.847 seconds: Muls/sec: 11805
mul8x8: Time taken: 1.040 - 0.226 = 0.814 seconds: Muls/sec: 12283

The 0.226 is the empty loop time. It's all slightly scaled integer arithmetic for the timings which are accurate to 1ms.

Benchmarking directly on the Atmega Co-processor is another order of magnitude faster though - it's the relatively high latency of communications between the 2 that's really hindering it here.

What have I learned from this? Well, really nothing new - 30+ years ago we were fudging benchmarks like this for the sales people, so nothing changes! One day I'll port drystone to BCPL as at least that's a good common base...

-Gordon

barrym95838 · Post by **barrym95838** » Mon Oct 25, 2021 5:18 pm

I can't prove it, but I think that my submission can beat 12,000 multiplies/sec at 16 MHz (is 16 MHz right?) if the a in a * b is kept "small" or if the b has a bunch of trailing binary zeros, like a multiple of a power of two would have. It would have no chance of competing with random inputs though ...

drogon · Post by **drogon** » Mon Oct 25, 2021 8:48 pm

barrym95838 wrote:

I can't prove it, but I think that my submission can beat 12,000 multiplies/sec at 16 MHz (is 16 MHz right?) if the a in a * b is kept "small" or if the b has a bunch of trailing binary zeros, like a multiple of a power of two would have. It would have no chance of competing with random inputs though ...

And we have a winner:

Code: Select all

Orig:   Time taken: 1.483 - 0.226 = 1.257 seconds: Muls/sec:  7954
BDD:    Time taken: 1.303 - 0.226 = 1.077 seconds: Muls/sec:  9284
CoPro:  Time taken: 1.073 - 0.226 = 0.847 seconds: Muls/sec: 11805
mul8x8: Time taken: 1.040 - 0.226 = 0.814 seconds: Muls/sec: 12283
MB:     Time taken: 0.950 - 0.226 = 0.724 seconds: Muls/sec: 13810

However, as you suggest it might not be the best with random - however my mandelbrot takes 9.348 seconds, so fast... Well done that man!

-Gordon

barrym95838 · Post by **barrym95838** » Mon Oct 25, 2021 9:05 pm

Code: Select all

@
 \
:-D
 /
@

(That's my best "Rocky" emoji ... maybe I should stick to coding for 46-year-old microprocessors).

Dr Jefyll · Post by **Dr Jefyll** » Tue Oct 26, 2021 2:10 pm

BigDumbDinosaur wrote:

drogon wrote:

I think S_LONG is supposed to be 4 here, however that's good and compact code there.

You're correct. Dunno how that ended up being 8.

Hm, here's another possible issue. Might it be that the high 4 bytes of FACA need to be initially cleared to zero? I'm looking at the line that says...

Code: Select all

        lda faca+s_long

...and, the first time that instruction executes, it's not clear what's being loaded into A. Still a solid and straight-ahead approach, however.

-- Jeff

Dr Jefyll · Post by **Dr Jefyll** » Tue Oct 26, 2021 2:11 pm

barrym95838 wrote:

That's my best "Rocky" emoji

Make some room on the podium, Mike -- we may have a tie for first place!

Based on random input values, it looks as if my modified right-shift algorithm can produce a full 64-bit result in roughly the same time as your routine takes to produce a 32-bit result. I realize Gordon only requires 32 bits, but it'll be interesting to compare the two routines just the same.

The code below is an 816-ified upscale of the modified-shift '02/'C02 approach I mentioned earlier. While preparing the '816 version I opted for a "plain-vanilla" approach because today I wanted to present something that was clear and easy to read. The original version is smaller because it reuses a single loop, but this new version is simpler and bulkier. Also there's no early-exit logic, and my comments don't have the laconic elegance that Mike's do!

Quote:

41 ticks for 0 * j, 79 or 104 ticks for i * 0, ~2332 ticks for 4294967295 * 4294967295, and typically something closer to 1000 ticks. But I don't have an '816 to test it.

Testing, yes... My '02/'C02 version actually has been tested for all 2^32 input combinations; this is something I edited my earlier post to mention. But the code below, while similar, hasn't been tested at all.

As for execution time, most approaches (including Mike's and mine) are strongly affected by the input values, and even the order of the input values. I'll leave it to the judges to determine the statistical weighting of the input value stream that's deemed relevant for this contest (if it is a contest).

The routine below has a fixed overhead plus a variable penalty that's proportional to how many 1 bits are in the multiplicand passed in FACA. Assuming FACA and FACB are in direct page, I tally the best-case performance to be 812 ticks and worst-case to be [edit] 1548. Thus the average, corresponding to random input, is (drum roll, please!) 1180 cycles. This is virtually identical to 1200, which is about what I take Mike's average performance to be. (Correct me if I'm wrong, Mike. I admit I'm confused by the "79 or 104 ticks for i * 0" among your best-case figures.)

Random input aside, I suspect Mike's early-exit feature will give him an advantage for the Mandelbrot test. I'm unable to estimate the magnitude of that advantage because don't know what values Mandelbrot is likely to favor, but studying Mike's routine I see it exits a little early if one of the operands has, in its MS bits, a short string of consecutive zeroes... and it exits a LOT early if there's a long string of zeroes there! The other operand affects performance in a similar way but the zeroes need to be right justified (ie, in the LS bits, not MS).

-- Jeff

Code: Select all

;   Computed Returns: FACA: the 8-byte value that becomes the 64 bit product (little-endian).
;                      The "bottom" of FACA also accepts one of the input values.
;   Input values:     FACA: the 4-byte value that's the 32 bit multiplicand (little-endian)
;                     FACB: the 4-byte value that's the 32 bit multiplier   (little-endian)

imul     rep #m_seta|#m_setx   ;16-bit accumulator and XY                       3
         cld                                                                    2
                       ;Clear the high 4 bytes of the 8-byte value at faca.
         lda #0        ;NB: the top 2 bytes are in A, copied to faca+6 at exit. 3
         sta faca+4    ;It helps to think of A as being synonymous with faca+6. 4

         ldx #16               ;number of iterations, 1st loop                  3
         lsr faca+0            ;CY= 1st of 16 bits tested in the 1st loop.      7
                               ;NB:faca+2 is not involved in the 1st loop.
LoopTop1:bcc No_Add1                                                        2 | 3  (Lp)
         tay                   ;Save A                                      2 | na (Lp)
         clc                   ;Add facb into top 32 bits of faca           2 | na (Lp)
         lda faca+4                                                         4 | na (Lp)
         adc facb                                                           4 | na (Lp)
         sta faca+4            ;low 16 bits done                            4 | na (Lp)
         tya                   ;Restore A = "faca+6"                        2 | na (Lp)
         adc facb+2            ;high 16 bits done                           4 | na (Lp)

No_Add1: ; Rotate Carry -> A_aka_faca+6  ->  faca+4  ->  faca
         ror A                 ;"faca+6"                                        2  (Lp)
         ror faca+4            ; faca+4                                         7  (Lp)
         ;;                      faca+2 is not involved in the 1st loop.
         ror faca              ; faca+0                                         7  (Lp)
         dex                   ;                                                2  (Lp)
         bne LoopTop1          ; more iterations for this loop?                 3  (Lp)
;
         ldx #16               ;number of iterations, 2nd loop                  3
         lsr faca+2            ;CY= 1st of 16 bits tested in the 2nd loop.      7
                               ;NB:faca+0 is not involved in the 2nd loop.
LoopTop2:bcc No_Add2                                                        2 | 3  (Lp)
         tay                   ;Save A                                      2 | na (Lp)
         clc                   ;Add facb into top 32 bits of faca           2 | na (Lp)
         lda faca+4                                                         4 | na (Lp)
         adc facb                                                           4 | na (Lp)
         sta faca+4            ;low 16 bits done                            4 | na (Lp)
         tya                   ;Restore A = "faca+6"                        2 | na (Lp)
         adc facb+2            ;high 16 bits done                           4 | na (Lp)

No_Add2: ; Rotate Carry -> A_aka_faca+6  ->  faca+4  ->  faca+2
         ror A                 ;"faca+6"                                        2  (Lp)
         ror faca+4            ; faca+4                                         7  (Lp)
         ror faca+2            ; faca+2                                         7  (Lp)
         ;;                      faca+0 is not involved in the 2nd loop.
         dex                   ;                                                2  (Lp)
         bne LoopTop2          ; more iterations for this loop?                 3  (Lp)

         sta faca+6            ;done!                                           4
Exit:

Edits: typo in cycle count stated in the text, booboo in one of the comments in the code

drogon · Post by **drogon** » Tue Oct 26, 2021 2:29 pm

Dr Jefyll wrote:

barrym95838 wrote:

That's my best "Rocky" emoji

Make some room on the podium, Mike -- we may have a tie for first place!

Based on random input values, it looks as if my modified right-shift algorithm can produce a full 64-bit result in roughly the same time as your routine takes to produce a 32-bit result. I realize Gordon only requires 32 bits, but it'll be interesting to compare the two routines just the same.

Actually, I need the full 64-bits for another operation, MULDIV where the result is

r := muldiv (x, y, z)

with x * y being computed to the full 64-bit value before the divide by (32-bit) z takes place. This is designed to better handle scaled integers amongst other things. However since this isn't a common operation I'm using my original long multiplication for that with a modified divide and it's not changed for some time. It's slow but works fine.

Quote:

The code below is an 816-ified upscale of the modified-shift '02/'C02 approach I mentioned earlier. Today I opted to present a "plain-vanilla" version because I wanted something that was clear and easy to read. The original version is smaller because it reuses a single loop, but this new version is simpler and bulkier. Also there's no early-exit logic, and my comments don't have the laconic elegance that Mike's do!

I'll have a look and better improve my code to swap between algorithms which is by conditional compilation of my BCPL/Cintcode VM, as well as the benchmark program to use a range of random numbers for a possibly better representation.

Cheers,

-Gordon

BigDumbDinosaur · Post by **BigDumbDinosaur** » Tue Oct 26, 2021 3:19 pm

Dr Jefyll wrote:

Hm, here's another possible issue. Might it be that the high 4 bytes of FACA need to be initially cleared to zero? I'm looking at the line that says...

Code: Select all

        lda faca+s_long

...and, the first time that instruction executes, it's not clear what's being loaded into A. Still a solid and straight-ahead approach, however.

The note near the top of the code about maximum operand magnitude implies that FACA needs to be zeroed prior to loading it with an operand. Otherwise, as you noted, the high DWORD of FACA would be undefined content and if non-zero, would bugger up the result. The file that contains that code, along with other math functions, has general notes at the beginning about clearing FACA before loading it, as well as a sub that when called, zeroes FACA and then loads it with an operand pointed to by the .X and .Y registers.

barrym95838 · Post by **barrym95838** » Tue Oct 26, 2021 3:34 pm

Dr Jefyll wrote:

Make some room on the podium, Mike -- we may have a tie for first place!

I'll always make room for you, doc!

Quote:

This is virtually identical to 1200, which is about what I take Mike's average performance to be. (Correct me if I'm wrong, Mike. I admit I'm confused by the "79 or 104 ticks for i * 0" among your best-case figures.)

I think 1200 might be a bit optimistic for mine. The performance drop as the number of binary ones increases is probably the steepest of the group. The "79" comes from a zero in bit zero of i, and the "104" comes from a one in bit zero of i.

Fast multiplication

Re: Fast multiplication

Re: Fast multiplication

Re: Fast multiplication

Re: Fast multiplication

Re: Fast multiplication

Re: Fast multiplication

Re: Fast multiplication

Re: Fast multiplication

Re: Fast multiplication

Re: Fast multiplication

Re: Fast multiplication

Re: Fast multiplication

Re: Fast multiplication

Re: Fast multiplication

Re: Fast multiplication