Fast multiplication

drogon · Post by **drogon** » Fri Oct 22, 2021 12:16 pm

I'm fairly sure I've read this before - here or over on the retrocomputing forum, but can't for the life of my find a recent thread on it (pointers welcome!)

However this site pops up from time to time:

http://www.txbobsc.com/aal/1986/aal8603.html

and in it there is possibly the fastest unsigned 8x8 multiply routine for the 6502. I decided to implement it in my Ruby system to see how it compares to my existing code... Especially when used to implement a 32x32 multiply as is used inside my BCPL system which is a 32-bit bytecode VM.

The good news is that it's faster than my native 32x32 multiply, the bad news is that it's slower than my "hardware arithmetic accelerator" which is an ATmega which is effectively an 8-bit CPU which also runs at 16Mhz. The advantage the ATmega has is that it has an on-board single cycle 8x8 multiply instruction. However it's only just faster due mostly to the latency of the calls between the 6502/816 side and the ATmega side, so while an interesting exercise, I'm yet again reminded of Knuth's Premature Optimisation quote.

The code below is in my BCPL VM minus the entry and exit stuff and handling signed numbers - regA and regQ are 32-bits wide located in zero/direct page, regW is 64-bits wide, again in zero page. It implements what's essentially a long multiplication of the bytes, adding them into regW at each step which was previously zeroed. It only needs to do enough to produce a 32-bit result.

Inlining the calls to mul8 will save a JSR/RTS, but again, Knuth is tapping my shoulder as there's only 10 calls, so that would save (6+6) * 10 = 120 cycles... Hmm. Well, might be worth it but at the expense of a lot more code space...

There is no doubt this is faster, so if you need an 8x8 (or more) multiply then this is it. My initial 65816 code for 32x32 mul takes just over 2000 cycles...

Also on that site is code for an 816 ('802) but it's going to be very much slower (at least I think it will be - I'll test it at some point). My own thought are that if I have enough RAM (needs 128KB) I can implement a lookup table which would enable me to do an 8x8 multiply in 16-bit mode with 2 simple lookups and none of the arithmetic needed to handle the table of squares / 4. I will give that a go when I upgrade my Ruby816 board to 1MB of RAM, but with 512KB it's a bit tight to take half it's RAM and leave enough to run the BCPL compiler in.

Code: Select all

        aix8

; 32-32 multiply of
;               P  Q  R  S
;             x T  U  V  W

digS    =       regA+0
digR    =       regA+1
digQ    =       regA+2
digP    =       regA+3

digW    =       regQ+0
digV    =       regQ+1
digU    =       regQ+2
digT    =       regQ+3

.macro  doMul8  d1,d2,res
        lda     d1
        ldx     d2
        jsr     mul8

        tay                     ; Temp. save high byte of result
        txa                     ; Low byte of result into A
        clc
        adc     res
        sta     res

        tya                     ; Restore high byte of result
        adc     res+1
        sta     res+1
.endmacro

        doMul8  digW,digS,regW+0
        doMul8  digW,digR,regW+1
        doMul8  digW,digQ,regW+2
        doMul8  digW,digP,regW+3

        doMul8  digV,digS,regW+1
        doMul8  digV,digR,regW+2
        doMul8  digV,digQ,regW+3
;       doMul8  digV,digP,regW+4

        doMul8  digU,digS,regW+2
        doMul8  digU,digR,regW+3
;       doMul8  digU,digQ,regW+4
;       doMul8  digU,digP,regW+5

        doMul8  digT,digS,regW+3
;       doMul8  digT,digR,regW+4
;       doMul8  digT,digQ,regW+5
;       doMul8  digT,digP,regW+6

        aix16

I've attached my versions of the code on the page above:

mul8.s: (8.62 KiB) Downloaded 130 times

BigEd · Post by **BigEd** » Fri Oct 22, 2021 12:44 pm

Nice! It was only a few weeks ago that I was looking for a comment someone had made, about this algorithm I think, to the effect that although the index registers are only 8 bits, you can wangle a 9 bit addition because the low byte of the base address is another 8 bits, and by using both you get a sort of 9 bit lookup.

But I wasn't able to find that discussion!

This was a recent mention of the quarter-square method, although I'm sure there must be previous mentions on this forum too.

What's awkward of course is that there could have been a series of algorithms, each one described at the time as the fastest, but overall only one of them retains the title.

For a table driven approach, is there any chance that looking up 4x4 results could be worthwhile? (Nibble extraction can be done by table, of course. But reassembling partial products might be a tangle.)

drogon · Post by **drogon** » Fri Oct 22, 2021 6:37 pm

BigEd wrote:

Nice! It was only a few weeks ago that I was looking for a comment someone had made, about this algorithm I think, to the effect that although the index registers are only 8 bits, you can wangle a 9 bit addition because the low byte of the base address is another 8 bits, and by using both you get a sort of 9 bit lookup.

But I wasn't able to find that discussion!

This was a recent mention of the quarter-square method, although I'm sure there must be previous mentions on this forum too.

Ah, I think that's the relatively recent one I was thinking of, thanks.

Quote:

What's awkward of course is that there could have been a series of algorithms, each one described at the time as the fastest, but overall only one of them retains the title.

For a table driven approach, is there any chance that looking up 4x4 results could be worthwhile? (Nibble extraction can be done by table, of course. But reassembling partial products might be a tangle.)

I'll give it some thought, but my concern is the shifting involved...

... thinking out loud, enter with the 4-bit args in A and X as before... store X in zp, shift A up 4 places, OR in the stored X, transfer to X, load from 256 byte table indexed by X and you have the result of the 4x4 multiply. ... then do the long multiply and add to get a 32x32 multiply which involves 8+7+...1 steps = 36 steps where the key part is isolating the nibble and that's either 4 shifts or a table lookup then the call then, as you said; reassembling the partial products...

I'll ponder more over the weekend!

Cheers,

-Gordon

GARTHWILSON · Post by **GARTHWILSON** » Fri Oct 22, 2021 7:34 pm

http://6502.org/source/general/SWN.html has a 12-cycle nybble-swap. You can get the 128KB 8x8 multiply look-up table (and lots of other tables) on my site at http://wilsonminesco.com/16bitMathTables/ . A direct link to the multiplication table itself in Intel Hex format is http://wilsonminesco.com/16bitMathTables/MULT.HEX .

drogon · Post by **drogon** » Fri Oct 22, 2021 8:57 pm

GARTHWILSON wrote:

http://6502.org/source/general/SWN.html has a 12-cycle nybble-swap. You can get the 128KB 8x8 multiply look-up table (and lots of other tables) on my site

Thanks, but I have a macro assembler that has a repeat function, so tables like this are trivial. e.g. from the source code I posted:

Code: Select all

sqLo:
        .repeat 512,i
          .byte <((i*i)/4)
        .endrep

sqHi:
        .repeat 512,i
          .byte >((i*i)/4)
        .endrep

but for anything much bigger calculating it at boot time is faster than loading it from disk (No ROM)

-Gordon

barrym95838 · Post by **barrym95838** » Fri Oct 22, 2021 9:55 pm

drogon wrote:

I can implement a lookup table which would enable me to do an 8x8 multiply in 16-bit mode with 2 simple lookups and none of the arithmetic needed to handle the table of squares / 4. I will give that a go when I upgrade my Ruby816 board to 1MB of RAM, but with 512KB it's a bit tight to take half it's RAM and leave enough to run the BCPL compiler in.

The thought of going that route is completely foreign to me, unless I had a hard real-time requirement that gave me no other choice but to starve. On the other hand, I will eagerly spend many hours in pursuit of a handful of bytes to save, so ... I'm somehow reminded of the Jack Sprat poem, but I decline to offer any opinion on which of us is "Jack".

BigEd · Post by **BigEd** » Sat Oct 23, 2021 6:19 am

Yep, you can save bytes, or save microseconds... sometimes, just sometimes, you can save both!

BillG · Post by **BillG** » Sat Oct 23, 2021 8:03 am

barrym95838 wrote:

drogon wrote:

I can implement a lookup table which would enable me to do an 8x8 multiply in 16-bit mode with 2 simple lookups and none of the arithmetic needed to handle the table of squares / 4. I will give that a go when I upgrade my Ruby816 board to 1MB of RAM, but with 512KB it's a bit tight to take half it's RAM and leave enough to run the BCPL compiler in.

The thought of going that route is completely foreign to me, unless I had a hard real-time requirement that gave me no other choice but to starve. On the other hand, I will eagerly spend many hours in pursuit of a handful of bytes to save, so ... I'm somehow reminded of the Jack Sprat poem, but I decline to offer any opinion on which of us is "Jack".

"Lean" can denote either using a reduced number of clock cycles or fewer bytes of machine code.

My philosophy a bit of the opposite of yours.

I have had only a few times where I had to make code smaller to fit an available space, so I tend to code leaning toward speed though I rarely take the extreme end of the time vs space tradeoff as in this multiplication discussion.

I never had to build "big" programs for an 8-bit (64K address space) platform.

I started Windows programming toward the end of the real-mode era. The best machines at the time were something like a 16 MHz 386. Because the architecture of Windows provided for the loading of segments of code on demand, the emphasis was still more toward speed.

Many developers today are accustomed to using two separate environments, one to build the code and another one to run it. I have no desire to build something like a compiler which runs natively on an 8-bit machine. Shoehorning code is not my idea of fun.

However, working on P-code for several weeks has given me an appreciation of how much more compact interpreted code is compared to native machine code. I have come to consider the value of an ability to mix native and interpreted code within one program for an 8-bit platform.

Microsoft C compilers in the early 90s had the ability to generate either interpreted or native code and the run-time environment managed the interface between the two. Early versions of Microsoft Word were built using this technology until the widespread adoption of protected-mode Windows obsoleted the need. It never JIT the interpreted code to native; machines at the time were considered to be too slow to do that. Developers had to choose between the two when programs were built.

BillG · Post by **BillG** » Sat Oct 23, 2021 8:14 am

drogon wrote:

There is no doubt this is faster, so if you need an 8x8 (or more) multiply then this is it. My initial 65816 code for 32x32 mul takes just over 2000 cycles...

I will have to look at doing something like this for multiplying two bignums using partial products instead of the traditional shift and add approach.

barrym95838 · Post by **barrym95838** » Sun Oct 24, 2021 2:40 am

drogon wrote:

There is no doubt this is faster, so if you need an 8x8 (or more) multiply then this is it. My initial 65816 code for 32x32 mul takes just over 2000 cycles...

I hope that's "worst case". I whipped up a little unsigned 32 x 32 -> 32 routine that spends 41 ticks for 0 * j, 79 or 104 ticks for i * 0, ~2332 ticks for 4294967295 * 4294967295, and typically something closer to 1000 ticks. But I don't have an '816 to test it.

Code: Select all

;-----------------------------------------------------;
; 32-bit unsigned multiply for the '802/816: i *= j.  ;
; entry:   16-bit mem/acc, 32-bit i, j and k in DP    ;
; exit:    overflow is ignored/discarded, j and k     ;
;            are modified, a = 0                      ;
; I, Michael T. Barry, place this code in the public
; domain, so it may be used and/or modified free of
; any legal encumbrances.  Attribution is requested
; but not strictly required.
; 48 bytes                                            ;
mul:
    lda i           ; 4 ;
    sta k           ; 4 ;
    lda i+2         ; 4 ; k = i
    sta k+2         ; 4 ;
    stz i           ; 4 ; i = 0
    stz i+2         ; 4 ;
mul2:
    lda k           ; 4 ;
    ora k+2         ; 4 ;
    beq mulrts      ; 2(3) ; exit early if k = 0
    lsr k+2         ; 7 ;
    ror k           ; 7 ; k /= 2
    bcc mul3        ; 2(3) ;
    clc             ; 2 ;
    lda i           ; 4 ; i += j
    adc j           ; 4 ;
    sta i           ; 4 ;
    lda i+2         ; 4 ;
    adc j+2         ; 4 ;
    sta i+2         ; 4 ;
mul3:
    asl j           ; 7 ;
    rol j+2         ; 7 ; j *= 2
    bne mul2        ; 2(3) ;
    lda j           ; 4 ; loop until done
    bne mul2        ; 2(3) ;
mulrts:
    rts             ; 6

[edit: corrected estimates for tick counts]

BigDumbDinosaur · Post by **BigDumbDinosaur** » Sun Oct 24, 2021 2:59 am

When I wrote my mkfs filesystem building code, I needed to be able to do 32-bit integer multiplication with a 64-bit result. Here is what I conjured.

Code: Select all

;================================================================================
;
;imul: FACA = FACA × FACB  (64 bit integer multiplication)
;
;	—————————————————————————————————————————————————
;	Preparatory Ops : FACA: 32 bit multiplicand
;	                  FACB: 32 bit multplier
;
;	Computed Returns: FACA: 64 bit product
;	                  FACB: entry value
;
;	Register Usage  : .A: used
;	                  .B: used
;	                  .X: used
;	                  .Y: undefined
;
;	MPU Flags: nvmxdizc
;	           ||||||||
;	           |||||||+———> 0
;	           ||||||+————> undefined
;	           |||||+—————> entry value
;	           ||+++——————> 0
;	           ++—————————> undefined
;
;	Notes: 1) Results are undefined if either operand
;	          is greater than $FFFFFFFF.
—————————————————————————————————————————————————
;
imul     rep #m_seta           ;16-bit accumulator
         sep #m_setx           ;8-bit index
         cld
         ldx #s_bdword+1       ;bits in DWORD +1 (32+1)
         clc
;
.0000010 ror faca+s_long+s_word
         ror faca+s_long
         ror faca+s_word
         ror faca
         bcc .0000020
;
         clc
         lda faca+s_long
         adc facb
         sta faca+s_long
         lda faca+s_long+s_word
         adc facb+s_word
         sta faca+s_long+s_word
;
.0000020 dex
         bne .0000010
;
         rep #m_setx           ;16-bit index
         rts

S_LONG = 4, S_WORD = 2. The above runs on the 65C816 in native mode with a 16-bit accumulator and 8-bit index.

——————————
Edit: Corrected S_LONG size.

Edit: Noted a DWORD is 32 bits.

drogon · Post by **drogon** » Sun Oct 24, 2021 4:33 pm

BigDumbDinosaur wrote:

S_LONG = 8, S_WORD = 2. The above runs on the 65C816 in native mode with a 16-bit accumulator and 8-bit index.

I think S_LONG is supposed to be 4 here, however that's good and compact code there.

It's fractionally faster than my original long multiplication code @ 13.311 secs. vs. 13.380 for my version with the loop unrolled, or 14.183 seconds for the loop unrolled 31 times. My version is based on code from: http://nparker.llx.com/a2/mult.html

With my accelerated 8x8 multiplier, the same benchmark (another Mandelbrot using scaled integers written in BCPL) is 11.099 seconds.

If I offload the multiply to the ATmega then it's 10.985 seconds.

Will try MikeB's code when I get time.

Cheers,

-Gordon

barrym95838 · Post by **barrym95838** » Sun Oct 24, 2021 6:52 pm

I knew something was fishy about my cycle counts ... the 16-bit shift and rotate instructions are seven cycles for DP, not five, so my estimates were low. @BDD: that's a nifty little subroutine, but it comes at a speed cost, like so many nifty little subroutines often do.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Sun Oct 24, 2021 8:07 pm

drogon wrote:

BigDumbDinosaur wrote:

S_LONG = 8, S_WORD = 2. The above runs on the 65C816 in native mode with a 16-bit accumulator and 8-bit index.

I think S_LONG is supposed to be 4 here, however that's good and compact code there.

You're correct. Dunno how that ended up being 8.

Quote:

It's fractionally faster than my original long multiplication code @ 13.311 secs. vs. 13.380 for my version with the loop unrolled, or 14.183 seconds for the loop unrolled 31 times. My version is based on code from: http://nparker.llx.com/a2/mult.html

When you tested it did you define FACA and FACB on direct page?

BigDumbDinosaur · Post by **BigDumbDinosaur** » Sun Oct 24, 2021 8:39 pm

barrym95838 wrote:

I knew something was fishy about my cycle counts ... the 16-bit shift and rotate instructions are seven cycles for DP, not five, so my estimates were low. @BDD: that's a nifty little subroutine, but it comes at a speed cost, like so many nifty little subroutines often do.

The speed cost isn't as bad as one might expect for a 32-bit routine, and if a slight additional speed penalty can be tolerated, the code can be further compacted by doing the four RORs in a loop—the classic speed vs. size dichotomy.

If anything, it illustrates the significant advantage the 65C816 has in native mode over the 65C02 when it comes to doing math of any kind, especially if larger numbers are involved (picture the above written for the 65C02—there'd be a bunch more RORs and ADCs).

I don't claim this is the most optimized way it could be done—it's actually a scaled-up version of a 16-bit multiplication routine I originally devised for the 65C02 back around 1990 when I was doing a contract project. Despite processing 32-bit operands, the 65C816 rendition is smaller than its 16-bit 65C02 ancestor and substantially faster.

Fast multiplication

Fast multiplication

Re: Fast multiplication

Re: Fast multiplication

Re: Fast multiplication

Re: Fast multiplication

Re: Fast multiplication

Re: Fast multiplication

Re: Fast multiplication

Re: Fast multiplication

Re: Fast multiplication

Re: Fast multiplication

Re: Fast multiplication

Re: Fast multiplication

Re: Fast multiplication

Re: Fast multiplication