integer division and multiplication

barrym95838 · Post by **barrym95838** » Tue Mar 17, 2015 4:17 am

That's a nice improvement from your original UD/MOD attempt. Have you been able to measure the speed difference yet?

I know that you are going for speed as well as compactness, and I can't figure out how to beat Garth's improved FIG UM/MOD for speed, but if I was going for smallest size, I would probably reuse code for UD/MOD and UM/MOD like so (58 code bytes total, headers not included):

Code: Select all

; - - - - - - - - - - - - - - - - - - - - - - - - - - -
; UM/MOD  ( ud u1 -- u2 u3 )
; Dividend ud is 32-bits unsigned.  Divisor u1,
; Remainder u2 and Quotient u3 are 16-bits unsigned
; Invalid inputs and/or outputs are silently ignored
; 1823 cycles (+ NEXT) best case (zero quotient)
; 2239 cycles (+ NEXT) worst case (zero divisor)
;
umslashmod:
        jsr  xudivmod   ; (do the dirty work)
        inx             ; NIP
        jmp  swap       ; SWAP

; - - - - - - - - - - - - - - - - - - - - - - - - - - -
; UD/MOD  ( ud1 u1 -- u2 ud2 )
; Dividend ud1 and Quotient ud2 are 32-bits unsigned
; Divisor u1 and Remainder u2 are 16-bits unsigned
; Invalid inputs and/or outputs are silently ignored
; 1856 cycles (+ NEXT) best case (zero quotient)
; 2272 cycles (+ NEXT) worst case (zero divisor)
;
udslashmod:
        jsr  xudivmod   ; (do the dirty work)
        jmp  dashrot    ; -ROT

; - - - - - - - - - - - - - - - - - - - - - - - - - - -
; Internal divmod routine used by UM/MOD and UD/MOD
; ( ud1 u1 -- ud2 u2 ) 
; ud1 is dividend, u1 is divisor
; ud2 is quotient, u2 is remainder
xudivmod:
        ldy  #32        ;[2] init loop counter
        lda  #0         ;[2] init partial remainder
        sta  N          ;[3]   in N:A (h:l)
xudm2:
        asl  stackl+1,x ;[6] dividend in NOS:3OS (h:l)
        rol  stackh+1,x ;[6]   is gradually replaced
        rol  stackl,x   ;[6]   with the quotient
        rol  stackh,x   ;[6]
        rol             ;[2] N:A is gradually replaced
        rol  N          ;[5]   with the remainder
        pha             ;[3]
        cmp  TOS        ;[3] TOS holds divisor
        lda  N          ;[3] partial remainder >= TOS?
        sbc  TOS+1      ;[3]
        bcc  xudm3      ;[3]*
        sta  N          ;[3]   yes: update the partial
        pla             ;[4]     remainder and set the
        sbc  TOS        ;[3]     low bit in the partial
        inc  stackl+1,x ;[6]     quotient
        .db  $c9        ;[2]*  cmp# naked opcode
xudm3:
        pla             ;[4]*
        dey             ;[2] loop 32 times
        bne  xudm2      ;[3]*
        sta  TOS        ;[3]
        ldy  N          ;[3]
        sty  TOS+1      ;[3]
        rts             ;[6]

I'm not going to win many speed contests that way, especially with UM/MOD, but I'm 51% confident that this way is going to have the smallest footprint. It's certainly not the smallest theoretically, because you could do repeated subtractions without shifts, but I think that my proposal gives good "bang for the byte", assuming you want to have UM/MOD and UD/MOD available to the user, with reasonable performance.

YMMV, untested, as usual.

Mike B.

GARTHWILSON · Post by **GARTHWILSON** » Tue Mar 17, 2015 5:18 am

UD/MOD doesn't usually need to be super fast; but if it did, I would look for an efficient, dedicated way to divide by ten (ie, $0A), then test the input and route the divide-by-10 to that, route the divide-by-two and divide-by-16 to dedicated parts that just shift, and anything else to the normal routine (which will seldom get used).

White Flame · Post by **White Flame** » Tue Mar 17, 2015 6:40 am

All you need is to go Forth and multiply.

(I'll see myself out...)

BigEd · Post by **BigEd** » Tue Mar 17, 2015 9:15 am

GARTHWILSON wrote:

UD/MOD doesn't usually need to be super fast; but if it did, I would look for an efficient, dedicated way to divide by ten (ie, $0A), then test the input and route the divide-by-10 to that, route the divide-by-two and divide-by-16 to dedicated parts that just shift, and anything else to the normal routine (which will seldom get used).

Nice idea - make the common case fast.

rwiker · Post by **rwiker** » Wed Mar 18, 2015 12:31 pm

White Flame wrote:

All you need is to go Forth and multiply.

(I'll see myself out...)

Not an entirely new idea: http://www.jupiter-ace.co.uk/Forth_PCWGoForthAnd.html

chitselb · Post by **chitselb** » Thu Mar 19, 2015 2:27 am

There are some great ideas here, and to do them justice, I'm going to have to dedicate some time to implementing and benchmarking them all within the same context. I never really did figure out how to do the reciprocal multiplication approach, and that division by constants technique with shifting and adding tables of constants, that's black magic to me as well. I'll publish a table of the results when I'm done.

barrym95838 · Post by **barrym95838** » Sat Apr 04, 2015 6:58 pm

barrym95838 wrote:

... and I can't figure out how to beat Garth's improved FIG UM/MOD for speed ...

... or can I???

... details to come, soon, I hope.

Mike B.

barrym95838 · Post by **barrym95838** » Tue Apr 14, 2015 6:25 am

Okay, I have an embarrassing confession ... I can't figure out how to use the IIgs monitor well enough to test this 16-bit UM/MOD ... I'm quite proficient with the original Apple monitor, but success in working the enhanced version somehow eludes me. Can anyone comment on this brutally raw attempt, and possibly give me a thumbs up or down? Garth is buried in legitimate work, and he's the only one I know with an '802/'816 Forth.

Code: Select all

;------------------------------------------------------
; UM/MOD for the 65802/816 by Michael T. Barry.
; Free to copy and/or modify, but without warranty.
;
; ( ud u1 -- u2 u3 )
;
; ud is the 32-bit dividend.
; u1 is the 16-bit divisor.
; u2 is the 16-bit remainder.
; u3 is the 16-bit quotient.
; All values are unsigned.
;
; Quotient overflow or /0 results in a quotient and
;   remainder of $FFFF.  Although $FFFF could be a
;   valid quotient, it cannot be a valid remainder,
;   so the caller may use this fact to check for an
;   invalid result.
;
; 0,x is TOS ( divisor -- quotient )
; 2,x is NOS ( hi dividend -- remainder )
; 4,x is 3OS ( lo dividend --  )
;
; Accumulator and memory are in 16-bit mode.
; Index registers are in 8-bit mode.
;
HEADER "UM/MOD", NOT_IMMEDIATE ; ( ud u -- rem quot )
        CODE            ; Machine code primitive.
umslashmod:
        lda  2,x        ; If hi cell of dividend is
        cmp  0,x        ;   >= divisor, make a note
        php             ;   of overflow for later.
        ldy  #17        ; Init loop counter.
        .db  $c0        ; Skip following rol.
umsm2:
        rol             ; Shift hi cell of dividend.
; !!!!!!!!! Bug !!!!!!!!! 17th bit is lost here !!!!!!!!!
        cmp  0,x        ; if (hi dividend >= divisor)
        bcc  umsm3      ; then update remainder
        sbc  0,x        ;   and leave carry set
umsm3:
        rol  4,x        ; Shift carry into quotient.
        dey             ; Loop until done.
        bne  umsm2
        plp             ; Check for overflow.
        bcc  umsm5      ;
        tya             ; If so, put $ffff in
        dec             ;   quotient and remainder.
        sta  4,x        ;
umsm5:                  ;
        sta  2,x        ; Save remainder in NOS.
        inx             ;
        inx             ; DROP
        jmp  swap       ; SWAP ;
; 34 bytes, not counting HEADER.

If I made any glaring errors, I will correct them here, with appropriate credit.

Mike B.

[Edit: I've been able to play with it just a bit, and the quotient doesn't seem to be correct for any divisor other than 1 ... that could be considered a significant bug! Work continues ...]

[Edit #2: I have substituted a second (but still not tested) attempt above, completely replacing the first, which was s**t. It was inspired by https://raw.githubusercontent.com/Acher ... muldiv.asm
, but please don't hold that against White Flame if this attempt turns out to be s**t as well.]

[Edit: This version is buggy for larger inputs. See below for a properly debugged version.]

barrym95838 · Post by **barrym95838** » Mon Apr 20, 2015 12:47 am

Okay, I have no solid excuse for not testing this one, only that I've been a bit under the weather for the past couple of days, and I got excited when I finally got it ironed out and it "looked" right:

Code: Select all

;------------------------------------------------------
; UM/MOD for any 6502 by Michael T. Barry.
; Free to copy and/or modify, but without warranty.
;
; ( ud u1 -- u2 u3 )
; ud is the 32-bit dividend.
; u1 is the 16-bit divisor.
; u2 is the 16-bit remainder.
; u3 is the 16-bit quotient.
; All values are unsigned.
;
; Quotient overflow or /0 results in a quotient and
;   remainder of $FFFF.  Although $FFFF could be a
;   valid quotient, it cannot be a valid remainder,
;   so the caller may use this fact to check for an
;   invalid result.
;
; TOS and TOS+1 are at fixed points in ZP.
; stackl,x and stackh,x are NOS.
; stackl+1,x and stackh+1,x are 3OS.
;
umslashmod:
        ldy  #17        ;[2] Init loop counter
        lda  stackl,x   ;[4]
        cmp  TOS        ;[3] If (dividend:H < divisor)
        lda  stackh,x   ;[4]
        sbc  TOS+1      ;[3] then
        lda  stackh,x   ;[4]   keep NOS:H in A for speed
        bcc  umsm4      ;[3]*  and proceed with division
        lda  #$ff       ;[2] else
        sta  stackl+1,x ;[4]   overflow: Put $ffff in
        sta  stackh+1,x ;[4]   quotient and remainder
        sta  stackl,x   ;[4]
umsm2:                  ;
        sta  stackh,x   ;[4] Exit sequence:
        jsr  slide      ;[28]  DROP the divisor
        jmp  SWAP       ;[34]* SWAP quotient and remainder
umsm3:                  ;
        sbc  TOS        ;[3]
        sta  stackl,x   ;[4] Update remainder
        pla             ;[4]  (gradually replacing
        sbc  TOS+1      ;[3]   dividend:H)
        sec             ;[2]
umsm4:                  ;
        rol  stackl+1,x ;[6] Shift carry into quotient
        rol  stackh+1,x ;[6]  (gradually replacing
        dey             ;[2]   dividend:L)
        beq  umsm2      ;[2]* Exit if done
        rol  stackl,x   ;[6] Shift dividend:H
        rol             ;[2]
        pha             ;[3]
umsm5:                  ;
        lda  stackl,x   ;[4]
        bcs  umsm3      ;[2]*
        cmp  TOS        ;[3] If (17th bit set) or
        pla             ;[4]   (dividend:H >= divisor)
        pha             ;[3]
        sbc  TOS+1      ;[3] then
        bcs  umsm5      ;[2]*  update remainder
        pla             ;[4] else
        bcc  umsm4      ;[3]   continue without update
;------------------- 64 bytes, not counting HEADER

Comments, corrections, critiques, or congratulations are welcome, as always.

Mike B.

[Edit: Corrected 17th-bit bug.]
[Edit #2: 17th-bit needs to be checked before the compare (a subtle but nasty little bug) ... corrected the correction.]

Dr Jefyll · Post by **Dr Jefyll** » Mon Apr 20, 2015 5:54 am

It looks right to me, Mike -- and a tidy piece of work it is! There is a way you could cut about 90 cycles off the execution time, but it adds about 30 bytes, and I have a feeling it's the memory economy that's your priority.

FWIW:

Right now you have a loop that iterates 16 times and contains a 4-byte (32-bit) shift.
But only 3 bytes are actually used each iteration. One byte isn't read or written but still gets shifted. That can be avoided.
The faster alternative is a loop that iterates 8 times and contains a 3-byte shift,
followed by another loop that iterates 8 times and contains a slightly different 3-byte shift.

To see what I mean, compare the middle and lower sections in the animation above. The middle section illustrates sixteen 4-byte shifts, and the bottom section illustrates eight 3-byte shifts followed by eight slightly different 3-byte shifts. The animation illustrates a few multiplication routines, discussed here, but the same shifting shortcut can be used for division. Were your division routine rewritten to exploit this, the first loop could omit shifting stackl+1,X and the second could omit shifting stackh+1,X.

cheers,
Jeff

ps: "TOS and TOS+1 are at fixed points in ZP." TOS resides someplace other than on stack??? I guess this relates to the split stack you're using. TOS is treated differently than the other stack items so it can be used as an indirect pointer -- is that right?

[Edit: add illustration; modify text]

barrym95838 · Post by **barrym95838** » Mon Apr 20, 2015 8:06 pm

Dr Jefyll wrote:

... "TOS and TOS+1 are at fixed points in ZP." TOS resides someplace other than on stack??? I guess this relates to the split stack you're using. TOS is treated differently than the other stack items so it can be used as an indirect pointer -- is that right?

That's Charlie's deal with his Pettil implementation. It involves some trade-offs. A favorable example: NIP becomes INX. There are pros and cons to any decision like that. For my 65m32 DTC Forth, I keep TOS in the accumulator, but that would be impossible for a 6502, and a dubious optimization for an '802/'816. Perhaps Charlie did some benchmarks that led to his design decision, and would be able to share them briefly with us, here or in a fresh thread?

Mike B.

[Addendum: Here's a tidied-up version for the '802/'816 ... same size, but slightly faster (?!), on average:

Code: Select all

;------------------------------------------------------
; UM/MOD for the 65802/816 by Michael T. Barry.
; Free to copy and/or modify, but without warranty.
;
; ( ud u1 -- u2 u3 )
;
; ud is the 32-bit dividend.
; u1 is the 16-bit divisor.
; u2 is the 16-bit remainder.
; u3 is the 16-bit quotient.
; All values are unsigned.
;
; Quotient overflow or /0 results in a quotient and
;   remainder of $FFFF.  Although $FFFF could be a
;   valid quotient, it cannot be a valid remainder,
;   so the caller may use this fact to check for an
;   invalid result.
;
; 0,x is TOS ( divisor -- quotient )
; 2,x is NOS ( hi dividend -- remainder )
; 4,x is 3OS ( lo dividend --  )
;
; Accumulator and memory are in 16-bit mode.
; Index registers are in 8-bit mode.
;
HEADER "UM/MOD", NOT_IMMEDIATE ; ( ud u -- rem quot )
        CODE            ; Machine code primitive.
umslashmod:
        ldy  #17        ; Init loop counter.
        lda  2,x        ; If (hi dividend >= divisor)
        cmp  0,x        ;   then we have an overflow
        bcc  umsm4      ;   condition, so we exit 
        lda  #$ffff     ;   early with $ffff in the
        sta  4,x        ;   quotient and remainder.
umsm2:                  ;
        sta  2,x        ; Save remainder in NOS.
        inx             ;
        inx             ; DROP
        jmp  swap       ; SWAP ;
umsm3:
        sbc  0,x        ; Update remainder
        sec
umsm4:
        rol  4,x        ; Shift carry into quotient.
        dey             ; Loop until done.
        beq  umsm2
        rol             ; Shift hi cell of dividend.
        bcs  umsm3
        cmp  0,x        ; If (hi dividend >= divisor)
        bcs  umsm3      ; then update remainder
        bcc  umsm4      ; else continue
;------------------- 37 bytes, not counting HEADER.
;Inspired by https://raw.githubusercontent.com/AcheronVM/acheronvm/master/src/ops-muldiv.asm

]
[Edit: Corrected 17th-bit bug.]

barrym95838 · Post by **barrym95838** » Sun May 03, 2015 9:37 pm

Garth found some time to check my routine, and found some strange errors with larger inputs. I believe that this is the bug report for the '802/'816 version, but he wasn't 100% clear on that, and I'm not sure if both of my attempts share the same bug(s):

Code: Select all

12345678 / 60000 says        64 rem 51534      This one in decimal.
                 should be  205 rem 45678      All others below are in hex.

12345678 / ABCD  says        42 rem  B9E
                 should be 1B20 rem 3DD8

 2345678 /   "   says        42 rem  B9E
                 should be  348 rem 9DD0

  345678 /   "   says        42 rem  B9E  (wrong again)
  123456 /   "   says         "  "    "      "     "

    1234 /   "   says         0 rem 1234   correct
    ABCD /   "   says         1 rem    0     "
    ABCE /   "   says         1 rem    1     "
   1ABCE /   "   says         2 rem 5434     "
  11ABCE /   "   says         2 rem 5434   wrong
                 should be   1A rem 38FC

   10000 / 1000  says        10 rem    0   correct
   20000 /   "   says        20 rem    0     "
  200000 /   "   says       200 rem    0     "
  200000 / 2000  says       100 rem    0     "
     "   / 4000  says        80 rem    0     "
     "   / 8000  says        40 rem    0     "
     "   / ABCD  says         0 rem    0   wrong
     "   / 8BCD  says         " rem    "     "
     "   / A000  says         " rem    "     "     Adding just the one bit (bit 13) in the divisor made it go wrong.  (It was ok with 8000.)

Does anyone feel like showing me the error of my ways? I seem to have lost the spark of inspiration that prompted me to post the attempts, and I don't know if I'll be able to find the bug soon enough to avoid more embarrassment.

Mike B.

[Edit: Since my quotient seems to wind up too low in the error cases, I am probably missing a 17th bit in my compare, and need to place a BCS umsm3 just after my ROL and an SEC just after my SBC 0,X. I'll give that a try and report back when I can. This potential fix would increase the '02 version to 64 bytes and the '802/'816 version to 37 bytes, which isn't bad, as long as it fixes the bug.]

barrym95838 · Post by **barrym95838** » Tue May 19, 2015 7:15 am

I tried the mods mentioned in the edit directly above and verified a half-width (8-bit accumulator) exhaustive test in about an hour under AppleWin at maximum speed on my old PC. I think that I can extrapolate full-width (16-bit accumulator) success without waiting about 1900 years for the exhaustive test at full-width using the same setup. The code above is corrected.

Mike B.

barrym95838 · Post by **barrym95838** » Mon Aug 03, 2015 6:12 am

Here's a short NMOS version without error checking, in only 46 bytes, using a standard fig-style stack implementation!

Code: Select all

;------------------------------------------------------
; UM/MOD for any 6502 by Michael T. Barry.
; Free to copy and/or modify, but without warranty.
;
; ( ud u1 -- u2 u3 )
; ud is the 32-bit dividend.
; u1 is the 16-bit divisor.
; u2 is the 16-bit remainder.
; u3 is the 16-bit quotient.
; All values are unsigned.
;
; Quotient overflow or /0 produces invalid results.
;
; 0,x and 1,x are TOS  ( divisor -- quotient )
; 2,x and 3,x are NOS  ( dividend:H -- remainder)
; 4,x and 5,x are 3OS  ( dividend:L --  )
;
umslashmod:
        ldy  #17        ;[2] Init loop counter
        clc             ;[2]
        bcc  umsm4      ;[3]
umsm2:                  ;
        rol  2,x        ;[6] Shift dividend:H
        rol  3,x        ;[6]
        bcs  umsm3      ;[2]* <<<<<<<<<<<<<<<<<<
        lda  2,x        ;[4]
        cmp  0,x        ;[4] If (17th bit clear) or
        lda  3,x        ;[4]   (dividend:H < divisor)
        sbc  1,x        ;[4] then
        bcc  umsm4      ;[2]*  skip remainder update
umsm3:                  ;
        lda  2,x        ;[4]
        sbc  0,x        ;[4]
        sta  2,x        ;[4] Update remainder
        lda  3,x        ;[4]  (gradually replacing
        sbc  1,x        ;[4]   dividend:H)
        sta  3,x        ;[4]
        sec             ;[2]
umsm4:                  ;
        rol  4,x        ;[6] Shift carry into quotient
        rol  5,x        ;[6]  (gradually replacing
        dey             ;[2]   dividend:L)
        bne  umsm2      ;[3]* Loop until done
        inx             ;[2] Exit sequence:
        inx             ;[2]   DROP the divisor
        jmp  SWAP       ;[34]* SWAP quotient and remainder
;------------------- 46 bytes, not counting HEADER

It passed my exhaustive half-width test as well, but still shouldn't be fully trusted until someone with a working 6502 Forth can try some of the test cases Garth detailed above.

Mike B.

barrym95838 · Post by **barrym95838** » Fri Aug 07, 2015 5:08 am

... and for you byte-squeezing freaks (you know who you are), I have 44-byte NMOS and 27-byte '802/'816 versions that pass the exhaustive half-width test as well:

Code: Select all

;------------------------------------------------------
; UM/MOD for any 6502 by Michael T. Barry.
; Free to copy and/or modify, but without warranty.
                        ;
; ( ud u1 -- u2 u3 )
; ud is the 32-bit dividend.
; u1 is the 16-bit divisor.
; u2 is the 16-bit remainder.
; u3 is the 16-bit quotient.
; All values are unsigned.
                        ;
; Invalid results for quotient overflow or /0
                        ;
; 0,x and 1,x are TOS  ( divisor -- quotient )
; 2,x and 3,x are NOS  ( dividend:H -- remainder )
; 4,x and 5,x are 3OS  ( dividend:L --  )
                        ;
umslashmod:             ;
        ldy  #16        ; loop counter
umsm2:                  ;
        asl  4,x        ; dividend:L is gradually replaced
        rol  5,x        ;   with the quotient
        rol  2,x        ; dividend:H is gradually replaced
        rol  3,x        ;   with the remainder
        bcs  umsm3      ;
        lda  2,x        ;
        cmp  0,x        ; compare remainder to divisor
        lda  3,x        ;
        sbc  1,x        ;
        bcc  umsm4      ;
umsm3:                  ;
        lda  2,x        ;
        sbc  0,x        ; if (remainder >= divisor) then
        sta  2,x        ;   update partial remainder
        lda  3,x        ;
        sbc  1,x        ;
        sta  3,x        ;
        inc  4,x        ;   and set low bit in quotient
umsm4:                  ;
        dey             ;
        bne  umsm2      ; Loop until done
        inx             ; Exit sequence:
        inx             ;   DROP the divisor
        jmp  SWAP       ;   SWAP quotient and remainder
                        ; (SWAP's machine code, not its CFA)
;------------------- 44 bytes, not counting HEADER

;------------------------------------------------------
; UM/MOD for the 65802/816 by Michael T. Barry.
; Free to copy and/or modify, but without warranty.
                        ;
; ( ud u1 -- u2 u3 )
; ud is the 32-bit dividend.
; u1 is the 16-bit divisor.
; u2 is the 16-bit remainder.
; u3 is the 16-bit quotient.
; All values are unsigned.
                        ;
; Invalid results for quotient overflow or /0
; Accumulator and memory are in 16-bit mode
; Index registers are in 8-bit mode
                        ;
; 0,x is TOS  ( divisor -- quotient )
; 2,x is NOS  ( dividend:H -- remainder )
; 4,x is 3OS  ( dividend:L --  )
                        ;
umslashmod:             ;
        ldy  #16        ; loop counter
        lda  2,x        ;
umsm2:                  ;
        asl  4,x        ; dividend:L is gradually replaced
                        ;   with the quotient
        rol             ; dividend:H is gradually replaced
                        ;   with the remainder
        bcs  umsm3      ;
        cmp  0,x        ; compare remainder to divisor
        bcc  umsm4      ;
umsm3:                  ; if (remainder >= divisor) then
        sbc  0,x        ;   update partial remainder
        inc  4,x        ;   and set low bit in quotient
umsm4:                  ;
        dey             ;
        bne  umsm2      ; Loop until done
        sta  2,x        ;
        inx             ; Exit sequence:
        inx             ;   DROP the divisor
        jmp  SWAP       ;   SWAP quotient and remainder
                        ; (SWAP's machine code, not its CFA)
;------------------- 27 bytes, not counting HEADER

Mike B.

[Edit: 44 bytes, not 45!]

integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication