integer division and multiplication

barrym95838 · Post by **barrym95838** » Sun Mar 08, 2015 4:56 am

: #       ( +d1 -- +d2)
     base @ ud/mod rot dup 9 > 7 and + Ascii 0 + hold ;

If this is the ONLY use you have for UD/MOD in Pettil, it might be better to just code # as a primitive, and embed a tailor-made (optimized just for #) version of UD/MOD. I think that's kind of what you were suggesting earlier, but it just now sank in. I'll whip up a primitive for # and post it here when I get it at least half-baked.

[Update: Something like this (36 bytes, 11xx cycles):

Code: Select all

; - - - - - - - - - - - - - - - - - - - - - - - - - - -
; #  ( ud1 -- ud2 )
; Extract least significant digit of ud1, convert to
;   ASCII and save that digit in HOLD area.
; ud2 is the quotient of ud1 / BASE
; Undefined behavior for a BASE outside [2..20]
;
sharp:
        lda  #0         ;partial remainder
        ldy  #32        ;loop counter
sharp2:
        asl  stackl,x   ;Dividend in TOS:NOS (h:l)
        rol  stackh,x   ;  is gradually replaced
        rol  TOS        ;  with the quotient
        rol  TOS+1      ;a is gradually replaced
        rol             ;  with the remainder
        cmp  BASE       ;partial remainder < BASE?
        bcc  sharp3     ;  no: update the partial
        sbc  BASE       ;    remainder and partial
        inc  stackl,x   ;    quotient
sharp3:
        dey
        bne  sharp2     ;loop 32 times
        cmp  #10
        sed
        adc  #'0'       ;adjust remainder to ASCII
        cld
        jsr  pushya
        jmp  hold       ;HOLD ;

UNTESTED!]

Mike B.

[Edit: It seems kind of silly for UD/MOD to end with two ROTs, especially after I find that its only application in the standard word set is followed by a third ROT!]

[Edit: Improved comments, corrected pushay -> pushya.]

theGSman · Post by **theGSman** » Mon Mar 09, 2015 6:48 am

barrym95838 wrote:

[Edit: It seems kind of silly for UD/MOD to end with two ROTs, especially after I find that its only application in the standard word set is followed by a third ROT!]

I'm surprised that you don't have a -ROT coded for such a purpose. I have found that I use this word so often that it is worthy of being a system word.

2PICK is also a worthy stack operator in its own right.

barrym95838 · Post by **barrym95838** » Mon Mar 09, 2015 7:25 am

If by "system word" you mean machine-coded primitive, then you are most likely correct that it could be justified. I am quite new to Forth, so I don't really know how to juggle yet. My interest in Forth stemmed from my attempt to button-up the instruction set on my experimental hobby processor design (by writing a Forth for it). I must say that it's an interesting language, but I'm running into the old problem of "too many pans in the fire", so I need to pace myself or face a nervous breakdown from all the directions in which my brain is getting pulled.

Code: Select all

; DTC 65m32 Forth
; a = TOS, x = PSP, y = IP, s = RSP, u = UP

        NOT_IMM '2PICK'
_2pick:  ; ( x1 x2 x3 -- x1 x2 x3 x1 )
        sta  ,-x
        lda  2,x
        jmp  (,y+)      ; NEXT

        NOT_IMM '-ROT'
_mrot:  ; ( x1 x2 x3 -- x3 x1 x2 )
        exa  1,x
        jmp  _swap

        NOT_IMM 'PICK'
_pick:  ; ( xu ... x2 x1 x0 u -- xu ... x2 x1 x0 xu )
        add  #,x
        jmp  _fetch

Mike B.

theGSman · Post by **theGSman** » Mon Mar 09, 2015 12:21 pm

barrym95838 wrote:

If by "system word" you mean machine-coded primitive, then you are most likely correct that it could be justified. I am quite new to Forth, so I don't really know how to juggle yet. My interest in Forth stemmed from my attempt to button-up the instruction set on my experimental hobby processor design (by writing a Forth for it). I must say that it's an interesting language, but I'm running into the old problem of "too many pans in the fire", so I need to pace myself or face a nervous breakdown from all the directions in which my brain is getting pulled.

I know what you mean about time management.

By "system word" I mean those words that are already available when you first fire up your forth system. You don't need to stress too much about which words you should include in your system since any words not already present can easily be coded. The makers of EFORTH claim that you can derive all your Forth words from a very small set of primitives. (For example, every single arithmetic operation can be derived from UM+ and XOR). You can even get away with not including an assembler since CODE words are just bytes COMMA'd in (the initial hand assembly might be a drag though).

So you can just have a skeletal Forth which loads additional words from files. I am thinking of going this way once my file management is completed.

GARTHWILSON · Post by **GARTHWILSON** » Mon Mar 09, 2015 7:26 pm

Quote:

The makers of EFORTH claim that you can derive all your Forth words from a very small set of primitives. (For example, every single arithmetic operation can be derived from UM+ and XOR).

True, but the performance will be terrible. They used to have such discussions back in the years when Forth Dimensions magazine was still being published, and the minimum seemed to be about 30 primitives; but again at a severe performance penalty. I believe the purpose was for maximum portability, ie, that there's little to re-write if you go to another processor. In the example theGSman gave above of -ROT, it's four times as fast in ITC as a primitive as it is as a secondary. For multiplication and division, I would expect an even much greater difference.

theGSman · Post by **theGSman** » Mon Mar 09, 2015 8:59 pm

If I needed an arithmetic or stack operator word that didn't exist in my system then I would be coding it in assembly since it usually is almost as easy. Of course, knowing that you can quickly whip up a high level version of the word for testing purposes is still an advantage (and you may elect to keep it that way if time is not an issue).

The point was that you don't have to agonize over which words you include in your Forth system since Forth is so easily extendable.

Brad R · Post by **Brad R** » Mon Mar 09, 2015 9:23 pm

GARTHWILSON wrote:

They used to have such discussions back in the years when Forth Dimensions magazine was still being published, and the minimum seemed to be about 30 primitives; but again at a severe performance penalty. I believe the purpose was for maximum portability, ie, that there's little to re-write if you go to another processor.

Correct. (We used to have these discussions at Forth conferences, too.) eForth was designed to have only about 30-32 primitives, as I recall, for quickest porting. But its multiplication was extremely slow.

For CamelForth I wanted a small set of primitives, for ease of porting, but large enough to give good performance. So I ended up with about 70 primitives (including unsigned multiply and divide).

At the other extreme is gForth, which I think has hundreds and hundreds of primitives.

nyef · Post by **nyef** » Tue Mar 10, 2015 2:06 am

Brad R wrote:

GARTHWILSON wrote:

They used to have such discussions back in the years when Forth Dimensions magazine was still being published, and the minimum seemed to be about 30 primitives; but again at a severe performance penalty. I believe the purpose was for maximum portability, ie, that there's little to re-write if you go to another processor.

Correct. (We used to have these discussions at Forth conferences, too.) eForth was designed to have only about 30-32 primitives, as I recall, for quickest porting. But its multiplication was extremely slow.

For CamelForth I wanted a small set of primitives, for ease of porting, but large enough to give good performance. So I ended up with about 70 primitives (including unsigned multiply and divide).

At the other extreme is gForth, which I think has hundreds and hundreds of primitives.

As a data point, my current 32-bit x86 ITC Forth (not particularly optimized in any direction) has 73 CODE words, seven of which are for a Linux syscall interface.

barrym95838 · Post by **barrym95838** » Tue Mar 10, 2015 2:38 am

Hey guys,

After staring at my UD/MOD for longer than I should have (because it hasn't been properly tested yet) I came up with this for my second stab ... 49 code bytes, and still roughly 1800 cycles:

Code: Select all

; - - - - - - - - - - - - - - - - - - - - - - - - - - -
; UD/MOD  ( ud1 u1 -- u2 ud2 )
; Dividend ud1 and Quotient ud2 are 32-bits unsigned
; Divisor u1 and Remainder u2 are 16-bits unsigned
; Invalid inputs and/or outputs are silently ignored
;
udslashmod:
        ldy  #32        ; init loop counter
        lda  #0         ; init partial remainder
        sta  N          ;   in N:A (h:l)
udsm2:
        asl  stackl+1,x ; dividend in NOS:3OS (h:l)
        rol  stackh+1,x ;   is gradually replaced
        rol  stackl,x   ;   with the quotient
        rol  stackh,x
        rol             ; N:A is gradually replaced
        rol  N          ;   with the remainder
        pha
        cmp  TOS        ; TOS holds the divisor
        lda  N          ; partial remainder >= TOS?
        sbc  TOS+1
        bcc  udsm3
        sta  N          ;   yes: update the partial
        pla             ;     remainder and set the
        sbc  TOS        ;     low bit in the partial
        pha             ;     quotient
        inc  stackl+1,x
udsm3:
        pla
        dey
        bne  udsm2      ; loop 32 times
        ldy  N
        jsr  put
        jsr  rot
        jmp  rot

I golfed it up the best I could ... but it's still untested. Any volunteers?

Mike B.

chitselb · Post by **chitselb** » Tue Mar 10, 2015 2:13 pm

Mike,

I'm looking at your code and my inner 6502 Lee Trevino is wondering if it's even possible to rearrange things a bit so that the asl stackl+1,x and inc stackl+1,x are combined into this single instruction: rol stackl+1,x (after setting up the C flag with the compare/subtract). I fooled around with it for a couple hours in the VICE PET debugger, no joy yet.

barrym95838 · Post by **barrym95838** » Tue Mar 10, 2015 3:11 pm

Yeah, it's so tempting to combine the inc and the asl into an rol, but I have found that it's harder than it looks, because ... well, you probably know, since you've been trying it too. I was actually successful in doing it in my 19-byte 40/MOD, but only by sacrificing the possibility of odd divisors.

I'm a Tom Watson fan, by the way. I hope that you don't hold that against me.

Mike B.

[Update: Now slightly faster, with a BIT zp naked op-code trick ... should be safe on >99% of the hardware out there in the field:

Code: Select all

; UD/MOD  ( ud1 u1 -- u2 ud2 )
; Dividend ud1 and Quotient ud2 are 32-bits unsigned
; Divisor u1 and Remainder u2 are 16-bits unsigned
; Invalid inputs and/or outputs are silently ignored
;
udslashmod:
        ldy  #32        ; init loop counter
        lda  #0         ; init partial remainder
        sta  N          ;   in N:A (h:l)
udsm2:
        asl  stackl+1,x ; dividend in NOS:3OS (h:l)
        rol  stackh+1,x ;   is gradually replaced
        rol  stackl,x   ;   with the quotient
        rol  stackh,x
        rol             ; N:A is gradually replaced
        rol  N          ;   with the remainder
        pha
        cmp  TOS        ; TOS holds the divisor
        lda  N          ; partial remainder >= TOS?
        sbc  TOS+1
        bcc  udsm3
        sta  N          ;   yes: update the partial
        pla             ;     remainder and set the
        sbc  TOS        ;     low bit in the partial
        inc  stackl+1,x ;     quotient
        .db  $24        ;   BIT zp naked opcode
udsm3:
        pla
        dey
        bne  udsm2      ; loop 32 times
        ldy  N
        jsr  put
        jsr  rot
        jmp  rot

]

chitselb · Post by **chitselb** » Wed Mar 11, 2015 1:22 pm

Using BIT zp takes 3 cycles, 1 byte. Using PHA takes 3 cycles, 1 byte. Which more clearly describes the intent of the code? Just sayin'

Dr Jefyll · Post by **Dr Jefyll** » Wed Mar 11, 2015 3:12 pm

chitselb wrote:

Using BIT zp takes 3 cycles, 1 byte. Using PHA takes 3 cycles, 1 byte.

Hmmm, yes... But the BIT zp causes the following PLA not to execute, so the net result is a 4 cycle saving doing it Mike's way.

barrym95838 wrote:

a BIT zp naked op-code trick ... should be safe on >99% of the hardware out there in the field

Right -- the safety issue arises because BIT zp touches an address which may contain a read-sensitive IO register (example: the IFR on a 6522). But, since it's just a single byte you're skipping over, not two, you can avoid the danger entirely -- and save 1 more cycle -- by substituting a different naked opcode. Use BIT immediate (or CMP/CPX/CPY immediate if you prefer).

barrym95838 · Post by **barrym95838** » Wed Mar 11, 2015 3:26 pm

Correct on both points, as usual, Jeff. The idea was to save 4 cycles for every '1' bit in the quotient by doing a BIT $68 instead of a PHA PLA. That .db $24 should be replaced with a .db $c9 (CMP #) for an additional (very slight) speed increase, and 100% compatibility.

Thank you sirs,

Mike B.

[Edit: Unless someone feels like busting my balloon, I must say that it feels really good straying off the beaten (UD/MOD) path and finding a little nugget in the rough, on a processor that has been around for so long. It feels like I missed the fairway on my tee shot, but came up with a birdie anyway.]

[Edit 2: Looking at CH's github source, it appears that I shouldn't JSR to put, because it ends with NEXT, not RTS. There is a -ROT though, so the end should probably change from:

Code: Select all

        jsr  put
        jsr  rot
        jmp  rot

to

Code: Select all

        sta  TOS
        sty  TOS+1
        jmp  dashrot

]

chitselb · Post by **chitselb** » Mon Mar 16, 2015 4:06 pm

Here's the (nominally tested) UD/MOD code.

It weighs in at 53 or 67 bytes, depending on whether it uses PETTIL's extended Sweet-16 to push the results back on the stack. The inner loop executes 32.5 times (we start in the middle) at either [54] or [59] clocks, or [1836] clocks total (on average).

The one-time setup/teardown code includes an additional performance penalty of 17+27x2 = [71 clocks] for 'jsr locals', to move the Dividend off of the stack, but then the 'ROR zp' (x6 x33 x1 clock) saves a total of [196 clocks] more than making up the difference. To return everything in the proper order requires moving the Dividend out of the way anyway. It seems a little crazy to squeeze a bunch of clocks with a tight loop and then squander them on Sweet-16, but it saves 14 bytes vs. doing the one-time exit code in assembler.

Code: Select all

UD/MOD   ( ud1 u2 – u3 ud4 )
udslashmod
        ldy #2
        jsr locals        ; dividend to n0..n3
        ;clc
        ;ldy #0           ; locals does this for us
        stx storex
        ldx #33
        tya
        beq udslashmodb   ; bra to initialization code
udslashmoda
        rol n+4           ; [5] I don't know what this 2-bytes is called
        rol n+5           ; [5] trial minuend or something like that
        sec               ; [2]
        lda n+4           ; [3]
        sbc tos           ; [3]
        tay               ; [2]
        lda n+5           ; [3]
        sbc tos+1         ; [3]
        bcc udslashmodc   ; save or abandon the trial subtraction
udslashmodb
        sty n+4           ; [2+3+3] or [3]?
        sta n+5           ; treating this as [~6] clocks on average
udslashmodc
        rol n+0           ; [5]
        rol n+1           ; [5]
        rol n+2           ; [5]
        rol n+3           ; [5]
        dex               ; [2]
        bne udslashmoda   ; [3] 
                ; [54] or [59] clocks for 33x main loop = [1839]

        ldx storex

; this is the end [22-bytes]
        ldy n+5
        sty tos+1
        lda n+4
        sta tos           ; put remainder on stack
        ldy n+1
        lda n+0
        jsr push6502      ; push quotient(lo) on stack
        ldy n+3
        lda n+2
        jmp pushya        ; push quotient(hi) on stack

; alternate ending [8-bytes]
        brk
        .byt ld | N2
        .byt st | TOS
        .byt ld | N0
        .byt push
        .byt ld | N1
        .byt push
        .byt nxt

integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication