6502.org

Posted: **Thu Oct 14, 2010 1:58 am**

chitselb wrote:

Here it is in all it's "I just now got it working" glory, my 32-bit unsigned division code. My first thought is, it's gigantic.

A couple thoughts about this routine I never got around to posting:

1. Is it really necessary to save/restore zi, zi+1, etc. on the stack?

2. You can pre-shift the numerator left rather the denominator right. Then you're only subtracting two bytes inside the (divmod02) loop instead of four. See also the UM/MOD routine (though it doesn't do any pre-shifting) at:

http://6502org.wikidot.com/errata-software-figforth

Posted: **Thu Oct 14, 2010 1:18 pm**

dclxvi wrote:

1. Is it really necessary to save/restore zi, zi+1, etc. on the stack?

The zi...zi+3 is a 4-byte zeropage area where the innermost DO LOOP index and limit are kept, for speed. So division within a loop would require this. I also took another look at Blazin' Forth's UM/MOD and noticed that it gets the job done properly with a 32/16=16R16 approach (my test is -1 -1 UD. should print 4294967295)

Posted: **Mon Oct 18, 2010 5:59 am**

Quote:

On a semi-related note, I am thinking about implementing a complete suite of floating point, based on the simple expedient of calling the PET ROM math routines and passing things around as 24-bit values on the stack.

I will mention that there are small (18- & 28-pin DIP & SOIC), serially interfaced 32-bit floating-point coprocessors avaiable at http://www.awce.com/pak1.htm .

Posted: **Sat Feb 04, 2012 7:31 am**

Doggone! How 'bout a comparison? I just got the equivalent code for UM/MOD for the PIC16 microcontrollers, ie, unsigned 32-bit/16-bit with 16-bit quotient and remainder, from Microchip's Embedded Control Handbook update: 237 instructions to do what the 6502 does in 35!

Posted: **Thu Mar 05, 2015 10:54 pm**

In the wake of folding Mike's excellent 40/MOD code and rewriting UM/MOD (unsigned 32/16 = rem16 quot16), I started giving the fisheye to UD/MOD (unsigned 32/16 = rem16 quot32) but couldn't figure out how to get rid of it because D. and DU. both require the 32-bit quotient to work.

Any ideas?

Posted: **Fri Mar 06, 2015 5:11 am**

Give it the fisheye? Does that mean get rid of it? It doesn't take much memory when defined in high-level, in terms of UM/MOD, as opposed to making it a primitive. I did it in assembly for the PIC16 for a big work project last year because I couldn't find what I needed in Microchip's pre-written toolbox. I kind of sprained my head, but I finally got it done. I found that you need the whole dividend to be the number of bits in the divisor plus the number of bits in the quotient; so for UD/MOD, you'll pad the input to get there. Something that may help, if indeed you want to do UD/MOD as a primitive for U. and friends, is that it's only used for dividing by 16 or less, never a number over 255.

Posted: **Fri Mar 06, 2015 5:57 am**

Camel Forth (thanks to Dr. Brad):

Code: Select all

: UD/MOD  ( ud1 u2 -- u3 ud4 ) \ 32/16->32 divide
  >R 0 R@ UM/MOD ROT ROT R> UM/MOD ROT
;

Mike B.

[Edit: Looks kinda slow ... I'll see if I can conjure up a much faster primitive, but I doubt that it'll be as space-efficient. Are you pressed for space?]

[Edit: I'll give you a rough prediction of about 70-something bytes of native 6502 code for the primitive. Say the word, and I'll give it the old college try ... ]

Posted: **Fri Mar 06, 2015 8:31 am**

I should have said stinkeye meaning I would like to get rid of the UD/MOD (32-bit quotient) primitive. There will be a UM/MOD (16-bit quotient) primitive, but having both is overkill. I'll build / /MOD FM/MOD */MOD as secondaries on top of UM/MOD .

Implementing UD/MOD in high-level Forth would substantially slow down numeric output. Right now it takes 81 jiffies (60ths of a second) to perform : FOO 100 0 DO I . LOOP ; in PETTIL, and I haven't tried it in Blazin' Forth. ( JIFFY@ FOO JIFFY@ D- D. )

The *only* place in the code where UD/MOD is *ever* called is from within # (sharp) for numeric output. It's fair to require that 2<=BASE<=36 because [0..9A..Z] are the only characters to (reasonably) output a number in human-readable form on a PET. Of these, I commonly use only HEX and DECIMAL. A special-purpose helper primitive will probably do the trick. I'm calling it (#) ( ud1 base -- ud2 'digit' ) pronounced 'paren-sharp'. This inner (#) primitive will JSR to reuse the same code that's in the 40/MOD primitive, but with BASE @ (instead of 40) as the divisor. I'll have to go back to the fat version of 40/MOD in case the BASE is an odd number, but no biggie there. Here's the high-level code for sharp:

Code: Select all

: #  (+d1 -- +d2 )
\ d1 is divided by BASE and the quotient is d2. 
\ The remainder is appended to the numeric output string via HOLD
BASE @  ( +d1 base )
(#)  ( +d2 'digit' )
HOLD ; ( +d2 )

And here's the (currently working) sharp code
: #   BASE @  UD/MOD ROT >6502  
      TOS LDA,
      SED,
      10 # CMP,
      ASCII 0 # ADC,
      CLD,
      TOS STA,
      TOFORTH JSR,
  HOLD ;

Posted: **Fri Mar 06, 2015 12:21 pm**

barrym95838 wrote:

Camel Forth (thanks to Dr. Brad):

Code: Select all

: UD/MOD  ( ud1 u2 -- u3 ud4 ) \ 32/16->32 divide
  >R 0 R@ UM/MOD ROT ROT R> UM/MOD ROT
;

Wouldn't it be better to do it the other way around?
ie.

Code: Select all

: UM/MOD UD/MOD DROP ;

My own UM/MOD routine (shown below) pretty much did 32-bit division although it only pushed the lower 16-bit result on the stack. It would be a simple matter to make it a UD/MOD instead (and I am thinking of doing that).

BTW with Forth-79, Forth-83, Fig Forth and ANSI Forth (not to mention all the individuals who make up their own standards) there are too many standards to make it worth while to just pick one. I prefer to just provide UM* and UM/MOD (or UD/MOD) and leave it to the user to deal with signed multiplication and division.

Code: Select all

;
; UM/MOD -      ( ud u1 -- u2 u3) 
;               An unsigned division of the 32 bit IACC (ud) 
;               by the 32 bit IOPR (u1 extended to 32 bits)
;               REMAINDER IS u2 (IACCl)
;		AND QUOTIENT IS u3 (IQUOTl moved to IACCh)
;		 
UMDMOD  TXA                     ;      +---------------+
        SEC                     ;  +12 | h         udl |
        SBC     #6              ;  +11 | l          u2 | IACCl
        TAX                     ;      +---------------+
        LDA     #0              ;  +10 | h         udh |
        STA     STACK+1,X       ;   +9 | l             | IACCh
        STA     STACK+2,X       ;      +---------------+
        STA     STACK+3,X       ;   +8 | h          u1 |
        STA     STACK+4,X       ;   +7 | l             | IOPRh
        STA     STACK+5,X       ;      +---------------+
        STA     STACK+6,X       ;   +6 | h           0 | 
        LDY     #16             ;   +5 | l             | IOPRl
; IOPR shifted 16 bits already  ;      +---------------+
_SHIFTL LDA     STACK+8,X       ;   +4 | h          u3 |
        BMI     _LOOP           ;   +3 | l             | IQUOTl
        ASL     STACK+7,X       ;      +---------------+
        ROL     STACK+8,X       ;   +2 | h           0 |
        INY                     ;   +1 | l             | IQUOTh
        BNE     _SHIFTL         ;      +---------------+
_LOOP   LDA     STACK+10,X      ;  X-->|               |
        CMP     STACK+8,X       ;
; Check if IOPR  > IACC         ;
        BCC     _SHIFTR         ;
        BNE     _SUBTR          ;
        LDA     STACK+9,X       ;
        CMP     STACK+7,X
        BCC     _SHIFTR
        BNE     _SUBTR
        LDA     STACK+12,X
        CMP     STACK+6,X
        BCC     _SHIFTR
        BNE     _SUBTR
        LDA     STACK+11,X
        CMP     STACK+5,X
        BCC     _SHIFTR
_SUBTR  SEC                     ; If IOPR <= IACC THEN
        LDA     STACK+11,X      ; subtract IOPR from IACC
        SBC     STACK+5,X       ; and shift 1 into IQUOT
        STA     STACK+11,X
        LDA     STACK+12,X
        SBC     STACK+6,X
        STA     STACK+12,X
        LDA     STACK+9,X
        SBC     STACK+7,X
        STA     STACK+9,X
        LDA     STACK+10,X
        SBC     STACK+8,X
        STA     STACK+10,X
_SHIFTR ROL     STACK+3,X       ; else shift 0 into IQUOT
        ROL     STACK+4,X
        ROL     STACK+1,X
        ROL     STACK+2,X
        LSR     STACK+8,X       ; shift IOPR right
        ROR     STACK+7,X
        ROR     STACK+6,X
        ROR     STACK+5,X
        DEY                     ; continue counted shifts and subtracts
        BPL     _LOOP
        LDA     STACK+3,X       ; copy IQUOT(L) TO right spot on stack
        STA     STACK+9,X       ; (remainder already in right place)
        LDA     STACK+4,X
        STA     STACK+10,X
        TXA                     ; CLEAN UP STACK
        CLC
        ADC     #8
        TAX
        RTS

Posted: **Fri Mar 06, 2015 4:27 pm**

theGSman wrote:

Wouldn't it be better to do it the other way around?
ie.

Code: Select all

: UM/MOD UD/MOD DROP ;

That looks a lot cleaner, but would slow down UM/MOD considerably. If you have horsepower to spare, it's no big deal, but CH is using "quaint" hardware, and UD/MOD is a bit expensive.

Mike B.

Posted: **Fri Mar 06, 2015 4:50 pm**

The UD/MOD base primitive approach was what I first considered and implemented. I came across this in VolksForth, which builds UD/MOD on top of UM/MOD , instead of the other way around. This is cool because, in my opinion, 32/32-bit division is seldom invoked and will be even more rare in my own application code, so I don't want to punish the user on every occurrence of division by crunching the whole 32-bits using UD/MOD as the basic primitive. Also UD/MOD isn't part of the Fig, Forth-79, Forth-83 or ANS standards, while UM/MOD is in the required wordset for Forth-83, aka U/MOD in Forth-79.

Code: Select all

  : ud/mod ( ud1 u2 -- urem udquot )
     >r   0 r@ um/mod   r> swap >r   um/mod r> ;

For general purpose division, I'm thinktanking an approach to UM/MOD that shifts the divisor left until it's >= the dividend, then do the usual trial-subtract from the dividend and right-shift the divisor thing until the divisor is back to its original alignment. Counting the divisor shifts and counting them back down. If something like this already exists, or anybody here is bored, or if this is a really dumb idea, I'm all ears. The reason for this is that numerator and denominator values are small values most of the time, and operating on all those leading zero bits is a waste of CPU. If somewhat bigger, more complicated code can avoid iterating through trial-subtract for each bit alignment every time through, I'd consider replacing the smaller, simpler UM/MOD code.

Most commonly I'll be using 40/MOD which is a specialty division by the constant 40, to index into the 40-column wide screen on a PET or C=64, and for unpacking text encoded in the RADIX-50 scheme. Pictured numeric output e.g. <# #S #> also requires division, so I'll overload the 40/MOD primitive a bit to work with BASE @ as the divisor for that purpose. All the other standard general-purpose division and modulo words, including UD/MOD will be built on top of UM/MOD. I'm going to use floored division (remainders are always positive) as per the Forth-83 standard.

Posted: **Fri Mar 06, 2015 5:26 pm**

barrym95838 wrote:

That looks a lot cleaner, but would slow down UM/MOD considerably.

Unfortunately, my routine already has that speed penalty since I couldn't think of a way to do the division without making the divisor 32 bits anyway.

The only slowdown for my code would be the 2 INX instructions (4 cycles) and the JSR/RTS pair (6+6 cycles) for a total of 16 cycles. The UD/MOD instruction would also have a additional two LDA/STA instructions adding a further 16 cycles. That's small bikkies when you consider all the compares and shifts involved in the division.

Posted: **Fri Mar 06, 2015 5:34 pm**

chitselb wrote:

For general purpose division, I'm thinktanking an approach to UM/MOD that shifts the divisor left until it's >= the dividend, then do the usual trial-subtract from the dividend and right-shift the divisor thing until the divisor is back to its original alignment. Counting the divisor shifts and counting them back down. If something like this already exists, or anybody here is bored, or if this is a really dumb idea, I'm all ears. The reason for this is that numerator and denominator values are small values most of the time, and operating on all those leading zero bits is a waste of CPU.

That is pretty much how my routine works (see above) except that I shifted the divisor all the way to the MSB (simpler than than comparing the relative sizes after each shift and not much of a time penalty). As a bonus, the division is done entirely on the stack which improves re-entrancy.

Unfortunately, my approach makes the whole thing slower when dealing with small divisors.

Posted: **Fri Mar 06, 2015 6:00 pm**

Skipping over 8 leading 0-bits at a time during the initial shift-left phase is a matter of a well-placed BEQ instruction

Posted: **Sat Mar 07, 2015 5:44 am**

Hey guys,

My initial stab at UD/MOD came out better than I expected: 56 bytes, ~18xx cycles (plus two ROTs)
This is completely untested!

Code: Select all

; - - - - - - - - - - - - - - - - - - - - - - - - - - -
; UD/MOD  ( ud1 u1 -- u2 ud2 )
; Dividend ud1 and Quotient ud2 are 32-bits unsigned
; Divisor u1 and Remainder u2 are 16-bits unsigned
; Invalid inputs and/or outputs are silently ignored
;
udslashmod:
        lda  #0
        sta  N          ;initialize partial remainder
        sta  N+1        ;  in N, N+1
        ldy  #32        ;loop counter
udsm2:
        asl  stackl+1,x ;Dividend in NOS:3OS (h:l)
        rol  stackh+1,x ;  is gradually replaced
        rol  stackl,x   ;  with the quotient
        rol  stackh,x
        rol  N          ;N is gradually replaced
        rol  N+1        ;  with the remainder
        lda  N
        cmp  TOS        ;TOS holds divisor
        lda  N+1        ;partial remainder >= TOS?
        sbc  TOS+1
        bcc  udsm3
        sta  N+1        ;  yes: update the partial
        lda  N          ;    remainder and set the
        sbc  TOS        ;    low bit in the partial
        sta  N          ;    quotient
        inc  stackl+1,x
udsm3:
        dey
        bne  udsm2      ;loop 32 times
        lda  N
        ldy  N+1        ;remainder goes to TOS
        jsr  put
        jsr  rot        ;ROT
        jmp  rot        ;ROT ;

We could make it about 10% faster by copying NOS and 3OS to N+... and back, but that would make the code more than 10% longer, and I don't think that it's worth the effort.
Please let me know if I goofed up somewhere, and I'll correct it here.

Mike B.

[Edit: adjusted stackl and stackh offsets to the way I think CH is doing it.]

[Edit: fixed obvious N->TOS bug, borrowed CH's PUT, and adjusted code size claim to 56 bytes.]

6502.org

integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication

Re: integer division and multiplication