A 6502 divide routine

wsimms · Post by **wsimms** » Tue Dec 31, 2024 3:04 pm

I want to present an integer division routine I have come up with. I have to imagine the algorithm is already well known, but I am too lazy to go find the prior art. The reason I am presenting this algorithm is because it seems to be common sentiment online that 6502 divide routines are difficult to understand, but this one seems quite simple to me and is reasonably fast.

The routine as presented divides a 15-bit unsigned dividend by a 15-bit unsigned divisor yielding a 15-bit unsigned quotient and 15-bit unsigned remainder. It is fairly trivial to produce a wrapper routine that handles 16-bit signed dividends and divisors yielding 16-bit signed quotients and remainders. That can also be the place where division by zero is handled.

First, a C implementation of the algorithm. I have tested this version by comparing its results to the results of the C '/' and '%' operators for all valid values of dividend and divisor:

Code: Select all

uint16_t uquotient;
uint16_t uremainder;

/*
  unsigned divide
  The largest possible multiples of the the divisor by a power of 2
  are subtracted from the dividend and the corresponding partial quotients
  (the powers of 2 by which the divisor is multiplied) are added to the
  quotient. Whatever is leftover is the remainder.
 */

void udiv(uint16_t dividend, uint16_t divisor)
{
    uint16_t partial_quotient = 1;
    uquotient = 0;

    // double the divisor until it exceeds the dividend. this is ok
    // because the divisor and dividend are no more than 15 bits long

    while (divisor < dividend) {
        divisor <<= 1;
        partial_quotient <<= 1;
    }

    while (1) {

        // halve the divisor until it is less than the dividend
        // this yields the largest multiple of the divisor by a power of 2
        // that can be subtracted from the dividend with a positive difference
	
        while (dividend < divisor) {
            divisor >>= 1;
            partial_quotient >>= 1;
            if (partial_quotient == 0) goto done;
        }

        // subtract the multiple of the divisor from the dividend and add
        // the corresponding power of 2 (partial quotient) to the quotient
	
        dividend -= divisor;
        uquotient += partial_quotient;
    }

done:
    uremainder = dividend;
}

Next, a basic 6502 version. I have only tested this minimally, using an Apple II emulation program (Virtual II) and ORCA/M assembler. Unfortunately it is difficult to export the assembler source files, so this is retyped (hopefully without typos):

Code: Select all

rmdr 	  
dend    ds    2
dsor    ds    2
quot    ds    2
pquo    ds    2

udiv    lda   #1
        sta   pquo
        lda   #0
        sta   pquo+1
        sta   quot
        sta   quot+1

;; double the divisor until it exceeds the dividend. this is ok
;; because the divisor and dividend are no more than 15 bits long

L0      lda   dsor+1
        cmp   dend+1
        bcc   L1
        bne   L2
        lda   dsor
        cmp   dend
        bcs   L2
L1      asl   dsor
        rol   dsor+1
        asl   pquo
        rol   pquo+1
        jmp   L0

;; halve the divisor until it is less than the dividend
;; this yields the largest multiple of the divisor by a power of 2
;; that can be subtracted from the dividend with a positive difference

L2      lda   dend+1
        cmp   dsor+1
        bcc   L3
        bne   L4
        lda   dend
        cmp   dsor
        bcs   L4
L3      lsr   dsor+1
        ror   dsor
        lsr   pquo+1
        ror   pquo
        bne   L2
        lda   pquo+1
        bne   L2
        rts

;; subtract the multiple of the divisor from the dividend and add
;; the corresponding power of 2 (partial quotient) to the quotient

L4      sec
        lda   dend
        sbc   dsor
        sta   dend
        lda   dend+1
        sbc   dsor+1
        sta   dend+1
        clc
        lda   pquo
        adc   quot
        sta   quot
        lda   pquo+1
        adc   quot+1
        sta   quot+1
        jmp   L2

P.S.: Edited twice to correct mistakes

drogon · Post by **drogon** » Tue Dec 31, 2024 3:47 pm

Interesting. Thanks. I'll have a look next year ...

However I think the JMP L0 at the very end might ought to be JMP L2 ?

-Gordon

GARTHWILSON · Post by **GARTHWILSON** » Tue Dec 31, 2024 3:58 pm

wsimms wrote:

I have to imagine the algorithm is already well known, but I am too lazy to go find the prior art.

Welcome. This link, if I formed it right, should take you directly to the relevant section of the source-code repository on this site:
http://6502.org/source/#integermath

wsimms · Post by **wsimms** » Tue Dec 31, 2024 4:02 pm

drogon wrote:

Interesting. Thanks. I'll have a look next year ...

However I think the JMP L0 at the very end might ought to be JMP L2 ?

You are correct. I have edited it.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Tue Dec 31, 2024 10:55 pm

wsimms wrote:

I want to present an integer division routine I have come up with...

Welcome to 6502-land.

Quote:

I have to imagine the algorithm is already well known, but I am too lazy to go find the prior art.

There are several such algorithms in existence, although I’d be hard-pressed to single out any one of them as “the best.” Most are based on the same principle, which is repetitive dividend rotations and trial subtractions.

Quote:

The reason I am presenting this algorithm is because it seems to be common sentiment online that 6502 divide routines are difficult to understand, but this one seems quite simple to me and is reasonably fast...

6502 division seems hard to understand because the methods usually used are not exactly like the long division method we learned in school. Not helping is the 6502 can only add, subtract and bit-shift, whereas we humans (supposedly) know how to multiply and divide.

It is a rare case in which an algorithm developed in C or another higher-level language translates to assembly language in an optimum way. Often, the inherent restrictions/limitations of the higher-level language get in the way of producing succinct machine code. Also, compilers seldom produce code that is as small and tight as that produced by a skilled assembly language programmer.

Anyhow, here’s a 16-bit division routine I concocted many years ago that will run on any member of the 6502 family. Over time, I massaged it in an effort to make it as fast as possible, a goal that is often at odds with trying to keep code as small as possible.

It’s based on an algorithm that was presented by Lance Leventhal in his first edition of 6502 Assembly Language Programming, and includes a division-by-zero check. With simple modifications, 24-bit and larger quantities may be processed.

Code: Select all

;16-BIT UNSIGNED DIVISION
;
;	------------------------------------
;	FAC1 = 16 bit dividend
;	FAC2 = 16 bit divisor
;
;	FAC1 is 32 bits in size; FAC2 is 16
;	bits in size.  Ideally, both should
;	be on page zero, but the algorithm
;	will work no matter where the accum-
;	ulators are located.
;
;	Bits 16-31 of FAC1 are used to store
;	the partial quotient.
;	------------------------------------
;	FAC1   = 16-bit quotient
;	FAC1+2 = 8-bit remainder
;	------------------------------------
;
dpdiv    sec
         lda fac2              ;divisor LSB
         ora fac2+1            ;divisor MSB	
         beq dpdiv03           ;can't divide by zero
;
         lda #0
         sta fac1+2            ;clear...
         sta fac1+3            ;partial quotient
         ldx #16               ;iteration counter
         clc                   ;no initial carry!
;
;	---------
;	Main Loop
;	---------
;
dpdiv01  rol fac1              ;rotate bit from...
         rol fac1+1            ;dividend into...
         rol fac1+2            ;partial quotient
         rol fac1+3
         sec                   ;do trial subtraction
         lda fac1+2            ;partial quotient LSB
         sbc fac2              ;divisor LSB
         tay                   ;we might not need this
         lda fac1+3            ;partial quotient MSB
         sbc fac2+1            ;divisor MSB
         bcc dpdiv02           ;discard trial difference
;
         sty fac1+2            ;new partial quotient LSB
         sta fac1+3            ;new partial quotient MSB
;
dpdiv02  dex                   ;we out of bits?
         bne dpdiv01           ;no
;
;	--------------------------------
;	Rotate Final Carry Into Quotient
;	--------------------------------
;
         rol fac1
         rol fac1+1
         clc                   ;return happy
;
dpdiv03  rts

In the above example, I replaced symbolic constants with “magic numbers” in the code, e.g., fac1+1, to make it easier to understand. In practice, “magic numbers” should not be buried in code, for reasons that become painfully apparent when a widely-used “magic number” constant in a large program has to be changed.

barnacle · Post by **barnacle** » Wed Jan 01, 2025 10:31 am

Almost identical to the division routine I use:

Code: Select all

	; core routine taken from https://www.llx.com/Neil/a2/mult.html
	; divides dividend by divisor, leaves result in dividend and the
	; remainder in remainder
	; dividend and divisor are both positive
	stz remainder
	stz remainder+1		; remainder initialised to zero
	ldx #16				; bit counter
ldv_1:
		asl dividend
		rol	dividend+1
		rol remainder
		rol remainder+1		; move high bit of dividend into remainder
		lda remainder
		sec					
		sbc divisor
		tay					; trial subtraction
		lda remainder+1
		sbc divisor+1
		bcc ldv_2
			sta remainder+1		; and if it worked, save it
			sty remainder
			inc dividend		; slide the result in as the low bit 
								; of dividend
ldv_2:
		dex			
		bne ldv_1			; and back sixteen times
	lda dividend
	ldy dividend+1		; return the result in y:a

There is an external wrapper to invert one or both parameters if they are (signed-16) negative, setting a flag if the result should be inverted at the end, and a call to an inversion routine (also used for subtraction).

Neil

(The Neil in the link is not me!)

wsimms · Post by **wsimms** » Wed Jan 01, 2025 3:13 pm

BigDumbDinosaur wrote:

wsimms wrote:

I want to present an integer division routine I have come up with...

Welcome to 6502-land. ... Anyhow, here’s a 16-bit division routine I concocted many years ago.

This was helpful. I wrote a wrapper program to call my routine and yours to compare the results and let them go over all dividends and divisors in the range [0...32767]. It took a while to run, but I found no discrepancies in the output, although I didn't compare the high byte of the remainder, since you stated your routine produces only an 8-bit remainder.

I should add that in checking the results of my routine against yours, I found a dumb bug in my routine. I had screwed up the two comparisons divisor < dividend and dividend < divisor. I edited my original post to fix the comparisons, so my code in the original post has now been edited twice.

BigDumbDinosaur wrote:

In the above example, I replaced symbolic constants with “magic numbers” in the code, e.g., fac1+1, to make it easier to understand.

I approve of this practice.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Wed Jan 01, 2025 4:16 pm

barnacle wrote:

Almost identical to the division routine I use...

Funny how these routines all seem to resemble each other.

I use an almost-identical version in my 64-bit 65C816 math library. The primary difference is that version processes in 16-bit chunks and uses loops to rotate and trial-subtract the (64-bit) operands. Here’s the listing as assembled in one of my programs:

Code: Select all

03921  0006CF  A9 40 00               lda !#s_blword        ;bits in dividend
03922  0006D2  85 C4                  sta icounter          ;use them as iteration counter
03923  0006D4  18                     clc                   ;no initial carry
03924  ;
03925  ;	—————————
03926  ;	Main Loop
03927  ;	—————————
03928  ;
03929  0006D5  A2 00         .main    ldx #0
03930  0006D7  A0 04                  ldy #s_dlong/s_word
03931  ;
03932  0006D9  36 A0         .main010 rol faca,x
03934  0006DB  E8                     inx
03935  0006DC  E8                     inx
03936  0006DD  88                     dey
03937  0006DE  D0 F9                  bne .main010
03938  ;
03939  0006E0  A2 00                  ldx #0
03940  0006E2  A0 04                  ldy #s_dlong/s_word
03941  ;
03942  0006E4  36 B8         .main020 rol faco,x
03944  0006E6  E8                     inx
03945  0006E7  E8                     inx
03946  0006E8  88                     dey
03947  0006E9  10 F9                  bpl .main020
03948  ;
03949  0006EB  A2 00                  ldx #0
03950  0006ED  A0 04                  ldy #s_dlong/s_word
03951  0006EF  38                     sec
03952  ;
03953  0006F0  B5 B8         .main030 lda faco,x
03954  0006F2  F5 A8                  sbc facb,x
03955  0006F4  95 B0                  sta facc,x
03957  0006F6  E8                     inx
03958  0006F7  E8                     inx
03959  0006F8  88                     dey
03960  0006F9  D0 F5                  bne .main030
03961  ;
03962  0006FB  90 0A                  bcc .main050
03963  ;
03964  0006FD  A2 04                  ldx #s_dlong/s_word
03965  ;
03966  0006FF  B5 B0         .main040 lda facc,x
03967  000701  95 B8                  sta faco,x
03969  000703  CA                     dex
03970  000704  CA                     dex
03971  000705  10 F8                  bpl .main040
03972  ;
03973  000707  C6 C4         .main050 dec icounter
03974  000709  D0 CA                  bne .main
03975  ;
03976  00070B  A2 00                  ldx #0
03977  00070D  A0 04                  ldy #s_dlong/s_word
03978  ;
03979  00070F  36 A0         .main060 rol faca,x
03981  000711  E8                     inx
03982  000712  E8                     inx
03983  000713  88                     dey
03984  000714  D0 F9                  bne .main060

Despite the extra monkey-motion of the inner loops, the above runs about 60 percent faster than the equivalent 8-bits-at-a-time routine without the inner loops. I could get it to run faster with the inner loops unrolled, but the code size would significantly increase. As always, there’s that trade-off between size and speed...

A separate scratch accumulator is used to temporarily hold the results of the trial subtraction, since it can’t be held solely in the registers. That “feature” eats up direct page space, but doesn’t seem to be avoidable.

Quote:

(The Neil in the link is not me!)

An imposter Neil? Is that anything like fake chips from you-know-where?

barnacle · Post by **barnacle** » Wed Jan 01, 2025 4:32 pm

I dunno, but the Neil in the link seems to know what he's talking about, so there's a difference

Neil

BigDumbDinosaur · Post by **BigDumbDinosaur** » Wed Jan 01, 2025 5:24 pm

barnacle wrote:

I dunno, but the Neil in the link seems to know what he's talking about, so there's a difference

Operative word is “seems”...

BTW, here’s the above 64-bit division routine with the inner loops unrolled:

Code: Select all

03920  0006CD  A2 40                  ldx #s_blword         ;iteration counter
03921  0006CF  C2 31                  rep #m_setr | sr_car  ;16-bits & no initial carry
03922  ;
03923  ;	—————————
03924  ;	Main Loop
03925  ;	—————————
03926  ;
03927  0006D1  26 A0         .main    rol faca              ;shift a bit...
03928  0006D3  26 A2                  rol faca+s_word       ;out of dvidend &...
03929  0006D5  26 A4                  rol faca+s_dword      ;...
03930  0006D7  26 A6                  rol faca+s_xdword     ;...
03931  0006D9  26 B8                  rol faco              ;into partial...
03932  0006DB  26 BA                  rol faco+s_word       ;quotient
03933  0006DD  26 BB                  rol faco+s_xword
03934  0006DF  26 BE                  rol faco+s_xdword
03935  0006E1  38                     sec
03936  0006E2  A5 B8                  lda faco              ;64-bit...
03937  0006E4  E5 A8                  sbc facb              ;trial subtraction...
03938  0006E6  85 B0                  sta facc              ;difference is...
03939  0006E8  A5 BA                  lda faco+s_word       ;saved in the...
03940  0006EA  E5 AA                  sbc facb+s_word       ;scratch...
03941  0006EC  85 B2                  sta facc+s_word       ;accumulator
03942  0006EE  A5 BC                  lda faco+s_dword
03943  0006F0  E5 AC                  sbc facb+s_dword
03944  0006F2  A8                     tay                   ;might not need this
03945  0006F3  A5 BE                  lda faco+s_xdword
03946  0006F5  E5 AE                  sbc facb+s_xdword
03947  0006F7  90 0C                  bcc .main010          ;discard trial difference
03948  ;
03949  0006F9  85 BE                  sta faco+s_xdword     ;trial difference is...
03950  0006FB  84 BC                  sty faco+s_dword      ;our new...
03951  0006FD  A5 B2                  lda facc+s_word       ;partial quotient
03952  0006FF  85 BA                  sta faco+s_word
03953  000701  A5 B0                  lda facc
03954  000703  85 B8                  sta faco
03955  ;
03956  000705  CA            .main010 dex                   ;all bits processed?
03957  000706  D0 C9                  bne .main             ;no
03958  ;
03959  000708  26 A0                  rol faca              ;handle...
03960  00070A  26 A2                  rol faca+s_word       ;final...
03961  00070C  26 A4                  rol faca+s_dword      ;carry
03962  00070E  26 A6                  rol faca+s_xdword

It’s slightly bigger than the previous rendition, but also somewhat faster. When I originally wrote the math library of which this routine is part, I was concerned about overall memory footprint, so I went for the smallest code I could concoct. With POC V1.3 now having about 110KB for user code and data, I can be a little more liberal with code size, as long as increased size doesn’t translate to more cycles.

wsimms · Post by **wsimms** » Thu Jan 02, 2025 10:44 pm

I have instrumented both my divide routine and the routine from BigDumbDinosaur to see how they compare in terms of time required. The results are interesting, in my opinion.

When both dividend and divisor range from 1 to 4096, for a total of 16,777,216 divisions, my routine takes an average of 198 cycles. However the number of cycles for a single division ranges from a minimum of 81 cycles to a maximum of 1713 cycles.

The routine posted by BigDumbDinosaur exhibits very different behavior. The number of cycles required for a single division is quite consistent, ranging from a minimum of 863 cycles to a maximum of 947 cycles, with an average of 868 cycles.

In tabular form:

Code: Select all

                         Avg # cycles   Min # cycles   Max # cycles
         wsimms divide:      198             81           1713
BigDumbDinosaur divide:      868            863            947

Both dividend and divisor allowed to range over [1...4096] for a total of 16,777,216 divisions.

BigEd · Post by **BigEd** » Thu Jan 02, 2025 10:57 pm

impressive performance!

(there is a paper somewhere which looks into what distribution of inputs might be most representative)

BigDumbDinosaur · Post by **BigDumbDinosaur** » Fri Jan 03, 2025 3:12 am

wsimms wrote:

I have instrumented both my divide routine and the routine from BigDumbDinosaur to see how they compare in terms of time required...In tabular form:

Code: Select all

                         Avg # cycles   Min # cycles   Max # cycles
         wsimms divide:      198             81           1713
BigDumbDinosaur divide:      868            863            947

The shift-and-subtract method I use tends to be sensitive only to the magnitude of the divisor relative to the dividend. A small divisor results in more-frequent partial quotient replacements. However, that part of the code is very economical with clock cycles when compared to the rotations, which, being R-M-W instructions, eat clock cycles for lunch. Hence the relatively small spread between minimum and maximum clock cycles is to be expected.

As is often the case, different methods produce both desirable and undesirable effects. It’s kind of like a joke from back in the early days of railroads with regards to the running of trains: you can have speed, safety and comfort. Pick any two...

wsimms · Post by **wsimms** » Fri Jan 03, 2025 4:49 am

BigDumbDinosaur wrote:

wsimms wrote:

I have instrumented both my divide routine and the routine from BigDumbDinosaur to see how they compare in terms of time required...In tabular form:

Code: Select all

                         Avg # cycles   Min # cycles   Max # cycles
         wsimms divide:      198             81           1713
BigDumbDinosaur divide:      868            863            947

The shift-and-subtract method I use tends to be sensitive only to the magnitude of the divisor relative to the dividend. A small divisor results in more-frequent partial quotient replacements. However, that part of the code is very economical with clock cycles when compared to the rotations, which, being R-M-W instructions, eat clock cycles for lunch. Hence the relatively small spread between minimum and maximum clock cycles is to be expected.

As is often the case, different methods produce both desirable and undesirable effects. It’s kind of like a joke from back in the early days of railroads with regards to the running of trains: you can have speed, safety and comfort. Pick any two...

My results so far (if correct) show that my routine has a much lower average cycle count over a wide range of operand values, but the maximum cycle count is double that of your routine and it is not at all clear which sorts of operands are "typical" in actual application. At the moment, I don't have any idea what types of operands are "good" for my routine. It might well be that common operands are all in the 1000+ cycle time range for my routine.

BigEd · Post by **BigEd** » Fri Jan 03, 2025 9:52 am

BigEd wrote:

(there is a paper somewhere which looks into what distribution of inputs might be most representative)

Ah, found it, in a thread comparing 6502 multiplication routines!

pdragon wrote:

just found this nice paper by Hamming, (Bell labs, 1970), On the Distribution of Numbers that talks about the same idea in the context of optimizing floating point representations.

Quote:

This paper examines the distribution of the mantissas of floating point numbers and shows how the arithmetic operations of a computer transform various distributions toward the limiting distribution

A 6502 divide routine

A 6502 divide routine

Re: A 6502 divide routine

Re: A 6502 divide routine

Re: A 6502 divide routine

Re: A 6502 divide routine

Re: A 6502 divide routine

Re: A 6502 divide routine

Re: A 6502 divide routine

Re: A 6502 divide routine

Re: A 6502 divide routine

Re: A 6502 divide routine

Re: A 6502 divide routine

Re: A 6502 divide routine

Re: A 6502 divide routine

Re: A 6502 divide routine