Page 1 of 2
A 6502 divide routine
Posted: Tue Dec 31, 2024 3:04 pm
by wsimms
I want to present an integer division routine I have come up with. I have to imagine the algorithm is already well known, but I am too lazy to go find the prior art. The reason I am presenting this algorithm is because it seems to be common sentiment online that 6502 divide routines are difficult to understand, but this one seems quite simple to me and is reasonably fast.
The routine as presented divides a 15-bit unsigned dividend by a 15-bit unsigned divisor yielding a 15-bit unsigned quotient and 15-bit unsigned remainder. It is fairly trivial to produce a wrapper routine that handles 16-bit signed dividends and divisors yielding 16-bit signed quotients and remainders. That can also be the place where division by zero is handled.
First, a C implementation of the algorithm. I have tested this version by comparing its results to the results of the C '/' and '%' operators for all valid values of dividend and divisor:
Code: Select all
uint16_t uquotient;
uint16_t uremainder;
/*
unsigned divide
The largest possible multiples of the the divisor by a power of 2
are subtracted from the dividend and the corresponding partial quotients
(the powers of 2 by which the divisor is multiplied) are added to the
quotient. Whatever is leftover is the remainder.
*/
void udiv(uint16_t dividend, uint16_t divisor)
{
uint16_t partial_quotient = 1;
uquotient = 0;
// double the divisor until it exceeds the dividend. this is ok
// because the divisor and dividend are no more than 15 bits long
while (divisor < dividend) {
divisor <<= 1;
partial_quotient <<= 1;
}
while (1) {
// halve the divisor until it is less than the dividend
// this yields the largest multiple of the divisor by a power of 2
// that can be subtracted from the dividend with a positive difference
while (dividend < divisor) {
divisor >>= 1;
partial_quotient >>= 1;
if (partial_quotient == 0) goto done;
}
// subtract the multiple of the divisor from the dividend and add
// the corresponding power of 2 (partial quotient) to the quotient
dividend -= divisor;
uquotient += partial_quotient;
}
done:
uremainder = dividend;
}
Next, a basic 6502 version. I have only tested this minimally, using an Apple II emulation program (Virtual II) and ORCA/M assembler. Unfortunately it is difficult to export the assembler source files, so this is retyped (hopefully without typos):
Code: Select all
rmdr
dend ds 2
dsor ds 2
quot ds 2
pquo ds 2
udiv lda #1
sta pquo
lda #0
sta pquo+1
sta quot
sta quot+1
;; double the divisor until it exceeds the dividend. this is ok
;; because the divisor and dividend are no more than 15 bits long
L0 lda dsor+1
cmp dend+1
bcc L1
bne L2
lda dsor
cmp dend
bcs L2
L1 asl dsor
rol dsor+1
asl pquo
rol pquo+1
jmp L0
;; halve the divisor until it is less than the dividend
;; this yields the largest multiple of the divisor by a power of 2
;; that can be subtracted from the dividend with a positive difference
L2 lda dend+1
cmp dsor+1
bcc L3
bne L4
lda dend
cmp dsor
bcs L4
L3 lsr dsor+1
ror dsor
lsr pquo+1
ror pquo
bne L2
lda pquo+1
bne L2
rts
;; subtract the multiple of the divisor from the dividend and add
;; the corresponding power of 2 (partial quotient) to the quotient
L4 sec
lda dend
sbc dsor
sta dend
lda dend+1
sbc dsor+1
sta dend+1
clc
lda pquo
adc quot
sta quot
lda pquo+1
adc quot+1
sta quot+1
jmp L2
P.S.: Edited twice to correct mistakes
Re: A 6502 divide routine
Posted: Tue Dec 31, 2024 3:47 pm
by drogon
Interesting. Thanks. I'll have a look next year ...
However I think the JMP L0 at the very end might ought to be JMP L2 ?
-Gordon
Re: A 6502 divide routine
Posted: Tue Dec 31, 2024 3:58 pm
by GARTHWILSON
I have to imagine the algorithm is already well known, but I am too lazy to go find the prior art.
Welcome. This link, if I formed it right, should take you directly to the relevant section of the source-code repository on this site:
http://6502.org/source/#integermath
Re: A 6502 divide routine
Posted: Tue Dec 31, 2024 4:02 pm
by wsimms
Interesting. Thanks. I'll have a look next year ...
However I think the JMP L0 at the very end might ought to be JMP L2 ?
You are correct. I have edited it.
Re: A 6502 divide routine
Posted: Tue Dec 31, 2024 10:55 pm
by BigDumbDinosaur
I want to present an integer division routine I have come up with...
Welcome to 6502-land.
I have to imagine the algorithm is already well known, but I am too lazy to go find the prior art.
There are several such algorithms in existence, although I’d be hard-pressed to single out any one of them as “the best.” Most are based on the same principle, which is repetitive dividend rotations and trial subtractions.
The reason I am presenting this algorithm is because it seems to be common sentiment online that 6502 divide routines are difficult to understand, but this one seems quite simple to me and is reasonably fast...
6502 division seems hard to understand because the methods usually used are not exactly like the long division method we learned in school. Not helping is the 6502 can only add, subtract and bit-shift, whereas we humans (supposedly) know how to multiply and divide. 
It is a rare case in which an algorithm developed in C or another higher-level language translates to assembly language in an optimum way. Often, the inherent restrictions/limitations of the higher-level language get in the way of producing succinct machine code. Also, compilers seldom produce code that is as small and tight as that produced by a skilled assembly language programmer.
Anyhow, here’s a 16-bit division routine I concocted many years ago that will run on any member of the 6502 family. Over time, I massaged it in an effort to make it as fast as possible, a goal that is often at odds with trying to keep code as small as possible.
It’s based on an algorithm that was presented by Lance Leventhal in his first edition of 6502 Assembly Language Programming, and includes a division-by-zero check. With simple modifications, 24-bit and larger quantities may be processed.
Code: Select all
;16-BIT UNSIGNED DIVISION
;
; ------------------------------------
; FAC1 = 16 bit dividend
; FAC2 = 16 bit divisor
;
; FAC1 is 32 bits in size; FAC2 is 16
; bits in size. Ideally, both should
; be on page zero, but the algorithm
; will work no matter where the accum-
; ulators are located.
;
; Bits 16-31 of FAC1 are used to store
; the partial quotient.
; ------------------------------------
; FAC1 = 16-bit quotient
; FAC1+2 = 8-bit remainder
; ------------------------------------
;
dpdiv sec
lda fac2 ;divisor LSB
ora fac2+1 ;divisor MSB
beq dpdiv03 ;can't divide by zero
;
lda #0
sta fac1+2 ;clear...
sta fac1+3 ;partial quotient
ldx #16 ;iteration counter
clc ;no initial carry!
;
; ---------
; Main Loop
; ---------
;
dpdiv01 rol fac1 ;rotate bit from...
rol fac1+1 ;dividend into...
rol fac1+2 ;partial quotient
rol fac1+3
sec ;do trial subtraction
lda fac1+2 ;partial quotient LSB
sbc fac2 ;divisor LSB
tay ;we might not need this
lda fac1+3 ;partial quotient MSB
sbc fac2+1 ;divisor MSB
bcc dpdiv02 ;discard trial difference
;
sty fac1+2 ;new partial quotient LSB
sta fac1+3 ;new partial quotient MSB
;
dpdiv02 dex ;we out of bits?
bne dpdiv01 ;no
;
; --------------------------------
; Rotate Final Carry Into Quotient
; --------------------------------
;
rol fac1
rol fac1+1
clc ;return happy
;
dpdiv03 rts
In the above example, I replaced symbolic constants with “magic numbers” in the code, e.g., fac1+1, to make it easier to understand. In practice, “magic numbers” should not be buried in code, for reasons that become painfully apparent when a widely-used “magic number” constant in a large program has to be changed.
Re: A 6502 divide routine
Posted: Wed Jan 01, 2025 10:31 am
by barnacle
Almost identical to the division routine I use:
Code: Select all
; core routine taken from https://www.llx.com/Neil/a2/mult.html
; divides dividend by divisor, leaves result in dividend and the
; remainder in remainder
; dividend and divisor are both positive
stz remainder
stz remainder+1 ; remainder initialised to zero
ldx #16 ; bit counter
ldv_1:
asl dividend
rol dividend+1
rol remainder
rol remainder+1 ; move high bit of dividend into remainder
lda remainder
sec
sbc divisor
tay ; trial subtraction
lda remainder+1
sbc divisor+1
bcc ldv_2
sta remainder+1 ; and if it worked, save it
sty remainder
inc dividend ; slide the result in as the low bit
; of dividend
ldv_2:
dex
bne ldv_1 ; and back sixteen times
lda dividend
ldy dividend+1 ; return the result in y:a
There is an external wrapper to invert one or both parameters if they are (signed-16) negative, setting a flag if the result should be inverted at the end, and a call to an inversion routine (also used for subtraction).
Neil
(The Neil in the link is not me!)
Re: A 6502 divide routine
Posted: Wed Jan 01, 2025 3:13 pm
by wsimms
I want to present an integer division routine I have come up with...
Welcome to 6502-land. ... Anyhow, here’s a 16-bit division routine I concocted many years ago.
This was helpful. I wrote a wrapper program to call my routine and yours to compare the results and let them go over all dividends and divisors in the range [0...32767]. It took a while to run, but I found no discrepancies in the output, although I didn't compare the high byte of the remainder, since you stated your routine produces only an 8-bit remainder.
I should add that in checking the results of my routine against yours, I found a dumb bug in my routine. I had screwed up the two comparisons divisor < dividend and dividend < divisor. I edited my original post to fix the comparisons, so my code in the original post has now been edited twice.
In the above example, I replaced symbolic constants with “magic numbers” in the code, e.g., fac1+1, to make it easier to understand.
I approve of this practice.
Re: A 6502 divide routine
Posted: Wed Jan 01, 2025 4:16 pm
by BigDumbDinosaur
Almost identical to the division routine I use...
Funny how these routines all seem to resemble each other.
I use an almost-identical version in my 64-bit 65C816 math library. The primary difference is that version processes in 16-bit chunks and uses loops to rotate and trial-subtract the (64-bit) operands. Here’s the listing as assembled in one of my programs:
Code: Select all
03921 0006CF A9 40 00 lda !#s_blword ;bits in dividend
03922 0006D2 85 C4 sta icounter ;use them as iteration counter
03923 0006D4 18 clc ;no initial carry
03924 ;
03925 ; —————————
03926 ; Main Loop
03927 ; —————————
03928 ;
03929 0006D5 A2 00 .main ldx #0
03930 0006D7 A0 04 ldy #s_dlong/s_word
03931 ;
03932 0006D9 36 A0 .main010 rol faca,x
03934 0006DB E8 inx
03935 0006DC E8 inx
03936 0006DD 88 dey
03937 0006DE D0 F9 bne .main010
03938 ;
03939 0006E0 A2 00 ldx #0
03940 0006E2 A0 04 ldy #s_dlong/s_word
03941 ;
03942 0006E4 36 B8 .main020 rol faco,x
03944 0006E6 E8 inx
03945 0006E7 E8 inx
03946 0006E8 88 dey
03947 0006E9 10 F9 bpl .main020
03948 ;
03949 0006EB A2 00 ldx #0
03950 0006ED A0 04 ldy #s_dlong/s_word
03951 0006EF 38 sec
03952 ;
03953 0006F0 B5 B8 .main030 lda faco,x
03954 0006F2 F5 A8 sbc facb,x
03955 0006F4 95 B0 sta facc,x
03957 0006F6 E8 inx
03958 0006F7 E8 inx
03959 0006F8 88 dey
03960 0006F9 D0 F5 bne .main030
03961 ;
03962 0006FB 90 0A bcc .main050
03963 ;
03964 0006FD A2 04 ldx #s_dlong/s_word
03965 ;
03966 0006FF B5 B0 .main040 lda facc,x
03967 000701 95 B8 sta faco,x
03969 000703 CA dex
03970 000704 CA dex
03971 000705 10 F8 bpl .main040
03972 ;
03973 000707 C6 C4 .main050 dec icounter
03974 000709 D0 CA bne .main
03975 ;
03976 00070B A2 00 ldx #0
03977 00070D A0 04 ldy #s_dlong/s_word
03978 ;
03979 00070F 36 A0 .main060 rol faca,x
03981 000711 E8 inx
03982 000712 E8 inx
03983 000713 88 dey
03984 000714 D0 F9 bne .main060
Despite the extra monkey-motion of the inner loops, the above runs about 60 percent faster than the equivalent 8-bits-at-a-time routine without the inner loops. I could get it to run faster with the inner loops unrolled, but the code size would significantly increase. As always, there’s that trade-off between size and speed...
A separate scratch accumulator is used to temporarily hold the results of the trial subtraction, since it can’t be held solely in the registers. That “feature” eats up direct page space, but doesn’t seem to be avoidable.
(The Neil in the link is not me!)
An imposter Neil? Is that anything like fake chips from you-know-where? 
Re: A 6502 divide routine
Posted: Wed Jan 01, 2025 4:32 pm
by barnacle
I dunno, but the Neil in the link seems to know what he's talking about, so there's a difference
Neil
Re: A 6502 divide routine
Posted: Wed Jan 01, 2025 5:24 pm
by BigDumbDinosaur
I dunno, but the Neil in the link seems to know what he's talking about, so there's a difference

Operative word is “seems”... 
BTW, here’s the above 64-bit division routine with the inner loops unrolled:
Code: Select all
03920 0006CD A2 40 ldx #s_blword ;iteration counter
03921 0006CF C2 31 rep #m_setr | sr_car ;16-bits & no initial carry
03922 ;
03923 ; —————————
03924 ; Main Loop
03925 ; —————————
03926 ;
03927 0006D1 26 A0 .main rol faca ;shift a bit...
03928 0006D3 26 A2 rol faca+s_word ;out of dvidend &...
03929 0006D5 26 A4 rol faca+s_dword ;...
03930 0006D7 26 A6 rol faca+s_xdword ;...
03931 0006D9 26 B8 rol faco ;into partial...
03932 0006DB 26 BA rol faco+s_word ;quotient
03933 0006DD 26 BB rol faco+s_xword
03934 0006DF 26 BE rol faco+s_xdword
03935 0006E1 38 sec
03936 0006E2 A5 B8 lda faco ;64-bit...
03937 0006E4 E5 A8 sbc facb ;trial subtraction...
03938 0006E6 85 B0 sta facc ;difference is...
03939 0006E8 A5 BA lda faco+s_word ;saved in the...
03940 0006EA E5 AA sbc facb+s_word ;scratch...
03941 0006EC 85 B2 sta facc+s_word ;accumulator
03942 0006EE A5 BC lda faco+s_dword
03943 0006F0 E5 AC sbc facb+s_dword
03944 0006F2 A8 tay ;might not need this
03945 0006F3 A5 BE lda faco+s_xdword
03946 0006F5 E5 AE sbc facb+s_xdword
03947 0006F7 90 0C bcc .main010 ;discard trial difference
03948 ;
03949 0006F9 85 BE sta faco+s_xdword ;trial difference is...
03950 0006FB 84 BC sty faco+s_dword ;our new...
03951 0006FD A5 B2 lda facc+s_word ;partial quotient
03952 0006FF 85 BA sta faco+s_word
03953 000701 A5 B0 lda facc
03954 000703 85 B8 sta faco
03955 ;
03956 000705 CA .main010 dex ;all bits processed?
03957 000706 D0 C9 bne .main ;no
03958 ;
03959 000708 26 A0 rol faca ;handle...
03960 00070A 26 A2 rol faca+s_word ;final...
03961 00070C 26 A4 rol faca+s_dword ;carry
03962 00070E 26 A6 rol faca+s_xdword
It’s slightly bigger than the previous rendition, but also somewhat faster. When I originally wrote the math library of which this routine is part, I was concerned about overall memory footprint, so I went for the smallest code I could concoct. With POC V1.3 now having about 110KB for user code and data, I can be a little more liberal with code size, as long as increased size doesn’t translate to more cycles.
Re: A 6502 divide routine
Posted: Thu Jan 02, 2025 10:44 pm
by wsimms
I have instrumented both my divide routine and the routine from BigDumbDinosaur to see how they compare in terms of time required. The results are interesting, in my opinion.
When both dividend and divisor range from 1 to 4096, for a total of 16,777,216 divisions, my routine takes an average of 198 cycles. However the number of cycles for a single division ranges from a minimum of 81 cycles to a maximum of 1713 cycles.
The routine posted by BigDumbDinosaur exhibits very different behavior. The number of cycles required for a single division is quite consistent, ranging from a minimum of 863 cycles to a maximum of 947 cycles, with an average of 868 cycles.
In tabular form:
Code: Select all
Avg # cycles Min # cycles Max # cycles
wsimms divide: 198 81 1713
BigDumbDinosaur divide: 868 863 947
Both dividend and divisor allowed to range over [1...4096] for a total of 16,777,216 divisions.
Re: A 6502 divide routine
Posted: Thu Jan 02, 2025 10:57 pm
by BigEd
impressive performance!
(there is a paper somewhere which looks into what distribution of inputs might be most representative)
Re: A 6502 divide routine
Posted: Fri Jan 03, 2025 3:12 am
by BigDumbDinosaur
I have instrumented both my divide routine and the routine from BigDumbDinosaur to see how they compare in terms of time required...In tabular form:
Code: Select all
Avg # cycles Min # cycles Max # cycles
wsimms divide: 198 81 1713
BigDumbDinosaur divide: 868 863 947
The shift-and-subtract method I use tends to be sensitive only to the magnitude of the divisor relative to the dividend. A small divisor results in more-frequent partial quotient replacements. However, that part of the code is very economical with clock cycles when compared to the rotations, which, being R-M-W instructions, eat clock cycles for lunch. Hence the relatively small spread between minimum and maximum clock cycles is to be expected.
As is often the case, different methods produce both desirable and undesirable effects. It’s kind of like a joke from back in the early days of railroads with regards to the running of trains: you can have speed, safety and comfort. Pick any two... 
Re: A 6502 divide routine
Posted: Fri Jan 03, 2025 4:49 am
by wsimms
I have instrumented both my divide routine and the routine from BigDumbDinosaur to see how they compare in terms of time required...In tabular form:
Code: Select all
Avg # cycles Min # cycles Max # cycles
wsimms divide: 198 81 1713
BigDumbDinosaur divide: 868 863 947
The shift-and-subtract method I use tends to be sensitive only to the magnitude of the divisor relative to the dividend. A small divisor results in more-frequent partial quotient replacements. However, that part of the code is very economical with clock cycles when compared to the rotations, which, being R-M-W instructions, eat clock cycles for lunch. Hence the relatively small spread between minimum and maximum clock cycles is to be expected.
As is often the case, different methods produce both desirable and undesirable effects. It’s kind of like a joke from back in the early days of railroads with regards to the running of trains: you can have speed, safety and comfort. Pick any two... 
My results so far (if correct) show that my routine has a much lower average cycle count over a wide range of operand values, but the maximum cycle count is double that of your routine and it is not at all clear which sorts of operands are "typical" in actual application. At the moment, I don't have any idea what types of operands are "good" for my routine. It might well be that common operands are all in the 1000+ cycle time range for my routine.
Re: A 6502 divide routine
Posted: Fri Jan 03, 2025 9:52 am
by BigEd
(there is a paper somewhere which looks into what distribution of inputs might be most representative)
Ah, found it, in a thread
comparing 6502 multiplication routines!
just found this nice paper by Hamming, (Bell labs, 1970),
On the Distribution of Numbers that talks about the same idea in the context of optimizing floating point representations.
This paper examines the distribution of the mantissas of floating point numbers and shows how the arithmetic operations of a computer transform various distributions toward the limiting distribution