6502.org

Posted: **Sat Apr 25, 2020 8:58 pm**

Collective groan from the 6502.org community.

But wait! I'm just wondering if I can improve on what I've got, and there's no better collection of minds than here.

So, I have two unsigned 16-bit values A and B, where A<B, and I want to compute A/B to 8-bit fractional precision.

My starting point was to look at a simpler case: dividing two 8 bit values, or, in this case, 16-bit divided by 8-bit, yielding an 8-bit value. To my moderate surprise, such a routine is already presented in the 6502 wiki:

Code: Select all

   LDA TH
   LDX #8
   ASL TLQ
L1 ROL
   BCS L2
   CMP B
   BCC L3
L2 SBC B
;
; The SEC is needed when the BCS L2 branch above was taken
;
   SEC
L3 ROL TLQ
   DEX
   BNE L1

Given that the LSB of the numerator is always zero, this can be subtly optimized to the following:

Code: Select all

   LDA TH
   LDX #8
L1 ASL
   BCS L2
   CMP B
   BCC L3
L2 SBC B
   SEC
L3 ROL TLQ
   DEX
   BNE L1

That actually looks pretty efficient, and I can use it if I know that both my dividends are <256. But I needed a 16-bit case, so I came up with the following:

Code: Select all

divide:
	LDA num_hi
	LDX #8
loop:
	ASL num_lo
	ROL A
	BCS subtract
	CMP denom_hi
	BCC skip
	BNE subtract
	LDY num_lo
	CPY denom_lo
	BCC skip
subtract:
	TAY
	LDA num_lo
	SBC denom_lo
	STA num_lo
	TYA
	SBC denom_hi
	SEC
skip:
	ROL result
	DEX
	BNE loop
	RTS

This normally completes (depending on the inputs) in between 250 and 300 cycles, which is OK, but I'm wondering if I can do better.

I wondered if there was something in doing the subtract, caching the result, and only committing it if it didn't overflow, but I feel like the routine above can short-circuit that work when not required, and probably gives a small performance gain (I guess I would need to measure it).

But the main reason I'm posting this is to ask if anyone knows a better way, or a totally different way! I'm interested in performance here, so if there's a table-based approach which uses a modest amount of RAM for tables, I'm open to ideas.

Note that I don't need an enormous amount of precision: 1/256 is fine. The use case here is as an input to an arctan table: an index from 0-255 (representing 0...1) yielding an angle between 0 and 45 degrees, where 45 degrees is represented by 128. With this kind of resolution, I don't need any more precision in the division results, as every angle is covered in the table. What I do need is to calculate hundreds of these things in real time, so, the quicker the better!

Over to you

Hope everyone's doing OK during these strange times.

Posted: **Sat Apr 25, 2020 9:14 pm**

Have you looked into non-restoring division? That's often faster. Instead of comparing and then subtracting, you add or subtract depending on the previous result. (I'm not sure if it will be faster in this case, as with a bytewise comparison you can get away with doing half the work most of the time.)

Posted: **Sat Apr 25, 2020 10:03 pm**

Not sure really what you mean there Ed - do you have an example?

The next thing I tried was to try forcing the dividends to be 8-bit in length by shifting them both until the highest set bit of the denominator is at bit 7 of the 8-bit input.

Code: Select all

divide:
    LDA denom_hi
    BMI done_shifting
shiftloop:
    ; naive way to do this: could improve perf by checking the top bits of denom
    ; and shifting in different directions (or just copying lo to hi directly)
    ASL num_lo
    ROL num_hi
    ASL denom_lo
    ROL denom_hi
    BPL shiftloop
done_shifting:

    LDA num_hi
    LDX #8
loop:
    ASL A
    BCS subtract
    CMP denom_hi
    BCC skip
subtract:
    SBC denom_hi
    SEC
skip:
    ROL result
    DEX
    BNE loop

This finishes, more typically, in around 170 cycles, which is a distinct improvement, but the results are less accurate, occasionally coming out 1 less than the correct result. I think this is within an acceptable error tolerance, and it could possibly be corrected by rounding the dividends up according to the underflow bit. Any other ideas?

Posted: **Sat Apr 25, 2020 10:04 pm**

In the source-code repository on this 6502.org site:
http://6502.org/source/integers/ummodfix/ummodfix.htm

Posted: **Sat Apr 25, 2020 10:18 pm**

Thanks Garth!

Yes, of course I went to the usual places to see what was already on the 6502.org site, and I'm painfully aware that 6502 division is a FAQ, but I figured my needs were sufficiently distinct from the "generic" 32-bit/16-bit division code that it was worth starting a topic on it.

I've always found the 6502 division algorithm surprisingly arcane, and I always need to re-derive it each time I need a variant on it, and I guess it's no wonder that it's a topic which comes up time and time again.

In my case, I'm happy with an approximation (already the required precision isn't really that great!) so I guess that's more the direction I'd like discussion to go. Just as there's the famous "difference of two squares" multiplication hack, I was wondering if a table-based approach to division was viable. I guess my next test can be to try a reciprocal multiply with the f(x)=x*x/4 tables, after having coerced the dividends into 8-bit approximate quantities. Anticipating better performance and worse accuracy (as is the way).

Posted: **Sat Apr 25, 2020 10:24 pm**

Ah, RichTW. For some reason I was thinking it was someone else, someone who had not been here long. Well you're probably aware of my large tables too then, but just in case, they're at http://wilsonminesco.com/16bitMathTables/ . (All the individual tables are indexed just past the middle of the page.) You probably don't want to dedicate enough memory for the entire inversion table (to multiply by the inverse); but there might be something there that would give you ideas.

Posted: **Sat Apr 25, 2020 11:31 pm**

I haven't been here for a while! Just amusing myself with a little 6502 prototype during these strange lockdown days.

I hadn't come across your large tables page before, and, while I don't have much more than about 4k to spend on tables, I've convinced myself that "difference of squares" multiplication tables, combined with a LSB/MSB reciprocal multiplication table for 256 values is probably enough for what I need. I'm going to keep accumulating truncation errors though, so I'm hoping it can all be ironed out by rounding up here and there. If errors are consistent, it won't be as bad as if they vary with inputs and lead to 'wobbling' outputs!

Posted: **Sun Apr 26, 2020 12:00 am**

I might have a go at this, as I have an idea for how to go about it, but it'll have to wait until some urgent day-job shenanigans gets out of the way. It's something that I unfortunately have to spend even my Sunday preparing for, and it doesn't leave me with the mental bandwidth necessary to really get to grips with the 6502.

Posted: **Sun Apr 26, 2020 12:25 am**

Tomorrow I'm going to try the following:

A reciprocal table of 16-bit precision: f(x)=65535/x for x=0..255.
The usual difference of squares multiplication table: f(x)=x*x/4 for x=0..511

To divide unsigned A/B, first shift down both until the highest bit in B is at bit 7. Since typically the values are <4000, it's more efficient to shift down. Loop is unrolled, of course

No obvious accuracy win in rounding up (ADC #0 using the final carry shifted out).

Then, using the reciprocal table and the difference of squares multiplication, calculate:
lsb(A * recip_msb(B)) + msb(A * recip_lsb(B))

Quick tests on various pairs of values suggest that it won't give results any worse than those I was getting before, but I need to implement it and test some corner cases.

Final step is to put that result into an arctan table: f(x)=arctan(x/256)*512/pi for x=0..255 to get an approximate bearing in an octant of a circle. Phew!

Posted: **Sun Apr 26, 2020 5:03 am**

Okay, so here's a quick and dirty version of compare-and-conditional-subtract, in which the subtraction itself is also the comparison:

Code: Select all

divide:
  LDA #$FE
  STA result  ; serves as indirect loop counter; calculating the final bit pushes a zero into the carry
: ASL num_lo
  ROL num_hi
  LDA num_lo
  SBC denom_lo
  TAX
  LDA num_hi
  SBC denom_hi
  BCC :+      ; skip saving the remainder if it's negative
  STA num_hi
  STX num_lo
: ROL result
  BCS :--
  RTS

This is actually pretty straightforward and compact, but I calculate it as taking about 304 cycles on average (ie. when half the result bits are set). Minimum runtime is 284, excluding JSR/RTS, when the result is zero.

OTOH, I don't see how your original 16-bit routine gets to under 300 cycles on average, as the "typical" code path I can see is 45 cycles times 8 iterations, which is 360. Is there something non-obvious about the input data which makes the fast path happen more often than the slow and medium ones?

Posted: **Sun Apr 26, 2020 5:36 am**

RichTW wrote:

Not sure really what you mean there Ed - do you have an example?

Not exactly but I found a previous thread where Bruce chimes in:

integer division and multiplication

Edit: if you have fast multiplication sorted out, maybe try Goldschmidt division.

Posted: **Sun Apr 26, 2020 2:04 pm**

Chromatix wrote:

Okay, so here's a quick and dirty version of compare-and-conditional-subtract, in which the subtraction itself is also the comparison:

That's a nice way to do it - thanks! I've used a similar trick when consuming data bit-by-bit (to determine when you need to buffer your next byte of data without having to count bits). I unrolled the 8-bit version, though it probably wouldn't be particularly worth doing it here. It might be worth trying to keep some value in the accumulator to save a memory shift each time round.

Chromatix wrote:

OTOH, I don't see how your original 16-bit routine gets to under 300 cycles on average, as the "typical" code path I can see is 45 cycles times 8 iterations, which is 360. Is there something non-obvious about the input data which makes the fast path happen more often than the slow and medium ones?

To be honest, I didn't take a particularly rigorous approach to timing it. I just used BBC BASIC to set up random inputs, and used a VIA timer to time the routine (minus the JSR/RTS) with interrupts off. Here's an example of the results (input, correct result, calculated result, cycle count):

Code: Select all

9666/38046     65        65       248
26/28561       0         0        208
8539/9886      221       221      328
5581/27331     52        52       268
6972/40417     44        44       268
6830/19933     87        87       308
32145/43307    190       190      306
2586/61285     10        10       234
23951/46785    131       131      260
8436/25269     85        85       288
6083/8679      179       179      308
40351/58884    175       175      286
6877/37957     46        46       280
18503/18910    250       250      328
4903/11669     107       107      308
9304/55921     42        42       246
17681/44656    101       101      280
30297/46844    165       165      280
12209/49047    63        63       286
1007/5538      46        46       288
45829/51754    226       226      266
26514/50175    135       135      274
20676/47186    112       112      254
5147/17169     76        76       268
1596/44698     9         9        248

This is with B from 0-65535, and A<B.

Interestingly, if I reduce the range of B to 0-4000, the average timing is generally higher:

Code: Select all

232/454        130       130      274
77/400         49        49       294
516/2281       57        57       308
876/2171       103       103      308
611/756        206       206      328
958/2632       93        93       308
1991/2294      222       222      338
585/599        250       250      342
123/173        182       182      358
332/2761       30        30       298

Some more detailed analysis could undoubtedly reveal why, but I don't have the time or energy to really look at it at the moment!

I wanted to try your code and drop some timing results in, but I was finding I was getting the wrong results about 50% of the time - going to take a look to see if I've made a mistake somewhere, and will report back! Timing-wise, it was looking to be far more consistent, taking around 300 cycles each time.

Posted: **Sun Apr 26, 2020 2:16 pm**

Hmm, I don't think your code is equivalent - specifically, the algorithm needs to always subtract if left-shifting the numerator at the top of the loop shifted a set bit out. Unfortunately I can't think of a neat way to express that in your code without ruining its sleekness!

Example output:

Code: Select all

930/41524      5         4        292
9960/38147     66        66       298
5051/15683     82        82       302
17436/64657    69        0        288
15214/30417    128       128      292
2922/20740     36        36       298
39/1997        4         4        292
1538/4442      88        88       302
2817/62813     11        0        288
2/257          1         1        292
34129/38117    229       9        298
14485/21065    176       176      302
22134/24325    232       232      308

Posted: **Sun Apr 26, 2020 2:33 pm**

BigEd wrote:

Not exactly but I found a previous thread where Bruce chimes in:

integer division and multiplication

Edit: if you have fast multiplication sorted out, maybe try Goldschmidt division.

Thanks for links Ed! Will try digesting that with a coffee! A skim read suggests that Goldschmidt division works with reals rather than integers, but I haven't really taken it in properly.

I think if I can get the required precision with a table-based approach, I'll go with that. The problem is that tables generally require reducing the domain of the inputs, but I've found so far that reducing the dividend and divisor to 8-bit values reduces the accuracy of the results (up to +/-2 of the correct result). I will probably try various different implementations (reciprocal multiply vs 8-bit division vs 16-bit division) and see how bad I can bear the inaccuracies to be, compared to the performance advantage.

Posted: **Sun Apr 26, 2020 4:29 pm**

Here are the results of the table-based approach.

First of all, the tables. They are page-aligned and need 2.5k:

Code: Select all

multablo - lsb of x*x/4 for x=0..511
multabhi - msb of x*x/4 for x=0..511
multablo2 - lsb of (x-255)*(x-255)/4 for x=0..511
multabhi2 - msb of (x-255)*(x-255)/4 for x=0..511
reciptablo - lsb of 65536/x for x=1..255
reciptabhi - msb of 65536/x for x=1..255

Next some one-time initialization to set up some zero page locations:

Code: Select all

init:
    LDA #>multablo
    STA multablo_ptr+1
    LDA #>multabhi
    STA multabhi_ptr+1
    LDA #>multablo2
    STA multablo2_ptr+1
    LDA #>multabhi2
    STA multabhi2_ptr+1

And now the division code:

Code: Select all

divide:
    LDA denom_hi
    BEQ done
    .rep 8   ; repeat this block of code 8 times
        LSR num_hi
        ROR num_lo
        LSR A
        BEQ shifted
        ROR denom_lo
    .endrep
shifted:
    ROR denom_lo
done:

    LDY num_lo
    LDX denom_lo
    LDA reciptabhi,X
    STA multablo_ptr
    EOR #255
    STA multablo2_ptr
    LDA reciptablo,X
    STA multabhi_ptr
    EOR #255
    STA multabhi2_ptr

    SEC
    LDA (multablo_ptr),Y
    SBC (multablo2_ptr),Y
    STA result
    LDA (multabhi_ptr),Y
    SBC (multabhi2_ptr),Y
    CLC
    ADC result
    STA result

This trick for multiplication is well-known, so I won't go over it again unless anyone would like an explanation. The nice thing is that it's extremely quick to get a result for A*B (A, B both unsigned 8-bit). What's clear is that the most work is now done normalizing the inputs so they fit in 8 bits. To optimize this better, it would check the top 4 bits of the denominator, and, if non-zero, shift in the other direction into the high byte, so that no more than 4 shifts ever need to be performed. For the purposes of this test, I'll keep the inputs to 12-bit values.

The final part of the routine is multiplying the numerator, A, by the reciprocal of the denominator, B. Since the reciprocal is a 16-bit quantity, we actually have to do:

A * recip_msb * 256 + A * recip_lsb

We are only concerned with bits 8-15 of the result, so we add the LSB of (A * recip_msb) to the MSB of (A * recip_lsb).

Results time (inputs, actual value, computed value, cycles taken):

Code: Select all

581/880        169       168      112
1769/2626      172       170      150
488/556        224       224      112
445/844        134       133      112
81/3546        5         5        150
2648/3598      188       188      150
24/195         31        31       74
3087/3915      201       200      150
1133/1381      210       210      132
304/543        143       144      112
1836/2903      161       161      150
1111/1445      196       197      132
2065/3746      141       141      150
73/412         45        45       92
11/96          29        29       74

It's so much faster than any other approach, but there are plenty of off-by-one errors (which is more due to the truncation of the inputs to 8-bit quantities than any errors from the reciprocal table, although the latter doesn't help). Whether this is sufficiently accurate remains to be seen, but it's nearly three times quicker than the full division approach, so it's definitely a contender.

6502.org

Division routine help sought!

Division routine help sought!

Re: Division routine help sought!

Re: Division routine help sought!

Re: Division routine help sought!

Re: Division routine help sought!

Re: Division routine help sought!

Re: Division routine help sought!

Re: Division routine help sought!

Re: Division routine help sought!

Re: Division routine help sought!

Re: Division routine help sought!

Re: Division routine help sought!

Re: Division routine help sought!

Re: Division routine help sought!

Re: Division routine help sought!