6502.org

Posted: **Sat Feb 20, 2016 6:59 am**

malcontent,

Have you seen Michael J. Mahon's fast square root for Applesoft? It claims to be about twice as fast as the Newton-Raphson version mentioned in your blog, and about nine times as fast as the MS exp(log(x)/2) method. It doesn't look prohibitively large either.

http://michaeljmahon.com/USR.SQR.pdf

Mike B.

[Edit: He made his routine relocatable, but employed a bit of negative stack addressing to do it, which is not generally safe, except on something that doesn't use interrupts, like an old Apple 1 or 2.]

Posted: **Sun Feb 21, 2016 3:47 pm**

Nice find, I have not seen that. Looks pretty speedy as it doesn't use any kernal math like the version I've got now.

Posted: **Sun Feb 21, 2016 6:13 pm**

I'm baffled by this square root computation. But I see the same general idea is used in BBC Basic:
http://8bs.com/basic/basic4-a7b5.htm
There is a worked example there, and this:

Quote:

I am not exactly sure how this routine works, so if anyone else can provide a better description or an example of the algorithm using decimal (base 10) values, then please do so.

I notice that the Beeb routine finishes up with a normalisation step.

Posted: **Tue Feb 23, 2016 6:06 pm**

Well that applesoft routine absolutely blazes compared to the built-in one, and beats the one I had by quite fair margin too since it avoids calling the kernal divide. I can't quite follow the bit level math, but it does seem to be doing some flavor of Newton (subtraction and division(shifts)). A few test cases showed it's about 10 times faster: ~4800 cycles compared to ~48000

Posted: **Tue Feb 23, 2016 10:54 pm**

One of us should try to make contact with Mr. Mahon and see if he has any other hot-rod stories he would like to share.

Mike B.

Posted: **Tue Feb 23, 2016 11:28 pm**

I am in contact... just received permission to post a version of his routine on the wiki. Actually he developed it for the same reason I'm interested in it, speeding up an orbit simulation.

Posted: **Wed Feb 24, 2016 2:24 am**

This might be the algorithm from Wikipedia. It's sort of a riff on the basic shift-and-subtract division, which has a cumulative effect such that it's seeing if (accum_shifted + bit)^2 fits as opposed to if (numerator_shifted > denom).

Posted: **Wed Feb 24, 2016 5:30 pm**

Thanks for the algorithm link! As you say, it's like shift and subtract, but with a double shift and a tweak. Each time around you compute one bit of result and consume two bits of the input. I notice the code given only needs to compute a half-width result so only iterates 16 times max to compute the root of a 32 bit number. In these floating point routines we need to keep going, so in both cases we iterate 32 times or so. Looks like Michael Mahon's code uses TEMP1 as a 32-bit extension to the input ARG. Acorn's code uses a moving bit mask and seems to do less shifting. Also it does some 40-bit subtraction and shifting.

Posted: **Wed Mar 09, 2016 9:20 pm**

Something that may be helpful and relevant: On the MoHPC calculator forum, there's a discussion of ways to get better accuracy on ARCTAN without running a ridiculous number of terms of the Taylor series. Chasfield started the topic because he's making himself an Arduino-based floating-point RPN calculator. See http://hpmuseum.org/forum/thread-5825.html .

I ran into the same problem when I was writing routines perhaps 25 years ago to calculate the 16-bit scaled-integer trig and log functions in Forth (before I did the big look-up tables that are on my website). For SIN and COS, you can use a small number of terms and tweak the coefficients a little to make up for the fact that you're not using many; but for ARCTAN it converges just toooooooo slowly, requiring a huge number of terms. Hence the discussion linked above. What I found, those years ago, to my delight, was that for my not needing 10 digits of accuracy, I could use a 16-segment look-up table (actually 18 points, 17 including both ends, plus one more) for 0-45°, and interpolate, and from there you can get the rest of the circle. Here are my notes in the code:

Code: Select all

\ ARCTAN can be interpolated nicely for 0-45°.  It works out much better than the approximation x(1+bx²)/(1+cx²)-dx^6, not to
\ mention the infinite series!  With arctangents from 0-45° (π/4 radians, or 1/8 of the circle), you can get the rest of the
\ circle, except that the tangent has this nasty habit of reaching infinity as you get to ±90°.

\ The numbers in the arctangent table below are not exactly the right values for the points they represent; but since the
\ interpolated values sag below, the numbers in the table are optimized so the maximum error from the function atan45 will be
\ <.02°.  The resolution of the table--that is, the angle difference made by each count--is about .007° worst case (near 0).
\ Each segment is a sixteenth of 45°, and the last number is actually the scaled atan of 17/16 of 45°.


CREATE  ATAN_TBL     0 ,  28C ,  512 ,  78E ,  9FC ,  C58 ,  E9F , 10CF ,
                  12E5 , 14E2 , 16C4 , 188B , 1A39 , 1BCE , 1D4B , 1EB1 ,
                  2000 , 213C ,  \ Resolution is .0084° worst case (near 45°).

: atan45  ( n1 -- n2 )           \ Input is scaled by 4000H (0-1) and output is in PiRads.
        <snip.  atan45 gets defined here, then it is used by ATAN to get the whole range of arctangents>

IIRC, the maximum error was inversely proportional to the square of the size of the table.

Warning: slight divergence from original subject matter ahead.

atan45 done this way takes about 880µs with a 5MHz 65c02 in Forth (not assembly). With the huge look-up tables on my site, it would take only 72µs (again at 5MHz) using the slow serial method of interfacing (shown on the same page) to the memory which is much larger than the 6502 can address directly. And yes, even legacy 6502 computers can take advantage of this. Putting the tables in memory directly on the 65816 bus, a 14MHz '816 will take 2.5µs to get the answer (yes, two-point-five microseconds), and it will be accurate to the last bit, requiring no interpolation. The input is scaled by $8000, so 1.00000 is represented by $8000. 1.00000 is of course the tangent of 45°. The output is again in the 16-bit circle, but its 45° maximum is represented by $2000 (since 65,536 counts is 360°).

Posted: **Thu Sep 06, 2018 9:33 pm**

barrym95838 wrote:

malcontent,

Have you seen Michael J. Mahon's fast square root for Applesoft? It claims to be about twice as fast as the Newton-Raphson version mentioned in your blog, and about nine times as fast as the MS exp(log(x)/2) method. It doesn't look prohibitively large either.

http://michaeljmahon.com/USR.SQR.pdf

Mike B.

[Edit: He made his routine relocatable, but employed a bit of negative stack addressing to do it, which is not generally safe, except on something that doesn't use interrupts, like an old Apple 1 or 2.]

I've got an integer square root routine for my Forth for the C64 I wrote some time ago that some might find useful. It also uses shifts and subtraction. I wasn't sure if I should start a new topic in Forth, or here with this discussion of square root algorithms. BTW, it does not round the result. Here's the code:

Code: Select all

HEX
CODE SQRT  ( UD -- U )
   TYA,  5 # LDY,
   BEGIN,  N ,Y STA,  DEY,
   0< UNTIL,
   10 # LDY,
   BEGIN,
      2 ,X ASL,  3 ,X ROL,
      0 ,X ROL,  1 ,X ROL,
      N    ROL,  N 1+ ROL,
      N 2+ ROL,
      2 ,X ASL,  3 ,X ROL,
      0 ,X ROL,  1 ,X ROL,
      N    ROL,  N 1+ ROL,
      N 2+ ROL,
      N 3 + ASL,  N 4 + ROL,
      N 5 + ROL,  SEC,
      N 3 + ROL,  N 4 + ROL,
      N 5 + ROL,  SEC,
      N    LDA,  N 3 + SBC,
      N 1+ LDA,  N 4 + SBC,
      N 2+ LDA,  N 5 + SBC,
      CS IF,
         N    LDA,  N 3 + SBC,  N    STA,
         N 1+ LDA,  N 4 + SBC,  N 1+ STA,
         N 2+ LDA,  N 5 + SBC,  N 2+ STA,
         N 3 + LDA,  2 # ORA,  N 3 + STA,
      THEN,
      N 5 + LSR,
      N 4 + ROR,  N 3 + ROR,
      DEY,
   0= UNTIL,
   N 3 + LDA,  2 ,X STA,
   N 4 + LDA,  3 ,X STA,
   POP JMP,
END-CODE

N is a scratch pad region in zero page from [N-1] to [N+7].
The code takes a 32 bit number ( a double number on this 16 bit Forth ) and returns a 16 bit result.

Posted: **Fri Sep 07, 2018 1:33 pm**

GARTHWILSON wrote:

Almost everything people think requires floating-point math can actually be done, just as accurately and more efficiently, in scaled integer. Notice I didn't say "fixed point," as that's only a limited subset of scaled integer. I have an article on this at http://wilsonminesco.com/16bitMathTables/, along with large look-up tables for many 16-bit math functions. There is no interpolation involved, because every single value is there, pre-calculated, accurate to all 16 bits; and in many cases it's literally hundreds of times as fast to just look up the answer as it is to actually have to calculate it.

Not exactly the same thing, but I can confirm what you said in another scenario. Years ago we had a database driven application that monitored key metrics. It was hierarchical in nature but had a severe bug that only allowed the parent/child relationship to go one level deep.

I altered the program to not only go N-levels deep (infinite), I also started storing the summation of the values IN the database. Plus, we used integer (fixed) calculations because our project only needed X.XXXX level of accuracy. But it could have been much deeper than that if needed.

Anyway, people thought I was crazy storing the summation in the DB each time the children were altered. But I had a quick method to do that up and down the tree each time.

So my point? Now when running reports on the parent (or even children) the sums/totals were brought back in single digit milliseconds no matter the time span. No amount of querying could produce a slow report because everything was basically a giant lookup table with no floating point conversions, etc.

I'm certainly no math wizard but unless you're sending satellites to Pluto, I think lookup tables and integer math can work for most situations. Especially anything you do with an 8/16 bit computer.

Posted: **Fri Sep 07, 2018 1:43 pm**

Lookup tables are great for limited precision, but they double for each extra bit you want. Garth's tables are already much bigger than the 6502's address space, so they need to be mapped into memory in some way.

You don't need to go to Pluto to find applications which need more than 16 bit precision. (There's a fair chance that engine management has tighter deadlines and needs more accuracy than that, and audio is very demanding, especially when mixing various signals and post-processing them, and still aiming at a high fidelity output.)

Now, on a machine like the 6502 with no hardware for multiply or divide, some small lookup tables can help to construct a multi-digit-precision facility while still being faster than shift-and-operate approaches.

I don't think I've seen a floating point library which uses anything other than shift-and-operate, but it would be interesting to hear of one.

Posted: **Fri Sep 07, 2018 8:13 pm**

BigEd wrote:

Lookup tables are great for limited precision, but they double for each extra bit you want. Garth's tables are already much bigger than the 6502's address space, so they need to be mapped into memory in some way.

I show how to interface it there through a VIA's serial port and 74HC595's and '165. It's still much, much faster than having to actually calculate the values.

Quote:

You don't need to go to Pluto to find applications which need more than 16 bit precision.

I said in the article that going up to 24-bit would require 48MB of ROM (or more conveniently, 64MB, so each answer is on a 4-byte boundary) which would be a lot of ROM even today. That was true at the time I wrote it, and 48MB or 64MB of EPROM is still a lot; but maybe an SPI flash memory would now be the key. The next step would be 32-bit with 16GB. Hmmm... That would take way too long to calculate to fill the tables the way I was doing it before, since my HP-71 worked for a couple of weeks to do the set of 16-bit tables in BASIC the first time around! (Part of that was turning the data into Intel Hex files.)

Posted: **Sat Sep 08, 2018 1:15 am**

GARTHWILSON wrote:

The next step would be 32-bit with 16GB. Hmmm... That would take way too long to calculate to fill the tables the way I was doing it before, since my HP-71 worked for a couple of weeks to do the set of 16-bit tables in BASIC the first time around! (Part of that was turning the data into Intel Hex files.)

So..., for square roots anyway, It looks like if you ever needed 32 bit results ( from 64 bit numbers ) a square root routine would be needed to directly calculate the roots as needed ( I've got a Forth code word for that ) or use interpolation with the existing tables.

Posted: **Tue Sep 11, 2018 4:02 pm**

White Flame wrote:

This might be the algorithm from Wikipedia. It's sort of a riff on the basic shift-and-subtract division, which has a cumulative effect such that it's seeing if (accum_shifted + bit)^2 fits as opposed to if (numerator_shifted > denom).

The code example on Wikipedia was wrong. After trying to figure out why my Forth implementation of it was not working (for a few hours). I had already noticed the superfluous 'while' that was in the Wiki code but I then compared the code block shown on Wikipedia to that in the linked source from Wayback Machine and found the error was in how 'res' was calculated in the if statement.

Code: Select all


res = (res >> 1) + bit; \\WRONG!
res += bit << 1; \\CORRECT

6502.org

CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement