6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sat Nov 16, 2024 5:12 pm

All times are UTC




Post new topic Reply to topic  [ 19 posts ]  Go to page 1, 2  Next
Author Message
PostPosted: Sun Jan 30, 2022 9:52 pm 
Offline

Joined: Thu Jun 17, 2021 7:53 am
Posts: 37
So you need an integer SQRT function, but which to choose?

I've been analysing the performance of many different implementations of 16 bit integer square root, testing them exhaustively over all possible inputs, cycle counting them, then graphing the results. There are big differences in both performance and memory use:

https://github.com/TobyLobster/sqrt_test

I've omitted routines that calculate squares by adding successive odd numbers, since these are *way* too slow for anything but small numbers.


Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 30, 2022 10:09 pm 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8491
Location: Midwestern USA
TobyLobster wrote:
So you need an integer SQRT function, but which to choose?

I've been analysing the performance of many different implementations of 16 bit integer square root, testing them exhaustively over all possible inputs, cycle counting them, then graphing the results. There are big differences in both performance and memory use:

https://github.com/TobyLobster/sqrt_test

I've omitted routines that calculate squares by adding successive odd numbers, since these are *way* too slow for anything but small numbers.

Sounds interesting and in fact, I'm working on something that likely will need a good square root function, especially one that can be extended to work with 32-bit integers.

I don't know if you are aware of it, but you can attach source code files to posts here, which makes it more convenient for forum members who don't do Github.

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Top
 Profile  
Reply with quote  
PostPosted: Sun Jan 30, 2022 11:23 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8541
Location: Southern California
Memory is cheap now; so if performance is a high-enough priority, it makes sense to have large look-up tables. I have such 16-bit scaled-integer (notice I did not say "fixed-point" which is a limited subset of scaled-integer) look-up tables available for many different functions at http://wilsonminesco.com/16bitMathTables/, not requiring any interpolation, as every answer is there, pre-calculated. Most of the tables are 128KB each, ie, 64K two-byte cells, although there are exceptions. Three of them are square-root tables:

Code:
SQRT1.HEX is a 64K table of square roots.  Input is 16-bit (unsigned), and output is 8-bit.  For bank A ($A:0000-A:FFFF) of ROM
        1.  Output is truncated in this one, not rounded to the nearest integer.  Even for getting the 16-bit root of a 32-bit
        input, the table can serve to give the r=(x+r²)/2r method a better starting point so you don't need so many iterations.


SQRT2.HEX is another 64K table of square roots.  Like SQRT1.HEX, input is 16-bit (unsigned) and output is 8-bit; but unlike
        SQRT1.HEX, this one is rounded.  Max output is of course $FF, not $100.  For bank B ($B:0000-B:FFFF) of ROM 1.


SQRT3.HEX is a 128K table of square roots.  Output is 16-bit.  Input in a sense is 32-bit (unsigned); but with only 64K cells,
        it assumes the 16 input bits are actually the high 16 bits of a 32-bit input whose low cell is 0000.  Output is rounded.
        For 32-bit input numbers whose low cell is not 0000, interpolating may bring a few more bits of accuracy.
        For banks C & D ($C:0000-D:FFFF) of ROM 1.


The contents of front page of that section of the site are:

  • Introduction & Contents
    • The unexpected, efficient solution (scaled-integer, that is)
    • Unexpected accuracy and range without floating point
    • Dollars and cents accuracy even in hex
    • Angles in trigonometry
    • Easy conversion to and from any base
    • What's in the files
  • Ideas for interfacing
  • So how do we look up a value? (even for the 6502's limited memory-map space)

Other parts of this section of the site, besides the tables themselves, are:
math table files' descriptions
rational-number approximations
how Intel Hex files work
how the files were calculated & formed

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Mon Jan 31, 2022 7:22 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10980
Location: England
Thanks for posting this work over here TobyLobster - I enjoyed the thread over on stardot!

Interesting results, for sure.


Top
 Profile  
Reply with quote  
PostPosted: Mon Jan 31, 2022 7:30 pm 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1949
Location: Sacramento, CA, USA
Michael J. Mahon wrote a very fast square root routine to replace the small but very slow exp(log(x)/2) Applesoft version. It performs its magic in 40-bit Microsoft floating point, but I have a crazy notion to simplify it way down for integers. I don't understand how it works, which is a bit of an impediment, and it's quite possible that its algorithm is already in one of the routines listed.

http://michaeljmahon.com/USR.SQR.html

_________________
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!

Mike B. (about me) (learning how to github)


Top
 Profile  
Reply with quote  
PostPosted: Tue Feb 01, 2022 3:55 pm 
Offline

Joined: Thu Jun 17, 2021 7:53 am
Posts: 37
That sounds intriguing, I hadn't thought to convert a floating point SQRT to integer, but it sounds like it should work. It's not obvious to me if it will turn out to be similar to one I've already tested or not, but it will be interesting to find out! Let us know how you get on.


Top
 Profile  
Reply with quote  
PostPosted: Wed Feb 02, 2022 12:01 am 
Offline
User avatar

Joined: Mon Apr 23, 2012 12:28 am
Posts: 760
Location: Huntsville, AL
TobyLobster wrote:
I hadn't thought to convert a floating point SQRT to integer
I am sure that many of the algorithms you've linked to in this thread are much more efficient than the algorithm I used to simultaneously calculate sqrt(S) and 1 / sqrt(S) for a control application that I implemented with a microprogrammed FPGA. It was based on a floating point algorithm, and I had to derive the integer form of the algorithm for my particular application where I specifically needed both the sqrt(S) and 1 / S. I derived the inverse of S by squaring the 1 / sqrt(S) term that the algorithm provided for free. I've described the derivation of my algorithm here.

_________________
Michael A.


Top
 Profile  
Reply with quote  
PostPosted: Wed Feb 02, 2022 7:29 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1949
Location: Sacramento, CA, USA
Thanks for that article, Michael. I think I'll have to read that more than a couple of times to get the hang of the math.

In other news, I was playing around with MJM's MS float routine, and I found what I thought was an interesting little resting point:
Code:
ACC = $80
ARG = $82
TMP = $84
sqrt:
    lda ACC+1
    sta ARG+1
    lda ACC
    sta ARG         ; Move ACC to ARG
    lda #0          ; Zero ACC & TMP
    sta ACC
    sta TMP
    sta ACC+1
    sta TMP+1
    ldx #16         ; Gen X bits of sqrt
    bne start       ; (always)
loop:
    asl ARG         ; Left shift TMP
    rol ARG+1
    rol TMP
    rol TMP+1
    asl ARG         ; by 2 bits.
    rol ARG+1
    rol TMP
    rol TMP+1
    lda TMP+1       ; Compare __ bits of TMP
    cmp ACC+1       ; with current sqrt.
    bne check
    lda TMP         ; Compare next byte
    cmp ACC
    bne check
start:
    lda ARG+1       ; Compare final byte
    cmp #$40
check:
    bcc shift0      ; TMP > sqrt
    lda ARG+1       ; TMP <= sqrt, so subtract
    sbc #$40
    sta ARG+1
    lda TMP
    sbc ACC
    sta TMP
    lda TMP+1
    sbc ACC+1
    sta TMP+1
shift0:
    rol ACC         ; Rotate C into sqrt
    rol ACC+1
    dex             ; Done?
    bne loop        ; -No, keep looping.
    rts

Load the 16-bit integer into ACC, and it appears to come back as the square root, in 8.8 fixed point format! I still don't understand how it works, but it seems I can get just the truncated integer result by dumbing the ldx #16 down to ldx #8.

There appears to be room for improvement if the truncated result is all that's needed, but I'm chuffed at the current state of affairs. Can anyone with a better brain than mine try to explain what's going on here?

_________________
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!

Mike B. (about me) (learning how to github)


Top
 Profile  
Reply with quote  
PostPosted: Wed Feb 02, 2022 12:04 pm 
Offline

Joined: Wed Jun 23, 2021 8:02 am
Posts: 166
Here's an attempt at an explanation of the algorithm.
Assume the input x is a fraction with 0 <= x < 1
In binary, this means the binary point is immediately to the left of the input value
Each iteration generates a single result bit
The first iteration determines the 2^-1 bit, the second the 2^-2 bit etc.
After n iterations, let y_n be the partial result
For iteration n+1 we need to decide the value of the 2^-(n+1) bit
The question which needs to be answered is as follows:

Is (y_n + 2^-(n+1))^2 >= x ?

If the answer is yes, the next bit is 1, otherwise it is 0

(y_n + 2^-(n+1))^2 = y_n^2 + 2^-n y_n + 2^-(2n+2)

This is a multiple of 2^-(2n+2), so at iteration (n+1), 2(n+1) fraction bits are involved
in the comparison, which means you need to take the next 2 input bits at each iteration

We rearrange the equations in terms of the error after n iterations:

e_n+1 = x - (y_n + 2^-(n+1))^2 = (x - y_n^2) - (2^-n y_n + 2^-(2n+2))
= e_n - (2^-n y_n + 2^-(2n+2))

Multiplying both sides by 2^2n gives

2^2n e_n+1 = 2^2n e_n - 2^n y_n - 2^-2

The terms on the RHS are as follows:

2^2n e_n = current error term shifted left so it is an integer, along with 2 extra input bits shifted into 2^-1 and 2^-2 positions
2^n y_n = current partial result shifted left so it is an integer
-2^-2 = an extra 1/4, which is where the CMP #$40 comes in

If the subtraction gives a borrow, the next result bit is 0 so we shift the partial result left 1
If the subtraction gives no borrow, the next result bit is 1 so we shift the partial result left 1 and add 1,
and update the current error value according to the result of the subtraction

Note that after n iterations, the difference between y_n and the true square root is guaranteed to be <2^-n
Let the true square root be r.

(r - y_n)(r + y_n) = r^2 - y_n^2
= x - y_n^2

e_n = (x - y_n^2) < 2^-n . 2 = 2^-(n-1)

So the error accumulator e_n grows by only 1 bit per iteration rather than 2


Top
 Profile  
Reply with quote  
PostPosted: Wed Feb 02, 2022 12:16 pm 
Offline

Joined: Wed Jun 23, 2021 8:02 am
Posts: 166
There's a similar algorithm for calculating cube roots. Here's a C implementation for 32 bit values:

Code:
/* Calculate cube root of a 32 bit unsigned integer
   Parameters:
       x = input argument
      rem = pointer to where remainder should be written
   Return value = integer part of cube root
   x = (return value) ^ 3 + remainder
 */
uint32_t icbrt32(uint32_t x, uint32_t* rem)
{
    uint32_t a = 0;
    uint32_t y = 0;
    uint32_t yy = 0;
    uint32_t z;
    uint32_t i;

    a = x >> 30;
    x <<= 2;
    if (a >= 1)
    {
        --a;
        y = 1;
        yy = 1;
    }
    for (i=1; i<11; ++i)
    {
        a = (a << 3) | (x >> 29);
        x <<= 3;
        y <<= 1;
        yy <<= 2;
        z = y + yy;
        z += (z << 1);
        z += 1;
        if (a >= z)
        {
            a -= z;
            yy += (y << 1);
            ++y;
            ++yy;
        }
    }
   if (rem)
      *rem = a;
    return y;
}


Top
 Profile  
Reply with quote  
PostPosted: Wed Feb 02, 2022 4:06 pm 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1949
Location: Sacramento, CA, USA
@kernelthread: Thank you so much for taking the time and effort to write that up, and for solving the mystery of the #$40 constant! My 19-year-old brain would have gobbled up the theory, but my 56-year-old brain is struggling mightily.
Quote:
If the subtraction gives a borrow, the next result bit is 0 so we shift the partial result left 1
If the subtraction gives no borrow, the next result bit is 1 so we shift the partial result left 1 and add 1,
and update the current error value according to the result of the subtraction

Neat how the 6502's SBC instruction seems tailor-made for that, huh?

_________________
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!

Mike B. (about me) (learning how to github)


Top
 Profile  
Reply with quote  
PostPosted: Wed Feb 02, 2022 11:45 pm 
Offline

Joined: Sun Apr 26, 2020 3:08 am
Posts: 357
I can't believe how close the SQRT routine is to WOZ's MOD routine.

This is from WOZ's Integer Basic. Note: In my assembler ":" is always a forward branch and "]" is always a backwards branch

Code:
TMP = $4
ACC = $6
ARG = $8

MOD   lda #0
   sta TMP
   sta TMP+1

   ldy #$10
   lda ARG
   bne :1

   lda ARG+1   ; division by zero check
   beq error

:1   equ *
]lp   asl ACC
   rol ACC+1
   rol TMP
   rol TMP+1

   lda TMP
   cmp ARG

   lda TMP+1
   sbc ARG+1
   bcc :1
   sta TMP+1

   lda TMP
   sbc ARG
   sta TMP

   inc ACC

:1   dey
   bne ]lp

   clc
   rts

error   sec
   rts


Top
 Profile  
Reply with quote  
PostPosted: Thu Feb 03, 2022 9:19 am 
Offline

Joined: Thu Apr 23, 2020 12:57 pm
Posts: 11
TobyLobster wrote:
So you need an integer SQRT function, but which to choose?

Here is my implementation that is using 2 tables of 256 bytes and binary search. The code is in MADS assembler syntax for table generation.
Code:
; 2 tables with low and high bytes of all x*x values
squared_table_lo:
:256 dta l(#*#)   
squared_table_hi:
:256 dta h(#*#)

l: .byte 0
r: .byte 0
val: .word 0

; input: 16bit val
; returns: sqrt of 16bit value in X register
; base on interative binary search
; https://www.geeksforgeeks.org/c-program-for-binary-search-recursive-and-iterative/
sqrt:
   ldx #0
   stx l
   dex ; #255
   stx r
   ; iterative binary search
   ; first on the higher byte, then on lower
loop:
       ;int m = l + (r-l)/2;
   ; (r-l)
   sec
   lda r
   sbc l
   lsr ; /2
   clc ; !!!todo - check if needed
   adc l
   ; m is x register
   tax
   
;   compare arr[m] and val        
     LDA squared_table_hi,x
     CMP val+1
     BNE skip_lo
     LDA squared_table_lo,x
     CMP val+0
skip_lo: 
     BCC lower
     BNE higher
same:
     ; we return index in X register    
     rts   
lower:
   ;    // If x greater, ignore left half   
   ;    if (arr[m] < val) 
   ;        l = m + 1; 
     inx
     stx l
     jmp check_end
higher:
   ;    // If val is smaller, ignore right half 
   ;    else
   ;         r = m - 1; 
     dex
     stx r
check_end:
   ; ending condition in the algorithm is:
   ;   while (l <= r)
   ; which is equal:
    ;   while (r>l)
   ; and we can use BCS
     LDA r
   CMP l
   BCS loop
stop:   
   ; we return index in X register
   rts


Top
 Profile  
Reply with quote  
PostPosted: Fri Feb 04, 2022 8:06 am 
Offline

Joined: Thu Jun 17, 2021 7:53 am
Posts: 37
ilmenit wrote:
TobyLobster wrote:
So you need an integer SQRT function, but which to choose?

Here is my implementation that is using 2 tables of 256 bytes and binary search. The code is in MADS assembler syntax for table generation.

Interesting approach I've not seen elsewhere. A couple of minor fixes are needed: You need "ldx r" before the rts in stop: to round the result down properly e.g. for SQRT(5), and inputs over 255*255 need handling properly (either with a special check at the start, such as:

Code:
sqrt:
    ldx #0
    stx l
    dex ; #255
    stx r

    ; special case for 255
    lda val+1
    cmp #$fe
    bcc +
    bne same
    lda val
    bne same
+


or by adding BEQ stop into lower:

Code:
lower:
    ;    // If x greater, ignore left half
    ;    if (arr[m] < val)
    ;        l = m + 1;
    inx
    beq stop        ; <- added


Timing shown in the graph below as sqrt11, in red


Attachments:
results.png
results.png [ 220.96 KiB | Viewed 2110 times ]
Top
 Profile  
Reply with quote  
PostPosted: Sat Feb 05, 2022 11:11 pm 
Offline

Joined: Thu Apr 23, 2020 12:57 pm
Posts: 11
Thanks, very good points about the rounding. I needed it for an Euclidean distance and rounding problem was negligible for me, so I didn't catch it before.
The code is on the MIT license so feel free to use it or add it to your benchmark.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 19 posts ]  Go to page 1, 2  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 15 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: