Comparing 16 bit Integer Square Root routines (SQRT)

TobyLobster · Post by **TobyLobster** » Sun Jan 30, 2022 9:52 pm

So you need an integer SQRT function, but which to choose?

I've been analysing the performance of many different implementations of 16 bit integer square root, testing them exhaustively over all possible inputs, cycle counting them, then graphing the results. There are big differences in both performance and memory use:

https://github.com/TobyLobster/sqrt_test

I've omitted routines that calculate squares by adding successive odd numbers, since these are *way* too slow for anything but small numbers.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Sun Jan 30, 2022 10:09 pm

TobyLobster wrote:

So you need an integer SQRT function, but which to choose?

I've been analysing the performance of many different implementations of 16 bit integer square root, testing them exhaustively over all possible inputs, cycle counting them, then graphing the results. There are big differences in both performance and memory use:

https://github.com/TobyLobster/sqrt_test

I've omitted routines that calculate squares by adding successive odd numbers, since these are *way* too slow for anything but small numbers.

Sounds interesting and in fact, I'm working on something that likely will need a good square root function, especially one that can be extended to work with 32-bit integers.

I don't know if you are aware of it, but you can attach source code files to posts here, which makes it more convenient for forum members who don't do Github.

GARTHWILSON · Post by **GARTHWILSON** » Sun Jan 30, 2022 11:23 pm

Memory is cheap now; so if performance is a high-enough priority, it makes sense to have large look-up tables. I have such 16-bit scaled-integer (notice I did not say "fixed-point" which is a limited subset of scaled-integer) look-up tables available for many different functions at http://wilsonminesco.com/16bitMathTables/, not requiring any interpolation, as every answer is there, pre-calculated. Most of the tables are 128KB each, ie, 64K two-byte cells, although there are exceptions. Three of them are square-root tables:

Code: Select all

SQRT1.HEX is a 64K table of square roots.  Input is 16-bit (unsigned), and output is 8-bit.  For bank A ($A:0000-A:FFFF) of ROM
        1.  Output is truncated in this one, not rounded to the nearest integer.  Even for getting the 16-bit root of a 32-bit
        input, the table can serve to give the r=(x+r²)/2r method a better starting point so you don't need so many iterations.


SQRT2.HEX is another 64K table of square roots.  Like SQRT1.HEX, input is 16-bit (unsigned) and output is 8-bit; but unlike
        SQRT1.HEX, this one is rounded.  Max output is of course $FF, not $100.  For bank B ($B:0000-B:FFFF) of ROM 1.


SQRT3.HEX is a 128K table of square roots.  Output is 16-bit.  Input in a sense is 32-bit (unsigned); but with only 64K cells,
        it assumes the 16 input bits are actually the high 16 bits of a 32-bit input whose low cell is 0000.  Output is rounded.
        For 32-bit input numbers whose low cell is not 0000, interpolating may bring a few more bits of accuracy.
        For banks C & D ($C:0000-D:FFFF) of ROM 1.

The contents of front page of that section of the site are:

Introduction & Contents
- The unexpected, efficient solution (scaled-integer, that is)
- Unexpected accuracy and range without floating point
- Dollars and cents accuracy even in hex
- Angles in trigonometry
- Easy conversion to and from any base
- What's in the files
Ideas for interfacing
So how do we look up a value? (even for the 6502's limited memory-map space)

Other parts of this section of the site, besides the tables themselves, are:
math table files' descriptions
rational-number approximations
how Intel Hex files work
how the files were calculated & formed

BigEd · Post by **BigEd** » Mon Jan 31, 2022 7:22 am

Thanks for posting this work over here TobyLobster - I enjoyed the thread over on stardot!

Interesting results, for sure.

barrym95838 · Post by **barrym95838** » Mon Jan 31, 2022 7:30 pm

Michael J. Mahon wrote a very fast square root routine to replace the small but very slow exp(log(x)/2) Applesoft version. It performs its magic in 40-bit Microsoft floating point, but I have a crazy notion to simplify it way down for integers. I don't understand how it works, which is a bit of an impediment, and it's quite possible that its algorithm is already in one of the routines listed.

http://michaeljmahon.com/USR.SQR.html

TobyLobster · Post by **TobyLobster** » Tue Feb 01, 2022 3:55 pm

That sounds intriguing, I hadn't thought to convert a floating point SQRT to integer, but it sounds like it should work. It's not obvious to me if it will turn out to be similar to one I've already tested or not, but it will be interesting to find out! Let us know how you get on.

MichaelM · Post by **MichaelM** » Wed Feb 02, 2022 12:01 am

TobyLobster wrote:

I hadn't thought to convert a floating point SQRT to integer

I am sure that many of the algorithms you've linked to in this thread are much more efficient than the algorithm I used to simultaneously calculate sqrt(S) and 1 / sqrt(S) for a control application that I implemented with a microprogrammed FPGA. It was based on a floating point algorithm, and I had to derive the integer form of the algorithm for my particular application where I specifically needed both the sqrt(S) and 1 / S. I derived the inverse of S by squaring the 1 / sqrt(S) term that the algorithm provided for free. I've described the derivation of my algorithm here.

barrym95838 · Post by **barrym95838** » Wed Feb 02, 2022 7:29 am

Thanks for that article, Michael. I think I'll have to read that more than a couple of times to get the hang of the math.

In other news, I was playing around with MJM's MS float routine, and I found what I thought was an interesting little resting point:

Code: Select all

ACC = $80
ARG = $82
TMP = $84
sqrt:
    lda ACC+1
    sta ARG+1
    lda ACC
    sta ARG         ; Move ACC to ARG
    lda #0          ; Zero ACC & TMP
    sta ACC
    sta TMP
    sta ACC+1
    sta TMP+1
    ldx #16         ; Gen X bits of sqrt
    bne start       ; (always)
loop:
    asl ARG         ; Left shift TMP
    rol ARG+1
    rol TMP
    rol TMP+1
    asl ARG         ; by 2 bits.
    rol ARG+1
    rol TMP
    rol TMP+1
    lda TMP+1       ; Compare __ bits of TMP
    cmp ACC+1       ; with current sqrt.
    bne check
    lda TMP         ; Compare next byte
    cmp ACC
    bne check
start:
    lda ARG+1       ; Compare final byte
    cmp #$40
check:
    bcc shift0      ; TMP > sqrt
    lda ARG+1       ; TMP <= sqrt, so subtract
    sbc #$40
    sta ARG+1
    lda TMP
    sbc ACC
    sta TMP
    lda TMP+1
    sbc ACC+1
    sta TMP+1
shift0:
    rol ACC         ; Rotate C into sqrt
    rol ACC+1
    dex             ; Done?
    bne loop        ; -No, keep looping.
    rts

Load the 16-bit integer into ACC, and it appears to come back as the square root, in 8.8 fixed point format! I still don't understand how it works, but it seems I can get just the truncated integer result by dumbing the ldx #16 down to ldx #8.

There appears to be room for improvement if the truncated result is all that's needed, but I'm chuffed at the current state of affairs. Can anyone with a better brain than mine try to explain what's going on here?

kernelthread · Post by **kernelthread** » Wed Feb 02, 2022 12:04 pm

Here's an attempt at an explanation of the algorithm.
Assume the input x is a fraction with 0 <= x < 1
In binary, this means the binary point is immediately to the left of the input value
Each iteration generates a single result bit
The first iteration determines the 2^-1 bit, the second the 2^-2 bit etc.
After n iterations, let y_n be the partial result
For iteration n+1 we need to decide the value of the 2^-(n+1) bit
The question which needs to be answered is as follows:

Is (y_n + 2^-(n+1))^2 >= x ?

If the answer is yes, the next bit is 1, otherwise it is 0

(y_n + 2^-(n+1))^2 = y_n^2 + 2^-n y_n + 2^-(2n+2)

This is a multiple of 2^-(2n+2), so at iteration (n+1), 2(n+1) fraction bits are involved
in the comparison, which means you need to take the next 2 input bits at each iteration

We rearrange the equations in terms of the error after n iterations:

e_n+1 = x - (y_n + 2^-(n+1))^2 = (x - y_n^2) - (2^-n y_n + 2^-(2n+2))
= e_n - (2^-n y_n + 2^-(2n+2))

Multiplying both sides by 2^2n gives

2^2n e_n+1 = 2^2n e_n - 2^n y_n - 2^-2

The terms on the RHS are as follows:

2^2n e_n = current error term shifted left so it is an integer, along with 2 extra input bits shifted into 2^-1 and 2^-2 positions
2^n y_n = current partial result shifted left so it is an integer
-2^-2 = an extra 1/4, which is where the CMP #$40 comes in

If the subtraction gives a borrow, the next result bit is 0 so we shift the partial result left 1
If the subtraction gives no borrow, the next result bit is 1 so we shift the partial result left 1 and add 1,
and update the current error value according to the result of the subtraction

Note that after n iterations, the difference between y_n and the true square root is guaranteed to be <2^-n
Let the true square root be r.

(r - y_n)(r + y_n) = r^2 - y_n^2
= x - y_n^2

e_n = (x - y_n^2) < 2^-n . 2 = 2^-(n-1)

So the error accumulator e_n grows by only 1 bit per iteration rather than 2

kernelthread · Post by **kernelthread** » Wed Feb 02, 2022 12:16 pm

There's a similar algorithm for calculating cube roots. Here's a C implementation for 32 bit values:

Code: Select all

/* Calculate cube root of a 32 bit unsigned integer
   Parameters:
       x = input argument
	   rem = pointer to where remainder should be written
   Return value = integer part of cube root
   x = (return value) ^ 3 + remainder
 */
uint32_t icbrt32(uint32_t x, uint32_t* rem)
{
    uint32_t a = 0;
    uint32_t y = 0;
    uint32_t yy = 0;
    uint32_t z;
    uint32_t i;

    a = x >> 30;
    x <<= 2;
    if (a >= 1)
    {
        --a;
        y = 1;
        yy = 1;
    }
    for (i=1; i<11; ++i)
    {
        a = (a << 3) | (x >> 29);
        x <<= 3;
        y <<= 1;
        yy <<= 2;
        z = y + yy;
        z += (z << 1);
        z += 1;
        if (a >= z)
        {
            a -= z;
            yy += (y << 1);
            ++y;
            ++yy;
        }
    }
	if (rem)
		*rem = a;
    return y;
}

barrym95838 · Post by **barrym95838** » Wed Feb 02, 2022 4:06 pm

@kernelthread: Thank you so much for taking the time and effort to write that up, and for solving the mystery of the #$40 constant! My 19-year-old brain would have gobbled up the theory, but my 56-year-old brain is struggling mightily.

Quote:

If the subtraction gives a borrow, the next result bit is 0 so we shift the partial result left 1
If the subtraction gives no borrow, the next result bit is 1 so we shift the partial result left 1 and add 1,
and update the current error value according to the result of the subtraction

Neat how the 6502's SBC instruction seems tailor-made for that, huh?

IamRob · Post by **IamRob** » Wed Feb 02, 2022 11:45 pm

I can't believe how close the SQRT routine is to WOZ's MOD routine.

This is from WOZ's Integer Basic. Note: In my assembler ":" is always a forward branch and "]" is always a backwards branch

Code: Select all

TMP = $4
ACC = $6
ARG = $8

MOD	lda #0
	sta TMP
	sta TMP+1

	ldy #$10
	lda ARG
	bne :1

	lda ARG+1	; division by zero check
	beq error

:1	equ *
]lp	asl ACC
	rol ACC+1
	rol TMP
	rol TMP+1

	lda TMP
	cmp ARG

	lda TMP+1
	sbc ARG+1
	bcc :1
	sta TMP+1

	lda TMP
	sbc ARG
	sta TMP

	inc ACC

:1	dey
	bne ]lp

	clc
	rts

error	sec
	rts

ilmenit · Post by **ilmenit** » Thu Feb 03, 2022 9:19 am

TobyLobster wrote:

So you need an integer SQRT function, but which to choose?

Here is my implementation that is using 2 tables of 256 bytes and binary search. The code is in MADS assembler syntax for table generation.

Code: Select all

; 2 tables with low and high bytes of all x*x values
squared_table_lo:
:256 dta l(#*#)	
squared_table_hi:
:256 dta h(#*#)

l: .byte 0
r: .byte 0
val: .word 0 

; input: 16bit val
; returns: sqrt of 16bit value in X register
; base on interative binary search 
; https://www.geeksforgeeks.org/c-program-for-binary-search-recursive-and-iterative/
sqrt: 
	ldx #0
	stx l
	dex ; #255
	stx r
	; iterative binary search
	; first on the higher byte, then on lower
loop:
    	;int m = l + (r-l)/2; 
	; (r-l)
	sec
	lda r 
	sbc l
	lsr ; /2
	clc ; !!!todo - check if needed
	adc l
	; m is x register
	tax 
	
;	compare arr[m] and val  		
	  LDA squared_table_hi,x
	  CMP val+1
	  BNE skip_lo
	  LDA squared_table_lo,x
	  CMP val+0
skip_lo:  
	  BCC lower
	  BNE higher
same:
	  ; we return index in X register	  
	  rts	
lower:
	;    // If x greater, ignore left half   
	;    if (arr[m] < val)  
	;        l = m + 1;  
	  inx
	  stx l
	  jmp check_end
higher:
	;    // If val is smaller, ignore right half  
	;    else 
	;         r = m - 1;  
	  dex
	  stx r
check_end:
	; ending condition in the algorithm is:
	;   while (l <= r)
	; which is equal:
 	;   while (r>l)
	; and we can use BCS
  	LDA r
	CMP l
	BCS loop
stop:	
	; we return index in X register
	rts

TobyLobster · Post by **TobyLobster** » Fri Feb 04, 2022 8:06 am

ilmenit wrote:

TobyLobster wrote:

So you need an integer SQRT function, but which to choose?

Here is my implementation that is using 2 tables of 256 bytes and binary search. The code is in MADS assembler syntax for table generation.

Interesting approach I've not seen elsewhere. A couple of minor fixes are needed: You need "ldx r" before the rts in stop: to round the result down properly e.g. for SQRT(5), and inputs over 255*255 need handling properly (either with a special check at the start, such as:

Code: Select all

sqrt:
    ldx #0
    stx l
    dex ; #255
    stx r

    ; special case for 255
    lda val+1
    cmp #$fe
    bcc +
    bne same
    lda val
    bne same
+

or by adding BEQ stop into lower:

Code: Select all

lower:
    ;    // If x greater, ignore left half
    ;    if (arr[m] < val)
    ;        l = m + 1;
    inx
    beq stop        ; <- added

Timing shown in the graph below as sqrt11, in red

ilmenit · Post by **ilmenit** » Sat Feb 05, 2022 11:11 pm

Thanks, very good points about the rounding. I needed it for an Euclidean distance and rounding problem was negligible for me, so I didn't catch it before.
The code is on the MIT license so feel free to use it or add it to your benchmark.

Comparing 16 bit Integer Square Root routines (SQRT)

Comparing 16 bit Integer Square Root routines (SQRT)

Re: Comparing 16 bit Integer Square Root routines (SQRT)

Re: Comparing 16 bit Integer Square Root routines (SQRT)

Re: Comparing 16 bit Integer Square Root routines (SQRT)

Re: Comparing 16 bit Integer Square Root routines (SQRT)

Re: Comparing 16 bit Integer Square Root routines (SQRT)

Re: Comparing 16 bit Integer Square Root routines (SQRT)

Re: Comparing 16 bit Integer Square Root routines (SQRT)

Re: Comparing 16 bit Integer Square Root routines (SQRT)

Re: Comparing 16 bit Integer Square Root routines (SQRT)

Re: Comparing 16 bit Integer Square Root routines (SQRT)

Re: Comparing 16 bit Integer Square Root routines (SQRT)

Re: Comparing 16 bit Integer Square Root routines (SQRT)

Re: Comparing 16 bit Integer Square Root routines (SQRT)

Re: Comparing 16 bit Integer Square Root routines (SQRT)