Comparing 16 bit Integer Square Root routines (SQRT)
-
TobyLobster
- Posts: 37
- Joined: 17 Jun 2021
Comparing 16 bit Integer Square Root routines (SQRT)
So you need an integer SQRT function, but which to choose?
I've been analysing the performance of many different implementations of 16 bit integer square root, testing them exhaustively over all possible inputs, cycle counting them, then graphing the results. There are big differences in both performance and memory use:
https://github.com/TobyLobster/sqrt_test
I've omitted routines that calculate squares by adding successive odd numbers, since these are *way* too slow for anything but small numbers.
I've been analysing the performance of many different implementations of 16 bit integer square root, testing them exhaustively over all possible inputs, cycle counting them, then graphing the results. There are big differences in both performance and memory use:
https://github.com/TobyLobster/sqrt_test
I've omitted routines that calculate squares by adding successive odd numbers, since these are *way* too slow for anything but small numbers.
- BigDumbDinosaur
- Posts: 9431
- Joined: 28 May 2009
- Location: Midwestern USA (JB Pritzker’s dystopia)
- Contact:
Re: Comparing 16 bit Integer Square Root routines (SQRT)
TobyLobster wrote:
So you need an integer SQRT function, but which to choose?
I've been analysing the performance of many different implementations of 16 bit integer square root, testing them exhaustively over all possible inputs, cycle counting them, then graphing the results. There are big differences in both performance and memory use:
https://github.com/TobyLobster/sqrt_test
I've omitted routines that calculate squares by adding successive odd numbers, since these are *way* too slow for anything but small numbers.
I've been analysing the performance of many different implementations of 16 bit integer square root, testing them exhaustively over all possible inputs, cycle counting them, then graphing the results. There are big differences in both performance and memory use:
https://github.com/TobyLobster/sqrt_test
I've omitted routines that calculate squares by adding successive odd numbers, since these are *way* too slow for anything but small numbers.
Sounds interesting and in fact, I'm working on something that likely will need a good square root function, especially one that can be extended to work with 32-bit integers.
I don't know if you are aware of it, but you can attach source code files to posts here, which makes it more convenient for forum members who don't do Github.
x86? We ain't got no x86. We don't NEED no stinking x86!
- GARTHWILSON
- Posts: 8777
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
Re: Comparing 16 bit Integer Square Root routines (SQRT)
Memory is cheap now; so if performance is a high-enough priority, it makes sense to have large look-up tables. I have such 16-bit scaled-integer (notice I did not say "fixed-point" which is a limited subset of scaled-integer) look-up tables available for many different functions at http://wilsonminesco.com/16bitMathTables/, not requiring any interpolation, as every answer is there, pre-calculated. Most of the tables are 128KB each, ie, 64K two-byte cells, although there are exceptions. Three of them are square-root tables:
The contents of front page of that section of the site are:
Other parts of this section of the site, besides the tables themselves, are:
math table files' descriptions
rational-number approximations
how Intel Hex files work
how the files were calculated & formed
Code: Select all
SQRT1.HEX is a 64K table of square roots. Input is 16-bit (unsigned), and output is 8-bit. For bank A ($A:0000-A:FFFF) of ROM
1. Output is truncated in this one, not rounded to the nearest integer. Even for getting the 16-bit root of a 32-bit
input, the table can serve to give the r=(x+r²)/2r method a better starting point so you don't need so many iterations.
SQRT2.HEX is another 64K table of square roots. Like SQRT1.HEX, input is 16-bit (unsigned) and output is 8-bit; but unlike
SQRT1.HEX, this one is rounded. Max output is of course $FF, not $100. For bank B ($B:0000-B:FFFF) of ROM 1.
SQRT3.HEX is a 128K table of square roots. Output is 16-bit. Input in a sense is 32-bit (unsigned); but with only 64K cells,
it assumes the 16 input bits are actually the high 16 bits of a 32-bit input whose low cell is 0000. Output is rounded.
For 32-bit input numbers whose low cell is not 0000, interpolating may bring a few more bits of accuracy.
For banks C & D ($C:0000-D:FFFF) of ROM 1.The contents of front page of that section of the site are:
- Introduction & Contents
- The unexpected, efficient solution (scaled-integer, that is)
- Unexpected accuracy and range without floating point
- Dollars and cents accuracy even in hex
- Angles in trigonometry
- Easy conversion to and from any base
- What's in the files
- Ideas for interfacing
- So how do we look up a value? (even for the 6502's limited memory-map space)
Other parts of this section of the site, besides the tables themselves, are:
math table files' descriptions
rational-number approximations
how Intel Hex files work
how the files were calculated & formed
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
Re: Comparing 16 bit Integer Square Root routines (SQRT)
Thanks for posting this work over here TobyLobster - I enjoyed the thread over on stardot!
Interesting results, for sure.
Interesting results, for sure.
- barrym95838
- Posts: 2056
- Joined: 30 Jun 2013
- Location: Sacramento, CA, USA
Re: Comparing 16 bit Integer Square Root routines (SQRT)
Michael J. Mahon wrote a very fast square root routine to replace the small but very slow exp(log(x)/2) Applesoft version. It performs its magic in 40-bit Microsoft floating point, but I have a crazy notion to simplify it way down for integers. I don't understand how it works, which is a bit of an impediment, and it's quite possible that its algorithm is already in one of the routines listed.
http://michaeljmahon.com/USR.SQR.html
http://michaeljmahon.com/USR.SQR.html
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!
Mike B. (about me) (learning how to github)
Mike B. (about me) (learning how to github)
-
TobyLobster
- Posts: 37
- Joined: 17 Jun 2021
Re: Comparing 16 bit Integer Square Root routines (SQRT)
That sounds intriguing, I hadn't thought to convert a floating point SQRT to integer, but it sounds like it should work. It's not obvious to me if it will turn out to be similar to one I've already tested or not, but it will be interesting to find out! Let us know how you get on.
Re: Comparing 16 bit Integer Square Root routines (SQRT)
TobyLobster wrote:
I hadn't thought to convert a floating point SQRT to integer
Michael A.
- barrym95838
- Posts: 2056
- Joined: 30 Jun 2013
- Location: Sacramento, CA, USA
Re: Comparing 16 bit Integer Square Root routines (SQRT)
Thanks for that article, Michael. I think I'll have to read that more than a couple of times to get the hang of the math.
In other news, I was playing around with MJM's MS float routine, and I found what I thought was an interesting little resting point:
Load the 16-bit integer into ACC, and it appears to come back as the square root, in 8.8 fixed point format! I still don't understand how it works, but it seems I can get just the truncated integer result by dumbing the ldx #16 down to ldx #8.
There appears to be room for improvement if the truncated result is all that's needed, but I'm chuffed at the current state of affairs. Can anyone with a better brain than mine try to explain what's going on here?
In other news, I was playing around with MJM's MS float routine, and I found what I thought was an interesting little resting point:
Code: Select all
ACC = $80
ARG = $82
TMP = $84
sqrt:
lda ACC+1
sta ARG+1
lda ACC
sta ARG ; Move ACC to ARG
lda #0 ; Zero ACC & TMP
sta ACC
sta TMP
sta ACC+1
sta TMP+1
ldx #16 ; Gen X bits of sqrt
bne start ; (always)
loop:
asl ARG ; Left shift TMP
rol ARG+1
rol TMP
rol TMP+1
asl ARG ; by 2 bits.
rol ARG+1
rol TMP
rol TMP+1
lda TMP+1 ; Compare __ bits of TMP
cmp ACC+1 ; with current sqrt.
bne check
lda TMP ; Compare next byte
cmp ACC
bne check
start:
lda ARG+1 ; Compare final byte
cmp #$40
check:
bcc shift0 ; TMP > sqrt
lda ARG+1 ; TMP <= sqrt, so subtract
sbc #$40
sta ARG+1
lda TMP
sbc ACC
sta TMP
lda TMP+1
sbc ACC+1
sta TMP+1
shift0:
rol ACC ; Rotate C into sqrt
rol ACC+1
dex ; Done?
bne loop ; -No, keep looping.
rtsThere appears to be room for improvement if the truncated result is all that's needed, but I'm chuffed at the current state of affairs. Can anyone with a better brain than mine try to explain what's going on here?
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!
Mike B. (about me) (learning how to github)
Mike B. (about me) (learning how to github)
-
kernelthread
- Posts: 166
- Joined: 23 Jun 2021
Re: Comparing 16 bit Integer Square Root routines (SQRT)
Here's an attempt at an explanation of the algorithm.
Assume the input x is a fraction with 0 <= x < 1
In binary, this means the binary point is immediately to the left of the input value
Each iteration generates a single result bit
The first iteration determines the 2^-1 bit, the second the 2^-2 bit etc.
After n iterations, let y_n be the partial result
For iteration n+1 we need to decide the value of the 2^-(n+1) bit
The question which needs to be answered is as follows:
Is (y_n + 2^-(n+1))^2 >= x ?
If the answer is yes, the next bit is 1, otherwise it is 0
(y_n + 2^-(n+1))^2 = y_n^2 + 2^-n y_n + 2^-(2n+2)
This is a multiple of 2^-(2n+2), so at iteration (n+1), 2(n+1) fraction bits are involved
in the comparison, which means you need to take the next 2 input bits at each iteration
We rearrange the equations in terms of the error after n iterations:
e_n+1 = x - (y_n + 2^-(n+1))^2 = (x - y_n^2) - (2^-n y_n + 2^-(2n+2))
= e_n - (2^-n y_n + 2^-(2n+2))
Multiplying both sides by 2^2n gives
2^2n e_n+1 = 2^2n e_n - 2^n y_n - 2^-2
The terms on the RHS are as follows:
2^2n e_n = current error term shifted left so it is an integer, along with 2 extra input bits shifted into 2^-1 and 2^-2 positions
2^n y_n = current partial result shifted left so it is an integer
-2^-2 = an extra 1/4, which is where the CMP #$40 comes in
If the subtraction gives a borrow, the next result bit is 0 so we shift the partial result left 1
If the subtraction gives no borrow, the next result bit is 1 so we shift the partial result left 1 and add 1,
and update the current error value according to the result of the subtraction
Note that after n iterations, the difference between y_n and the true square root is guaranteed to be <2^-n
Let the true square root be r.
(r - y_n)(r + y_n) = r^2 - y_n^2
= x - y_n^2
e_n = (x - y_n^2) < 2^-n . 2 = 2^-(n-1)
So the error accumulator e_n grows by only 1 bit per iteration rather than 2
Assume the input x is a fraction with 0 <= x < 1
In binary, this means the binary point is immediately to the left of the input value
Each iteration generates a single result bit
The first iteration determines the 2^-1 bit, the second the 2^-2 bit etc.
After n iterations, let y_n be the partial result
For iteration n+1 we need to decide the value of the 2^-(n+1) bit
The question which needs to be answered is as follows:
Is (y_n + 2^-(n+1))^2 >= x ?
If the answer is yes, the next bit is 1, otherwise it is 0
(y_n + 2^-(n+1))^2 = y_n^2 + 2^-n y_n + 2^-(2n+2)
This is a multiple of 2^-(2n+2), so at iteration (n+1), 2(n+1) fraction bits are involved
in the comparison, which means you need to take the next 2 input bits at each iteration
We rearrange the equations in terms of the error after n iterations:
e_n+1 = x - (y_n + 2^-(n+1))^2 = (x - y_n^2) - (2^-n y_n + 2^-(2n+2))
= e_n - (2^-n y_n + 2^-(2n+2))
Multiplying both sides by 2^2n gives
2^2n e_n+1 = 2^2n e_n - 2^n y_n - 2^-2
The terms on the RHS are as follows:
2^2n e_n = current error term shifted left so it is an integer, along with 2 extra input bits shifted into 2^-1 and 2^-2 positions
2^n y_n = current partial result shifted left so it is an integer
-2^-2 = an extra 1/4, which is where the CMP #$40 comes in
If the subtraction gives a borrow, the next result bit is 0 so we shift the partial result left 1
If the subtraction gives no borrow, the next result bit is 1 so we shift the partial result left 1 and add 1,
and update the current error value according to the result of the subtraction
Note that after n iterations, the difference between y_n and the true square root is guaranteed to be <2^-n
Let the true square root be r.
(r - y_n)(r + y_n) = r^2 - y_n^2
= x - y_n^2
e_n = (x - y_n^2) < 2^-n . 2 = 2^-(n-1)
So the error accumulator e_n grows by only 1 bit per iteration rather than 2
-
kernelthread
- Posts: 166
- Joined: 23 Jun 2021
Re: Comparing 16 bit Integer Square Root routines (SQRT)
There's a similar algorithm for calculating cube roots. Here's a C implementation for 32 bit values:
Code: Select all
/* Calculate cube root of a 32 bit unsigned integer
Parameters:
x = input argument
rem = pointer to where remainder should be written
Return value = integer part of cube root
x = (return value) ^ 3 + remainder
*/
uint32_t icbrt32(uint32_t x, uint32_t* rem)
{
uint32_t a = 0;
uint32_t y = 0;
uint32_t yy = 0;
uint32_t z;
uint32_t i;
a = x >> 30;
x <<= 2;
if (a >= 1)
{
--a;
y = 1;
yy = 1;
}
for (i=1; i<11; ++i)
{
a = (a << 3) | (x >> 29);
x <<= 3;
y <<= 1;
yy <<= 2;
z = y + yy;
z += (z << 1);
z += 1;
if (a >= z)
{
a -= z;
yy += (y << 1);
++y;
++yy;
}
}
if (rem)
*rem = a;
return y;
}
- barrym95838
- Posts: 2056
- Joined: 30 Jun 2013
- Location: Sacramento, CA, USA
Re: Comparing 16 bit Integer Square Root routines (SQRT)
@kernelthread: Thank you so much for taking the time and effort to write that up, and for solving the mystery of the #$40 constant! My 19-year-old brain would have gobbled up the theory, but my 56-year-old brain is struggling mightily.
Neat how the 6502's SBC instruction seems tailor-made for that, huh?
Quote:
If the subtraction gives a borrow, the next result bit is 0 so we shift the partial result left 1
If the subtraction gives no borrow, the next result bit is 1 so we shift the partial result left 1 and add 1,
and update the current error value according to the result of the subtraction
If the subtraction gives no borrow, the next result bit is 1 so we shift the partial result left 1 and add 1,
and update the current error value according to the result of the subtraction
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!
Mike B. (about me) (learning how to github)
Mike B. (about me) (learning how to github)
Re: Comparing 16 bit Integer Square Root routines (SQRT)
I can't believe how close the SQRT routine is to WOZ's MOD routine.
This is from WOZ's Integer Basic. Note: In my assembler ":" is always a forward branch and "]" is always a backwards branch
This is from WOZ's Integer Basic. Note: In my assembler ":" is always a forward branch and "]" is always a backwards branch
Code: Select all
TMP = $4
ACC = $6
ARG = $8
MOD lda #0
sta TMP
sta TMP+1
ldy #$10
lda ARG
bne :1
lda ARG+1 ; division by zero check
beq error
:1 equ *
]lp asl ACC
rol ACC+1
rol TMP
rol TMP+1
lda TMP
cmp ARG
lda TMP+1
sbc ARG+1
bcc :1
sta TMP+1
lda TMP
sbc ARG
sta TMP
inc ACC
:1 dey
bne ]lp
clc
rts
error sec
rtsRe: Comparing 16 bit Integer Square Root routines (SQRT)
TobyLobster wrote:
So you need an integer SQRT function, but which to choose?
Code: Select all
; 2 tables with low and high bytes of all x*x values
squared_table_lo:
:256 dta l(#*#)
squared_table_hi:
:256 dta h(#*#)
l: .byte 0
r: .byte 0
val: .word 0
; input: 16bit val
; returns: sqrt of 16bit value in X register
; base on interative binary search
; https://www.geeksforgeeks.org/c-program-for-binary-search-recursive-and-iterative/
sqrt:
ldx #0
stx l
dex ; #255
stx r
; iterative binary search
; first on the higher byte, then on lower
loop:
;int m = l + (r-l)/2;
; (r-l)
sec
lda r
sbc l
lsr ; /2
clc ; !!!todo - check if needed
adc l
; m is x register
tax
; compare arr[m] and val
LDA squared_table_hi,x
CMP val+1
BNE skip_lo
LDA squared_table_lo,x
CMP val+0
skip_lo:
BCC lower
BNE higher
same:
; we return index in X register
rts
lower:
; // If x greater, ignore left half
; if (arr[m] < val)
; l = m + 1;
inx
stx l
jmp check_end
higher:
; // If val is smaller, ignore right half
; else
; r = m - 1;
dex
stx r
check_end:
; ending condition in the algorithm is:
; while (l <= r)
; which is equal:
; while (r>l)
; and we can use BCS
LDA r
CMP l
BCS loop
stop:
; we return index in X register
rts
-
TobyLobster
- Posts: 37
- Joined: 17 Jun 2021
Re: Comparing 16 bit Integer Square Root routines (SQRT)
ilmenit wrote:
TobyLobster wrote:
So you need an integer SQRT function, but which to choose?
Code: Select all
sqrt:
ldx #0
stx l
dex ; #255
stx r
; special case for 255
lda val+1
cmp #$fe
bcc +
bne same
lda val
bne same
+
Code: Select all
lower:
; // If x greater, ignore left half
; if (arr[m] < val)
; l = m + 1;
inx
beq stop ; <- added
Re: Comparing 16 bit Integer Square Root routines (SQRT)
Thanks, very good points about the rounding. I needed it for an Euclidean distance and rounding problem was negligible for me, so I didn't catch it before.
The code is on the MIT license so feel free to use it or add it to your benchmark.
The code is on the MIT license so feel free to use it or add it to your benchmark.