6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 22, 2024 1:07 am

All times are UTC




Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 6 posts ] 
Author Message
PostPosted: Sun Apr 22, 2001 2:58 pm 
I have 16 bit routines. I am looking for FAST math routines.

Multiply, divide, add, subtract. I realize that each routine
is specialized for a certain result ranges, and other considerations.

I would like to be able to take a look at what source is available
and design my code with the help.

II


Report this post
Top
  
Reply with quote  
PostPosted: Mon Apr 23, 2001 4:24 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California
The slowest of my basic math building blocks in Forth is UM/MOD, which takes in a 32-bit unsigned number, divides by a 16-bit, and gives a 16-bit quotient and a 16-bit remainder. Any overflow condition including /0 results in both quotient and remainder being FFFF, which is easy to catch and handle as an error condition, or in some cases, even use as-is. This is not a problem, since UM/MOD is unsigned. In my 65816 Forth, UM/MOD takes 59 microseconds @ 16MHz to do 26,274,859/7458 (expressed here in decimal). I haven't measured yet on the 65c02. Logically the multiply and especially the add and subtract will be much faster. Beyond that, my Forth words for log, trig, square-root, random numbers, multiple-precision, etc. are secondaries, as the speed penalty is minimal once you have the basics done as primitives, which are hand-optimized in assembly language. Forth secondaries are of course not processor-specific. Some of my functions like square root are not really optimized for speed; but I think that for my next workbench computer, I'll have large 64K-word look-up tables in memory. If your system can afford the memory requirements, this of course eliminates the speed-versus-accuracy tradeoff, giving you the best of both worlds. What I have not taken the time to figure out yet is if a 1/X table can be made practical to use for division. [Edit, years later: What I did for the 1/X table is made it to give 32-bit answers to 16-bit inputs, in order to get the full resolution at both ends of the scale. This table then is 256KB, instead of 128KB like most of the other tables in the set. (That's not to say you have to use all 32 bits every time if you don't want to.)]

I've collected a few algorithms over the years, and found some really surprising things. For example, there is a way to get square root with no multiplication or division, if you can wait a long, long time for the answer. I think one of the next ones I want to write is an FFT word. [Edit, years later: Done. It takes the 6502 five seconds to do a 2K complex FFT in 16-bit scaled-integer, without the large tables mentioned above. It took the original IBM PC about a hundred times as long to do one half that size in GWBASIC.] That would really come in handy sometimes on my workbench computer. 16-bit output would be more than adequate since the input will usually be an array of numbers read from 8-bit A/D converters.

About 10 years ago I began finding out how little you really need floating-point arithmetic for. FP really slows things down. It's necessary for scientific calculators, but has little use in embedded computers that control machines and processes, take data, even calculate graphics, and so on. For speed, use scaled integers, even for things like trig and log functions. If you really need more resolution, you can usually use multiple-precision scaled integers and avoid FP.

Jack Ganssle has a new book out called "Math Tool Kit for Real-Time Programming" (Lawrence, KS: CMP Books, 2000). He has his own website at www.ganssle.com . I do intend to get it myself. [Edit, years later: Our wonderful computer-science daughter-in-law got it for me. It's definitely a good book.] It's not at all specific to the 6502 family, but has all kinds of algorithms that are well explained without getting any more heavily into theory than necessary. I understand another good book is John Hart's "Computer Approximations" (Malabar, FL: Robert E. Drieger Publishing Company, 1968); but while Crenshaw lays out the Welcome mat, Hart's explanations will require a Ph.D. in math to understand. (That means it's not for me!)

As for my own code, you're welcome to it, except that I have a small problem now. It's on the DOS computer I use most, which has neither Windoze nor internet access; and when I put things on floppy disc, this other computer I'm on right now is unable to find and E-mail those files. It swallows and says thankyou, but then nothing happens. [Edit, years later: After more FD problems on newer PCs, I finally got an IDE SD-card interface for the DOS computer, so file transfer between computers is more trouble-free.] If you're interested, I'll have to mail you a floppy or a printout. The routines are explained in detail, but do assume at least a little knowledge of Forth. I hope to get a website going possibly later this year, and this will definitely be on it. [Edit: I finally got the website going 10+ years later and have the LUT material there and other major features, and have far more I want to post as time permits.]

Edit, 6/25/12: added link to the look-up tables for super fast, accurate 16-bit math. More edits done later.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Report this post
Top
 Profile  
Reply with quote  
PostPosted: Mon Apr 23, 2001 9:06 am 
Offline
Site Admin
User avatar

Joined: Fri Aug 30, 2002 1:08 am
Posts: 281
Location: Northern California
Hi Guys,

The Source Code Repository on www.6502.org would be an excellent place to post any math routines you have. Please feel free to send any material my way.

Best Regards,
Mike

_________________
- Mike Naberezny (mike@naberezny.com) http://6502.org


Report this post
Top
 Profile  
Reply with quote  
PostPosted: Thu Jul 26, 2001 7:31 pm 
I would like to develop an arithmetic library for fast computation in 6502; then this could be submitted to all compiler writers for the benefit of all. Of course, the world also deserves a small sized version but that is not my present interest.
I come from the commodore background but this applies to all 6502 users. Even the 65816? needs a soft multiply.
Now, I think I've found a <190 cycles 16x16=32 unsigned multiply.
I wonder if anyone can best this? Are there still any coders left here?
I'm working on an article for it. I would like to challenge anyone to beat
this.
I start with an improved version of the fastmult from C=hacking 16, with
better tables (g(x)=(x-255)^2,f(x)=x^2) you can remove the A=0 bug and adc #1
instruction to get 38 cycles from input A:Y to output X:A (low,high).
The basic idea, for reference, is:
a*b = f(a+b) - f(a-b), f(x) = x^2/4
Note that:
(a+b)^2=a^2+b^2+2*a*b
(a-b)^2=a^2+b^2-2*a*b
(a+b)^2-(a-b)^2=4*a*b
(proof).
So I just did f(x)-g(x). Note you now need 2x510 byte tables. C=hacking 16
doesn't explain it very well (sorry S. Judd).
I should note that the standard fastmult always ends with carry set, so if
you do another fast mult right after you can shave another 2 cycles off it.
It's because f(a+b) is always >= f(a-b), which means the subtraction of the
high bytes will never go negative.

Next I tried the three mult formula for 16bit mults, from Knuth:
(note: value ranges arising in actual use of unsigned mults are noted
next to the formulas)
k=(a-b)*(c-d) +-FE01
l=a*c 0-FE01
m=b*d 0-FE01
n=l+m-k l+m 0-1FC02 l-k or m-k +-FE01 L+m-k 0-1FC02
x*y=l*65536+n*256+m

Where:
a is high byte of first number x
b is low byte of first number x
c is high byte of 2nd number y
d is low byte of 2nd number y

I found that while this saves on mults, there are some tricky 17 bit adds
which slow it down. So far I get 262 cycles, but that can definitely be
improved with my 3add speedup technique (see below).

Next I tried the usual 4 mult method, with a*c*65536+a*d*256+b*c*256+b*d.
As mentioned in C=hacking 16, you can use the fastmult routine and leave the
pointers set up to do a second 24 cycles multiply. That is, x*y and x*z can
be done in 2/3 the time as x*y and w*z, because of the common x.
By using the following order of computation, you can use this to achive just
less than the equivalent of 3 mults time when doing 4 mults, so already it
beats the above method.
a*c
a*d
b*c
b*d
The first argument is put in the A register, the second argument in the Y
register for the fastmult macro.
The problem still remains of doing the two adds of the middle products, which
are 16bit+24bit adds.

So, now I investigate techniques to do quick adds of three x 8bit numbers:
Fast version
2 clc
3 lda n1
3 adc n2
2 bcs s1;11
3 adc n3
3 sta r
2 lda #0
2 adc #0
3 sta r+1;2*4+3*5=23
rts
2 s1 clc;** bug fixed
3 adc n3;11+...
3 sta r
2 lda #1
2 adc #0
3 sta r+1;11+13=26
rts

Standard version
2 clc
3 lda n1
3 adc n2
3 sta r
2 lda #0
2 adc #0
3 sta r+1
3 lda n3
3 adc r
3 sta r
3 lda r+1
2 adc #0
3 sta r+1;2*4+3*9=35
rts

So I get 24 compared to 35 cycles, which is pretty good. This can be
extended to add even more numbers with greater savings. I just wish there
were a 2bit carry flag...
To incorporate this technique into the four x 8bit fast mults, I need to
change fastmult again to use add instead of subtract. I just make
g(x)=-(x-255)^2.
Now I can add two products directly in four adds:
f(a+b)+g(a-b)+f(c+d)+g(c-d)
Of course, those aren't the real numbers I should be adding...
I can't fully make use of this technique because of the opposing goals of
reusing the pointers of the fastmult and directly adding the middle partial
products. Since each use of the technique only saves 11 cycles it is better
to reuse the pointers.
So I expect a 170cycles routine for 16x16=32 mult. This is incredible, since
some textbook looped code for 8bit multiply takes this long! (though a good
one is about 100 cycles).
Moving forward, I expect a 32x32=64 bit to take <650 cycles, because I WILL
use the 3 mult method here; where each mult is a 16bit mult of 170 cycles. I
can't reuse the pointers because there is just too many, that's sixteen x
8bit multiplies to do, and 12 partials to add (the first 4 are just stored
into the space of the product). Even 4 full and 12 partial mults is
4*38+12*24=440, which is good but all those adds slow it down. And four x
16bit multiplies is already 4*170=680 cycles (plus just 2 adds), so at this
level saving 170 for a multiply can easily be worthwhile when traded for four
x 24bit or so adds.

Any comments? Any good code for direct four x 8bit adds?
Ultimately I should be able to write very fast 40bit floating point to
replace basic's, which can also be donated to compiler writers etc.
I'm also looking into convolution based multiplies, it just takes a while to
calculate the direct fourier formula by hand. I think I can also speed up S.
Judds 3dlib quite a bit, by using this with matrices (and investigating
Strassens fast matrix mult method).
I am also writing a compiler but the arithmetic library is sidetracking me a
bit, I really want fast math.
Jim Butterfield wrote:

> On Tue, 17 Jul 2001 04:51:14 -0300, asdf <asdf@sdf.com> wrote:
>
> <snip>
> > ... I seem to recall that the best
> >standard way accululated the bits of the answer in one end while
> >shifting the multiplier out the other in the same memory location, but I
> >don't remember exactly how it worked.
>
> 1. Put the multiplicand in the high order end of a work area
> 2. Put zero bytes (same number as in the multiplier) in the low order
> end.
> 3. Loop the following for however many bits are in the multiplicand:
> Do a left-shift of the entire work area (ASL then multiple
> ROLs);
> If a bit dropped out of the high end (carry set), add the
> multiplier into the low end of the work area; if necessary, propogate
> the addition carry as far up into the high end as it needs to go
> (i.e., add zeros).
> 4. The work area contains the result. All the bits of the
> multiplicand have been shifted out of the number.
>
> Most of the timing on this multiply is in the shifting and rotating.
>
> There are "clever tricks" that might or might not speed up the
> operation; most of them depend on what you might know about the nature
> of the numbers you are handling.
>
> --Jim


Report this post
Top
  
Reply with quote  
PostPosted: Thu Jul 26, 2001 7:31 pm 
I would like to develop an arithmetic library for fast computation in 6502; then this could be submitted to all compiler writers for the benefit of all. Of course, the world also deserves a small sized version but that is not my present interest.
I come from the commodore background but this applies to all 6502 users. Even the 65816? needs a soft multiply.
Now, I think I've found a <190 cycles 16x16=32 unsigned multiply.
I wonder if anyone can best this? Are there still any coders left here?
I'm working on an article for it. I would like to challenge anyone to beat
this.
I start with an improved version of the fastmult from C=hacking 16, with
better tables (g(x)=(x-255)^2,f(x)=x^2) you can remove the A=0 bug and adc #1
instruction to get 38 cycles from input A:Y to output X:A (low,high).
The basic idea, for reference, is:
a*b = f(a+b) - f(a-b), f(x) = x^2/4
Note that:
(a+b)^2=a^2+b^2+2*a*b
(a-b)^2=a^2+b^2-2*a*b
(a+b)^2-(a-b)^2=4*a*b
(proof).
So I just did f(x)-g(x). Note you now need 2x510 byte tables. C=hacking 16
doesn't explain it very well (sorry S. Judd).
I should note that the standard fastmult always ends with carry set, so if
you do another fast mult right after you can shave another 2 cycles off it.
It's because f(a+b) is always >= f(a-b), which means the subtraction of the
high bytes will never go negative.

Next I tried the three mult formula for 16bit mults, from Knuth:
(note: value ranges arising in actual use of unsigned mults are noted
next to the formulas)
k=(a-b)*(c-d) +-FE01
l=a*c 0-FE01
m=b*d 0-FE01
n=l+m-k l+m 0-1FC02 l-k or m-k +-FE01 L+m-k 0-1FC02
x*y=l*65536+n*256+m

Where:
a is high byte of first number x
b is low byte of first number x
c is high byte of 2nd number y
d is low byte of 2nd number y

I found that while this saves on mults, there are some tricky 17 bit adds
which slow it down. So far I get 262 cycles, but that can definitely be
improved with my 3add speedup technique (see below).

Next I tried the usual 4 mult method, with a*c*65536+a*d*256+b*c*256+b*d.
As mentioned in C=hacking 16, you can use the fastmult routine and leave the
pointers set up to do a second 24 cycles multiply. That is, x*y and x*z can
be done in 2/3 the time as x*y and w*z, because of the common x.
By using the following order of computation, you can use this to achive just
less than the equivalent of 3 mults time when doing 4 mults, so already it
beats the above method.
a*c
a*d
b*c
b*d
The first argument is put in the A register, the second argument in the Y
register for the fastmult macro.
The problem still remains of doing the two adds of the middle products, which
are 16bit+24bit adds.

So, now I investigate techniques to do quick adds of three x 8bit numbers:
Fast version
2 clc
3 lda n1
3 adc n2
2 bcs s1;11
3 adc n3
3 sta r
2 lda #0
2 adc #0
3 sta r+1;2*4+3*5=23
rts
2 s1 clc;** bug fixed
3 adc n3;11+...
3 sta r
2 lda #1
2 adc #0
3 sta r+1;11+13=26
rts

Standard version
2 clc
3 lda n1
3 adc n2
3 sta r
2 lda #0
2 adc #0
3 sta r+1
3 lda n3
3 adc r
3 sta r
3 lda r+1
2 adc #0
3 sta r+1;2*4+3*9=35
rts

So I get 24 compared to 35 cycles, which is pretty good. This can be
extended to add even more numbers with greater savings. I just wish there
were a 2bit carry flag...
To incorporate this technique into the four x 8bit fast mults, I need to
change fastmult again to use add instead of subtract. I just make
g(x)=-(x-255)^2.
Now I can add two products directly in four adds:
f(a+b)+g(a-b)+f(c+d)+g(c-d)
Of course, those aren't the real numbers I should be adding...
I can't fully make use of this technique because of the opposing goals of
reusing the pointers of the fastmult and directly adding the middle partial
products. Since each use of the technique only saves 11 cycles it is better
to reuse the pointers.
So I expect a 170cycles routine for 16x16=32 mult. This is incredible, since
some textbook looped code for 8bit multiply takes this long! (though a good
one is about 100 cycles).
Moving forward, I expect a 32x32=64 bit to take <650 cycles, because I WILL
use the 3 mult method here; where each mult is a 16bit mult of 170 cycles. I
can't reuse the pointers because there is just too many, that's sixteen x
8bit multiplies to do, and 12 partials to add (the first 4 are just stored
into the space of the product). Even 4 full and 12 partial mults is
4*38+12*24=440, which is good but all those adds slow it down. And four x
16bit multiplies is already 4*170=680 cycles (plus just 2 adds), so at this
level saving 170 for a multiply can easily be worthwhile when traded for four
x 24bit or so adds.

Any comments? Any good code for direct four x 8bit adds?
Ultimately I should be able to write very fast 40bit floating point to
replace basic's, which can also be donated to compiler writers etc.
I'm also looking into convolution based multiplies, it just takes a while to
calculate the direct fourier formula by hand. I think I can also speed up S.
Judds 3dlib quite a bit, by using this with matrices (and investigating
Strassens fast matrix mult method).
I am also writing a compiler but the arithmetic library is sidetracking me a
bit, I really want fast math.
Jim Butterfield wrote:

> On Tue, 17 Jul 2001 04:51:14 -0300, asdf <asdf@sdf.com> wrote:
>
> <snip>
> > ... I seem to recall that the best
> >standard way accululated the bits of the answer in one end while
> >shifting the multiplier out the other in the same memory location, but I
> >don't remember exactly how it worked.
>
> 1. Put the multiplicand in the high order end of a work area
> 2. Put zero bytes (same number as in the multiplier) in the low order
> end.
> 3. Loop the following for however many bits are in the multiplicand:
> Do a left-shift of the entire work area (ASL then multiple
> ROLs);
> If a bit dropped out of the high end (carry set), add the
> multiplier into the low end of the work area; if necessary, propogate
> the addition carry as far up into the high end as it needs to go
> (i.e., add zeros).
> 4. The work area contains the result. All the bits of the
> multiplicand have been shifted out of the number.
>
> Most of the timing on this multiply is in the shifting and rotating.
>
> There are "clever tricks" that might or might not speed up the
> operation; most of them depend on what you might know about the nature
> of the numbers you are handling.
>
> --Jim


Report this post
Top
  
Reply with quote  
PostPosted: Thu Jul 26, 2001 8:50 pm 
The standard 6502 fast 8 bit multiply (unsigned).
Takes 38 cycles, this is almost the best multiply possible on 6502. There is actually a much faster one, but I don't have source.
a*b = f(a+b) - f(a-b), f(x) = x^2/4
3 STA ZP1 ;ZP1 points to f(x) low byte
6 STA ZP2 ;high byte
8 EOR #$FF
10 ADC #01;take this out with new tables TAKE THIS LINE OUT
13 STA ZP3 ;table of f(x-255) -- see below
16 STA ZP4
half LDA (ZP1),Y ;f(Y+A), low byte
23 SEC
28 SBC (ZP3),Y ;g(Y-A), low byte
30 TAX
35 LDA (ZP2),Y ;f(Y+A), high byte
40 SBC (ZP4),Y ;g(Y-A), high byte
;result is now in .X .A = low, high
;carry is always set

40cycles/38 cycles
14 cycles for halfmult
This leads to the following algorithm for doing signed
multiplications:

multiply x and y as normal with some routine
if x<0 then subtract y from the high bytes of the result
if y<0 then subtract x from the high bytes
compute
f(x)=x*x
g(x)=(x-255)^2
so ie g(254) is 1^1/4.

A Y a+y a-y f(a+y) f(a-y)
3 2 5 257 6 0
3 3 6 256 9 0
3 4 7 255 12 0
3 5 8 254 16 1
3 6 9 253 20 2

(a+b)^2=a^2+b^2+2*a*b
(a-b)^2=a^2+b^2-2*a*b
(a+b)^2-(a-b)^2=4*a*b (proof)


Report this post
Top
  
Reply with quote  
Display posts from previous:  Sort by  
Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 6 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 32 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: