6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sat Nov 23, 2024 2:02 am

All times are UTC




Post new topic Reply to topic  [ 45 posts ]  Go to page 1, 2, 3  Next
Author Message
PostPosted: Thu Jan 28, 2016 11:47 pm 
Offline

Joined: Thu Jan 28, 2016 11:12 pm
Posts: 12
Hello forum! I've been writing an entry for codebase64 about the built-in FP routines on the C64. Much of this knowledge was obtained in the quest to write a space simulator, but I've mostly put that project on the back-burner. It did however, lead to quest to find alternate routines to speed-up the notoriously slow ones included with BASIC.

I've seen a few FP topics here, but it seems they are aimed at machines that have more ram than the humble 64 and its brethren. So if anyone knows of methods alternate to the ones used in BASIC 2.0 that I could add to the optimising section it would be greatly appreciated. Clearly multiplication would be the most important to improve (as it would improve other functions) but I suspect this is also the least trivial to improve as well.

WIP simulator here: "accurately" simulates gravity from the Moon and the Earth acting on a spacecraft. the Earth also affects the Moon. Why I decided to take on this project when I am not very mathematically inclined is beyond me, but every time I look at the Moon in the sky I think back on reviving it. Having a speedier multiply would certainly help.


Top
 Profile  
Reply with quote  
PostPosted: Fri Jan 29, 2016 12:09 am 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1949
Location: Sacramento, CA, USA
Can you sacrifice accuracy for speed, by going with a 24-bit mantissa? It was "good enough" for many applications.

Also, not to pick nits, but near the bottom of your linked page:
Quote:
If we do exponentiation with our new routines like the Kernal does (y^x = exp(x*log(y))) for 8^8 we get 16777148.6, while the Kernal routines return 16777216, which is a little closer to the true value of 16777224.

The correct answer is 16777216, actually.

Mike B.


Top
 Profile  
Reply with quote  
PostPosted: Fri Jan 29, 2016 12:50 am 
Offline

Joined: Thu Jan 28, 2016 11:12 pm
Posts: 12
I did try to use the Woz FP routines for the simulator, but the nature of integration (quick and dirty Euler especially) is error prone, and the limited accuracy compounded that, given that the scale of the system is similar to the real world, the amount of acceleration change in some instances is essentially zero with a lower precision. I could decrease the resolution of the co-ordinates, but I think it's at a Goldilocks point. I don't know the FPS, and it is also dependent on the graphics routine, but if you look at the program you'll see the ship zips along at a good rate (make sure to zoom in).

And it's not nitpicking, it's exactly why I asked for people to look. Thanks!

Also rewriting an entire FP engine is not exactly what I'm looking for, as the c64 Kernal is very handy given that you can still have 64k when using it.


Top
 Profile  
Reply with quote  
PostPosted: Fri Jan 29, 2016 1:38 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8544
Location: Southern California
From my links page:
6502 BCD floating-point scientific math package, 12-digit plus 3-digit exp, up to hyperbolic functions
math coprocessors, 32-bit floating-point, serial-interfaced

Almost everything people think requires floating-point math can actually be done, just as accurately and more efficiently, in scaled integer. Notice I didn't say "fixed point," as that's only a limited subset of scaled integer. I have an article on this at http://wilsonminesco.com/16bitMathTables/, along with large look-up tables for many 16-bit math functions. There is no interpolation involved, because every single value is there, pre-calculated, accurate to all 16 bits; and in many cases it's literally hundreds of times as fast to just look up the answer as it is to actually have to calculate it. Even if 16 bits (with 32-bit intermediate results) is not enough for what you want to do (so I can't help you with the look-up tables), be sure you don't discount scaled-integer math just because it does not intuitively seem up to the task. Perhaps 24 or 32 bits would be fine (with 48- or 64-bit intermediate results).

At first I was skeptical myself; but eventually I came to prefer it (in systems with no floating-point coprocessor). I wrote the software for some automated test equipment without floating-point. There will always be a place for floating-point, especially with calculators where the applicable ranges cannot be anticipated; but I have mostly quit using floating-point for data acquisition and control for everything up to and including the fast Fourier transform (FFT) for spectral analysis. Digital signal processing is frequently done without floating-point.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Fri Jan 29, 2016 7:24 am 
Offline

Joined: Thu Jan 28, 2016 11:12 pm
Posts: 12
I did see your math package, but am I wrong when I see that just one of the required tables is 256k? No way this would help with legacy computers. Unless staight BCD is still faster even without tables. While this is ostensibly about making my simulator for c64, it's more about improving methods on all CBM machines without having to fill the RAM with code and tables leaving no room for anything else.


Top
 Profile  
Reply with quote  
PostPosted: Fri Jan 29, 2016 8:06 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Welcome, malcontent!

I'm sure you're right to try to find a faster multiply. If you're also using trig functions, you could consider using faster implementations of those, which use fewer multiplies than the method used in MS Basics. I haven't a good high-level reference, but for a very low level you could study the routines in BBC Basic V4. For example, at
http://www.8bs.com/basic/basic4-a90d.htm
we have the continued-fraction evaluation of sine. This is faster and more accurate than a power series evaluation.

I think the BBC Basic 4 routines do use some 65C02 opcodes, so you might need to use very slightly slower code on a 6502/6510.

Cheers
Ed


Top
 Profile  
Reply with quote  
PostPosted: Fri Jan 29, 2016 8:44 am 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8544
Location: Southern California
malcontent wrote:
I did see your math package, but am I wrong when I see that just one of the required tables is 256k? No way this would help with legacy computers. Unless staight BCD is still faster even without tables. While this is ostensibly about making my simulator for c64, it's more about improving methods on all CBM machines without having to fill the RAM with code and tables leaving no room for anything else.

The biggest table (inverses) is 256KB, since it has 64K 32-bit answers. Most tables are 128KB, two bytes for each 16-bit answer. Near the bottom of the index page there are suggestions for implementation, including a synchronous-serial interface for 8-bit computers that don't have nearly enough memory or even the address space for it. Even communicating over synchronous serial, it's still much faster than having to actually calculate the values. In the case of a C64, such memory would probably be on a board that gets plugged into the back, unless you want to wire it internally. For only the price of postage, I do supply one-megabyte EPROM pairs, pre-programmed with the whole set of tables by our late forum member NightmareTony who died with cancer in Nov 2012 just a month or two after the last time I saw him. (My EPROM programmer doesn't do EPROMs that big, so he helped me out, even donating the EPROMs.)

Tables or no tables though, part of my point is the unexpected value and capability of scaled-integer math which is more efficient if the machine does not have a floating-point coprocessor.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Fri Jan 29, 2016 4:11 pm 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1949
Location: Sacramento, CA, USA
Ed, your link 404s from over here. Could you fix it for us (me)?

I browsed the MS multiply routine, and I agree that it could be a candidate for optimization, but a question would remain ... how to wedge it into all the existing MS code which calls $BA28 and which the OP wishes to keep, without shadowing the ROM to underlying RAM and modifying it there.

Mike B.


Top
 Profile  
Reply with quote  
PostPosted: Fri Jan 29, 2016 4:19 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
Hmm, that site I linked to seems to have some access control - whether it's region blocking or something else I don't know. It doesn't like to be archived. Try
http://beebwiki.mdfs.net/COS
instead. Or search for 'A90D SIN' and get the cached version from your search provider, if you can.


Top
 Profile  
Reply with quote  
PostPosted: Sat Jan 30, 2016 5:04 am 
Offline

Joined: Tue Jul 24, 2012 2:27 am
Posts: 679
I had an idea many years ago that I never got around to implementing: Represent the exponent in terms of 256^n, instead of 2^n. In other words, it dictates how many bytes to shift the mantissa instead of how many bits. Normalization will be a lot cheaper, and aligning different floating point values for addition or subtraction gets reduced down to indexing instead of bit-shifting.

In regular floating point numbers, they're normalized such that there's an assumed 1 bit on top, and the mantissa holds the next 32 bits of precision. With a byte-aligned scheme, that top 1 bit would have to be included in the top mantissa byte, as its location could be anywhere within the byte. This means that the effective precision would be variable in the range of 24-31 bits of equivalent precision from the bitwise mantissa representation. Since the math processing is expected to be far faster, maybe adding another byte of mantissa to bring the minimum precision back up to 32 bits would be prudent.

Regarding multiplication specifically, it can be done with only 8 shifts instead of 32 in a sort of byte-parallel fashion. Each byte provides a shift-and-add source from a byte-preshifted perspective. You can shift bits out of each byte independently, instead of the more expensive full 32-bit shift of the entire mantissa. The code would be bigger this way, but certainly not as large as lookup tables (though tables would be faster still).

_________________
WFDis Interactive 6502 Disassembler
AcheronVM: A Reconfigurable 16-bit Virtual CPU for the 6502 Microprocessor


Top
 Profile  
Reply with quote  
PostPosted: Sat Jan 30, 2016 8:12 am 
Offline

Joined: Thu Jan 28, 2016 11:12 pm
Posts: 12
barrym95838 wrote:
how to wedge it into all the existing MS code which calls $BA28


At least for the "real-time" rendering of the simulator and the screen, multiply is only called directly in the main loop. The exception is when I want to display other information, I.E. Velocity, Altitude, etc. Which must call the slow float-string conversion. If I wrote a faster multiply I would probably drop that section into ram and change the references.

The (partly coded) parts of the simulator that use trig are not in that version on codebase. It also calculates the predicted orbit using Kepler's laws, then plots that ellipse/parabola/hyperbola. It has a 2D look, but it is working in 3D. That takes a bit of time, so the simulation is paused for that.

Whiteflame: That exponent idea is really cool. I'm also intrigued by your multiply suggestion and I am sort of following what you are saying, you are talking about about only working on one byte of the partial product at a time right? I'm still scratching my head a little on how I'd tack them all together... Perhaps you could dumb it down a wee bit more? :oops:


Top
 Profile  
Reply with quote  
PostPosted: Sat Jan 30, 2016 9:34 am 
Offline
User avatar

Joined: Tue Mar 02, 2004 8:55 am
Posts: 996
Location: Berkshire, UK
White Flame wrote:
I had an idea many years ago that I never got around to implementing: Represent the exponent in terms of 256^n, instead of 2^n. In other words, it dictates how many bytes to shift the mantissa instead of how many bits. Normalization will be a lot cheaper, and aligning different floating point values for addition or subtraction gets reduced down to indexing instead of bit-shifting.

In regular floating point numbers, they're normalized such that there's an assumed 1 bit on top, and the mantissa holds the next 32 bits of precision. With a byte-aligned scheme, that top 1 bit would have to be included in the top mantissa byte, as its location could be anywhere within the byte. This means that the effective precision would be variable in the range of 24-31 bits of equivalent precision from the bitwise mantissa representation. Since the math processing is expected to be far faster, maybe adding another byte of mantissa to bring the minimum precision back up to 32 bits would be prudent.

Regarding multiplication specifically, it can be done with only 8 shifts instead of 32 in a sort of byte-parallel fashion. Each byte provides a shift-and-add source from a byte-preshifted perspective. You can shift bits out of each byte independently, instead of the more expensive full 32-bit shift of the entire mantissa. The code would be bigger this way, but certainly not as large as lookup tables (though tables would be faster still).

IBM mainframes originally had exponents that where powers of 16 (e.g. nybble shifts). They found that there was a noticeable loss of accuracy in calculations and they eventually added IEEE 754 support.

https://en.wikipedia.org/wiki/IBM_Floating_Point_Architecture#Precision_issues

With larger exponents you would be discarding even more mantissa bits during operations and the results with be very inaccurate.

_________________
Andrew Jacobs
6502 & PIC Stuff - http://www.obelisk.me.uk/
Cross-Platform 6502/65C02/65816 Macro Assembler - http://www.obelisk.me.uk/dev65/
Open Source Projects - https://github.com/andrew-jacobs


Top
 Profile  
Reply with quote  
PostPosted: Sat Jan 30, 2016 10:03 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10986
Location: England
I do like the idea of stretching the mantissa to compensate - that seems like a good tradeoff to me.


Top
 Profile  
Reply with quote  
PostPosted: Sat Jan 30, 2016 10:56 am 
Offline

Joined: Sat Jul 28, 2012 11:41 am
Posts: 442
Location: Wiesbaden, Germany
BitWise wrote:
IBM mainframes originally had exponents that where powers of 16 (e.g. nybble shifts). They found that there was a noticeable loss of accuracy in calculations and they eventually added IEEE 754 support.

https://en.wikipedia.org/wiki/IBM_Floating_Point_Architecture#Precision_issues

With larger exponents you would be discarding even more mantissa bits during operations and the results with be very inaccurate.
That is comparing apples with oranges. IBM's shift instruction can shift by any number of bits in a single instruction. There is no overhead in shifting by bit versus shifting by nibble. If you would add bits to the mantissa to compensate for the loss of accuracy you need to extend by an extra word (32 bit registers!) versus a byte architecture CPU only needs another byte.

_________________
6502 sources on GitHub: https://github.com/Klaus2m5


Top
 Profile  
Reply with quote  
PostPosted: Sat Jan 30, 2016 8:56 pm 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1949
Location: Sacramento, CA, USA
I think that White Flame's ^256 idea has merit (and should be attempted by someone), but it goes against the OP's original implied intention to keep the MS 40-bit format and just tweak a few critical subroutines.

Mike B.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 45 posts ]  Go to page 1, 2, 3  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 36 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron