6502.org

Posted: **Thu Jan 28, 2016 11:47 pm**

Hello forum! I've been writing an entry for codebase64 about the built-in FP routines on the C64. Much of this knowledge was obtained in the quest to write a space simulator, but I've mostly put that project on the back-burner. It did however, lead to quest to find alternate routines to speed-up the notoriously slow ones included with BASIC.

I've seen a few FP topics here, but it seems they are aimed at machines that have more ram than the humble 64 and its brethren. So if anyone knows of methods alternate to the ones used in BASIC 2.0 that I could add to the optimising section it would be greatly appreciated. Clearly multiplication would be the most important to improve (as it would improve other functions) but I suspect this is also the least trivial to improve as well.

WIP simulator here: "accurately" simulates gravity from the Moon and the Earth acting on a spacecraft. the Earth also affects the Moon. Why I decided to take on this project when I am not very mathematically inclined is beyond me, but every time I look at the Moon in the sky I think back on reviving it. Having a speedier multiply would certainly help.

Posted: **Fri Jan 29, 2016 12:09 am**

Can you sacrifice accuracy for speed, by going with a 24-bit mantissa? It was "good enough" for many applications.

Also, not to pick nits, but near the bottom of your linked page:

Quote:

If we do exponentiation with our new routines like the Kernal does (y^x = exp(x*log(y))) for 8^8 we get 16777148.6, while the Kernal routines return 16777216, which is a little closer to the true value of 16777224.

The correct answer is 16777216, actually.

Mike B.

Posted: **Fri Jan 29, 2016 12:50 am**

I did try to use the Woz FP routines for the simulator, but the nature of integration (quick and dirty Euler especially) is error prone, and the limited accuracy compounded that, given that the scale of the system is similar to the real world, the amount of acceleration change in some instances is essentially zero with a lower precision. I could decrease the resolution of the co-ordinates, but I think it's at a Goldilocks point. I don't know the FPS, and it is also dependent on the graphics routine, but if you look at the program you'll see the ship zips along at a good rate (make sure to zoom in).

And it's not nitpicking, it's exactly why I asked for people to look. Thanks!

Also rewriting an entire FP engine is not exactly what I'm looking for, as the c64 Kernal is very handy given that you can still have 64k when using it.

Posted: **Fri Jan 29, 2016 1:38 am**

From my links page:
6502 BCD floating-point scientific math package, 12-digit plus 3-digit exp, up to hyperbolic functions
math coprocessors, 32-bit floating-point, serial-interfaced

Almost everything people think requires floating-point math can actually be done, just as accurately and more efficiently, in scaled integer. Notice I didn't say "fixed point," as that's only a limited subset of scaled integer. I have an article on this at http://wilsonminesco.com/16bitMathTables/, along with large look-up tables for many 16-bit math functions. There is no interpolation involved, because every single value is there, pre-calculated, accurate to all 16 bits; and in many cases it's literally hundreds of times as fast to just look up the answer as it is to actually have to calculate it. Even if 16 bits (with 32-bit intermediate results) is not enough for what you want to do (so I can't help you with the look-up tables), be sure you don't discount scaled-integer math just because it does not intuitively seem up to the task. Perhaps 24 or 32 bits would be fine (with 48- or 64-bit intermediate results).

At first I was skeptical myself; but eventually I came to prefer it (in systems with no floating-point coprocessor). I wrote the software for some automated test equipment without floating-point. There will always be a place for floating-point, especially with calculators where the applicable ranges cannot be anticipated; but I have mostly quit using floating-point for data acquisition and control for everything up to and including the fast Fourier transform (FFT) for spectral analysis. Digital signal processing is frequently done without floating-point.

Posted: **Fri Jan 29, 2016 7:24 am**

I did see your math package, but am I wrong when I see that just one of the required tables is 256k? No way this would help with legacy computers. Unless staight BCD is still faster even without tables. While this is ostensibly about making my simulator for c64, it's more about improving methods on all CBM machines without having to fill the RAM with code and tables leaving no room for anything else.

Posted: **Fri Jan 29, 2016 8:06 am**

Welcome, malcontent!

I'm sure you're right to try to find a faster multiply. If you're also using trig functions, you could consider using faster implementations of those, which use fewer multiplies than the method used in MS Basics. I haven't a good high-level reference, but for a very low level you could study the routines in BBC Basic V4. For example, at
http://www.8bs.com/basic/basic4-a90d.htm
we have the continued-fraction evaluation of sine. This is faster and more accurate than a power series evaluation.

I think the BBC Basic 4 routines do use some 65C02 opcodes, so you might need to use very slightly slower code on a 6502/6510.

Cheers
Ed

Posted: **Fri Jan 29, 2016 8:44 am**

malcontent wrote:

I did see your math package, but am I wrong when I see that just one of the required tables is 256k? No way this would help with legacy computers. Unless staight BCD is still faster even without tables. While this is ostensibly about making my simulator for c64, it's more about improving methods on all CBM machines without having to fill the RAM with code and tables leaving no room for anything else.

The biggest table (inverses) is 256KB, since it has 64K 32-bit answers. Most tables are 128KB, two bytes for each 16-bit answer. Near the bottom of the index page there are suggestions for implementation, including a synchronous-serial interface for 8-bit computers that don't have nearly enough memory or even the address space for it. Even communicating over synchronous serial, it's still much faster than having to actually calculate the values. In the case of a C64, such memory would probably be on a board that gets plugged into the back, unless you want to wire it internally. For only the price of postage, I do supply one-megabyte EPROM pairs, pre-programmed with the whole set of tables by our late forum member NightmareTony who died with cancer in Nov 2012 just a month or two after the last time I saw him. (My EPROM programmer doesn't do EPROMs that big, so he helped me out, even donating the EPROMs.)

Tables or no tables though, part of my point is the unexpected value and capability of scaled-integer math which is more efficient if the machine does not have a floating-point coprocessor.

Posted: **Fri Jan 29, 2016 4:11 pm**

Ed, your link 404s from over here. Could you fix it for us (me)?

I browsed the MS multiply routine, and I agree that it could be a candidate for optimization, but a question would remain ... how to wedge it into all the existing MS code which calls $BA28 and which the OP wishes to keep, without shadowing the ROM to underlying RAM and modifying it there.

Mike B.

Posted: **Fri Jan 29, 2016 4:19 pm**

Hmm, that site I linked to seems to have some access control - whether it's region blocking or something else I don't know. It doesn't like to be archived. Try
http://beebwiki.mdfs.net/COS
instead. Or search for 'A90D SIN' and get the cached version from your search provider, if you can.

Posted: **Sat Jan 30, 2016 5:04 am**

I had an idea many years ago that I never got around to implementing: Represent the exponent in terms of 256^n, instead of 2^n. In other words, it dictates how many bytes to shift the mantissa instead of how many bits. Normalization will be a lot cheaper, and aligning different floating point values for addition or subtraction gets reduced down to indexing instead of bit-shifting.

In regular floating point numbers, they're normalized such that there's an assumed 1 bit on top, and the mantissa holds the next 32 bits of precision. With a byte-aligned scheme, that top 1 bit would have to be included in the top mantissa byte, as its location could be anywhere within the byte. This means that the effective precision would be variable in the range of 24-31 bits of equivalent precision from the bitwise mantissa representation. Since the math processing is expected to be far faster, maybe adding another byte of mantissa to bring the minimum precision back up to 32 bits would be prudent.

Regarding multiplication specifically, it can be done with only 8 shifts instead of 32 in a sort of byte-parallel fashion. Each byte provides a shift-and-add source from a byte-preshifted perspective. You can shift bits out of each byte independently, instead of the more expensive full 32-bit shift of the entire mantissa. The code would be bigger this way, but certainly not as large as lookup tables (though tables would be faster still).

Posted: **Sat Jan 30, 2016 8:12 am**

barrym95838 wrote:

how to wedge it into all the existing MS code which calls $BA28

At least for the "real-time" rendering of the simulator and the screen, multiply is only called directly in the main loop. The exception is when I want to display other information, I.E. Velocity, Altitude, etc. Which must call the slow float-string conversion. If I wrote a faster multiply I would probably drop that section into ram and change the references.

The (partly coded) parts of the simulator that use trig are not in that version on codebase. It also calculates the predicted orbit using Kepler's laws, then plots that ellipse/parabola/hyperbola. It has a 2D look, but it is working in 3D. That takes a bit of time, so the simulation is paused for that.

Whiteflame: That exponent idea is really cool. I'm also intrigued by your multiply suggestion and I am sort of following what you are saying, you are talking about about only working on one byte of the partial product at a time right? I'm still scratching my head a little on how I'd tack them all together... Perhaps you could dumb it down a wee bit more?

Posted: **Sat Jan 30, 2016 9:34 am**

White Flame wrote:

I had an idea many years ago that I never got around to implementing: Represent the exponent in terms of 256^n, instead of 2^n. In other words, it dictates how many bytes to shift the mantissa instead of how many bits. Normalization will be a lot cheaper, and aligning different floating point values for addition or subtraction gets reduced down to indexing instead of bit-shifting.

In regular floating point numbers, they're normalized such that there's an assumed 1 bit on top, and the mantissa holds the next 32 bits of precision. With a byte-aligned scheme, that top 1 bit would have to be included in the top mantissa byte, as its location could be anywhere within the byte. This means that the effective precision would be variable in the range of 24-31 bits of equivalent precision from the bitwise mantissa representation. Since the math processing is expected to be far faster, maybe adding another byte of mantissa to bring the minimum precision back up to 32 bits would be prudent.

Regarding multiplication specifically, it can be done with only 8 shifts instead of 32 in a sort of byte-parallel fashion. Each byte provides a shift-and-add source from a byte-preshifted perspective. You can shift bits out of each byte independently, instead of the more expensive full 32-bit shift of the entire mantissa. The code would be bigger this way, but certainly not as large as lookup tables (though tables would be faster still).

IBM mainframes originally had exponents that where powers of 16 (e.g. nybble shifts). They found that there was a noticeable loss of accuracy in calculations and they eventually added IEEE 754 support.

https://en.wikipedia.org/wiki/IBM_Float ... ion_issues

With larger exponents you would be discarding even more mantissa bits during operations and the results with be very inaccurate.

Posted: **Sat Jan 30, 2016 10:03 am**

I do like the idea of stretching the mantissa to compensate - that seems like a good tradeoff to me.

Posted: **Sat Jan 30, 2016 10:56 am**

BitWise wrote:

IBM mainframes originally had exponents that where powers of 16 (e.g. nybble shifts). They found that there was a noticeable loss of accuracy in calculations and they eventually added IEEE 754 support.

https://en.wikipedia.org/wiki/IBM_Float ... ion_issues

With larger exponents you would be discarding even more mantissa bits during operations and the results with be very inaccurate.

That is comparing apples with oranges. IBM's shift instruction can shift by any number of bits in a single instruction. There is no overhead in shifting by bit versus shifting by nibble. If you would add bits to the mantissa to compensate for the loss of accuracy you need to extend by an extra word (32 bit registers!) versus a byte architecture CPU only needs another byte.

Posted: **Sat Jan 30, 2016 8:56 pm**

I think that White Flame's ^256 idea has merit (and should be attempted by someone), but it goes against the OP's original implied intention to keep the MS 40-bit format and just tweak a few critical subroutines.

Mike B.

6502.org

CBM (C64) floating point improvement

CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement

Re: CBM (C64) floating point improvement