6502.org • View topic - 128 bit Floating Point 65C816 implementation

View unanswered posts | View active topics

Board index » 6502.org Users Forum » Programming

All times are UTC

128 bit Floating Point 65C816 implementation

Page 3 of 3

[ 44 posts ]

Go to page Previous 1, 2, 3

Previous topic | Next topic

Author

Message

sanny

Post subject: Re: 128 bit Floating Point 65C816 implementation

Posted: Sun Jul 24, 2022 7:28 pm

Joined: Fri Sep 01, 2017 6:20 pm
Posts: 6

granati wrote:

Now all source code is browsable online at:

http://65xx.unet.bz/repos

I put online source code for gal too (that i missing till now)

Marco

Can I download / checkout over SVN?

Top

tmr4

Post subject: Re: 128 bit Floating Point 65C816 implementation

Posted: Sun Jul 24, 2022 9:39 pm

Joined: Sat Feb 19, 2022 10:14 pm
Posts: 147

sanny wrote:

granati wrote:

Now all source code is browsable online at:

http://65xx.unet.bz/repos

I put online source code for gal too (that i missing till now)

Marco

Can I download / checkout over SVN?

Here is a direct link to the floating-point code that was provided earlier in this post: http://65xx.unet.bz/fpu.txt. It's a bit newer than what's in the repo.

Top

tmr4

Post subject: Re: 128 bit Floating Point 65C816 implementation

Posted: Sat Aug 13, 2022 8:34 pm

Joined: Sat Feb 19, 2022 10:14 pm
Posts: 147

I've been playing with this floating-point package for a few weeks now. Here's a Mandelbrot set with 100 interations per point:

Code:

128-bit fp (100 iterations)
üüüüüüüüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷÷÷÷÷ôôôôòòòïèàèíâùùùùùùùùùùùùùùùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷÷÷÷ôôôôôôòòòòíêÛØêíïôôôôùùùùùùùùùùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷÷÷ôôôôôôôôòòòà·Îp cÄÛåòòòôôôô÷ùùùùùùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷÷ôôôôôôôôïïïííÝ      Áêíïòòòòòô÷÷÷ùùùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷ôôôôôôàà åêèÎØÛÛÖÄ   ÆØàÛå êïïïíÌï÷÷÷ùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷ôôôòòòïíèØ  B¼                à¨º∟Îèò÷÷÷÷ùùùùùùùùùù
üüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷òòòòòòïïïÝâ ¨                      Äèïòò÷÷÷÷ùùùùùùùùù
üüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷òòòòòòïïíêâ                           Áêíò÷÷÷÷÷ùùùùùùùù
üüüüüüüüüüüüüüüüü÷÷÷÷÷èíèåØèèíííêêàw                              ïô÷÷÷÷÷ùùùùùùù
üüüüüüüüüüüüüüüüüòííèèª É     Äââà                              Ýàòôô÷÷÷÷÷ùùùùùù
üüüüüüüüüüüüüüüüòòííØ            Ó                              âíòôô÷÷÷÷÷÷ùùùùù
üüüüüüüüüüüüüüüü    ¿            =                             Ýïòòôô÷÷÷÷÷÷ùùùùù
üüüüüüüüüüüüüüüü    ¿            =                             Ýïòòôô÷÷÷÷÷÷ùùùùù
üüüüüüüüüüüüüüüüòòííØ            Ó                              âíòôô÷÷÷÷÷÷ùùùùù
üüüüüüüüüüüüüüüüüòííèèª É     Äââà                              Ýàòôô÷÷÷÷÷ùùùùùù
üüüüüüüüüüüüüüüüü÷÷÷÷÷èíèåØèèíííêêàw                              ïô÷÷÷÷÷ùùùùùùù
üüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷òòòòòòïïíêâ                           Áêíò÷÷÷÷÷ùùùùùùùù
üüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷òòòòòòïïïÝâ ¨                      Äèïòò÷÷÷÷ùùùùùùùùù
üüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷ôôôòòòïíèØ  B¼                à¨º∟Îèò÷÷÷÷ùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷ôôôôôôàà åêèÎØÛÛÖÄ   ÆØàÛå êïïïíÌï÷÷÷ùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷÷ôôôôôôôôïïïííÝ      Áêíïòòòòòô÷÷÷ùùùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷÷÷ôôôôôôôôòòòà·Îp cÄÛåòòòôôôô÷ùùùùùùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷÷÷÷ôôôôôôòòòòíêÛØêíïôôôôùùùùùùùùùùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷÷÷÷÷ôôôôòòòïèàèíâùùùùùùùùùùùùùùùùùùùùùùùùùù

I haven't tried this on my hardware build yet but this took about 26 minutes to produce on my fastest PC on my 65816 emulator which I estimate runs at an equivalent of about 800 kHz. This would still take over 2 1/2 minutes running on my hardware at 8 MHz.

I wanted something faster, so I looked for ways to speed up the floating-point package. I noted a few possibilities when I added the fp package to my system:

Of course, the biggest slowdown in my Mandelbrot calculation above was using a 128-bit fp package in the first place. Did I really need that much precision? Curious, I decided to convert the key fp routines needed for the Mandelbrot calculation to 32-bit precision, incorporating the improvements noted above as well. Here the output with the 32-bit fp routines:

Code:

32-bit fp (100 iterations)
üüüüüüüüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷÷÷÷÷ôôôôòòòïèàèíâùùùùùùùùùùùùùùùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷÷÷÷ôôôôôôòòòòíêÛØêíïôôôôùùùùùùùùùùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷÷÷ôôôôùôùùòòòà·ùp cÄÛåòùòôôôô÷ùùùùùùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷÷ôôôôôôôôïïïííÝ      Áêíïòòòòòô÷÷÷ùùùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷ôôôôôôàà åêèÎØÛÛÖÄ   ÆØàÛå êïïïíÌï÷÷÷ùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷ôôôòòòïíèØ  B¼                à¨º∟Îèò÷÷÷÷ùùùùùùùùùù
üüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷òòòòòòïïïÝâ â                      Äèïòò÷÷÷÷ùùùùùùùùù
üüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷òòòòòòïïíêâ   ù ùù                    Áêíò÷÷÷÷÷ùùùùùùùù
üüüüüüüüüüüüüüüüü÷÷÷÷÷èíèåØèèíííêêàw                              ïô÷÷÷÷÷ùùùùùùù
üüüüüüüüüüüüüüüüüòííèèª É     Äââà                     Ý        Ýàòôô÷÷÷÷÷ùùùùùù
üüüüüüüüüüüüüüüüòòííØ            Ó                              âíòôô÷÷÷÷÷÷ùùùùù
üüüüüüüüüüüüüüüü                                               Ýïòòôô÷÷÷÷÷÷ùùùùù
üüüüüüüüüüüüüüüü                                               Ýïòòôô÷÷÷÷÷÷ùùùùù
üüüüüüüüüüüüüüüüòòííØ            Ó                              âíòôô÷÷÷÷÷÷ùùùùù
üüüüüüüüüüüüüüüüüòííèèª É     Äââà                     Ý        Ýàòôô÷÷÷÷÷ùùùùùù
üüüüüüüüüüüüüüüüü÷÷÷÷÷èíèåØèèíííêêàw                              ïô÷÷÷÷÷ùùùùùùù
üüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷òòòòòòïïíêâ   ù ùù                    Áêíò÷÷÷÷÷ùùùùùùùù
üüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷òòòòòòïïïÝâ â                      Äèïòò÷÷÷÷ùùùùùùùùù
üüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷ôôôòòòïíèØ  B¼                à¨º∟Îèò÷÷÷÷ùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷ôôôôôôàà åêèÎØÛÛÖÄ   ÆØàÛå êïïïíÌï÷÷÷ùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷÷ôôôôôôôôïïïííÝ      Áêíïòòòòòô÷÷÷ùùùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷÷÷ôôôôùôùùòòòà·ùp cÄÛåòùòôôôô÷ùùùùùùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷÷÷÷ôôôôôôòòòòíêÛØêíïôôôôùùùùùùùùùùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷÷÷÷÷ôôôôòòòïèàèíâùùùùùùùùùùùùùùùùùùùùùùùùùù

This took under 5 minutes to produce, about 5.5 times faster than the 128-bit fp version, which should run under 30 seconds on my hardware build running at 8 MHz. That's more attractive.

But what about the loss of precision? Comparing the two plots you can see slight discrepancies in the 32-bit plot. A few of these are corrected by increasing the number of iterations per point, but not for others (like the anomaly in the center of the Mandelbrot set). Thus, it seems as if 32-bit precision won't provide as precise a result even when looking at the course granularity shown above.

If there is interest, I can clean up and post my 32-bit fp code. So far, I've only implemented 32-bit fp addition, subtraction, multiplication, division, square and a few conversion routines, just what I needed to run my madelbrot routine.

Top

BigEd

Post subject: Re: 128 bit Floating Point 65C816 implementation

Posted: Sun Aug 14, 2022 5:39 am

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10793
Location: England

Nice work! I'd definitely look into those two stray interior pixels in the 32 bit case - there is a bug lurking in the arithmetic. If you tabulate the orbit of that point (there's only one point, if we take symmetry into account) you should be able to see it diverge from a correct calculation.

100 iterations is a lot - I wonder if perhaps that will provoke rounding errors at all? But first, check those pixels. Then, perhaps, see if 20 iterations gives you any difference from 100.

Top

Martin_H

Post subject: Re: 128 bit Floating Point 65C816 implementation

Posted: Sun Aug 14, 2022 11:11 am

Joined: Wed Jan 08, 2014 3:31 pm
Posts: 562

Nice images.

With regards to 32 vs 128 bit. When you aren't zooming in, 32 bit should yield a fine image. I've even gotten passable images with 16 bit fixed point. So I think BigEd is right that the image artifacts are a lurking bug.

With 128 bit you should be able to do some deep zooming before the arithmetic breaks down.

Top

tmr4

Post subject: Re: 128 bit Floating Point 65C816 implementation

Posted: Sun Aug 14, 2022 9:44 pm

Joined: Sat Feb 19, 2022 10:14 pm
Posts: 147

BigEd wrote:

...there is a bug lurking in the arithmetic.

Martin_H wrote:

So I think BigEd is right that the image artifacts are a lurking bug.

Thanks, I think you're both right. I had done some basic testing on my 32-bit fp package and actually was using the Mandelbrot plot to give it a more rigorous testing. I was a too dense to realize that the artifacts in the 32-bit plot were due to a bug rather than insufficient precision. With some more tesitng, it appears that on occasion I lose a sign for some of my 32-bit fp numbers. I still need to track down the specific cause.

BigEd wrote:

100 iterations is a lot ... see if 20 iterations gives you any difference from 100.

Thanks. I didn't have a feel for the number of iterations required for a basic plot. The higher iterations provide some refined detail at the boundary but at this resolution a too high number is overkill and really slows down the calculation for points within the set.

Martin_H wrote:

With regards to 32 vs 128 bit. When you aren't zooming in, 32 bit should yield a fine image. I've even gotten passable images with 16 bit fixed point. ... With 128 bit you should be able to do some deep zooming before the arithmetic breaks down.

Thanks for the feedback. Once I track down my bug, I'll try to vary some of my input parameters to see how far I can push it.

Top

GARTHWILSON

Post subject: Re: 128 bit Floating Point 65C816 implementation

Posted: Sun Aug 14, 2022 10:38 pm

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8427
Location: Southern California

tmr4 wrote:

BigEd wrote:

...there is a bug lurking in the arithmetic.

Martin_H wrote:

So I think BigEd is right that the image artifacts are a lurking bug.

Thanks, I think you're both right. I had done some basic testing on my 32-bit fp package and actually was using the Mandelbrot plot to give it a more rigorous testing. I was a too dense to realize that the artifacts in the 32-bit plot were due to a bug rather than insufficient precision. With some more testing, it appears that on occasion I lose a sign for some of my 32-bit fp numbers. I still need to track down the specific cause.

Make sure the handling of the mantissa doesn't have the same bugs that made it into the scaled-integer multiplication and division in Forth's UM* and UM/MOD.

UM* (multiplication) bug in common 6502 Forths (and my fix) Also shows some faster variations, with code size and speed comparisons.
UM/MOD (32-bit division) bug in common 6502 Forths (and my fix)

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?

Top

tmr4

Post subject: Re: 128 bit Floating Point 65C816 implementation

Posted: Mon Aug 15, 2022 4:52 am

Joined: Sat Feb 19, 2022 10:14 pm
Posts: 147

GARTHWILSON wrote:

Make sure the handling of the mantissa doesn't have the same bugs that made it into the scaled-integer multiplication and division in Forth's UM* and UM/MOD.

I always note to check this for at least my division routines (I'm less clear on the historic UM* problem and now to test for it).

I've been puzzling how to check this with floating point and am wondering if it's even possible to test with the values on your UM/MOD page, only one of which is a proper normalized floating-point value. I suppose it's possible for subnormal values, but I haven't factored those in yet (and maybe never will).

Top

drogon

Post subject: Re: 128 bit Floating Point 65C816 implementation

Posted: Mon Aug 15, 2022 9:51 am

Joined: Wed Feb 14, 2018 2:33 pm
Posts: 1398
Location: Scotland

My "go to" Mandelbrot for some time now has been the following BASIC programs - it's deliberately simplified to run on different MS and MS style BASICs (of the 8-bit era) and I've used it as a crude benchmark and should be easy to translate on a line by line basic to other languages without deliberately trying to speed it up (but such is the nature of benchmarking!)

It's maximum "depth" is 18 which produces a decent image on a text terminal.

I'm curious about the performance of the 32-bit FP code though - currently my BCPL '816 system uses 32-bit IEEE floats, but I offload the processing to the board "co-processor" which is an ATmega running at 16Mhz.

Recently we (my learned peers here) found that some integer 32-bit calculations could be faster on the 6502/816 than on the ATmega - even though the ATmega has a hardware 2-cycle 8x8 multiplier! There were code experiments here: viewtopic.php?f=2&t=6838&hilit=multiply

But back to Mandelbrots:

This is the code I use - it's BASIC, but very adaptable:

Code:

  100 REM A BASIC, ASCII MANDELBROT
  110 REM
  120 REM This implementation copyright (c) 2019, Gordon Henderson
  130 REM
  140 REM Permission to use/abuse anywhere for any purpose granted, but
  150 REM it comes with no warranty whatsoever. Good luck!
  160 REM
  170 C$ = ".,'~=+:;[/<&?oxOX# " : REM 'Pallet' Lightest to darkest...
  180 SO = 1 : REM Set to 0 if your MID$() indexes from 0.
  190 MI = LEN(C$)
  200 MX = 4
  210 LS = -2.0
  220 TP = 1.25
  230 XS = 2.5
  240 YS = -2.5
  250 W = 64
  260 H = 48
  270 SX = XS / W
  280 SY = YS / H
  290 Q = TIME
  300 FOR Y = 0 TO H
  310   CY = Y * SY + TP
  320   FOR X = 0 TO W
  330     CX = X * SX + LS
  340     ZX = 0
  350     ZY = 0
  360     CC = SO
  370     X2 = ZX * ZX
  380     Y2 = ZY * ZY
  390     IF CC > MI THEN GOTO 460
  400     IF (X2 + Y2) > MX THEN GOTO 460
  410     T = X2 - Y2 + CX
  420     ZY = 2 * ZX * ZY + CY
  430     ZX = T
  440     CC = CC + 1
  450     GOTO 370
  460     PRINT MID$(C$, CC - SO, 1);
  470   NEXT
  480   PRINT
  490 NEXT
  500 PRINT
  510 PRINT (TIME - Q) / 100
  520 END

and the output is (or should be!):

Code:

............,,,,,,,,,,,,,,'''''''''''''''''''''''''',,,,,,,,,,,,,
...........,,,,,,,,,,,''''''''''''''''''''''''''''''''',,,,,,,,,,
..........,,,,,,,,,'''''''''''''''''''''''~~~~===~~~~''''',,,,,,,
.........,,,,,,,,'''''''''''''''''''''~~~~~~=+[&+==~~~~~''''',,,,
........,,,,,,,'''''''''''''''''''''~~~~~~~==+: ;+++~~~~~~''''',,
.......,,,,,,'''''''''''''''''''''~~~~~~~~===+:[ / [+~~~~~~''''''
......,,,,,,''''''''''''''''''''~~~~~~~~~===+:;/?o[:+==~~~~~'''''
......,,,,''''''''''''''''''''~~~~~~~~~====+:O/x  <;:+==~~~~~~'''
.....,,,,''''''''''''''''''''~~~~~~~~~===++:#      X/+====~~~~'''
.....,,,'''''''''''''''''''~~~~~~~~~==++++:;/X      [:++====~~~''
....,,,'''''''''''''''''''~~~~~~~~==+++:::;[/      X/;:+++++==~~'
....,,''''''''''''''''''~~~~~~~===+[<&x[[? <&x     o&//<;:::[[=~~
...,,'''''''''''''''''~~~~~~=====+:;    &O              /[</&/:=~
...,'''''''''''''''''~~~========++:;<                    x    :=~
..,,'''''''''''''''~~=========+++:;/<O                       ;+==
..,'''''''''''''~~~=========++++:< ##                      X<;:+=
..''''''''''~~~~==:/++++++++::::;/x                          [;:=
.,''''''~~~~~~===+:X[;:;; ;;::;;[                             o/=
.,''''~~~~~~~===++;<xXo<<X &<[[[/                             X:+
.'''~~~~~~~=====+::[&         <<&                             /:=
.'~~~~~~~~=====+::;/?          oO                              :=
.'~~~~~~~====++/;[/o                                          [+=
.~~~~~~=++++::;/???X                                         #:+=
.==++:/::+:;;[[o                                             :+==
                                                           &[:+==
.==++:/::+:;;[[o                                             :+==
.~~~~~~=++++::;/???X                                         #:+=
.'~~~~~~~====++/;[/o                                          [+=
.'~~~~~~~~=====+::;/?          oO                              :=
.'''~~~~~~~=====+::[&         <<&                             /:=
.,''''~~~~~~~===++;<xXo<<X &<[[[/                             X:+
.,''''''~~~~~~===+:X[;:;; ;;::;;[                             o/=
..''''''''''~~~~==:/++++++++::::;/x                          [;:=
..,'''''''''''''~~~=========++++:< ##                      X<;:+=
..,,'''''''''''''''~~=========+++:;/<O                       ;+==
...,'''''''''''''''''~~~========++:;<                    x    :=~
...,,'''''''''''''''''~~~~~~=====+:;    &O              /[</&/:=~
....,,''''''''''''''''''~~~~~~~===+[<&x[[? <&x     o&//<;:::[[=~~
....,,,'''''''''''''''''''~~~~~~~~==+++:::;[/      X/;:+++++==~~'
.....,,,'''''''''''''''''''~~~~~~~~~==++++:;/X      [:++====~~~''
.....,,,,''''''''''''''''''''~~~~~~~~~===++:#      X/+====~~~~'''
......,,,,''''''''''''''''''''~~~~~~~~~====+:O/x  <;:+==~~~~~~'''
......,,,,,,''''''''''''''''''''~~~~~~~~~===+:;/?o[:+==~~~~~'''''
.......,,,,,,'''''''''''''''''''''~~~~~~~~===+:[ / [+~~~~~~''''''
........,,,,,,,'''''''''''''''''''''~~~~~~~==+: ;+++~~~~~~''''',,
.........,,,,,,,,'''''''''''''''''''''~~~~~~=+[&+==~~~~~''''',,,,
..........,,,,,,,,,'''''''''''''''''''''''~~~~===~~~~''''',,,,,,,
...........,,,,,,,,,,,''''''''''''''''''''''''''''''''',,,,,,,,,,
............,,,,,,,,,,,,,,'''''''''''''''''''''''''',,,,,,,,,,,,,

     48.21

That time, 48.21 seconds was on my Ruby816 board running BBC Basic (4) in 65C02 emulation mode at 16Mhz. BBC Basic has it's own 'standard' 5-byte floating point format. MS basics with 4-byte FP's are typically double the time.

When I translated it line for line into BCPL in my Ruby816 board the time is much less: - 26 seconds on the same hardware. This uses the ATmega for IEEE floating point. The BCPL compiles into a bytecode which is then interpreted by an assembly language program running on the '816.

So I'd be really interested if your 32-bit floating point could improve on that...

Cheers,

-Gordon

_________________
--
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/

Top

tmr4

Post subject: Re: 128 bit Floating Point 65C816 implementation

Posted: Mon Aug 15, 2022 5:39 pm

Joined: Sat Feb 19, 2022 10:14 pm
Posts: 147

drogon wrote:

I'm curious about the performance of the 32-bit FP code though - currently my BCPL '816 system uses 32-bit IEEE floats, but I offload the processing to the board "co-processor" which is an ATmega running at 16Mhz.

One of the speed enhancements I made when adding Marco's package to my system was to create a dedicated floating-point stack. In converting to 32-bit floating-point I translated all functions to operate with values on the stack, eliminating the transfer of values to dedicated floating-point registers and back. This comes at a cost as an index register is used for stack access and the resulting address modes used are less efficient than those available with dedicated floating-point registers. Still, the savings seems significant compared with Marco's package. I estimate that using a floating-point stack was about 1.5 times faster than switching to the dedicated floating-point direct page and loading the register's in Marco's package (assuming the switch to 32-bit was about 4 times faster than for 128-bit, giving the overall 5.5 speed improvement I saw with my 32-bit version on my Mandelbrot code).

It would be interesting to see how much time was consumed simply by transferring values back and forth to the ATmega in your system. It seems with that you'd need a double transfer, even less efficient than what's required with Marco's 128-bit package. I'd guess it's not a trivial amount, but hopefully much less than doing the equivalent work on the '816 though.

drogon wrote:

Recently we (my learned peers here) found that some integer 32-bit calculations could be faster on the 6502/816 than on the ATmega - even though the ATmega has a hardware 2-cycle 8x8 multiplier! There were code experiments here: viewtopic.php?f=2&t=6838&hilit=multiply
That time, 48.21 seconds was on my Ruby816 board running BBC Basic (4) in 65C02 emulation mode at 16Mhz. BBC Basic has it's own 'standard' 5-byte floating point format. MS basics with 4-byte FP's are typically double the time.

Interesting, you'd expect the MS version to be faster just based on the smaller float size. Perhaps they store a compacted value internally to save memory but at the cost of having to unpack it for each calculation. I follow Marco's method of using an unpacked internal representation. With that, floating-point calcs are pretty straight forward.

drogon wrote:

When I translated it line for line into BCPL in my Ruby816 board the time is much less: - 26 seconds on the same hardware. This uses the ATmega for IEEE floating point. The BCPL compiles into a bytecode which is then interpreted by an assembly language program running on the '816.

You're one step from being able to assess the transfer overhead. How do you interface with the ATmega? Transfer fp values to registers and execute a COP?

drogon wrote:

So I'd be really interested if your 32-bit floating point could improve on that...

I haven't made any refinements to the algorithms Marco used in his floating-point package and as mentioned above, in some cases I had to use less efficient address modes in implementing a floating-point stack. Also, in moving to using 16-bit registers, I eliminated some of the byte manipulation areas of his code where I didn't feel the overhead was justified for a 32-bit mantissa. I lost some efficiency there as well. So, I have some optimization potential, but for now, ultimate speed isn't a focus.

That said, using the timing estimates from my previous post, I think my system running at 16 MHz would produce a 20 iteration, 64x48 plot similar to yours in about 24 seconds. Very comparable to what you're seeing. Of course, that's all very theoretical. I haven't even transferred my floating-point package to my hardware yet and right now, my breadboard-based '816 system is unstable above 10 MHz.

BTW - I discovered the error producing the anomalies in my 32-bit plot. When adding a very small value to a much larger negative value my 32-bit code produces an infinite value. In translating Marco's code to 32-bit I stripped out a lot of stuff. Not being interested in infinite values, I didn't focus much on those areas of the code. Obviously, I messed up in one of those areas.

Top

drogon

Post subject: Re: 128 bit Floating Point 65C816 implementation

Posted: Mon Aug 15, 2022 6:00 pm

Joined: Wed Feb 14, 2018 2:33 pm
Posts: 1398
Location: Scotland

tmr4 wrote:

drogon wrote:

You're one step from being able to assess the transfer overhead. How do you interface with the ATmega? Transfer fp values to registers and execute a COP?

The latency is relatively high, but I've never measured it. In essence, the 6502/816 writes data into RAM then executes a WAI instruction. The ATmega meanwhile has been polling for the RDY line to go low and when it does, it pulls BE low on the 65xx, then attaches itself to the RAM (un-tristates an 8-bit port for data and another for AD[0:7], AD[8:15] is pulled high), reads the data, does the action (floating point operation, serial IO, filing system, etc.) then writes the result back to the RAM, and reverses the procedure to transfer control back to the 65xx, ending by sending an NMI to wake up the 65xx which does nothing more than RTI.

Access to/from the SD card goes at about 33KB/sec in 128 byte bursts (native speed on the ATmega is 1MB/sec) so I could work it out from that or just time it but timing from the 6502 side is hard as everything stops at that point including the timer interrupt...

-Gordon

_________________
--
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/

Top

tmr4

Post subject: Re: 128 bit Floating Point 65C816 implementation

Posted: Mon Aug 15, 2022 7:27 pm

Joined: Sat Feb 19, 2022 10:14 pm
Posts: 147

drogon wrote:

In essence, the 6502/816 writes data into RAM then executes a WAI instruction. The ATmega meanwhile has been polling for the RDY line to go low and when it does, it pulls BE low on the 65xx, then attaches itself to the RAM (un-tristates an 8-bit port for data and another for AD[0:7], AD[8:15] is pulled high), reads the data, does the action (floating point operation, serial IO, filing system, etc.) then writes the result back to the RAM, and reverses the procedure to transfer control back to the 65xx, ending by sending an NMI to wake up the 65xx which does nothing more than RTI.

Nice setup.

When I first looked into floating-point, I considered using the floating-point capability of the smart display I use with my builds. I think a similar setup as yours is possible with it, but I just communicate with it over a serial connection. That was too slow for floating-point so I went the software route.

Top

drogon

Post subject: Re: 128 bit Floating Point 65C816 implementation

Posted: Mon Aug 15, 2022 7:59 pm

Joined: Wed Feb 14, 2018 2:33 pm
Posts: 1398
Location: Scotland

tmr4 wrote:

drogon wrote:

This is sort-of a side-effect of my original intention to be "ROMless". The shared RAM area is $FFxx and it initially writes a small bootloader there then un-resets the 65xx where it boots, relocates the bootloader then uses that area to get the rest of the operating system into RAM. Initially the ATmega was also doing video, but a revision later and I dropped the video and started to use it for other stuff like serial and SD card access.

-Gordon

_________________
--
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/

Top

leepivonka

Post subject: Re: 128 bit Floating Point 65C816 implementation

Posted: Fri Aug 19, 2022 2:01 am

Joined: Fri Apr 15, 2016 1:03 am
Posts: 135

Here are some more 65816 software-only Mandelbrot samples.

Using 128bit mantissa floating-point, runs in 108 sec at 10MHz.

Code:

mainh
............,,,,,,,,,,,,,,'''''''''''''''''''''''''',,,,,,,,,,,,,
...........,,,,,,,,,,,''''''''''''''''''''''''''''''''',,,,,,,,,,
..........,,,,,,,,,'''''''''''''''''''''''~~~~===~~~~''''',,,,,,,
.........,,,,,,,,'''''''''''''''''''''~~~~~~=+[&+==~~~~~''''',,,,
........,,,,,,,'''''''''''''''''''''~~~~~~~==+: ;+++~~~~~~''''',,
.......,,,,,,'''''''''''''''''''''~~~~~~~~===+:[ / [+~~~~~~''''''
......,,,,,,''''''''''''''''''''~~~~~~~~~===+:;/?o[:+==~~~~~'''''
......,,,,''''''''''''''''''''~~~~~~~~~====+:O/x  <;:+==~~~~~~'''
.....,,,,''''''''''''''''''''~~~~~~~~~===++:#      X/+====~~~~'''
.....,,,'''''''''''''''''''~~~~~~~~~==++++:;/X      [:++====~~~''
....,,,'''''''''''''''''''~~~~~~~~==+++:::;[/      X/;:+++++==~~'
....,,''''''''''''''''''~~~~~~~===+[<&x[[? <&x     o&//<;:::[[=~~
...,,'''''''''''''''''~~~~~~=====+:;    &O              /[</&/:=~
...,'''''''''''''''''~~~========++:;<                    x    :=~
..,,'''''''''''''''~~=========+++:;/<O                       ;+==
..,'''''''''''''~~~=========++++:< ##                      X<;:+=
..''''''''''~~~~==:/++++++++::::;/x                          [;:=
.,''''''~~~~~~===+:X[;:;; ;;::;;[                             o/=
.,''''~~~~~~~===++;<xXo<<X &<[[[/                             X:+
.'''~~~~~~~=====+::[&         <<&                             /:=
.'~~~~~~~~=====+::;/?          oO                              :=
.'~~~~~~~====++/;[/o                                          [+=
.~~~~~~=++++::;/???X                                         #:+=
.==++:/::+:;;[[o                                             :+==
                                                           &[:+==
.==++:/::+:;;[[o                                             :+==
.~~~~~~=++++::;/???X                                         #:+=
.'~~~~~~~====++/;[/o                                          [+=
.'~~~~~~~~=====+::;/?          oO                              :=
.'''~~~~~~~=====+::[&         <<&                             /:=
.,''''~~~~~~~===++;<xXo<<X &<[[[/                             X:+
.,''''''~~~~~~===+:X[;:;; ;;::;;[                             o/=
..''''''''''~~~~==:/++++++++::::;/x                          [;:=
..,'''''''''''''~~~=========++++:< ##                      X<;:+=
..,,'''''''''''''''~~=========+++:;/<O                       ;+==
...,'''''''''''''''''~~~========++:;<                    x    :=~
...,,'''''''''''''''''~~~~~~=====+:;    &O              /[</&/:=~
....,,''''''''''''''''''~~~~~~~===+[<&x[[? <&x     o&//<;:::[[=~~
....,,,'''''''''''''''''''~~~~~~~~==+++:::;[/      X/;:+++++==~~'
.....,,,'''''''''''''''''''~~~~~~~~~==++++:;/X      [:++====~~~''
.....,,,,''''''''''''''''''''~~~~~~~~~===++:#      X/+====~~~~'''
......,,,,''''''''''''''''''''~~~~~~~~~====+:O/x  <;:+==~~~~~~'''
......,,,,,,''''''''''''''''''''~~~~~~~~~===+:;/?o[:+==~~~~~'''''
.......,,,,,,'''''''''''''''''''''~~~~~~~~===+:[ / [+~~~~~~''''''
........,,,,,,,'''''''''''''''''''''~~~~~~~==+: ;+++~~~~~~''''',,
.........,,,,,,,,'''''''''''''''''''''~~~~~~=+[&+==~~~~~''''',,,,
..........,,,,,,,,,'''''''''''''''''''''''~~~~===~~~~''''',,,,,,,
...........,,,,,,,,,,,''''''''''''''''''''''''''''''''',,,,,,,,,,
............,,,,,,,,,,,,,,'''''''''''''''''''''''''',,,,,,,,,,,,,

1076 MCycles ok

Using 32bit mantissa floating-point, runs in 18 sec at 10MHz.

Code:

mainf
............,,,,,,,,,,,,,,'''''''''''''''''''''''''',,,,,,,,,,,,,
...........,,,,,,,,,,,''''''''''''''''''''''''''''''''',,,,,,,,,,
..........,,,,,,,,,'''''''''''''''''''''''~~~~===~~~~''''',,,,,,,
.........,,,,,,,,'''''''''''''''''''''~~~~~~=+[&+==~~~~~''''',,,,
........,,,,,,,'''''''''''''''''''''~~~~~~~==+: ;+++~~~~~~''''',,
.......,,,,,,'''''''''''''''''''''~~~~~~~~===+:[ / [+~~~~~~''''''
......,,,,,,''''''''''''''''''''~~~~~~~~~===+:;/?o[:+==~~~~~'''''
......,,,,''''''''''''''''''''~~~~~~~~~====+:O/x  <;:+==~~~~~~'''
.....,,,,''''''''''''''''''''~~~~~~~~~===++:#      X/+====~~~~'''
.....,,,'''''''''''''''''''~~~~~~~~~==++++:;/X      [:++====~~~''
....,,,'''''''''''''''''''~~~~~~~~==+++:::;[/      X/;:+++++==~~'
....,,''''''''''''''''''~~~~~~~===+[<&x[[? <&x     o&//<;:::[[=~~
...,,'''''''''''''''''~~~~~~=====+:;    &O              /[</&/:=~
...,'''''''''''''''''~~~========++:;<                    x    :=~
..,,'''''''''''''''~~=========+++:;/<O                       ;+==
..,'''''''''''''~~~=========++++:< ##                      X<;:+=
..''''''''''~~~~==:/++++++++::::;/x                          [;:=
.,''''''~~~~~~===+:X[;:;; ;;::;;[                             o/=
.,''''~~~~~~~===++;<xXo<<X &<[[[/                             X:+
.'''~~~~~~~=====+::[&         <<&                             /:=
.'~~~~~~~~=====+::;/?          oO                              :=
.'~~~~~~~====++/;[/o                                          [+=
.~~~~~~=++++::;/???X                                         #:+=
.==++:/::+:;;[[o                                             :+==
/                                                          &[:+==
.==++:/::+:;;[[o                                             :+==
.~~~~~~=++++::;/???X                                         #:+=
.'~~~~~~~====++/;[/o                                          [+=
.'~~~~~~~~=====+::;/?          oO                              :=
.'''~~~~~~~=====+::[&         <<&                             /:=
.,''''~~~~~~~===++;<xXo<<X &<[[[/                             X:+
.,''''''~~~~~~===+:X[;:;; ;;::;;[                             o/=
..''''''''''~~~~==:/++++++++::::;/x                          [;:=
..,'''''''''''''~~~=========++++:< ##                      X<;:+=
..,,'''''''''''''''~~=========+++:;/<O                       ;+==
...,'''''''''''''''''~~~========++:;<                    x    :=~
...,,'''''''''''''''''~~~~~~=====+:;    &O              /[</&/:=~
....,,''''''''''''''''''~~~~~~~===+[<&x[[? <&x     o&//<;:::[[=~~
....,,,'''''''''''''''''''~~~~~~~~==+++:::;[/      X/;:+++++==~~'
.....,,,'''''''''''''''''''~~~~~~~~~==++++:;/X      [:++====~~~''
.....,,,,''''''''''''''''''''~~~~~~~~~===++:#      X/+====~~~~'''
......,,,,''''''''''''''''''''~~~~~~~~~====+:O/x  <;:+==~~~~~~'''
......,,,,,,''''''''''''''''''''~~~~~~~~~===+:;/?o[:+==~~~~~'''''
.......,,,,,,'''''''''''''''''''''~~~~~~~~===+:[ / [+~~~~~~''''''
........,,,,,,,'''''''''''''''''''''~~~~~~~==+: ;+++~~~~~~''''',,
.........,,,,,,,,'''''''''''''''''''''~~~~~~=+[&+==~~~~~''''',,,,
..........,,,,,,,,,'''''''''''''''''''''''~~~~===~~~~''''',,,,,,,
...........,,,,,,,,,,,''''''''''''''''''''''''''''''''',,,,,,,,,,
............,,,,,,,,,,,,,,'''''''''''''''''''''''''',,,,,,,,,,,,,

181 MCycles ok

Attachments:

File comment: 32bit console log

F_MandF4_1.zip [4.04 KiB]
Downloaded 46 times

File comment: 128bit console log

F_MandH4_1.zip [7.15 KiB]
Downloaded 39 times

Top

Page 3 of 3

[ 44 posts ]

Go to page Previous 1, 2, 3

Board index » 6502.org Users Forum » Programming

All times are UTC

Who is online

Users browsing this forum: No registered users and 2 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum