6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Thu Nov 21, 2024 10:48 pm

All times are UTC




Post new topic Reply to topic  [ 44 posts ]  Go to page Previous  1, 2, 3
Author Message
PostPosted: Sun Jul 24, 2022 7:28 pm 
Offline

Joined: Fri Sep 01, 2017 6:20 pm
Posts: 6
granati wrote:
Now all source code is browsable online at:

http://65xx.unet.bz/repos

I put online source code for gal too (that i missing till now)

Marco


Can I download / checkout over SVN?


Top
 Profile  
Reply with quote  
PostPosted: Sun Jul 24, 2022 9:39 pm 
Offline

Joined: Sat Feb 19, 2022 10:14 pm
Posts: 147
sanny wrote:
granati wrote:
Now all source code is browsable online at:

http://65xx.unet.bz/repos

I put online source code for gal too (that i missing till now)

Marco


Can I download / checkout over SVN?

Here is a direct link to the floating-point code that was provided earlier in this post: http://65xx.unet.bz/fpu.txt. It's a bit newer than what's in the repo.


Top
 Profile  
Reply with quote  
PostPosted: Sat Aug 13, 2022 8:34 pm 
Offline

Joined: Sat Feb 19, 2022 10:14 pm
Posts: 147
I've been playing with this floating-point package for a few weeks now. Here's a Mandelbrot set with 100 interations per point:

Code:
128-bit fp (100 iterations)
üüüüüüüüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷÷÷÷÷ôôôôòòòïèàèíâùùùùùùùùùùùùùùùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷÷÷÷ôôôôôôòòòòíêÛØêíïôôôôùùùùùùùùùùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷÷÷ôôôôôôôôòòòà·Îp cÄÛåòòòôôôô÷ùùùùùùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷÷ôôôôôôôôïïïííÝ      Áêíïòòòòòô÷÷÷ùùùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷ôôôôôôàà åêèÎØÛÛÖÄ   ÆØàÛå êïïïíÌï÷÷÷ùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷ôôôòòòïíèØ  B¼                ਺∟Îèò÷÷÷÷ùùùùùùùùùù
üüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷òòòòòòïïïÝâ ¨                      Äèïòò÷÷÷÷ùùùùùùùùù
üüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷òòòòòòïïíêâ                           Áêíò÷÷÷÷÷ùùùùùùùù
üüüüüüüüüüüüüüüüü÷÷÷÷÷èíèåØèèíííêêàw                              ïô÷÷÷÷÷ùùùùùùù
üüüüüüüüüüüüüüüüüòííèèª É     Äââà                              Ýàòôô÷÷÷÷÷ùùùùùù
üüüüüüüüüüüüüüüüòòííØ            Ó                              âíòôô÷÷÷÷÷÷ùùùùù
üüüüüüüüüüüüüüüü    ¿            =                             Ýïòòôô÷÷÷÷÷÷ùùùùù
üüüüüüüüüüüüüüüü    ¿            =                             Ýïòòôô÷÷÷÷÷÷ùùùùù
üüüüüüüüüüüüüüüüòòííØ            Ó                              âíòôô÷÷÷÷÷÷ùùùùù
üüüüüüüüüüüüüüüüüòííèèª É     Äââà                              Ýàòôô÷÷÷÷÷ùùùùùù
üüüüüüüüüüüüüüüüü÷÷÷÷÷èíèåØèèíííêêàw                              ïô÷÷÷÷÷ùùùùùùù
üüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷òòòòòòïïíêâ                           Áêíò÷÷÷÷÷ùùùùùùùù
üüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷òòòòòòïïïÝâ ¨                      Äèïòò÷÷÷÷ùùùùùùùùù
üüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷ôôôòòòïíèØ  B¼                ਺∟Îèò÷÷÷÷ùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷ôôôôôôàà åêèÎØÛÛÖÄ   ÆØàÛå êïïïíÌï÷÷÷ùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷÷ôôôôôôôôïïïííÝ      Áêíïòòòòòô÷÷÷ùùùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷÷÷ôôôôôôôôòòòà·Îp cÄÛåòòòôôôô÷ùùùùùùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷÷÷÷ôôôôôôòòòòíêÛØêíïôôôôùùùùùùùùùùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷÷÷÷÷ôôôôòòòïèàèíâùùùùùùùùùùùùùùùùùùùùùùùùùù

I haven't tried this on my hardware build yet but this took about 26 minutes to produce on my fastest PC on my 65816 emulator which I estimate runs at an equivalent of about 800 kHz. This would still take over 2 1/2 minutes running on my hardware at 8 MHz.

I wanted something faster, so I looked for ways to speed up the floating-point package. I noted a few possibilities when I added the fp package to my system:
    * Use the system direct page. The fp package uses a dedicated direct page which must be shifted to and back for each fp operation.
    * Use a dedicated fp stack. The fp package uses registers which must be loaded and the result retrieved from for each fp operation.
    * Use 16-bit registers unless needed otherwise. The fp package requires every fp operation to be called with 8-bit registers. Since my system uses 16-bit registers, this required a switch before and after every fp operation.

Of course, the biggest slowdown in my Mandelbrot calculation above was using a 128-bit fp package in the first place. Did I really need that much precision? Curious, I decided to convert the key fp routines needed for the Mandelbrot calculation to 32-bit precision, incorporating the improvements noted above as well. Here the output with the 32-bit fp routines:

Code:
32-bit fp (100 iterations)
üüüüüüüüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷÷÷÷÷ôôôôòòòïèàèíâùùùùùùùùùùùùùùùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷÷÷÷ôôôôôôòòòòíêÛØêíïôôôôùùùùùùùùùùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷÷÷ôôôôùôùùòòòà·ùp cÄÛåòùòôôôô÷ùùùùùùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷÷ôôôôôôôôïïïííÝ      Áêíïòòòòòô÷÷÷ùùùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷ôôôôôôàà åêèÎØÛÛÖÄ   ÆØàÛå êïïïíÌï÷÷÷ùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷ôôôòòòïíèØ  B¼                ਺∟Îèò÷÷÷÷ùùùùùùùùùù
üüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷òòòòòòïïïÝâ â                      Äèïòò÷÷÷÷ùùùùùùùùù
üüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷òòòòòòïïíêâ   ù ùù                    Áêíò÷÷÷÷÷ùùùùùùùù
üüüüüüüüüüüüüüüüü÷÷÷÷÷èíèåØèèíííêêàw                              ïô÷÷÷÷÷ùùùùùùù
üüüüüüüüüüüüüüüüüòííèèª É     Äââà                     Ý        Ýàòôô÷÷÷÷÷ùùùùùù
üüüüüüüüüüüüüüüüòòííØ            Ó                              âíòôô÷÷÷÷÷÷ùùùùù
üüüüüüüüüüüüüüüü                                               Ýïòòôô÷÷÷÷÷÷ùùùùù
üüüüüüüüüüüüüüüü                                               Ýïòòôô÷÷÷÷÷÷ùùùùù
üüüüüüüüüüüüüüüüòòííØ            Ó                              âíòôô÷÷÷÷÷÷ùùùùù
üüüüüüüüüüüüüüüüüòííèèª É     Äââà                     Ý        Ýàòôô÷÷÷÷÷ùùùùùù
üüüüüüüüüüüüüüüüü÷÷÷÷÷èíèåØèèíííêêàw                              ïô÷÷÷÷÷ùùùùùùù
üüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷òòòòòòïïíêâ   ù ùù                    Áêíò÷÷÷÷÷ùùùùùùùù
üüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷òòòòòòïïïÝâ â                      Äèïòò÷÷÷÷ùùùùùùùùù
üüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷ôôôòòòïíèØ  B¼                ਺∟Îèò÷÷÷÷ùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷ôôôôôôàà åêèÎØÛÛÖÄ   ÆØàÛå êïïïíÌï÷÷÷ùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷÷ôôôôôôôôïïïííÝ      Áêíïòòòòòô÷÷÷ùùùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷÷÷ôôôôùôùùòòòà·ùp cÄÛåòùòôôôô÷ùùùùùùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷÷÷÷ôôôôôôòòòòíêÛØêíïôôôôùùùùùùùùùùùùùùùùùùùùù
üüüüüüüüüüüüüüüüüüüüüüüüüüü÷÷÷÷÷÷÷÷÷÷÷÷÷÷ôôôôòòòïèàèíâùùùùùùùùùùùùùùùùùùùùùùùùùù

This took under 5 minutes to produce, about 5.5 times faster than the 128-bit fp version, which should run under 30 seconds on my hardware build running at 8 MHz. That's more attractive.

But what about the loss of precision? Comparing the two plots you can see slight discrepancies in the 32-bit plot. A few of these are corrected by increasing the number of iterations per point, but not for others (like the anomaly in the center of the Mandelbrot set). Thus, it seems as if 32-bit precision won't provide as precise a result even when looking at the course granularity shown above.

If there is interest, I can clean up and post my 32-bit fp code. So far, I've only implemented 32-bit fp addition, subtraction, multiplication, division, square and a few conversion routines, just what I needed to run my madelbrot routine.


Top
 Profile  
Reply with quote  
PostPosted: Sun Aug 14, 2022 5:39 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Nice work! I'd definitely look into those two stray interior pixels in the 32 bit case - there is a bug lurking in the arithmetic. If you tabulate the orbit of that point (there's only one point, if we take symmetry into account) you should be able to see it diverge from a correct calculation.

100 iterations is a lot - I wonder if perhaps that will provoke rounding errors at all? But first, check those pixels. Then, perhaps, see if 20 iterations gives you any difference from 100.


Top
 Profile  
Reply with quote  
PostPosted: Sun Aug 14, 2022 11:11 am 
Offline

Joined: Wed Jan 08, 2014 3:31 pm
Posts: 578
Nice images.

With regards to 32 vs 128 bit. When you aren't zooming in, 32 bit should yield a fine image. I've even gotten passable images with 16 bit fixed point. So I think BigEd is right that the image artifacts are a lurking bug.

With 128 bit you should be able to do some deep zooming before the arithmetic breaks down.


Top
 Profile  
Reply with quote  
PostPosted: Sun Aug 14, 2022 9:44 pm 
Offline

Joined: Sat Feb 19, 2022 10:14 pm
Posts: 147
BigEd wrote:
...there is a bug lurking in the arithmetic.

Martin_H wrote:
So I think BigEd is right that the image artifacts are a lurking bug.

Thanks, I think you're both right. I had done some basic testing on my 32-bit fp package and actually was using the Mandelbrot plot to give it a more rigorous testing. I was a too dense to realize that the artifacts in the 32-bit plot were due to a bug rather than insufficient precision. With some more tesitng, it appears that on occasion I lose a sign for some of my 32-bit fp numbers. I still need to track down the specific cause.

BigEd wrote:
100 iterations is a lot ... see if 20 iterations gives you any difference from 100.
Thanks. I didn't have a feel for the number of iterations required for a basic plot. The higher iterations provide some refined detail at the boundary but at this resolution a too high number is overkill and really slows down the calculation for points within the set.

Martin_H wrote:
With regards to 32 vs 128 bit. When you aren't zooming in, 32 bit should yield a fine image. I've even gotten passable images with 16 bit fixed point. ... With 128 bit you should be able to do some deep zooming before the arithmetic breaks down.
Thanks for the feedback. Once I track down my bug, I'll try to vary some of my input parameters to see how far I can push it.


Top
 Profile  
Reply with quote  
PostPosted: Sun Aug 14, 2022 10:38 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California
tmr4 wrote:
BigEd wrote:
...there is a bug lurking in the arithmetic.

Martin_H wrote:
So I think BigEd is right that the image artifacts are a lurking bug.

Thanks, I think you're both right. I had done some basic testing on my 32-bit fp package and actually was using the Mandelbrot plot to give it a more rigorous testing. I was a too dense to realize that the artifacts in the 32-bit plot were due to a bug rather than insufficient precision. With some more testing, it appears that on occasion I lose a sign for some of my 32-bit fp numbers. I still need to track down the specific cause.

Make sure the handling of the mantissa doesn't have the same bugs that made it into the scaled-integer multiplication and division in Forth's UM* and UM/MOD.


_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Mon Aug 15, 2022 4:52 am 
Offline

Joined: Sat Feb 19, 2022 10:14 pm
Posts: 147
GARTHWILSON wrote:
Make sure the handling of the mantissa doesn't have the same bugs that made it into the scaled-integer multiplication and division in Forth's UM* and UM/MOD.

I always note to check this for at least my division routines (I'm less clear on the historic UM* problem and now to test for it).

I've been puzzling how to check this with floating point and am wondering if it's even possible to test with the values on your UM/MOD page, only one of which is a proper normalized floating-point value. I suppose it's possible for subnormal values, but I haven't factored those in yet (and maybe never will).


Top
 Profile  
Reply with quote  
PostPosted: Mon Aug 15, 2022 9:51 am 
Offline
User avatar

Joined: Wed Feb 14, 2018 2:33 pm
Posts: 1488
Location: Scotland
My "go to" Mandelbrot for some time now has been the following BASIC programs - it's deliberately simplified to run on different MS and MS style BASICs (of the 8-bit era) and I've used it as a crude benchmark and should be easy to translate on a line by line basic to other languages without deliberately trying to speed it up (but such is the nature of benchmarking!)

It's maximum "depth" is 18 which produces a decent image on a text terminal.

I'm curious about the performance of the 32-bit FP code though - currently my BCPL '816 system uses 32-bit IEEE floats, but I offload the processing to the board "co-processor" which is an ATmega running at 16Mhz.

Recently we (my learned peers here) found that some integer 32-bit calculations could be faster on the 6502/816 than on the ATmega - even though the ATmega has a hardware 2-cycle 8x8 multiplier! There were code experiments here: viewtopic.php?f=2&t=6838&hilit=multiply

But back to Mandelbrots:

This is the code I use - it's BASIC, but very adaptable:

Code:
  100 REM A BASIC, ASCII MANDELBROT
  110 REM
  120 REM This implementation copyright (c) 2019, Gordon Henderson
  130 REM
  140 REM Permission to use/abuse anywhere for any purpose granted, but
  150 REM it comes with no warranty whatsoever. Good luck!
  160 REM
  170 C$ = ".,'~=+:;[/<&?oxOX# " : REM 'Pallet' Lightest to darkest...
  180 SO = 1 : REM Set to 0 if your MID$() indexes from 0.
  190 MI = LEN(C$)
  200 MX = 4
  210 LS = -2.0
  220 TP = 1.25
  230 XS = 2.5
  240 YS = -2.5
  250 W = 64
  260 H = 48
  270 SX = XS / W
  280 SY = YS / H
  290 Q = TIME
  300 FOR Y = 0 TO H
  310   CY = Y * SY + TP
  320   FOR X = 0 TO W
  330     CX = X * SX + LS
  340     ZX = 0
  350     ZY = 0
  360     CC = SO
  370     X2 = ZX * ZX
  380     Y2 = ZY * ZY
  390     IF CC > MI THEN GOTO 460
  400     IF (X2 + Y2) > MX THEN GOTO 460
  410     T = X2 - Y2 + CX
  420     ZY = 2 * ZX * ZY + CY
  430     ZX = T
  440     CC = CC + 1
  450     GOTO 370
  460     PRINT MID$(C$, CC - SO, 1);
  470   NEXT
  480   PRINT
  490 NEXT
  500 PRINT
  510 PRINT (TIME - Q) / 100
  520 END


and the output is (or should be!):

Code:
............,,,,,,,,,,,,,,'''''''''''''''''''''''''',,,,,,,,,,,,,
...........,,,,,,,,,,,''''''''''''''''''''''''''''''''',,,,,,,,,,
..........,,,,,,,,,'''''''''''''''''''''''~~~~===~~~~''''',,,,,,,
.........,,,,,,,,'''''''''''''''''''''~~~~~~=+[&+==~~~~~''''',,,,
........,,,,,,,'''''''''''''''''''''~~~~~~~==+: ;+++~~~~~~''''',,
.......,,,,,,'''''''''''''''''''''~~~~~~~~===+:[ / [+~~~~~~''''''
......,,,,,,''''''''''''''''''''~~~~~~~~~===+:;/?o[:+==~~~~~'''''
......,,,,''''''''''''''''''''~~~~~~~~~====+:O/x  <;:+==~~~~~~'''
.....,,,,''''''''''''''''''''~~~~~~~~~===++:#      X/+====~~~~'''
.....,,,'''''''''''''''''''~~~~~~~~~==++++:;/X      [:++====~~~''
....,,,'''''''''''''''''''~~~~~~~~==+++:::;[/      X/;:+++++==~~'
....,,''''''''''''''''''~~~~~~~===+[<&x[[? <&x     o&//<;:::[[=~~
...,,'''''''''''''''''~~~~~~=====+:;    &O              /[</&/:=~
...,'''''''''''''''''~~~========++:;<                    x    :=~
..,,'''''''''''''''~~=========+++:;/<O                       ;+==
..,'''''''''''''~~~=========++++:< ##                      X<;:+=
..''''''''''~~~~==:/++++++++::::;/x                          [;:=
.,''''''~~~~~~===+:X[;:;; ;;::;;[                             o/=
.,''''~~~~~~~===++;<xXo<<X &<[[[/                             X:+
.'''~~~~~~~=====+::[&         <<&                             /:=
.'~~~~~~~~=====+::;/?          oO                              :=
.'~~~~~~~====++/;[/o                                          [+=
.~~~~~~=++++::;/???X                                         #:+=
.==++:/::+:;;[[o                                             :+==
                                                           &[:+==
.==++:/::+:;;[[o                                             :+==
.~~~~~~=++++::;/???X                                         #:+=
.'~~~~~~~====++/;[/o                                          [+=
.'~~~~~~~~=====+::;/?          oO                              :=
.'''~~~~~~~=====+::[&         <<&                             /:=
.,''''~~~~~~~===++;<xXo<<X &<[[[/                             X:+
.,''''''~~~~~~===+:X[;:;; ;;::;;[                             o/=
..''''''''''~~~~==:/++++++++::::;/x                          [;:=
..,'''''''''''''~~~=========++++:< ##                      X<;:+=
..,,'''''''''''''''~~=========+++:;/<O                       ;+==
...,'''''''''''''''''~~~========++:;<                    x    :=~
...,,'''''''''''''''''~~~~~~=====+:;    &O              /[</&/:=~
....,,''''''''''''''''''~~~~~~~===+[<&x[[? <&x     o&//<;:::[[=~~
....,,,'''''''''''''''''''~~~~~~~~==+++:::;[/      X/;:+++++==~~'
.....,,,'''''''''''''''''''~~~~~~~~~==++++:;/X      [:++====~~~''
.....,,,,''''''''''''''''''''~~~~~~~~~===++:#      X/+====~~~~'''
......,,,,''''''''''''''''''''~~~~~~~~~====+:O/x  <;:+==~~~~~~'''
......,,,,,,''''''''''''''''''''~~~~~~~~~===+:;/?o[:+==~~~~~'''''
.......,,,,,,'''''''''''''''''''''~~~~~~~~===+:[ / [+~~~~~~''''''
........,,,,,,,'''''''''''''''''''''~~~~~~~==+: ;+++~~~~~~''''',,
.........,,,,,,,,'''''''''''''''''''''~~~~~~=+[&+==~~~~~''''',,,,
..........,,,,,,,,,'''''''''''''''''''''''~~~~===~~~~''''',,,,,,,
...........,,,,,,,,,,,''''''''''''''''''''''''''''''''',,,,,,,,,,
............,,,,,,,,,,,,,,'''''''''''''''''''''''''',,,,,,,,,,,,,

     48.21


That time, 48.21 seconds was on my Ruby816 board running BBC Basic (4) in 65C02 emulation mode at 16Mhz. BBC Basic has it's own 'standard' 5-byte floating point format. MS basics with 4-byte FP's are typically double the time.

When I translated it line for line into BCPL in my Ruby816 board the time is much less: - 26 seconds on the same hardware. This uses the ATmega for IEEE floating point. The BCPL compiles into a bytecode which is then interpreted by an assembly language program running on the '816.

So I'd be really interested if your 32-bit floating point could improve on that...

Cheers,

-Gordon

_________________
--
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/


Top
 Profile  
Reply with quote  
PostPosted: Mon Aug 15, 2022 5:39 pm 
Offline

Joined: Sat Feb 19, 2022 10:14 pm
Posts: 147
drogon wrote:
I'm curious about the performance of the 32-bit FP code though - currently my BCPL '816 system uses 32-bit IEEE floats, but I offload the processing to the board "co-processor" which is an ATmega running at 16Mhz.
One of the speed enhancements I made when adding Marco's package to my system was to create a dedicated floating-point stack. In converting to 32-bit floating-point I translated all functions to operate with values on the stack, eliminating the transfer of values to dedicated floating-point registers and back. This comes at a cost as an index register is used for stack access and the resulting address modes used are less efficient than those available with dedicated floating-point registers. Still, the savings seems significant compared with Marco's package. I estimate that using a floating-point stack was about 1.5 times faster than switching to the dedicated floating-point direct page and loading the register's in Marco's package (assuming the switch to 32-bit was about 4 times faster than for 128-bit, giving the overall 5.5 speed improvement I saw with my 32-bit version on my Mandelbrot code).

It would be interesting to see how much time was consumed simply by transferring values back and forth to the ATmega in your system. It seems with that you'd need a double transfer, even less efficient than what's required with Marco's 128-bit package. I'd guess it's not a trivial amount, but hopefully much less than doing the equivalent work on the '816 though.
drogon wrote:
Recently we (my learned peers here) found that some integer 32-bit calculations could be faster on the 6502/816 than on the ATmega - even though the ATmega has a hardware 2-cycle 8x8 multiplier! There were code experiments here: viewtopic.php?f=2&t=6838&hilit=multiply
That time, 48.21 seconds was on my Ruby816 board running BBC Basic (4) in 65C02 emulation mode at 16Mhz. BBC Basic has it's own 'standard' 5-byte floating point format. MS basics with 4-byte FP's are typically double the time.
Interesting, you'd expect the MS version to be faster just based on the smaller float size. Perhaps they store a compacted value internally to save memory but at the cost of having to unpack it for each calculation. I follow Marco's method of using an unpacked internal representation. With that, floating-point calcs are pretty straight forward.

drogon wrote:
When I translated it line for line into BCPL in my Ruby816 board the time is much less: - 26 seconds on the same hardware. This uses the ATmega for IEEE floating point. The BCPL compiles into a bytecode which is then interpreted by an assembly language program running on the '816.
You're one step from being able to assess the transfer overhead. How do you interface with the ATmega? Transfer fp values to registers and execute a COP?

drogon wrote:
So I'd be really interested if your 32-bit floating point could improve on that...
I haven't made any refinements to the algorithms Marco used in his floating-point package and as mentioned above, in some cases I had to use less efficient address modes in implementing a floating-point stack. Also, in moving to using 16-bit registers, I eliminated some of the byte manipulation areas of his code where I didn't feel the overhead was justified for a 32-bit mantissa. I lost some efficiency there as well. So, I have some optimization potential, but for now, ultimate speed isn't a focus.

That said, using the timing estimates from my previous post, I think my system running at 16 MHz would produce a 20 iteration, 64x48 plot similar to yours in about 24 seconds. Very comparable to what you're seeing. Of course, that's all very theoretical. I haven't even transferred my floating-point package to my hardware yet and right now, my breadboard-based '816 system is unstable above 10 MHz.

BTW - I discovered the error producing the anomalies in my 32-bit plot. When adding a very small value to a much larger negative value my 32-bit code produces an infinite value. In translating Marco's code to 32-bit I stripped out a lot of stuff. Not being interested in infinite values, I didn't focus much on those areas of the code. Obviously, I messed up in one of those areas.


Top
 Profile  
Reply with quote  
PostPosted: Mon Aug 15, 2022 6:00 pm 
Offline
User avatar

Joined: Wed Feb 14, 2018 2:33 pm
Posts: 1488
Location: Scotland
tmr4 wrote:
drogon wrote:
When I translated it line for line into BCPL in my Ruby816 board the time is much less: - 26 seconds on the same hardware. This uses the ATmega for IEEE floating point. The BCPL compiles into a bytecode which is then interpreted by an assembly language program running on the '816.
You're one step from being able to assess the transfer overhead. How do you interface with the ATmega? Transfer fp values to registers and execute a COP?


The latency is relatively high, but I've never measured it. In essence, the 6502/816 writes data into RAM then executes a WAI instruction. The ATmega meanwhile has been polling for the RDY line to go low and when it does, it pulls BE low on the 65xx, then attaches itself to the RAM (un-tristates an 8-bit port for data and another for AD[0:7], AD[8:15] is pulled high), reads the data, does the action (floating point operation, serial IO, filing system, etc.) then writes the result back to the RAM, and reverses the procedure to transfer control back to the 65xx, ending by sending an NMI to wake up the 65xx which does nothing more than RTI.

Access to/from the SD card goes at about 33KB/sec in 128 byte bursts (native speed on the ATmega is 1MB/sec) so I could work it out from that or just time it but timing from the 6502 side is hard as everything stops at that point including the timer interrupt...

-Gordon

_________________
--
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/


Top
 Profile  
Reply with quote  
PostPosted: Mon Aug 15, 2022 7:27 pm 
Offline

Joined: Sat Feb 19, 2022 10:14 pm
Posts: 147
drogon wrote:
In essence, the 6502/816 writes data into RAM then executes a WAI instruction. The ATmega meanwhile has been polling for the RDY line to go low and when it does, it pulls BE low on the 65xx, then attaches itself to the RAM (un-tristates an 8-bit port for data and another for AD[0:7], AD[8:15] is pulled high), reads the data, does the action (floating point operation, serial IO, filing system, etc.) then writes the result back to the RAM, and reverses the procedure to transfer control back to the 65xx, ending by sending an NMI to wake up the 65xx which does nothing more than RTI.
Nice setup.

When I first looked into floating-point, I considered using the floating-point capability of the smart display I use with my builds. I think a similar setup as yours is possible with it, but I just communicate with it over a serial connection. That was too slow for floating-point so I went the software route.


Top
 Profile  
Reply with quote  
PostPosted: Mon Aug 15, 2022 7:59 pm 
Offline
User avatar

Joined: Wed Feb 14, 2018 2:33 pm
Posts: 1488
Location: Scotland
tmr4 wrote:
drogon wrote:
In essence, the 6502/816 writes data into RAM then executes a WAI instruction. The ATmega meanwhile has been polling for the RDY line to go low and when it does, it pulls BE low on the 65xx, then attaches itself to the RAM (un-tristates an 8-bit port for data and another for AD[0:7], AD[8:15] is pulled high), reads the data, does the action (floating point operation, serial IO, filing system, etc.) then writes the result back to the RAM, and reverses the procedure to transfer control back to the 65xx, ending by sending an NMI to wake up the 65xx which does nothing more than RTI.
Nice setup.

When I first looked into floating-point, I considered using the floating-point capability of the smart display I use with my builds. I think a similar setup as yours is possible with it, but I just communicate with it over a serial connection. That was too slow for floating-point so I went the software route.


This is sort-of a side-effect of my original intention to be "ROMless". The shared RAM area is $FFxx and it initially writes a small bootloader there then un-resets the 65xx where it boots, relocates the bootloader then uses that area to get the rest of the operating system into RAM. Initially the ATmega was also doing video, but a revision later and I dropped the video and started to use it for other stuff like serial and SD card access.

-Gordon

_________________
--
Gordon Henderson.
See my Ruby 6502 and 65816 SBC projects here: https://projects.drogon.net/ruby/


Top
 Profile  
Reply with quote  
PostPosted: Fri Aug 19, 2022 2:01 am 
Offline

Joined: Fri Apr 15, 2016 1:03 am
Posts: 140
Here are some more 65816 software-only Mandelbrot samples.

Using 128bit mantissa floating-point, runs in 108 sec at 10MHz.
Code:
mainh
............,,,,,,,,,,,,,,'''''''''''''''''''''''''',,,,,,,,,,,,,
...........,,,,,,,,,,,''''''''''''''''''''''''''''''''',,,,,,,,,,
..........,,,,,,,,,'''''''''''''''''''''''~~~~===~~~~''''',,,,,,,
.........,,,,,,,,'''''''''''''''''''''~~~~~~=+[&+==~~~~~''''',,,,
........,,,,,,,'''''''''''''''''''''~~~~~~~==+: ;+++~~~~~~''''',,
.......,,,,,,'''''''''''''''''''''~~~~~~~~===+:[ / [+~~~~~~''''''
......,,,,,,''''''''''''''''''''~~~~~~~~~===+:;/?o[:+==~~~~~'''''
......,,,,''''''''''''''''''''~~~~~~~~~====+:O/x  <;:+==~~~~~~'''
.....,,,,''''''''''''''''''''~~~~~~~~~===++:#      X/+====~~~~'''
.....,,,'''''''''''''''''''~~~~~~~~~==++++:;/X      [:++====~~~''
....,,,'''''''''''''''''''~~~~~~~~==+++:::;[/      X/;:+++++==~~'
....,,''''''''''''''''''~~~~~~~===+[<&x[[? <&x     o&//<;:::[[=~~
...,,'''''''''''''''''~~~~~~=====+:;    &O              /[</&/:=~
...,'''''''''''''''''~~~========++:;<                    x    :=~
..,,'''''''''''''''~~=========+++:;/<O                       ;+==
..,'''''''''''''~~~=========++++:< ##                      X<;:+=
..''''''''''~~~~==:/++++++++::::;/x                          [;:=
.,''''''~~~~~~===+:X[;:;; ;;::;;[                             o/=
.,''''~~~~~~~===++;<xXo<<X &<[[[/                             X:+
.'''~~~~~~~=====+::[&         <<&                             /:=
.'~~~~~~~~=====+::;/?          oO                              :=
.'~~~~~~~====++/;[/o                                          [+=
.~~~~~~=++++::;/???X                                         #:+=
.==++:/::+:;;[[o                                             :+==
                                                           &[:+==
.==++:/::+:;;[[o                                             :+==
.~~~~~~=++++::;/???X                                         #:+=
.'~~~~~~~====++/;[/o                                          [+=
.'~~~~~~~~=====+::;/?          oO                              :=
.'''~~~~~~~=====+::[&         <<&                             /:=
.,''''~~~~~~~===++;<xXo<<X &<[[[/                             X:+
.,''''''~~~~~~===+:X[;:;; ;;::;;[                             o/=
..''''''''''~~~~==:/++++++++::::;/x                          [;:=
..,'''''''''''''~~~=========++++:< ##                      X<;:+=
..,,'''''''''''''''~~=========+++:;/<O                       ;+==
...,'''''''''''''''''~~~========++:;<                    x    :=~
...,,'''''''''''''''''~~~~~~=====+:;    &O              /[</&/:=~
....,,''''''''''''''''''~~~~~~~===+[<&x[[? <&x     o&//<;:::[[=~~
....,,,'''''''''''''''''''~~~~~~~~==+++:::;[/      X/;:+++++==~~'
.....,,,'''''''''''''''''''~~~~~~~~~==++++:;/X      [:++====~~~''
.....,,,,''''''''''''''''''''~~~~~~~~~===++:#      X/+====~~~~'''
......,,,,''''''''''''''''''''~~~~~~~~~====+:O/x  <;:+==~~~~~~'''
......,,,,,,''''''''''''''''''''~~~~~~~~~===+:;/?o[:+==~~~~~'''''
.......,,,,,,'''''''''''''''''''''~~~~~~~~===+:[ / [+~~~~~~''''''
........,,,,,,,'''''''''''''''''''''~~~~~~~==+: ;+++~~~~~~''''',,
.........,,,,,,,,'''''''''''''''''''''~~~~~~=+[&+==~~~~~''''',,,,
..........,,,,,,,,,'''''''''''''''''''''''~~~~===~~~~''''',,,,,,,
...........,,,,,,,,,,,''''''''''''''''''''''''''''''''',,,,,,,,,,
............,,,,,,,,,,,,,,'''''''''''''''''''''''''',,,,,,,,,,,,,

1076 MCycles ok


Using 32bit mantissa floating-point, runs in 18 sec at 10MHz.
Code:
mainf
............,,,,,,,,,,,,,,'''''''''''''''''''''''''',,,,,,,,,,,,,
...........,,,,,,,,,,,''''''''''''''''''''''''''''''''',,,,,,,,,,
..........,,,,,,,,,'''''''''''''''''''''''~~~~===~~~~''''',,,,,,,
.........,,,,,,,,'''''''''''''''''''''~~~~~~=+[&+==~~~~~''''',,,,
........,,,,,,,'''''''''''''''''''''~~~~~~~==+: ;+++~~~~~~''''',,
.......,,,,,,'''''''''''''''''''''~~~~~~~~===+:[ / [+~~~~~~''''''
......,,,,,,''''''''''''''''''''~~~~~~~~~===+:;/?o[:+==~~~~~'''''
......,,,,''''''''''''''''''''~~~~~~~~~====+:O/x  <;:+==~~~~~~'''
.....,,,,''''''''''''''''''''~~~~~~~~~===++:#      X/+====~~~~'''
.....,,,'''''''''''''''''''~~~~~~~~~==++++:;/X      [:++====~~~''
....,,,'''''''''''''''''''~~~~~~~~==+++:::;[/      X/;:+++++==~~'
....,,''''''''''''''''''~~~~~~~===+[<&x[[? <&x     o&//<;:::[[=~~
...,,'''''''''''''''''~~~~~~=====+:;    &O              /[</&/:=~
...,'''''''''''''''''~~~========++:;<                    x    :=~
..,,'''''''''''''''~~=========+++:;/<O                       ;+==
..,'''''''''''''~~~=========++++:< ##                      X<;:+=
..''''''''''~~~~==:/++++++++::::;/x                          [;:=
.,''''''~~~~~~===+:X[;:;; ;;::;;[                             o/=
.,''''~~~~~~~===++;<xXo<<X &<[[[/                             X:+
.'''~~~~~~~=====+::[&         <<&                             /:=
.'~~~~~~~~=====+::;/?          oO                              :=
.'~~~~~~~====++/;[/o                                          [+=
.~~~~~~=++++::;/???X                                         #:+=
.==++:/::+:;;[[o                                             :+==
/                                                          &[:+==
.==++:/::+:;;[[o                                             :+==
.~~~~~~=++++::;/???X                                         #:+=
.'~~~~~~~====++/;[/o                                          [+=
.'~~~~~~~~=====+::;/?          oO                              :=
.'''~~~~~~~=====+::[&         <<&                             /:=
.,''''~~~~~~~===++;<xXo<<X &<[[[/                             X:+
.,''''''~~~~~~===+:X[;:;; ;;::;;[                             o/=
..''''''''''~~~~==:/++++++++::::;/x                          [;:=
..,'''''''''''''~~~=========++++:< ##                      X<;:+=
..,,'''''''''''''''~~=========+++:;/<O                       ;+==
...,'''''''''''''''''~~~========++:;<                    x    :=~
...,,'''''''''''''''''~~~~~~=====+:;    &O              /[</&/:=~
....,,''''''''''''''''''~~~~~~~===+[<&x[[? <&x     o&//<;:::[[=~~
....,,,'''''''''''''''''''~~~~~~~~==+++:::;[/      X/;:+++++==~~'
.....,,,'''''''''''''''''''~~~~~~~~~==++++:;/X      [:++====~~~''
.....,,,,''''''''''''''''''''~~~~~~~~~===++:#      X/+====~~~~'''
......,,,,''''''''''''''''''''~~~~~~~~~====+:O/x  <;:+==~~~~~~'''
......,,,,,,''''''''''''''''''''~~~~~~~~~===+:;/?o[:+==~~~~~'''''
.......,,,,,,'''''''''''''''''''''~~~~~~~~===+:[ / [+~~~~~~''''''
........,,,,,,,'''''''''''''''''''''~~~~~~~==+: ;+++~~~~~~''''',,
.........,,,,,,,,'''''''''''''''''''''~~~~~~=+[&+==~~~~~''''',,,,
..........,,,,,,,,,'''''''''''''''''''''''~~~~===~~~~''''',,,,,,,
...........,,,,,,,,,,,''''''''''''''''''''''''''''''''',,,,,,,,,,
............,,,,,,,,,,,,,,'''''''''''''''''''''''''',,,,,,,,,,,,,

181 MCycles ok


Attachments:
File comment: 32bit console log
F_MandF4_1.zip [4.04 KiB]
Downloaded 69 times
File comment: 128bit console log
F_MandH4_1.zip [7.15 KiB]
Downloaded 61 times
Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 44 posts ]  Go to page Previous  1, 2, 3

All times are UTC


Who is online

Users browsing this forum: No registered users and 3 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: