6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Sat Nov 16, 2024 7:56 pm

All times are UTC




Post new topic Reply to topic  [ 12 posts ] 
Author Message
PostPosted: Sat Sep 07, 2019 9:52 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10981
Location: England
I thought this might be an interesting programming challenge, as it needs multi-byte precision, a little more precision I think than the usual floating point formats seen on 6502.

Quote:
In 1977, Isaac Asimov asked how many terms of the slowly converging series
    π = 4 – 4/3 + 4/5 – 4/7 + 4/9 – …
would you have to sum before doing better than the approximation
    π ≈ 355/113.

The answer is known, as of a couple of years later, and you could easily look it up. In fact it's here, where I found the question. But how much more interesting to embark on a calculation!

The quoted article goes on to say:
Quote:
If he had access to an Apple II at the time, it would have run at 1 MHz. My calculation took around a minute to run on a 2 GHz laptop, so I’m guessing the same calculation would have taken a day or two on an Apple II. This assumes he could find extended precision software on an Apple II, which is doubtful.


So, a day or two of calculation! A good test, perhaps, of an arithmetic library. Although not the smartest way to answer Asimov's question.

I know about Karatsuba's speedup for multi-digit multiplication, and I know about some table-driven approaches for fast short multiplication on 6502. And over here I learn about Goldschmidt's algorithms which can be pressed into service for division. All the pieces are in place! (I'm tempted to think that fixed-point would be enough here: although the terms necessarily become ever smaller, the sum is going to be pi-sized.)

A much quicker question to answer, I imagine, would be how many terms are needed to improve on 22/7.


Top
 Profile  
Reply with quote  
PostPosted: Sat Sep 07, 2019 5:47 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
Well, he says it took a minute of calculation on a PC - but that's because he was using Python. A quick try in C (using 80-bit 'long double' format) took less than a tenth of a second to find the correct answer on one of my less powerful machines. Switching to 64-bit 'double' format perturbed the calculation enough to result in a slightly *smaller* number of terms being needed - though only by two - and essentially no difference in speed.

We can infer from this that using fixed-point calculations with 64-bit precision should be sufficient to replicate the feat, and that the time taken might not be too excessive if coded well. Incidentally, I think I'm not giving away too much by pointing out that the both the number of terms, and the divisors derived from the term counter, will comfortably fit in 3 bytes (and you can use that as a safety valve to terminate the calculation in the event of a bug causing improper convergence).

Now, most versions of BASIC can handle a 24-bit integer without any trouble, but getting them to support 64-bit precision could be tricky. I do note, however, that there exists a quadruple-precision float library for the 65816 - and a IIgs is still technically an Apple II!


Top
 Profile  
Reply with quote  
PostPosted: Sat Sep 07, 2019 6:58 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
Some hexadecimal constants may help in constructing an accurate solution:
Code:
     Pi as a 4.60 bit fixed point value: 3.243 F6A8 885A 308D
355/113 as a 4.60 bit fixed point value: 3.243 F6F0 243F 6F02

     Pi as a 2.62 bit fixed point value: C90F DAA2 2168 C235
355/113 as a 2.62 bit fixed point value: C90F DBC0 90FD BC09
With 2.62 format, the zeroth term 4.0 cannot be directly represented. Instead, just take an implicit borrow from the left end when subtracting the first term. All subsequent terms are smaller, so overflow will not subsequently occur. An advantage here is that it's easy to treat the division within each term as a straight reciprocal.

I may attempt this using ordinary trial-subtraction division, and leave the fancy algorithms to others.


Top
 Profile  
Reply with quote  
PostPosted: Sat Sep 07, 2019 7:41 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
Even better, if we recursively apply Euler's Transform to the series - observing that it's exactly the sort of slowly-converging alternating series that it applies to - then we only actually need to calculate 17 terms after the initial 4.0, plus about 300 averages of consecutive partial results. This can be done within 256 bytes of RAM, and probably within a second even on an NMOS 6502.

ETA: on a 3.5MHz 65C02 emulation, BBC BASIC manages to verify this method within a couple of seconds. In fact, that's how long it takes to produce 32 terms, by which time the approximation is better than the precision of the floats used.


Top
 Profile  
Reply with quote  
PostPosted: Sun Sep 08, 2019 3:01 am 
Offline

Joined: Mon Sep 17, 2018 2:39 am
Posts: 138
Hi!

BigEd wrote:
I thought this might be an interesting programming challenge, as it needs multi-byte precision, a little more precision I think than the usual floating point formats seen on 6502.

Quote:
In 1977, Isaac Asimov asked how many terms of the slowly converging series
    π = 4 – 4/3 + 4/5 – 4/7 + 4/9 – …
would you have to sum before doing better than the approximation
    π ≈ 355/113.

The answer is known, as of a couple of years later, and you could easily look it up. In fact it's here, where I found the question. But how much more interesting to embark on a calculation!

The quoted article goes on to say:
Quote:
If he had access to an Apple II at the time, it would have run at 1 MHz. My calculation took around a minute to run on a 2 GHz laptop, so I’m guessing the same calculation would have taken a day or two on an Apple II. This assumes he could find extended precision software on an Apple II, which is doubtful.


So, a day or two of calculation! A good test, perhaps, of an arithmetic library. Although not the smartest way to answer Asimov's question.

I know about Karatsuba's speedup for multi-digit multiplication, and I know about some table-driven approaches for fast short multiplication on 6502. And over here I learn about Goldschmidt's algorithms which can be pressed into service for division. All the pieces are in place! (I'm tempted to think that fixed-point would be enough here: although the terms necessarily become ever smaller, the sum is going to be pi-sized.)

A much quicker question to answer, I imagine, would be how many terms are needed to improve on 22/7.


This is a good challenge.

First, there is an obvious way to calculate the required number of terms: as the series converges alternating signs, the error is always less than half the next term in the series. This means that the sum should be better when 1/2 * 4/(2*N-1) < 315/113 - PI, so 2*N-1 > 2/(315/113 -PI) = 7 497 258.185..., so N > 3 748 629.59... and the smallest N = 3 748 630.

Now, the slow answer, just calculate the sum until the result is closer to PI than 315/113. This code runs in an Atari 8-bit computer (so, available in 1979), it is a quick and dirty job using base 10 arithmetic to show the sum in real-time (as the numbers are stored in the screen).

Using 20 digit numbers, it finishes in 5 hours with 12 minutes, 28722673559 CPU cycles. The numbers printed are the current sum/4 , the last added/subtracted term, the value of "2*N-1" and the time in frames, NTSC Ataris have 59.92271 frames per second.

Attachment:
Screenshot from 2019-09-07 22-58-22.png
Screenshot from 2019-09-07 22-58-22.png [ 5.22 KiB | Viewed 1785 times ]


And the code:

Code:
        org     $2000

.zpvar  num1, num2,ptr   .word
.zpvar  count,tmp1,tmp2  :4 .byte

digits = 20

        ; Shorten screen - makes faster updates
        mwa     560 ptr

        lda     20
w1      cmp     20
        beq     w1

        ldy     #10
        lda     561
        sta     (ptr),y
        dey
        lda     560
        sta     (ptr),y
        dey
        lda     #65
        sta     (ptr),y

        ; Setup pointers
        mwa     88  num1
        adw     88  #40 num2
        lda     #1
        sta     count
        lda     #0
        sta     count+1
        sta     count+2
        sta     count+3
        sta     18
        sta     19
        sta     20

        jsr     set1_n1

do_calc
        ; Increment count
        jsr     inc2
        jsr     cinv_n2
        jsr     num_sub
;        wait

        jsr     compare_low
        bcc     ready

        jsr     inc2
        jsr     cinv_n2
        jsr     num_add
;        wait

        jsr     compare_hi
        bcc     ready

        ; Show N after a few iterations
        lda     count
        and     #$3E
        bne     do_calc

        jsr     show_count
;        wait

        jmp     do_calc

ready
        jsr     show_count

        ; Loop forever
        jmp     *


; Low threshold number: 1/4 * (2*PI - 355/113)
low_thr:
.byte   $10,$17,$18,$15,$13,$19,$18,$10,$19,$16 ; 0.785398096...
.byte   $17,$10,$16,$14,$10,$11,$10,$14,$14,$10
.byte   $11,$10,$10,$18,$12,$17,$15,$13,$15,$18
.byte   $16,$16,$15,$14,$10,$19,$16,$19,$16,$15

; High threshold number: 1/4 * (PI - 355//113)
hi_thr:
.byte   $10,$17,$18,$15,$13,$19,$18,$12,$13,$10 ; 0.785398230...
.byte   $10,$18,$18,$14,$19,$15,$15,$17,$15,$12
.byte   $12,$11,$12,$13,$18,$19,$13,$18,$10,$15
.byte   $13,$10,$19,$17,$13,$14,$15,$11,$13,$12

; Check if num1 is > low
.proc   compare_low
        ldy     #0

nxt
        lda     low_thr, y
        cmp     (num1), y
        bcc     ret
        bne     ret
        iny
        cpy     #digits
        bne     nxt
ret     rts
.endp

; Check if num1 is < high
.proc   compare_hi
        ldy     #0

nxt
        lda     (num1), y
        cmp     hi_thr, y
        bcc     ret
        bne     ret
        iny
        cpy     #digits
        bne     nxt
ret     rts
.endp

.proc   show_count
        mva     count   tmp1
        mva     count+1 tmp1+1
        mva     count+2 tmp1+2
        mva     count+3 tmp1+3
        lda     #80
        sta     ptr
        jsr     show_number

again
        ldx     20
        stx     tmp1
        mva     19      tmp1+1
        mva     18      tmp1+2
        cpx     20
        bne     again
        mva     #0      tmp1+3

        lda     #100
        sta     ptr
        jmp     show_number
.endp


        ; Prints number in "tmp1"
.proc show_number
        ldx     #0
        stx     tmp2
        stx     tmp2+1
        stx     tmp2+2
        stx     tmp2+3

        ldy     #32
        sed
tobcd
        ; Multiplies tmp2 by 2, in BCD
        clc
        lda     tmp2
        adc     tmp2
        sta     tmp2
        lda     tmp2+1
        adc     tmp2+1
        sta     tmp2+1
        lda     tmp2+2
        adc     tmp2+2
        sta     tmp2+2
        lda     tmp2+3
        adc     tmp2+3
        sta     tmp2+3

        ; Extract one bit
        asl     tmp1
        rol     tmp1+1
        rol     tmp1+2
        rol     tmp1+3

        ; Adds 1 to tmp2, in BCD
        txa
        adc     tmp2
        sta     tmp2
        txa
        adc     tmp2+1
        sta     tmp2+1
        txa
        adc     tmp2+2
        sta     tmp2+2
        txa
        adc     tmp2+3
        sta     tmp2+3


        dey
        bne     tobcd

        cld
        ; Prints BCD
        ldy     ptr
        lda     tmp2+3
        jsr     print2
        lda     tmp2+2
        jsr     print2
        lda     tmp2+1
        jsr     print2
        lda     tmp2

print2
        tax
        lsr
        lsr
        lsr
        lsr
        ora     #$10
        sta     (88),y
        iny
        txa
        and     #$0F
        ora     #$10
        sta     (88),y
        iny
        rts
.endp

.proc inc2
        inc     count
        bne     nxt
        inc     count+1
        bne     nxt
        inc     count+2
        bne     nxt
        inc     count+3
nxt
        inc     count
        bne     ret
        inc     count+1
        bne     ret
        inc     count+2
        bne     ret
        inc     count+3
ret
        rts
.endp

        ; Adds num1 = num1 + num2
.proc num_add
        ldy     #digits-1
        clc
        sed

loop:
        lda     (num1), y
        eor     #$90
        adc     (num2), y
        and     #$0F
        ora     #$10
        sta     (num1), y
        dey
        bpl     loop

        cld
        rts
.endp

        ; Subtract num1 = num1 - num2
.proc num_sub
        ldy     #digits-1
        sec
        sed

loop:
        lda     (num1), y
        sbc     (num2), y
        and     #$0F
        ora     #$10
        sta     (num1), y
        dey
        bpl     loop

        cld
        rts
.endp

        ; Calculate num2 = 1/N
.proc cinv_n2

        ; Init REM to 0
        lda     #1
        sta     tmp1
        lda     #0
        sta     tmp1+1
        sta     tmp1+2
        sta     tmp1+3

        ; Start loop
        ldy     #0
        lda     #16
        sta     (num2),y
        iny

loop_dig

        ; Multiply remainder by 10 and add new digit = 0
        jsr     mul10

        ; Divide by N
        ldx     #0

loop_div
        inx
        lda     tmp1
        sec
        sbc     count
        sta     tmp1
        lda     tmp1+1
        sbc     count+1
        sta     tmp1+1
        lda     tmp1+2
        sbc     count+2
        sta     tmp1+2
        lda     tmp1+3
        sbc     count+3
        sta     tmp1+3
        bpl     loop_div

        ; Restore last value
        lda     tmp1
        clc
        adc     count
        sta     tmp1
        lda     tmp1+1
        adc     count+1
        sta     tmp1+1
        lda     tmp1+2
        adc     count+2
        sta     tmp1+2
        lda     tmp1+3
        adc     count+3
        sta     tmp1+3
        dex

        txa
        ora     #16
        sta     (num2),y

        iny
        cpy     #digits
        bne     loop_dig
        rts
.endp

.proc mul10
        mva     tmp1+1  tmp2+1
        mva     tmp1+2  tmp2+2
        mva     tmp1+3  tmp2+3
        lda     tmp1
        sta     tmp2

        asl             ; *2
        rol     tmp1+1
        rol     tmp1+2
        rol     tmp1+3

        asl             ; *4
        rol     tmp1+1
        rol     tmp1+2
        rol     tmp1+3

        adc     tmp2    ; *5
        sta     tmp1
        lda     tmp1+1
        adc     tmp2+1
        sta     tmp1+1
        lda     tmp1+2
        adc     tmp2+2
        sta     tmp1+2
        lda     tmp1+3
        adc     tmp2+3

        asl     tmp1    ; *10
        rol     tmp1+1
        rol     tmp1+2
        rol
        sta     tmp1+3
        rts
.endp

.proc set1_n1
        ; Set NUM1 = 1
        ldy     #digits-1
        lda     #16
set0
        sta     (num1),y
        dey
        bne     set0
        lda     #17
        sta     (num1),y
        rts
.endp


Have Fun!


Top
 Profile  
Reply with quote  
PostPosted: Sun Sep 08, 2019 8:36 am 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
So, we have a decimal digit based implementation which seems to work, but takes a while. One assumes that a binary version could be faster.

Core code for such a version is written over here. I've used 65SC02 instructions (STZ and BRA), but a version for NMOS should have very little difference in performance from replacing those. There are no terribly clever coding tricks, just straightforward trial-subtraction division, add, subtract, and compare routines. I suspect that the Newton-Raphson and Goldschmidt algorithms, relying on a fast multiplier as they do, would not be faster on the 6502 without special hardware measures being taken. In particular, until about 1984 RAM was too expensive to waste on a 256x256x16-bit lookup table, and I think most micros didn't make it very easy to bolt such a thing on.

Still need to write a harness to run it in and produce readable output, and of course the inevitable round of debugging. Too tired now to do either, so they'll have to wait.


Top
 Profile  
Reply with quote  
PostPosted: Mon Sep 09, 2019 6:45 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10981
Location: England
Chromatix wrote:
A quick try in C (using 80-bit 'long double' format) took less than a tenth of a second to find the correct answer on one of my less powerful machines. Switching to 64-bit 'double' format perturbed the calculation enough to result in a slightly *smaller* number of terms being needed - though only by two - and essentially no difference in speed.

Hmm. If 64 bit and 80 bit formats produce different answers, I would take that to mean that we need at least 80 bits and perhaps more. (If 64 bit happens to produce a known-good answer anyway, that would presumably be a coincidence.) [The mantissa being 53 vs 64 bits, or thereabouts, in the two formats.]


Top
 Profile  
Reply with quote  
PostPosted: Mon Sep 09, 2019 6:47 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10981
Location: England
dmsc wrote:
This is a good challenge.

Thanks - and thanks for your code!


Top
 Profile  
Reply with quote  
PostPosted: Mon Sep 09, 2019 8:07 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
I mentioned earlier that the correct result fitting neatly in a 24-bit integer could be used as a safety valve for debugging. Here's how I implemented that safety valve on an emulated BBC Master:
Code:
MainLoop:
 INC termCount+0
 BNE Divide
 INC termCount+1
 BNE Divide
 INC termCount+2
 BNE Divide
 BRK ; trap with Accuracy Lost error
 .byte 23, "termCount wrapped, suspect bug!", 0

Divide:
 ...
This actually proved useful this afternoon, as there was in fact a bug which triggered this safety valve - after several hours. Since it's hard to predict how long this sort of program *should* take to run, it would have been hard to decide when to halt it otherwise.


Top
 Profile  
Reply with quote  
PostPosted: Sat Sep 14, 2019 1:11 pm 
Offline

Joined: Mon May 21, 2018 8:09 pm
Posts: 1462
Still puzzling over what's going wrong with my code. However I have been able to calculate how quickly it *should* find the answer: about 12,000 megacycles, or 3h20m at 1MHz. Even if my fix slows it down a little, that shows it's entirely feasible to verify the answer on an Apple ][ with some fairly ordinary fixed-point arithmetic.


Top
 Profile  
Reply with quote  
PostPosted: Sat Sep 21, 2019 8:41 am 
Offline

Joined: Fri Apr 15, 2016 1:03 am
Posts: 139
Here is another attempt. Runs in about 11820 megacycles.
Developed in FORTH on a 65816 in native mode. The calculation routines are 8-bit internally to simulate a 6502.


Attachments:
File comment: Simulator console log
F_AsimovPiQB_6.txt [23.2 KiB]
Downloaded 127 times
Top
 Profile  
Reply with quote  
PostPosted: Sat Feb 15, 2020 11:46 am 
Offline
User avatar

Joined: Tue Jul 12, 2005 6:29 am
Posts: 16
leepivonka wrote:
Here is another attempt. Runs in about 11820 megacycles.
Developed in FORTH on a 65816 in native mode. The calculation routines are 8-bit internally to simulate a 6502.


Very nice. But not quite 6502 from 1977. Looks more like 65c02 with PLX etc. I'm sure it could easily be fixed up for 1977 and I like your code.


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 12 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 3 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: