Lots of macro assembly and self-modifying code in this project, also lots of optimisations, large and small. It's been a blast.
For some months now I've been thinking about, then eventually collaborating on, a pi program. Dave/hoglet is my team-mate on this, and has (as ever) done most of the actual coding. It's been working, and delivering great results, for a while now, although we can't yet say it's finished.
FAQ:
Do we need another pi program? Yes, of course.
Is this one interesting or different? Yes, indeed.
Is there a silent video with minimal information? Yes, look
here.
Is this open source? Of course, repo
here.
What does it look like? Like this:
Attachment:
File comment: screenshot, Mode 7 on a Beeb, 800 digits of pi, in just under 54 seconds
Bellard-Beeb-800-digits-54-seconds.png [ 41.99 KiB | Viewed 904 times ]
Some full writeups are being posted in this thread over on Stardot:
A tale of two spigots - more digits of pi, and fasterYou'll find tabulations, code snippets, a chronology, links, images and even videos over there.
A few highlights, then, in case you haven't already clicked on that link:
- it's a pi spigot, which means it starts producing digits at once - you don't have to wait for the calculation to finish before seeing anything
- written in 6502 assembly, using BeebAsm, with lots of macros for loop unrolling and elaborating variations of code
- Edit to add: we also used Owlet, for early Basic experiments, Beebjit, for accelerated accurate emulation, b2 similarly, and hoglet's 6502Decoder for debugging and profiling. For later and longer runs, we used PiTubeDirect as a very fast 6502 second processor
- almost all the time is in long division, of a bignum by a modest multi-byte divisor, and accumulation of the quotient into a sum
- we have three or four division routines, for different lengths of divisor, and each is patched in place (self-modifying code) for the specific divisor
- the accumulation of terms by addition or alternatively by subtraction is handled by patching in place
- we're using a spigotised version of the 4-term BBP formula (for slightly greater capacity) or the 7-term Bellard formula (for much better speed.)
- the code is written for the BBC Micro, with or without 6502 second processor, is around 8k in size, but need do nothing more OS-specific than print a character (and read the system time.)
- all available RAM can be used for the pi calculation - we can get a little better than 2 digits of pi for each free byte of RAM
- we have a visualisation mode which places the pi calculation workspace in screen memory so you can see what's going on. (Example video
here)
- on a BBC Micro we can readily fit a 50000 digit calculation, with a second processor we can push it to 135000. That's using nearly 55k of RAM for the bignums
- on a 2MHz 6502 we can compute 100 digits in under a second, 1000 digits in under a minute and a half, 3000 digits in under a quarter of an hour (like almost all pi calculations, it's quadratic in the number of digits)
Previous threads relating to pi calculations:
the fast realization of π-spigot algorithmCalculating Pi: HP-41C versus Apple ][A program for todayApple Pi: BASIC vs. assemblycomputing pi with square roots / RPN calculator buildIt's worth noting that a spigot algorithm, which generates successive digits in a very user-friendly way, is not the fastest way to compute pi, on a 6502 or elsewhere. But among spigots, ours seems to be very fast. Also, I believe our program uses less RAM than previous spigots, and less RAM than other means of calculating pi. So our program can compute more digits on any given machine than previous methods - presuming we're only using RAM, and not storage.
Edit: here's a glimpse of some of the code - it's patched in place, so you see immediates which are not in fact constants. This code runs very many times, and versions of this code account for almost all the runtime. This is the listing as output by the assembler, so you don't see the comments or the symbolic names for things:
Code:
.bit_loop_next
269F A2 00 LDX #&00
26A1 E4 67 CPX &67
26A3 90 13 BCC &26B8
26A5 D0 27 BNE &26CE
.skip_br
26A7 A2 00 LDX #&00
26A9 E4 66 CPX &66
26AB 90 0B BCC &26B8
26AD D0 1F BNE &26CE
.skip_br
26AF A2 00 LDX #&00
26B1 E4 65 CPX &65
26B3 90 03 BCC &26B8
26B5 D0 17 BNE &26CE
.skip_br
26B7 18 CLC
.do_subtract
26B8 AA TAX
26B9 A5 65 LDA &65
26BB E9 00 SBC #&00
26BD 85 65 STA &65
26BF A5 66 LDA &66
26C1 E9 00 SBC #&00
26C3 85 66 STA &66
26C5 A5 67 LDA &67
26C7 E9 00 SBC #&00
26C9 85 67 STA &67
26CB 8A TXA
26CC 09 20 ORA #&20
Edit: here's some corresponding source,
from the repoCode:
.modify
; Unroll the bit loop
; FOR J%=0 TO 7
FOR j,0,7
; IF T%>=D%
FOR i,bytes-1,0,-1
LDX #&00 ; operand is modified dynamically
CPX temp+i ; X is used so as not to corrupt A
; Shortcut the first BCC/BNE (optimization suggested by profiling)
IF i=bytes-1 AND j>=COMP_OPT_THRESHOLD
BEQ skip_br
ENDIF
BCC do_subtract
BNE bit_loop_next
.skip_br
NEXT
CLC
; T%=T%-D%:B%=B%+1
.do_subtract
TAX ; save A
FOR i,0,bytes-1
LDA temp+i
SBC #&00 ; operand is modified dynamically
STA temp+i
NEXT
TXA ; restore A
ORA #(&80 >> j) ; dynmically modified (e.g. ORA #&40 for add, AND #&BF for subtract)
; NEXT J%
.bit_loop_next
NEXT