Calculating Pi: HP-41C versus Apple ][

BigEd · Post by **BigEd** » Fri Jul 27, 2012 3:49 pm

Over at hpmuseum.org, Gerson W. Barbosa pointed out this article which revisits a speed contest between a calculator and one of our well-known 6502-based computers. The computer is hampered in this race by running BASIC, so the author investigates Forth and C, and mentions assembly too:
Pi Day Rematch: Apple II vs. HP-41C

(The assembly reference is to a program by Glen Bredon, bundled with the Merlin assembler. I haven't yet tracked down this program.)

Edit: I found the program - see attached. Will remove if there's any copyright complaint. Used the 'fid' utility to unpack a *.dsk image found on macgui.com

GARTHWILSON · Post by **GARTHWILSON** » Fri Jul 27, 2012 7:08 pm

There are several interesting things there—first, that the BASIC was that slow, even compared to a 41 in user language; second, that going to MCode in the 41 didn't even double the speed, which I suppose is because so much of the time was already taken in multiply and divide routines; third, that in spite of Apple Forth not being a very good Forth, it was still faster than the fastest C (cc65 ANSI C 2.13) [*] and 4.5 times as fast as the slowest C (Aztec K&R C 3.2b); fourth, that in the comments Richard Nelson points out that the 41CL is now online which is 50x as fast as the original 41 for operations not involving I/O or access to plug-in modules (as opposed to ROM images, which the CL has over 300 of, integrated), but that a 20MHz 65816 would also be about 50 times as fast as the 1MHz 6502.

From a post of mine in 2003:

Quote:

I was looking at a programming magazine from 1982 and came across an ad for MicroSpeed for the Apple II. At the top it says in big letters, "TEST-FLY A $20 MILLION JET ON AN APPLE? YES. WITH MICROSPEED." It starts by saying,

(Quote:) At the Bethesda Naval Research Center, they've discovered the power of MicroSPEED. The Navy's engineers use this remarkable hardware/software combination to "fly" an advanced fighter aircraft in real time-- even making vertical landings on a simulated carrier deck. A "crash" is merely another learning experience, and an opportunity to modify the research aircraft-- inside the Apple-- to improve tomorrow's combat planes. Surprised that such a sophisticated task is possible on the Apple? So were the Navy's officials, and many others who have discovered THE MICROSPEED DIFFERENCE <snip> ...and incredible Forth extensibility <snip>

As I understand it, this was a 4MHz Forth-hardware plug-in board for the Apple II, and the MicroSpeed software for it came on a single 5.25" floppy.

I'll take all the speed I can get, but there are not really any thresholds for what generally constitutes "enough" without specifying the application—it's just that more speed opens up more possibilities in certain areas, without having much value in others. I still use my HP-41cx every day, as its design gives it certain functional advantages over the alternatives in spite of its slowness. High speed is not needed for everything. I formed the large 16-bit math tables with my HP-71 which is many times as fast as the 41, and it still took many hours per table—but I have to sleep and do other things anyway, so there's no real problem in letting it work while I do something else. The slowness of the 41 OTOH would have been nearly prohibitive!

[*] Edit, 2/14/21: I just came across this page about benchmarking the various C compilers for the 6502. CC65 produced much slower, more bloated code than the other C options, although it was more solid.

datajerk · Post by **datajerk** » Mon Sep 10, 2012 6:52 pm

Hello BigEd,

I'm the author of that article. One thing I didn't get time to do was write my own 6502 assembly Pi to 1000 digits. Well, I finally got around to it. On the Apple II it takes 1:52, Glen's was 3:14 (interesting number :-). On the Apple I (960 KHz) it takes 2:15. I had to reduce the amount of unrolling and dump all tables to fit in < Apple-1 4K (code and data and monitor and stack) limit. The Apple II version fits in well under the 8K mark (.org $800).

Glen's version used base10, whereas I used base256. That should cut the time in half, however converting to base10 for display takes as many mult as divides. So time is lost.

What is most interesting is that I created a 6800 version last weekend, using the Apple-1 version as a guide. And it runs in 2:12 @ 960 KHz. That was unexpected. The dual accumulators really helped optimize the div/mult code.

Cheers,

Egan

BigEd · Post by **BigEd** » Mon Sep 10, 2012 7:00 pm

Hi Egan
welcome!
Interesting point about the 6800. I started a thread the other day to compare the two micros, but the best comparison would always be to code a solution to the same problem on both, using the best techniques in each case - which is a tall order! Of course, different types of problem could well produce widely varying results, as the respective strengths and weaknesses play out.
What do you think is behind Glen's time of 3:14? That he stopped optimising at that point? A complete coincidence??
(Does it make any sense to wonder about base 100? Or is that in effect what Glen does, maybe using BCD mode?)
Cheers
Ed

GARTHWILSON · Post by **GARTHWILSON** » Mon Sep 10, 2012 7:48 pm

I would remind the court that the 6502 ran Ragsdale's fig-Forth 25% faster than the 6800 at a given clock speed, and of course the 6502 has reached clock speeds an order of magnitude higher in off-the shelf µP's, and two orders of magnitude higher as a core in custom ICs.

fachat · Post by **fachat** » Mon Sep 10, 2012 7:50 pm

One thing I find interesting from the article is that the energy consumed by the apple II is about twice as much as the calculator's energy, but the speed up is much much more than a factor of two!

While the apple II only has about half (0.47) Operations per Watt-second (or Flops per Watt) as the calculator, the speedup is still 155!

datajerk · Post by **datajerk** » Mon Sep 10, 2012 8:44 pm

BigEd wrote:

I started a thread the other day to compare the two micros, but the best comparison would always be to code a solution to the same problem on both, using the best techniques in each case - which is a tall order!

That is exactly what I tried to do. And it took a lot of time. The 6800 is stupid simple. The 6502, with it's 8-bit indexes, a relatively vast array of addressing modes, and page crossing penalties, took a lot longer than expected. I still think I could get a bit more out of the 6502. Not so sure about the 6800.

To maximize the 6800 performance I had to use both accumulators as much as possible. The changeable stack pointer comes in handy for loading and storing 16-bit pointers. To do block moves without "indirect indexed" (i.e (foo),y) is a real pain with the 6800. However I was able to cheat by having one pointer use reg x and the other use the stack pointer. Then a simple block move looks like:

Code: Select all

pula        ;auto increments!
staa ,x
inx

Every example 6800 program I looked at had reg x being reassigned for each block, back and forth. That would kill performance. Esp. with long loops.

BigEd wrote:

Of course, different types of problem could well produce widely varying results, as the respective strengths and weaknesses play out.

Yep. Here is the profile of the 6502 code:

The 6800 and just about any other 8-bit processor without a hard div or mult would look the same. ROL dominates and many of the top 15 are used for division routines. The 6800 with its dual accumulators helped with optimization. (Notice the lack of BNE? Lots of unrolling :-).

BigEd wrote:

What do you think is behind Glen's time of 3:14? That he stopped optimising at that point? A complete coincidence??

Complete coincidence. Glen's code prompts for number of digits. I entered 1000 and timed it by recording the screen output. 3:14. Awesome.

BigEd wrote:

(Does it make any sense to wonder about base 100? Or is that in effect what Glen does, maybe using BCD mode?)

Go for it! I am not interested, since BCD math on any 8-bit processor is not as complete as binary math.

datajerk · Post by **datajerk** » Mon Sep 10, 2012 8:58 pm

fachat wrote:

One thing I find interesting from the article is that the energy consumed by the apple II is about twice as much as the calculator's energy, but the speed up is much much more than a factor of two!

While the apple II only has about half (0.47) Operations per Watt-second (or Flops per Watt) as the calculator, the speedup is still 155!

If I had to guess, it would be that battery operated devices are optimized for battery life before performance. I am fairly certain that Woz didn't care about the Apple II power consumption.

CAPEX Cost/performance the Apple II wins hands down as well.

BigEd · Post by **BigEd** » Mon Sep 10, 2012 9:12 pm

Very interesting data and analysis, thanks! I'm glad you put in the effort and time and shared the results here.

I skimmed Glen's code, and it doesn't seem to use BCD. (By base 100, I meant to keep a single base-100 digit in each byte, not to use BCD. On reflection, even producing a carry for each single-byte digit's addition is going to need a trick, but there might be a saving in the final base conversion, or indeed BCD might be better. It would be interesting to know what proportion of time is taken in the final conversion.)

If you would post your code (as an attachment) that would be much appreciated, but I understand if you prefer not to.

Cheers
Ed

datajerk · Post by **datajerk** » Mon Sep 10, 2012 9:37 pm

Code attached. I used ca65 for the 6502 code and asl for the 6800 code. I have not completed commenting all the 6800 code. I wrote it very quickly over the weekend (mostly late last night).

Both codes were written for the Apple-1. The Apple-1 was designed to support the 6502 and the 6800, however Woz picked the 6502 (reason: $25 vs. $175 cost, source: http://apple2history.org/history/ah02/). Last year, Eric Smith, as part of the Summer Retro Challenge created the 6800 Monitor ROM for the hypothetical 6800-based Apple-1. Eric also created a set of MESS patches you can apply so that you can own a 6800-based Apple-1 today! :-)

The code below was created for both Apple-1 emulators (6502 and 6800)

apple1_6800_pi.zip: asl code, sorry still working on comments.; (4.14 KiB) Downloaded 381 times

apple1_6502_pi.zip: cc65/ca65 assembly source.; (6.04 KiB) Downloaded 339 times

I'll be tinkering with optimization over the next six months as I complete my 8080, z80, 6809, and 8088 versions. The rules are simple:

1. Must be able to rerun without reloading, i.e. no precomputed values (e.g. pre zero'd arrays).
2. No self-modifying code.
3. Optimize for target CPU, use all tricks.
4. Same N(O^2) algorithm for all.
5. Have a life. Do not spend hours to shave off sub seconds.
6. Assume and respect some form of native environment or OS (e.g. do not tape load to direct page of running monitor in Apple-1 to load tables).

If any of you can find ways to significantly improve performance, then please pass on to me via email (datajerk@gmail.com).

Thanks.

datajerk · Post by **datajerk** » Mon Sep 10, 2012 9:42 pm

apple1_6502_pi.zip: 6502 source with macros.; (7.61 KiB) Downloaded 397 times

Oops use this 6502 version (forgot to include macros).

BigEd · Post by **BigEd** » Mon Sep 10, 2012 9:50 pm

Excellent - thanks!

Calculating Pi: HP-41C versus Apple ][

Calculating Pi: HP-41C versus Apple ][

Re: Calculating Pi: HP-41C versus Apple ][

Re: Calculating Pi: HP-41C versus Apple ][

Re: Calculating Pi: HP-41C versus Apple ][

Re: Calculating Pi: HP-41C versus Apple ][

Re: Calculating Pi: HP-41C versus Apple ][

Re: Calculating Pi: HP-41C versus Apple ][

Re: Calculating Pi: HP-41C versus Apple ][

Re: Calculating Pi: HP-41C versus Apple ][

Re: Calculating Pi: HP-41C versus Apple ][

Re: Calculating Pi: HP-41C versus Apple ][

Re: Calculating Pi: HP-41C versus Apple ][