Page 8 of 11
Re: Comparisons and contrasts
Posted: Sun Nov 14, 2021 1:44 pm
by BillG
On further reflection, I am thinking that the upper bound check can be done with an unsigned test.
Concur?
Re: Comparisons and contrasts
Posted: Sun Nov 14, 2021 5:23 pm
by barrym95838
I might be completely wrong, but doesn't PASCAL allow signed indices?
If you're just comparing a 16-bit value to another, you can almost always replace SEC SBC with CMP for the low byte, signed or unsigned.
Re: Comparisons and contrasts
Posted: Mon Nov 15, 2021 11:17 am
by BillG
I might be completely wrong, but doesn't PASCAL allow signed indices?
Yes it does.
Consider this:
Code: Select all
Offset := Subscript - LowerBound
if Offset < 0 { This must be a signed comparison, that is, -32768 - LowerBound must be reported as an error }
then
report OutOfBounds error
{ Offset is now an unsigned quantity }
If you're just comparing a 16-bit value to another, you can almost always replace SEC SBC with CMP for the low byte, signed or unsigned.
There are two ways to compare 16-bit numbers:
Code: Select all
subtract NumberInRange from Offset
if the difference < 0
then
report OutOfBounds error
and
Code: Select all
compare upper byte of Offset with upper byte of NumberInRange
if <
then
proceed
else if >
then
report error
else
compare lower byte of Offset with lower byte of NumberInRange
if <
then
proceed
else
report error
For the lower bound, we need the difference anyway, so do the subtraction.
For the upper bound, I'll have to analyze the two methods.
Edit: Oh, I see what you are saying. Since the difference is not kept, do a compare of the low bytes instead of a subtraction...
Re: Comparisons and contrasts
Posted: Mon Nov 15, 2021 10:48 pm
by barrym95838
Oh, I see what you are saying. Since the difference is not kept, do a compare of the low bytes instead of a subtraction...
That's what I meant (although in hindsight checking for "<=" or ">" or "=" or "<>" isn't as nice as checking for "<" or ">=" if you use CMP on the low byte, because you might need the low result later).
Re: Comparisons and contrasts
Posted: Thu Dec 23, 2021 9:09 am
by litwr
I have done some tests for the 6502, Z80, K1801VM1, 8088, and 68000 which have to do the same Mandelbrot calculations. Results are
here. It seems that for intensive 16-bit calculations the 6502 shows rather mediocre results. However, getting good optimized code for the Z80 is a very long and expensive process.
Re: Comparisons and contrasts
Posted: Thu Dec 23, 2021 10:22 am
by BillG
The linked page says:
Emulators were used to get these results: BK2010 v0.5 for the BK0010, GID v3.10 for the BK0011M, plus4emu v1.2.10 for the Commodore+4, ep128emu v2.0.11 for the Amstrad CPC, mess 0.229 for the BBC Master, FS-UAE 3.0.5 for the Amiga 500, pce-ibmpc version 20140222-4b05f0c for the IBM PC 5160 EGA. The emulators are quite accurate for timings, the only exception is the emulator for the IBM PC which appears to be about 25% faster than real hardware. So all the speed results for the IBM PC may be degraded by this 25% – the degraded ER result is shown in parantheses.
I am not familiar with pce-ibmpc. Did you measure the time using a stopwatch or does the emulator report the number of CPU cycles used?
The fastest code in an emulator as measured with a stopwatch is not likely to be the fastest on actual hardware. It is not unlike trying to optimize code for a BASIC compiler as opposed to for one of the many interpreters.
Code optimization on x86 processors is notoriously difficult. If you have ever read The Zen of Assembly Language by Michael Abrash, you will know what I am talking about.
For instance, your code does not make use of the STOSW instruction which is especially advantageous on the 8088/8086 processor but not so much on modern members of the family.
Re: Comparisons and contrasts
Posted: Thu Dec 23, 2021 1:17 pm
by BigEd
For me, it's a major difficulty with any kind of comparison using hand-coded assembly - you need to be able to marshal the same level of expertise with each micro. Not insurmountable, but otherwise you may leave unrealised improvements for some which distort the rankings.
Re: Comparisons and contrasts
Posted: Thu Dec 23, 2021 7:18 pm
by litwr
The linked page says:
I am not familiar with pce-ibmpc. Did you measure the time using a stopwatch or does the emulator report the number of CPU cycles used?
The fastest code in an emulator as measured with a stopwatch is not likely to be the fastest on actual hardware. It is not unlike trying to optimize code for a BASIC compiler as opposed to for one of the many interpreters.
Code optimization on x86 processors is notoriously difficult. If you have ever read The Zen of Assembly Language by Michael Abrash, you will know what I am talking about.
For instance, your code does not make use of the STOSW instruction which is especially advantageous on the 8088/8086 processor but not so much on modern members of the family.
Let me show you the original PDP-11 assembly for the main loop.
Code: Select all
1$: mov sqr(r1), r3 ; r3 = y^2
add r0, r1 ; r1 = x+y
mov sqr(r0), r0 ; r0 = x^2
add r3, r0 ; r0 = x^2+y^2
cmp r0, r6 ; if r0 >= 4.0 then
bge 2$ ; overflow
mov sqr(r1), r1 ; r1 = (x+y)^2
sub r0, r1 ; r1 = (x+y)^2-x^2-y^2 = 2*x*y
add r5, r1 ; r1 = 2*x*y+b, updated y
sub r3, r0 ; r0 = x^2
sub r3, r0 ; r0 = x^2-y^2
add r4, r0 ; r0 = x^2-y^2+a, updated x
sob r2, 1$ ; to next iteration
It is just a straightforward implementation of the Mandelbrot fractal algorithm main loop.
Code: Select all
while (x*x + y*y ≤ 2*2 AND iteration < max_iteration) do
xtemp := x*x - y*y + x0
y := 2*x*y + y0
x := xtemp
iteration := iteration + 1
So it was not impossible to write so small optimized code for several platforms. We just need to translate these 13 lines only.
All programs just print timings, it is written on the linked page with the results. Those emulators are capable to run tricky code for games and demos. So they are quite accurate. The only exception is the IBM PC emulator because timings were almost never a compatibility issue for this computer. There were too many variety of it that used different processors (the 8088, 8086, V20, V30, 80286, ...) at different clocks. So emulators for the IBM PC are usually faster than the original machines. IMHO the emulators just don't simulate the instruction queue delays. BTW if anybody knows the best IBM PC emulator please inform me about it.
It is possible just get machine cycles quantity for some processors (the Z80 and 6502) but requires different means to use and this IMHO can't change the results. And this approach misses the fine resulting pictures.
For me, it's a major difficulty with any kind of comparison using hand-coded assembly - you need to be able to marshal the same level of expertise with each micro. Not insurmountable, but otherwise you may leave unrealised improvements for some which distort the rankings.
I can only claim that my code is optimized well but I can't claim it is perfect. The most difficult is the Z80 code optimization, it took about a month for me and I used help from several the Z80 experts. Finally I could speed up my initial Z80 code by about 50%. The other processors are more plain for optimizations. It would be great if somebody can discover the better code for the 6502 or other CPU.
It is interesting that for per MHz efficiency the 6502 beats the 8088 even on 16-bit calculations.

Re: Comparisons and contrasts
Posted: Thu Dec 23, 2021 7:21 pm
by BigEd
Re: Comparisons and contrasts
Posted: Fri Dec 24, 2021 9:05 am
by litwr
I like this emulator, I use it for the AT emulation. But to be cycle exact is not enough for the perfect IBM PC emulation because the x86 uses the instruction queue unit which can stop the main Central Processor Unit. This was documented very well by Michael Abrash. The instruction queue delays are especially significant for the 8088. Sorry I have been too lazy to do thorough checking of PCem sources, so I am not sure about the instruction queue emulation. But PCem IBM PC 5160 timings are faster than real iron too.
Re: Comparisons and contrasts
Posted: Mon Jan 10, 2022 3:27 pm
by litwr
I have ported the
pi-spigot algorithm to the 6803. This CPU was used very rarely, I know only the Tandy TRS-80 MC-10 and its French clones. The results show that this processor is slightly faster than the 6809! It seems Motorola followed the DEC way to make processor instructions more complex and slow. The 6803 has faster instructions than the 6809 but the 6809 has more registers, instructions and addressing modes. However some instructions like LSRD or ASLD only exist for the 6803. If the 6803 also had ROLD, it would greatly speed up the division procedure for this processor.
It would also be interesting to guess how the 6502 might have evolved if MOS Technology had been regularly upgrading it. It is possible that they could have chosen the same path that Motorola did for the 6803. That would have meant using a 16-bit accumulator.
Re: Comparisons and contrasts
Posted: Mon Jan 10, 2022 4:01 pm
by barrym95838
I'm not surprising anyone here, but the ability to run legacy binaries was the constraint chosen for the '802 and '816, and they're full of 16-bit stuff behind those annoying mode bits. x86 followed a similar upgrade path, but threw much, much more money for R&D into the mix.
Re: Comparisons and contrasts
Posted: Tue Jan 11, 2022 2:06 pm
by litwr
I'm not surprising anyone here, but the ability to run legacy binaries was the constraint chosen for the '802 and '816, and they're full of 16-bit stuff behind those annoying mode bits. x86 followed a similar upgrade path, but threw much, much more money for R&D into the mix.
I have found out that the 6800 family consists of families of binary incompatible processors. I could count up to 5 families:
1) 6800/6802/6808, 6801/6803 (backward compatible with the 6800);
2) 6800, 68HC08;
3) 6800, 68HC11;
4) 6804, 6805;
5) 6809, 6309 (backward compatible with the CMOS 6809).
This is not a complete list. I am not able to finish it, too many controllers are there. The 6502 family is a set of more compatible processors. But it also contains varieties:
1) NMOS 6502 (undocumented instructions);
2) CMOS 6502 (almost 100% compatible with the NMOS 6502 without undocumented instructions);
3) 6509 (almost 100% compatible with the NMOS 6502);
4) 65CE02 (backward compatible with the CMOS 6502);
5) HuC6280 (backward compatible with the CMOS 6502?);
6) DTV (backward compatible with the CMOS 6502);
7) WDC65C02 (backward compatible with the CMOS 6502);
8 ) 65816 (backward compatible with the CMOS 6502).
It is interesting that almost universal assembler
VASM has supports for all (?) 6502 8-bit varieties but still misses the 65816.
Re: Comparisons and contrasts
Posted: Tue Jan 11, 2022 4:12 pm
by barrym95838
I have found out that the 6800 family consists of families of binary incompatible processors. I could count up to 5 families:
1) 6800/6802/6808, 6801/6803 (backward compatible with the 6800);
2) 6800, 68HC08;
3) 6800, 68HC11;
4) 6804, 6805;
5) 6809, 6309 (backward compatible with the CMOS 6809).
This is not a complete list. I am not able to finish it, too many controllers are there.
I don't have any practical experience with it, but the 68HC12 looks like it would be my favorite cousin from that family. It lost the 6809/6309's U register, but in doing so gained an impressive amount of binary code efficiency with a carefully overhauled opcode matrix.
The 68HC12 adds to and replaces a small number of 68HC11 instructions with new forms that are closer to the 6809 processor. More significantly it changes the instruction encodings to be far more dense and adds many 6809 like indexing features, some with even more flexibility. The net result is that code sizes are typically 30% smaller.
Re: Comparisons and contrasts
Posted: Tue Jan 11, 2022 5:50 pm
by BigEd
Wow, 30% smaller code is very impressive!