Comparisons and contrasts

BillG · Post by **BillG** » Sun Nov 14, 2021 1:44 pm

On further reflection, I am thinking that the upper bound check can be done with an unsigned test.

Concur?

barrym95838 · Post by **barrym95838** » Sun Nov 14, 2021 5:23 pm

I might be completely wrong, but doesn't PASCAL allow signed indices?

If you're just comparing a 16-bit value to another, you can almost always replace SEC SBC with CMP for the low byte, signed or unsigned.

BillG · Post by **BillG** » Mon Nov 15, 2021 11:17 am

barrym95838 wrote:

I might be completely wrong, but doesn't PASCAL allow signed indices?

Yes it does.

Consider this:

Code: Select all

Offset := Subscript - LowerBound
if Offset < 0  { This must be a signed comparison, that is, -32768 - LowerBound must be reported as an error }
  then
    report OutOfBounds error
{ Offset is now an unsigned quantity }

barrym95838 wrote:

If you're just comparing a 16-bit value to another, you can almost always replace SEC SBC with CMP for the low byte, signed or unsigned.

There are two ways to compare 16-bit numbers:

Code: Select all

subtract NumberInRange from Offset
if the difference < 0
  then
    report OutOfBounds error

and

Code: Select all

compare upper byte of Offset with upper byte of NumberInRange
if <
  then
    proceed
  else if >
    then
      report error
    else
      compare lower byte of Offset with lower byte of NumberInRange
      if <
        then
          proceed
        else
          report error

For the lower bound, we need the difference anyway, so do the subtraction.

For the upper bound, I'll have to analyze the two methods.

Edit: Oh, I see what you are saying. Since the difference is not kept, do a compare of the low bytes instead of a subtraction...

barrym95838 · Post by **barrym95838** » Mon Nov 15, 2021 10:48 pm

BillG wrote:

Oh, I see what you are saying. Since the difference is not kept, do a compare of the low bytes instead of a subtraction...

That's what I meant (although in hindsight checking for "<=" or ">" or "=" or "<>" isn't as nice as checking for "<" or ">=" if you use CMP on the low byte, because you might need the low result later).

litwr · Post by **litwr** » Thu Dec 23, 2021 9:09 am

I have done some tests for the 6502, Z80, K1801VM1, 8088, and 68000 which have to do the same Mandelbrot calculations. Results are here. It seems that for intensive 16-bit calculations the 6502 shows rather mediocre results. However, getting good optimized code for the Z80 is a very long and expensive process.

BillG · Post by **BillG** » Thu Dec 23, 2021 10:22 am

The linked page says:

Quote:

Emulators were used to get these results: BK2010 v0.5 for the BK0010, GID v3.10 for the BK0011M, plus4emu v1.2.10 for the Commodore+4, ep128emu v2.0.11 for the Amstrad CPC, mess 0.229 for the BBC Master, FS-UAE 3.0.5 for the Amiga 500, pce-ibmpc version 20140222-4b05f0c for the IBM PC 5160 EGA. The emulators are quite accurate for timings, the only exception is the emulator for the IBM PC which appears to be about 25% faster than real hardware. So all the speed results for the IBM PC may be degraded by this 25% – the degraded ER result is shown in parantheses.

I am not familiar with pce-ibmpc. Did you measure the time using a stopwatch or does the emulator report the number of CPU cycles used?

The fastest code in an emulator as measured with a stopwatch is not likely to be the fastest on actual hardware. It is not unlike trying to optimize code for a BASIC compiler as opposed to for one of the many interpreters.

Code optimization on x86 processors is notoriously difficult. If you have ever read The Zen of Assembly Language by Michael Abrash, you will know what I am talking about.

For instance, your code does not make use of the STOSW instruction which is especially advantageous on the 8088/8086 processor but not so much on modern members of the family.

BigEd · Post by **BigEd** » Thu Dec 23, 2021 1:17 pm

For me, it's a major difficulty with any kind of comparison using hand-coded assembly - you need to be able to marshal the same level of expertise with each micro. Not insurmountable, but otherwise you may leave unrealised improvements for some which distort the rankings.

litwr · Post by **litwr** » Thu Dec 23, 2021 7:18 pm

BillG wrote:

The linked page says:
I am not familiar with pce-ibmpc. Did you measure the time using a stopwatch or does the emulator report the number of CPU cycles used?

The fastest code in an emulator as measured with a stopwatch is not likely to be the fastest on actual hardware. It is not unlike trying to optimize code for a BASIC compiler as opposed to for one of the many interpreters.

Code optimization on x86 processors is notoriously difficult. If you have ever read The Zen of Assembly Language by Michael Abrash, you will know what I am talking about.

For instance, your code does not make use of the STOSW instruction which is especially advantageous on the 8088/8086 processor but not so much on modern members of the family.

Let me show you the original PDP-11 assembly for the main loop.

Code: Select all

1$:	mov	sqr(r1), r3	; r3 = y^2
	add	r0, r1		; r1 = x+y
	mov	sqr(r0), r0	; r0 = x^2
	add	r3, r0		; r0 = x^2+y^2
	cmp	r0, r6		; if r0 >= 4.0 then
	bge	2$		; overflow
	mov	sqr(r1), r1	; r1 = (x+y)^2
	sub	r0, r1		; r1 = (x+y)^2-x^2-y^2 = 2*x*y
	add	r5, r1		; r1 = 2*x*y+b, updated y
	sub	r3, r0		; r0 = x^2
	sub	r3, r0		; r0 = x^2-y^2
	add	r4, r0		; r0 = x^2-y^2+a, updated x
	sob	r2, 1$		; to next iteration

It is just a straightforward implementation of the Mandelbrot fractal algorithm main loop.

Code: Select all

    while (x*x + y*y ≤ 2*2 AND iteration < max_iteration) do
        xtemp := x*x - y*y + x0
        y := 2*x*y + y0
        x := xtemp
        iteration := iteration + 1

So it was not impossible to write so small optimized code for several platforms. We just need to translate these 13 lines only.

All programs just print timings, it is written on the linked page with the results. Those emulators are capable to run tricky code for games and demos. So they are quite accurate. The only exception is the IBM PC emulator because timings were almost never a compatibility issue for this computer. There were too many variety of it that used different processors (the 8088, 8086, V20, V30, 80286, ...) at different clocks. So emulators for the IBM PC are usually faster than the original machines. IMHO the emulators just don't simulate the instruction queue delays. BTW if anybody knows the best IBM PC emulator please inform me about it.

It is possible just get machine cycles quantity for some processors (the Z80 and 6502) but requires different means to use and this IMHO can't change the results. And this approach misses the fine resulting pictures.

BigEd wrote:

For me, it's a major difficulty with any kind of comparison using hand-coded assembly - you need to be able to marshal the same level of expertise with each micro. Not insurmountable, but otherwise you may leave unrealised improvements for some which distort the rankings.

I can only claim that my code is optimized well but I can't claim it is perfect. The most difficult is the Z80 code optimization, it took about a month for me and I used help from several the Z80 experts. Finally I could speed up my initial Z80 code by about 50%. The other processors are more plain for optimizations. It would be great if somebody can discover the better code for the 6502 or other CPU.
It is interesting that for per MHz efficiency the 6502 beats the 8088 even on 16-bit calculations.

BigEd · Post by **BigEd** » Thu Dec 23, 2021 7:21 pm

Sarah Walker's PCem is cycle accurate, apparently.
https://retrocomputing.stackexchange.co ... 6-emulator

litwr · Post by **litwr** » Fri Dec 24, 2021 9:05 am

BigEd wrote:

Sarah Walker's PCem is cycle accurate, apparently.
https://retrocomputing.stackexchange.co ... 6-emulator

I like this emulator, I use it for the AT emulation. But to be cycle exact is not enough for the perfect IBM PC emulation because the x86 uses the instruction queue unit which can stop the main Central Processor Unit. This was documented very well by Michael Abrash. The instruction queue delays are especially significant for the 8088. Sorry I have been too lazy to do thorough checking of PCem sources, so I am not sure about the instruction queue emulation. But PCem IBM PC 5160 timings are faster than real iron too.

litwr · Post by **litwr** » Mon Jan 10, 2022 3:27 pm

I have ported the pi-spigot algorithm to the 6803. This CPU was used very rarely, I know only the Tandy TRS-80 MC-10 and its French clones. The results show that this processor is slightly faster than the 6809! It seems Motorola followed the DEC way to make processor instructions more complex and slow. The 6803 has faster instructions than the 6809 but the 6809 has more registers, instructions and addressing modes. However some instructions like LSRD or ASLD only exist for the 6803. If the 6803 also had ROLD, it would greatly speed up the division procedure for this processor.
It would also be interesting to guess how the 6502 might have evolved if MOS Technology had been regularly upgrading it. It is possible that they could have chosen the same path that Motorola did for the 6803. That would have meant using a 16-bit accumulator.

barrym95838 · Post by **barrym95838** » Mon Jan 10, 2022 4:01 pm

I'm not surprising anyone here, but the ability to run legacy binaries was the constraint chosen for the '802 and '816, and they're full of 16-bit stuff behind those annoying mode bits. x86 followed a similar upgrade path, but threw much, much more money for R&D into the mix.

litwr · Post by **litwr** » Tue Jan 11, 2022 2:06 pm

barrym95838 wrote:

I'm not surprising anyone here, but the ability to run legacy binaries was the constraint chosen for the '802 and '816, and they're full of 16-bit stuff behind those annoying mode bits. x86 followed a similar upgrade path, but threw much, much more money for R&D into the mix.

I have found out that the 6800 family consists of families of binary incompatible processors. I could count up to 5 families:
1) 6800/6802/6808, 6801/6803 (backward compatible with the 6800);
2) 6800, 68HC08;
3) 6800, 68HC11;
4) 6804, 6805;
5) 6809, 6309 (backward compatible with the CMOS 6809).
This is not a complete list. I am not able to finish it, too many controllers are there. The 6502 family is a set of more compatible processors. But it also contains varieties:
1) NMOS 6502 (undocumented instructions);
2) CMOS 6502 (almost 100% compatible with the NMOS 6502 without undocumented instructions);
3) 6509 (almost 100% compatible with the NMOS 6502);
4) 65CE02 (backward compatible with the CMOS 6502);
5) HuC6280 (backward compatible with the CMOS 6502?);
6) DTV (backward compatible with the CMOS 6502);
7) WDC65C02 (backward compatible with the CMOS 6502);
8 ) 65816 (backward compatible with the CMOS 6502).
It is interesting that almost universal assembler VASM has supports for all (?) 6502 8-bit varieties but still misses the 65816.

barrym95838 · Post by **barrym95838** » Tue Jan 11, 2022 4:12 pm

litwr wrote:

I have found out that the 6800 family consists of families of binary incompatible processors. I could count up to 5 families:
1) 6800/6802/6808, 6801/6803 (backward compatible with the 6800);
2) 6800, 68HC08;
3) 6800, 68HC11;
4) 6804, 6805;
5) 6809, 6309 (backward compatible with the CMOS 6809).
This is not a complete list. I am not able to finish it, too many controllers are there.

I don't have any practical experience with it, but the 68HC12 looks like it would be my favorite cousin from that family. It lost the 6809/6309's U register, but in doing so gained an impressive amount of binary code efficiency with a carefully overhauled opcode matrix.

Quote:

The 68HC12 adds to and replaces a small number of 68HC11 instructions with new forms that are closer to the 6809 processor. More significantly it changes the instruction encodings to be far more dense and adds many 6809 like indexing features, some with even more flexibility. The net result is that code sizes are typically 30% smaller.

BigEd · Post by **BigEd** » Tue Jan 11, 2022 5:50 pm

Wow, 30% smaller code is very impressive!

Comparisons and contrasts

Re: Comparisons and contrasts

Re: Comparisons and contrasts

Re: Comparisons and contrasts

Re: Comparisons and contrasts

Re: Comparisons and contrasts

Re: Comparisons and contrasts

Re: Comparisons and contrasts

Re: Comparisons and contrasts

Re: Comparisons and contrasts

Re: Comparisons and contrasts

Re: Comparisons and contrasts

Re: Comparisons and contrasts

Re: Comparisons and contrasts

Re: Comparisons and contrasts

Re: Comparisons and contrasts