BillG wrote:
The linked page says:
I am not familiar with pce-ibmpc. Did you measure the time using a stopwatch or does the emulator report the number of CPU cycles used?
The fastest code in an emulator as measured with a stopwatch is not likely to be the fastest on actual hardware. It is not unlike trying to optimize code for a BASIC compiler as opposed to for one of the many interpreters.
Code optimization on x86 processors is notoriously difficult. If you have ever read The Zen of Assembly Language by Michael Abrash, you will know what I am talking about.
For instance, your code does not make use of the STOSW instruction which is especially advantageous on the 8088/8086 processor but not so much on modern members of the family.
Let me show you the original PDP-11 assembly for the main loop.
Code:
1$: mov sqr(r1), r3 ; r3 = y^2
add r0, r1 ; r1 = x+y
mov sqr(r0), r0 ; r0 = x^2
add r3, r0 ; r0 = x^2+y^2
cmp r0, r6 ; if r0 >= 4.0 then
bge 2$ ; overflow
mov sqr(r1), r1 ; r1 = (x+y)^2
sub r0, r1 ; r1 = (x+y)^2-x^2-y^2 = 2*x*y
add r5, r1 ; r1 = 2*x*y+b, updated y
sub r3, r0 ; r0 = x^2
sub r3, r0 ; r0 = x^2-y^2
add r4, r0 ; r0 = x^2-y^2+a, updated x
sob r2, 1$ ; to next iteration
It is just a straightforward implementation of the Mandelbrot fractal algorithm main loop.
Code:
while (x*x + y*y ≤ 2*2 AND iteration < max_iteration) do
xtemp := x*x - y*y + x0
y := 2*x*y + y0
x := xtemp
iteration := iteration + 1
So it was not impossible to write so small optimized code for several platforms. We just need to translate these 13 lines only.
All programs just print timings, it is written on the linked page with the results. Those emulators are capable to run tricky code for games and demos. So they are quite accurate. The only exception is the IBM PC emulator because timings were almost never a compatibility issue for this computer. There were too many variety of it that used different processors (the 8088, 8086, V20, V30, 80286, ...) at different clocks. So emulators for the IBM PC are usually faster than the original machines. IMHO the emulators just don't simulate the instruction queue delays. BTW if anybody knows the best IBM PC emulator please inform me about it.
It is possible just get machine cycles quantity for some processors (the Z80 and 6502) but requires different means to use and this IMHO can't change the results. And this approach misses the fine resulting pictures.
BigEd wrote:
For me, it's a major difficulty with any kind of comparison using hand-coded assembly - you need to be able to marshal the same level of expertise with each micro. Not insurmountable, but otherwise you may leave unrealised improvements for some which distort the rankings.
I can only claim that my code is optimized well but I can't claim it is perfect. The most difficult is the Z80 code optimization, it took about a month for me and I used help from several the Z80 experts. Finally I could speed up my initial Z80 code by about 50%. The other processors are more plain for optimizations. It would be great if somebody can discover the better code for the 6502 or other CPU.
It is interesting that for per MHz efficiency the 6502 beats the 8088 even on 16-bit calculations.