6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 22, 2024 5:08 pm

All times are UTC




Post new topic Reply to topic  [ 163 posts ]  Go to page Previous  1 ... 5, 6, 7, 8, 9, 10, 11  Next
Author Message
PostPosted: Sun Nov 14, 2021 1:44 pm 
Offline

Joined: Thu Mar 12, 2020 10:04 pm
Posts: 704
Location: North Tejas
On further reflection, I am thinking that the upper bound check can be done with an unsigned test.

Concur?


Top
 Profile  
Reply with quote  
PostPosted: Sun Nov 14, 2021 5:23 pm 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1949
Location: Sacramento, CA, USA
I might be completely wrong, but doesn't PASCAL allow signed indices?

If you're just comparing a 16-bit value to another, you can almost always replace SEC SBC with CMP for the low byte, signed or unsigned.

_________________
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!

Mike B. (about me) (learning how to github)


Top
 Profile  
Reply with quote  
PostPosted: Mon Nov 15, 2021 11:17 am 
Offline

Joined: Thu Mar 12, 2020 10:04 pm
Posts: 704
Location: North Tejas
barrym95838 wrote:
I might be completely wrong, but doesn't PASCAL allow signed indices?

Yes it does.

Consider this:
Code:
Offset := Subscript - LowerBound
if Offset < 0  { This must be a signed comparison, that is, -32768 - LowerBound must be reported as an error }
  then
    report OutOfBounds error
{ Offset is now an unsigned quantity }


barrym95838 wrote:
If you're just comparing a 16-bit value to another, you can almost always replace SEC SBC with CMP for the low byte, signed or unsigned.

There are two ways to compare 16-bit numbers:
Code:
subtract NumberInRange from Offset
if the difference < 0
  then
    report OutOfBounds error

and
Code:
compare upper byte of Offset with upper byte of NumberInRange
if <
  then
    proceed
  else if >
    then
      report error
    else
      compare lower byte of Offset with lower byte of NumberInRange
      if <
        then
          proceed
        else
          report error

For the lower bound, we need the difference anyway, so do the subtraction.

For the upper bound, I'll have to analyze the two methods.

Edit: Oh, I see what you are saying. Since the difference is not kept, do a compare of the low bytes instead of a subtraction...


Top
 Profile  
Reply with quote  
PostPosted: Mon Nov 15, 2021 10:48 pm 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1949
Location: Sacramento, CA, USA
BillG wrote:
Oh, I see what you are saying. Since the difference is not kept, do a compare of the low bytes instead of a subtraction...

That's what I meant (although in hindsight checking for "<=" or ">" or "=" or "<>" isn't as nice as checking for "<" or ">=" if you use CMP on the low byte, because you might need the low result later).

_________________
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!

Mike B. (about me) (learning how to github)


Top
 Profile  
Reply with quote  
PostPosted: Thu Dec 23, 2021 9:09 am 
Offline

Joined: Sat Jul 09, 2016 6:01 pm
Posts: 180
I have done some tests for the 6502, Z80, K1801VM1, 8088, and 68000 which have to do the same Mandelbrot calculations. Results are here. It seems that for intensive 16-bit calculations the 6502 shows rather mediocre results. However, getting good optimized code for the Z80 is a very long and expensive process.

_________________
my blog about processors


Top
 Profile  
Reply with quote  
PostPosted: Thu Dec 23, 2021 10:22 am 
Offline

Joined: Thu Mar 12, 2020 10:04 pm
Posts: 704
Location: North Tejas
The linked page says:

Quote:
Emulators were used to get these results: BK2010 v0.5 for the BK0010, GID v3.10 for the BK0011M, plus4emu v1.2.10 for the Commodore+4, ep128emu v2.0.11 for the Amstrad CPC, mess 0.229 for the BBC Master, FS-UAE 3.0.5 for the Amiga 500, pce-ibmpc version 20140222-4b05f0c for the IBM PC 5160 EGA. The emulators are quite accurate for timings, the only exception is the emulator for the IBM PC which appears to be about 25% faster than real hardware. So all the speed results for the IBM PC may be degraded by this 25% – the degraded ER result is shown in parantheses.


I am not familiar with pce-ibmpc. Did you measure the time using a stopwatch or does the emulator report the number of CPU cycles used?

The fastest code in an emulator as measured with a stopwatch is not likely to be the fastest on actual hardware. It is not unlike trying to optimize code for a BASIC compiler as opposed to for one of the many interpreters.

Code optimization on x86 processors is notoriously difficult. If you have ever read The Zen of Assembly Language by Michael Abrash, you will know what I am talking about.

For instance, your code does not make use of the STOSW instruction which is especially advantageous on the 8088/8086 processor but not so much on modern members of the family.


Top
 Profile  
Reply with quote  
PostPosted: Thu Dec 23, 2021 1:17 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
For me, it's a major difficulty with any kind of comparison using hand-coded assembly - you need to be able to marshal the same level of expertise with each micro. Not insurmountable, but otherwise you may leave unrealised improvements for some which distort the rankings.


Top
 Profile  
Reply with quote  
PostPosted: Thu Dec 23, 2021 7:18 pm 
Offline

Joined: Sat Jul 09, 2016 6:01 pm
Posts: 180
BillG wrote:
The linked page says:
I am not familiar with pce-ibmpc. Did you measure the time using a stopwatch or does the emulator report the number of CPU cycles used?

The fastest code in an emulator as measured with a stopwatch is not likely to be the fastest on actual hardware. It is not unlike trying to optimize code for a BASIC compiler as opposed to for one of the many interpreters.

Code optimization on x86 processors is notoriously difficult. If you have ever read The Zen of Assembly Language by Michael Abrash, you will know what I am talking about.

For instance, your code does not make use of the STOSW instruction which is especially advantageous on the 8088/8086 processor but not so much on modern members of the family.

Let me show you the original PDP-11 assembly for the main loop.
Code:
1$:   mov   sqr(r1), r3   ; r3 = y^2
   add   r0, r1      ; r1 = x+y
   mov   sqr(r0), r0   ; r0 = x^2
   add   r3, r0      ; r0 = x^2+y^2
   cmp   r0, r6      ; if r0 >= 4.0 then
   bge   2$      ; overflow
   mov   sqr(r1), r1   ; r1 = (x+y)^2
   sub   r0, r1      ; r1 = (x+y)^2-x^2-y^2 = 2*x*y
   add   r5, r1      ; r1 = 2*x*y+b, updated y
   sub   r3, r0      ; r0 = x^2
   sub   r3, r0      ; r0 = x^2-y^2
   add   r4, r0      ; r0 = x^2-y^2+a, updated x
   sob   r2, 1$      ; to next iteration

It is just a straightforward implementation of the Mandelbrot fractal algorithm main loop.
Code:
    while (x*x + y*y ≤ 2*2 AND iteration < max_iteration) do
        xtemp := x*x - y*y + x0
        y := 2*x*y + y0
        x := xtemp
        iteration := iteration + 1

So it was not impossible to write so small optimized code for several platforms. We just need to translate these 13 lines only.

All programs just print timings, it is written on the linked page with the results. Those emulators are capable to run tricky code for games and demos. So they are quite accurate. The only exception is the IBM PC emulator because timings were almost never a compatibility issue for this computer. There were too many variety of it that used different processors (the 8088, 8086, V20, V30, 80286, ...) at different clocks. So emulators for the IBM PC are usually faster than the original machines. IMHO the emulators just don't simulate the instruction queue delays. BTW if anybody knows the best IBM PC emulator please inform me about it.

It is possible just get machine cycles quantity for some processors (the Z80 and 6502) but requires different means to use and this IMHO can't change the results. And this approach misses the fine resulting pictures. :)

BigEd wrote:
For me, it's a major difficulty with any kind of comparison using hand-coded assembly - you need to be able to marshal the same level of expertise with each micro. Not insurmountable, but otherwise you may leave unrealised improvements for some which distort the rankings.

I can only claim that my code is optimized well but I can't claim it is perfect. The most difficult is the Z80 code optimization, it took about a month for me and I used help from several the Z80 experts. Finally I could speed up my initial Z80 code by about 50%. The other processors are more plain for optimizations. It would be great if somebody can discover the better code for the 6502 or other CPU.
It is interesting that for per MHz efficiency the 6502 beats the 8088 even on 16-bit calculations. :)

_________________
my blog about processors


Last edited by litwr on Thu Dec 23, 2021 7:28 pm, edited 2 times in total.

Top
 Profile  
Reply with quote  
PostPosted: Thu Dec 23, 2021 7:21 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Sarah Walker's PCem is cycle accurate, apparently.
https://retrocomputing.stackexchange.co ... 6-emulator


Top
 Profile  
Reply with quote  
PostPosted: Fri Dec 24, 2021 9:05 am 
Offline

Joined: Sat Jul 09, 2016 6:01 pm
Posts: 180
BigEd wrote:
Sarah Walker's PCem is cycle accurate, apparently.
https://retrocomputing.stackexchange.co ... 6-emulator

I like this emulator, I use it for the AT emulation. But to be cycle exact is not enough for the perfect IBM PC emulation because the x86 uses the instruction queue unit which can stop the main Central Processor Unit. This was documented very well by Michael Abrash. The instruction queue delays are especially significant for the 8088. Sorry I have been too lazy to do thorough checking of PCem sources, so I am not sure about the instruction queue emulation. But PCem IBM PC 5160 timings are faster than real iron too.

_________________
my blog about processors


Top
 Profile  
Reply with quote  
PostPosted: Mon Jan 10, 2022 3:27 pm 
Offline

Joined: Sat Jul 09, 2016 6:01 pm
Posts: 180
I have ported the pi-spigot algorithm to the 6803. This CPU was used very rarely, I know only the Tandy TRS-80 MC-10 and its French clones. The results show that this processor is slightly faster than the 6809! It seems Motorola followed the DEC way to make processor instructions more complex and slow. The 6803 has faster instructions than the 6809 but the 6809 has more registers, instructions and addressing modes. However some instructions like LSRD or ASLD only exist for the 6803. If the 6803 also had ROLD, it would greatly speed up the division procedure for this processor.
It would also be interesting to guess how the 6502 might have evolved if MOS Technology had been regularly upgrading it. It is possible that they could have chosen the same path that Motorola did for the 6803. That would have meant using a 16-bit accumulator.

_________________
my blog about processors


Top
 Profile  
Reply with quote  
PostPosted: Mon Jan 10, 2022 4:01 pm 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1949
Location: Sacramento, CA, USA
I'm not surprising anyone here, but the ability to run legacy binaries was the constraint chosen for the '802 and '816, and they're full of 16-bit stuff behind those annoying mode bits. x86 followed a similar upgrade path, but threw much, much more money for R&D into the mix.

_________________
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!

Mike B. (about me) (learning how to github)


Top
 Profile  
Reply with quote  
PostPosted: Tue Jan 11, 2022 2:06 pm 
Offline

Joined: Sat Jul 09, 2016 6:01 pm
Posts: 180
barrym95838 wrote:
I'm not surprising anyone here, but the ability to run legacy binaries was the constraint chosen for the '802 and '816, and they're full of 16-bit stuff behind those annoying mode bits. x86 followed a similar upgrade path, but threw much, much more money for R&D into the mix.

I have found out that the 6800 family consists of families of binary incompatible processors. I could count up to 5 families:
1) 6800/6802/6808, 6801/6803 (backward compatible with the 6800);
2) 6800, 68HC08;
3) 6800, 68HC11;
4) 6804, 6805;
5) 6809, 6309 (backward compatible with the CMOS 6809).
This is not a complete list. I am not able to finish it, too many controllers are there. The 6502 family is a set of more compatible processors. But it also contains varieties:
1) NMOS 6502 (undocumented instructions);
2) CMOS 6502 (almost 100% compatible with the NMOS 6502 without undocumented instructions);
3) 6509 (almost 100% compatible with the NMOS 6502);
4) 65CE02 (backward compatible with the CMOS 6502);
5) HuC6280 (backward compatible with the CMOS 6502?);
6) DTV (backward compatible with the CMOS 6502);
7) WDC65C02 (backward compatible with the CMOS 6502);
8 ) 65816 (backward compatible with the CMOS 6502).
It is interesting that almost universal assembler VASM has supports for all (?) 6502 8-bit varieties but still misses the 65816.

_________________
my blog about processors


Top
 Profile  
Reply with quote  
PostPosted: Tue Jan 11, 2022 4:12 pm 
Offline
User avatar

Joined: Sun Jun 30, 2013 10:26 pm
Posts: 1949
Location: Sacramento, CA, USA
litwr wrote:
I have found out that the 6800 family consists of families of binary incompatible processors. I could count up to 5 families:
1) 6800/6802/6808, 6801/6803 (backward compatible with the 6800);
2) 6800, 68HC08;
3) 6800, 68HC11;
4) 6804, 6805;
5) 6809, 6309 (backward compatible with the CMOS 6809).
This is not a complete list. I am not able to finish it, too many controllers are there.

I don't have any practical experience with it, but the 68HC12 looks like it would be my favorite cousin from that family. It lost the 6809/6309's U register, but in doing so gained an impressive amount of binary code efficiency with a carefully overhauled opcode matrix.
Quote:
The 68HC12 adds to and replaces a small number of 68HC11 instructions with new forms that are closer to the 6809 processor. More significantly it changes the instruction encodings to be far more dense and adds many 6809 like indexing features, some with even more flexibility. The net result is that code sizes are typically 30% smaller.

_________________
Got a kilobyte lying fallow in your 65xx's memory map? Sprinkle some VTL02C on it and see how it grows on you!

Mike B. (about me) (learning how to github)


Top
 Profile  
Reply with quote  
PostPosted: Tue Jan 11, 2022 5:50 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
Wow, 30% smaller code is very impressive!


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 163 posts ]  Go to page Previous  1 ... 5, 6, 7, 8, 9, 10, 11  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 13 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: