slightly OT: a simple Benchmark

GaBuZoMeu · Post by **GaBuZoMeu** » Wed Jul 04, 2018 11:14 pm

barrym95838 wrote:

GaBuZoMeu wrote:

Thank you Mike

657 s = 11 minutes - too much for staying unnoticed in a computer shop

...and also too long for me to pay attention to the clock. The number didn't look like it fit in with the rest, so I re-ran it and discovered that I lost two minutes somewhere. The "real" result is C=777 for Applesoft. Sorry about the error ...

Mike B.

No matter - I took this occasion to resort the table, grouping the 6502s together, pushing the fast ones down, and add some more results for the RasPi.

GaBuZoMeu · Post by **GaBuZoMeu** » Wed Jul 04, 2018 11:31 pm

John West wrote:

This will be a good test for my 65020. I'll have a go at implementing both algorithms tonight. The simulator gives an approximate cycle count, so we can do hand-wavy comparisons.

That was fun. I've got a 65020 translation of the second Pascal version running. The source is below, although it isn't pretty.

The times are only estimates, and it's very possible it's counting them wrong. I've assumed that mul, div, and mod take the usual cycles to fetch opcode and operands, plus one cycle per bit (that doesn't sound unreasonable for mid-1980s technology). The C640 will probably end up running at 5MHz (which also doesn't sound unreasonable for an improved Commodore 64), and I've translated the cycle counts with that assumption

A: 194368 cycles = 39ms
B: 331858 cycles = 66ms
C: 4603325 cycles = 0.92s
D: 12183830 cycles = 2.44s
E: 23183225 cycles = 4.64s
F: 700264537 cycles = 140.05s

I am highly impressed.

Somehow I'm running out of words.

Chromatix · Post by **Chromatix** » Thu Jul 05, 2018 1:13 am

Well, I finally located and excised the bugs in my assembly routine. Too tired to try it in BeebEm tonight, but it should run well tomorrow.

Chromatix · Post by **Chromatix** » Thu Jul 05, 2018 12:07 pm

If the cycle counting in my own emulator is correct, at 4MHz my assembly implementation should beat the Amiga 2000 entries in the table! That's a CPU with hardware divide instructions and, presumably, a much higher clock speed. Makes me wonder what some of those compilers are up to.

Code: Select all

macbook-pro:lib6502++ chromi$ ./test6502 primegap.bin 4000 4498 | fgrep RTS
4122: RTS             ; 340071 cyc
4122: RTS             ; 945137 cyc
4122: RTS             ; 11181386 cyc
4497: RTS             ; 42165323 cyc
4497: RTS             ; 113473325 cyc
4497: RTS             ; 3767409738 cyc

John West · Post by **John West** » Thu Jul 05, 2018 12:41 pm

Chromatix wrote:

If the cycle counting in my own emulator is correct, at 4MHz my assembly implementation should beat the Amiga 2000 entries in the table! That's a CPU with hardware divide instructions and, presumably, a much higher clock speed. Makes me wonder what some of those compilers are up to.

It's a 68000 at 7MHz, so not much higher. And the 68000 had a fairly slow memory interface, taking (from memory) 4 cycles per access. The compiler probably isn't be doing any optimisation beyond a simple peephole pass.

Your cycle counts are impressive. I've got some work to do.

Chromatix · Post by **Chromatix** » Thu Jul 05, 2018 12:48 pm

Actual timings from the emulated Second Processor:

Code: Select all

A: 0.11
B: 0.2
C: 3.4
D: 10.27
E: 23.65
F: 1212.3

I suspect BeebEm is using NMOS cycle counts even when emulating a 65C02. Still, these numbers can be compared directly with the HiBASIC numbers from the same emulator.

Chromatix · Post by **Chromatix** » Thu Jul 05, 2018 1:05 pm

John West wrote:

It's a 68000 at 7MHz, so not much higher. And the 68000 had a fairly slow memory interface, taking (from memory) 4 cycles per access. The compiler probably isn't be doing any optimisation beyond a simple peephole pass.

If it's an *original* 68000, and not one of the later pipelined versions... turns out DIVU takes 140 cycles. Ouch. But memory access shouldn't be a factor, since all the required data fits in the 68K's generous register bank, except for the instructions themselves. If the compiler is still keeping variables in RAM, then that's the sort of inefficiency I'm talking about.

I'm reminded of the Modula-2 compiler used for developing ARX, which would emit dozens of instructions and *then* a subroutine call to perform a simple multiply, on an ARM CPU that had loads of registers *and* hardware multiply. Eventually a competitive demo was made between ARX and the MOS-derived Arthur, the latter running on a much smaller and cheaper configuration of the prototype Archimedes - the latter won handily, and ARX was promptly cancelled.

GaBuZoMeu · Post by **GaBuZoMeu** » Thu Jul 05, 2018 1:05 pm

Chromatix wrote:

If the cycle counting in my own emulator is correct, at 4MHz my assembly implementation should beat the Amiga 2000 entries in the table! That's a CPU with hardware divide instructions and, presumably, a much higher clock speed. Makes me wonder what some of those compilers are up to.

Code: Select all

macbook-pro:lib6502++ chromi$ ./test6502 primegap.bin 4000 4498 | fgrep RTS
4122: RTS             ; 340071 cyc
4122: RTS             ; 945137 cyc
4122: RTS             ; 11181386 cyc
4497: RTS             ; 42165323 cyc
4497: RTS             ; 113473325 cyc
4497: RTS             ; 3767409738 cyc

In this case I can simply divide your cycle counts by 1,000,000 and get the execution time in seconds for a 1 MHz CPU - that would make comparisons to other results easier. Did you using 65C02 instructions? Perhaps you can add a source file - a listing might be too long?

The 68000 has a different clock scheme than the 6502. A cycle took 4 clocks (IIRC the TAS instruction took 6). I don't look at the clock speed anymore - it isn't helpful. I think it is better to ask for the required memory spped (access time). Using the required memory speed (e.g. 500ns for a 1 MHz 6502 system) you can compare various clock architectures easily. The Z-80 took 4 cycles minimum (6 for opcode fetch), the TMS 9900 requires 3 cycles (with 4 non overlapping phases

), a RCA1802 uses 4 clocks for each memory cycle. Running a TMS-9900 @ 3 MHz requires roughly 600ns RAMs, similar to a MC-68K with 4 MHz. With an equal memory speed requirement as a calculation base you may then compare different CPUs fairly. So 1 MHz for 6502 corresponds to 3 MHz for TMS9900, 4 MHz for RCA1802 or MC68000. I have no databook at hand for the Z-80, but I assume a 500ns RAM would work for clockspeeds around 3.3 MHz (perhaps 3.0 MHz only).

Cheers,
Arne

BigEd · Post by **BigEd** » Thu Jul 05, 2018 1:08 pm

Chromatix wrote:

... turns out DIVU takes 140 cycles. Ouch.

Ouch indeed. Sounds like one of those CISCy situations where a hand-coded DIVIDE (or MOD) might outperform the instruction.

Chromatix · Post by **Chromatix** » Thu Jul 05, 2018 1:16 pm

Actually no, because all instructions take multiple cycles on the 68000. The later pipelined versions (68020 onwards) are vastly faster in that respect.

The 68000 implements only a 32/16 bit divide, so 140 cycles corresponds to about 8-9 cycles per quotient bit. A simple 16-bit add, subtract or compare takes 4 cycles, and you need more than two such instructions to implement one stage of division.

BigEd · Post by **BigEd** » Thu Jul 05, 2018 1:25 pm

I suppose that's a good thing then - the DIVU does earn its keep after all.

Chromatix · Post by **Chromatix** » Thu Jul 05, 2018 1:25 pm

Quote:

Did you using 65C02 instructions? Perhaps you can add a source file - a listing might be too long?

Yes, I assumed a 65c02, so for example the setup and cleanup phases use STZ and DEC A. However I think the bulk of the code is NMOS-clean, so it should be possible to convert without losing much performance.

NB: the cycle counts are all from the beginning of the simulation and therefore *cumulative*. You'll have to subtract the previous one from all but the first to get the timing for that particular gap run.

I've attached the source code.

Chromatix · Post by **Chromatix** » Thu Jul 05, 2018 1:35 pm

BigEd wrote:

I suppose that's a good thing then - the DIVU does earn its keep after all.

Indeed. For comparison, my 6502 code for a 24/16 division discarding quotient, takes 43 cycles worst-case and 23 cycles best-case to handle each bit - and it has to produce the same number of (virtual) quotient bits as the 68000. The 16/8 division routine takes 13-15 cycles per bit. So with the clock speed advantage, the 68000 should easily be outpacing me.

Chromatix · Post by **Chromatix** » Thu Jul 05, 2018 7:57 pm

When interpreting cycle counts as a measure of time spent, there are a number of caveats to consider - not the least of which is whether my own emulator's cycle counts are in fact accurate. I haven't yet got around to verifying that in detail, though I have tried to model all the 65C02 timing quirks I know about.

Many 6502 based machines avoid stalling the CPU for DMA accesses by performing the latter during the Phi1 cycle, effectively driving the RAM at twice the speed of the CPU. This is true of the BBC Micro, at least, which only modifies the normal 2MHz CPU timing by "stretching the clock" while accessing addresses in the I/O window ($FC00-$FEFF), known as the "1MHz bus". The Second Processor doesn't have even this modification of the clock, and always runs at full speed with no interruptions.

The C64 is an example of a machine that *doesn't* fit this model. Most of the VIC-II's accesses are done during Phi1, but it also needs to use the Phi2 cycles during certain scanlines, known as "bad lines" in the demoscene. It asserts the appropriate 6502 signal 3 cycles before these "cycle stealing" accesses begin, because the NMOS 6502 ignores that signal during writes (but it only ever performs 3 writes in a row). So the effective performance of the C64's CPU is somewhat less than its clock speed would normally imply - and it's only about 1MHz to begin with.

It seems that Z80 based machines are much more prone to "cycle stealing" as an alternative to staying out of the CPU's way. Some of the early Sinclair ZX machines were particularly notorious in this respect. The effect of this can be relatively benign (extending the timing of each instruction by a cycle or three at random) or catastrophic (completely halting the CPU except in the CRT blanking intervals). But some machines have a clean separation between the CPU and graphics memory buses, and achieve corresponding better and more predictable performance.

I'm not familiar enough with the early 6502 micros in the table to know how much they're affected by the above.

Of course, now I'm looking at the CoCo 3 results and thinking that a 6809 - and particularly a 6309 - should do that much better. The 6309 can run at up to 5MHz reliably and consumes far fewer cycles in its divide instruction - yes it has one! - than the 68000. So let's see if I can find a cycle-accurate emulator to hack around on...

GaBuZoMeu · Post by **GaBuZoMeu** » Thu Jul 05, 2018 8:53 pm

@Chromatix: There is a discrepancy in the numbers you has given between the cycle counts and the times in seconds from simulation.

Code: Select all

0			  cycles		[s]@4MHz	 Sim-T	Sim-T/4MHz 
340071		340071		0,0850		0,11	   1,294
945137		605066		0,1513		0,2		 1,322
11181386	 9896178	  2,4740		3,4		 1,374
42165323	 29698729	 7,4247		10,27	  1,383
113473325	58841408	 14,7104	  23,65	  1,608   <===
3767409738  3599304496  899,8261	 1212,3	 1,347

It seems the time or cycle counts in the 5th row are wrong. Second there is a constant factor of 1.33 between the numbers. Any idea?

Cheers
Arne

slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark

Re: slightly OT: a simple Benchmark