I have to fix my previous post. BCD multiply and divide are 3 and even more times slower than the binary because BCD data occupy about 50% more bytes. The addition and the subtraction are about 50% slower for the same reason.
GARTHWILSON wrote:
Since then of course the lines have been quite blurred. ARM is not a RISC.
Thank you very much for valuable information and the links.
I agree that ARM and 6502 are not RISC. IMHO these words RISC, CISC, ARA are kind of meaningless slogans. The wiki's page about ARM based Acorn Archimedes (
https://en.wikipedia.org/wiki/Acorn_Archimedes) has "information": "since the 68000 is a CISC and ARM2 is a RISC architecture, the 68000 could execute more complex instructions in one step, while the ARM2 must do it in several steps. So, depending on the task and the code, the 68000 may outperform the ARM2 in several cases". IMHO 68000 can outperform ARM2 only with BCD operations.
And I can add that a lot of ARM instructions are so complex that 68000 has to use 2-5 its instructions to emulate them.
GARTHWILSON wrote:
the 6502's decimal-mode operations don't consume any space on the op-code table except SED and CLD.
The decimal mode of 6502 consumes only two opcodes but also some of silicon space which may be used for ONE or TWO additional instruction(s)...
GARTHWILSON wrote:
I seem to remember a DIV instruction took over 170 cycles, and that interrupts had to wait.
I've checked the tables, 68000 DIVU instruction takes 144 cycles. The best known 6502 division routine (32 bits dividend and 16 bits divisor) takes about 700 cycles and occupies about 1 KB of memory. Its version with loops is about 100 bytes in size but about 850 cycles in time. Is there any problem in the interrupt delay for 144 cycles for 8 MHz 68000?
GARTHWILSON wrote:
Both the 6800 and the '02 at 1MHz routinely outperformed a 4MHz Z80 in benchmarks in BASIC.
Any proof? I'm aware of
http://www.hpmuseum.org/cgi-sys/cgiwrap/hpmuseum/archv017.cgi?read=120687, IMHO it gives very odd results close to absurd like z80 is faster than 80188. It uses different algorithms, multi-level emulation and other things that make the objective compare almost impossible.
I worked with Amstrad CPC (z80 at effective 3.2 MHz) and PCW models. I have to say that Amstrad CPC ROM Basic is 2-3 faster than Commodore Basic (with 1.1 MHz 6502). I have the evidences of commercial game programmers from USA who made the codes for 6502 and z80 that 6502 is generally about 2.2-2.4 times faster. My project
http://litwr2.atspace.eu/xlife/retro/xlife8.html shows about the same ratio.
GARTHWILSON wrote:
My 65816 Forth is two to three times as fast as my '02 Forth again at a given clock speed.
My measurements show that 65816 in 16 bit mode is less that 50% faster with arithmetic, 100% with the memory block transfer (due to MVN/MVP), 0% with branches. 65816 has powerful stack addressing modes. So for stack based Forth it should give the significant speed boost maybe even more that 100%. This boost with the smaller scale will be applied to the programming languages which use subroutines with the formal parameters and the local variables. However the work with byte tables is faster in 8-bit mode and 68516 has no any advantages over 6502 in this area.
IMHO 65816 was realized a bit extensive like a bit ugly z80. It might be faster with instructions like CLC, TXA, ... like later 4510/65CE02. XBA, MVP, ... might also be faster.
I want to find opportunity to study 32016, the first 32-bit CPU... B-em, BBC Micro emulator has some support for it. I doubt in its accuracy though.
GARTHWILSON wrote:
An increment-by-two and decrement-by-two would be nice.
It is better to have ADDQ and SUBQ from 68000.
GARTHWILSON wrote:
It is faster than the 68000's DIV instruction.
68000 can also use tables.
I had a surprising result with R800 (the fast version of z80 for MSX turboR, it is about 3-4 times faster than the original z80). R800 has fast hardware multiplication (36 cycles for 16 bit multipliers) but with a 768 bytes table I could get 35 cycles multiplication with a constant factor.
IMHO the fastest division is achievable via logarithmic tables. I suspect that 80186 uses them.
GARTHWILSON wrote:
Earlier related topics include:
I have just browsed all these topics. Thanks again.
IMHO LSR4 instruction would be worth to mention too. 6502 requres 4 LSR to get the higher nibble. The lower nibble can be get by AND #15. So what's the purpose of SWN (swap nibbles)? It is better to have z80 RLD which allows to make fast 4-bit shift of the sequence of bytes.
I have another a bit ignoramus question. Why JMP ($xxFF) is often considered as a 6502 bug? IMHO it is very good to work with tables. It allows for jump tables to occupy exactly one page. For example, the next code provides jumps for 128 odd values
Code:
ldx value
stx mjmp+1
mjmp jmp (divjmp)
the table divjmp occupies one page for 6502 but requires an ugly misaligned page for 65C02/65816...