What if independent MOS Technology had survived?

litwr · Post by **litwr** » Wed May 03, 2017 6:06 am

BigDumbDinosaur wrote:

litwr wrote:

BTW I don't like 65816 PER instruction.

What do you have against PER? I find all sorts of use for that instruction, as it is the basis for developing fully relocatable code.

Quote:

PEI and PEA are just extenders to PUSH.

PEA and PEI are indispensable for creating stack frames prior to calling full reentrant subroutines. Just about all of my 65C816 programs use those instructions, along with more orthodox register pushes.

This idea of fully relocatable code is a chimeric descendant of PDP/VAX-11 ISA. But even PDP/VAX-11 couldn't get it. Nobody needs this. It is impossible to use program without a proper loader in the complex OS. IMHO PER is just a waste of the codespace. Did anybody use it?
On the contrary, PEA looks quite useful. I can't find PEI too useful it is almost an effective shorthand for LDA (dp),y PHA DEY LDA (dp),y PHA. IMHO it would be better to store 2 adjacent zp-values to the stack like LDA zp,X PHA DEX LDA zp,X PHA.

BigDumbDinosaur wrote:

litwr wrote:

And anyway 6502+ would be much better because 6309 has a lot of bulky instructions, can use only 64 KB of memory, ...

...and the 65C816 is better still with 16 bit registers, 24 bit memory addressing, block copy instructions, stack relative addressing, fully relocatable direct page, support for high clock speeds (20 MHz if the rest of the circuit can handle it), etc.

6309 is also a mighty beast. It has 4 accumulators (2 of them are rather slow), fast division (for 32-bit dividend!) and multiplication, 16-bit ALU, auto increment and decrement for index registers, ... 65816 is much slower with arithmetic. The only advantages of 65816 are 24-bit address, higher frequency of produced chips, better support.

EDIT. It is also worth to note that there is a true OS for 6809/6309 - FLEX. 6502 family missed this. 6502 has only Basics and Forth.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Wed May 03, 2017 7:02 am

Arlet wrote:

BigDumbDinosaur wrote:

Arlet wrote:

And while it's true that you don't need to load/unload the pointers in zeropage, it is quite painful to do pointer arithmetic, such as [B+N*X+offset] addressing, that's typically used in higher level languages. Anybody who disagrees is invited to write a 6502 version of my memory allocation challenge.

The pain is nowhere as severe with the 65C816 in native mode.

The '816 makes it easier to do 16 bit arithmetic, but if you want to use the full 24 bit address, it's still painful.

Painful in what way?

Arlet · Post by **Arlet** » Wed May 03, 2017 7:25 am

BigDumbDinosaur wrote:

Quote:

The '816 makes it easier to do 16 bit arithmetic, but if you want to use the full 24 bit address, it's still painful.

Painful in what way?

Because the index/accumulator registers are only 16 bits, so you still need multiple operations on memory for a simple address calculation. But I may be wrong, so if you disagree, I'm curious to see how you would do [B + N * X + offset] access using 24 bit addresses (B,X are variables/registers, and N,offset are constants, you may reduce X to 16 bit, since that would be the most common case).

BigDumbDinosaur · Post by **BigDumbDinosaur** » Wed May 03, 2017 7:33 am

litwr wrote:

I can't find PEI too useful it is almost an effective shorthand for LDA (dp),y PHA DEY LDA (dp),y PHA.

PEI effectively performs the following:

Code: Select all

         rep #%00100000        ;16 bit accumulator
         lda dirpag            ;somewhere in direct page
         pha                   ;push it

Despite the meaning of the PEI mnemonic, the instruction does not use indirection as you suggest.

PEA and PEI are two poorly-named mnemonics—a single mnemonic, such as PHW (push word), should have been used, with PHW # pushing its 16 bit operand (equivalent to PEA) and PHW <dp> pushing the 16 bit value at <dp> and <dp>+1 (equivalent to PEI). It would have been nice if these instructions had a complement that pulls a word and discards it, such as PLW, which would simplify stack housekeeping following a return from a subroutine whose invocation was preceded by the pushing of a stack frame, but they don't.

Quote:

IMHO it would be better to store 2 adjacent zp-values to the stack like LDA zp,X PHA DEX LDA zp,X PHA.

Even simpler if you are going to go that route is:

Code: Select all

         lda zp+1
         pha
         lda zp
         pha

which avoids the penalty of manipulating .X, as well as the extra clock cycles consumed with indexed reads. However, PEI is twice as fast as my above example, as it uses six clock cycles—assuming DP is page-aligned, instead of the 12 clock cycles used to read and push both direct page locations, again assuming DP is page-aligned. Furthermore, PEI doesn't clobber the accumulator.

Quote:

6309 is also a mighty beast...The only advantages of 65816 are 24-bit address, higher frequency of produced chips, better support.

...not to mention lower power consumption and substantially lower interrupt latency.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Wed May 03, 2017 7:52 am

Arlet wrote:

BigDumbDinosaur wrote:

Quote:

The '816 makes it easier to do 16 bit arithmetic, but if you want to use the full 24 bit address, it's still painful.

Painful in what way?

Because the index/accumulator registers are only 16 bits, so you still need multiple operations on memory for a simple address calculation. But I may be wrong, so if you disagree, I'm curious to see how you would do [B + N * X + offset] access using 24 bit addresses (B,X are variables/registers, and N,offset are constants, you may reduce X to 16 bit, since that would be the most common case).

I can't speak for everyone who writes 65C816 assembly language, but I use 32 bit math to compute addresses, and simply ignore bits 24-31 in the result, as long as all those bits are zero (an easy test with the '816). Otherwise, it means constantly fiddling with the m bit in the status register to go between a narrow and wide accumulator. Since memory management involves integers and since a 32 bit quantity with bits 24-31 equal to %00000000 is effectively a 24 bit address, 32 bit unsigned integer math is practical, and is relatively succinct on the 65C816.

As part of developing my 816NIX filesystem, I've written routines that do 64 bit integer multiplication and division on the '816, which routines can be "dumbed down" to 32 bits to compute memory allocation values. The addition part is a sequence of two loads, additions and stores, which is plenty fast. Hence I see no problem with efficiently computing B + N * X + offset.

Arlet · Post by **Arlet** » Wed May 03, 2017 8:14 am

BigDumbDinosaur wrote:

As part of developing my 816NIX filesystem, I've written routines that do 64 bit integer multiplication and division on the '816, which routines can be "dumbed down" to 32 bits to compute memory allocation values. The addition part is a sequence of two loads, additions and stores, which is plenty fast. Hence I see no problem with efficiently computing B + N * X + offset.

I think we differ in our threshold for pain

. In my book, calling a subroutine to do a common array lookup address calculation is "painful", and so is a sequence of loads, additions and stores. On modern CPUs those calculations can be done in 1 or 2 instructions, using only a handful of bytes.

In practice, the 'N' value is often a small constant integer, such as 2, 4, 5, 10, or 16, so calling a multiply routine is a bit of overkill. On the other hand, inlining that kind of code without a multiply instruction or a barrel shifter isn't very attractive either.

GARTHWILSON · Post by **GARTHWILSON** » Wed May 03, 2017 8:42 am

In my '816 ITC Forth, I used PEI in nest (not NEXT) which gets used all the time:

Code: Select all

nest:   PEI     IP      ; PEI IP replaces LDA IP , PHA here.  nest is the
        LDA     W       ; runtime code of : (often called DOCOL ). It is not
        INA_INA         ; really a Forth word itself per se-- but it is pointed
        STA     IP      ; to by the CFA of secondaries.
        GO_NEXT
 ;-------------------

I'm sure the Apple IIgs's ProDOS used PER.

Quote:

6309 is also a mighty beast. It has 4 accumulators (2 of them are rather slow), fast division (for 32-bit dividend!) and multiplication, 16-bit ALU, auto increment and decrement for index registers, ... 65816 is much slower with arithmetic. The only advantages of 65816 are 24-bit address, higher frequency of produced chips, better support.

If math performance really needs to be that hot and you can't use something like an ARM, consider:
The 816's 24-bit address space allows it to use large look-up tables in memory for fast 16-bit scaled-integer math; for example, you can get a sine or logarithm in 23 clocks (1.44µs @ 16MHz), or 35 clocks (2.19µs @ 16MHz) if you include the JSR & RTS pair, and it will be accurate to all 16 bits, this way:

Code: Select all

                         ; Start with the input number in the 16-bit accumulator.                  number of clocks:
SIN:  ASL  A             ; Double the input number by shifting left one bit position,                     2
      STA  TBL_ADR       ; and store the low 16 bits in the DP variable that we'll use as a pointer.      4
      LDA  #SIN_TBL_BANK ; Get the bank number where the table starts.                                    3
      ADC  #0            ; If the ASL above left the C flag set, increment the bank number.               3
      STA  TBL_ADR+2     ; Store it in the bank byte of the pointer variable.                             4
      LDA  [TBL_ADR]     ; Read the sine value.  The two bytes of the answer will never straddle the      7
      RTS                ;                                                             bank boundary.
 ;------------------

The 6809 takes more than twice as long to do a single multiplication; and it will take a lot of multiplications and divisions to calculate a logarithm. The natural logarithm is defined as:

ln(x) = 2*(z + z^3/3 + z^5/5 + z^7/7 + z^9/9 + ...) where z=(x-1)/(x+1)

...and it converges very slowly. There are various ways to approximate it, mostly by adjusting adjusting the range and adjusting the polynomial coefficients to make up for the fact that you want to minimize the number of terms you use. In the days of the Curta mechanical calculator, one way they'd jurry-rig it is to take 11 successive square roots of x, subtract 1, then multiply the result by 2048. Here you can just look it up, and the most you'd have to do is a single multiplication or division to make the best of the range.

Getting the 32-bit inverse of a 16-bit input takes the '816 only a little longer than what's shown above, and you can use it to divide by multiplying by the inverse.

BTW, I was not able to find a DIVide instruction in the 6809. I remember the one in the 68000 took an awful lot of clock cycles—something like 170.

Arlet · Post by **Arlet** » Wed May 03, 2017 8:51 am

Quote:

I was not able to find a DIVide instruction in the 6809. I remember the one in the 68000 took an awful lot of clock cycles—something like 170.

The nice thing about DIV/MUL instructions, even if they are slow, is that your old programs that use those will automatically get faster on newer processors.

litwr · Post by **litwr** » Wed May 03, 2017 4:40 pm

BigDumbDinosaur wrote:

Despite the meaning of the PEI mnemonic, the instruction does not use indirection as you suggest.

Sorry it was my miss, the name of this mnemonic confused me. Thank you for the information.

So PEI would be very useful for 6502+ too.

BigDumbDinosaur wrote:

...not to mention lower power consumption and substantially lower interrupt latency.

Why are you so sure that 6809 or 6309 has bigger power consumption and interrupt latency?

GARTHWILSON wrote:

BTW, I was not able to find a DIVide instruction in the 6809. I remember the one in the 68000 took an awful lot of clock cycles—something like 170.

65816 is much faster with arithmetic than 6809 even without big tables because it has 16 bit ALU and faster operations with memory. But don't confuse 6809 with later and much more powerful 6309. 65816 can't match 6309 division even with tables.
BTW I am still astonished by an achievement of Intel engineers with 80286 in 1982, it can perform the mentioned 68000 division for about 20 cycles. IMHO there were only three good designs of CPU 6502, x86 and ARM. Of course, 6502 was the best.

GARTHWILSON · Post by **GARTHWILSON** » Wed May 03, 2017 7:13 pm

litwr wrote:

GARTHWILSON wrote:

BTW, I was not able to find a DIVide instruction in the 6809. I remember the one in the 68000 took an awful lot of clock cycles—something like 170.

65816 is much faster with arithmetic than 6809 even without big tables because it has 16 bit ALU and faster operations with memory. But don't confuse 6809 with later and much more powerful 6309. 65816 can't match 6309 division even with tables.
BTW I am still astonished by an achievement of Intel engineers with 80286 in 1982, it can perform the mentioned 68000 division for about 20 cycles. IMHO there were only three good designs of CPU 6502, x86 and ARM. Of course, 6502 was the best.

Ah yes, I should have looked up the 6309, not 6809. So now I see the 6309 has a divide instruction, and that it takes a minimum of 25 clocks, so it takes more clocks to do just a divide than the '816 with tables takes to look up a trig or log function as I showed. Now consider also that the '816 can run at four times the clock rate of the 6309, and we have an '816 that can get a log function for example in about one-fifth the time it takes the 6309 to do the first of many divisions it will have to do to calculate the function, making the '816 potentially a hundred times as fast in this case. And again, the tables have every one of the 65,536 answers pre-calculated, accurate to the last bit, so there's no interpolation necessary.

Arlet · Post by **Arlet** » Wed May 03, 2017 7:17 pm

Quote:

So now I see the 6309 has a divide instruction, and that it takes a minimum of 25 clocks, so it takes more clocks to do just a divide than the '816 with tables takes to look up a trig or log function as I showed

But that's comparing apples to oranges. How long does the '816 take to do the same divide ?

GaBuZoMeu · Post by **GaBuZoMeu** » Wed May 03, 2017 8:16 pm

I don't know whether I should cry or laugh - I merely wonder about this stuff here:

we all participate this forum because we have some relations to the bride named 6502. But in the same breath we praise here beauty we start to dream about amplifying here attributes... some of us didn't even take a step back - taken the whole picture - and realize that the beauty is long gone.

my 2 cents

BigEd · Post by **BigEd** » Wed May 03, 2017 8:24 pm

GARTHWILSON wrote:

... I see the 6309 has a divide instruction, and that it takes a minimum of 25 clocks, so it takes more clocks to do just a divide than the '816 with tables takes to look up a trig or log function as I showed.

(It's worth noting that there are many ways to calculate log and trig, Taylor series being only one and not especially efficient. CORDIC, I think, doesn't use division. So, best not to assume that division is the limiting factor: it's slower than multiplication in every case I've seen, and so the people who design the algorithms for other calculations don't overuse it.)

GARTHWILSON · Post by **GARTHWILSON** » Wed May 03, 2017 8:38 pm

Quote:

But that's comparing apples to oranges. How long does the '816 take to do the same divide ?

What I'm saying is that a divide instruction is not always as valuable as the tables which the 6809 and 6309 do not have the memory range to address. I haven't counted the cycles the '816 needs to do a division. If I do, it will be to start with a 32-bit dividend and a 16-bit divisor, and get a 16-bit quotient and a 16-bit remainder, which are important to me. This is apparently equivalent to the 6309's DIVQ instruction which takes a minimum of 34 clock cycles. The fastest ASIC 816's run over 100MHz which is about 30 times as high a clock rate as the fastest 6309's, so they'll run over a thousand clock cycles in the amount of time the 6309 takes to do a division. That's probably enough enough to match it in a divsion (while going way, way faster in the other stuff), but again, I haven't counted. I'm sure Bruce Clark (dclxvi on the forum, our algorithms expert) has. This level of speed is not important to me in my uses. I just want others to see what's possible. If I needed ARM speed, I guess I'd get an ARM. But I don't.

BigDumbDinosaur · Post by **BigDumbDinosaur** » Wed May 03, 2017 9:11 pm

litwr wrote:

BigDumbDinosaur wrote:

Despite the meaning of the PEI mnemonic, the instruction does not use indirection as you suggest.

Sorry it was my miss, the name of this mnemonic confused me. Thank you for the information.

So PEI would be very useful for 6502+ too.

As I earlier said, the naming of the PEA and PEI instructions was poorly thought out, as was the selection of mnemonics. The two instructions do the same thing, differing only in where they get the word that they push.

Quote:

Why are you so sure that 6809 or 6309 has bigger power consumption and interrupt latency?

Data sheets, my friend, data sheets.

Also, it is a well-known fact that the 6502's interrupt latency is one of the lowest of any microprocessor design.

What if independent MOS Technology had survived?

Re: What if independent MOS Technology had survived?

Re: What if independent MOS Technology had survived?

Re: What if independent MOS Technology had survived?

Re: What if independent MOS Technology had survived?

Re: What if independent MOS Technology had survived?

Re: What if independent MOS Technology had survived?

Re: What if independent MOS Technology had survived?

Re: What if independent MOS Technology had survived?

Re: What if independent MOS Technology had survived?

Re: What if independent MOS Technology had survived?

Re: What if independent MOS Technology had survived?

Re: What if independent MOS Technology had survived?

Re: What if independent MOS Technology had survived?

Re: What if independent MOS Technology had survived?

Re: What if independent MOS Technology had survived?