Instructions that I missed
Instructions that I missed
The 6502 was a nice processor to program but looking around the alternatives of the time (Z80, 6809 etc) there were a few things I missed. Interestingly the clever double width accumulator of the 6809 or the similar 16 bit registers of Z80 were not one of them. Rather the DJNZ was something I would have liked. Inner loops could never get tight enough so something that decremented X or Y and branched, preferably with a backward bias, in a single instruction was something I wanted. If anyone is about to improve the 6502 instruction set I'd like to hear your view on such end-of-loop instructions.
Steve Wozniak created SWEET16 that I had thought would provide the necessary 16 bit functionality and also a transition to a 16 bit processor. So why was SWEET16 never implemented in hardware? Like the 6502 it was elegant, orthogonal and required no prefix codes. It seems to me like a lost opportunity.
Steve Wozniak created SWEET16 that I had thought would provide the necessary 16 bit functionality and also a transition to a 16 bit processor. So why was SWEET16 never implemented in hardware? Like the 6502 it was elegant, orthogonal and required no prefix codes. It seems to me like a lost opportunity.
- GARTHWILSON
- Forum Moderator
- Posts: 8773
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
Re: Instructions that I missed
You could just use the 65816. Although the high address byte (A16-A23) may look scary, you don't have to latch, decode, or use that high address byte to still get a ton of benefits. The code length is dramatically shortened anytime you need to deal with 16-bit values. See an example here. Keep in mind however that the 6502 out-benchmarked the Z80 even though the Z80 had more and wider registers, and ran at higher clock rates.
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
Re: Instructions that I missed
Quote:
something that decremented X or Y and branched [...] in a single instruction was something I wanted
One caveat is that it has to be more than just a convenience for producing pretty code. IOW I'd want to be certain there's a bottom line saving in clock cycles. (Did the Z80's DJNZ live up to this?) At 5 cycles, the DEY BNE sequence is already pretty fast, even though not as sexy as a single instruction equivalent. In the spirit of discussion, what would be the costs and benefits of separately optimizing the DEY instruction and the BNE instruction? The overall payoff would be broader...
cheers
Jeff
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html
https://laughtonelectronics.com/Arcana/ ... mmary.html
Re: Instructions that I missed
I could've done with a 2 cycle instruction to swap the top and bottom 4 bits of the accumulator (SWap Nybble?). Would've saved on sequences like which could just have become
I guess it would've been a whole new special path in the random control logic, as it's not remotely like any other kind of operation, so I can see why it wasn't there.
Code: Select all
ASL A
ASL A
ASL A
ASL ACode: Select all
SWN A
AND #$F0Re: Instructions that I missed
GARTHWILSON wrote:
You could just use the 65816. Although the high address byte (A16-A23) may look scary, you don't have to latch, decode, or use that high address byte to still get a ton of benefits. The code length is dramatically shortened anytime you need to deal with 16-bit values.
Quote:
See an example here. Keep in mind however that the 6502 out-benchmarked the Z80 even though the Z80 had more and wider registers, and ran at higher clock rates.
Re: Instructions that I missed
Collapsing multiple instructions into one, esp. with something like incrementing or decrementing has its caveats. For example Motorola got it wrong in the 68000 where the postincrement/predecrement of index registers was broken when for example a bus error ABORTed the opcode. Collapsing dec/inc and branch should be fine in this respect though.
However, if you start combining INX with BNE, you could also combine INX with BEQ, and INY instead of INX and maybe even a memory location, and other branches.... You quickly clutter your opcode space.
Here is a set of opcodes I defined for my 65002: http://www.6502.org/users/andre/65k/specsprog.html
BLT - Branch Less Than (C=0 or Z=1)
BGT - Branch Greater Than (C=1 and Z=0)
RDL - Rotate Direct Left (rotate without carry, i.e. bit 7 is directly moved to bit 0)
RDR - Rotate Direct Right (rotate without carry)
ASR - arithmetic shift right (shift in the sign from the left)
SWP - swap upper and lower nibble
INV - two's complement
BCN - bit count, compute number of 1-bits
SXY - swap X and Y
PSH - push all registers (in a 6502 that would be AC, XR, YR, more in a 65002)
PLL - pull all registers (in a 6502 that would be AC, XR, YR, more in a 65002)
FIL - fill a memory area with a byte value
ADS - add value to stack pointer
SBS - substract value from stack pointer
MVN/MVP - move a memory area (see 65816)
HBS - determine bit number of highest bit that is set (like log2)
HBC - determine bit number of highest bit that is clear
BSW - bit swap - exchange bit 7 with bit 0, bit 6 with bit 1, etc
(Not all of them are already on the web page though).
INX/INY can be done in a single cycle in the 65002, so there is not much penalty of not combining them.
André
Edit: added ASR which I had forgot
However, if you start combining INX with BNE, you could also combine INX with BEQ, and INY instead of INX and maybe even a memory location, and other branches.... You quickly clutter your opcode space.
Here is a set of opcodes I defined for my 65002: http://www.6502.org/users/andre/65k/specsprog.html
BLT - Branch Less Than (C=0 or Z=1)
BGT - Branch Greater Than (C=1 and Z=0)
RDL - Rotate Direct Left (rotate without carry, i.e. bit 7 is directly moved to bit 0)
RDR - Rotate Direct Right (rotate without carry)
ASR - arithmetic shift right (shift in the sign from the left)
SWP - swap upper and lower nibble
INV - two's complement
BCN - bit count, compute number of 1-bits
SXY - swap X and Y
PSH - push all registers (in a 6502 that would be AC, XR, YR, more in a 65002)
PLL - pull all registers (in a 6502 that would be AC, XR, YR, more in a 65002)
FIL - fill a memory area with a byte value
ADS - add value to stack pointer
SBS - substract value from stack pointer
MVN/MVP - move a memory area (see 65816)
HBS - determine bit number of highest bit that is set (like log2)
HBC - determine bit number of highest bit that is clear
BSW - bit swap - exchange bit 7 with bit 0, bit 6 with bit 1, etc
(Not all of them are already on the web page though).
INX/INY can be done in a single cycle in the 65002, so there is not much penalty of not combining them.
André
Edit: added ASR which I had forgot
Last edited by fachat on Tue Apr 24, 2012 7:38 pm, edited 1 time in total.
Author of the GeckOS multitasking operating system, the usb65 stack, designer of the Micro-PET and many more 6502 content: http://6502.org/users/andre/
Re: Instructions that I missed
I've got some notes on Z80 vs 6502 which I wanted to put into some order for a coherent post about what happened with chess machines... But here are some raw clippings:
The fridge (Judd) reckons a roughly 3:1 clock speed comparability
http://www.ffd2.com/fridge/speccy/score
and has good architectural comparison
(The implication is that 4MHz z80 wins versus 1MHz 6502, but not against 2MHz 6502.)
The hpmuseum queens benchmark comes out at 2.8:1
http://www.hpmuseum.org/cgi-sys/cgiwrap ... ead=120687
- Z80 283 msec
- 6502 100 msec
- 68000 220 msec
There are some cycle count comparisons in the slightly excited
http://www.alfonsomartone.itb.it/aunlzr.html
In this thread
https://groups.google.com/forum/#!msg/a ... I1im1b2DsJ
william H ivey says he found a 1MHz 6502 to be more performant than a 4MHz z80
That thread and another page both say that z80 is good for floating point. The other page says it's because of the large register set - so less memory bandwidth needed.
http://www.andreadrian.de/oldcpu/Z80_nu ... ncher.html
The fridge (Judd) reckons a roughly 3:1 clock speed comparability
http://www.ffd2.com/fridge/speccy/score
and has good architectural comparison
(The implication is that 4MHz z80 wins versus 1MHz 6502, but not against 2MHz 6502.)
The hpmuseum queens benchmark comes out at 2.8:1
http://www.hpmuseum.org/cgi-sys/cgiwrap ... ead=120687
- Z80 283 msec
- 6502 100 msec
- 68000 220 msec
There are some cycle count comparisons in the slightly excited
http://www.alfonsomartone.itb.it/aunlzr.html
In this thread
https://groups.google.com/forum/#!msg/a ... I1im1b2DsJ
william H ivey says he found a 1MHz 6502 to be more performant than a 4MHz z80
That thread and another page both say that z80 is good for floating point. The other page says it's because of the large register set - so less memory bandwidth needed.
http://www.andreadrian.de/oldcpu/Z80_nu ... ncher.html
Re: Instructions that I missed
Dr Jefyll wrote:
Quote:
something that decremented X or Y and branched [...] in a single instruction was something I wanted
One caveat is that it has to be more than just a convenience for producing pretty code. IOW I'd want to be certain there's a bottom line saving in clock cycles. (Did the Z80's DJNZ live up to this?) At 5 cycles, the DEY BNE sequence is already pretty fast, even though not as sexy as a single instruction equivalent. In the spirit of discussion, what would be the costs and benefits of separately optimizing the DEY instruction and the BNE instruction? The overall payoff would be broader...
cheers
Jeff
1: Loops never get tight enough.
2: The BNE is (nearly) symmetric as to the maximum distance of branching forward or backward and it gets inelegant if you have to branch more than 128 away from present address. An instruction as proposed here would lend itself to a backward bias, with branches -192 to +64, perhaps even -224 to +32, saving ugly branches plus JMP.
3: BNE is 2 cycles if branch not taken, 3 if taken. The proposal here is to assume branch is taken and optimise for that, shaving off another cycle.
4: DEX + BNE is in sum 3 bytes and 5 cycles, while a LPX (just to give it a name) would be 2 bytes and 3 cycles, perhaps 2 cycles (see point 3).
5: It overcomes the lack of post/pre decrementing index registers during load/store, as found in many other processors, and does so without adding more combinatorials to the existing already substantial number of addressing modes.
LPX (=DEX; BNE) might even be combined as LPX ++Y (=INY; DEX; BNE) allowing fast reversal. Simultaneous decrementing both X and Y seems less useful.
All in all I believe the benefits are substantial. As for the cost it requires 2 (perhaps 4) new opcodes: LPX, LPY, LPX ++Y, LPY ++X with corresponding logic. Considering decrementing and branching logic already are in place I guess the added logic would not be that great.
To speed things up you might prepare branch if register is not 1 and then run the decrement in parallel, retiring the updated value while the branch is taking place.
I was never comfortable about comparing with Z80 so I cannot comment much on the timing issues for Z80. Still, I believe these looping instructions are sexy, draped in leather and crack the whip over Z80, as it should...
Re: Instructions that I missed
fachat wrote:
However, if you start combining INX with BNE, you could also combine INX with BEQ, and INY instead of INX and maybe even a memory location, and other branches.... You quickly clutter your opcode space.
Incidentally the DSP65300 has a zero overhead loop function. It is fast, elegant and clever but regrettable requires a fair bit of logic and a flag in the processor status.
- GARTHWILSON
- Forum Moderator
- Posts: 8773
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
Re: Instructions that I missed
Quote:
1: Loops never get tight enough.
2: The BNE is (nearly) symmetric as to the maximum distance of branching forward or backward and it gets inelegant if you have to branch more than 128 away from present address. An instruction as proposed here would lend itself to a backward bias, with branches -192 to +64, perhaps even -224 to +32, saving ugly branches plus JMP.
2: The BNE is (nearly) symmetric as to the maximum distance of branching forward or backward and it gets inelegant if you have to branch more than 128 away from present address. An instruction as proposed here would lend itself to a backward bias, with branches -192 to +64, perhaps even -224 to +32, saving ugly branches plus JMP.
I never noticed any injustice in the backward-versus-forward branch distance myself. Although I've done the branches around JMPs, I don't remember it ever being to get back to the top of a long loop. However there have been many times when longish forward branches were needed to bypass a portion of code that should be skipped under the current conditions.
The benchmarks truly are favorable to the 6502; but the one that really counts of course is your own application. I have a short list of new instructions I would like added, but I can imagine a list of reasons why they have not been added, ranging from instruction-decoding complexity versus speed, to silicon real-estate costs which apparently are one of the major motivators for WDC's licensees to choose the 65c02 over the competition for many jobs.
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
Re: Instructions that I missed
GARTHWILSON wrote:
I have a short list of new instructions I would like added, ...
André
Author of the GeckOS multitasking operating system, the usb65 stack, designer of the Micro-PET and many more 6502 content: http://6502.org/users/andre/
-
teamtempest
- Posts: 443
- Joined: 08 Nov 2009
- Location: Minnesota
- Contact:
Re: Instructions that I missed
Quote:
LPX (=DEX; BNE) might even be combined as LPX ++Y (=INY; DEX; BNE) allowing fast reversal. Simultaneous decrementing both X and Y seems less useful.
Presumably the answer is the change in X. Okay, but what am I going to know about the state of Y (ie., the value it holds) after the loop terminates? In the general case, presumably nothing.
My second reaction is: oh, this adds to the complexity of trying to visualize just what the instruction really does. Every other instruction changes just one register at a time; this would be un-orthogonal to them. Oh, my poor head!
I do pity myself quite easily.
-
teamtempest
- Posts: 443
- Joined: 08 Nov 2009
- Location: Minnesota
- Contact:
Re: Instructions that I missed
Quote:
BLT - Branch Less Than (C=0 or Z=1)
BGT - Branch Greater Than (C=1 and Z=0)
BGT - Branch Greater Than (C=1 and Z=0)
Quote:
RDL - Rotate Direct Left (rotate without carry, i.e. bit 7 is directly moved to bit 0)
RDR - Rotate Direct Right (rotate without carry)
ASR - arithmetic shift right (shift in the sign from the left)
RDR - Rotate Direct Right (rotate without carry)
ASR - arithmetic shift right (shift in the sign from the left)
Quote:
SWP - swap upper and lower nibble
Quote:
INV - two's complement
Quote:
BCN - bit count, compute number of 1-bits
Quote:
SXY - swap X and Y
Quote:
PSH - push all registers (in a 6502 that would be AC, XR, YR, more in a 65002)
PLL - pull all registers (in a 6502 that would be AC, XR, YR, more in a 65002)
PLL - pull all registers (in a 6502 that would be AC, XR, YR, more in a 65002)
But an instruction to do it would be just as useful. The idea of multiple dedicated stacks would of course be to minimize response time to an interrupt. I don't think these would be as useful for calling subroutines, mainly because of the difficulty of returning results in registers (if the callee did PSH at the start then PLL at the end would trash anything you put in them...unless maybe the caller did PSH? Then after the callee returns, save any results passed in registers before the caller does PLL...might work. Okay, I like it)
Quote:
FIL - fill a memory area with a byte value
Quote:
ADS - add value to stack pointer
SBS - substract value from stack pointer
SBS - substract value from stack pointer
Quote:
MVN/MVP - move a memory area (see 65816)
Quote:
HBS - determine bit number of highest bit that is set (like log2)
HBC - determine bit number of highest bit that is clear
BSW - bit swap - exchange bit 7 with bit 0, bit 6 with bit 1, etc
HBC - determine bit number of highest bit that is clear
BSW - bit swap - exchange bit 7 with bit 0, bit 6 with bit 1, etc
- GARTHWILSON
- Forum Moderator
- Posts: 8773
- Joined: 30 Aug 2002
- Location: Southern California
- Contact:
Re: Instructions that I missed
Quote:
I'm eager to hear of them!
This is mostly from a list that's at least 10 years old when I probably wasn't thinking as much about relocatable code where relative addresses for data access would be a plus. The '816 can do that though, maybe not as nimbly as nimbly as I could wish, but a hundred times as well as the 6502 can.
STF STore FF to a memory location without affecting processor registers. Many times a byte is used as a flag variable, and STZ clears it, so STF would set it much more efficiently than STZ, DEC. LDA #$FF (or $FFFF) followed by STA ___ affects a processor register.
DIN and DDE Double INcrement and Double DEcrement. Same as INC INC or DEC DEC but faster, and the C flag would tell if you went from FF to 01 or vice-versa without having to test in between the two INC or DEC instructions. This is particularly useful in higher-level languages that are always incrementing pointers to the next two-byte address.
BEV & BOD ("Branch if EVen" and "Branch if ODd"), using another flag in the status register. I can't remember anymore why I wanted these. The need might go away with the DIN and DDE above. Since the NMOS 6502 had the JMP (xxFF) bug, Forth on that one required keeping 16-bit values aligned on even addresses. It's not a problem on the CMOS one, so potentially thousands of bytes can be saved by not having to align.
SWN SWap Nybbles $12 becomes $21. Useful in cobbling together fast math routines? 65816 has an XBA instruction to swap bytes in the 16-bit accumulator. Edit, Oct 2017: We have that in an 8-byte, 12-clock routine, here.
JSR relative long, to anywhere in the 64K memory space. (This and long relative branching are possible on the 6502 by doing JSR to a routine that calculates the target address, but it's very inefficient. The JSR puts the calling address on the stack so the routine can add the offset to it.) The '816 has BRL, and you can synthesize a BSR with it by preceding it with PER.
Several other desirables are already implemented in the 65816, like a branch relative long (BRL), stack-relative addressing, block move, push an indirect address or a relative address or a literal on the stack, movable base page that doesn't have to be ZP, 16-bit stack pointer, etc..
Putting just a few of IDC (indirect-threaded code) Forth's internals like NEXT , nest , and unnest in the machine-language instruction set would make a big difference in execution speed and I believe could be used for other higher-level languages as well, although the extra stack pointer mentioned recently to use an RTS-like instruction for DTC (direct-threaded code) Forth's NEXT would probably be more efficient. STC (subroutine-threaded code) Forth doesn't need any of them.
Eliminate most dead bus cycles. I don't know if it gets easier if the input clock were faster than the bus, like 20 or 40MHz input for a 10MHz bus.
Looking over my own lists and others', I'd have to say that most of the dreamed-of extra instructions would not be used enough to justify them, or there's already a provision in the '816, like the block-fill one which can be done with MVN or MVP, starting from the opposite end from how you would do a move. Ones like MULtiply take a lot of silicon real estate.
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?
- BigDumbDinosaur
- Posts: 9425
- Joined: 28 May 2009
- Location: Midwestern USA (JB Pritzker’s dystopia)
- Contact:
Re: Instructions that I missed
fachat wrote:
SWP - swap upper and lower nibble
x86? We ain't got no x86. We don't NEED no stinking x86!