6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 22, 2024 2:17 am

All times are UTC




Post new topic Reply to topic  [ 69 posts ]  Go to page 1, 2, 3, 4, 5  Next
Author Message
PostPosted: Mon Apr 23, 2012 9:15 pm 
Offline

Joined: Mon Apr 16, 2012 8:45 pm
Posts: 60
The 6502 was a nice processor to program but looking around the alternatives of the time (Z80, 6809 etc) there were a few things I missed. Interestingly the clever double width accumulator of the 6809 or the similar 16 bit registers of Z80 were not one of them. Rather the DJNZ was something I would have liked. Inner loops could never get tight enough so something that decremented X or Y and branched, preferably with a backward bias, in a single instruction was something I wanted. If anyone is about to improve the 6502 instruction set I'd like to hear your view on such end-of-loop instructions.

Steve Wozniak created SWEET16 that I had thought would provide the necessary 16 bit functionality and also a transition to a 16 bit processor. So why was SWEET16 never implemented in hardware? Like the 6502 it was elegant, orthogonal and required no prefix codes. It seems to me like a lost opportunity.


Top
 Profile  
Reply with quote  
PostPosted: Mon Apr 23, 2012 10:25 pm 
Online
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California
You could just use the 65816.  Although the high address byte (A16-A23) may look scary, you don't have to latch, decode, or use that high address byte to still get a ton of benefits.  The code length is dramatically shortened anytime you need to deal with 16-bit values.  See an example here. Keep in mind however that the 6502 out-benchmarked the Z80 even though the Z80 had more and wider registers, and ran at higher clock rates.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Mon Apr 23, 2012 11:37 pm 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
Quote:
something that decremented X or Y and branched [...] in a single instruction was something I wanted
Hello, Alienthe! It is an interesting idea. Loops account for a high proportion of the execution time for most programs, so speeding up the critical loops is usually worth the effort -- an optimization almost as reliable as boosting clock speed. Loops can be unrolled, of course, but that's a bit OT. I can certainly see the appeal of a DJNZ equivalent for 6500 family.

One caveat is that it has to be more than just a convenience for producing pretty code. IOW I'd want to be certain there's a bottom line saving in clock cycles. (Did the Z80's DJNZ live up to this?) At 5 cycles, the DEY BNE sequence is already pretty fast, even though not as sexy as a single instruction equivalent. In the spirit of discussion, what would be the costs and benefits of separately optimizing the DEY instruction and the BNE instruction? The overall payoff would be broader...

cheers
Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Top
 Profile  
Reply with quote  
PostPosted: Tue Apr 24, 2012 6:34 am 
Offline

Joined: Wed Oct 06, 2010 9:05 am
Posts: 95
Location: Palma, Spain
I could've done with a 2 cycle instruction to swap the top and bottom 4 bits of the accumulator (SWap Nybble?). Would've saved on sequences like
Code:
ASL A
ASL A
ASL A
ASL A
which could just have become
Code:
SWN A
AND #$F0

I guess it would've been a whole new special path in the random control logic, as it's not remotely like any other kind of operation, so I can see why it wasn't there.


Top
 Profile  
Reply with quote  
PostPosted: Tue Apr 24, 2012 7:12 pm 
Offline

Joined: Mon Apr 16, 2012 8:45 pm
Posts: 60
GARTHWILSON wrote:
You could just use the 65816. Although the high address byte (A16-A23) may look scary, you don't have to latch, decode, or use that high address byte to still get a ton of benefits. The code length is dramatically shortened anytime you need to deal with 16-bit values.

SWEET16 was launched as part of Apple II back in 1977, the 65816 was launched late in 1984. These 7 years represent, in my view, a lost opportunity. SWEET16, while not as capable as 65816, would have been simple to implement due to orthogonality and a well structured ISA and instruction code design, as one would indeed expect from the Woz. Moreover the SWEET16 is more "spacious" having more registers, while the 65816 adds no new general purpose registers.

Quote:
See an example here. Keep in mind however that the 6502 out-benchmarked the Z80 even though the Z80 had more and wider registers, and ran at higher clock rates.

I remember many benchmark discussions back in the day but I never was really sure how the speed really differed.


Top
 Profile  
Reply with quote  
PostPosted: Tue Apr 24, 2012 7:34 pm 
Offline

Joined: Tue Jul 05, 2005 7:08 pm
Posts: 1043
Location: near Heidelberg, Germany
Collapsing multiple instructions into one, esp. with something like incrementing or decrementing has its caveats. For example Motorola got it wrong in the 68000 where the postincrement/predecrement of index registers was broken when for example a bus error ABORTed the opcode. Collapsing dec/inc and branch should be fine in this respect though.

However, if you start combining INX with BNE, you could also combine INX with BEQ, and INY instead of INX and maybe even a memory location, and other branches.... You quickly clutter your opcode space.

Here is a set of opcodes I defined for my 65002: http://www.6502.org/users/andre/65k/specsprog.html

BLT - Branch Less Than (C=0 or Z=1)
BGT - Branch Greater Than (C=1 and Z=0)
RDL - Rotate Direct Left (rotate without carry, i.e. bit 7 is directly moved to bit 0)
RDR - Rotate Direct Right (rotate without carry)
ASR - arithmetic shift right (shift in the sign from the left)
SWP - swap upper and lower nibble
INV - two's complement
BCN - bit count, compute number of 1-bits
SXY - swap X and Y
PSH - push all registers (in a 6502 that would be AC, XR, YR, more in a 65002)
PLL - pull all registers (in a 6502 that would be AC, XR, YR, more in a 65002)
FIL - fill a memory area with a byte value
ADS - add value to stack pointer
SBS - substract value from stack pointer
MVN/MVP - move a memory area (see 65816)
HBS - determine bit number of highest bit that is set (like log2)
HBC - determine bit number of highest bit that is clear
BSW - bit swap - exchange bit 7 with bit 0, bit 6 with bit 1, etc

(Not all of them are already on the web page though).
INX/INY can be done in a single cycle in the 65002, so there is not much penalty of not combining them.

André

Edit: added ASR which I had forgot

_________________
Author of the GeckOS multitasking operating system, the usb65 stack, designer of the Micro-PET and many more 6502 content: http://6502.org/users/andre/


Last edited by fachat on Tue Apr 24, 2012 7:38 pm, edited 1 time in total.

Top
 Profile  
Reply with quote  
PostPosted: Tue Apr 24, 2012 7:36 pm 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
I've got some notes on Z80 vs 6502 which I wanted to put into some order for a coherent post about what happened with chess machines... But here are some raw clippings:

The fridge (Judd) reckons a roughly 3:1 clock speed comparability
http://www.ffd2.com/fridge/speccy/score
and has good architectural comparison
(The implication is that 4MHz z80 wins versus 1MHz 6502, but not against 2MHz 6502.)

The hpmuseum queens benchmark comes out at 2.8:1
http://www.hpmuseum.org/cgi-sys/cgiwrap ... ead=120687
       -  Z80      283 msec
       -  6502     100 msec
       -  68000    220 msec

There are some cycle count comparisons in the slightly excited
http://www.alfonsomartone.itb.it/aunlzr.html

In this thread
https://groups.google.com/forum/#!msg/a ... I1im1b2DsJ
william H ivey says he found a 1MHz 6502 to be more performant than a 4MHz z80

That thread and another page both say that z80 is good for floating point. The other page says it's because of the large register set - so less memory bandwidth needed.
http://www.andreadrian.de/oldcpu/Z80_nu ... ncher.html


Top
 Profile  
Reply with quote  
PostPosted: Tue Apr 24, 2012 7:45 pm 
Offline

Joined: Mon Apr 16, 2012 8:45 pm
Posts: 60
Dr Jefyll wrote:
Quote:
something that decremented X or Y and branched [...] in a single instruction was something I wanted
Hello, Alienthe! It is an interesting idea. Loops account for a high proportion of the execution time for most programs, so speeding up the critical loops is usually worth the effort -- an optimization almost as reliable as boosting clock speed. Loops can be unrolled, of course, but that's a bit OT. I can certainly see the appeal of a DJNZ equivalent for 6500 family.

One caveat is that it has to be more than just a convenience for producing pretty code. IOW I'd want to be certain there's a bottom line saving in clock cycles. (Did the Z80's DJNZ live up to this?) At 5 cycles, the DEY BNE sequence is already pretty fast, even though not as sexy as a single instruction equivalent. In the spirit of discussion, what would be the costs and benefits of separately optimizing the DEY instruction and the BNE instruction? The overall payoff would be broader...

cheers
Jeff

Well, syntactic sugar always has an appeal. My value proposition here is actually more real than that:
1: Loops never get tight enough.
2: The BNE is (nearly) symmetric as to the maximum distance of branching forward or backward and it gets inelegant if you have to branch more than 128 away from present address. An instruction as proposed here would lend itself to a backward bias, with branches -192 to +64, perhaps even -224 to +32, saving ugly branches plus JMP.
3: BNE is 2 cycles if branch not taken, 3 if taken. The proposal here is to assume branch is taken and optimise for that, shaving off another cycle.
4: DEX + BNE is in sum 3 bytes and 5 cycles, while a LPX (just to give it a name) would be 2 bytes and 3 cycles, perhaps 2 cycles (see point 3).
5: It overcomes the lack of post/pre decrementing index registers during load/store, as found in many other processors, and does so without adding more combinatorials to the existing already substantial number of addressing modes.

LPX (=DEX; BNE) might even be combined as LPX ++Y (=INY; DEX; BNE) allowing fast reversal. Simultaneous decrementing both X and Y seems less useful.

All in all I believe the benefits are substantial. As for the cost it requires 2 (perhaps 4) new opcodes: LPX, LPY, LPX ++Y, LPY ++X with corresponding logic. Considering decrementing and branching logic already are in place I guess the added logic would not be that great.

To speed things up you might prepare branch if register is not 1 and then run the decrement in parallel, retiring the updated value while the branch is taking place.

I was never comfortable about comparing with Z80 so I cannot comment much on the timing issues for Z80. Still, I believe these looping instructions are sexy, draped in leather and crack the whip over Z80, as it should...


Top
 Profile  
Reply with quote  
PostPosted: Tue Apr 24, 2012 7:54 pm 
Offline

Joined: Mon Apr 16, 2012 8:45 pm
Posts: 60
fachat wrote:
However, if you start combining INX with BNE, you could also combine INX with BEQ, and INY instead of INX and maybe even a memory location, and other branches.... You quickly clutter your opcode space.

I quite agree. A clean structure is to me one of the attractions of the 6502 (and also the SWEET16). For looping purposes I was only thinking of decrementing and a BNE with backward bias. There are some obvious variations outlined in my earlier reply.

Incidentally the DSP65300 has a zero overhead loop function. It is fast, elegant and clever but regrettable requires a fair bit of logic and a flag in the processor status.


Top
 Profile  
Reply with quote  
PostPosted: Tue Apr 24, 2012 8:53 pm 
Online
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California
Quote:
1: Loops never get tight enough.
2: The BNE is (nearly) symmetric as to the maximum distance of branching forward or backward and it gets inelegant if you have to branch more than 128 away from present address. An instruction as proposed here would lend itself to a backward bias, with branches -192 to +64, perhaps even -224 to +32, saving ugly branches plus JMP.

I never noticed any injustice in the backward-versus-forward branch distance myself.  Although I've done the branches around JMPs, I don't remember it ever being to get back to the top of a long loop.  However there have been many times when longish forward branches were needed to bypass a portion of code that should be skipped under the current conditions.

The benchmarks truly are favorable to the 6502; but the one that really counts of course is your own application.  I have a short list of new instructions I would like added, but I can imagine a list of reasons why they have not been added, ranging from instruction-decoding complexity versus speed, to silicon real-estate costs which apparently are one of the major motivators for WDC's licensees to choose the 65c02 over the competition for many jobs.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Tue Apr 24, 2012 9:28 pm 
Offline

Joined: Tue Jul 05, 2005 7:08 pm
Posts: 1043
Location: near Heidelberg, Germany
GARTHWILSON wrote:
I have a short list of new instructions I would like added, ...


I'm eager to hear of them!

André

_________________
Author of the GeckOS multitasking operating system, the usb65 stack, designer of the Micro-PET and many more 6502 content: http://6502.org/users/andre/


Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 25, 2012 1:24 am 
Offline

Joined: Sun Nov 08, 2009 1:56 am
Posts: 411
Location: Minnesota
Quote:
LPX (=DEX; BNE) might even be combined as LPX ++Y (=INY; DEX; BNE) allowing fast reversal. Simultaneous decrementing both X and Y seems less useful.


My first reaction is: what sets the status register flags? The change in Y or the change in X?

Presumably the answer is the change in X. Okay, but what am I going to know about the state of Y (ie., the value it holds) after the loop terminates? In the general case, presumably nothing.

My second reaction is: oh, this adds to the complexity of trying to visualize just what the instruction really does. Every other instruction changes just one register at a time; this would be un-orthogonal to them. Oh, my poor head!

I do pity myself quite easily.


Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 25, 2012 2:16 am 
Offline

Joined: Sun Nov 08, 2009 1:56 am
Posts: 411
Location: Minnesota
Quote:
BLT - Branch Less Than (C=0 or Z=1)
BGT - Branch Greater Than (C=1 and Z=0)


Signed or unsigned comparison? As things stand now signed comparisons are always a tricky thing on the 6502. You have to use that mysterious V-flag if you want to do signed. It's difficult enough to grasp what's going on that a lot of people stay away from it. Perhaps explicitly signed comparison, subtraction and addition instruction branch instrctions would be helpful as well.

Quote:
RDL - Rotate Direct Left (rotate without carry, i.e. bit 7 is directly moved to bit 0)
RDR - Rotate Direct Right (rotate without carry)
ASR - arithmetic shift right (shift in the sign from the left)


I like these. They can be synthesized already by interspersing CMP immediates among ROL/ROR sequences, but they're occasionally nice to have.

Quote:
SWP - swap upper and lower nibble


This would be extremely handy. This might be the best addition of them all.

Quote:
INV - two's complement


I dunno. It's easy enough already to do "EOR #$FF". I don't know that I want to do this often enough to make it worthwhile as a separate instruction. Is there some address mode other than accumulator where this is useful?

Quote:
BCN - bit count, compute number of 1-bits


This would take a (very) short routine to duplicate, and the answer would come in log(N) time. Or a big lookup table and constant time. Is there enough need to make it a separate instruction? I can see that it would be handy if you wanted to do parity checks, but beyond that I'm stumped. If parity is all it's good for, why not just a parity status flag?

Quote:
SXY - swap X and Y


Mmm. Maybe. All I can think of is setting up registers prior to subroutine (or OS) calls, but it might be handy at that.

Quote:
PSH - push all registers (in a 6502 that would be AC, XR, YR, more in a 65002)
PLL - pull all registers (in a 6502 that would be AC, XR, YR, more in a 65002)


So the other day I was thinking "What if each register had its own dedicated on-chip stack space?" Ie, N registers, N stacks And then I thought "What would that possibly be good for?" And then I thought "Interrupts!". I was thinking the processor would do it automatically upon recognizing an interrupt, so an ISR could start out trashing any register it wanted. RTI would restore them all. If each stack was 8- or 16-deep, you could even deal with re-entrant interrupts. The stacks would not be user-accessible, just places to stash register values during interrupts.

But an instruction to do it would be just as useful. The idea of multiple dedicated stacks would of course be to minimize response time to an interrupt. I don't think these would be as useful for calling subroutines, mainly because of the difficulty of returning results in registers (if the callee did PSH at the start then PLL at the end would trash anything you put in them...unless maybe the caller did PSH? Then after the callee returns, save any results passed in registers before the caller does PLL...might work. Okay, I like it)

Quote:
FIL - fill a memory area with a byte value


Again a short subroutine. Is the speed advantage (one or two cycles per byte?) worth it?

Quote:
ADS - add value to stack pointer
SBS - substract value from stack pointer


Fun. I like these, although playing with stack pointer is not for novices.

Quote:
MVN/MVP - move a memory area (see 65816)


I never liked this instruction much. Kind of ugly. The coolest answer to this in the 6502 world I ever saw were the Commodore RAM expanders. That REC chip could do transfers and fills at a byte per clock cycle, and swap two bytes in two clock cycles.

Quote:
HBS - determine bit number of highest bit that is set (like log2)
HBC - determine bit number of highest bit that is clear
BSW - bit swap - exchange bit 7 with bit 0, bit 6 with bit 1, etc


How is the result returned, BTW? What if no bit is set? If bits are numbered 7..0 then zero is not an answer to that. I guess I'm having trouble seeing how useful they'd be, ie., what problem do they solve?


Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 25, 2012 5:03 am 
Online
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California
Quote:
I'm eager to hear of them!

This is mostly from a list that's at least 10 years old when I probably wasn't thinking as much about relocatable code where relative addresses for data access would be a plus.  The '816 can do that though, maybe not as nimbly as nimbly as I could wish, but a hundred times as well as the 6502 can.

STF STore FF to a memory location without affecting processor registers.  Many times a byte is used as a flag variable, and STZ clears it, so STF would set it much more efficiently than STZ, DEC.  LDA #$FF (or $FFFF) followed by STA ___ affects a processor register.

DIN and DDE Double INcrement and Double DEcrement.  Same as INC INC or DEC DEC but faster, and the C flag would tell if you went from FF to 01 or vice-versa without having to test in between the two INC or DEC instructions.  This is particularly useful in higher-level languages that are always incrementing pointers to the next two-byte address.

BEV & BOD ("Branch if EVen" and "Branch if ODd"), using another flag in the status register.  I can't remember anymore why I wanted these.  The need might go away with the DIN and DDE above.  Since the NMOS 6502 had the JMP (xxFF) bug, Forth on that one required keeping 16-bit values aligned on even addresses.  It's not a problem on the CMOS one, so potentially thousands of bytes can be saved by not having to align.

SWN SWap Nybbles $12 becomes $21.  Useful in cobbling together fast math routines?  65816 has an XBA instruction to swap bytes in the 16-bit accumulator.  Edit, Oct 2017:  We have that in an 8-byte, 12-clock routine, here.

JSR relative long, to anywhere in the 64K memory space.  (This and long relative branching are possible on the 6502 by doing JSR to a routine that calculates the target address, but it's very inefficient.  The JSR puts the calling address on the stack so the routine can add the offset to it.)  The '816 has BRL, and you can synthesize a BSR with it by preceding it with PER.

Several other desirables are already implemented in the 65816, like a branch relative long (BRL), stack-relative addressing, block move, push an indirect address or a relative address or a literal on the stack, movable base page that doesn't have to be ZP, 16-bit stack pointer, etc..

Putting just a few of IDC (indirect-threaded code) Forth's internals like NEXT , nest , and unnest in the machine-language instruction set would make a big difference in execution speed and I believe could be used for other higher-level languages as well, although the extra stack pointer mentioned recently to use an RTS-like instruction for DTC (direct-threaded code) Forth's NEXT would probably be more efficient.  STC (subroutine-threaded code) Forth doesn't need any of them.

Eliminate most dead bus cycles.  I don't know if it gets easier if the input clock were faster than the bus, like 20 or 40MHz input for a 10MHz bus.

Looking over my own lists and others', I'd have to say that most of the dreamed-of extra instructions would not be used enough to justify them, or there's already a provision in the '816, like the block-fill one which can be done with MVN or MVP, starting from the opposite end from how you would do a move.  Ones like MULtiply take a lot of silicon real estate.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 25, 2012 2:53 pm 
Offline
User avatar

Joined: Thu May 28, 2009 9:46 pm
Posts: 8504
Location: Midwestern USA
fachat wrote:
SWP - swap upper and lower nibble
The 65C816's XBA instruction does that on the accumulator.

_________________
x86?  We ain't got no x86.  We don't NEED no stinking x86!


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 69 posts ]  Go to page 1, 2, 3, 4, 5  Next

All times are UTC


Who is online

Users browsing this forum: Google [Bot] and 15 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: