6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 22, 2024 9:19 am

All times are UTC




Post new topic Reply to topic  [ 69 posts ]  Go to page Previous  1, 2, 3, 4, 5  Next
Author Message
PostPosted: Wed Apr 25, 2012 3:55 pm 
Offline
User avatar

Joined: Fri Dec 11, 2009 3:50 pm
Posts: 3367
Location: Ontario, Canada
BigDumbDinosaur wrote:
fachat wrote:
SWP - swap upper and lower nibble
The 65C816's XBA instruction does that on the accumulator.
Sort of, but not quite. XBA swaps the halves of the 816's 16 bit accumulator C -- ie, swaps 8 bits from accumulator A with 8 bits from accumulator B. I think André meant swapping the 4 bit halves of an 8 bit accumulator.

Alienthe wrote:
So why was SWEET16 never implemented in hardware? Like the 6502 it was elegant, orthogonal and required no prefix codes. It seems to me like a lost opportunity.
I have to admit I like this idea. But I suspect it won't do much in terms of a useful speedup of existing code. Sweet 16 was used to shrink the memory footprint when a task wasn't especially performance-critical, isn't that right? But writing new Sweet 16 code to run natively in hardware.... that would be cool (and fast!)

-- Jeff

_________________
In 1988 my 65C02 got six new registers and 44 new full-speed instructions!
https://laughtonelectronics.com/Arcana/ ... mmary.html


Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 25, 2012 9:47 pm 
Offline

Joined: Wed Oct 06, 2010 9:05 am
Posts: 95
Location: Palma, Spain
Another thing that was done on some 6502 variant, and which I always liked as an extension, was the addition of a Z index register whose default value was 0. Many of the new opcodes corresponded with existing ones, so STZ did exactly that, and the zp-indirect instructions like LDA (zp) became LDA (zp),Z. Always seemed like a smart way of getting some extra mileage out of a limited number of opcodes, while retaining backward compatibility.

ASR - yes, that would've been nice. Also ADD and SUB (without taking C as input) would've been great, and while we're at it (taking inspiration from the ARM), RSB and RSC for reverse subtract [with carry] would've saved the whole horrible sequence of:
Code:
STA temp
LDA #$08
SEC
SBC temp
or, slightly more cleverly:
Code:
EOR #$FF
SEC
ADC #$08
RSB #0 would of course provide a nice 2 cycle 2's complement operation.


Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 25, 2012 10:09 pm 
Offline

Joined: Tue Jul 05, 2005 7:08 pm
Posts: 1043
Location: near Heidelberg, Germany
RichTW wrote:
RSB and RSC for reverse subtract [with carry] would've saved the whole horrible sequence of:
Code:
STA temp
LDA #$08
SEC
SBC temp
or, slightly more cleverly:
Code:
EOR #$FF
SEC
ADC #$08
RSB #0 would of course provide a nice 2 cycle 2's complement operation.


That's what I made INV for - the two's complement. You could
Code:
   INV A
   CLC
   ADC #value

But looking at that it doesn't save very much...

André

_________________
Author of the GeckOS multitasking operating system, the usb65 stack, designer of the Micro-PET and many more 6502 content: http://6502.org/users/andre/


Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 25, 2012 10:31 pm 
Offline

Joined: Tue Jul 05, 2005 7:08 pm
Posts: 1043
Location: near Heidelberg, Germany
teamtempest wrote:
Quote:
BLT - Branch Less Than (C=0 or Z=1)
BGT - Branch Greater Than (C=1 and Z=0)

Signed or unsigned comparison? As things stand now signed comparisons are always a tricky thing on the 6502. You have to use that mysterious V-flag if you want to do signed. It's difficult enough to grasp what's going on that a lot of people stay away from it. Perhaps explicitly signed comparison, subtraction and addition instruction branch instrctions would be helpful as well.

Unsigned. I had noticed that I had sequences of Branches to do just that, branch on greater (where BCS is basically greater or equal) or branch on less or equal (where BCC is basically branch on less).

Forgive my ignorance, but isn't signed the same as unsigned, as long as V is NOT set?
Quote:
Quote:
SWP - swap upper and lower nibble

This would be extremely handy. This might be the best addition of them all.

As has been asked in another post in this thread, yes, that exchanges the two nibbles in a single byte. Due to the extended nature of my 65k, of course there are SWP.W to exchange the bytes in a word, SWP.L to exchange the WORDs in a LONG, and SWP.Q to exchange the LONGs in a QUAD (64 bit).
Quote:
Quote:
INV - two's complement

I dunno. It's easy enough already to do "EOR #$FF". I don't know that I want to do this often enough to make it worthwhile as a separate instruction. Is there some address mode other than accumulator where this is useful?

"EOR #$FF" is the one's complement, not the two's complement. I made this to easily do a reverse substraction as mentioned in another post here. But in fact, replacing
Code:
   EOR #$FF
   SEC
   ADC value

with
Code:
   INV A
   CLC
   ADC value

does not help much - I should probably investigate the reverse substract as well (in addition to ADD and SUB that don't use the carry)
Quote:
Quote:
BCN - bit count, compute number of 1-bits

This would take a (very) short routine to duplicate, and the answer would come in log(N) time. Or a big lookup table and constant time. Is there enough need to make it a separate instruction? I can see that it would be handy if you wanted to do parity checks, but beyond that I'm stumped. If parity is all it's good for, why not just a parity status flag?

Parity flag is a good point. I currently just don't have a status register bit left... My 65k can have up to 64 bit values - the routine would still be very short, but log(64) is still getting larger, so I thought it in order to do this.
Quote:
Quote:
PSH - push all registers (in a 6502 that would be AC, XR, YR, more in a 65002)
PLL - pull all registers (in a 6502 that would be AC, XR, YR, more in a 65002)

So the other day I was thinking "What if each register had its own dedicated on-chip stack space?" Ie, N registers, N stacks And then I thought "What would that possibly be good for?" And then I thought "Interrupts!". I was thinking the processor would do it automatically upon recognizing an interrupt, so an ISR could start out trashing any register it wanted. RTI would restore them all. If each stack was 8- or 16-deep, you could even deal with re-entrant interrupts. The stacks would not be user-accessible, just places to stash register values during interrupts.

But an instruction to do it would be just as useful. The idea of multiple dedicated stacks would of course be to minimize response time to an interrupt. I don't think these would be as useful for calling subroutines, mainly because of the difficulty of returning results in registers (if the callee did PSH at the start then PLL at the end would trash anything you put in them...unless maybe the caller did PSH? Then after the callee returns, save any results passed in registers before the caller does PLL...might work. Okay, I like it)

a per-register stack is a neat idea, but what do you do if it overflows?

I though doing these opcodes to make model/family-independent interrupt routines. A later 650x0 could have more registers, and this PLL/PSH would simply be extended to pull/push them from/onto the stack.
This would not necessarily be used for calling subroutines - here I would use the registers for parameters and return values anyway, and selectively save those on the stack that are needed.
Quote:
Quote:
FIL - fill a memory area with a byte value

Again a short subroutine. Is the speed advantage (one or two cycles per byte?) worth it?

I don't intend to make it as inefficient as the 65816 one. I was more thinking of one cycle per byte (or even less with a wider memory interface).
Quote:
Quote:
ADS - add value to stack pointer
SBS - substract value from stack pointer

Fun. I like these, although playing with stack pointer is not for novices.

No, but those can be used for handling pointers to parameters on the stack.
Quote:
Quote:
MVN/MVP - move a memory area (see 65816)

I never liked this instruction much. Kind of ugly. The coolest answer to this in the 6502 world I ever saw were the Commodore RAM expanders. That REC chip could do transfers and fills at a byte per clock cycle, and swap two bytes in two clock cycles.

I wonder how that would work, swap two bytes in two clock cycles. I can only see this with two memory banks read in parallel and written back in parallel.
Anyway, the plan is to have two cycles per byte maximum (maybe less with wider memory interfaces)
Quote:
Quote:
HBS - determine bit number of highest bit that is set (like log2)
HBC - determine bit number of highest bit that is clear
BSW - bit swap - exchange bit 7 with bit 0, bit 6 with bit 1, etc

How is the result returned, BTW? What if no bit is set? If bits are numbered 7..0 then zero is not an answer to that. I guess I'm having trouble seeing how useful they'd be, ie., what problem do they solve?


Admittedly HBS/HBC are "imagined", but could possibly come in handy to shorten an integer multiply/division routine.
BSW I would find useful for computing the swapped addresses in a fast fourier transform (although then it would probably better be used on an index register).
Those would either be read memory -> do operation -> store in AC, or AC -> operation -> AC. Don't think a R/M/W would be useful here though.
If no bit is set/cleared on HBS/HBC respectively, the carry bit could be set for example. Haven't thought about that yet I have to admit.

André

_________________
Author of the GeckOS multitasking operating system, the usb65 stack, designer of the Micro-PET and many more 6502 content: http://6502.org/users/andre/


Top
 Profile  
Reply with quote  
PostPosted: Wed Apr 25, 2012 10:39 pm 
Offline

Joined: Tue Jul 05, 2005 7:08 pm
Posts: 1043
Location: near Heidelberg, Germany
GARTHWILSON wrote:
STF STore FF to a memory location without affecting processor registers. Many times a byte is used as a flag variable, and STZ clears it, so STF would set it much more efficiently than STZ, DEC. LDA #$FF (or $FFFF) followed by STA ___ affects a processor register.

I think I even have (had?) that on a todo list somewhere :-)
Quote:
DIN and DDE Double INcrement and Double DEcrement. Same as INC INC or DEC DEC but faster, and the C flag would tell if you went from FF to 01 or vice-versa without having to test in between the two INC or DEC instructions. This is particularly useful in higher-level languages that are always incrementing pointers to the next two-byte address.

For this I wanted to invent "quick" opcodes that contain a number from 1 to 8 in the opcode and work on Y or X, like
Code:
  INY #2
  DEX #4

Quote:
BEV & BOD ("Branch if EVen" and "Branch if ODd"), using another flag in the status register. I can't remember anymore why I wanted these. The need might go away with the DIN and DDE above. Since the NMOS 6502 had the JMP(xxFF) bug, Forth on that one required keeping 16-bit values aligned on even addresses. It's not a problem on the CMOS one, so potentially thousands of bytes can be saved by not having to align.

Hm, interesting idea. Never thought of it before. But then it seems I probably haven't really needed it.
Quote:
Eliminate most dead bus cycles. I don't know if it gets easier if the input clock were faster than the bus, like 20 or 40MHz input for a 10MHz bus.

As you have to compute values just read into a new address (indirect indexed addressing modes), there will always be "dead" cycles on the core. But for example I was able to hide the 65816 dead cycles in my computer if the CPU was fast enough to still provide the address lines even though the "dead" cycle was in the beginning of Phi1 - however, the clock factor needed IIRC was about 1:8 or 1:10.
Quote:
Ones like MULtiply take a lot of silicon real estate.

Yeah, that would be a real dream...

André

_________________
Author of the GeckOS multitasking operating system, the usb65 stack, designer of the Micro-PET and many more 6502 content: http://6502.org/users/andre/


Top
 Profile  
Reply with quote  
PostPosted: Thu Apr 26, 2012 5:28 am 
Offline
User avatar

Joined: Thu Dec 11, 2008 1:28 pm
Posts: 10985
Location: England
A couple of points:

On FGPA, multiply is cheap. The multipliers are already there, whether you use them or not. The argument that they use up a lot of silicon area applies to the era of the 6502 rather more strongly than to the era of the 250k gate FPGA. (But division remains difficult.)

If you supply bitcount, parity is cheap. The converse isn't true! Whether bitcount is worth bothering with is another question, but again, on FPGA we don't pay for silicon (within reason), we pay for lines of code and we pay if clock speed is affected.

Final point: beware of wish-list features which trim one or two bytes or clock cycles. These thoughts probably come from experience with 1MHz or 2MHz systems with 64k memory. If a project is delivering a CPU which runs at 25MHz with a 32bit address space, the baseline performance should already be a lot higher and the memory pressure a lot lower. Time might be better spent on increasing clock speed, or making a smarter SDRAM interface which will speed up all programs, instead of adding or tweaking a few special-case instructions. (Of course, some instructions are worth the effort, and others are not.)

Cheers
Ed


Top
 Profile  
Reply with quote  
PostPosted: Thu Apr 26, 2012 6:16 am 
Offline

Joined: Wed Oct 06, 2010 9:05 am
Posts: 95
Location: Palma, Spain
I wrote:
Also ADD and SUB (without taking C as input) would've been great

Thinking a little more about this - how much extra logic would've been necessary to achieve this, other than some extra lines on the PLA to implement the opcodes? It's just that, as far as I can tell, the 6502 already knows how to do adds and subtracts without using the carry as input, because CMP is just a SBC operation with C=1, and left shifts are implemented as N+N, which means that ASL is just an ADC with C=0.


Top
 Profile  
Reply with quote  
PostPosted: Thu Apr 26, 2012 7:27 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
RichTW wrote:
I wrote:
Also ADD and SUB (without taking C as input) would've been great

Thinking a little more about this - how much extra logic would've been necessary to achieve this, other than some extra lines on the PLA to implement the opcodes? It's just that, as far as I can tell, the 6502 already knows how to do adds and subtracts without using the carry as input, because CMP is just a SBC operation with C=1, and left shifts are implemented as N+N, which means that ASL is just an ADC with C=0.

I would take almost no logic, but it would take a large chunk of opcode space, which may be better used for something else.


Top
 Profile  
Reply with quote  
PostPosted: Thu Apr 26, 2012 9:41 am 
Offline

Joined: Wed Oct 06, 2010 9:05 am
Posts: 95
Location: Palma, Spain
That's what I figured, but the original 6502 had plenty of spare opcode space. Why were ASL/ROL and LSR/ROR afforded the luxury of having versions which ignored the carry or used the carry, while ADC and SBC were not. Arguably, addition and subtraction are a more basic and common operation than multiplication/division by 2.


Top
 Profile  
Reply with quote  
PostPosted: Thu Apr 26, 2012 11:45 am 
Offline
User avatar

Joined: Tue Nov 16, 2010 8:00 am
Posts: 2353
Location: Gouda, The Netherlands
RichTW wrote:
That's what I figured, but the original 6502 had plenty of spare opcode space. Why were ASL/ROL and LSR/ROR afforded the luxury of having versions which ignored the carry or used the carry, while ADC and SBC were not. Arguably, addition and subtraction are a more basic and common operation than multiplication/division by 2.

Probably because it only takes 1 instruction to clear/set the carry flag, and it would take multiple instructions to simulate a ROR with ASR, or the other way around, which would be very annoying if you had to do multiple shifts in a row. Also, the shift/rotate instruction only have 5 addressing modes, and the ADC/SBC have 8. And while there's plenty of spare opcode space, I suspect the designers wanted to have the option of using it for something they thought was more useful.


Top
 Profile  
Reply with quote  
PostPosted: Thu Apr 26, 2012 7:01 pm 
Offline

Joined: Tue Jul 05, 2005 7:08 pm
Posts: 1043
Location: near Heidelberg, Germany
Arlet wrote:
I would take almost no logic, but it would take a large chunk of opcode space, which may be better used for something else.


In my 65k I have an "LEA" load effective address into the E register with all addressing modes, and then some of the new opcodes only get like immediate and E-indirect:
Code:
    RDR (E)

would do a direct rotate of the address that is stored in E. That helps not cluttering up the opcode space.

André

_________________
Author of the GeckOS multitasking operating system, the usb65 stack, designer of the Micro-PET and many more 6502 content: http://6502.org/users/andre/


Top
 Profile  
Reply with quote  
PostPosted: Fri Apr 27, 2012 9:30 pm 
Offline

Joined: Mon Apr 16, 2012 8:45 pm
Posts: 60
GARTHWILSON wrote:
Quote:
1: Loops never get tight enough.
2: The BNE is (nearly) symmetric as to the maximum distance of branching forward or backward and it gets inelegant if you have to branch more than 128 away from present address. An instruction as proposed here would lend itself to a backward bias, with branches -192 to +64, perhaps even -224 to +32, saving ugly branches plus JMP.

I never noticed any injustice in the backward-versus-forward branch distance myself. Although I've done the branches around JMPs, I don't remember it ever being to get back to the top of a long loop. However there have been many times when longish forward branches were needed to bypass a portion of code that should be skipped under the current conditions.

I noticed it only a few times, typically in large nested loops. However most of my work was time critical, which was part of the reason to write assembly code in the first place. I remember compiler authors also back then said their compilers created better code than hand crafted assembly. My employers disagreed and I was brought in on many a project to squeeze that extra bit of oumph out of the code or shave off memory or bandwidth requirements. The 6502, in my view, was very appealing for coding and lent itself to ultra high density code.

Quote:
The benchmarks truly are favorable to the 6502; but the one that really counts of course is your own application. I have a short list of new instructions I would like added, but I can imagine a list of reasons why they have not been added, ranging from instruction-decoding complexity versus speed, to silicon real-estate costs which apparently are one of the major motivators for WDC's licensees to choose the 65c02 over the competition for many jobs.

I did all sorts of projects, data compression, image compression, map and navigation systems and more. It was amazing what you could fit into 32 KB memory. And the 6502 delivered every time. Unfortunately I don't have any of that code. If we had a substantial code corpus we could do some statistical analysis on what new instructions could gain us in terms of memory and cycle savings.

Is anyone from WDC on this forum? It would be interesting to hear their views on what this discussion is bringing up. Implementing a LPX instruction should be trivial to them.


Top
 Profile  
Reply with quote  
PostPosted: Fri Apr 27, 2012 9:38 pm 
Offline

Joined: Mon Apr 16, 2012 8:45 pm
Posts: 60
teamtempest wrote:
Quote:
LPX (=DEX; BNE) might even be combined as LPX ++Y (=INY; DEX; BNE) allowing fast reversal. Simultaneous decrementing both X and Y seems less useful.

My first reaction is: what sets the status register flags? The change in Y or the change in X?

The idea was that for LPX the X-register sets the flag. part of my thinking of writing "++Y" rather than "Y++" was to indicate that Y is incremented first, in other words before X is decremented.

Quote:
Presumably the answer is the change in X. Okay, but what am I going to know about the state of Y (ie., the value it holds) after the loop terminates? In the general case, presumably nothing.

Leaving Y unknown is easy, convenient but in my view also untidy. Considering a common trick of the trade is to reply on a known state of a register for further work, it would be counter productive to leave big unknowns around. I would prefer that Y was incremented every time X was decremented.

Quote:
My second reaction is: oh, this adds to the complexity of trying to visualize just what the instruction really does. Every other instruction changes just one register at a time; this would be un-orthogonal to them. Oh, my poor head!

I do pity myself quite easily.

Um, well, changing two registers at a time is common, like X-register and processor status. What I suggests here is, strictly speaking, changing 3 registers. It would be an unusual case but the benefit would be significantly increased code density and an opportunity for cycle saving.


Top
 Profile  
Reply with quote  
PostPosted: Fri Apr 27, 2012 9:49 pm 
Offline

Joined: Mon Apr 16, 2012 8:45 pm
Posts: 60
Dr Jefyll wrote:
Alienthe wrote:
So why was SWEET16 never implemented in hardware? Like the 6502 it was elegant, orthogonal and required no prefix codes. It seems to me like a lost opportunity.
I have to admit I like this idea. But I suspect it won't do much in terms of a useful speedup of existing code. Sweet 16 was used to shrink the memory footprint when a task wasn't especially performance-critical, isn't that right? But writing new Sweet 16 code to run natively in hardware.... that would be cool (and fast!)

Actually SWEET16 is already in use, in the Apple II. And a search on the net shows that a few assemblers also support this syntax. This means that use and infrastructure is already in place, moreover it also means that SWEET16 is a proven concept.

I would propose some minor tweaks, say SWEET17 (she grew up, right?), by synchronising the A, X and Y registers with SWEET16 registers on entry and exit:
Code:
R0 low byte: A
R0 high byte: zero on entry
R1: B register - concatenated with A when switching to SWEET32 mode (she grew up to a lady)
R4 low byte: X
R4 high byte: zero on entry
R6 low byte: Y
R6 high byte: zero on entry


Going from 6502 to SWEET17 provides, in my view, a clear path for scaling up in width.


Top
 Profile  
Reply with quote  
PostPosted: Fri Apr 27, 2012 10:02 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California
Alienthe wrote:
I remember compiler authors also back then said their compilers created better code than hand crafted assembly. My employers disagreed and I was brought in on many a project to squeeze that extra bit of oumph out of the code or shave off memory or bandwidth requirements. The 6502, in my view, was very appealing for coding and lent itself to ultra high density code.

Yes, for 6502, I would find it hard to believe that a human couldn't do a lot better than a compiler.  What the compiler authors were saying might be true of other processors whose assembly languages are nearly too complex for most programmers to do well with in assembly language.

Quote:
Is anyone from WDC on this forum? It would be interesting to hear their views on what this discussion is bringing up. Implementing a LPX instruction should be trivial to them.

There is one, but I don't think I've ever seen him post.

_________________
http://WilsonMinesCo.com/ lots of 6502 resources
The "second front page" is http://wilsonminesco.com/links.html .
What's an additional VIA among friends, anyhow?


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 69 posts ]  Go to page Previous  1, 2, 3, 4, 5  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 22 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: