I'm not sure about "minus most three byte instructions" - the instructions are there but the length is different. I think perhaps two interesting things happen:
- detecting short addresses to choose zero page versus absolute will always choose zero page
- so some of the absolute forms never get used. So they need not be implemented, which does leave room for alternate opcodes
I think you're probably right that memory referencing instructions that have both zero page and absolute forms will always 'naturally' resolve to two byte zero page forms if an assembler is told that zero page spans 4GB. Which is quite a minimal change to make, actually.
As you also point out, that won't work for JSR or JMP. They really do have to change from emitting three bytes to just two, as would any other absolute instruction without an equivalent zero page form (no examples of which come immediately to mind). From a macro implementation point of view it's just changing
Code: Select all
JSR .macro ?addr
.byte $20
.byte <(?addr)
.byte >(?addr)
.endm
to
Code: Select all
JSR .macro ?addr
.byte $20
.byte ?addr
.endm
Since there aren't that many instructions affected, it's possible one version or the other could even be conditionally defined within one source file. A native version wouldn't be much harder than that, either.
The one less memory fetch would make them faster, wouldn't it?
One concern: bit shifting up to 32 bits becomes quite cumbersome with only single-bit shift instructions, and because there are no 8 bit units (octets), you can't take any shortcuts by shifting per octet either.
So, it seems an extension for a barrel shift instruction would be quite useful, slowing down the core by quite a bit (although you could make it a N-clock cycle instruction)
I'm glad to see others are thinking about what a nuisance having only single-bit shifts might be. Not being a hardware guy, I don't know much about what the solution might be. I don't quite see why there can't be instructions that shift by eight or 16 bits at a time either, since these are natural sizes that come up quite often regardless of how many bits are in the accumulator (for example, ASCII and Unicode character codes).
One thought regarding speed, though. Is it really such a problem? I've been told that actual sustained reading speeds of DDR2 memory chips and the like are closer to 50Mhz than the 800MHz (or whatever) burst speeds they are capable of. That would seem to match what Ed's tests are showing pretty well. Much faster and the problems of matching a faster CPU with a slower memory are going to become more prominent.
Or am I totally off base here?
Oh, and one other thought: if there was an N-clock shift instruction, how would that affect interrupt latency? Or would it be interruptible, not atomic? I suppose from a programming point of view I'd care most only if I was trying to interface hardware of some kind, but I'm kind of curious anyway. I suppose the same question could be asked about a multiply or divide instruction, actually, unless they could be done in 6 cycles or less (unlikely without a lot of additional circuitry, I suspect).