Quote:
I don't think this is the case, as the 65xx architecture is pretty much optimized to use as many cycles for valid accesses due to its internal pipelining.
Obviously, if we're discussing widening the buses, the fundamental architecture of the chip must also be altered to. This completely invalidates your core assumption that the 65xx's execution core remains unaltered. Remember, the internal instruction set is
fundamentally architected around an 8-bit wide bus. Looking at the micrograph of the 6502, it's clear that each instruction implementation is responsible for driving the buses too (e.g., there is no distinct chunk of logic dedicated to managing the address bus, none dedicated to the data bus, etc), which implies a
very tight degree of coupling between RAM and the CPU's registers. Looking at the block diagram of the 6502 seems to confirm this, where the only thing separating the external and internal buses are a set of transceivers. Yikes! That means, if we were to widen the buses, then we'd need to rearchitect the core
completely to take advantage of the new bus width.
Quote:
For example if you read an aligned "CLC", this opcode takes a single byte, but two cycles, where the second cylce is a "dead" cycles (see also the discussion about the transparent DMA channel
http://forum.6502.org/viewtopic.php?t=1160&start=15 )
So if you read the next opcode in the same cycle, you may already decode the next opcode during the dead cycle, which saves you a cycle.
Irrelavent. If you read CLC:ADC in a single cycle, your instruction decoder (which is aware of 16-bit wide fetches now, remember!) can recognize this as a "macro" instruction, which allows it to clear the carry
while fetching the ADC operand bytes. In essence, you get CLC for free, shaving off 2 cycles (1 cycles for the CLC dead cycle and 1 for the ADC opcode fetch, since it's
already co-decoded in the instruction register).
Quote:
What, however, if you have opcodes that access memory: Just like "LDA abs" which has three bytes, but 4 cycles with 4 valid memory addresses - three for the opcode+address, one for the actual read of the data.
Once again, Intel CPUs have been doing this since the 8086, back in 1978.
The answer is trivially simple:
1. Fetch opcode + low byte of absolute address. The instruction decoder sees that it's not yet complete, so we ...
2. Fetch high byte of absolute address + next opcode. The full instruction is fetched now, so we can ...
3. Fetch from absolute address.
3b. *IF* address is unaligned, *AND* it's a 16-bit fetch, fetch the next byte too.
4. We still have one byte left over from fetching the previous opcode, which we take as the next opcode. Fetch operands from next opcode.
We can simplify this architecture by making the EU a simple byte-wide execution engine with the ability to "look ahead" in the instruction queue. Which, of course, means we now need an instruction queue. It doesn't need to be terribly big -- even the largest of queues on the x86 platform are 16 bytes long. Since the longest 65816 instruction is 4 bytes, I claim that a 5-byte prefetch queue is plenty sufficient.
Note, when consuming operands from the queue, it can do so in entire chunks. So while it pops 8-bit opcodes from the queue sequentially, it can freely pop 16- and 24-bit items in a single clock as required. No need to spend 2 to 4 cycles removing operand bytes, when they're sitting right there for inspection directly from the queue.
Quote:
The last (data) access saves you nothing
This is ambiguous. That does "last (data)" refer to here?
Quote:
it even increases power consumption, as reading 16 bit in general draws more power than 8 bit and you never know whether you actually need the second byte (This would work better if the AC were 16 bit). With the second opcode read you may read the following opcode, saving this cycle again.
If you're after performance, you are
positively not after minimum power consumption anyway. By definition, the fact that there are more transistors in the CPU, regardless of the bus width, means that you'll be pulling more power. The 65816 should definitely draw more power than the 6502 because of this, despite the fact that it still has only an 8-bit bus.
Still, I think you're badly over-estimating the effects of these changes. A higher performance 65816 can be made with just those performance-boosting features that are required. If you're going all-out desktop-class processor, be aware that there are dual-core MIPS-4 architecture chips (64-bit wide RISC processors with millions of transistors and lots of on-board cache to play with) that draw only 1.5W of power (compare with x86 processors of similar performance, and you're looking at at least 50W).
Quote:
You can actually not do this with a store - think self-modifying code. If a "STA abs" stores a new value at the position of the next opcode, you have to detect this and invalidate the next opcode that had already been read.
No, you don't. It's NICE to do, but you don't HAVE to. First of all, what business do you have storing data in the next instruction for? If you make that kind of thing illegal to do without an explicit "instruction prefetch flush" instruction, that requirement disappears. Finally, if you're worried about power consumption, you're also going to be running mainly from ROM (e.g., an embedded environment) where such an instruction does absolutely nothing anyway.
You can implement logic to detect this (the x86 architecture does), thus preserving the illusion of non-prefetched code. But, as you say above, that will impair performance and, the extra logic needing extra transistors, will absolutely draw more power. But, then again, since we're looking for ways to get the 65xx architecture to execute code faster to begin with, then obviously, you'll also want to avoid code that is known to slow it down too. This is a clear case of a patient going to the doctor to complain of a headache after slamming his head into the wall a couple times. The prescription is clear:
don't do that.Quote:
And then think of branch misses. I.e. when a branch opcode branches, the possibly read next opcode is not usable.
Irrelavent.
Quote:
I would rather expect a save of less than 1 cycle per opcode, which is not 50% faster, is it?
What? You're not clear here.
Quote:
Also on a misaligned address (e.g. even the 2-byte address of an absolute addressing opcode that is aligned) the CPU internally needs to load the high byte of the memory word into the low byte of the address latch, and with the next read put the low byte of the memory word into the high byte of the address latch. Thinking of all those combinations (remember, there are addressing modes like "(zp),y", "(abs)" or "(zp,x)" that all read addresses from different memory locations, makes for a very complicated system.
No it doesn't. Not even close.
Execution unit presents a 16- or 24-bit address to the Bus Interface unit, and it's the BIU's job to handle memory mis-alignments. The EU will just sit and wait until all data is fetched. The BIU, in the mean time, need only concern itself with
2 possibilities (assuming a 16-bit bus width, which you're assuming here):
* A full word-aligned access
* A two-cycle, word-misaligned access
These are all solved problems.
Quote:
I still remember the the issues the 68k had with 16bit opcodes, 32bit registers, and access widths ranging from 8 to 16 bit.
Issues which turned out to be non-issues, considering the
sheer quantity of platforms built around the 68K line, the vociferous protest of designers everywhere when Motorola stated they'd stop production of the 68K all-together, and the sheer volume of programming language research performed with the 68K as the foundation. Indeed, ColdFire exists
only to sate the embedded designers. The Dragonball's only real customer was Palm, and once they switched to Arm, that spelt the death of the 68K-as-such completely. Now-a-days, if you want to avoid the Intel architecture like the plague, Arm and PowerPC are the only way to go for moderately high performance, low-power applications.
Quote:
The hardware interface had select lines so it could read single bytes even though it was a 16bit interface - but some peripherals are simply just 8 bit wide.
Every single integrated CPU made since the 68000 has had a similar interface. Even Intel's approach for the 8086-80286 implemented it:
Code:
BHE A0 Description
0 0 Data on D0-D15 (word-size data)
0 1 Data on D8-D15 (odd address)
1 0 Data on D0-D7 (even address)
1 1 Never generated as far as I know.
The use of byte lanes for a byte-addressing, 16- or 32-bit wide 65xx is of course assumed.
Unless you fundamentally redefine the byte width of the machine to 16-bits. Which brings me to . . .
Quote:
A side note: With a 16-bit bus width 6502 all applications using I/O would have to be rewritten for those peripherals that are still 8-bit (with for example the VIA registers appearing not at +0,+1,+2,...+f, but at +0,+2,+4,+6, ... +1e so you could access them with 16bit. And the again once you go 32bit
If the byte width is widened to 16-bits, then by definition,
no smaller addressable unit exists. Therefore, 8-bit-wide devices can sit on D0-D7, ignoring D8-D15, while addresses still increment by one. The TMS9900 16-bit processor (the world's first, by the way) did things this way, as did its immediate predecessor the TMS990-series mainframe processor. As far as I can tell, it really kicked butt.
This is also the approach that WDC's 32-bit Terbium processor takes, which has a 32-bit internal, 16-bit external architecture, capable of addressing 4.2 giga
words (they're continuing to use the term
byte for an 8-bit quantity, which is actually incorrect for that architecture). Whether or not they eliminate dead-cycles remains to be seen, however.
Quote:
- optimize the internal pipeline such that there are no dead opcodes. Why does "CLC" for example need two cylces?
Most likely, it's a design simplification.
Quote:
then independently build a cache in front of the CPU, where the CPU interface is 8 bit and blazingly fast on a cache hit. The memory interface can be 16 or even more bit. Use this cache for opcodes only, so loops don't get pushed out of the cache by data.
The whole purpose of a cache is to increase data flow in and out of the processor, not to constrain it. If the core EU has only an 8-bit path to cache, then what in the world is the point? I would like to point out, also, that this equally invalidates support for self-modifying code.