6502.org Forum  Projects  Code  Documents  Tools  Forum
It is currently Fri Nov 22, 2024 4:03 am

All times are UTC




Post new topic Reply to topic  [ 28 posts ]  Go to page Previous  1, 2
Author Message
 Post subject:
PostPosted: Sun Jun 03, 2007 5:12 pm 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
pipettas wrote:
I believe that 65816 has a lot of of improvments w.r.t 6502 in fact but it is still a 16-bit processor. Offering real 32-bit access things should go faster, is'n it?


Garth,

What's meant here is that the CPU would have higher performance with a 32-bit bus. The 65816 still has only an 8-bit bus. Making a 65xx processor that is 32-bit but still bottlenecked by an 8-bit external bus will have diminishing returns. It'd take 4 cycles to save one word of data.

What pipettas is arguing for is a wider bus width. If the 65816 is double the speed of a 6502 with an 8-bit bus, imagine how much faster it could go with a real 16-bit bus. I predict 50% faster easily, since at least half of the words you store or read to/from memory will be aligned, and therefore, can be fetched in a single clock. Macro-instruction sequences, such as CLC/ADC or SEC/SBC, if properly aligned, can be decoded in a single clock, thus shaving 3 cycles off the operation. Things like that.

Of course, these are all no-brainers. Intel has been using tricks like these to make the 80x86 architecture more performant than the instruction set would suggest ever since the 80286.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sun Jun 03, 2007 8:49 pm 
Offline

Joined: Tue Jul 05, 2005 7:08 pm
Posts: 1043
Location: near Heidelberg, Germany
kc5tja wrote:
What's meant here is that the CPU would have higher performance with a 32-bit bus. The 65816 still has only an 8-bit bus. Making a 65xx processor that is 32-bit but still bottlenecked by an 8-bit external bus will have diminishing returns. It'd take 4 cycles to save one word of data.

What pipettas is arguing for is a wider bus width. If the 65816 is double the speed of a 6502 with an 8-bit bus, imagine how much faster it could go with a real 16-bit bus. I predict 50% faster easily, since at least half of the words you store or read to/from memory will be aligned, and therefore, can be fetched in a single clock. Macro-instruction sequences, such as CLC/ADC or SEC/SBC, if properly aligned, can be decoded in a single clock, thus shaving 3 cycles off the operation. Things like that.


I don't think this is the case, as the 65xx architecture is pretty much optimized to use as many cycles for valid accesses due to its internal pipelining.

For example if you read an aligned "CLC", this opcode takes a single byte, but two cycles, where the second cylce is a "dead" cycles (see also the discussion about the transparent DMA channel http://forum.6502.org/viewtopic.php?t=1160&start=15 )
So if you read the next opcode in the same cycle, you may already decode the next opcode during the dead cycle, which saves you a cycle.

What, however, if you have opcodes that access memory: Just like "LDA abs" which has three bytes, but 4 cycles with 4 valid memory addresses - three for the opcode+address, one for the actual read of the data. The last (data) access saves you nothing - it even increases power consumption, as reading 16 bit in general draws more power than 8 bit and you never know whether you actually need the second byte (This would work better if the AC were 16 bit). With the second opcode read you may read the following opcode, saving this cycle again.

You can actually not do this with a store - think self-modifying code. If a "STA abs" stores a new value at the position of the next opcode, you have to detect this and invalidate the next opcode that had already been read.

And then think of branch misses. I.e. when a branch opcode branches, the possibly read next opcode is not usable.

I would rather expect a save of less than 1 cycle per opcode, which is not 50% faster, is it?

Also on a misaligned address (e.g. even the 2-byte address of an absolute addressing opcode that is aligned) the CPU internally needs to load the high byte of the memory word into the low byte of the address latch, and with the next read put the low byte of the memory word into the high byte of the address latch. Thinking of all those combinations (remember, there are addressing modes like "(zp),y", "(abs)" or "(zp,x)" that all read addresses from different memory locations, makes for a very complicated system. I still remember the the issues the 68k had with 16bit opcodes, 32bit registers, and access widths ranging from 8 to 16 bit. The hardware interface had select lines so it could read single bytes even though it was a 16bit interface - but some peripherals are simply just 8 bit wide.

A side note: With a 16-bit bus width 6502 all applications using I/O would have to be rewritten for those peripherals that are still 8-bit (with for example the VIA registers appearing not at +0,+1,+2,...+f, but at +0,+2,+4,+6, ... +1e so you could access them with 16bit. And the again once you go 32bit ;-)

IMHO the way to improve it would be two stages:
- optimize the internal pipeline such that there are no dead opcodes. Why does "CLC" for example need two cylces?
- then independently build a cache in front of the CPU, where the CPU interface is 8 bit and blazingly fast on a cache hit. The memory interface can be 16 or even more bit. Use this cache for opcodes only, so loops don't get pushed out of the cache by data.

It's late, I hope you can make sense out of my comments :-)

André


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sun Jun 03, 2007 10:34 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California
Quote:
A side note: With a 16-bit bus width 6502 all applications using I/O would have to be rewritten for those peripherals that are still 8-bit (with for example the VIA registers appearing not at +0,+1,+2,...+f, but at +0,+2,+4,+6, ... +1e so you could access them with 16bit. And the again once you go 32bit ;-)

What might be better is to just leave bits 8-15 (or 8-31) empty, and still keep the same addresses for all the VIA registers. If you load from one of those addresses, whatever garbage comes from the higher bits is simply ANDed out in the 8-bit fetch operation, and the higher bits of a word stored to that address go nowhere. Because of the peripherals, it would be good to have the 8-bit load operations, but basically the smallest unit of memory would be a set of 16 (or 32) bits, not 8. It would be similar to when I've used a 4-bit RTC on the 6502's 8-bit bus, and the four high bits were irrelevant. There was nothing on the bus in those bit positions at that address. For that I did however need to AND#0FH since there's no way to do that in the same instruction with the LDA.

For eliminating the dead bus cycles, I would have nothing against running the processor at 4 or 8 times the bus speed so the processor can do more steps per bus cycle. In fact, this would allow running bus faster with the same memory speeds, and better set-up and hold times could be guaranteed. The way technology has developed, this should be much more feasible than it was in the mid-70's when the original 6502 was designed. The processor might internally run at 100 or 200 MHz with a 25MHz bus speed and memory that is only fast enough for 16MHz with the 65c02's made today.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Sun Jun 03, 2007 10:47 pm 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
Quote:
I don't think this is the case, as the 65xx architecture is pretty much optimized to use as many cycles for valid accesses due to its internal pipelining.


Obviously, if we're discussing widening the buses, the fundamental architecture of the chip must also be altered to. This completely invalidates your core assumption that the 65xx's execution core remains unaltered. Remember, the internal instruction set is fundamentally architected around an 8-bit wide bus. Looking at the micrograph of the 6502, it's clear that each instruction implementation is responsible for driving the buses too (e.g., there is no distinct chunk of logic dedicated to managing the address bus, none dedicated to the data bus, etc), which implies a very tight degree of coupling between RAM and the CPU's registers. Looking at the block diagram of the 6502 seems to confirm this, where the only thing separating the external and internal buses are a set of transceivers. Yikes! That means, if we were to widen the buses, then we'd need to rearchitect the core completely to take advantage of the new bus width.

Quote:
For example if you read an aligned "CLC", this opcode takes a single byte, but two cycles, where the second cylce is a "dead" cycles (see also the discussion about the transparent DMA channel http://forum.6502.org/viewtopic.php?t=1160&start=15 )
So if you read the next opcode in the same cycle, you may already decode the next opcode during the dead cycle, which saves you a cycle.


Irrelavent. If you read CLC:ADC in a single cycle, your instruction decoder (which is aware of 16-bit wide fetches now, remember!) can recognize this as a "macro" instruction, which allows it to clear the carry while fetching the ADC operand bytes. In essence, you get CLC for free, shaving off 2 cycles (1 cycles for the CLC dead cycle and 1 for the ADC opcode fetch, since it's already co-decoded in the instruction register).

Quote:
What, however, if you have opcodes that access memory: Just like "LDA abs" which has three bytes, but 4 cycles with 4 valid memory addresses - three for the opcode+address, one for the actual read of the data.


Once again, Intel CPUs have been doing this since the 8086, back in 1978.

The answer is trivially simple:

1. Fetch opcode + low byte of absolute address. The instruction decoder sees that it's not yet complete, so we ...
2. Fetch high byte of absolute address + next opcode. The full instruction is fetched now, so we can ...
3. Fetch from absolute address.
3b. *IF* address is unaligned, *AND* it's a 16-bit fetch, fetch the next byte too.
4. We still have one byte left over from fetching the previous opcode, which we take as the next opcode. Fetch operands from next opcode.

We can simplify this architecture by making the EU a simple byte-wide execution engine with the ability to "look ahead" in the instruction queue. Which, of course, means we now need an instruction queue. It doesn't need to be terribly big -- even the largest of queues on the x86 platform are 16 bytes long. Since the longest 65816 instruction is 4 bytes, I claim that a 5-byte prefetch queue is plenty sufficient.

Note, when consuming operands from the queue, it can do so in entire chunks. So while it pops 8-bit opcodes from the queue sequentially, it can freely pop 16- and 24-bit items in a single clock as required. No need to spend 2 to 4 cycles removing operand bytes, when they're sitting right there for inspection directly from the queue.

Quote:
The last (data) access saves you nothing


This is ambiguous. That does "last (data)" refer to here?

Quote:
it even increases power consumption, as reading 16 bit in general draws more power than 8 bit and you never know whether you actually need the second byte (This would work better if the AC were 16 bit). With the second opcode read you may read the following opcode, saving this cycle again.


If you're after performance, you are positively not after minimum power consumption anyway. By definition, the fact that there are more transistors in the CPU, regardless of the bus width, means that you'll be pulling more power. The 65816 should definitely draw more power than the 6502 because of this, despite the fact that it still has only an 8-bit bus.

Still, I think you're badly over-estimating the effects of these changes. A higher performance 65816 can be made with just those performance-boosting features that are required. If you're going all-out desktop-class processor, be aware that there are dual-core MIPS-4 architecture chips (64-bit wide RISC processors with millions of transistors and lots of on-board cache to play with) that draw only 1.5W of power (compare with x86 processors of similar performance, and you're looking at at least 50W).

Quote:
You can actually not do this with a store - think self-modifying code. If a "STA abs" stores a new value at the position of the next opcode, you have to detect this and invalidate the next opcode that had already been read.


No, you don't. It's NICE to do, but you don't HAVE to. First of all, what business do you have storing data in the next instruction for? If you make that kind of thing illegal to do without an explicit "instruction prefetch flush" instruction, that requirement disappears. Finally, if you're worried about power consumption, you're also going to be running mainly from ROM (e.g., an embedded environment) where such an instruction does absolutely nothing anyway.

You can implement logic to detect this (the x86 architecture does), thus preserving the illusion of non-prefetched code. But, as you say above, that will impair performance and, the extra logic needing extra transistors, will absolutely draw more power. But, then again, since we're looking for ways to get the 65xx architecture to execute code faster to begin with, then obviously, you'll also want to avoid code that is known to slow it down too. This is a clear case of a patient going to the doctor to complain of a headache after slamming his head into the wall a couple times. The prescription is clear: don't do that.

Quote:
And then think of branch misses. I.e. when a branch opcode branches, the possibly read next opcode is not usable.


Irrelavent.

Quote:
I would rather expect a save of less than 1 cycle per opcode, which is not 50% faster, is it?


What? You're not clear here.

Quote:
Also on a misaligned address (e.g. even the 2-byte address of an absolute addressing opcode that is aligned) the CPU internally needs to load the high byte of the memory word into the low byte of the address latch, and with the next read put the low byte of the memory word into the high byte of the address latch. Thinking of all those combinations (remember, there are addressing modes like "(zp),y", "(abs)" or "(zp,x)" that all read addresses from different memory locations, makes for a very complicated system.


No it doesn't. Not even close.

Execution unit presents a 16- or 24-bit address to the Bus Interface unit, and it's the BIU's job to handle memory mis-alignments. The EU will just sit and wait until all data is fetched. The BIU, in the mean time, need only concern itself with 2 possibilities (assuming a 16-bit bus width, which you're assuming here):

* A full word-aligned access
* A two-cycle, word-misaligned access

These are all solved problems.

Quote:
I still remember the the issues the 68k had with 16bit opcodes, 32bit registers, and access widths ranging from 8 to 16 bit.


Issues which turned out to be non-issues, considering the sheer quantity of platforms built around the 68K line, the vociferous protest of designers everywhere when Motorola stated they'd stop production of the 68K all-together, and the sheer volume of programming language research performed with the 68K as the foundation. Indeed, ColdFire exists only to sate the embedded designers. The Dragonball's only real customer was Palm, and once they switched to Arm, that spelt the death of the 68K-as-such completely. Now-a-days, if you want to avoid the Intel architecture like the plague, Arm and PowerPC are the only way to go for moderately high performance, low-power applications.

Quote:
The hardware interface had select lines so it could read single bytes even though it was a 16bit interface - but some peripherals are simply just 8 bit wide.


Every single integrated CPU made since the 68000 has had a similar interface. Even Intel's approach for the 8086-80286 implemented it:

Code:
BHE  A0 Description
  0   0 Data on D0-D15 (word-size data)
  0   1 Data on D8-D15 (odd address)
  1   0 Data on D0-D7 (even address)
  1   1 Never generated as far as I know.


The use of byte lanes for a byte-addressing, 16- or 32-bit wide 65xx is of course assumed.

Unless you fundamentally redefine the byte width of the machine to 16-bits. Which brings me to . . .

Quote:
A side note: With a 16-bit bus width 6502 all applications using I/O would have to be rewritten for those peripherals that are still 8-bit (with for example the VIA registers appearing not at +0,+1,+2,...+f, but at +0,+2,+4,+6, ... +1e so you could access them with 16bit. And the again once you go 32bit ;-)


If the byte width is widened to 16-bits, then by definition, no smaller addressable unit exists. Therefore, 8-bit-wide devices can sit on D0-D7, ignoring D8-D15, while addresses still increment by one. The TMS9900 16-bit processor (the world's first, by the way) did things this way, as did its immediate predecessor the TMS990-series mainframe processor. As far as I can tell, it really kicked butt.

This is also the approach that WDC's 32-bit Terbium processor takes, which has a 32-bit internal, 16-bit external architecture, capable of addressing 4.2 gigawords (they're continuing to use the term byte for an 8-bit quantity, which is actually incorrect for that architecture). Whether or not they eliminate dead-cycles remains to be seen, however.

Quote:
- optimize the internal pipeline such that there are no dead opcodes. Why does "CLC" for example need two cylces?


Most likely, it's a design simplification.

Quote:
then independently build a cache in front of the CPU, where the CPU interface is 8 bit and blazingly fast on a cache hit. The memory interface can be 16 or even more bit. Use this cache for opcodes only, so loops don't get pushed out of the cache by data.


The whole purpose of a cache is to increase data flow in and out of the processor, not to constrain it. If the core EU has only an 8-bit path to cache, then what in the world is the point? I would like to point out, also, that this equally invalidates support for self-modifying code.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Mon Jun 04, 2007 7:43 am 
Offline

Joined: Tue Jul 05, 2005 7:08 pm
Posts: 1043
Location: near Heidelberg, Germany
kct5ja,

I'm sorry if my last post sounded to negative, I was just thinking about the idea and writing at the same time, which is no good.

And yes, my assumption was based on keeping the original 8-bit EU architecture. If you change that, everything's possible.

André


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Mon Jun 04, 2007 3:28 pm 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
I think Fachat's points and my counters to them brings up very important issues with respect to processor design:

* Splitting a processor into multiple, does-one-thing-and-does-it-well components (e.g., strong modularization) is considered a "good thing." The 6502 and 65816 aren't modular in their current silicon-only design (I cannot speak for WDC's Verilog models).

* Modularization supports quicker turn-around times in the marketplace. Having just granted the 6502 a 32-bit wide bus, suddenly your boss is screaming, "We need 64 bits pronto!" Provided you split the bus interface unit apart from everything else, you need only change the BIU and respin the design. Everything else remains the same.

* Modularization, however, increases die size. Because modularization implies related gates are kept in close proximity, that also means more die area is needed for metalization of buses spanning the chip. If a design doesn't fit as a result of routing issues, you may need to alter other units as well. Hopefully these modifications are kept as small as possible. But it IS an issue.

* Modularization increases transistors, and indirectly, power consumed. Because transistors are dedicated to specific functions, the odds of being able to re-use a transistor for different functions are reduced. For example, a 6502 with a dedicated BIU might not rely on the CPU's core ALU to perform effective address arithmetic. It might have its own ALU instead (maybe this produces faster execution, but it remains that more transistors exist to do it).

* NOT modularizing a design produces the smallest, most efficient design possible. But it is a one-time deal, so you better get it right. It's like writing Microsoft Windows XP in pure assembly language. It's absolutely doable, and hoo-boy will it be screamin' fast. But, it'll take forever to complete.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Tue Jun 05, 2007 10:38 am 
Offline

Joined: Tue Mar 27, 2007 6:02 pm
Posts: 14
Location: Sevilla, España
Quote:

For eliminating the dead bus cycles, I would have nothing against running the processor at 4 or 8 times the bus speed so the processor can do more steps per bus cycle. In fact, this would allow running bus faster with the same memory speeds, and better set-up and hold times could be guaranteed.
The processor might internally run at 100 or 200 MHz with a 25MHz bus speed and memory that is only fast enough for 16MHz with the 65c02's made today.


I don't get it.
If the 6502 needs to access the memory bus at every cycle ( -except during dead cycles- ), how could the cpu run 4 or 8 times faster than the memory bus ?

BTW, your post lead me to this idea :

Given that during a dead cycle's memory access, the data put/retrieved into/from the memory bus is irrelevant :

1.- Isolate the cpu from the memory bus at the beginning of the dead cycle.
2.- inject into the cpu clock input a single phi clock cycle (*) let's say 20 times shorter (faster) just once (in order to make the dead cycle pass-by).
3.- Reconnect the cpu to the memory bus.
4.- The dead cycle has been evaporated... (and my much beloved DMAMagic channel too)

The phi1 part of the current cpu cycle will be shortened a bit (about 1/10th) because an extra cpu cycle has been injected, but the phi2 part won't be affected at all.

Would it work ?
Or would the 6502 go crazy (most likely) ?

Hmm, if I fit a 14Mhz 65c02 part in my 1Mhz Apple II, it *shouldn't* go crazy, right ?

(*)That is, in fact, a single short negative-going pulse.

_________________
--JorgeChamorro.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Tue Jun 05, 2007 2:32 pm 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
jorge wrote:
I don't get it.
If the 6502 needs to access the memory bus at every cycle ( -except during dead cycles- ), how could the cpu run 4 or 8 times faster than the memory bus ?


The 6809 is a perfect example of a CPU that did this. To get a 6809 that ran at "1MHz", you gave it a clock at 4MHz. This allowed the 6809's internal logic to run with sufficient performance to not waste external bus cycles.

Quote:
1.- Isolate the cpu from the memory bus at the beginning of the dead cycle.
2.- inject into the cpu clock input a single phi clock cycle (*) let's say 20 times shorter (faster) just once (in order to make the dead cycle pass-by).


Unless your CPU can take 5x clock speeds in the first place, this won't work. The 6502's logic is fundamentally clocked _at_ the same speed as the bus. So suddenly giving it a clock cycle that is 5x as fast as it previously was runs the risk of over-clocking the CPU to a degree that it cannot function.

And, if it can handle those kinds of speeds, why not just run the CPU at 5x what you were planning on anyway? The end result will be faster even with the dead cycles, than it would be by contracting only dead cycles.

Quote:
Hmm, if I fit a 14Mhz 65c02 part in my 1Mhz Apple II, it *shouldn't* go crazy, right ?


It probably will because the 14MHz part will have much, much faster rise and fall times. The phi0 input will very likely not respond correctly due to the slower slew rates. You'll want to make sure the phi0 is suitably conditioned (e.g., via pair of 74F00 gates for example) first.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Tue Jun 05, 2007 4:50 pm 
Offline

Joined: Sat May 20, 2006 2:54 pm
Posts: 43
Location: Brighton, England
jorge wrote
Quote:
Hmm, if I fit a 14Mhz 65c02 part in my 1Mhz Apple II, it *shouldn't* go crazy, right ?


If you are really trying this - watch out for pin 1. On a WDC 65C02 pin 1 is the VP output. On everyone elses 65C02s and on NMOS 6502s pin 1 is a ground connection. Plugging a WDC 65C02 into a socket intended for a 6502 could short the VP output to ground and kill the chip. :cry:

_________________
Shift to the left, shift to the right,
mask in, mask out,
BYTE! BYTE! BYTE!


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Tue Jun 05, 2007 5:10 pm 
Offline
User avatar

Joined: Fri Aug 30, 2002 1:09 am
Posts: 8543
Location: Southern California
Quote:
Plugging a WDC 65C02 into a socket intended for a 6502 could short the VP output to ground and kill the chip.

I seriously doubt that it would kill the whole thing, but it would probably damage the VP output buffer. The supply current would be high, but probably not as high as the NMOS 6502 current was. Just to test maximum output current on the WDC VIAs which I think have the same output drivers, I actually shorted a bit to Vcc and then to ground, and got 50mA. It did not damage the IC but I didn't keep it shorted for a long time either. If you want to plug the WDC 65c02 into a socket that has pin 1 grounded, just bend the one pin out so it doesn't go into the socket.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Tue Jun 05, 2007 5:50 pm 
Offline

Joined: Tue Dec 30, 2003 10:35 am
Posts: 42
jorge wrote:
kc5tja wrote:
For eliminating the dead bus cycles, I would have nothing against running the processor at 4 or 8 times the bus speed so the processor can do more steps per bus cycle. [...]

I don't get it. If the 6502 needs to access the memory bus at every cycle ( -except during dead cycles- ), how could the cpu run 4 or 8 times faster than the memory bus ?

kc5tja's comment was in the context of a system with memory more than 8 bits wide, where a single memory access would read/write 2 or more bytes at a time.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Tue Jun 05, 2007 5:59 pm 
Offline

Joined: Sat Jan 04, 2003 10:03 pm
Posts: 1706
Sorry, but that is an incorrect attribution. Garth wrote the comment, not me.

Also, the technique has been applied from everything from 8-bit CPUs on up in widths. The 6809 uses a 4MHz clock for 1MHz operation, and the TMS9900A uses something like a 12MHz for 1MHz operation, IIRC. As does the original 8051. So it's hardly a unique thing.

Indeed, the fact that 6502 and 65816 runs at bus speeds makes it completely unique in the industry.


Top
 Profile  
Reply with quote  
 Post subject:
PostPosted: Tue Jun 05, 2007 6:25 pm 
Offline

Joined: Sat May 20, 2006 2:54 pm
Posts: 43
Location: Brighton, England
kc5tja wrote:
Indeed, the fact that 6502 and 65816 runs at bus speeds makes it completely unique in the industry.


Not quite - the original 6800 managed it as well. I can't think of any other CPU that does.

_________________
Shift to the left, shift to the right,
mask in, mask out,
BYTE! BYTE! BYTE!


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 28 posts ]  Go to page Previous  1, 2

All times are UTC


Who is online

Users browsing this forum: Google [Bot] and 17 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: