There are others on the forum who know more about things like cache and pipelining than I, but I'll jump in.
The 6502 is a simple processor that is not intended to compete with some of the high-end ones you might have already been looking at, based on your post. Yet for many applications, it offers an excellent ratio of performance to complexity (both hardware and software).
The 6502 does not have a deep pipeline. Instructions never overlap by more than one clock. The pipelining it does have allows the next byte in the program, whether instruction or operand, to be getting fetched while the processor's instruction-decoding logic is still in the process of figuring out what that next byte will even be for, or whether it is needed for the current instruction or not. Even with its single 8-bit ALU, the 6502 is able to keep the bus busy with a different read or write in almost every clock (or T-state, as it is called on some processors). The link below points to more about this.
In the example of an instruction that uses indexing on a 16-bit (2-byte) operand, the low byte is fetched first so the indexing addition can get started while the high byte is still in the process of getting fetched. By the time the high byte is fetched, the processor will know if the addtion at the low byte produced a carry.
With the absence of a deep pipeline, there are no branch-time pipeline flushes and re-filling, or any need for complex branch-prediction algorithms. The 6502's test of a status flag and the subsequent conditional branch takes only two clocks if the branch is not taken, and usually three clocks if the branch is taken. Many other 8-bit processors can't do anything in less than four clocks, and may take two or three (or more) of these 4-clock machine cycles to test the condition flag and make the branch. Even Microchip's PIC16 which has a 4-clock pipeline and is supposedly so fast takes 12 clocks to test a flag and branch on the condition, which the 6502 can do in 3. An absolute jump to anywhere in memory on the 6502 always takes 3 clocks.
The '02 usually uses no cache. This, combined with a continuous address space, allows for example an indexed one-byte fetch from a table anywhere in memory using LDA addr,X to take 4-5 clocks for something that requires 8 instructions and 48 clocks to do on the PIC16 if a page boundary might be crossed.
Another result of not having a deep pipeline on the 6502 is outstanding interrupt performance. It is not unreasonable to have an entire interrupt, including overhead, get carried out in one microsecond at 20MHz. An x86, if it is running Windows CE, can't do that at
any clock speed!
> Also, any interesting facts/info you have would be helpful as well!
The thread at
http://www.6502.org/forum/viewtopic.php?t=737 is related.