Ruud wrote:
Wiat states are still used by PCs. The faster your memory, the less wait states the CPU need.
This is actually not strictly true; contemporary memory (yes, even static RAM) are all synchronous in design now-a-days, and the "speed" of a memory refers more to its bus speed than its access latency.
SxRAMs (where x=D for DRAM, S for SRAM) are built for cache-supporting CPUs, and transfer their memory in bursts ranging from 8 to 256 bytes at a time, configurable at system reset time (even before the CPU's reset is negated). The idea is to amortize the typical access latency across a multiple bytes fetched or stored, thus effectively reducing net access times. Each byte (or word, in some cases) transfers in a single clock cycle, but you still have to wait for the RAM to load the row of bytes in the first place.
Note that actual data access times have
not gone down over the years; all they're doing is amortizing it. So, really,
the faster the CPU, the more wait states you'll need. I'll come back to this in a moment.
Also, SxRAMs support pipelining in many cases, turning what otherwise would be wait states into useful transactions.
For a single memory transaction, the client typically does:
1. Issues the column address.
2. Issues the row address.
3. Waits 4 to 8 cycles. (Double this for DDR SxRAMs)
4. Transfers 8 to 256 bytes worth of data.
If reading in the same column, you can skip a few steps:
1. unused
2. Issues a row address.
3. Waits 2 to 4 cycles.
4. Transfers 8 to 256 bytes.
There are things called pages and banks, which also have their own (usually longer) wait times.
If you exploit pipelining, you exploit the access cycles:
1. Issues the column address 1.
2. Issues the row address 1.
3. Issues the column address 2 -- overlaps access time for chunk 1.
4. Issues the row address 2.
5. Issues the column address 3 -- overlaps access times for chunks 1 and 2.
6. Issues the row address 3.
7. Wait maybe 2 to 4 cycles, if at all.
8. Receive 8 to 256 bytes for chunk 1.
9. Receive 8 to 256 bytes for chunk 2.
10. Receive 8 to 256 bytes for chunk 3.
and so on.
The chip protocols are pretty doggone complex, actually, which is why even the simplest of homebrew projects involving lots of RAM _requires_ an FPGA at the very least to implement an SxRAM controller. But, it's the price you pay if you want a memory system that works at near-CPU speeds at anything above 20MHz these days.
Concerning the CPU speed and the number of wait-states, the reality of why computer shops don't care about this anymore is because
nobody cares anymore. How can you, when computers have a plurality of wait-state inducing functions that more than dwarf the number of wait-states on a memory chip? Consider: cache misses, TLB misses, branch mis-prediction, pipeline stalls in superscalar architectures, and more all contribute
way more wasted cycles than memory access typically would. With today's 7-way integer execution units, to say that a cache miss costs the CPU several
thousand instructions in performance is not outside the realm of possibility.
Assuming everything works well, though, and considering the size of most cache lines and what SxRAMs can deliver in a single burst, most access times are amortized into sub-cycle latencies. If you have an 8 cycle wait time and a 64-byte burst, you're looking at an average 0.125 wait cycle latency per byte fetched. Considering other performance-affecting factors in the CPU, this isn't even worth worrying about.