Code optimization !?

Programming the 6502 microprocessor and its relatives in assembly and other languages.
kc5tja
Posts: 1706
Joined: 04 Jan 2003

Post by kc5tja »

BigEd wrote:
I think perhaps Toshi's point is that Samuel implies that getting adequate bandwith to memory by pipelining accesses is good enough, but in fact if the CPU is waiting for the data, it isn't.
I never claimed that pipelining would solve everything. But, supporting 3 concurrent memory accesses with the usual 4 cycle access times is a great way of filling the gap, if the processor can keep itself busy in the mean time, which it turns out it usually can do.
Quote:
As we're entirely and completely off-topic now, I thought it interesting that the recent "5 trillion digits of pi" calculation (using a single PC, for the most part) had this to say about ram sizes and bus speeds:
Quote:
Due to time constraints, most of the algorithms that are used for disk operations had been "forcibly ported" from the in-ram equivalents with little or no regard to performance implications. Most of these performance penalties come in the form of sub-optimal cache use.

This sub-optimal cache use leads to excessive memory-thrashing which explains why Shigeru Kondo noticed that 96 GB of ram @ 1066 MHz had higher performance than 144 GB of ram @ 800 MHz
This amply illustrates my point. Memory access latencies are important, but if you abuse or misuse other performance-enhancing features of the system, it turns out you've got bigger problems.

Where I work, we use moderately powerful servers to host our web application software. But, they're not mainframes, and even in the x86 world, they're hardly what I'd consider the top of the line hardware. Since we're only a medium-sized company, we must remain cost conscious.

Our solutions to memory bottlenecks rarely involves swapping out a motherboard for a faster FSB or for faster RAM sticks. Most of our performance gains come from decomposing problems differently, compensating for cache effects, or parallelization through the use of multiple machines. We service the same volume of customer traffic as Facebook, using only 1/100th the number of servers. RAM speed is almost never an issue in contemporary high-performance software.

There are places where RAM speed absolutely matters, of course. Supercomputing comes to mind, but even here, properly built, industrial-grade vector processors (not the toys you see in Intel or PowerPC instruction sets) can stream data from SDRAM at bus-speed. I have a friend who works for nVidia, and their GPUs do all sorts of neat tricks with streaming data from SDRAM at bus speed, for example. To benefit from these optimizations, coders must place their texture (and other) data in memory in ways that fit the SDRAM bursting model, which isn't always the nice, clean, flat 2-D bitmaps we've all come to appreciate over the years.

But, on the whole, across the entire industry and probing all sectors, where CPUs dwarf RAM in performance, we are finding that RAM is nonetheless faster than most software running on these CPUs. This is particularly the case with systems where the FSB approximates internal core operating speeds! So, the end result is a computer running software at woefully suboptimal speeds.

I stand by my statements: the problem rarely is the RAM performance these days. It's how you use the RAM that determines overall performance now. This isn't 1996 anymore.
User avatar
BigEd
Posts: 11464
Joined: 11 Dec 2008
Location: England
Contact:

Post by BigEd »

fachat wrote:
GARTHWILSON wrote:
Modern 6502's can run at much higher clock rates, but that doesn't mean the rest of your hardware can handle it. Since the 65802 is no longer available, we had kicked around the idea here of making a board with a 65816 on it that could be plugged into a 6502 socket, but we never did it.
In case you didn't notice - I built such a card.
http://www.6502.org/users/andre/cbmhw/pet816/index.html
I am currently working on making it run in a 2MHz host at 2 MHz... (due to internal delays I'm not sure if I can get along without wait states in a 2MHz host)

André
We've got a similar project, which we term our level1b board, but I think Garth is referring to the simpler substitution of just the CPU, which we term level0:
Image
User avatar
BigEd
Posts: 11464
Joined: 11 Dec 2008
Location: England
Contact:

Post by BigEd »

kc5tja wrote:
I received a private message indicating the game in question was an implementation of chess.
Dragging us back to the topic of running chess on a faster CPU, and to compare with the primitive stripboard picture above, our old friend H.G.Muller built a nice 65816 chess computer in a matchbox:
Image
His pages tell the fascinating story of his Micro-Max short(!) chess program. Maybe even those 2kbytes of C would compile down to an 8k ROM...

Edit: update link to Harms' profile.
Last edited by BigEd on Thu Mar 14, 2013 9:51 pm, edited 2 times in total.
TMorita
Posts: 217
Joined: 15 Sep 2002

Post by TMorita »

kc5tja wrote:
BigEd wrote:
...
But, on the whole, across the entire industry and probing all sectors, where CPUs dwarf RAM in performance, we are finding that RAM is nonetheless faster than most software running on these CPUs. This is particularly the case with systems where the FSB approximates internal core operating speeds! So, the end result is a computer running software at woefully suboptimal speeds.

I stand by my statements: the problem rarely is the RAM performance these days. It's how you use the RAM that determines overall performance now. This isn't 1996 anymore.
The average cache fill latency with 533Mhz DDR3 DRAM is about 50 clocks at 533 Mhz, or basically around 100 nanoseconds. This was calculated from cache traces on our current design by a tool I wrote called cacheperf.

So basically, DDR SDRAM is nowhere near as fast as you assume it to be.

Toshi
kc5tja
Posts: 1706
Joined: 04 Jan 2003

Post by kc5tja »

What are the figures for a 1GHz FSB?
User avatar
BigEd
Posts: 11464
Joined: 11 Dec 2008
Location: England
Contact:

Post by BigEd »

TMorita wrote:
kc5tja wrote:
...
But, ...
I stand by my statements...
...cache fill latency ... about 50 clocks...
Toshi
Just for the record - a bit of finger trouble has Toshi appearing to attribute Samuel's comment to me.

This stackoverflow Q&A has some good pointers, for those interested in performance of modern memory systems. The bottom line is: if your program treats memory as random access then the performance can be remarkably low. You need to be aware of the cache architecture and the relative costs of missing each level of cache. See also "Analysis of Processor Cache Effects" and "The hidden cycle-eating demon: L1 cache misses".

As we'd be lucky in 6502-land to have even a single level of cache, when we find ourselves counting cycles we only have to take care with crossing page boundaries. It's simple, which is probably why we love it.
User avatar
BigEd
Posts: 11464
Joined: 11 Dec 2008
Location: England
Contact:

Post by BigEd »

kc5tja wrote:
What are the figures for a 1GHz FSB?
One of the links I just posted includes a pointer to some results from Lavalys' Everest analysis program (I've never heard of it before.)

This tabulation concludes with "Mem latency: 124.775ns (431 clks)" for this machine with 3.46GHz cpu, 1066MHz FSB. There are heaps of other results nearby.

400+ clocks penalty for accessing RAM - you need to find a lot of other work to do to hide that cost!
kc5tja
Posts: 1706
Joined: 04 Jan 2003

Post by kc5tja »

Yes; my point is that it's quite doable to fill those 430 clocks with useful work. With that kind of latency hitting RAM, ultra-optimization of data structures to use B-trees and the like actually makes somewhat less sense. Counter-intuitive? Yup!

The reason is that many CPUs have cache-line prefetch instructions, which you can use to speculatively prefetch a chunk of RAM into cache while you're busy executing other code. While this is happening, you can perform a brute-force search for data in memory you already know to be cached -- in other words, you're explicitly multitasking the different functional units of the chip: while the CPU is speculatively searching the next chunk of memory into cache, you can be brute-forcing through an array of data known to be entirely in cache.

(Of course, you can also go the multiple-core route too.)

Note: I'm not suggesting that one can just abandon sophisticated indices. But, where does come a point of diminishing returns. This is why B-trees almost never refer to anything smaller than individual sectors on a disk. Due to I/O overhead, it's faster to just DMA the sector into memory and brute-force through it while you prefetch another sector.

Substitute cache-line for sector, and you get the same idea, but applied at a much finer granularity. It's actually an old technique by today's standards, but it works well.
Dimitri
Posts: 142
Joined: 08 Mar 2010
Contact:

Post by Dimitri »

kc5tja wrote:
I have a friend who works for nVidia, and their GPUs do all sorts of neat tricks with streaming data from SDRAM at bus speed, for example. To benefit from these optimizations, coders must place their texture (and other) data in memory in ways that fit the SDRAM bursting model, which isn't always the nice, clean, flat 2-D bitmaps we've all come to appreciate over the years.
Consider the newest FireStream cards from AMD/ATI. Which are workstation class cards designed around being used as math co-processors.

The FireStream 9370 which was released in late June of this year, has a core running at only 825Mhz and uses 1150Mhz memory. However its running a 256-bit data bus, not a 32 bit or 64 bit core like your desktop and with its 4GBytes of memory at its disposal manages to squeeze out 528 GFLOPS on double precision (64 bit) and 2640 GFLOPS on single precision (32-bit).

Using benchmarking software most current "high end" models of CPU's only manage to squeeze out 10-20 GFLOPS (mix of 50/50 32/64bit precision).

PS for the 9350 (the above cards little brother with only 2GBytes of RAM) the code name was "Kestrel".

Dimitri
Post Reply