MichaelM wrote:
Toshi:
I am sure that after all of the effort that compiler writers and software developers have devoted to extracting the most performance from an instruction set, they could have at least devoted some effort to static analysis of the link image for cache thrashing issues. In fact, I would have thought that the garbage collector for the Java virtual machine would have been designed to spot issues such as cache thrashing and automatically adjust the relative positions of the offending code segments to avoid the issue while it had essentially brought the machine to a standstill. This is such a simple concept that it should be relatively easy to implement since the cache design and strategy of the target is a known quantity. Hennessy and Patterson claimed that these type of problems should be left to the software tools and not be implemented in the computer's hardware, i.e. the cache subsystem or the execution pipeline. After driving processor design to virtually unimaginable levels of complexity, it's time for a bit of innovation on the software development side of the house by the so called computer scientists.
Few comments:
1) Code placement is a linker issue, not a compiler issue.
2) The linker has a very limited freedom on where to place functions due to file format issues.
When a compiler compiles a C files and produces a COFF or EL object file, the object file
contains all the code for all the functions in the file in one contiguous code section.
Therefore, the linker usually has no freedom to rearrange the order of the functions in an
object file. The linker only has the ability to determine the order of the object files compiled
into the final executable, so this is very coarse.
3) If the linker had the freedom to reorder individual functions by using a different object file format,
then the linker needs semantic information such as the call graph in order to determine a
theoretically optimal code layout. So the object file would also need to include call graph
information.
4) The call graph information is useful to the linker, but it is statically determined information, and
may not accurately represent the execution path of the executable with different inputs.
5) I know a person who did this as a project, and after optimizing the layout of the functions,
found that it ran slower than the original code. For his case, which was optimizing the layout
of functions in the Mozilla web browser on Linux, the main source of branch latency was the
paging in of the executable from disk rather than the cache latency because disk I/O is
a few orders of magnitude slower than cache line fills from memory. The project was called
GNU Rope - the paper here describes it, but does not mention the final results which I'm
recalling from memory:
http://www.nat.org/grope.pdfSo based on my experience watching other people implement this optimization, my opinion is that it requires a significant amount of work to implement but the average return may be close to zero - it helps some executables but hurts some executables for an average gain of zero.
Toshi