Calltree: a call-graph cache profiler

To use this skin, you must specify --skin=calltree on the Valgrind command line or use the supplied script calltree. Calltree is an extension of the Cachegrind skin. Read the documentation of the Cachegrind skin first; this page only describes the features supported in addition to Cachegrinds features.

Detailed technical documentation on how Calltree works is available here. If you want to know how to use it, you only need to read this page.

1. Purpose

2. Usage

2.1 Basics

To start a profile run for a program, execute After program termination, a profile dump file named "cachegrind.out.pid" is generated with pid being the process ID number of the profile run.

This will collect information

  1. on memory accesses of your program, and if an access can be satisfied by loading from 1st/2nd level cache,
  2. on the calls made in your program among the functions executed.

If you are only interested the first item, it's enough to use the Cachegrind skin from Valgrind. If you are only interested in the second item, use Calltree with option "--simulate-cache=no". This will give instruction read accesses only. But it significantly speeds up the profiling typically by a factor of 2 or 3.

2.2 Multiple dumps from one program run

Often, you aren't interested in time characteristics of a full program run, but only of a small part of it (e.g. execution of one algorithm). If there are multiple algorithms or one algorithm running with different input data, it's even useful to get different profile information for multiple parts of one program run.

The generated dump files are named

where pid is the PID of the running program, part is a number incremented on each dump (".part" is skipped for the dump at program termination), threadID is a thread identification ("-threadID" is only used if you request dumps if individual threads).

There are different ways to generate multiple profile dumps while a program is running under supervision of Calltree. Still, all methods trigger the same action "dump all profile information since last dump or program start, and zero cost counters afterwards". To allow for zeroing cost counters without dumping, there exists a second action "zero all cost counters now". The different methods are:

If you are running a multi-threaded application and specify the command line option "--dump-threads=yes", every thread will be profiled on its own and will create its own profile dump. Thus, the last two methods will only generate one dump of the currently running thread. With the other methods, you will get multiple dumps (one for each thread) on a dump request.

2.3 Limiting range of event collection

You can control for which part of your program you want to collect event costs by using --toggle-collect=funcprefix. This will toggle the collection state on entering and leaving a function. When specifying this option, the default collecting state at program start is "off". Thus, only events happing while running inside of funcprefix will be collected. Recursive function calls of funcprefix don't influence collecting at all.

2.4 Avoiding cycles

Each group of functions with any two of them happening to have a call chain from one to the other, is called a cycle. E.g. with A calling B, B calling C, and C calling A, the three functions A,B,C build up one cycle.

If a call chain goes multiple times around inside of a cycle, you can't distinguish costs coming from the first round or the second. Thus, it makes no sense to attach any cost to a call among function in one cycle: if "A > B" appears multiple times in a call chain, you have no way to partition the one big sum of all appearances of "A > B". Thus, for profile data presentation, all functions of a cycle are seen as one big virtual function.

Unfortunately, if you have an application using some callback mechanism (like any GUI program), or even with normal polymorphism (as in OO languages like C++), it's quite possible to get large cycles. As it is often impossible to say anything about performance behaviour inside of cycles, it is useful to introduce some mechanisms to avoid cycles in call graphs at all. This is done by treating the same function as different functions depending on the current execution context by giving them different names, or by ignoring calls to functions at all.

There is an option to ignore calls to a function with "--fn-skip=funcprefix". E.g., you usually don't want to see the trampoline functions in the PLT sections for calls to functions in shared libraries. You can see the difference if you profile with "--skip-plt=no". If a call is ignored, cost events happening will be attached to the enclosing function.

If you have a recursive function, you can distinguish the first 10 recursion levels by specifying "--fn-recursion10=funcprefix". Or for all functions with "fn-recursion=10", but this will give you much bigger profile dumps. In the profile data, you will see the recursion levels of "func" as the different functions with names "func", "func'2", "func'3" and so on.

If you have call chains "A > B > C" and "A > C > B" in your program, you usually get a "false" cycle "B <> C". Use "--fn-caller2=B --fn-caller2=C", and functions "B" and "C" will be treated as different functions depending on the direct caller. Using the apostrophe for appending this "context" to the function name, you get "A > B'A > C'B" and "A > C'A > B'C", and there will be no cycle. Use "--fn-callers=3" to get a 2-caller depencendy for all functions. Again, this will multiplicate the profile data size.

3. Command line option reference

--base=<prefix>

--separate-dumps=yes|no

--simulate-cache=yes|no

--collect-state=yes|no

--skip-plt=no|yes

--fn-skip=<function>/code>

--fn-group<number>=<function>

--fn-recursion<number>=<function>

--fn-caller<number>=<function>

--dump-before=<function>

--zero-before=<function>

--dump-after=<function>

--toggle-collect=<function>

--fn-recursion=<level>

--fn-caller=<callers>

--mangle-names=no|yes

--dump-threads=no|yes

--compress-strings=no|yes

--dump-bbs=no|yes

--dumps=<count>

4. Profile data file format