Diplom Thesis: Automatic Profile-Based Characterisation of Performance Traces for Highly Parallel Applications

About four years ago I defended my diplom thesis. The title is quite a mouthful. Another way to put it is: We came up with a new way to cluster threads of execution for performance analysis.

I made the defense slides, thesis, paper and source code available.

Developing Vampir led us to the realization that scalability of performance analysis and visualization is fundamentally difficult. Information about more and more processing elements needs to be processed and displayed in a limited number of pixels on the screen.

Around 2012 we started exploring what you can do with differences between program execution traces. One thing we did was to use sequence alignment and diff algorithms to compute the difference between two function call stacks over time [1] [2]. This technique is rather expensive, but for example allows comparing the function-call-by-function-call timing of two threads of execution.

Two ways to tackle visual complexity is through folding similar executions into one [3], and e.g. by using edge bundling to cluster inter-process communication [4]. The problems these two ideas address are specific to our tools, and might not be a problem for e.g. profile-based visualizations.

Throughout the rest of the article I'll use the word process as a short form for thread, process and MPI rank.

Before I started working on my thesis, as part of an independent study course, I explored some more what kind of differences you can derive from processes and how to visualize them (slides), (report). One thing I tried was using box plots / percentiles to draw profiles on a per-process basis, instead of summing or averaging function timings across processes. The other main thing I investigated was whether to use flat function profiles, call trees, or something in between for comparing processes. One bucket in a flat function profile is a function, irrespective of which arguments it is called with, and what the call stack leading up to the call is. E.g. all calls to a matrix-matrix multiplication function are lumped into one bucket, regardless of the matrix size. A call tree profile retains the call stack for each bucket and thus provides more context. Comparing two call tree profiles is not that intuitive, and dealing with recursion is awkward. In between the two sits the idea of using the pair of functions of who-calls-whom. That provides some context, and works well with recursion. For each of these three profile types, the buckets are (1) a function, (2) a sequence of functions, and (3) a pair of functions. I also spent some time visualizing this so-called call matrix. It didn't work well. Not intuitive enough.

The important realization for my diplom thesis is that I can apply formal concept analysis to sets of function pairs to derive a kind-of hierarchy of processes. Each process is characterized by its set of function pairs (which function calls which), disregarding invocation counts and timing.

This example is a weather simulation using four MPI ranks with three threads each. All processes and threads call the top set of functions. The threads do not call any MPI or I/O functions. Process 0, on top of what processes 1-3 do also performs I/O. (The percentage displayed is the ratio of function pairs in the node to all function pairs. E.g. 37% of the pairs are shared between all processes and threads. 41% are unique to processes 1-3.)

To iron out small dissimilarities introduced by performance recording technicalities, you use the transitive closure of the function pair sets, rather than the sets themselves. Then the above example becomes three nodes on top of each other, the top-most being the threads. Thus, process 0 subsumes all other processes and threads, and process 1-3 subsume all threads. The threads are spawned by process 0-3! (The defense slides, specifically slides 16-18 and 26, have some more detail and examples on this.)

The thesis devotes much space to demonstrating the techniques on many real world examples and investigating two representatives in greater detail.

The expected complexity of creating the concept lattice (above graph) is linear in the number of processes. But the worst-case complexity is exponential, because the maximum number of nodes is. For that to happen you need a certain pattern of differences, and every process needs to have a different function pair set. In normal applications, and especially SPMD programs this never happens.

But there are three scenarios where the approach does not perform well:
  • The function call sequence becomes kind of random if you observe scheduling.
  • Load-balancing can lead to the same.
  • An application assigns a different cell of a space to each process. The space contains objects. As the simulation progresses objects interact and move between cells/processes. Then each process can call different functions depending on what happens within each cell. ParaDiS does something like that.
Thus the technique does not work on every code, but most. You'd need to implement fall backs to handle those gracefully.

To summarize, the technique takes the set of function pairs for each process and creates a concept lattice. This lattice represents a clustering with an attached hierarchy.

You can also use processes from multiple application runs, and not only compare processes within one run.

Clustering processes is a promising avenue that, to my knowledge, no performance analysis tool has taken, yet. Every performance tool presents a profile accumulated or averaged across all processes. With this they lump together profiles of processes that perform different work. You want to create one profile per group of processes that performs similar work. In the above example you would perhaps have one for process 0, one for process 1-3 and one for the threads. The difference between threads that share memory, and processes which communicate across nodes should be considered!

Or for example, if you think about our sequence alignment work, you don't want to align dissimilar processes, because that would result in an alignment that is just not useful. Instead you use our technique to cluster processes and then only align and compare detailed timings of similar ones.

You can also use the clustering to safe screen space by only displaying information about one representative of each group, perhaps with an indicator for timing variance.

This concludes the walkthrough of what we did. There are two things I want to mention: In the code on GitHub, I was experimenting with a more C-style C++ with templates. I don't endorse that style.

Secondly, the above technique is rather heavy-weight. You can get pretty far by just hashing the function pair set. Equal sets share the hash, unequal sets don't. Clustering done. (Then you can e.g. compute similarity between each group) This approach is easier to implement and does not have the exponential worst-case complexity. But you lose the the hierarchy. My above example talks about the process 0 vs process 1-3 vs threads relationship which is rather simplistic. A performance monitor or runtime should be able to know about this. But then again, if your analysis technique automatically yields that and more intricate relationships, that's pretty cool, too.