Diplom Thesis: Automatic Profile-Based Characterisation of Performance Traces for Highly Parallel Applications

About four years ago I defended my diplom thesis. The title is quite a mouthful. Another way to put it is: We came up with a new way to cluster threads of execution for performance analysis.

I made the defense slides, thesis, paper and source code available.

Developing Vampir led us to the realization that scalability of performance analysis and visualization is fundamentally difficult. Information about more and more processing elements needs to be processed and displayed in a limited number of pixels on the screen.

Around 2012 we started exploring what you can do with differences between program execution traces. One thing we did was to use sequence alignment and diff algorithms to compute the difference between two function call stacks over time [1] [2]. This technique is rather expensive, but for example allows comparing the function-call-by-function-call timing of two threads of execution.

Two ways to tackle visual complexity is through folding similar executions into one [3], and e.g. by using edge bundling to cluster inter-process communication [4]. The problems these two ideas address are specific to our tools, and might not be a problem for e.g. profile-based visualizations.

Throughout the rest of the article I'll use the word process as a short form for thread, process and MPI rank.

Before I started working on my thesis, as part of an independent study course, I explored some more what kind of differences you can derive from processes and how to visualize them (slides), (report). One thing I tried was using box plots / percentiles to draw profiles on a per-process basis, instead of summing or averaging function timings across processes. The other main thing I investigated was whether to use flat function profiles, call trees, or something in between for comparing processes. One bucket in a flat function profile is a function, irrespective of which arguments it is called with, and what the call stack leading up to the call is. E.g. all calls to a matrix-matrix multiplication function are lumped into one bucket, regardless of the matrix size. A call tree profile retains the call stack for each bucket and thus provides more context. Comparing two call tree profiles is not that intuitive, and dealing with recursion is awkward. In between the two sits the idea of using the pair of functions of who-calls-whom. That provides some context, and works well with recursion. For each of these three profile types, the buckets are (1) a function, (2) a sequence of functions, and (3) a pair of functions. I also spent some time visualizing this so-called call matrix. It didn't work well. Not intuitive enough.

The important realization for my diplom thesis is that I can apply formal concept analysis to sets of function pairs to derive a kind-of hierarchy of processes. Each process is characterized by its set of function pairs (which function calls which), disregarding invocation counts and timing.

This example is a weather simulation using four MPI ranks with three threads each. All processes and threads call the top set of functions. The threads do not call any MPI or I/O functions. Process 0, on top of what processes 1-3 do also performs I/O. (The percentage displayed is the ratio of function pairs in the node to all function pairs. E.g. 37% of the pairs are shared between all processes and threads. 41% are unique to processes 1-3.)

To iron out small dissimilarities introduced by performance recording technicalities, you use the transitive closure of the function pair sets, rather than the sets themselves. Then the above example becomes three nodes on top of each other, the top-most being the threads. Thus, process 0 subsumes all other processes and threads, and process 1-3 subsume all threads. The threads are spawned by process 0-3! (The defense slides, specifically slides 16-18 and 26, have some more detail and examples on this.)

The thesis devotes much space to demonstrating the techniques on many real world examples and investigating two representatives in greater detail.

The expected complexity of creating the concept lattice (above graph) is linear in the number of processes. But the worst-case complexity is exponential, because the maximum number of nodes is. For that to happen you need a certain pattern of differences, and every process needs to have a different function pair set. In normal applications, and especially SPMD programs this never happens.

But there are three scenarios where the approach does not perform well:
  • The function call sequence becomes kind of random if you observe scheduling.
  • Load-balancing can lead to the same.
  • An application assigns a different cell of a space to each process. The space contains objects. As the simulation progresses objects interact and move between cells/processes. Then each process can call different functions depending on what happens within each cell. ParaDiS does something like that.
Thus the technique does not work on every code, but most. You'd need to implement fall backs to handle those gracefully.

To summarize, the technique takes the set of function pairs for each process and creates a concept lattice. This lattice represents a clustering with an attached hierarchy.

You can also use processes from multiple application runs, and not only compare processes within one run.

Clustering processes is a promising avenue that, to my knowledge, no performance analysis tool has taken, yet. Every performance tool presents a profile accumulated or averaged across all processes. With this they lump together profiles of processes that perform different work. You want to create one profile per group of processes that performs similar work. In the above example you would perhaps have one for process 0, one for process 1-3 and one for the threads. The difference between threads that share memory, and processes which communicate across nodes should be considered!

Or for example, if you think about our sequence alignment work, you don't want to align dissimilar processes, because that would result in an alignment that is just not useful. Instead you use our technique to cluster processes and then only align and compare detailed timings of similar ones.

You can also use the clustering to safe screen space by only displaying information about one representative of each group, perhaps with an indicator for timing variance.

This concludes the walkthrough of what we did. There are two things I want to mention: In the code on GitHub, I was experimenting with a more C-style C++ with templates. I don't endorse that style.

Secondly, the above technique is rather heavy-weight. You can get pretty far by just hashing the function pair set. Equal sets share the hash, unequal sets don't. Clustering done. (Then you can e.g. compute similarity between each group) This approach is easier to implement and does not have the exponential worst-case complexity. But you lose the the hierarchy. My above example talks about the process 0 vs process 1-3 vs threads relationship which is rather simplistic. A performance monitor or runtime should be able to know about this. But then again, if your analysis technique automatically yields that and more intricate relationships, that's pretty cool, too.


Costa Rica - Wildlife Refuge Laguna Urpiano

Time flies. It's been a year already. I am thoroughly enjoying my job at Oak Ridge National Lab.

I got used to much of the U.S. American weirdness and other differences to home. Deep-fried sushi, deep-fried pickles, shitty bread, disgusting tap water, cars everywhere, lots of driving and so on.

July last year, three friends and I spent two weeks in Costa Rica volunteering at the sea turtle refuge Laguna Urpiano - a breathtaking experience. I'm not sure where to start. The refuge is located on the eastern side of the country without any direct access except through a canal. There is no sewage, no tap water, no electricity. The refuge has a well, runs a generator to pump water, and some solar cells that charge batteries for lighting at night, and smart phones.

On the way to the refuge

The canal

The toilets, showers and a little shack at the refuge

The housing is simple wood with a metal roof, not completely closed.

Early morning, and sometimes during the day Howler Monkeys made a lot of noise. They live high in the trees so you can't see them well. But you can definitely hear them, when they argue with their neighbors! It's not that bad, though. You get used to it.

We spent our days planting trees, helping out with work at the refuge, relaxing, drinking fresh coconut milk, and most importantly guarding one kilometer of beach every night from about 22:00 to 4:00, if I remember correctly. Napping during the day is mandatory.

Sea turtle eggs and, depending on the species, parts of the turtle are valuable. If countries like Costa Rica would not try to secure breeding grounds for sea turtles, they would go extinct. Sea turtles are very vulnerable when they lay eggs. Plus, laying eggs takes long and poachers can most of the time easily find the nests.

So we went out every night, traversing the same one kilometer every 30 minutes. When we found a turtle, we took its eggs and brought them to the refuge and built a new nest there. When we encountered a poacher, we would wait for him or her to pass. The idea is to try and be the first at the nest. The unspoken rule is that you leave the finder alone with its nest. No guns, just a machete for safety and perhaps sticks.

I can't show you any pictures of our nightly walks, and no turtles, because turtles only come ashore at night. And they use the moon for navigation. We needed to be dressed darkly and use no light except a bit of red light for writing into a notebook. No flashes allowed.

There are a few different species of sea turtles landing here, but during early July only the Leatherback visits (You can find some pictures here: (1), (2), (3)). It can grow to around 2m and 700kg. We saw just two in all our time, but during high season there are lot more. Both turtles were around 1.6m long. Impressive animals - Huge!

This is the trail of a Leatherback to its nest. The actual nest is a 20 by 20 by 40 cm hole. The rest is dug up sand to hide its exact location.

This is the remains of a fight with the sea. A Leatherback turtle decided to lay its eggs in a bad spot. The waves came in too far. If the nest / the turtle gets wet in process, the turtle aborts its mission, goes back into the sea and takes a new attempt in the next few hours or days. So we fortified the position in hopes to keep the water in check. ...But to no avail. The turtle went back without laying its eggs.

Another day we were lucky and witnessed everything. It was amazing! Such a huge animal. Paddling across the beach, inch by inch. Then it used its back fins to dig the nest hole and put the eggs in. We saw the strain on her face from the exertion. By the way, the head of such a turtle is as large as a human's. I felt deeply in awe. After that the turtle did what looks like a dance and dug up sand here and there to hide the true position of the nest.

The refuge's hatchery

The beach was littered with crabs that would hide in little holes and come out if no one is around. A bit like Whack-a-Mole where they hide when you are near. Another type of crab repeatedly rides in on a wave and tries to catch small insects like flies. Then it stands there and watches and eventually runs back into the ocean. It was a fun pass time chasing them. ... Wave comes, wave retreats, crab remains on the beach and watches around, I run in between it and the sea to try and cut it off. If that was successful, which it rarely was, I'd chase it around and try to catch it :D. And of course I'd just free it again, afterwards. I admit I injured one in the process once.

Marinera crab in its hole

A pineapple plant

We spent the last few days in Puerto Viejo, a very touristy place. Lots of surfers, lots of Germans.

The prime customer in the rescue center is the sloth, because it's too slow to cross the street quickly, and often gets electrocuted while climbing an electric pole.

These volunteers socialize with monkey children. Monkeys need that to grow up healthy.

And then it was already time to go to San Jose, and catch the flight home.

I really enjoyed my time in Costa Rica, and recommend spending a vacation there. The people at the refuge took great care of us and helped us navigate the country.

Thank you Barbara, Johnny, Emilia, Pedro, Eliseo, Alan, Maylin, Elena, Suzette and Claire!


Retirement Calculator

I made my own little retirement calculator, because I was dissatisfied with how the calculators I checked do the math. And I added a bit more flexibility to play with the numbers.

It's a simple python script: (GitHub link)

    Usage: retirement-calculator [options]
    Optional arguments:
      -h, --help                   Show this help message and exit

      --initial-savings            <USD>
      --working-annual-return-rate <rate>
      --retired-annual-return-rate <rate> 
      --inflation                  <rate> 
      --working                    <years> 
      --retired                    <years>
      --working-annual-savings     <USD> 
      --retired-annual-withdraw    <USD>

The calculation works as follows:

    for year in range(1, working+1):
        return = savings * working_annual_return_rate

        savings += return
        savings += working_annual_savings
        savings *= (1-inflation)

And in retirement:

    for year in range(1, retired+1):
        savings -= retired_annual_withdraw
        return = savings * retired_annual_return_rate

        savings += return

        savings *= (1-inflation)

To help with picking values for annual return rates and inflation, the calculator provides some historic data.

One consideration I glossed over is taxation. Depending on your account type (brokerage, IRA, Roth IRA, ...) you do or do not pay certain taxes like capital gain tax on your realized gains or income tax on your retirement income. Keep that in mind when looking at the numbers.

The other thing I simplified is that you would normally gradually move your investments towards safer options as you get closer to your retirement date. The calculator assumes a hard switch in annual return rate at the start of retirement.

Note that all numbers are in today's US Dollars. That also means that the calculator assumes you increase your annual savings to match the rate of inflation.

For example, if you work for 35 years, save $6000 a year, assume an annual return rate of 7% during work life and 5% during retirement and 3% annual rate of inflation, you end up with $411020 of today's US Dollars, and an initial annual return of $19351 at the start of retirement. With that you can roughly withdraw $17000 per year for 30 years.

So that's no good.

The difficult part is estimating future returns on your investments.

Playing with the numbers, I'd say if you want to be safe, saving $10000 to $15000 per year seems to be about right. The more the better.

Here is a sample output: (link)