Harmless Half-knowledge

While I was shopping for a new laptop, I wanted to know which SSDs Dell includes in their XPS 15" models. So I called sales support. It turns out they don't know, and as of 2019, you lose warranty if you change any part. They should stop calling it CRU then! I don't know whether or not there is a seal or how they make sure you didn't tamper with the machine, or if this policy is even enforced.

So going with the included, unspecified disk was the only safe option. In my case it turned out to be a Micron 2200S NVMe disk, which is likely the same as the 2200. According to AnandTech the warranty endurance is 300 TB written, which is OK for a 1TB disk. A Samsung 970 Evo Plus has 600, my old Intel 600p has 200 TB.

It took me a while to figure out which laptop to get. In short, for me, Dell's XPS 15" cost 700€ less than the Lenovo X1 Extreme or P1 and Dell offers US international keyboards. Lenovo offers only European layouts in Germany. And you can't change them yourself in these two models.

In hindsight I'm not sure about my choice, anymore. The 4K screen is badly supported in KDE/Kubuntu. Everything is off, dialog sizes are weird, all icon sizes are weird, margins and icon placement in buttons is off. The mouse cursor switches between normal-size and half-size depending on the window you are hovering over. And if you attach an external monitor, KDE's settings offer only one global scaling factor for all screens. I.e. everything is huge if you set it 2x for the 4K internal display. The keyboard is way worse than Lenovo's and is missing the Home, End, Page-Up and Page-Down keys that I heavily use(d).

Maybe going with the Lenovo X1 Extreme or P1 with a more affordable 1080p screen and the English (EU) layout would have worked out better.

About four years ago I defended my diplom thesis. The title is quite a mouthful. Another way to put it is: We came up with a new way to cluster threads of execution for performance analysis.

I made the defense slides, thesis, paper and source code available.

Developing Vampir led us to the realization that scalability of performance analysis and visualization is fundamentally difficult. Information about more and more processing elements needs to be processed and displayed in a limited number of pixels on the screen.

Around 2012 we started exploring what you can do with differences between program execution traces. One thing we did was to use sequence alignment and diff algorithms to compute the difference between two function call stacks over time [1] [2]. This technique is rather expensive, but for example allows comparing the function-call-by-function-call timing of two threads of execution.

Two ways to tackle visual complexity is through folding similar executions into one [3], and e.g. by using edge bundling to cluster inter-process communication [4]. The problems these two ideas address are specific to our tools, and might not be a problem for e.g. profile-based visualizations.

Throughout the rest of the article I'll use the word process as a short form for thread, process and MPI rank.

Before I started working on my thesis, as part of an independent study course, I explored some more what kind of differences you can derive from processes and how to visualize them (slides), (report). One thing I tried was using box plots / percentiles to draw profiles on a per-process basis, instead of summing or averaging function timings across processes. The other main thing I investigated was whether to use flat function profiles, call trees, or something in between for comparing processes. One bucket in a flat function profile is a function, irrespective of which arguments it is called with, and what the call stack leading up to the call is. E.g. all calls to a matrix-matrix multiplication function are lumped into one bucket, regardless of the matrix size. A call tree profile retains the call stack for each bucket and thus provides more context. Comparing two call tree profiles is not that intuitive, and dealing with recursion is awkward. In between the two sits the idea of using the pair of functions of who-calls-whom. That provides some context, and works well with recursion. For each of these three profile types, the buckets are (1) a function, (2) a sequence of functions, and (3) a pair of functions. I also spent some time visualizing this so-called call matrix. It didn't work well. Not intuitive enough.

The important realization for my diplom thesis is that I can apply formal concept analysis to sets of function pairs to derive a kind-of hierarchy of processes. Each process is characterized by its set of function pairs (which function calls which), disregarding invocation counts and timing.

This example is a weather simulation using four MPI ranks with three threads each. All processes and threads call the top set of functions. The threads do not call any MPI or I/O functions. Process 0, on top of what processes 1-3 do also performs I/O. (The percentage displayed is the ratio of function pairs in the node to all function pairs. E.g. 37% of the pairs are shared between all processes and threads. 41% are unique to processes 1-3.)

To iron out small dissimilarities introduced by performance recording technicalities, you use the transitive closure of the function pair sets, rather than the sets themselves. Then the above example becomes three nodes on top of each other, the top-most being the threads. Thus, process 0 subsumes all other processes and threads, and process 1-3 subsume all threads. The threads are spawned by process 0-3! (The defense slides, specifically slides 16-18 and 26, have some more detail and examples on this.)

The thesis devotes much space to demonstrating the techniques on many real world examples and investigating two representatives in greater detail.

The expected complexity of creating the concept lattice (above graph) is linear in the number of processes. But the worst-case complexity is exponential, because the maximum number of nodes is. For that to happen you need a certain pattern of differences, and every process needs to have a different function pair set. In normal applications, and especially SPMD programs this never happens.

But there are three scenarios where the approach does not perform well:

The function call sequence becomes kind of random if you observe scheduling.
Load-balancing can lead to the same.
An application assigns a different cell of a space to each process. The space contains objects. As the simulation progresses objects interact and move between cells/processes. Then each process can call different functions depending on what happens within each cell. ParaDiS does something like that.

Thus the technique does not work on every code, but most. You'd need to implement fall backs to handle those gracefully.

To summarize, the technique takes the set of function pairs for each process and creates a concept lattice. This lattice represents a clustering with an attached hierarchy.

You can also use processes from multiple application runs, and not only compare processes within one run.

Clustering processes is a promising avenue that, to my knowledge, no performance analysis tool has taken, yet. Every performance tool presents a profile accumulated or averaged across all processes. With this they lump together profiles of processes that perform different work. You want to create one profile per group of processes that performs similar work. In the above example you would perhaps have one for process 0, one for process 1-3 and one for the threads. The difference between threads that share memory, and processes which communicate across nodes should be considered!

Or for example, if you think about our sequence alignment work, you don't want to align dissimilar processes, because that would result in an alignment that is just not useful. Instead you use our technique to cluster processes and then only align and compare detailed timings of similar ones.

You can also use the clustering to safe screen space by only displaying information about one representative of each group, perhaps with an indicator for timing variance.

This concludes the walkthrough of what we did. There are two things I want to mention: In the code on GitHub, I was experimenting with a more C-style C++ with templates. I don't endorse that style.

Secondly, the above technique is rather heavy-weight. You can get pretty far by just hashing the function pair set. Equal sets share the hash, unequal sets don't. Clustering done. (Then you can e.g. compute similarity between each group) This approach is easier to implement and does not have the exponential worst-case complexity. But you lose the the hierarchy. My above example talks about the process 0 vs process 1-3 vs threads relationship which is rather simplistic. A performance monitor or runtime should be able to know about this. But then again, if your analysis technique automatically yields that and more intricate relationships, that's pretty cool, too.

Harmless Half-knowledge

2020-02-17

Load Local Vimrc Automatically From Parent Directories

2019-12-19

Dell XPS 7590 SSD Benchmark

2018-11-30

Diplom Thesis: Automatic Profile-Based Characterisation of Performance Traces for Highly Parallel Applications

Archive

Search

Subscribe