2015-07-26

OTF2 Knowledge: How to Match MPI Messages

OTF2 (part of Score-P) has a terrible reader interface. This post explains one little part of boilerplate you need to write in order to process MPI messages.

To properly process MPI send and receive, you need to determine which send belongs to which receive. This in conjunction with the previous two posts (1, 2) enables you to generate one list of messages to be used in your tool, subsequently.

Note that the following explanation also applies to OTF and likely to processing MPI send and receive events in general.

Formally, given:

    struct SentMessage {
        sender, receiver, time, communicator, length, tag }
    struct ReceivedMessage {
        receiver, sender, time, communicator, length, tag }

    list<SentMessage>     sentMessages
    list<ReceivedMessage> receivedMessages


Where both sent and received messages are ordered according to when these operations where issued, we want to determine:

    struct Message { sender, receiver, time, duration, length }

    list<Message> messages

First, we transform the received messages into another data structure:

    struct Key { sender, receiver, communicator, tag }

    map<Key, queue<ReceivedMessage>> receivedMessagesMap

Second, we iterate over all sent messages and index into the received messages map, to find the matching receive.
  • For each sent message
    • Use its sender, receiver, communicator and tag fields to index into receivedMessagesMap and retrieve the queue
    • Is there an item in the queue? Yes:
      • The first item in the queue is the receive that matches to this send
      • Append the matched message to messages. Duration is the difference between sent and receive time.
      • Dequeue this item
      • If duration ≤ 0: Warning: A message should be received later than it has been sent. We have undefined/infinite or negative bandwidth.
      • If send length > receive length: Warning: A sent message is not allowed to be larger than the receive assumes.
    • No:
      • Warning: Missing receive
  • Print out a condensed version of the above warnings

We use one queue per sender, receiver, communicator and tag, because a send belongs to a receive only if all of the four coincide.

Recording the warnings/errors makes sense, because MPI libraries are less strict than the standard. For example, if you send messages and never receive them and just exit the program, neither the MPI library nor Score-P might care and thus you have missing receives. I can confirm that this happens sometimes. Therefore, you should not disallow missing receives.
Also, be aware that durations will likely be zero or negative from time to time. Time depends on the clocks used. Clocks across different nodes/processes can yield values that imply zero or negative durations. Score-P does not sanitize time values. Thus, it is left to the tools developer to decide how to handle this situation.

With these three posts you can now properly process MPI messages, for example to compute correct bandwidths, or to visualize them:


Happy coding!

2015-07-05

OTF2 Knowledge: How to Process Non-Blocking MPI Messages

OTF2 (part of the Score-P) has a terrible reader interface. This post attempts to explain one part of boilerplate you need to write in order to process MPI messages.

If your tool processes MPI messages, it does not suffice to only consider MpiSend and MpiRecv records. Applications using MPI can also use non-blocking send and receive operations. Mixing blocking and non-blocking operations is possible as well. In order to handle MPI messages correctly, you need to consider all of the following records:

    OTF2_CallbackCode handleMpiSend(OTF2_LocationRef sender, 
        OTF2_TimeStamp time, void* userData, OTF2_AttributeList* a,
        uint32_t receiver, OTF2_CommRef com, uint32_t tag,
        uint64_t length)
    OTF2_CallbackCode handleMpiIsend(OTF2_LocationRef sender,
        OTF2_TimeStamp time, void *userData, OTF2_AttributeList *a,
        uint32_t receiver, OTF2_CommRef com, uint32_t tag,
        uint64_t length, uint64_t requestId)
    OTF2_CallbackCode handleMpiIsendComplete(

        OTF2_LocationRef sender, OTF2_TimeStamp time,
        void *userData, OTF2_AttributeList *a, uint64_t requestId)
    OTF2_CallbackCode handleMpiRecv(OTF2_LocationRef receiver,

        OTF2_TimeStamp time, void* userData, OTF2_AttributeList* a,
        uint32_t sender, OTF2_CommRef com, uint32_t tag,
        uint64_t length)
    OTF2_CallbackCode handleMpiIrecv(OTF2_LocationRef receiver,

        OTF2_TimeStamp time, void *userData, OTF2_AttributeList *a,
        uint32_t sender, OTF2_CommRef com, uint32_t tag,
        uint64_t length, uint64_t requestId)
    OTF2_CallbackCode handleMpiIrecvRequest(
        OTF2_LocationRef receiver, OTF2_TimeStamp time,
        void *userData, OTF2_AttributeList *a, uint64_t requestId)
    OTF2_CallbackCode handleMpiRequestCancelled(

        OTF2_LocationRef location, OTF2_TimeStamp time,
        void *userData, OTF2_AttributeList *a, uint64_t requestId)

This post explains how to translate non-blocking send and receive records to normal/blocking send and receives, so that your tool can subsequently process all types of messages in a homogenous way.

In a previous post I explained a detail you need know for this to work. 

Explaining the Involved Records

  • Send: Is issued when MPI_Send is called
  • Isend: Is issued when MPI_Isend is called
  • IsendComplete: Is issued when an MPI_Wait, MPI_Test or a similar function confirms that the Isend operation has been completed
  • Receive: Is issued when MPI_Recv is called
  • IreceiveRequest:
    • Is issued when MPI_Irecv is called
    • Similar to the Isend record, except you don't yet know the tag, communicator and length of the to-be-received message, because of possible wildcards for tags and communicators in such requests
  • Ireceive:
    • Issued when an MPI_Wait, MPI_Test or a similar function confirms that the Ireceive operation has been completed
    • Similar to IsendComplete, except it contains the parameters of the received message, whereas Isend itself (not the complete) contains them
  • RequestCancelled: Can cancel Isends and IreceiveRequests, and other maybe not-recorded requests. 
The naming scheme of these records is a bit confusing. A more consistent scheme would have been: Isend, IsendComplete, Ireceive (=IreceiveRequest here), IreceiveComplete (=Ireceive here).

Data Structures

Send/Receive

    struct SentMessage {
        time, sender, receiver, communicator, length, tag }
    struct ReceivedMessage {
        time, receiver, sender , communicator, length, tag }

Non-blocking

    struct Isend { time, sender, receiver, com, length,
        tag, requestId, queue<SentMessage> blockedSends }
    struct IreceiveRequest { requestId,
        queue<ReceivedMessage> blockedReceives }

    map<process, queue<Isend>>           isends
    map<process, queue<IreceiveRequest>> ireceiveRequests

The Algorithm

Send

  • If isends for this process is empty
    • Record the sent message
  • Otherwise
    • Append this Send to the latest isends[sender]'s blockedSends 
Explanation: To preserve the correct ordering of messages, previously issued Isends have to be processes before this Send. We can only process Isends that are completed, because they might be cancelled subsequently. We therefore enqueue this Send until all previous Isends are processed.

Isend

  • Append this Isend to isends[sender] with an empty queue

IsendComplete

  • Search for a matching Isend on this process
  • If it matches the first in the queue
    • Record the sent message (time is Isend's time)
    • For each blocked Send in the attached queue
      • Record the sent message
  • Otherwise
    • Append this completed Send to the previous Isend's blocked queue
    • Append the blocked queue of the completed Send to the previous Isend's queue as well
  • remove this entry from isends[sender]
Explanation: An Isend has been completed. Therefore, we have a succesfully sent message that we can record unless there are previously issued, uncompleted Isends (similar to when Send happens). If we completed the earliest remaining Isend, we can now process all messages that have been blocked by it. If we completed an Isend that has previous other Isends, then we complete this Send and enqueue everything to the previous Isend's queue, because we can only process these messages when this previous Isend has been completed.

The receive records will be handled similarly, with some slight modifications due to differences in whether information is known during the request start or completion.

Receive

  • If ireceiveRequests for this process is empty
    • Record the received message
  • Otherwise
    • Append this Receive to the latest ireceiveRequests[receiver]'s blockedReceives

IreceiveRequest

  • Append the request to ireceiveRequest[receiver] with an empty queue.

Ireceive

  • Search for a matching IreceiveRequest on this process
  • If it matches the first in the queue
    • Record the received message (time is Ireceive's time)
    • For each blocked Receive in the attached queue
      • Record the received message
  • Otherwise
    • Append this completed Receive to the previous IreceiveRequest's blocked queue
    • Append the queue of the completed Receive to the previous request's queue as well
  • Remove this entry from ireceiveRequests[receiver]

RequestCancelled

  • If it matches an Isend's requestId
    • Handle the same as IsendComplete, but without recording this send
  • If it matches an IreceiveRequest's requestId
    • Handle the same as Ireceive, but without recording this receive
  • It should never be both, but can be neither

Wrap Up

A sent message has the timestamp of the initial Isend. A receive has the timestamp of the completed Ireceive. Thus, we take the timestamp of the earliest possible sending moment and the latest possible receiving moment, because the message is actually transmitted sometime during this interval. This is as accurate as it gets in OTF2. Because of this, IreceiveRequest's and IsendComplete's timestamps are ignored. This also means that the order in which you process the records is not necessarily chronological for receives, but it is for sends.

After applying the above algorithm you have a list of sent and received MPI messages. You don't yet know which send belongs to which receive. Determining this relationship is called message matching. I intend to explain how it works in a future blog post.

Once this is done, you have one list of messages with correct durations, that can finally be used in your tool.

Happy Coding!


P.S.: I used OTF2 version 1.5.1

2015-05-29

OTF2 Knowledge: How to Map Local MPI Ranks to OTF2 Locations

OTF2 (part of the Score-P infrastructure) has a terrible reader interface. This post attempts to explain one little part of boilerplate you need to write in order to process MPI records.

For example, to process an MPI_Send record you use a handler with the following signature:

    OTF2_CallbackCode handleMpiSend(OTF2_LocationRef sender,
        OTF2_TimeStamp time, void* userData,
        OTF2_AttributeList* a, uint32_t receiver,
        OTF2_CommRef com, uint32_t tag, uint64_t length);


Notice that the data type of the sender and receiver processes are not the same. This is because the event is recorded as is during tracing, without trying to match MPI-specific, communicator-dependent identifiers to communicator-independent OTF2 ones. Since Score-P/OTF2 does not have a post-processing step, the connection between these identifiers and the OTF2 locations is never established. It is left to the programmer to apply this mapping while reading the trace.

Required Handlers

    OTF2_CallbackCode handleOtf2DefGroup(void *userData,
        OTF2_GroupRef group, OTF2_StringRef name,
        OTF2_GroupType groupType, OTF2_Paradigm paradigm,
        OTF2_GroupFlag groupFlags, uint32_t memberCount,
        const uint64_t* members);

    OTF2_CallbackCode handleOtf2DefCommunicator(void *userData,
        OTF2_CommRef com, OTF2_StringRef name,
        OTF2_GroupRef group, OTF2_CommRef parent);


Data Structures

    map<OTF2_CommRef, OTF2_GroupRef> communicatorToGroup;
    map<OTF2_GroupRef, map<uint32_t /*local rank*/,
        uint64_t /*rank in comm world*/>> localRankToGlobalRank;
    OTF2_GroupRef mpiLocationGroup;

The Algorithm

 

DefGroup

    localRankToGlobalRank[group] = {};
    for (uint32_t i = 0; i < memberCount; i += 1) {
        localRankToGlobalRank[group][i] = members[i];
    }

    if (type == OTF2_GROUP_TYPE_COMM_LOCATIONS &&
            paradigm == OTF2_PARADIGM_MPI) {
        mpiLocationGroup = group;
    }

DefCommunicator

    communicatorToGroup[com] = group

Mapping Local MPI Ranks to OTF2 Locations


Input: local MPI rank rank, MPI communicator of this rank com
Output: OTF2 location
Procedure: First, map the local MPI rank to the rank in comm world. Next map this rank to the OTF2 location using the magic group mpiLocationGroup:

    localRankToGlobalRank[mpiLocationGroup[localRankToGlobalRank[
        communicatorToGroup[com]][rank]]

Happy Coding!

ps: I used OTF2 version 1.5.1

2015-05-09

Automatically flushing QTextStream

QTextStream is way more convenient for printing to console than C++ streams are. Therefore, I usually replace cerr and cout by qerr and qout in projects using Qt:

Header file:
     extern QTextStream qerr;
     extern QTextStream qout;

Source file:
     QTextStream qerr(stderr, QIODevice::WriteOnly | QIODevice::Text);
     QTextStream qout(stdout, QIODevice::WriteOnly | QIODevice::Text);

Unfortunately, QTextStream buffers and is not able to automatically flush. This means it may print a message later than expected or not at all if the program crashes. This is obviously undesirable for a cerr replacement.

One way to work around this is by intercepting all stream operator calls and adding a call to flush():

Header file:
    class AutoFlushingQTextStream : public QTextStream {
    public:
        AutoFlushingQTextStream(FILE* f, QIODevice::OpenMode o)
                : QTextStream(f, o) {} 

        template<typename T>
        AutoFlushingQTextStream& operator<<(T&& s) {
            *((QTextStream*) this) << std::forward<T>(s);
            flush(); 
            return *this; 
        }
    };

    extern AutoFlushingQTextStream qerr;
    extern QTextStream qout;

Source file:
    AutoFlushingQTextStream qerr(stderr, QIODevice::WriteOnly | QIODevice::Text);
    QTextStream qout(stdout, QIODevice::WriteOnly | QIODevice::Text);


Any Tips for improving the code are very much appreciated.

2015-02-04

Minor Thesis - Development and Analysis of Barrier Protocols

A year ago, I defended my minor thesis (equivalently bachelor thesis or Großer Beleg) (PDF download). The repository, including smaller course work on similar topics from before, is on GitHub: https://github.com/hydroo/barrier.

Thesis

Contents

  • 1/2 Intro/Background
    • Cache coherency and atomic operations don't scale
    • Probabilistic synchronization is a good idea
    • Survey of currently used barrier protocols and of general ways to implement barriers
    • Detailed look at MPI 3's remote memory access API for implementing less deterministic barrier/synchronization protocols
  • 3 Three Novel Barrier Protocols
    • MGB Barrier: Inspired by Nicholas Mc Guire
    • B1 Barrier: The opposite of the Central Counter Barrier
    • B2 Barrier: Variant of the B1 Barrier that computes more and communicates less
  • 4 Evaluation
    • Central Counter Barrier vs B1 Barrier
    • Dissemination Barrier vs B2 Barrier
    • Some high-level observations about the four protocols
    • Verify correctness of the protocols via model checking
    • Compare the speed and energy consumption of the four barriers using quantitative model checking

Personal Comment

Sadly, I had to prioritize model checking over measurement-based analysis.
Therefore, I have little confidence in the quantitative model checking results. Aside from that, I quite like the thesis, and think there are valuable ideas in there. Especially:
  • The so-called progress problem that most barriers have. It is explained in Section 4.1.
  • The B1 and B2 Barrier can be faster than every other barrier protocol. Depending on the computation speed and communication latency different variants of them should be used. They avoid the progress problem. They have their weaknesses, too. More analysis is required to definitely state if the additional resource requirements outweigh the speed advantage.
  • The so-called barrier building blocks (Section 2.3)

Repository

Abstract

This is a semi-organized repository of ~1 year of university work. During this time I:
  • Surveyed existing barrier protocols
  • Surveyed actual implementations
  • Surveyed what low-level tools are available to develop synchronization primitives, and in particular barriers
  • Invented new protocols
  • Analyzed/modelchecked/benchmarked them
  • In-depth look at MPI 3's remote memory access API and its suitability for implementing (less deterministic) barriers and other synchronization protocols

License

Everything is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 license. If this licensing scheme is a problem for your intentions, please contact me. I'm open to suggestions.

Why upload the whole thing?

  • Because there are few reasons not to
  • I would oftentimes be very happy, if I could dig through the source code of someone's research
  • For people interested in researching barrier protocols and synchronization primitives in general
  • For the useful code snippets contained
  • As a portfolio entry so potential employers can see what I did and how I did it

2015-01-30

New PGP and OTR Keys

PGP key id: e94b3df55b06c8e4
OTR fingerprint: 1d070cab 116463a6 0f94b424 075b6c83 abb4a140

When in doubt, contact me some more reliable way, since this blog isn't a secure way of distributing this info.