2015-07-26

OTF2 Knowledge: How to Match MPI Messages

OTF2 (part of Score-P) has a terrible reader interface. This post explains one little part of boilerplate you need to write in order to process MPI messages.

To properly process MPI send and receive, you need to determine which send belongs to which receive. This in conjunction with the previous two posts (1, 2) enables you to generate one list of messages to be used in your tool, subsequently.

Note that the following explanation also applies to OTF and likely to processing MPI send and receive events in general.

Formally, given:

    struct SentMessage {
        sender, receiver, time, communicator, length, tag }
    struct ReceivedMessage {
        receiver, sender, time, communicator, length, tag }

    list<SentMessage>     sentMessages
    list<ReceivedMessage> receivedMessages


Where both sent and received messages are ordered according to when these operations where issued, we want to determine:

    struct Message { sender, receiver, time, duration, length }

    list<Message> messages

First, we transform the received messages into another data structure:

    struct Key { sender, receiver, communicator, tag }

    map<Key, queue<ReceivedMessage>> receivedMessagesMap

Second, we iterate over all sent messages and index into the received messages map, to find the matching receive.
  • For each sent message
    • Use its sender, receiver, communicator and tag fields to index into receivedMessagesMap and retrieve the queue
    • Is there an item in the queue? Yes:
      • The first item in the queue is the receive that matches to this send
      • Append the matched message to messages. Duration is the difference between sent and receive time.
      • Dequeue this item
      • If duration ≤ 0: Warning: A message should be received later than it has been sent. We have undefined/infinite or negative bandwidth.
      • If send length > receive length: Warning: A sent message is not allowed to be larger than the receive assumes.
    • No:
      • Warning: Missing receive
  • Print out a condensed version of the above warnings

We use one queue per sender, receiver, communicator and tag, because a send belongs to a receive only if all of the four coincide.

Recording the warnings/errors makes sense, because MPI libraries are less strict than the standard. For example, if you send messages and never receive them and just exit the program, neither the MPI library nor Score-P might care and thus you have missing receives. I can confirm that this happens sometimes. Therefore, you should not disallow missing receives.
Also, be aware that durations will likely be zero or negative from time to time. Time depends on the clocks used. Clocks across different nodes/processes can yield values that imply zero or negative durations. Score-P does not sanitize time values. Thus, it is left to the tools developer to decide how to handle this situation.

With these three posts you can now properly process MPI messages, for example to compute correct bandwidths, or to visualize them:


Happy coding!

2015-07-05

OTF2 Knowledge: How to Process Non-Blocking MPI Messages

OTF2 (part of the Score-P) has a terrible reader interface. This post attempts to explain one part of boilerplate you need to write in order to process MPI messages.

If your tool processes MPI messages, it does not suffice to only consider MpiSend and MpiRecv records. Applications using MPI can also use non-blocking send and receive operations. Mixing blocking and non-blocking operations is possible as well. In order to handle MPI messages correctly, you need to consider all of the following records:

    OTF2_CallbackCode handleMpiSend(OTF2_LocationRef sender, 
        OTF2_TimeStamp time, void* userData, OTF2_AttributeList* a,
        uint32_t receiver, OTF2_CommRef com, uint32_t tag,
        uint64_t length)
    OTF2_CallbackCode handleMpiIsend(OTF2_LocationRef sender,
        OTF2_TimeStamp time, void *userData, OTF2_AttributeList *a,
        uint32_t receiver, OTF2_CommRef com, uint32_t tag,
        uint64_t length, uint64_t requestId)
    OTF2_CallbackCode handleMpiIsendComplete(

        OTF2_LocationRef sender, OTF2_TimeStamp time,
        void *userData, OTF2_AttributeList *a, uint64_t requestId)
    OTF2_CallbackCode handleMpiRecv(OTF2_LocationRef receiver,

        OTF2_TimeStamp time, void* userData, OTF2_AttributeList* a,
        uint32_t sender, OTF2_CommRef com, uint32_t tag,
        uint64_t length)
    OTF2_CallbackCode handleMpiIrecv(OTF2_LocationRef receiver,

        OTF2_TimeStamp time, void *userData, OTF2_AttributeList *a,
        uint32_t sender, OTF2_CommRef com, uint32_t tag,
        uint64_t length, uint64_t requestId)
    OTF2_CallbackCode handleMpiIrecvRequest(
        OTF2_LocationRef receiver, OTF2_TimeStamp time,
        void *userData, OTF2_AttributeList *a, uint64_t requestId)
    OTF2_CallbackCode handleMpiRequestCancelled(

        OTF2_LocationRef location, OTF2_TimeStamp time,
        void *userData, OTF2_AttributeList *a, uint64_t requestId)

This post explains how to translate non-blocking send and receive records to normal/blocking send and receives, so that your tool can subsequently process all types of messages in a homogenous way.

In a previous post I explained a detail you need know for this to work. 

Explaining the Involved Records

  • Send: Is issued when MPI_Send is called
  • Isend: Is issued when MPI_Isend is called
  • IsendComplete: Is issued when an MPI_Wait, MPI_Test or a similar function confirms that the Isend operation has been completed
  • Receive: Is issued when MPI_Recv is called
  • IreceiveRequest:
    • Is issued when MPI_Irecv is called
    • Similar to the Isend record, except you don't yet know the tag, communicator and length of the to-be-received message, because of possible wildcards for tags and communicators in such requests
  • Ireceive:
    • Issued when an MPI_Wait, MPI_Test or a similar function confirms that the Ireceive operation has been completed
    • Similar to IsendComplete, except it contains the parameters of the received message, whereas Isend itself (not the complete) contains them
  • RequestCancelled: Can cancel Isends and IreceiveRequests, and other maybe not-recorded requests. 
The naming scheme of these records is a bit confusing. A more consistent scheme would have been: Isend, IsendComplete, Ireceive (=IreceiveRequest here), IreceiveComplete (=Ireceive here).

Data Structures

Send/Receive

    struct SentMessage {
        time, sender, receiver, communicator, length, tag }
    struct ReceivedMessage {
        time, receiver, sender , communicator, length, tag }

Non-blocking

    struct Isend { time, sender, receiver, com, length,
        tag, requestId, queue<SentMessage> blockedSends }
    struct IreceiveRequest { requestId,
        queue<ReceivedMessage> blockedReceives }

    map<process, queue<Isend>>           isends
    map<process, queue<IreceiveRequest>> ireceiveRequests

The Algorithm

Send

  • If isends for this process is empty
    • Record the sent message
  • Otherwise
    • Append this Send to the latest isends[sender]'s blockedSends 
Explanation: To preserve the correct ordering of messages, previously issued Isends have to be processes before this Send. We can only process Isends that are completed, because they might be cancelled subsequently. We therefore enqueue this Send until all previous Isends are processed.

Isend

  • Append this Isend to isends[sender] with an empty queue

IsendComplete

  • Search for a matching Isend on this process
  • If it matches the first in the queue
    • Record the sent message (time is Isend's time)
    • For each blocked Send in the attached queue
      • Record the sent message
  • Otherwise
    • Append this completed Send to the previous Isend's blocked queue
    • Append the blocked queue of the completed Send to the previous Isend's queue as well
  • remove this entry from isends[sender]
Explanation: An Isend has been completed. Therefore, we have a succesfully sent message that we can record unless there are previously issued, uncompleted Isends (similar to when Send happens). If we completed the earliest remaining Isend, we can now process all messages that have been blocked by it. If we completed an Isend that has previous other Isends, then we complete this Send and enqueue everything to the previous Isend's queue, because we can only process these messages when this previous Isend has been completed.

The receive records will be handled similarly, with some slight modifications due to differences in whether information is known during the request start or completion.

Receive

  • If ireceiveRequests for this process is empty
    • Record the received message
  • Otherwise
    • Append this Receive to the latest ireceiveRequests[receiver]'s blockedReceives

IreceiveRequest

  • Append the request to ireceiveRequest[receiver] with an empty queue.

Ireceive

  • Search for a matching IreceiveRequest on this process
  • If it matches the first in the queue
    • Record the received message (time is Ireceive's time)
    • For each blocked Receive in the attached queue
      • Record the received message
  • Otherwise
    • Append this completed Receive to the previous IreceiveRequest's blocked queue
    • Append the queue of the completed Receive to the previous request's queue as well
  • Remove this entry from ireceiveRequests[receiver]

RequestCancelled

  • If it matches an Isend's requestId
    • Handle the same as IsendComplete, but without recording this send
  • If it matches an IreceiveRequest's requestId
    • Handle the same as Ireceive, but without recording this receive
  • It should never be both, but can be neither

Wrap Up

A sent message has the timestamp of the initial Isend. A receive has the timestamp of the completed Ireceive. Thus, we take the timestamp of the earliest possible sending moment and the latest possible receiving moment, because the message is actually transmitted sometime during this interval. This is as accurate as it gets in OTF2. Because of this, IreceiveRequest's and IsendComplete's timestamps are ignored. This also means that the order in which you process the records is not necessarily chronological for receives, but it is for sends.

After applying the above algorithm you have a list of sent and received MPI messages. You don't yet know which send belongs to which receive. Determining this relationship is called message matching. I intend to explain how it works in a future blog post.

Once this is done, you have one list of messages with correct durations, that can finally be used in your tool.

Happy Coding!


P.S.: I used OTF2 version 1.5.1