2015-07-05

OTF2 Knowledge: How to Process Non-Blocking MPI Messages

OTF2 (part of the Score-P) has a terrible reader interface. This post attempts to explain one part of boilerplate you need to write in order to process MPI messages.

If your tool processes MPI messages, it does not suffice to only consider MpiSend and MpiRecv records. Applications using MPI can also use non-blocking send and receive operations. Mixing blocking and non-blocking operations is possible as well. In order to handle MPI messages correctly, you need to consider all of the following records:

    OTF2_CallbackCode handleMpiSend(OTF2_LocationRef sender, 
        OTF2_TimeStamp time, void* userData, OTF2_AttributeList* a,
        uint32_t receiver, OTF2_CommRef com, uint32_t tag,
        uint64_t length)
    OTF2_CallbackCode handleMpiIsend(OTF2_LocationRef sender,
        OTF2_TimeStamp time, void *userData, OTF2_AttributeList *a,
        uint32_t receiver, OTF2_CommRef com, uint32_t tag,
        uint64_t length, uint64_t requestId)
    OTF2_CallbackCode handleMpiIsendComplete(

        OTF2_LocationRef sender, OTF2_TimeStamp time,
        void *userData, OTF2_AttributeList *a, uint64_t requestId)
    OTF2_CallbackCode handleMpiRecv(OTF2_LocationRef receiver,

        OTF2_TimeStamp time, void* userData, OTF2_AttributeList* a,
        uint32_t sender, OTF2_CommRef com, uint32_t tag,
        uint64_t length)
    OTF2_CallbackCode handleMpiIrecv(OTF2_LocationRef receiver,

        OTF2_TimeStamp time, void *userData, OTF2_AttributeList *a,
        uint32_t sender, OTF2_CommRef com, uint32_t tag,
        uint64_t length, uint64_t requestId)
    OTF2_CallbackCode handleMpiIrecvRequest(
        OTF2_LocationRef receiver, OTF2_TimeStamp time,
        void *userData, OTF2_AttributeList *a, uint64_t requestId)
    OTF2_CallbackCode handleMpiRequestCancelled(

        OTF2_LocationRef location, OTF2_TimeStamp time,
        void *userData, OTF2_AttributeList *a, uint64_t requestId)

This post explains how to translate non-blocking send and receive records to normal/blocking send and receives, so that your tool can subsequently process all types of messages in a homogenous way.

In a previous post I explained a detail you need know for this to work. 

Explaining the Involved Records

  • Send: Is issued when MPI_Send is called
  • Isend: Is issued when MPI_Isend is called
  • IsendComplete: Is issued when an MPI_Wait, MPI_Test or a similar function confirms that the Isend operation has been completed
  • Receive: Is issued when MPI_Recv is called
  • IreceiveRequest:
    • Is issued when MPI_Irecv is called
    • Similar to the Isend record, except you don't yet know the tag, communicator and length of the to-be-received message, because of possible wildcards for tags and communicators in such requests
  • Ireceive:
    • Issued when an MPI_Wait, MPI_Test or a similar function confirms that the Ireceive operation has been completed
    • Similar to IsendComplete, except it contains the parameters of the received message, whereas Isend itself (not the complete) contains them
  • RequestCancelled: Can cancel Isends and IreceiveRequests, and other maybe not-recorded requests. 
The naming scheme of these records is a bit confusing. A more consistent scheme would have been: Isend, IsendComplete, Ireceive (=IreceiveRequest here), IreceiveComplete (=Ireceive here).

Data Structures

Send/Receive

    struct SentMessage {
        time, sender, receiver, communicator, length, tag }
    struct ReceivedMessage {
        time, receiver, sender , communicator, length, tag }

Non-blocking

    struct Isend { time, sender, receiver, com, length,
        tag, requestId, queue<SentMessage> blockedSends }
    struct IreceiveRequest { requestId,
        queue<ReceivedMessage> blockedReceives }

    map<process, queue<Isend>>           isends
    map<process, queue<IreceiveRequest>> ireceiveRequests

The Algorithm

Send

  • If isends for this process is empty
    • Record the sent message
  • Otherwise
    • Append this Send to the latest isends[sender]'s blockedSends 
Explanation: To preserve the correct ordering of messages, previously issued Isends have to be processes before this Send. We can only process Isends that are completed, because they might be cancelled subsequently. We therefore enqueue this Send until all previous Isends are processed.

Isend

  • Append this Isend to isends[sender] with an empty queue

IsendComplete

  • Search for a matching Isend on this process
  • If it matches the first in the queue
    • Record the sent message (time is Isend's time)
    • For each blocked Send in the attached queue
      • Record the sent message
  • Otherwise
    • Append this completed Send to the previous Isend's blocked queue
    • Append the blocked queue of the completed Send to the previous Isend's queue as well
  • remove this entry from isends[sender]
Explanation: An Isend has been completed. Therefore, we have a succesfully sent message that we can record unless there are previously issued, uncompleted Isends (similar to when Send happens). If we completed the earliest remaining Isend, we can now process all messages that have been blocked by it. If we completed an Isend that has previous other Isends, then we complete this Send and enqueue everything to the previous Isend's queue, because we can only process these messages when this previous Isend has been completed.

The receive records will be handled similarly, with some slight modifications due to differences in whether information is known during the request start or completion.

Receive

  • If ireceiveRequests for this process is empty
    • Record the received message
  • Otherwise
    • Append this Receive to the latest ireceiveRequests[receiver]'s blockedReceives

IreceiveRequest

  • Append the request to ireceiveRequest[receiver] with an empty queue.

Ireceive

  • Search for a matching IreceiveRequest on this process
  • If it matches the first in the queue
    • Record the received message (time is Ireceive's time)
    • For each blocked Receive in the attached queue
      • Record the received message
  • Otherwise
    • Append this completed Receive to the previous IreceiveRequest's blocked queue
    • Append the queue of the completed Receive to the previous request's queue as well
  • Remove this entry from ireceiveRequests[receiver]

RequestCancelled

  • If it matches an Isend's requestId
    • Handle the same as IsendComplete, but without recording this send
  • If it matches an IreceiveRequest's requestId
    • Handle the same as Ireceive, but without recording this receive
  • It should never be both, but can be neither

Wrap Up

A sent message has the timestamp of the initial Isend. A receive has the timestamp of the completed Ireceive. Thus, we take the timestamp of the earliest possible sending moment and the latest possible receiving moment, because the message is actually transmitted sometime during this interval. This is as accurate as it gets in OTF2. Because of this, IreceiveRequest's and IsendComplete's timestamps are ignored. This also means that the order in which you process the records is not necessarily chronological for receives, but it is for sends.

After applying the above algorithm you have a list of sent and received MPI messages. You don't yet know which send belongs to which receive. Determining this relationship is called message matching. I intend to explain how it works in a future blog post.

Once this is done, you have one list of messages with correct durations, that can finally be used in your tool.

Happy Coding!


P.S.: I used OTF2 version 1.5.1

No comments:

Post a Comment