International Business Machines Corporation
Inter-processor communications fault handling in high performance computing networks

Last updated:

Abstract:

A computer-implemented method and system for inter-processor communications fault handling in high performance computing networks. The method includes detecting that an InfiniBand (IB) queue pair has transitioned into an error state based on an unsuccessful completion status that relates to unsuccessful delivery of a message from an initiator endpoint at a first server device to at least one target endpoint at a second server device. The initiator and target endpoints are associated with at least one application under execution. An embodiment includes inferring, when the unsuccessful completion status is indicated as flushed, that the message was in a send queue of the IB queue pair when the IB queue pair transitioned into the error state. An embodiment includes establishing an IB Direct Connect queue pair connection between the target and initiator endpoints. An embodiment includes re-queueing the message in the IB queue pair for dispatch to the target endpoint.

Status:
Grant
Type:

Utility

Filling date:

26 Nov 2019

Issue date:

31 May 2022