Fenix @develop
 
Loading...
Searching...
No Matches
Process Recovery

Functions for managing process recovery in Fenix. More...

Macros

#define Fenix_Init(_role, _comm, _newcomm, _argc, _argv, _spare_ranks, _spawn, _info, _error)
 

Enumerations

enum  Fenix_Rank_role { FENIX_ROLE_INITIAL_RANK = 0 , FENIX_ROLE_RECOVERED_RANK = 1 , FENIX_ROLE_SURVIVOR_RANK = 2 }
 All possible roles returned by Fenix_Init. More...
 

Functions

void Fenix_Init (int *role, MPI_Comm comm, MPI_Comm *newcomm, int **argc, char ***argv, int spare_ranks, int spawn, MPI_Info info, int *error)
 Build a resilient communicator and set the restart point.
 
int Fenix_Initialized (int *flag)
 Sets flag to true if Fenix_Init has been called, else false.
 
int Fenix_Callback_register (void(*recover)(MPI_Comm, int, void *), void *callback_data)
 Register a callback to be invoked after failure process recovery.
 
int Fenix_Callback_pop ()
 Pop the most recently registered callback from the callback stack.
 
int Fenix_Process_detect_failures (int do_recovery)
 Check for any failed ranks.
 
int Fenix_get_number_of_ranks_with_role (int, int *)
  UNIMPLEMENTED Returns the number of ranks with a given Fenix_Rank_role
 
int Fenix_get_role (MPI_Comm comm, int rank, int *role)
  UNIMPLEMENTED Returns the Fenix_Rank_role for a given rank
 
int Fenix_Process_fail_list (int **fail_list)
 Get the list of ranks that failed in the most recent failure.
 
int Fenix_check_cancelled (MPI_Request *request, MPI_Status *status)
 Check a pre-recovery request without error.
 
int Fenix_Finalize ()
 Clean up Fenix state. Each active rank must call Fenix_Finalize before exiting.
 

Overview

Functions for managing process recovery in Fenix.

Process recovery within Fenix can be broken down into three steps: detection, communicator recovery, and application recovery.


Detecting Failures

Fenix is built on top of ULFM MPI, so specific fault detection mechanisms and options can be found in the ULFM documentation. At a high level, this means that Fenix will detect failures when an MPI function call is made which involves a failed rank. Detection is not collectively consistent, meaning some ranks may fail to complete a collective while other ranks finish successfully. Once a failure is detected, Fenix will 'revoke' the communicator that the failed operation was using and the top-level communicator output by Fenix_Init (these communicators are usually the same). The revocation is permanent, and means that all future operations on the communicator by any rank will fail. This allows knowledge of the failed rank to be propagated to all ranks in the communicator, even if some ranks would never have directly communicated with the failed rank.

Since failures can only be detected during MPI function calls, applications with long periods of communication-free computation will experience delays in beginning recovery. Such applications may benefit from inserting periodic calls to Fenix_Process_detect_failures to allow ranks to participate in global recovery operations with less delay.

Fenix will only detect and respond to failures that occur on the communicator provided by Fenix_Init or any communicators derived from it. Faults on other communicators will, by default, abort the application. Note that having multiple derived communicators is not currently recommended, and may lead to deadlock. In fact, even one derived communicator may lead to deadlock if not used carefully. If you have a use case that requires multiple communicators, please contact us about your use case – we can provide guidance and may be able to update Fenix to support it.

Advanced: Applications may wish to handle some failures themselves - either ignoring them or implementing custom recovery logic in certain code regions. This is not generally recommended. Significant care must be taken to ensure that the application does not attempt to enter two incompatible recovery steps. However, if you wish to do this, you can include "fenix_ext.h" and manually set fenix.ignore_errs to a non-zero value. This will cause Fenix's error handler to simply return any errors it encounters as the exit code of the application's MPI function call. Alternatively, applications may temporarily replace the communicator's error handler to avoid Fenix recovery. If you have a use case that would benefit from this, you can contact us for guidance and/or to request some specific error handling features.


Communicator Recovery

Once a failure has been detected, Fenix will begin the collective process of rebuilding the resilient communicator provided by Fenix_Init. There are two ways to rebuild: replacing failed ranks with spares, or shrinking the communicator to exclude the failed ranks. If there are any spares available, Fenix will use those to replace the failed ranks and maintain the original communicator size and guarantee that surviving processes keep the same rank ID. If there are not enough spares, some processes may have a different rank ID on the new communicator, and Fenix will warn the user about this by setting the error code for Fenix_Init to FENIX_WARNING_SPARE_RANKS_DEPLETED.

Advanced: Communicator recovery is collective, blocking, and not interruptable. ULFM exposes some functions (e.g. MPIX_Comm_agree, MPIX_Comm_shrink) that are also not interrupable – meaning they will continue despite any failures or revocations. If multiple collective, non-interruptable operations are started by different ranks in different orders, the application will deadlock. This is similar to what would happen if a non-resilient application called multiple collectives (e.g. MPI_Allreduce) in different orders. However, the preemptive and inconsistent nature of failure recovery makes it more complex to reason about ordering between ranks. Fenix uses these ULFM functions internally, so care is taken to ensure that the order of operations is consistent across ranks. Before any such operation begins, Fenix first uses MPIX_Comm_agree on the resilient communicator provided by Fenix_Init to agree on which 'location' will proceed - if there is any disagreement, all ranks will enter recovery as if they had detected a failure. Applications which wish to use these functions themselves should follow this pattern, providing a unique 'location' value for any operations that may be interrupted.


Application Recovery

Once a new communicator has been constructed, application recovery begins. There are two recovery modes: jumping (default) and non-jumping. With jumping recovery, Fenix will automatically longjmp to the Fenix_Init call site once communicator recovery is complete. This allows for very simple recovery logic, since it mimics the traditional teardown-restart pattern. However, longjmp has many undefined semantics according to the C and C++ specifications and may result in unexpected behavior due to compiler assumptions and optimizations. Additionally, some applications may be able to more efficiently recover by continuing inline. Users can initialize Fenix as non-jumping (see test/no_jump) to instead return an error code from the triggering MPI function call after communicator recovery. This may require more intrusive code changes (checking return statuses of each MPI call).

Fenix also allows applications to register one or more callback functions with Fenix_Callback_register and Fenix_Callback_pop, which removes the most recently registered callback. These callbacks are invoked after communicator recovery, just before control returns to the application. Callbacks are executed in the reverse order they were registered.

For C++ applications, it is recommended to use Fenix in non-jumping mode and to register a callback that throws an exception. At it's simplest, wrapping everything between Fenix_Init and Fenix_Finalize in a single try-catch can give the same simple recovery logic as jumping mode, but without the undefined behavior of longjmp.

Fenix_Init outputs a role, from Fenix_Rank_role, which helps inform the application about the recovery state of the rank. It is important to note that all spare ranks are captured inside Fenix_Init until they are used for recovery. Therefore, after recovery, recovered ranks will not have the same callbacks registered – recovered ranks will need to manually invoke any callbacks that use MPI functions. These roles also help the application more generally modify it's behavior based on each rank's recovery state.

Enumeration Type Documentation

◆ Fenix_Rank_role

All possible roles returned by Fenix_Init.

Describes the current process's state in reference to process recovery.

It is important to note that FENIX_ROLE_RECOVERED_RANK is only guaranteed to be the value after a single failure, so users ought not use the role to directly ensure a valid state if they desire to be resilient to failures during their failure recovery process.

Enumerator
FENIX_ROLE_INITIAL_RANK 

No failures have occurred yet.

FENIX_ROLE_RECOVERED_RANK 

This rank was a spare before the most recent failure, or was just spawned.

FENIX_ROLE_SURVIVOR_RANK 

This rank was not a spare before the most recent failure.

Function Documentation

◆ Fenix_Callback_pop()

int Fenix_Callback_pop ( )

Pop the most recently registered callback from the callback stack.

Returns
FENIX_SUCCESS if successful, any return code otherwise.

◆ Fenix_Callback_register()

int Fenix_Callback_register ( void(* recover )(MPI_Comm, int, void *),
void * callback_data )

Register a callback to be invoked after failure process recovery.

This function registers a callback to be invoked after a failure has been recovered by Fenix, and right before resuming application execution (e.g. returning from Fenix_Init by default). If this function is called more than once, the different callbacks will be called in the reverse order that they were registered (i.e. as a callback stack).

Callback functions are passed the newly-repaired resilient communicator, the error code returned by MPI in the communication action which caused a failure recovery, and the user-provided void* callback data.

Callbacks will only be invoked by survivor ranks, since spare ranks or respawned ranks had no way to register them before a failure.

Parameters
[in]recoverthe callback function to register.
[in]callback_dataThe user-provided data which will be passed to the callback.
Returns
FENIX_SUCCESS if successful, any return code otherwise.

◆ Fenix_check_cancelled()

int Fenix_check_cancelled ( MPI_Request * request,
MPI_Status * status )

Check a pre-recovery request without error.

Parameters
[in]requestThe request to check
[out]statusThe status of the request
Returns
True if the request was cancelled or has unknown completion status, false if it completed successfully.

◆ Fenix_Finalize()

int Fenix_Finalize ( )

Clean up Fenix state. Each active rank must call Fenix_Finalize before exiting.

This function cleans up all Fenix state, if any. If an MPI program using the Fenix library terminates normally (i.e. not due to a call to MPI_Abort, or an unrecoverable error) then each rank must call Fenix_Finalize before it exits. It must be called before MPI_Finalize, and after Fenix_Init. There shall be no function calls after this function, except Fenix_Initialized.

As noted in the description of Fenix_Init, all spare ranks that have not been used to recover from failures (and therefore are still reserved by Fenix and kept inside Fenix_Init) will call MPI_Finalize and exit when all active ranks have called Fenix_Finalize.

Advice: Sometimes users may want to remove ranks proactively from the execution, for example because monitoring data shows that failure of a rank is imminent or that a rank is executing un-manageably slowly. This can be accomplished by calling exit on the targeted ranks, followed by an invocation of MPI_Barrier. The removed ranks will be reported as failed and error handling will progress appropriately. No calls to finalize are needed in this case.

◆ Fenix_Init()

void Fenix_Init ( int * role,
MPI_Comm comm,
MPI_Comm * newcomm,
int ** argc,
char *** argv,
int spare_ranks,
int spawn,
MPI_Info info,
int * error )

Build a resilient communicator and set the restart point.

This function must be called by all ranks in comm, after MPI initialization. All calling ranks must pass the same values for the parameters comm, spare_ranks, spawn, and info. Fenix_init must be called exactly once by each rank. This function is used (1) to activate the Fenix library, (2) to specify extra resources in case of rank failure, and (3) to create a logical resumption point in case of rank failure.

For C, the program may rely on the the state of any variables defined and set before the call to Fenix_Init. But note that the code executed before Fenix_Init is executed by all ranks in the system (including spare ranks, see below). For C++, the state of objects declared before Fenix_Init but within the same scope as Fenix_Init is compiler-dependant, and it is recommended to place Fenix_Init within a subscope exluding any variables expected to no be destructed.

It is recommended to access argc and argv only after executin Fenix_Init, since command line arguments passed to this function that apply to Fenix may be removed by Fenix_Init.

Fenix_Init is blocking in the following sense. If it is entered for the first time via a regular, explicit function call, it must be entered by all ranks in communicator comm. If it is entered after an error intercepted by Fenix (it if the default execution resumption point, see _info below), no ranks are allowed to exit from it until all non-failed ranks have returned control to it. Note: Typically control is
returned automatically through revocation of the resilient communicator, which means ranks which have long delays between MPI function calls or ranks which only use communicators unaffected by failure may lead to long delays between a failure and its recovery.

Ranks to be used as spare ranks by Fenix will be available to the application only before Fenix_Init, or after they are used to replace a failed rank, in which case they turn into active ranks. This document refers to the latter as RECOVERED ranks (see Fenix_Rank_role). Note that all spare ranks that have not been used to recover from failures (and, therefore, are still reserved by Fenix and kept inside Fenix_Init) will automatically call MPI_Finalize and exit when all active ranks have entered the Fenix_Finalize call.

No Fenix functions may be called before Fenix_Init, except Fenix_Initialized.

Parameters
[out]roleThe current role of this rank (see Fenix_Rank_role)
[in]commThe base communicator to construct a resilient communicator from, which must include any spare ranks (see below) the user deems necessary. MPI_COMM_WORLDis a valid value, but MPI_COMM_SELF is not.
[out]newcommResilient output communicator, managed by Fenix and derived from comm, to be used by the application instead of comm.
[in,out]argcPointer to application main's argc parameter
[in,out]argvPointer to application main's argv parameter
[in]spare_ranksThe number of ranks in comm that are exempted by Fenix in the construction of the resilient communicator by Fenix_Init. These ranks are kept in reserve to substitute for failed ranks. Failed ranks in resilient communicators are replaced by spare or spawned ranks.
[in]spawnUnimplemented: Whether to enable spawning new ranks to replace failed ranks when spares are unavailable.
[in]infoFenix recovery configuration parameters, may be MPI_INFO_NULL Supports the "FENIX_RESUME_MODE" key, used to indicate where execution should resume upon rank failure for all active (non-spare) ranks in any resilient communicators, not only for those ranks in communicators that failed. The following values associated with the "resume_mode" key are supported:
  • "Fenix_init" (default): execution resumes at logical exit of Fenix_Init.
  • "NO_JUMP": execution continues from the failing MPI call. Errors are otherwise handled as normal, but return the error code as well. Applications should typically either check for return codes or assign an error callback through Fenix.
[out]errorThe return status of Fenix_Init
Used to signal that a non-fatal error or special condition was encountered in the execution of Fenix_Init, or FENIX_SUCCESS otherwise. It has the same value across all ranks released by Fenix_Init. If spawning is explicitly disabled (_spawn equals false) and spare ranks have been depleted, Fenix will repair resilience communicators by shrinking them and will report such shrinkage in this return parameter through the value FENIX_WARNING_SPARE_RANKS_DEPLETED.

◆ Fenix_Initialized()

int Fenix_Initialized ( int * flag)

Sets flag to true if Fenix_Init has been called, else false.

Parameters
[out]flagPointer to the flag to be set.
Returns
FENIX_SUCCESS if successful, any return code otherwise.

◆ Fenix_Process_detect_failures()

int Fenix_Process_detect_failures ( int do_recovery)

Check for any failed ranks.

Parameters
[in]do_recoveryIf true, Fenix will attempt to recover from any detected failures. Else, it will ignore any failures and simply return the MPI return code.
Returns
MPI_SUCCESS if no failures were detected, else the MPI return code.

◆ Fenix_Process_fail_list()

int Fenix_Process_fail_list ( int ** fail_list)

Get the list of ranks that failed in the most recent failure.

Parameters
[out]fail_listSet to a list of failed ranks.
Returns
The number of failed ranks.