Fenix @develop
 
Loading...
Searching...
No Matches
Process Recovery

Functions for managing process recovery in Fenix. More...

Macros

#define Fenix_Init(_role, _comm, _newcomm, _argc, _argv, _spare_ranks, _err)
 

Enumerations

enum  Fenix_Rank_role { FENIX_ROLE_INITIAL_RANK = 0 , FENIX_ROLE_RECOVERED_RANK = 1 , FENIX_ROLE_SURVIVOR_RANK = 2 }
 All possible roles returned by Fenix_Init. More...
 
enum  Fenix_Setting_name {
  FENIX_RECOVERY_MODE , FENIX_RESUME_MODE , FENIX_UNHANDLED_MODE , FENIX_CALLBACK_EXCEPTION_MODE ,
  FENIX_MLOG_RECOVERY_MODE , FENIX_SPARE_WAIT_MODE , FENIX_SETTING_NAME_MAXCODE
}
 Global Fenix settings. More...
 
enum  Fenix_Recovery_mode {
  FENIX_RECOVERY_IGNORE , FENIX_RECOVERY_NOOP , FENIX_RECOVERY_REPAIR , FENIX_RECOVERY_SPAWN ,
  FENIX_RECOVERY_MODE_MAXCODE
}
 Options for recovering after a failed rank is detected. More...
 
enum  Fenix_Mlog_recovery_mode { FENIX_MLOG_RECOVERY_MANUAL , FENIX_MLOG_RECOVERY_INLINE , FENIX_MLOG_RECOVERY_INLINE_AUTOSYNC , FENIX_MLOG_RECOVERY_MODE_MAXCODE }
 
enum  Fenix_Resume_mode { FENIX_RESUME_JUMP , FENIX_RESUME_RETURN , FENIX_RESUME_THROW , FENIX_RESUME_MODE_MAXCODE }
 Options for passing control back to application after recovery. More...
 
enum  Fenix_Unhandled_mode { FENIX_UNHANDLED_SILENT , FENIX_UNHANDLED_PRINT , FENIX_UNHANDLED_ABORT , FENIX_UNHANDLED_MODE_MAXCODE }
 Options for dealing with 'unhandled' errors, e.g. invalid rank IDs. More...
 
enum  Fenix_Spare_wait_mode { FENIX_SPARE_WAIT_BUSY , FENIX_SPARE_WAIT_YIELD , FENIX_SPARE_WAIT_SLEEP , FENIX_SPARE_WAIT_MODE_MAXCODE }
 Options for how spare ranks wait to be needed. Must be set before Fenix_Init to take effect. More...
 
enum  Fenix_Callback_exception_mode { FENIX_CALLBACK_EXCEPTION_RETHROW , FENIX_CALLBACK_EXCEPTION_SQUASH , FENIX_CALLBACK_EXCEPTION_MODE_MAXCODE }
 Options for dealing with CommExceptions generated in callbacks. More...
 

Functions

void Fenix_Init (int *role, MPI_Comm comm, MPI_Comm *newcomm, int **argc, char ***argv, int spare_ranks, int *error)
 Build a resilient communicator and set the restart point.
 
int Fenix_Initialized (int *flag)
 Sets flag to true if Fenix_Init has been called, else false.
 
int Fenix_Finalized (int *flag)
 Sets flag to true if Fenix_Finalize has been called, else false.
 
int Fenix_set_option (Fenix_Setting_name setting, unsigned option)
 Configure a global Fenix setting.
 
int Fenix_get_option (Fenix_Setting_name setting, unsigned *option)
 Get the current option for a Fenix setting.
 
int Fenix_Callback_register (void(*recover)(MPI_Comm, int, void *), void *callback_data)
 Register a callback to be invoked after failure process recovery.
 
int Fenix_Callback_pop ()
 Pop the most recently registered callback from the callback stack.
 
int Fenix_Callback_invoke_all ()
 Invoke all callbacks with information from the last recovered fault.
 
int Fenix_Process_detect_failures (int do_recovery)
 Check for any failed ranks.
 
int Fenix_get_number_of_ranks_with_role (int, int *)
  UNIMPLEMENTED Returns the number of ranks with a given Fenix_Rank_role
 
int Fenix_get_rank_role (MPI_Comm comm, int rank, int *role)
  UNIMPLEMENTED Returns the Fenix_Rank_role for a given rank
 
Fenix_Rank_role Fenix_get_role ()
 Returns this rank's Fenix_Rank_role.
 
int Fenix_get_error ()
 Returns the error value from Fenix_Init or the latest recovery.
 
int Fenix_get_nspare ()
 Returns the number of spare ranks currently available to Fenix.
 
int Fenix_Process_fail_list (int **fail_list)
 Get the list of ranks that failed in the most recent failure.
 
int Fenix_check_cancelled (MPI_Request *request, MPI_Status *status)
 Check a pre-recovery request without error.
 
int Fenix_Finalize ()
 Clean up Fenix state. Each active rank must call Fenix_Finalize before exiting.
 

Overview

Functions for managing process recovery in Fenix.

Process recovery within Fenix can be broken down into three steps: detection, communicator recovery, and application recovery.


Detecting Failures

Fenix is built on top of ULFM MPI, so specific fault detection mechanisms and options can be found in the ULFM documentation. At a high level, this means that Fenix will detect failures when an MPI function call is made which involves a failed rank. Detection is not collectively consistent, meaning some ranks may fail to complete a collective while other ranks finish successfully. Once a failure is detected, Fenix will 'revoke' the communicator that the failed operation was using and the top-level communicator output by Fenix_Init (these communicators are usually the same). The revocation is permanent, and means that all future operations on the communicator by any rank will fail. This allows knowledge of the failed rank to be propagated to all ranks in the communicator, even if some ranks would never have directly communicated with the failed rank.

Since failures can only be detected during MPI function calls, applications with long periods of communication-free computation will experience delays in beginning recovery. Such applications may benefit from inserting periodic calls to Fenix_Process_detect_failures to allow ranks to participate in global recovery operations with less delay.

Fenix will only detect and respond to failures that occur on the communicator provided by Fenix_Init or any communicators derived from it. Faults on other communicators will, by default, abort the application. Note that having multiple derived communicators is not currently recommended, and may lead to deadlock. In fact, even one derived communicator may lead to deadlock if not used carefully. If you have a use case that requires multiple communicators, please contact us about your use case – we can provide guidance and may be able to update Fenix to support it.

Advanced: Applications may wish to handle some failures themselves - either ignoring them or implementing custom recovery logic in certain code regions. This is not generally recommended. Significant care must be taken to ensure that the application does not attempt to enter two incompatible recovery steps. However, if you wish to do this, you can include "fenix_ext.h" and manually set fenix.ignore_errs to a non-zero value. This will cause Fenix's error handler to simply return any errors it encounters as the exit code of the application's MPI function call. Alternatively, applications may temporarily replace the communicator's error handler to avoid Fenix recovery. If you have a use case that would benefit from this, you can contact us for guidance and/or to request some specific error handling features.


Communicator Recovery

Once a failure has been detected, Fenix will begin the collective process of rebuilding the resilient communicator provided by Fenix_Init. There are two ways to rebuild: replacing failed ranks with spares, or shrinking the communicator to exclude the failed ranks. If there are any spares available, Fenix will use those to replace the failed ranks and maintain the original communicator size and guarantee that surviving processes keep the same rank ID. If there are not enough spares, some processes may have a different rank ID on the new communicator, and Fenix will warn the user about this by setting the error code for Fenix_Init to FENIX_WARNING_SPARE_RANKS_DEPLETED.

Advanced: Communicator recovery is collective, blocking, and not interruptable. ULFM exposes some functions (e.g. MPIX_Comm_agree, MPIX_Comm_shrink) that are also not interrupable – meaning they will continue despite any failures or revocations. If multiple collective, non-interruptable operations are started by different ranks in different orders, the application will deadlock. This is similar to what would happen if a non-resilient application called multiple collectives (e.g. MPI_Allreduce) in different orders. However, the preemptive and inconsistent nature of failure recovery makes it more complex to reason about ordering between ranks. Fenix uses these ULFM functions internally, so care is taken to ensure that the order of operations is consistent across ranks. Before any such operation begins, Fenix first uses MPIX_Comm_agree on the resilient communicator provided by Fenix_Init to agree on which 'location' will proceed - if there is any disagreement, all ranks will enter recovery as if they had detected a failure. Applications which wish to use these functions themselves should follow this pattern, providing a unique 'location' value for any operations that may be interrupted.


Application Recovery

Once a new communicator has been constructed, application recovery begins. There are two recovery modes: jumping (default) and non-jumping. With jumping recovery, Fenix will automatically longjmp to the Fenix_Init call site once communicator recovery is complete. This allows for very simple recovery logic, since it mimics the traditional teardown-restart pattern. However, longjmp has many undefined semantics according to the C and C++ specifications and may result in unexpected behavior due to compiler assumptions and optimizations. Additionally, some applications may be able to more efficiently recover by continuing inline. Users can initialize Fenix as non-jumping (see test/no_jump) to instead return an error code from the triggering MPI function call after communicator recovery. This may require more intrusive code changes (checking return statuses of each MPI call).

Fenix also allows applications to register one or more callback functions with Fenix_Callback_register and Fenix_Callback_pop, which removes the most recently registered callback. These callbacks are invoked after communicator recovery, just before control returns to the application. Callbacks are executed in the reverse order they were registered.

For C++ applications, it is recommended to use Fenix in non-jumping mode and to register a callback that throws an exception. At it's simplest, wrapping everything between Fenix_Init and Fenix_Finalize in a single try-catch can give the same simple recovery logic as jumping mode, but without the undefined behavior of longjmp.

Fenix_Init outputs a role, from Fenix_Rank_role, which helps inform the application about the recovery state of the rank. It is important to note that all spare ranks are captured inside Fenix_Init until they are used for recovery. Therefore, after recovery, recovered ranks will not have the same callbacks registered – recovered ranks will need to manually invoke any callbacks that use MPI functions. These roles also help the application more generally modify it's behavior based on each rank's recovery state.

Enumeration Type Documentation

◆ Fenix_Callback_exception_mode

Options for dealing with CommExceptions generated in callbacks.

Enumerator
FENIX_CALLBACK_EXCEPTION_RETHROW 

CommExceptions are allowed to propagate out of callbacks.

FENIX_CALLBACK_EXCEPTION_SQUASH 

CommExceptions from callbacks are squashed.

FENIX_CALLBACK_EXCEPTION_MODE_MAXCODE 

Not a valid option.

◆ Fenix_Mlog_recovery_mode

Enumerator
FENIX_MLOG_RECOVERY_MANUAL 

All message logging recovery is manual.

FENIX_MLOG_RECOVERY_INLINE 

Automatically repeats failed, logged MPI operations without disrupting normal application control flow.

User is responsible for handling any recovery steps in Fenix callbacks.

FENIX_MLOG_RECOVERY_INLINE_AUTOSYNC 

As INLINE, but automatically sync logs with FENIX_MLOG_CONTINUE.

Invoked after post-recovery callbacks, immediately before resuming. Invoked regardless of the logged-or-not status of the failing message.

FENIX_MLOG_RECOVERY_MODE_MAXCODE 

Not a valid option.

◆ Fenix_Rank_role

All possible roles returned by Fenix_Init.

Describes the current process's state in reference to process recovery.

It is important to note that FENIX_ROLE_RECOVERED_RANK is only guaranteed to be the value after a single failure, so users ought not use the role to directly ensure a valid state if they desire to be resilient to failures during their failure recovery process.

Enumerator
FENIX_ROLE_INITIAL_RANK 

No failures have occurred yet.

FENIX_ROLE_RECOVERED_RANK 

This rank was a spare before the most recent failure, or was just spawned.

FENIX_ROLE_SURVIVOR_RANK 

This rank was not a spare before the most recent failure.

◆ Fenix_Recovery_mode

Options for recovering after a failed rank is detected.

Enumerator
FENIX_RECOVERY_IGNORE 

Do not repair communicator, immediately resume per FENIX_RESUME_MODE.

FENIX_RECOVERY_NOOP 

Do not repair communicator, otherwise behave normally.

This includes calling the PRE_RECOVERY and POST_RECOVERY callbacks

FENIX_RECOVERY_REPAIR 

Repair the communicator with spares or by shrinking.

FENIX_RECOVERY_SPAWN 

UNIMPLEMENTED As REPAIR, but attempt to respawn failed processes

FENIX_RECOVERY_MODE_MAXCODE 

Not a valid option.

◆ Fenix_Resume_mode

Options for passing control back to application after recovery.

Enumerator
FENIX_RESUME_JUMP 

Return to Fenix_Init via longjmp (default)

The value of variables set before the longjmp are subject to undefined behavior from compiler optimizations. To ensure expected behavior, it is recommended that any variables that will be used across the longjmp are declared as volatile, are heap allocated, or are global in scope.

For C++ applications, whether stack variables are automatically destructed when leaving stack frames via longjmp is undefined. For this reason and the above, it is highly recommended to instead use FENIX_RESUME_THROW for C++ applications.

FENIX_RESUME_RETURN 

Return the error code inline.

FENIX_RESUME_THROW 

Throw a fenix::CommException.

FENIX_RESUME_MODE_MAXCODE 

Not a valid option.

◆ Fenix_Setting_name

Global Fenix settings.

Enumerator
FENIX_RECOVERY_MODE 

See Fenix_Recovery_mode.

FENIX_RESUME_MODE 

See Fenix_Resume_mode.

FENIX_UNHANDLED_MODE 

See Fenix_Unhandled_mode.

FENIX_CALLBACK_EXCEPTION_MODE 

See Fenix_Callback_exception_mode.

FENIX_MLOG_RECOVERY_MODE 

See Fenix_Mlog_recovery_mode.

FENIX_SPARE_WAIT_MODE 

See Fenix_Spare_wait_mode.

FENIX_SETTING_NAME_MAXCODE 

Not a valid option.

◆ Fenix_Spare_wait_mode

Options for how spare ranks wait to be needed. Must be set before Fenix_Init to take effect.

Enumerator
FENIX_SPARE_WAIT_BUSY 

Busy wait, consuming CPU time in exchange for faster response.

FENIX_SPARE_WAIT_YIELD 

Tell MPI to yield this thread while waiting (if supported, else busy wait)

FENIX_SPARE_WAIT_SLEEP 

Sleep 100ms between checks to see if this thread is needed for recovery.

FENIX_SPARE_WAIT_MODE_MAXCODE 

Not a valid option.

◆ Fenix_Unhandled_mode

Options for dealing with 'unhandled' errors, e.g. invalid rank IDs.

Enumerator
FENIX_UNHANDLED_SILENT 

Ignore unhandled errors.

FENIX_UNHANDLED_PRINT 

Print error and continue without handling.

FENIX_UNHANDLED_ABORT 

Print error and abort Fenix's world (default)

FENIX_UNHANDLED_MODE_MAXCODE 

Not a valid option.

Function Documentation

◆ Fenix_Callback_invoke_all()

int Fenix_Callback_invoke_all ( )

Invoke all callbacks with information from the last recovered fault.

Returns
FENIX_SUCCESS if successful, any return code otherwise.

◆ Fenix_Callback_pop()

int Fenix_Callback_pop ( )

Pop the most recently registered callback from the callback stack.

Returns
FENIX_SUCCESS if successful, any return code otherwise.

◆ Fenix_Callback_register()

int Fenix_Callback_register ( void(* recover )(MPI_Comm, int, void *),
void * callback_data )

Register a callback to be invoked after failure process recovery.

This function registers a callback to be invoked after a failure has been recovered by Fenix, and right before resuming application execution (e.g. returning from Fenix_Init by default). If this function is called more than once, the different callbacks will be called in the reverse order that they were registered (i.e. as a callback stack).

Callback functions are passed the newly-repaired resilient communicator, the error code returned by MPI in the communication action which caused a failure recovery, and the user-provided void* callback data.

Callbacks will only be invoked by survivor ranks, since spare ranks or respawned ranks had no way to register them before a failure.

Any active mlog will be deactivated for the duration of callbacks.

Parameters
[in]recoverthe callback function to register.
[in]callback_dataThe user-provided data which will be passed to the callback.
Returns
FENIX_SUCCESS if successful, any return code otherwise.

◆ Fenix_check_cancelled()

int Fenix_check_cancelled ( MPI_Request * request,
MPI_Status * status )

Check a pre-recovery request without error.

Parameters
[in]requestThe request to check
[out]statusThe status of the request
Returns
True if the request was cancelled or has unknown completion status, false if it completed successfully.

◆ Fenix_Finalize()

int Fenix_Finalize ( )

Clean up Fenix state. Each active rank must call Fenix_Finalize before exiting.

This function cleans up all Fenix state, if any. If an MPI program using the Fenix library terminates normally (i.e. not due to a call to MPI_Abort, or an unrecoverable error) then each rank must call Fenix_Finalize before it exits. It must be called before MPI_Finalize, and after Fenix_Init. There shall be no function calls after this function, except Fenix_Initialized.

As noted in the description of Fenix_Init, all spare ranks that have not been used to recover from failures (and therefore are still reserved by Fenix and kept inside Fenix_Init) will call MPI_Finalize and exit when all active ranks have called Fenix_Finalize.

Supports inline recovery when it is active (see Fenix_Mlog_activate). In this case, a rank will not leave this function until success (or an error with the message logs).

Advice: Sometimes users may want to remove ranks proactively from the execution, for example because monitoring data shows that failure of a rank is imminent or that a rank is executing un-manageably slowly. This can be accomplished by calling exit on the targeted ranks, followed by an invocation of MPI_Barrier. The removed ranks will be reported as failed and error handling will progress appropriately. No calls to finalize are needed in this case.

◆ Fenix_Finalized()

int Fenix_Finalized ( int * flag)

Sets flag to true if Fenix_Finalize has been called, else false.

Parameters
[out]flagPointer to the flag to be set.
Returns
FENIX_SUCCESS if successful, any return code otherwise.

◆ Fenix_get_option()

int Fenix_get_option ( Fenix_Setting_name setting,
unsigned * option )

Get the current option for a Fenix setting.

Each Fenix_Setting_name will describe its function and possible options.

Parameters
[in]settingThe Fenix_Setting_name to configure.
[out]optionThe current option of the setting.
Returns
FENIX_SUCCESS if successful, any return code otherwise.

◆ Fenix_Init()

void Fenix_Init ( int * role,
MPI_Comm comm,
MPI_Comm * newcomm,
int ** argc,
char *** argv,
int spare_ranks,
int * error )

Build a resilient communicator and set the restart point.

This function must be called by all ranks in comm, after MPI initialization. All calling ranks must pass the same values for the parameters comm and spare_ranks. Fenix_init must be called exactly once by each rank. This function is used (1) to activate the Fenix library, (2) to specify extra resources in case of rank failure, and (3) to create a logical resume point for when FENIX_RESUME_JUMP is set

Note that this function uses FENIX_RESUME_JUMP by default, which exposes applications to various undefined behaviors if they do not take care about how variables are used before and after the jump. See FENIX_RESUME_JUMP for more information.

It is recommended to access argc and argv only after executing Fenix_Init, since command line arguments passed to this function that apply to Fenix may be removed by Fenix_Init.

Fenix_Init is blocking in the following sense. If it is entered for the first time via a regular, explicit function call, it must be entered by all ranks in communicator comm. If it is entered after an error intercepted by Fenix (due to FENIX_RESUME_MODE FENIX_RESUME_JUMP), no ranks are allowed to exit from it until all non-failed ranks have returned control to it. Note: Typically, control is returned automatically through revocation of the resilient communicator, which means ranks that have long delays between MPI function calls or ranks that only use communicators unaffected by failure may lead to long delays between a failure and its recovery. See Fenix_Process_detect_failures for a way to improve this behavior.

Ranks to be used as spare ranks by Fenix will be available to the application only before Fenix_Init or after they are used to replace a failed rank (in which case they become active ranks). This document refers to the latter as RECOVERED ranks (see Fenix_Rank_role). Note that all spare ranks that have not been used to recover from failures (and, therefore, are still reserved by Fenix and kept inside Fenix_Init) will automatically call MPI_Finalize and exit when all active ranks have entered the Fenix_Finalize call.

No Fenix functions may be called before Fenix_Init, except Fenix_Initialized and Fenix_set_option.

Parameters
[out]roleThe current Fenix_Rank_role of this rank
[in]commCommunicator to construct the resilient communicator from
[out]newcommResilient output communicator
[in,out]argcPointer to application main's argc parameter
[in,out]argvPointer to application main's argv parameter
[in]spare_ranksNumber of ranks in comm to reserve from newcomm These ranks are reserved to substitute for failed ranks.
[out]errorThe return status of Fenix_Init
Used to signal that a non-fatal error or special condition was encountered in the execution of Fenix_Init, or FENIX_SUCCESS otherwise. It has the same value across all ranks released by Fenix_Init. If spawning is not enabled (FENIX_RECOVERY_MODE is not FENIX_RECOVERY_SPAWN) and spare ranks have been depleted, Fenix will repair resilience communicators by shrinking them and will report such shrinkage in this return parameter through the value FENIX_WARNING_SPARE_RANKS_DEPLETED.

◆ Fenix_Initialized()

int Fenix_Initialized ( int * flag)

Sets flag to true if Fenix_Init has been called, else false.

Parameters
[out]flagPointer to the flag to be set.
Returns
FENIX_SUCCESS if successful, any return code otherwise.

◆ Fenix_Process_detect_failures()

int Fenix_Process_detect_failures ( int do_recovery)
local

Check for any failed ranks.

This function is a local best-effort check for detecting any failed ranks. It does not perform communication, and does not guarantee a consistent view across ranks in the way an MPI_Comm_agree would. This function's goal is to allow applications with long periods of compute between communication to more quickly respond to failures during those periods of compute.

If recovery is enabled, this will behave exactly as a failed MPI operation. Otherwise, this function behaves as if FENIX_RECOVERY_IGNORE and FENIX_RESUME_RETURN are set and returns FENIX_ERROR_PROCESS_FAILURE if any rank failures are detected on the resilient communicator provided by Fenix.

If recovery is enabled and inline recovery is active (see Fenix_Mlog_activate), this function supports inline recovery. In this case, any detected errors are recovered but the FENIX_RESUME_MODE is ignored. The function will automatically replay after recovery until no failures are detected.

Parameters
[in]do_recoveryWhether to enable recovery for this function.
Returns
FENIX_SUCCESS if successful, any return code otherwise.

◆ Fenix_Process_fail_list()

int Fenix_Process_fail_list ( int ** fail_list)

Get the list of ranks that failed in the most recent failure.

Parameters
[out]fail_listSet to a list of failed ranks.
Returns
The number of failed ranks.

◆ Fenix_set_option()

int Fenix_set_option ( Fenix_Setting_name setting,
unsigned option )
local

Configure a global Fenix setting.

Each Fenix_Setting_name will describe its function and valid options.

If called prior to Fenix_Init, the setting will apply to future Fenix_Inits. Settings configured during initialization override prior configurations.

Unless otherwise noted, it is undefined behavior if settings are not consistent across all ranks.

Parameters
[in]settingThe Fenix_Setting_name to configure.
[in]optionThe option to configure setting to.
Returns
FENIX_SUCCESS if successful, any return code otherwise.