Nutanix, Inc.
Executing resource management operations in distributed computing systems

Last updated:

Abstract:

Computing cluster system management. Embodiments implement fine-grained rule-based approaches to error recovery. A service dispatches tasks to components of the computing cluster. At the time of task dispatching, entries are made into a write-ahead log. The write-ahead log entries serve for recording task and component attributes. A monitor detects a failure event raised by one or more of the components of the computing cluster. Responses to the failure event include determining a set of conditions that are present in the computing cluster at the time of the detection, and then using the failure event and the determined conditions in combination with a set of fine-grained failure processing rules to determine one or more recovery actions to take. Recovery actions include redistributing the failed task to a different node or to different service. Certain conditions and rules initiate actions that rollback the state of a component to a previous success point.

Status:
Grant
Type:

Utility

Filling date:

20 Nov 2017

Issue date:

22 Sep 2020