Nutanix, Inc.
Executing resource management operations in distributed computing systems
Last updated:
Abstract:
Computing cluster system management. Embodiments implement fine-grained rule-based approaches to error recovery. A service dispatches tasks to components of the computing cluster. At the time of task dispatching, entries are made into a write-ahead log. The write-ahead log entries serve for recording task and component attributes. A monitor detects a failure event raised by one or more of the components of the computing cluster. Responses to the failure event include determining a set of conditions that are present in the computing cluster at the time of the detection, and then using the failure event and the determined conditions in combination with a set of fine-grained failure processing rules to determine one or more recovery actions to take. Recovery actions include redistributing the failed task to a different node or to different service. Certain conditions and rules initiate actions that rollback the state of a component to a previous success point.
Utility
20 Nov 2017
22 Sep 2020