International Business Machines Corporation
FAILURE DETECTION AND CORRECTION IN A DISTRIBUTED COMPUTING SYSTEM
Last updated:
Abstract:
A failure detection and correction module (FDCM) uses statistical measurement to detect failures in a distributed computing system caused by hardware, software, workflow, deployment, environmental factors, etc. in a component of the computing system, the computing system, or multiple computing systems and produces corrective actions. The FDCM identifies issues from various components, correlates the estimated failures in each level of components and rolls up failures and estimated failures from each level of components to system level estimations of failures, reevaluates the system reliability factors, readjusts the system reliability and system functions from the adjusted reliability factors, and produces intelligent corrective actions to improve both system reliability and the system efficiency. Corrective action includes changing slice storing parameters and rebuild priorities on a dispersed storage system.
Utility
6 Jan 2020
8 Jul 2021