Alibaba Group Holding Limited
Method and apparatus for machine-exception handling and learning rate adjustment
Last updated:
Abstract:
The present disclosure provides machine-exception handling methods and learning rate adjustment methods and apparatuses. One exemplary method comprises: acquiring a gradient consumption time of a target machine, wherein the gradient consumption time is used to indicate a gradient related time consumed by the target machine in a training process; determining whether the gradient consumption time satisfies a predetermined condition compared with a pre-acquired average consumption time, wherein the average consumption time is used to indicate an average value of the gradient related time consumed by all machines other than the target machine in a cluster in the training process; and determining that the target machine is abnormal if the gradient consumption time satisfies the predetermined condition compared with the average consumption time. The present disclosure addresses the technical problem of high training costs caused by low computation or communication speeds of some machines in a cluster.
Utility
23 Jul 2018
18 Aug 2020