International Business Machines Corporation
DYNAMIC POLICY PROGRAMMING FOR CONTINUOUS ACTION SPACES
Last updated:
Abstract:
A method, system, and computer program product for dynamic policy programming in continuous action spaces are provided. The method generates a replay buffer including a set of transitions from a set of states. A plurality of transitions are sampled from the replay buffer as a set of samples. The method generates an action value based on a dynamic policy programming (DPP) recursion using the set of samples and a first hyperparameter. The policy is evaluated by computing a KL divergence using the set of samples, the action value and a second hyperparameter. The current policy is evaluated using the action value and a DPP-based error. The method generates a subsequent policy by updating the current policy by minimizing a sum of two KL divergences and using a third hyperparameter. The second KL divergence is computed from the current policy and the set of samples.
Utility
1 Sep 2020
3 Mar 2022