International Business Machines Corporation
DYNAMIC POLICY PROGRAMMING FOR CONTINUOUS ACTION SPACES

Last updated:

Abstract:

A method, system, and computer program product for dynamic policy programming in continuous action spaces are provided. The method generates a replay buffer including a set of transitions from a set of states. A plurality of transitions are sampled from the replay buffer as a set of samples. The method generates an action value based on a dynamic policy programming (DPP) recursion using the set of samples and a first hyperparameter. The policy is evaluated by computing a KL divergence using the set of samples, the action value and a second hyperparameter. The current policy is evaluated using the action value and a DPP-based error. The method generates a subsequent policy by updating the current policy by minimizing a sum of two KL divergences and using a third hyperparameter. The second KL divergence is computed from the current policy and the set of samples.

Status:
Application
Type:

Utility

Filling date:

1 Sep 2020

Issue date:

3 Mar 2022