Natural Policy Gradient for Exponential Families

Abstract

Recent work has highlighted how a misalignment between the support of the policy and the action space of the reinforcement learning problem can introduce bias and unnecessary variance into policy gradient estimates. To better align the support of the policy and the action space, we can consider using arbitrary exponential families to model the policy distribution. Exponential families are a natural choice because the class of exponential families is very rich and can model the support of most action spaces of practical interest. While the multivariate Gaussian is the most commonly used distribution today, in general it is possible to efficiently implement both natural policy gradient and TRPO for any exponential family. In this technical report we derive efficient natural policy gradient update rules for several exponential familes. We also consider an application of the Gamma distribution to an optimal production problem and show that it substantially outperforms the Gaussian.

Publication
Technical Report