Improving Exploration in Actor-Critic With Weakly Pessimistic Value Estimation and Optimistic Policy Optimization

IEEE Trans Neural Netw Learn Syst. 2022 Oct 28:PP. doi: 10.1109/TNNLS.2022.3215596. Online ahead of print.

Abstract

Deep off-policy actor-critic algorithms have been successfully applied to challenging tasks in continuous control. However, these methods typically suffer from the poor sample efficiency problem, limiting their widespread adoption in real-world domains. To mitigate this issue, we propose a novel actor-critic algorithm with weakly pessimistic value estimation and optimistic policy optimization (WPVOP) for continuous control. WPVOP integrates two key ingredients: 1) a weakly pessimistic value estimation, which compensates the pessimism of lower confidence bound in conventional value function (i.e., clipped double Q -learning) to trigger exploration in low-value state-action regions and 2) an optimistic policy optimization algorithm by sampling actions that could benefit the policy learning most toward optimal Q -values for efficient exploration. We theoretically analyze that the proposed weakly pessimistic value estimation method is lower and upper bounded, and empirically show that it could avoid extremely over-optimistic value estimates. We show that these two ideas are largely complementary, and can be fruitfully integrated to improve performance and promote sample efficiency of exploration. We evaluate WPVOP on the suite of continuous control tasks from MuJoCo, achieving state-of-the-art sample efficiency and performance.