Two RL papers

arXiv:2202.09699 [pdf, other]

cs.LG cs.AI stat.ML

Selective Credit Assignment

Authors: Veronica Chelu, Diana Borsa, Doina Precup, Hado van Hasselt

Abstract: Efficient credit assignment is essential for reinforcement learning algorithms in both prediction and control settings. We describe a unified view on temporal-difference algorithms for selective credit assignment. These selective algorithms apply weightings to quantify the contribution of learning updates. We present insights into applying weightings to value-based learning and planning algorithms, and describe their role in mediating the backward credit distribution in prediction and control. Within this space, we identify some existing online learning algorithms that can assign credit selectively as special cases, as well as add new algorithms that assign credit backward in time counterfactually, allowing credit to be assigned off-trajectory and off-policy. △ Less

Submitted 19 February, 2022; originally announced February 2022.
arXiv:2201.06468 [pdf, other]

cs.LG cs.AI stat.ML

Chaining Value Functions for Off-Policy Learning

Authors: Simon Schmitt, John Shawe-Taylor, Hado van Hasselt

Abstract: To accumulate knowledge and improve its policy of behaviour, a reinforcement learning agent can learn `off-policy' about policies that differ from the policy used to generate its experience. This is important to learn counterfactuals, or because the experience was generated out of its own control. However, off-policy learning is non-trivial, and standard reinforcement-learning algorithms can be unstable and divergent. In this paper we discuss a novel family of off-policy prediction algorithms which are convergent by construction. The idea is to first learn on-policy about the data-generating behaviour, and then bootstrap an off-policy value estimate on this on-policy estimate, thereby constructing a value estimate that is partially off-policy. This process can be repeated to build a chain of value functions, each time bootstrapping a new estimate on the previous estimate in the chain. Each step in the chain is stable and hence the complete algorithm is guaranteed to be stable. Under mild conditions this comes arbitrarily close to the off-policy TD solution when we increase the length of the chain. Hence it can compute the solution even in cases where off-policy TD diverges. We prove that the proposed scheme is convergent and corresponds to an iterative decomposition of the inverse key matrix. Furthermore it can be interpreted as estimating a novel objective -- that we call a `k-step expedition' -- of following the target policy for finitely many steps before continuing indefinitely with the behaviour policy. Empirically we evaluate the idea on challenging MDPs such as Baird's counter example and observe favourable results. △ Less

Submitted 2 February, 2022; v1 submitted 17 January, 2022; originally announced January 2022.

朱小虎 Xiaohu Zhu

一切为了我们世界的长远未来 - All about the long-term future of our world.

Two RL papers