Path-Specific Objectives for Safer Agent Incentives
Sebastian Farquhar, Ryan Carey, Tom Everitt
University of Oxford, DeepMind
Abstract
We present a general framework for training safe agents whose
naive incentives are unsafe. As an example, manipulative or deceptive behaviour can improve rewards but should be avoided.
Most approaches fail here: agents maximize expected return
by any means necessary. We formally describe settings with
‘delicate’ parts of the state which should not be used as a
means to an end. We then train agents to maximize the causal
effect of actions on the expected return which is not mediated
by the delicate parts of state, using Causal Influence Diagram
analysis. The resulting agents have no incentive to control the
delicate state. We further show how our framework unifies and
generalizes existing proposals.