Grégoire Delétang∗, Anian Ruoss∗, Jordi Grau-Moya, Tim Genewein,
Li Kevin Wenliang, Elliot Catt, Marcus Hutter, Shane Legg, Pedro A. Ortega
DeepMind
London, UK
Abstract
Reliable generalization lies at the heart of safe ML and AI. However, understanding
when and how neural networks generalize remains one of the most important
unsolved problems in the field. In this work, we conduct an extensive empirical
study (2200 models, 16 tasks) to investigate whether insights from the theory of
computation can predict the limits of neural network generalization in practice.
We demonstrate that grouping tasks according to the Chomsky hierarchy allows
us to forecast whether certain architectures will be able to generalize to out-of-
distribution inputs. This includes negative results where even extensive amounts of
data and training time never led to any non-trivial generalization, despite models
having sufficient capacity to perfectly fit the training data. Our results show
that, for our subset of tasks, RNNs and Transformers fail to generalize on non-
regular tasks, LSTMs can solve regular and counter-language tasks, and only
networks augmented with structured memory (such as a stack or memory tape) can
successfully generalize on context-free and context-sensitive tasks.