LaTex2Web logo

Documents Live, a web authoring and publishing system

If you see this, something is wrong

Collapse and expand sections

To get acquainted with the document, the best thing to do is to select the "Collapse all sections" item from the "View" menu. This will leave visible only the titles of the top-level sections.

Clicking on a section title toggles the visibility of the section content. If you have collapsed all of the sections, this will let you discover the document progressively, from the top-level sections to the lower-level ones.

Cross-references and related material

Generally speaking, anything that is blue is clickable.

Clicking on a reference link (like an equation number, for instance) will display the reference as close as possible, without breaking the layout. Clicking on the displayed content or on the reference link hides the content. This is recursive: if the content includes a reference, clicking on it will have the same effect. These "links" are not necessarily numbers, as it is possible in LaTeX2Web to use full text for a reference.

Clicking on a bibliographical reference (i.e., a number within brackets) will display the reference.

Speech bubbles indicate a footnote. Click on the bubble to reveal the footnote (there is no page in a web document, so footnotes are placed inside the text flow). Acronyms work the same way as footnotes, except that you have the acronym instead of the speech bubble.

Discussions

By default, discussions are open in a document. Click on the discussion button below to reveal the discussion thread. However, you must be registered to participate in the discussion.

If a thread has been initialized, you can reply to it. Any modification to any comment, or a reply to it, in the discussion is signified by email to the owner of the document and to the author of the comment.

Table of contents

First published on Saturday, Jun 14, 2025 and last modified on Saturday, Jun 14, 2025 by François Chaplais.

Mathematical theory of deep learning
arXiv
Published version: 10.48550/arXiv.2407.18384

Philipp Petersen Universität Wien, Fakultät für Mathematik, 1090 Wien, Austria Email

Jakob Zech Universität Heidelberg, Interdisziplinäres Zentrum für Wissenschaftliches Rechnen, 69120 Heidelberg, Germany Email

Preface

This book serves as an introduction to the key ideas in the mathematical analysis of deep learning. It is designed to help students and researchers to quickly familiarize themselves with the area and to provide a foundation for the development of university courses on the mathematics of deep learning. Our main goal in the composition of this book was to present various rigorous, but easy to grasp, results that help to build an understanding of fundamental mathematical concepts in deep learning. To achieve this, we prioritize simplicity over generality.

As a mathematical introduction to deep learning, this book does not aim to give an exhaustive survey of the entire (and rapidly growing) field, and some important research directions are missing. In particular, we have favored mathematical results over empirical research, even though an accurate account of the theory of deep learning requires both.

The book is intended for students and researchers in mathematics and related areas. While we believe that every diligent researcher or student will be able to work through this manuscript, it should be emphasized that a familiarity with analysis, linear algebra, probability theory, and basic functional analysis is recommended for an optimal reading experience. To assist readers, a review of key concepts in probability theory and functional analysis is provided in the appendix.

The material is structured around the three main pillars of deep learning theory: Approximation theory, Optimization theory, and Statistical Learning theory. This structure, which corresponds to the three error terms typically occuring in the theoretical analysis of deep learning models, is inspired by other recent texts on the topic following the same outline [1, 2, 3]. More specifically, Chapter 2 provides an overview and introduces key questions for understand deep learning. Chapters 3 - 10 explore results in approximation theory, Chapters 11 - 14 discuss optimization theory for deep learning, and the remaining Chapters 15 - 17 address the statistical aspects of deep learning.

This book is the result of a series of lectures given by the authors. Parts of the material were presented by P.Pin a lecture titled “Neural Network Theory” at the University of Vienna, and by J.Zin a lecture titled “Theory of Deep Learning” at Heidelberg University. The lecture notes of these courses formed the basis of this book. We are grateful to the many colleagues and students who contributed to this book through insightful discussions and valuable suggestions. We would like to offer special thanks to the following individuals:

Jonathan Garcia Rebellon, Jakob Lanser, Andrés Felipe Lerma Pineda, Marvin Koß, Martin Mauser, Davide Modesto, Martina Neuman, Bruno Perreaux, Johannes Asmus Petersen, Milutin Popovic, Tuan Quach, Tim Rakowski, Lorenz Riess, Jakob Fabian Rohner, Jonas Schuhmann, Peter Školnìk, Matej Vedak, Simon Weissmann, Josephine Westermann, Ashia Wilson.

Notation

SymbolDescriptionReference
\( \mathcal{A}\)vector of layer widthsDefinition 26
\( \mathfrak{A}\)a sigma-algebraDefinition 38
\( {\rm {aff}}(S)\)affine hull of \( S\)(44)
\( \mathfrak{B}_d\)the Borel sigma-algebra on \( \mathbb{R}^d\)Section 18.1
\( \mathcal{B}^n\)B-Splines of order \( n\)Definition 8
\( B_r(x)\)ball of radius \( r\ge 0\) around \( x\) in a metric space \( X\)(282)
\( B_r^d\)ball of radius \( r\ge 0\) around \( \boldsymbol{0}\) in \( \mathbb{R}^d\)
\( C^k(\Omega)\)\( k\) -times continuously differentiable functions from \( \Omega\to\mathbb{R}\)
\( C^\infty_c(\Omega)\)infinitely differentiable functions from \( \Omega\to\mathbb{R}\) with compact support in \( \Omega\)
\( C^{0,s}(\Omega)\)\( s\) -Hölder continuous functions from \( \Omega\to\mathbb{R}\)Definition 14
\( C^{k,s}(\Omega)\)\( C^k(\Omega)\) functions \( f\) for which \( f^{(k)}\in C^{0,s}(\Omega)\)Definition 18
\( f_n \xrightarrow{{\rm {cc}}} f\)compact convergence of \( f_n\) to \( f\)Definition 3
\( {\rm {co}}(S)\)convex hull of a set \( S\)(38)
\( f * g\)convolution of \( f\) and \( g\)(11)
\( \mathcal{D}\)data distribution(4), Section 15.1
\( D^{\boldsymbol{\alpha}}\)partial derivative
\( {\rm {depth}}(\Phi)\)depth of \( \Phi\)Definition 1
\( \varepsilon_{\mathrm{approx}}\)approximation error(233)
\( \varepsilon_{\mathrm{gen}}\)generalization error(233)
\( \varepsilon_{\mathrm{int}}\)interpolation error(234)
\( \mathbb{E}[X]\)expectation of random variable \( X\)(274)
\( \mathbb{E}[X|Y]\)conditional expectation of random variable \( X\)Subsection 18.3.3
\( \mathcal{G}(S, \varepsilon, X)\)\( \varepsilon\) -covering number of a set \( S \subseteq X\)Definition 35
\( \Gamma_C\)Barron space with constant \( C\)Section 9.1
\( \nabla_x f\)gradient of \( f\) w.r.t\( x\)
\( \oslash\)componentwise (Hadamard) division
\( \otimes\)componentwise (Hadamard) product
\( h_S\)empirical risk minimizer for a sample \( S\)Definition 33
\( {\Phi^{\rm {id}}_{L}}\)identity ReLU neural networkLemma 6
\( \boldsymbol{1}_S\)indicator function of the set \( S\)
\( \left\langle \cdot, \cdot\right\rangle_{}\)Euclidean inner product on \( \mathbb{R}^d\)
\( \left\langle \cdot, \cdot\right\rangle_{H}\)inner product on a vector space \( H\)Definition 54
\( k_\mathcal{T}\)maximal number of elements shared by a single node of a triangulation(39)
\( \hat K_n({\boldsymbol{x}},{\boldsymbol{x}}')\)empirical tangent kernel(175)
\( \Lambda_{\mathcal{A}, \sigma, S, \mathcal{L}}\)loss landscape defining functionDefinition 27
\( {\rm {Lip}}(f)\)Lipschitz constant of a function \( f\)(97)
\( {\rm {Lip}}_M(\Omega)\)\( M\) -Lipschitz continuous functions on \( \Omega\)(100)
\( \mathcal{L}\)general loss functionSection 15.1
\( \mathcal{L}_{0-1}\)0-1 lossSection 15.1
\( \mathcal{L}_{\rm {ce}}\)binary cross entropy lossSection 15.1
\( \mathcal{L}_2\)square lossSection 15.1
\( \ell^p(\mathbb{N})\)space of \( p\) -summable sequences indexed over \( \mathbb{N}\)Section 19.2.3
\( L^p(\Omega)\)Lebesgue space over \( \Omega\)Section 19.2.3
\( \mathcal{M}\)piecewise continuous and locally bounded functionsDefinition 8
\( \mathcal{N}_d^m(\sigma;L,n)\)set of multilayer perceptrons with \( d\) -dim input, \( m\) -dim output, activation function \( \sigma\) , depth \( L\) , and width \( n\)Definition 5
\( \mathcal{N}_d^m(\sigma;L)\)union of \( \mathcal{N}_d^m(\sigma;L,n)\) for all \( n\in\mathbb{N}\)Definition 5
\( \mathcal{N}(\sigma;\mathcal{A}, B)\)set of neural networks with architecture \( \mathcal{A}\) , activation function \( \sigma\) and all weights bounded in modulus by \( B\)Definition 26
\( \mathcal{N}^*(\sigma, \mathcal{A}, B)\)neural networks in \( \mathcal{N}(\sigma;\mathcal{A}, B)\) with range in \( [-1,1]\)(239)
\( \mathbb{N}\)positive natural numbers
\( \mathbb{N}_0\)natural numbers including \( 0\)
\( {\rm {N}}({\boldsymbol{m}},{\boldsymbol{C}})\)multivariate normal distribution with mean \( {\boldsymbol{m}}\in\mathbb{R}^d\) and covariance \( {\boldsymbol{C}}\in\mathbb{R}^{d\times d}\)
\( n_\mathcal{A}\)number of parameters of a neural network with layer widths described by \( \mathcal{A}\)Definition 26
\( \| \cdot \|_{}\)Euclidean norm for vectors in \( \mathbb{R}^d\) and spectral norm for matrices in \( \mathbb{R}^{n\times d}\)
\( \| \cdot \|_{F}\)Frobenius norm for matrices
\( \| \cdot \|_{\infty}\)\( \infty\) -norm on \( \mathbb{R}^d\) or supremum norm for functions
\( \| \cdot \|_{p}\)\( p\) -norm on \( \mathbb{R}^d\)
\( \| \cdot \|_{X}\)norm on a vector space \( X\)
\( \boldsymbol{0}\)zero vector or zero matrix
\( O(\cdot)\)Landau notation
\( \omega(\eta)\)patch of the node \( \eta\)(42)
\( \Omega_\Lambda(c)\)sublevel set of loss landscapeDefinition 28
\( \partial f({\boldsymbol{x}})\)set of subgradients of \( f\) at \( {\boldsymbol{x}}\)Definition 23
\( \mathcal{P}_n(\mathbb{R}^d)\) or \( \mathcal{P}_n\)space of multivariate polynomials of degree \( n\) on \( \mathbb{R}^d\)Example 1
\( \mathcal{P}(\mathbb{R}^d)\) or \( \mathcal{P}\)space of multivariate polynomials of arbitrary degree on \( \mathbb{R}^d\)Example 1
\( \mathbb{P}_X\)distribution of random variable \( X\)Definition 43
\( \mathbb{P}[A]\)probability of event \( A\)Definition 40
\( \mathbb{P}[A|B]\)conditional probability of event \( A\) given \( B\)Definition 276
\( \mathcal{PN}(\mathcal{A}, B)\)parameter set of neural networks with architecture \( \mathcal{A}\) and all weights bounded in modulus by \( B\)Definition 26
\( {\rm {Pieces}}(f, \Omega)\)number of pieces of \( f\) on \( \Omega\)Definition 15
\( \Phi({\boldsymbol{x}})\)model (e.gneural network) in terms of input \( {\boldsymbol{x}}\) (parameter dependence suppressed)
\( \Phi({\boldsymbol{x}},{\boldsymbol{w}})\)model (e.gneural network) in terms of input \( {\boldsymbol{x}}\) and parameters \( {\boldsymbol{w}}\)
\( \Phi^{\rm {lin}}\)linearization around initialization(172)
\( {\Phi^{\min}_{n}}\)minimum neural networkLemma 11
\( {\Phi^{\times}_{\varepsilon}}\)multiplication neural networkLemma 20
\( {\Phi^{\times}_{n,\varepsilon}}\)multiplication of \( n\) numbers neural networkProposition 9
\( \Phi_2\circ\Phi_1\)composition of neural networksLemma 7
\( \Phi_2\bullet\Phi_1\)sparse composition of neural networksLemma 7
\( (\Phi_1, \dots, \Phi_m)\)parallelization of neural networks(28)
\( {\boldsymbol{A}}^\dagger\)pseudoinverse of a matrix \( {\boldsymbol{A}}\)
\( \mathbb{Q}\)rational numbers
\( \mathbb{R}\)real numbers
\( \mathbb{R}_-\)non-positive real numbers
\( \mathbb{R}_+\)non-negative real numbers
\( R_\sigma\)Realization mapDefinition 26
\( R^*\)Bayes risk(230)
\( \mathcal{R}(h)\)risk of hypothesis \( h\)Definition 31
\( \widehat{\mathcal{R}}_S(h)\)empirical risk of \( h\) for sample \( S\)(3), Definition 32
\( \mathcal{S}_n\)cardinal B-splineDefinition 7
\( \mathcal{S}_{\ell, {\boldsymbol{t}}, n}^d\)multivariate cardinal B-splineDefinition 8
\( |S|\)cardinality of an arbitrary set \( S\) , or Lebesgue measure of \( S\subseteq\mathbb{R}^d\)
\( \mathring{S}\)interior of a set \( S\)
\( \overline{S}\)closure of a set \( S\)
\( \partial{S}\)boundary of a set \( S\)
\( S^c\)complement of a set \( S\)
\( S^\perp\)orthogonal complement of a set \( S\)Definition 56
\( \sigma\)general activation function
\( \sigma_{a}\)parametric ReLU activation functionSection 3.3
\( \sigma_{\rm {ReLU}}\)ReLU activation functionSection 3.3
\( {\rm {sign}}\)sign function
\( s_{\rm {max}}({\boldsymbol{A}})\)maximal singular value of a matrix \( {\boldsymbol{A}}\)
\( s_{\rm {min}}({\boldsymbol{A}})\)minimal (positive) singular value of a matrix \( {\boldsymbol{A}}\)
\( {\rm {size}}(\Phi)\)number of free network parameters in \( \Phi\)Definition 2
\( {\rm {span}}(S)\)linear hull or span of \( S\)
\( \mathcal{T}\)triangulationDefinition 12
\( \mathcal{V}\)set of nodes in a triangulationDefinition 12
\( \mathbb{V}[X]\)variance of random variable \( X\)Section 18.2.2
\( \mathrm{VCdim}(\mathcal{H})\)VC dimension of a set of functions \( \mathcal{H}\)Definition 36
\( \mathcal{W}\)distribution of weight initializationSection 12.6.1
\( {\boldsymbol{W}}^{(\ell)}, {\boldsymbol{b}}^{(\ell)}\)weights and biases in layer \( \ell\) of a neural networkDefinition 1
\( {\rm {width}}(\Phi)\)width of \( \Phi\)Definition 1
\( {\boldsymbol{x}}^{(\ell)}\)output of \( \ell\) -th layer of a neural networkDefinition 1
\( \bar{\boldsymbol{x}}^{(\ell)}\)preactivations(150)
\( X'\)dual space to a normed space \( X\)Definition 53

2 Introduction

2.1 Mathematics of deep learning

In 2012, a deep learning architecture revolutionized the field of computer vision by achieving unprecedented performance in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [4]. The deep learning architecture, known as AlexNet, significantly outperformed all competing approaches. A few years later, in March 2016, a deep learning-based architecture called AlphaGo defeated the best Go player at the time, Lee Sedol, in a five-game match [5]. Go is a highly complex board game with a vast number of possible moves, making it a challenging problem for artificial intelligence. Because of this complexity, many researchers believed that defeating a top human Go player was a feat that would only be achieved decades later.

These breakthroughs along with many others, including DeepMind’s AlphaFold [6], which revolutionized protein structure prediction in 2020, the unprecedented language capabilities of large language models like GPT-3 (and later versions) [7, 8], and the emergence of generative AI models like Stable Diffusion, Midjourney, and DALL-E, have sparked interest among scientists across (almost) all disciplines. Likewise, while mathematical research on neural networks has a long history, these groundbreaking developments revived interest in the theoretical underpinnings of deep learning among mathematicians. However, initially, there was a clear consensus in the mathematics community: We do not understand why this technology works so well! In fact, there are many mathematical reasons that, at least superficially, should prevent the observed success.

Over the past decade the field has matured, and mathematicians have gained a more profound understanding of deep learning, although many open questions remain. Recent years have brought various new explanations and insights into the inner workings of these models. Before discussing them in detail in the following chapters, we first give a high-level introduction to deep learning, with a focus on the supervised learning framework for function approximation, which is the central theme of this book.

2.2 High-level overview of deep learning

Deep learning refers to the application of deep neural networks trained by gradient-based methods, to identify unknown input-output relationships. This approach has three key ingredients: deep neural networks, gradient-based training, and prediction. We now explain each of these ingredients separately.

<span data-controller="mathjax">Illustration of a single neuron ) .
 The neuron receives six inputs (x_1, , x_6) = {x}) ,
 computes their weighted sum _{j=1}^6x_jw_j) ,
 adds a bias b) , and finally applies the activation function ) 
 to produce the output (x)) .</span>
Figure 1. Illustration of a single neuron \( \nu\) . The neuron receives six inputs \( (x_1, \dots, x_6) = {\boldsymbol{x}}\) , computes their weighted sum \( \sum_{j=1}^6x_jw_j\) , adds a bias \( b\) , and finally applies the activation function \( \sigma\) to produce the output \( \nu(x)\) .
2.2.1 Deep Neural Networks

Deep neural networks are formed by a combination of neurons. A neuron is a function of the form

\[ \begin{align} \mathbb{R}^d \ni {\boldsymbol{x}} \mapsto \nu({\boldsymbol{x}}) = \sigma( {\boldsymbol{w}}^\top {\boldsymbol{x}} +b ), \end{align} \]

(1)

where \( {\boldsymbol{w}} \in \mathbb{R}^d\) is a weight vector, \( b\in \mathbb{R}\) is called bias, and the function \( \sigma\) is referred to as an activation function. This concept is due to McCulloch and Pitts [9] and is a mathematical model for biological neurons. If we consider \( \sigma\) to be the Heaviside function, i.e., \( \sigma = \boldsymbol{1}_{\mathbb{R}_+}\) with \( \mathbb{R}_+\mathrm{:}= [0,\infty)\) , then the neuron “fires” if the weighted sum of the inputs \( {\boldsymbol{x}}\) surpasses the threshold \( -b\) . We depict a neuron in Figure 1. Note that if we fix \( d\) and \( \sigma\) , then the set of neurons can be naturally parameterized by the \( d+1\) real values \( w_1,\dots,w_d,b\in\mathbb{R}\) .

Neural networks are functions formed by connecting neurons, where the output of one neuron becomes the input to another. One simple but very common type of neural network is the so-called feedforward neural network. This structure distinguishes itself by having the neurons grouped in layers, and the inputs to neurons in the \( (\ell+1)\) -st layer are exclusively neurons from the \( \ell\) -th layer.

We start by defining a shallow feedforward neural network as an affine transformation applied to the output of a set of neurons that share the same input and the same activation function. Here, an affine transformation is a map \( T:\mathbb{R}^p\to\mathbb{R}^q\) such that \( T({\boldsymbol{x}})={\boldsymbol{W}}{\boldsymbol{x}}+{\boldsymbol{b}}\) for some \( {\boldsymbol{W}}\in\mathbb{R}^{q\times p}\) , \( {\boldsymbol{b}}\in\mathbb{R}^q\) where \( p\) , \( q \in \mathbb{N}\) .

Formally, a shallow feedforward neural network is, therefore, a map \( \Phi\) of the form

\[ \mathbb{R}^d \ni {\boldsymbol{x}} \mapsto \Phi({\boldsymbol{x}}) = T_1\circ\sigma\circ T_0({\boldsymbol{x}}) \]

where \( T_0\) , \( T_1\) are affine transformations and the application of \( \sigma\) is understood to be in each component of \( T_1({\boldsymbol{x}})\) . A visualization of a shallow neural network is given in Figure 2.

A deep feedforward neural network is constructed by compositions of shallow neural networks. This yields a map of the type

\[ \mathbb{R}^d \ni {\boldsymbol{x}} \mapsto \Phi({\boldsymbol{x}}) = T_{L+1}\circ\sigma\circ \cdots \circ T_1 \circ\sigma\circ T_0({\boldsymbol{x}}), \]

where \( L \in \mathbb{N}\) and \( (T_j)_{j= 0}^{L+1}\) are affine transformations. The number of compositions \( L\) is referred to as the number of layers of the deep neural network. Similar to a single neuron, (deep) neural networks can be viewed as a parameterized function class, with the parameters being the entries of the matrices and vectors determining the affine transformations \( (T_j)_{j= 0}^{L+1}\) .

<span data-controller="mathjax">Illustration of a shallow neural network.
 The affine transformation T_0) is of the form (x_1, , x_6) = {x} {W}{x} + {b}) , where the rows of {W}) are the weight vectors {w}_1) , {w}_2) , {w}_3) for each respective neuron.</span>
Figure 2. Illustration of a shallow neural network. The affine transformation \( T_0\) is of the form \( (x_1, \dots, x_6) = {\boldsymbol{x}}\mapsto {\boldsymbol{W}}{\boldsymbol{x}} + {\boldsymbol{b}}\) , where the rows of \( {\boldsymbol{W}}\) are the weight vectors \( {\boldsymbol{w}}_1\) , \( {\boldsymbol{w}}_2\) , \( {\boldsymbol{w}}_3\) for each respective neuron.
2.2.2 Gradient-based training

After defining the structure or architecture of the neural network, e.g., the activation function and the number of layers, the second step of deep learning consists of determining suitable values for its parameters. In practice this is achieved by minimizing an objective function. In supervised learning, which will be our focus, this objective depends on a collection of input-output pairs, commonly known as training data or simply as a sample. Concretely, let \( S = ({\boldsymbol{x}}_i, {\boldsymbol{y}}_i)_{i=1}^m\) be a sample, where \( {\boldsymbol{x}}_i \in \mathbb{R}^d\) represents the inputs and \( {\boldsymbol{y}}_i \in \mathbb{R}^k\) the corresponding outputs with \( d\) , \( k \in \mathbb{N}\) . Our goal is to find a deep neural network \( \Phi\) such that

\[ \begin{align} \Phi({\boldsymbol{x}}_i) \approx {\boldsymbol{y}}_i~~\text{for all } i=1,\dots,m \end{align} \]

(2)

in a meaningful sense. For example, we could interpret “\( \approx\) ” to mean closeness with respect to the Euclidean norm, or more generally, that \( \mathcal{L}(\Phi({\boldsymbol{x}}_i), {\boldsymbol{y}}_i)\) is small for a function \( \mathcal{L}\) measuring the dissimilarity between its inputs. Such a function \( \mathcal{L}\) is called a loss function. A standard way of achieving (2) is by minimizing the so-called empirical risk of \( \Phi\) with respect to the sample \( S\) defined as

\[ \begin{align} \widehat{\mathcal{R}}_S(\Phi) = \frac{1}{m}\sum_{i=1}^m\mathcal{L}(\Phi({\boldsymbol{x}}_i), {\boldsymbol{y}}_i). \end{align} \]

(3)

This quantity serves as a measure of how well \( \Phi\) predicts \( {\boldsymbol{y}}_i\) at the training points \( {\boldsymbol{x}}_1, \dots, {\boldsymbol{x}}_m\) .

If \( \mathcal{L}\) is differentiable, and for all \( {\boldsymbol{x}}_i\) the output \( \Phi({\boldsymbol{x}}_i)\) depends differentiably on the parameters of the neural network, then the gradient of the empirical risk \( \widehat{\mathcal{R}}_S(\Phi)\) with respect to the parameters is well-defined. This gradient can be efficiently computed using a technique called backpropagation. This allows to minimize (3) by optimization algorithms such as (stochastic) gradient descent. They produce a sequence of neural networks parameters, and corresponding neural network functions \( \Phi_1, \Phi_2, \dots \) , for which the empirical risk is expected to decrease. Figure 3 illustrates a possible behavior of this sequence.

2.2.3 Prediction

The final part of deep learning concerns the question of whether we have actually learned something by the procedure above. Suppose that our optimization routine has either converged or has been terminated, yielding a neural network \( \Phi_*\) . While the optimization aimed to minimize the empirical risk on the training sample \( S\) , our ultimate interest is not in how well \( \Phi_*\) performs on \( S\) . Rather, we are interested in its performance on new data points \( ({\boldsymbol{x}}_{\rm {new}}, {\boldsymbol{y}}_{\rm new})\) outside of \( S\) .

To make meaningful statements about this, we assume existence of a data distribution \( \mathcal{D}\) on the input-output space—in our case, this is \( \mathbb{R}^d \times \mathbb{R}^k\) —such that both the elements of \( S\) and all other data points are drawn from this distribution. In other words, we treat \( S\) as an i.i.ddraw from \( \mathcal{D}\) , and \( ({\boldsymbol{x}}_{\rm {new}}, {\boldsymbol{y}}_{\rm new})\) also as sampled independently from \( \mathcal{D}\) . If we want \( \Phi_*\) to perform well on average, then this amounts to controlling the following expression

\[ \begin{align} \mathcal{R}(\Phi_*) = \mathbb{E}_{({\boldsymbol{x}}_{\rm new}, {\boldsymbol{y}}_{\rm new}) \sim \mathcal{D}}[\mathcal{L}(\Phi_*({\boldsymbol{x}}_{\rm new}), {\boldsymbol{y}}_{\rm new})], \end{align} \]

(4)

which is called the risk of \( \Phi_*\) . If the risk is not much larger than the empirical risk, then we say that the neural network \( \Phi_*\) has a small generalization error. On the other hand, if the risk is much larger than the empirical risk, then we say that \( \Phi_*\) overfits the training data, meaning that \( \Phi_*\) has memorized the training samples, but does not generalize well to data outside of the training set.

<span data-controller="mathjax">A sequence of one dimensional neural networks _1, , _4) that successively minimizes the empirical risk for the sample S = (x_i,y_i)_{i=1}^6) .</span>
Figure 3. A sequence of one dimensional neural networks \( \Phi_1, \dots, \Phi_4\) that successively minimizes the empirical risk for the sample \( S = (x_i,y_i)_{i=1}^6\) .

2.3 Why does it work?

It is natural to wonder why the deep learning pipeline, as outlined in the previous subsection, ultimately succeeds in learning, i.e., achieving a small risk. Is it true that for a given sample \( ({\boldsymbol{x}}_i,{\boldsymbol{y}}_i)_{i=1}^m\) there exist a neural network such that \( \Phi({\boldsymbol{x}}_i) \approx {\boldsymbol{y}}_i\) for all \( i = 1, \dots m\) ? Does the optimization routine produce a meaningful result? Can we control the risk, knowing only that the empirical risk is small?

While most of these questions can be answered affirmatively under certain assumptions, these assumptions often do not apply to deep learning in practice. We next explore some potential explanations and show that they lead to even more questions.

2.3.1 Approximation

A fundamental result in the study of neural networks is the so-called universal approximation theorem, which will be discussed in Chapter 4. This result states that every continuous function on a compact domain can be approximated arbitrarily well (in a uniform sense) by a shallow neural network.

This result, however, does not address the practically relevant question of efficiency. For example, if we aim for computational efficiency, then we may be interested in identifying the smallest neural network that fits the data. This naturally raises the question: What is the role of the architecture for the expressive capabilities of neural networks? Furthermore, viewing empirical risk minimization as an approximation problem, we are confronted with a central challenge in approximation theory: the curse of dimensionality. Function approximation in high dimensions is notoriously difficult and becomes exponentially harder as the dimensionality increases. Yet, many successful deep learning architectures operate in this high-dimensional regime. Why do these neural networks appear to overcome this so-called curse?

2.3.2 Optimization

While gradient descent can sometimes be proven to converge to a global minimum, as we will discuss in Chapter 11, this typically requires the objective function to be at least convex. However, there is no reason to believe that for example the empirical risk is a convex function of the network parameters. In fact, due to the repeatedly occurring compositions with the nonlinear activation function in the network, the empirical risk is typically highly non-linear and not convex. Therefore, there is generally no guarantee that the optimization routine will converge to a global minimum, and it may get stuck in a local (and non-global) minimum or a saddle point. Why is the output of the optimization nonetheless often meaningful in practice?

2.3.3 Generalization

In traditional statistical learning theory, which we will review in Chapter 15, the extent to which the risk exceeds the empirical risk, can be bounded a priori; such bounds are often expressed in terms of a notion of complexity of the set of admissible functions (the class of neural networks) divided by the number of training samples. For the class of neural networks of a fixed architecture, the complexity roughly amounts to the number of neural network parameters. In practice, typically neural networks with more parameters than training samples are used. This is dubbed the overparameterized regime. In this regime, the classical estimates described above are void.

Why is it that, nonetheless, deep overparameterized architectures are capable of making accurate predictions on unseen data? Furthermore, while deep architectures often generalize well, they sometimes fail spectacularly on specific, carefully crafted examples. In image classification tasks, these examples may differ only slightly from correctly classified images in a way that is not perceptible to the human eye. Such examples are known as adversarial examples, and their existence poses a great challenge for applications of deep learning.

2.4 Outline and philosophy

This book addresses the questions raised in the previous section, providing answers that are mathematically rigorous and accessible. Our focus will be on provable statements, presented in a manner that prioritizes simplicity and clarity over generality. We will sometimes illustrate key ideas only in special cases, or under strong assumptions, both to avoid an overly technical exposition, and because definitive answers are often not yet available. In the following, we summarize the content of each chapter and highlight parts pertaining to the questions stated in the previous section.

Chapter 3: Feedforward neural networks. In this chapter, we introduce the main object of study of this book—the feedforward neural network.

Chapter 4: Universal approximation. We present the classical view of function approximation by neural networks, and give two instances of so-called universal approximation results. Such statements describe the ability of neural networks to approximate every function of a given class to arbitrary accuracy, given that the network size is sufficiently large. The first result, which holds under very broad assumptions on the activation function, is on uniform approximation of continuous functions on compact domains. The second result shows that for a very specific activation function, the network size can be chosen independently of the desired accuracy, highlighting that universal approximation needs to be interpreted with caution.

Chapter 5: Splines. Going beyond universal approximation, this chapter starts to explore approximation rates of neural networks. Specifically, we examine how well certain functions can be approximated relative to the number of parameters in the network. For so-called sigmoidal activation functions, we establish a link between neural-network- and spline-approximation. This reveals that smoother functions require fewer network parameters. However, achieving this increased efficiency necessitates the use of deeper neural networks. This observation offers a first glimpse into the importance of depth in deep learning.

Chapter 6: ReLU neural networks. This chapter focuses on one of the most popular activation functions in practice—the ReLU. We prove that the class of ReLU networks is equal to the set of continuous piecewise linear functions, thus providing a theoretical foundation for their expressive power. Furthermore, given a continuous piecewise linear function, we investigate the necessary width and depth of a ReLU network to represent it. Finally, we leverage approximation theory for piecewise linear functions to derive convergence rates for approximating Hölder continuous functions.

Chapter 7: Affine pieces for ReLU neural networks. Having gained some intuition about ReLU neural networks, in this chapter, we adress some potential limitations. We analyze ReLU neural networks by counting the number of affine regions that they generate. The key insight of this chapter is that deep neural networks can generate exponentially more regions than shallow ones. This observation provides further evidence for the potential advantages of depth in neural network architectures.

Chapter 8: Deep ReLU neural networks. Having identified the ability of deep ReLU neural networks to generate a large number of affine regions, we investigate whether this translates into an actual advantage in function approximation. Indeed, for approximating smooth functions, we prove substantially better approximation rates than we obtained for shallow neural networks. This adds again to our understanding of depth and its connections to expressive power of neural network architectures.

Chapter 9: High-dimensional approximation. The convergence rates established in the previous chapters deteriorate significantly in high-dimensional settings. This chapter examines three scenarios under which neural networks can provably overcome the curse of dimensionality.

Chapter 10: Interpolation. In this chapter we shift our perspective from approximation to exact interpolation of the training data. We analyze conditions under which exact interpolation is possible, and discuss the implications for empirical risk minimization. Furthermore, we present a constructive proof showing that ReLU networks can express an optimal interpolant of the data (in a specific sense).

Chapter 11: Training of neural networks. We start to examine the training process of deep learning. First, we study the fundamentals of (stochastic) gradient descent and convex optimization. Additionally, we examine accelerated methods and highlight the key principles behind popular training algorithms such as Adam. Finally, we discuss how the backpropagation algorithm can be used to implement these optimization algorithms for training neural networks.

Chapter 12: Wide neural networks and the neural tangent kernel. This chapter introduces the neural tangent kernel as a tool for analyzing the training behavior of neural networks. We begin by revisiting linear and kernel regression for the approximation of functions based on data. Additionally we discuss the effect of adding a regularization term to the objective function. Afterwards, we show for certain architectures of sufficient width, that the training dynamics of gradient descent resemble those of kernel regression and converge to a global minimum. This analysis provides insights into why, under certain conditions, we can train neural networks without getting stuck in (bad) local minima, despite the non-convexity of the objective function. Finally, we discuss a well-known link between neural networks and Gaussian processes, giving some indication why overparameterized networks do not necessarily overfit in practice.

Chapter 13: Loss landscape analysis. In this chapter, we present an alternative view on the optimization problem, by analyzing the loss landscape—the empirical risk as a function of the neural network parameters. We give theoretical arguments showing that increasing overparameterization leads to greater connectivity between the valleys and basins of the loss landscape. Consequently, overparameterized architectures make it easier to reach a region where all minima are global minima. Additionally, we observe that most stationary points associated with non-global minima are saddle points. This sheds further light on the empirically observed fact that deep architectures can often be optimized without getting stuck in non-global minima.

Chapter 14: Shape of neural network spaces. While Chapters 12 and 13 highlight potential reasons for the success of neural network training, in this chapter, we show that the set of neural networks of a fixed architecture has some undesirable properties from an optimization perspective. Specifically, we show that this set is typically non-convex. Moreover, in general it does not possess the best-approximation property, meaning that there might not exist a neural network within the set yielding the best approximation for a given function.

Chapter 15 : Generalization properties of deep neural networks. To understand why deep neural networks successfully generalize to unseen data points (outside of the training set), we study classical statistical learning theory, with a focus on neural network functions as the hypothesis class. We then show how to establish generalization bounds for deep learning, providing theoretical insights into the performance on unseen data.

Chapter 16: Generalization in the overparameterized regime. The generalization bounds of the previous chapter are not meaningful when the number of parameters of a neural network surpasses the number of training samples. However, this overparameterized regime is where many successful network architectures operate. To gain a deeper understanding of generalization in this regime, we describe the phenomenon of double descent and present a potential explanation. This addresses the question of why deep neural networks perform well despite being highly overparameterized.

Chapter 17: Robustness and adversarial examples. In the final chapter, we explore the existence of adversarial examples—inputs designed to deceive neural networks. We provide some theoretical explanations of why adversarial examples arise, and discuss potential strategies to prevent them.

2.5 Material not covered in this book

This book studies some central topics of deep learning but leaves out even more. Interesting questions associated with the field that were omitted, as well as some pointers to related works are listed below:

Advanced architectures: The (deep) feedforward neural network is far from the only type of neural network. In practice, architectures must be adapted to the type of data. For example, images exhibit strong spatial dependencies in the sense that adjacent pixels often have similar values. Convolutional neural networks [10] are particularly well suited for this type of input, as they employ convolutional filters that aggregate information from neighboring pixels, thus capturing the data structure better than a fully connected feedforward network. Similarly, graph neural networks [11] are a natural choice for graph-based data. For sequential data, such as natural language, architectures with some form of memory component are used, including Long Short-Term Memory (LSTM) networks [12] and attention-based architectures like transformers [7].

Unsupervised and Reinforcement Learning: While this book focuses on supervised learning, where each data point \( x_i\) has a label \( y_i\) , there is a vast field of machine learning called unsupervised learning, where labels are absent. Classical unsupervised learning problems include clustering and dimensionality reduction [13, Chapters 22/23].

A popular area in deep learning, where no labels are used, is physics-informed neural networks [14]. Here, a neural network is trained to satisfy a partial differential equation (PDE), with the loss function quantifying the deviation from this PDE.

Finally, reinforcement learning is a technique where an agent can interact with an environment and receives feedback based on its actions. The actions are guided by a so-called policy, which is to be learned, [15, Chapter 17]. In deep reinforcement learning, this policy is modeled by a deep neural network. Reinforcement learning is the basis of the aforementioned AlphaGo.

Interpretability/Explainability and Fairness: The use of deep neural networks in critical decision-making processes, such as allocating scarce resources (e.g., organ transplants in medicine, financial credit approval, hiring decisions) or engineering (e.g., optimizing bridge structures, autonomous vehicle navigation, predictive maintenance), necessitates an understanding of their decision-making process. This is crucial for both practical and ethical reasons.

Practically, understanding how a model arrives at a decision can help us improve its performance and mitigate problems. It allows us to ensure that the model performs according to our intentions and does not produce undesirable outcomes. For example, in bridge design, understanding why a model suggests or rejects a particular configuration can help engineers identify potential vulnerabilities, ultimately leading to safer and more efficient designs. Ethically, transparent decision-making is crucial, especially when the outcomes have significant consequences for individuals or society; biases present in the data or model design can lead to discriminatory outcomes, making explainability essential.

However, explaining the predictions of deep neural networks is not straightforward. Despite knowledge of the network weights and biases, the repeated and complex interplay of linear transformations and non-linear activation functions often renders these models black boxes. A comprehensive overview of various techniques for interpretability, not only for deep neural networks, can be found in [16]. Regarding the topic of fairness, we refer for instance to [17, 18].

Implementation: While this book focuses on provable theoretical results, the field of deep learning is strongly driven by applications, and a thorough understanding of deep learning cannot be achieved without practical experience. For this, there exist numerous resources with excellent explanations. We recommend [19, 20, 21] as well as the countless online tutorials that are just a Google (or alternative) search away.

Many more: The field is evolving rapidly, and new ideas are constantly being generated and tested. This book cannot give a complete overview. However, we hope that it provides the reader with a solid foundation in the fundamental knowledge and principles to quickly grasp and understand new developments in the field.

Bibliography and further reading

Throughout this book, we will end each chapter with a short overview of related work and the references used in the chapter.

In this introductory chapter, we highlight several other recent textbooks and works on deep learning. For a historical survey on neural networks see [22] and also [23]. For general textbooks on neural networks and deep learning, we refer to [24, 25, 21] for more recent monographs. More mathematical introductions to the topic are given, for example, in [26, 3, 27]. For the implementation of neural networks we refer for example to [19, 20].

3 Feedforward neural networks

Feedforward neural networks, henceforth simply referred to as neural networks (NNs), constitute the central object of study of this book. In this chapter, we provide a formal definition of neural networks, discuss the size of a neural network, and give a brief overview of common activation functions.

3.1 Formal definition

We previously defined a single neuron \( \nu\) in (1) and Figure 1. A neural network is constructed by connecting multiple neurons. Let us now make precise this connection procedure.

Definition 1

Let \( L\in \mathbb{N}\) , \( d_0,\dots,d_{L+1}\in\mathbb{N}\) , and let \( \sigma\colon \mathbb{R} \to \mathbb{R}\) . A function \( \Phi \colon \mathbb{R}^{d_0}\to\mathbb{R}^{d_{L+1}}\) is called a neural network if there exist matrices \( {\boldsymbol{W}}^{(\ell)}\in\mathbb{R}^{d_{\ell+1}\times d_\ell}\) and vectors \( {\boldsymbol{b}}^{(\ell)}\in\mathbb{R}^{d_{\ell+1}}\) , \( \ell = 0, \dots, L\) , such that with

\[ \begin{align} {\boldsymbol{x}}^{(0)} &\mathrm{:}= {\boldsymbol{x}} \end{align} \]

(5.a)

\[ \begin{align} {\boldsymbol{x}}^{(\ell)}&\mathrm{:}= \sigma({\boldsymbol{W}}^{(\ell-1)}{\boldsymbol{x}}^{(\ell-1)}+{\boldsymbol{b}}^{(\ell-1)})&&\text{for } \ell \in\{1, \dots, L\} \end{align} \]

(5.b)

\[ \begin{align} {\boldsymbol{x}}^{(L+1)}&\mathrm{:}= {\boldsymbol{W}}^{(L)}{\boldsymbol{x}}^{(L)}+{\boldsymbol{b}}^{(L)} \end{align} \]

(5.c)

holds

\[ \begin{align*} \Phi({\boldsymbol{x}}) = {\boldsymbol{x}}^{(L+1)}~~ \text{for all }{\boldsymbol{x}}\in\mathbb{R}^{d_0}. \end{align*} \]

We call \( L\) the depth, \( d_{\rm {max}} =\max_{\ell= 1, \dots, L} d_\ell\) the width, \( \sigma\) the activation function, and \( (\sigma;d_0,\dots,d_{L+1})\) the architecture of the neural network \( \Phi\) . Moreover, \( {\boldsymbol{W}}^{(\ell)}\in\mathbb{R}^{d_{\ell+1}\times d_\ell}\) are the weight matrices and \( {\boldsymbol{b}}^{(\ell)}\in\mathbb{R}^{d_{\ell+1}}\) the bias vectors of \( \Phi\) for \( \ell = 0, \dots L\) .

Remark 1

Typically, there exist different choices of architectures, weights, and biases yielding the same function \( \Phi:\mathbb{R}^{d_0}\to\mathbb{R}^{d_{L+1}}\) . For this reason we cannot associate a unique meaning to these notions solely based on the function realized by \( \Phi\) . In the following, when we refer to the properties of a neural network \( \Phi\) , it is always understood to mean that there exists at least one construction as in Definition 1, which realizes the function \( \Phi\) and uses parameters that satisfy those properties.

The architecture of a neural network is often depicted as a connected graph, as illustrated in Figure 4. The nodes in such graphs represent (the output of) the neurons. They are arranged in layers, with \( {\boldsymbol{x}}^{(\ell)}\) in Definition 1 corresponding to the neurons in layer \( \ell\) . We also refer to \( {\boldsymbol{x}}^{(0)}\) in (5.a) as the input layer and to \( {\boldsymbol{x}}^{(L+1)}\) in (5.c) as the output layer. All layers in between are referred to as the hidden layers and their output is given by (5.b). The number of hidden layers corresponds to the depth. For the correct interpretation of such graphs, we note that by our conventions in Definition 1, the activation function is applied after each affine transformation, except in the final layer.

Neural networks of depth one are called shallow, if the depth is larger than one they are called deep. The notion of deep neural networks is not used entirely consistently in the literature, and some authors use the word deep only in case the depth is much larger than one, where the precise meaning of “much larger” depends on the application.

Throughout, we only consider neural networks in the sense of Definition 1. We emphasize however, that this is just one (simple but very common) type of neural network. Many adjustments to this construction are possible and also widely used. For example:

  • We may use different activation functions \( \sigma_\ell\) in each layer \( \ell\) or we may even use a different activation function for each node.
  • Residual neural networks allow “skip connections” [28]. This means that information is allowed to skip layers in the sense that the nodes in layer \( \ell\) may have \( {\boldsymbol{x}}^{(0)},\dots,{\boldsymbol{x}}^{(\ell-1)}\) as their input (and not just \( {\boldsymbol{x}}^{(\ell-1)}\) ), cf. (5).
  • In contrast to feedforward neural networks, recurrent neural networks allow information to flow backward, in the sense that \( {\boldsymbol{x}}^{(\ell-1)},\dots,{\boldsymbol{x}}^{(L+1)}\) may serve as input for the nodes in layer \( \ell\) (and not just \( {\boldsymbol{x}}^{(\ell-1)}\) ). This creates loops in the flow of information, and one has to introduce a time index \( t\in\mathbb{N}\) , as the output of a node in time step \( t\) might be different from the output in time step \( t+1\) .
<span data-controller="mathjax">Sketch of a neural network with three hidden layers,
 and d_0 = 3) , d_1= 4) , d_2=3) , d_3 = 4) , d_4 = 2) . The neural network has depth three and width four.</span>
Figure 4. Sketch of a neural network with three hidden layers, and \( d_0 = 3\) , \( d_1= 4\) , \( d_2=3\) , \( d_3 = 4\) , \( d_4 = 2\) . The neural network has depth three and width four.

Let us clarify some further common terminology used in the context of neural networks:

  • parameters: The parameters of a neural network refer to the set of all entries of the weight matrices and bias vectors. These are often collected in a single vector

    \[ \begin{equation} {\boldsymbol{w}}=(({\boldsymbol{W}}^{(0)},{\boldsymbol{b}}^{(0)}),\dots,({\boldsymbol{W}}^{(L)},{\boldsymbol{b}}^{(L)})). \end{equation} \]

    (6)

    These parameters are adjustable and are learned during the training process, determining the specific function realized by the network.

  • hyperparameters: Hyperparameters are settings that define the network’s architecture (and training process), but are not directly learned during training. Examples include the depth, the number of neurons in each layer, and the choice of activation function. They are typically set before training begins.
  • weights: The term “weights” is often used broadly to refer to all parameters of a neural network, including both the weight matrices and bias vectors.
  • model: For a fixed architecture, every choice of network parameters \( {\boldsymbol{w}}\) as in (6) defines a specific function \( {\boldsymbol{x}}\mapsto \Phi_{\boldsymbol{w}}({\boldsymbol{x}})\) . In deep learning this function is often referred to as a model. More generally, “model” can be used to describe any function parameterization by a set of parameters \( {\boldsymbol{w}}\in\mathbb{R}^n\) , \( n \in \mathbb{N}\) .

3.1.1 Basic operations on neural networks

There are various ways how neural networks can be combined with one another. The next proposition addresses this for linear combinations, compositions, and parallelization. The formal proof, which is a good exercise to familiarize oneself with neural networks, is left as Exercise 1.

Proposition 1

For two neural networks \( \Phi_1\) , \( \Phi_2\) , with architectures

\[ (\sigma;d_0^1, d_1^1, \dots, d_{L_1+1}^1)~ \text{ and }~ (\sigma;d_0^2, d_1^2, \dots, d_{L_2+1}^2) \]

respectively, it holds that

  1. for all \( \alpha \in \mathbb{R}\) exists a neural network \( \Phi_\alpha\) with architecture \( (\sigma;d_0^1, d_1^1, \dots, d_{L_1+1}^1)\) such that

    \[ \Phi_\alpha({\boldsymbol{x}}) = \alpha \Phi_1({\boldsymbol{x}}) ~~ \text{ for all } {\boldsymbol{x}} \in \mathbb{R}^{d_0^1}, \]
  2. if \( d_0^1 = d_0^2 \mathrm{:=} d_0\) and \( L_1 = L_2 \mathrm{:=} L\) , then there exists a neural network \( \Phi_{{\rm {parallel}}}\) with architecture \( (\sigma;d_0, d_1^1 + d_1^2, \dots, d_{L+1}^1 + d_{L+1}^2)\) such that

    \[ \Phi_{\rm parallel}({\boldsymbol{x}}) = (\Phi_1({\boldsymbol{x}}), \Phi_2({\boldsymbol{x}})) ~~ \text{ for all } {\boldsymbol{x}} \in \mathbb{R}^{d_0}, \]
  3. if \( d_0^1 = d_0^2 \mathrm{:=} d_0\) , \( L_1 = L_2 \mathrm{:=} L\) , and \( d_{L+1}^1 = d_{L+1}^2 \mathrm{:=} d_{L+1}\) , then there exists a neural network \( \Phi_{\rm{sum}}\) with architecture \( (\sigma;d_0, d_1^1 + d_1^2, \dots, d_{L}^1 + d_{L}^2, d_{L+1})\) such that

    \[ \Phi_{\rm sum}({\boldsymbol{x}}) = \Phi_1({\boldsymbol{x}}) + \Phi_2({\boldsymbol{x}}) ~~ \text{ for all } {\boldsymbol{x}} \in \mathbb{R}^{d_0}, \]
  4. if \( d_{L_1+1}^1 = d_0^2\) , then there exists a neural network \( \Phi_{\rm {comp}}\) with architecture \( (\sigma; d_0^1, d_1^1, \dots, d_{L_1}^1, d_1^2, \dots, d_{L_2+1}^2)\) such that

    \[ \Phi_{\rm comp}({\boldsymbol{x}}) = \Phi_2 \circ \Phi_1({\boldsymbol{x}}) ~~ \text{ for all } {\boldsymbol{x}} \in \mathbb{R}^{d_0^1}. \]

3.2 Notion of size

Neural networks provide a framework to parametrize functions. Ultimately, our goal is to find a neural network that fits some underlying input-output relation. As mentioned above, the architecture (depth, width and activation function) is typically chosen apriori and considered fixed. During training of the neural network, its parameters (weights and biases) are suitably adapted by some algorithm. Depending on the application, on top of the stated architecture choices, further restrictions on the weights and biases can be desirable. For example, the following two appear frequently:

  • weight sharing: This is a technique where specific entries of the weight matrices (or bias vectors) are constrained to be equal. Formally, this means imposing conditions of the form \( W_{k,l}^{(i)}=W^{(j)}_{s,t}\) , i.e the entry \( (k,l)\) of the \( i\) th weight matrix is equal to the entry at position \( (s,t)\) of weight matrix \( j\) . We denote this assumption by \( (i,k,l)\sim (j,s,t)\) , paying tribute to the trivial fact that “\( \sim\) ” is an equivalence relation. During training, shared weights are updated jointly, meaning that any change to one weight is simultaneously applied to all other weights of this class. Weight sharing can also be applied to the entries of bias vectors.
  • sparsity: This refers to imposing a sparsity structure on the weight matrices (or bias vectors). Specifically, we apriorily set \( W_{k,l}^{(i)}=0\) for certain \( (k,l,i)\) , i.ewe impose entry \( (k,l)\) of the \( i\) th weight matrix to be \( 0\) . These zero-valued entries are considered fixed, and are not adjusted during training. The condition \( W_{k,l}^{(i)}=0\) corresponds to node \( l\) of layer \( i-1\) not serving as an input to node \( k\) in layer \( i\) . If we represent the neural network as a graph, this is indicated by not connecting the corresponding nodes. Sparsity can also be imposed on the bias vectors.

Both of these restrictions decrease the number of learnable parameters in the neural network. The number of parameters can be seen as a measure of the complexity of the represented function class. For this reason, we introduce \( \rm{ size}(\Phi)\) as a notion for the number of learnable parameters. Formally (with \( |S|\) denoting the cardinality of a set \( S\) ):

Definition 2

Let \( \Phi\) be as in Definition 1. Then the size of \( \Phi\) is

\[ \begin{align} {\rm size}(\Phi)\mathrm{:}= \left|\left(\{(i,k,l)\,|\,W_{k,l}^{(i)}\neq 0\}\cup \{(i,k)\,|\,b_{k}^{(i)}\neq 0\} \right)\big/\sim\right|. \end{align} \]

(7)

3.3 Activation functions

Activation functions are a crucial part of neural networks, as they introduce nonlinearity into the model. If an affine activation function were used, the resulting neural network function would also be affine and hence very restricted in what it can represent.

The choice of activation function can have a significant impact on the performance, but there does not seem to be a universally optimal one. We next discuss a few important activation functions and highlight some common issues associated with them.

Sigmoid
Figure 6. Sigmoid
ReLU and SiLU
Figure 7. ReLU and SiLU
Leaky ReLU
Figure 8. Leaky ReLU
Figure 5. Different activation functions.

Sigmoid: The sigmoid activation function is given by

\[ \sigma_{\rm sig}(x) = \frac{1}{1+e^{-x}}~~\text{for }x \in \mathbb{R}, \]

and depicted in Figure 5 (a). Its output ranges between zero and one, making it interpretable as a probability. The sigmoid is a smooth function, which allows the application of gradient-based training.

It has the disadvantage that its derivative becomes very small if \( |x| \to \infty\) . This can affect learning due to the so-called vanishing gradient problem. Consider the simple neural network \( \Phi_n(x) = \sigma \circ \dots \circ \sigma(x + b)\) defined with \( n \in \mathbb{N}\) compositions of \( \sigma\) , and where \( b\in\mathbb{R}\) is a bias. Its derivative with respect to \( b\) is

\[ \frac{\,\mathrm{d}}{\,\mathrm{d} b} \Phi_n(x) = \sigma'(\Phi_{n-1}(x)) \frac{\,\mathrm{d}}{\,\mathrm{d} b} \Phi_{n-1}(x). \]

If \( \sup_{x\in\mathbb{R}}|\sigma'(x)|\le 1-\delta\) , then by induction, \( |\frac{\,\mathrm{d}}{\,\mathrm{d} b} \Phi_n(x)|\le (1-\delta)^n\) . The opposite effect happens for activation functions with derivatives uniformly larger than one. This argument shows that the derivative of \( \Phi_n(x,b)\) with respect to \( b\) can become exponentially small or exponentially large when propagated through the layers. This effect, known as the vanishing- or exploding gradient effect, also occurs for activation functions which do not admit the uniform bounds assumed above. However, since the sigmoid activation function exhibits areas with extremely small gradients, the vanishing gradient effect can be strongly exacerbated.

ReLU (Rectified Linear Unit): The ReLU is defined as

\[ \sigma_{\rm ReLU}(x) = \max\{x, 0\}~~\text{for }x \in \mathbb{R}, \]

and depicted in Figure 5 (b). It is piecewise linear, and due to its simplicity its evaluation is computationally very efficient. It is one of the most popular activation functions in practice. Since its derivative is always zero or one, it does not suffer from the vanishing gradient problem to the same extent as the sigmoid function. However, ReLU can suffer from the so-called dead neurons problem. Consider the neural network

\[ \Phi(x) = \sigma_{\rm ReLU}( b - \sigma_{\rm ReLU}(x))~~\text{for }x\in\mathbb{R} \]

depending on the bias \( b\in\mathbb{R}\) . If \( b < 0\) , then \( \Phi(x) = 0\) for all \( x \in \mathbb{R}\) . The neuron corresponding to the second application of \( \sigma_{\rm {ReLU}}\) thus produces a constant signal. Moreover, if \( b < 0\) , \( \frac{\,\mathrm{d}}{\,\mathrm{d} b}\Phi(x)=0\) for all \( x\in\mathbb{R}\) . As a result, every negative value of \( b\) yields a stationary point of the empirical risk. A gradient-based method will not be able to further train the parameter \( b\) . We thus refer to this neuron as a dead neuron.

SiLU (Sigmoid Linear Unit): An important difference between the ReLU and the Sigmoid is that the ReLU is not differentiable at \( 0\) . The SiLU activation function (also referred to as “swish”) can be interpreted as a smooth approximation to the ReLU. It is defined as

\[ \sigma_{\rm SiLU}(x)\mathrm{:}= x\sigma_{\rm sig}(x) = \frac{x}{1+e^{-x}}~~\text{for }x \in \mathbb{R}, \]

and is depicted in Figure 5 (b). There exist various other smooth activation functions that mimic the ReLU, including the Softplus \( x\mapsto \log(1+\exp(x))\) , the GELU (Gaussian Error Linear Unit) \( x\mapsto xF(x)\) where \( F(x)\) denotes the cumulative distribution function of the standard normal distribution, and the Mish \( x\mapsto x\tanh(\log(1+\exp(x)))\) .

Parametric ReLU or Leaky ReLU: This variant of the ReLU addresses the dead neuron problem. For some \( a\in (0,1)\) , the parametric ReLU is defined as

\[ \sigma_{a}(x) = \max\{x, ax\}~~\text{for }x\in\mathbb{R}, \]

and is depicted in Figure 5 (c) for three different values of \( a\) . Since the output of \( \sigma\) does not have flat regions like the ReLU, the dying ReLU problem is mitigated. If \( a\) is not chosen too small, then there is less of a vanishing gradient problem than for the Sigmoid. In practice, the additional parameter \( a\) has to be fine-tuned depending on the application. Like the ReLU, the parametric ReLU is not differentiable at \( 0\) .

Bibliography and further reading

The concept of neural networks was first introduced by McCulloch and Pitts in [9]. Later Rosenblatt [29] introduced the perceptron, an artificial neuron with adjustable weights that forms the basis of the multilayer perceptron (a fully connected feedforward neural network). The vanishing gradient problem shortly addressed in Section 3.3 was discussed by Hochreiter in his diploma thesis [30] and later in [31, 12].

Exercises

Exercise 1

Prove Proposition 1.

Exercise 2

In this exercise, we show that ReLU and parametric ReLU create similar sets of neural network functions. Fix \( a>0\) .

  1. Find a set of weight matrices and bias vectors, such that the associated neural network \( \Phi_1\) , with the ReLU activation function \( \sigma_{\rm {ReLU}}\) satisfies \( \Phi_1(x) = \sigma_{a}(x)\) for all \( x \in \mathbb{R}\) .
  2. Find a set of weight matrices and bias vectors, such that the associated neural network \( \Phi_2\) , with the parametric ReLU activation function \( \sigma_a\) satisfies \( \Phi_2(x) = \sigma_{\rm {ReLU}}(x)\) for all \( x \in \mathbb{R}\) .
  3. Conclude that every ReLU neural network can be expressed as a leaky ReLU neural network and vice versa.

Exercise 3

Let \( d\in \mathbb{N}\) , and let \( \Phi_1\) be a neural network with the ReLU as activation function, input dimension \( d\) , and output dimension \( 1\) . Moreover, let \( \Phi_2\) be a neural network with the sigmoid activation function, input dimension \( d\) , and output dimension \( 1\) . Show that, if \( \Phi_1 = \Phi_2\) , then \( \Phi_1\) is a constant function.

Exercise 4

In this exercise, we show that for the sigmoid activation functions, dead-neuron-like behavior is very rare. Let \( \Phi\) be a neural network with the sigmoid activation function. Assume that \( \Phi\) is a constant function. Show that for every \( \varepsilon >0\) there is a non-constant neural network \( \widetilde{\Phi}\) with the same architecture as \( \Phi\) such that for all \( \ell = 0, \dots L\) ,

\[ \begin{align*} \| {\boldsymbol{W}}^{(\ell)} - \widetilde{{\boldsymbol{W}}}^{(\ell)} \|_{} \leq \varepsilon \text{ and } \| {\boldsymbol{b}}^{(\ell)} - \widetilde{{\boldsymbol{b}}}^{(\ell)} \|_{} \leq \varepsilon \end{align*} \]

where \( {\boldsymbol{W}}^{(\ell)}\) , \( {\boldsymbol{b}}^{(\ell)}\) are the weights and biases of \( \Phi\) and \( \widetilde{{\boldsymbol{W}}}^{(\ell)}\) , \( \widetilde{{\boldsymbol{b}}}^{(\ell)}\) are the biases of \( \widetilde{\Phi}\) .

Show that such a statement does not hold for ReLU neural networks. What about leaky ReLU?

4 Universal approximation

After introducing neural networks in Chapter 3, it is natural to inquire about their capabilities. Specifically, we might wonder if there exist inherent limitations to the type of functions a neural network can represent. Could there be a class of functions that neural networks cannot approximate? If so, it would suggest that neural networks are specialized tools, similar to how linear regression is suited for linear relationships, but not for data with nonlinear relationships.

In this chapter, primarily following [32], we will show that this is not the case, and neural networks are indeed a universal tool. More precisely, given sufficiently large and complex architectures, they can approximate almost every sensible input-output relationship. We will formalize and prove this claim in the subsequent sections.

4.1 A universal approximation theorem

To analyze what kind of functions can be approximated with neural networks, we start by considering the uniform approximation of continuous functions \( f:\mathbb{R}^d\to\mathbb{R}\) on compact sets. To this end, we first introduce the notion of compact convergence.

Definition 3

Let \( d\in \mathbb{N}\) . A sequence of functions \( f_n:\mathbb{R}^d\to\mathbb{R}\) , \( n\in\mathbb{N}\) , is said to converge compactly to a function \( f:\mathbb{R}^d\to\mathbb{R}\) , if for every compact \( K\subseteq \mathbb{R}^d\) it holds that \( \lim_{n\to\infty}\sup_{{\boldsymbol{x}}\in K} |f_n({\boldsymbol{x}})-f({\boldsymbol{x}})|=0\) . In this case we write \( f_n\xrightarrow{{\rm {cc}}} f\) .

Throughout what follows, we always consider \( C^0(\mathbb{R}^d)\) equipped with the topology of Definition 3 (also see Exercise 5), and every subset such as \( C^0(D)\) with the subspace topology: for example, if \( D\subseteq\mathbb{R}^d\) is bounded, then convergence in \( C^0(D)\) refers to uniform convergence \( \lim_{n\to\infty}\sup_{x\in D}|f_n(x)-f(x)|=0\) .

4.1.1 Universal approximators

As stated before, we want to show that deep neural networks can approximate every continuous function in the sense of Definition 3. We call sets of functions that satisfy this property universal approximators.

Definition 4

Let \( d\in \mathbb{N}\) . A set of functions \( \mathcal{H}\) from \( \mathbb{R}^d\) to \( \mathbb{R}\) is a universal approximator (of \( C^0(\mathbb{R}^d)\) ), if for every \( \varepsilon>0\) , every compact \( K\subseteq\mathbb{R}^d\) , and every \( f\in C^0(\mathbb{R}^d)\) , there exists \( g\in\mathcal{H}\) such that \( \sup_{{\boldsymbol{x}}\in K}|f({\boldsymbol{x}})-g({\boldsymbol{x}})|<\varepsilon\) .

For a set of (not necessarily continuous) functions \( \mathcal{H}\) mapping between \( \mathbb{R}^d\) and \( \mathbb{R}\) , we denote by \( \overline{\mathcal{H}}^{\rm {cc}}\) its closure with respect to compact convergence.

The relationship between a universal approximator and the closure with respect to compact convergence is established in the proposition below.

Proposition 2

Let \( d\in \mathbb{N}\) and \( \mathcal{H}\) be a set of functions from \( \mathbb{R}^d\) to \( \mathbb{R}\) . Then, \( \mathcal{H}\) is a universal approximator of \( C^0(\mathbb{R}^d)\) if and only if \( C^0(\mathbb{R}^d)\subseteq \overline{\mathcal{H}}^{\rm {cc}}\) .

Proof

Suppose that \( \mathcal{H}\) is a universal approximator and fix \( f\in C^0(\mathbb{R}^d)\) . For \( n \in \mathbb{N}\) , define \( K_n\mathrm{:}= [-n,n]^d\subseteq\mathbb{R}^d\) . Then for every \( n\in\mathbb{N}\) there exists \( f_n\in\mathcal{H}\) such that \( \sup_{{\boldsymbol{x}}\in K_n}|f_n({\boldsymbol{x}})-f({\boldsymbol{x}})|</{n}\) . Since for every compact \( K\subseteq\mathbb{R}^d\) there exists \( n_0\) such that \( K\subseteq K_n\) for all \( n\ge n_0\) , it holds \( f_n\xrightarrow{{\rm {cc}}} f\) . The “only if” part of the assertion is trivial.

A key tool to show that a set is a universal approximator is the Stone-Weierstrass theorem, see for instance [1, Sec. 5.7].

Theorem 1 (Stone-Weierstrass)

Let \( d\in \mathbb{N}\) , let \( K\subseteq\mathbb{R}^d\) be compact, and let \( \mathcal{H}\subseteq C^0(K,\mathbb{R})\) satisfy that

  1. for all \( {\boldsymbol{x}}\in K\) there exists \( f\in\mathcal{H}\) such that \( f({\boldsymbol{x}})\neq 0\) ,
  2. for all \( {\boldsymbol{x}}\neq {\boldsymbol{y}}\in K\) there exists \( f\in\mathcal{H}\) such that \( f({\boldsymbol{x}})\neq f({\boldsymbol{y}})\) ,
  3. \( \mathcal{H}\) is an algebra of functions, i.e., \( \mathcal{H}\) is closed under addition, multiplication and scalar multiplication.

Then \( \mathcal{H}\) is dense in \( C^0(K)\) .

Example 1 (Polynomials are a universal approximator)

For a multiindex \( {\boldsymbol{\alpha}}=(\alpha_1,\dots,\alpha_d)\in\mathbb{N}_0^d\) and a vector \( {\boldsymbol{x}}=(x_1,\dots,x_d)\in\mathbb{R}^d\) denote \( {\boldsymbol{x}}^{\boldsymbol{\alpha}}\mathrm{:}= \prod_{j=1}^d x_j^{\alpha_j}\) . In the following, with \( |{\boldsymbol{\alpha}}|\mathrm{:}= \sum_{j=1}^d\alpha_j\) , we write

\[ \begin{align*} \mathcal{P}_n\mathrm{:}= {\rm span}\{{\boldsymbol{x}}^{\boldsymbol{\alpha}}\,|\,{\boldsymbol{\alpha}}\in\mathbb{N}_0^d,~|{\boldsymbol{\alpha}}|\le n\} \end{align*} \]

i.e., \( \mathcal{P}_n\) is the space of polynomials of degree at most \( n\) (with real coefficients). It is easy to check that \( \mathcal{P}\mathrm{:}=\bigcup_{n\in\mathbb{N}}\mathcal{P}_n(\mathbb{R}^d)\) satisfies the assumptions of Theorem 1 on every compact set \( K\subseteq\mathbb{R}^d\) . Thus the space of polynomials \( \mathcal{P}\) is a universal approximator of \( C^0(\mathbb{R}^d)\) , and by Proposition 2, \( \mathcal{P}\) is dense in \( C^0(\mathbb{R}^d)\) . In case we wish to emphasize the dimension of the underlying space, in the following we will also write \( \mathcal{P}_n(\mathbb{R}^d)\) or \( \mathcal{P}(\mathbb{R}^d)\) to denote \( \mathcal{P}_n\) , \( \mathcal{P}\) respectively.

4.1.2 Shallow neural networks

With the necessary formalism established, we can now show that shallow neural networks of arbitrary width form a universal approximator under certain (mild) conditions on the activation function. The results in this section are based on [2], and for the proofs we follow the arguments in that paper.

We first introduce notation for the set of all functions realized by certain architectures.

Definition 5

Let \( d\) , \( m\) , \( L\) , \( n \in \mathbb{N}\) and \( \sigma \colon \mathbb{R} \to \mathbb{R}\) . The set of all functions realized by neural networks with \( d\) -dimensional input, \( m\) -dimensional output, depth at most \( L\) , width at most \( n\) , and activation function \( \sigma\) is denoted by

\[ \begin{align*} \mathcal{N}_d^m(\sigma;L,n)\mathrm{:}= \{\Phi:\mathbb{R}^d\to\mathbb{R}^m\,|\,\Phi\text{ as in Def1, }{\rm depth}(\Phi)\le L,~{\rm width}(\Phi)\le n\}. \end{align*} \]

Furthermore,

\[ \begin{align*} \mathcal{N}_d^m(\sigma;L)\mathrm{:}= \bigcup_{n\in\mathbb{N}}\mathcal{N}_d^m(\sigma;L,n). \end{align*} \]

In the sequel, we require the activation function \( \sigma\) to belong to the set of piecewise continuous and locally bounded functions

\[ \begin{equation} \begin{aligned} \mathcal{M}\mathrm{:}= \big\{\sigma\in L_{\rm loc}^\infty(\mathbb{R}) \;\big|\;&\text{there exist intervals }I_1,\dots,I_M \text{ partitioning } \mathbb{R},\\ &\text{s.t. }\sigma\in C^0(I_j) \text{ for all } j = 1, \dots, M \big\}. \end{aligned} \end{equation} \]

(8)

Here, \( M\in\mathbb{N}\) is finite, and the intervals \( I_j\) are understood to have positive (possibly infinite) Lebesgue measure, i.e\( I_j\) is not allowed to be empty or a single point. Hence, \( \sigma\) is a piecewise continuous function, and it has discontinuities at at most finitely many points.

Example 2

Activation functions belonging to \( \mathcal{M}\) include, in particular, all continuous non-polynomial functions, which in turn includes all practically relevant activation functions such as the ReLU, the SiLU, and the Sigmoid discussed in Section 3.3. In these cases, we can choose \( M=1\) and \( I_1=\mathbb{R}\) . Discontinuous functions include for example the Heaviside function \( x \mapsto \boldsymbol{1}_{x>0}\) (also called a “perceptron” in this context) but also \( x \mapsto \boldsymbol{1}_{x>0}\sin(1/x)\) : Both belong to \( \mathcal{M}\) with \( M=2\) , \( I_1=(-\infty,0]\) and \( I_2=(0,\infty)\) . We exclude for example the function \( x \mapsto {1}/{x}\) , which is not locally bounded.

The rest of this subsection is dedicated to proving the following theorem that has now already been announced repeatedly.

Theorem 2

Let \( d\in\mathbb{N}\) and \( \sigma\in\mathcal{M}\) . Then \( \mathcal{N}_d^1(\sigma;1)\) is a universal approximator of \( C^0(\mathbb{R}^d)\) if and only if \( \sigma\) is not a polynomial.

Remark 2

We will see in Corollary 2 and Exercise 9 that neural networks can also arbitrarily well approximate non-continuous functions with respect to suitable norms.

The universal approximation theorem by Leshno, Lin, Pinkus and Schocken [2]—of which Theorem 2 is a special case—is even formulated for a much larger set \( \mathcal{M}\) , which allows for activation functions that have discontinuities at a (possibly non-finite) set of Lebesgue measure zero. Instead of proving the theorem in this generality, we resort to the simpler case stated above. This allows to avoid some technicalities, but the main ideas remain the same. The proof strategy is to verify the following three claims:

  1. if \( C^0(\mathbb{R}^1)\subseteq \overline{\mathcal{N}_1^1(\sigma;1)}^{\rm {cc}}\) then \( C^0(\mathbb{R}^d)\subseteq \overline{\mathcal{N}_d^1(\sigma;1)}^{\rm {cc}}\) ,
  2. if \( \sigma\in C^\infty(\mathbb{R})\) is not a polynomial then \( C^0(\mathbb{R}^1)\subseteq \overline{\mathcal{N}_1^1(\sigma;1)}^{\rm {cc}}\) ,
  3. if \( \sigma\in \mathcal{M}\) is not a polynomial then there exists \( \tilde\sigma \in C^\infty(\mathbb{R})\cap \overline{\mathcal{N}_1^1(\sigma;1)}^{\rm {cc}}\) which is not a polynomial.

Upon observing that \( \tilde\sigma\in \overline{\mathcal{N}_1^1(\sigma;1)}^{\rm {cc}}\) implies \( \overline{\mathcal{N}_1^1(\tilde\sigma,1)}^{\rm {cc}}\subseteq \overline{\mathcal{N}_1^1(\sigma;1)}^{\rm cc}\) , it is easy to see that these statements together with Proposition 2 establish the implication “\( \Leftarrow\) ” asserted in Theorem 2. The reverse direction is straightforward to check and will be the content of Exercise 6.

We start with a more general version of (III) and reduce the problem to the one dimensional case following [3, Theorem 2.1].

Lemma 1

Assume that \( \mathcal{H}\) is a universal approximator of \( C^0(\mathbb{R})\) . Then for every \( d\in\mathbb{N}\)

\[ \begin{align*} {\rm span}\{{\boldsymbol{x}} \mapsto g({\boldsymbol{w}}\cdot{\boldsymbol{x}})\,|\,{\boldsymbol{w}}\in\mathbb{R}^d,~g\in\mathcal{H}\} \end{align*} \]

is a universal approximator of \( C^0(\mathbb{R}^d)\) .

Proof

For \( k\in\mathbb{N}_0\) , denote by \( \mathbb{H}_k\) the space of all \( k\) -homogenous polynomials, that is

\[ \begin{align*} \mathbb{H}_k\mathrm{:}= {\rm span}\left\{\mathbb{R}^d \ni {\boldsymbol{x}} \mapsto {\boldsymbol{x}}^{\boldsymbol{\alpha}}\, \middle|\,{\boldsymbol{\alpha}}\in\mathbb{N}_0^d,~|{\boldsymbol{\alpha}}|=k\right\}. \end{align*} \]

We claim that

\[ \begin{align} \mathbb{H}_k\subseteq \overline{{\rm span}\{\mathbb{R}^d \ni {\boldsymbol{x}} \mapsto g({\boldsymbol{w}}\cdot{\boldsymbol{x}})\,|\,{\boldsymbol{w}}\in\mathbb{R}^d,~g\in\mathcal{H}\}}^{\rm cc} =\mathrm{:} X \end{align} \]

(9)

for all \( k\in\mathbb{N}_0\) . This implies that all multivariate polynomials belong to \( X\) . An application of the Stone-Weierstrass theorem (cp. Example 1) and Proposition 2 then conclude the proof.

For every \( {\boldsymbol{\alpha}}\) , \( {\boldsymbol{\beta}}\in\mathbb{N}_0^d\) with \( |{\boldsymbol{\alpha}}|=|{\boldsymbol{\beta}}|=k\) , it holds \( D^{\boldsymbol{\beta}} {\boldsymbol{x}}^{\boldsymbol{\alpha}}=\delta_{{\boldsymbol{\beta}},{\boldsymbol{\alpha}}} {\boldsymbol{\alpha}} !\) , where \( {\boldsymbol{\alpha}} !\mathrm{:}= \prod_{j=1}^d\alpha_j!\) and \( \delta_{{\boldsymbol{\beta}},{\boldsymbol{\alpha}}}=1\) if \( {\boldsymbol{\beta}}={\boldsymbol{\alpha}}\) and \( \delta_{{\boldsymbol{\beta}},{\boldsymbol{\alpha}}}=0\) otherwise. Hence, since \( \{{\boldsymbol{x}} \mapsto {\boldsymbol{x}}^{\boldsymbol{\alpha}}\,|\,|{\boldsymbol{\alpha}}|=k\}\) is a basis of \( \mathbb{H}_k\) , the set \( \{D^{\boldsymbol{\alpha}}\,|\,|{\boldsymbol{\alpha}}|=k\}\) is a basis of its topological dual \( \mathbb{H}_k'\) . Thus each linear functional \( l \in \mathbb{H}_k'\) allows the representation \( l=p(D)\) for some \( p\in \mathbb{H}_k\) (here \( D\) stands for the differential).

By the multinomial formula

\[ \begin{align*} ({\boldsymbol{w}}\cdot{\boldsymbol{x}})^k = \left(\sum_{j=1}^d w_jx_j\right)^k= \sum_{\{{\boldsymbol{\alpha}}\in\mathbb{N}_0^d\,|\,|{\boldsymbol{\alpha}}|=k\}} \frac{k!}{{\boldsymbol{\alpha}} !} {\boldsymbol{w}}^{\boldsymbol{\alpha}}{\boldsymbol{x}}^{\boldsymbol{\alpha}}. \end{align*} \]

Therefore, we have that \( ({\boldsymbol{x}} \mapsto ({\boldsymbol{w}}\cdot{\boldsymbol{x}})^k) \in \mathbb{H}_k\) . Moreover, for every \( l=p(D)\in \mathbb{H}_k'\) and all \( {\boldsymbol{w}}\in\mathbb{R}^d\) we have that

\[ \begin{align*} l({\boldsymbol{x}} \mapsto ({\boldsymbol{w}}\cdot{\boldsymbol{x}})^k)= k! p({\boldsymbol{w}}). \end{align*} \]

Hence, if \( l({\boldsymbol{x}} \mapsto({\boldsymbol{w}}\cdot{\boldsymbol{x}})^k)=p(D)({\boldsymbol{x}} \mapsto({\boldsymbol{w}}\cdot{\boldsymbol{x}})^k)=0\) for all \( {\boldsymbol{w}}\in\mathbb{R}^d\) , then \( p\equiv 0\) and thus \( l\equiv 0\) .

This implies \( {\rm {span}}\{{\boldsymbol{x}} \mapsto({\boldsymbol{w}}\cdot{\boldsymbol{x}})^k\,|\,{\boldsymbol{w}}\in\mathbb{R}^d\}=\mathbb{H}_k\) . Indeed, if there exists \( h \in \mathbb{H}_k\) which is not in \( {\rm {span}}\{{\boldsymbol{x}} \mapsto ({\boldsymbol{w}}\cdot{\boldsymbol{x}})^k\,|\,{\boldsymbol{w}}\in\mathbb{R}^d\}\) , then by the theorem of Hahn-Banach (see Theorem 52), there exists a non-zero functional in \( \mathbb{H}_k'\) vanishing on \( {\rm {span}}\{{\boldsymbol{x}} \mapsto({\boldsymbol{w}}\cdot{\boldsymbol{x}})^k\,|\,{\boldsymbol{w}}\in\mathbb{R}^d\}\) . This contradicts the previous observation.

By the universality of \( \mathcal{H}\) it is not hard to see that \( {\boldsymbol{x}} \mapsto ({\boldsymbol{w}}\cdot{\boldsymbol{x}})^k\in X\) for all \( {\boldsymbol{w}}\in\mathbb{R}^d\) . Therefore, we have \( \mathbb{H}_k\subseteq X\) for all \( k\in\mathbb{N}_0\) .

By the above lemma, in order to verify that \( \mathcal{N}_d^1(\sigma;1)\) is a universal approximator, it suffices to show that \( \mathcal{N}_1^1(\sigma;1)\) is a universal approximator. We first show that this is the case for sigmoidal activations.

Definition 6

An activation function \( \sigma:\mathbb{R}\to\mathbb{R}\) is called sigmoidal, if \( \sigma\in C^0(\mathbb{R})\) , \( \lim_{x\to\infty}\sigma(x)=1\) and \( \lim_{x\to-\infty}\sigma(x)=0\) .

For sigmoidal activation functions we can now conclude the universality in the univariate case.

Lemma 2

Let \( \sigma:\mathbb{R}\to\mathbb{R}\) be monotonically increasing and sigmoidal. Then \( C^0(\mathbb{R})\subseteq \overline{\mathcal{N}_1^1(\sigma;1)}^{\rm {cc}}\) .

We prove Lemma 2 in Exercise 7. Lemma 1 and Lemma 2 show Theorem 2 in the special case where \( \sigma\) is monotonically increasing and sigmoidal. For the general case, let us continue with (IV) and consider \( C^\infty\) activations.

Lemma 3

If \( \sigma\in C^\infty(\mathbb{R})\) and \( \sigma\) is not a polynomial, then \( \mathcal{N}_1^1(\sigma;1)\) is dense in \( C^0(\mathbb{R})\) .

Proof

Denote \( X\mathrm{:}=\overline{\mathcal{N}_1^1(\sigma;1)}^{\rm {cc}}\) . We show again that all polynomials belong to \( X\) . An application of the Stone-Weierstrass theorem then gives the statement.

Fix \( b\in\mathbb{R}\) and denote \( f_x(w)\mathrm{:}= \sigma(wx+b)\) for all \( x\) , \( w\in\mathbb{R}\) . By Taylor’s theorem, for \( h\neq 0\)

\[ \begin{align} \frac{\sigma((w+h)x+b)-\sigma(wx+b)}{h} &=\frac{f_x(w+h)-f_x(w)}{h}\nonumber \end{align} \]

(10)

\[ \begin{align} &=f_x'(w)+\frac{h}{2}f_x''(\xi)\\ &=f_x'(w)+\frac{h}{2}x^2\sigma''(\xi x+b) \\ \\\end{align} \]

(11)

for some \( \xi=\xi(h)\) between \( w\) and \( w+h\) . Note that the left-hand side belongs to \( \mathcal{N}_1^1(\sigma;1)\) as a function of \( x\) . Since \( \sigma''\in C^0(\mathbb{R})\) , for every compact set \( K\subseteq\mathbb{R}\)

\[ \begin{align*} \sup_{x\in K}\sup_{|h|\le 1}|x^2\sigma''(\xi(h) x+b)|\le \sup_{x\in K}\sup_{\eta\in[w-1,w+1]}|x^2\sigma''(\eta x+b)|<\infty. \end{align*} \]

Letting \( h\to 0\) , as a function of \( x\) the term in (10) thus converges uniformly towards \( K\ni x\mapsto f_x'(w)\) . Since \( K\) was arbitrary, \( x\mapsto f_x'(w)\) belongs to \( X\) . Inductively applying the same argument to \( f_x^{(k-1)}(w)\) , we find that \( x\mapsto f_x^{(k)}(w)\) belongs to \( X\) for all \( k\in\mathbb{N}\) , \( w\in\mathbb{R}\) . Observe that \( f_x^{(k)}(w) = x^k\sigma^{(k)}(wx+b)\) . Since \( \sigma\) is not a polynomial, for each \( k\in\mathbb{N}\) there exists \( b_k\in\mathbb{R}\) such that \( \sigma^{(k)}(b_k)\neq 0\) . Choosing \( w=0\) , we obtain that \( x\mapsto \sigma^{(k)}(b_k) x^k\) belongs to \( X\) , and thus also \( x\mapsto x^k\) belongs to \( X\) .

Finally, we come to the proof of (V)—the claim that there exists at least one non-polynomial \( C^\infty(\mathbb{R})\) function in the closure of \( \mathcal{N}_1^1(\sigma;1)\) . The argument is split into two lemmata. Denote in the following by \( C_c^\infty(\mathbb{R})\) the set of compactly supported \( C^\infty(\mathbb{R})\) functions, and for two functions \( f\) , \( g:\mathbb{R}\to\mathbb{R}\) let

\[ \begin{equation} f*g(x)\mathrm{:}= \int_{\mathbb{R}}f(x-y)g(y)\,\mathrm{d} x~~\text{for all }x\in\mathbb{R} \end{equation} \]

(12)

be the convolution of \( f\) and \( g\) .

Lemma 4

Let \( \sigma\in\mathcal{M}\) . Then for each \( \varphi\in C_c^\infty(\mathbb{R})\) it holds \( \sigma*\varphi\in \overline{\mathcal{N}_1^1(\sigma;1)}^{\rm {cc}}\) .

Proof

Fix \( \varphi\in C_c^\infty(\mathbb{R})\) and let \( a>0\) such that \( {\rm {supp}}\varphi\subseteq [-a,a]\) . Denote \( y_j\mathrm{:}= -a+2a {j}/{n}\) for \( j=0,\dots,n\) and define for \( x \in \mathbb{R}\)

\[ \begin{align*} f_n(x) \mathrm{:=} \frac{2a}{n} \sum_{j=0}^{n-1}\sigma(x-y_j)\varphi(y_j). \end{align*} \]

Clearly, \( f_n \in \mathcal{N}_1^1(\sigma;1).\) We will show \( f_n\xrightarrow{{\rm {cc}}} \sigma*\varphi\) as \( n\to\infty\) . To do so we verify uniform convergence of \( f_n\) towards \( \sigma*\varphi\) on the interval \( [-b,b]\) with \( b>0\) arbitrary but fixed.

For \( x\in [-b,b]\)

\[ \begin{align} |\sigma*\varphi(x)-f_n(x)| \le \sum_{j=0}^{n-1}\left|\int_{y_{j}}^{y_{j+1}}\sigma(x-y)\varphi(y)-\sigma(x-y_j)\varphi(y_j)\,\mathrm{d} y\right|. \end{align} \]

(13)

Fix \( \varepsilon\in (0,1)\) . Since \( \sigma\in\mathcal{M}\) , there exist \( z_1,\dots,z_M\in\mathbb{R}\) such that \( \sigma\) is continuous on \( \mathbb{R}\backslash\{z_1,\dots,z_M\}\) (cp. (8)). With \( D_\varepsilon\mathrm{:}= \bigcup_{j=1}^M(z_j-\varepsilon,z_j+\varepsilon)\) , observe that \( \sigma\) is uniformly continuous on the compact set \( K_\varepsilon\mathrm{:}=[-a-b,a+b]\cap D_\varepsilon^c\) . Now let \( J_c\cup J_d =\{0,\dots,n-1\}\) be a partition (depending on \( x\) ), such that \( j\in J_c\) if and only if \( [x-y_{j+1},x-y_j]\subseteq K_\varepsilon\) . Hence, \( j\in J_d\) implies the existence of \( i\in\{1,\dots,M\}\) such that the distance of \( z_i\) to \( [x-y_{j+1},x-y_j]\) is at most \( \varepsilon\) . Due to the interval \( [x-y_{j+1},x-y_{j}]\) having length \( {2a}/{n}\) , we can bound

\[ \begin{align*} \sum_{j\in J_d}y_{j+1}-y_{j} &=\left|\bigcup_{j\in J_d}[x-y_{j+1},x-y_j]\right|\\ &\le \left|\bigcup_{i=1}^M\Big[z_i-\varepsilon-\frac{2a}{n},z_i+\varepsilon+\frac{2a}{n}\Big]\right|\\ &\le M \cdot \Big(2\varepsilon+\frac{4a}{n}\Big), \end{align*} \]

where \( |A|\) denotes the Lebesgue measure of a measurable set \( A\subseteq\mathbb{R}\) . Next, because of the local boundedness of \( \sigma\) and the fact that \( \varphi\in C_c^\infty\) , it holds \( \sup_{|y|\le a+b}|\sigma(y)|+\sup_{|y|\le a}|\varphi(y)|=\mathrm{:} \gamma<\infty\) . Hence

\[ \begin{align} &|\sigma*\varphi(x)-f_n(x)| \nonumber \end{align} \]

(14)

\[ \begin{align} &~~ \le \sum_{j\in J_c\cup J_d}\left|\int_{y_{j}}^{y_{j+1}}\sigma(x-y)\varphi(y)-\sigma(x-y_j)\varphi(y_j)\,\mathrm{d} y\right| \\ &~~ \le 2\gamma^2 M \cdot \left(2\varepsilon+\frac{4a}{n}\right) \\ &~~ ~~ + 2a\sup_{j\in J_c} \max_{y\in [y_{j},y_{j+1}]}| \sigma(x-y)\varphi(y)-\sigma(x-y_j)\varphi(y_j)|. \\ \\\end{align} \]

We can bound the term in the last maximum by

\[ \begin{align*} &|\sigma(x-y)\varphi(y)-\sigma(x-y_j)\varphi(y_j)|\\ &~~ \le |\sigma(x-y)-\sigma(x-y_j)||\varphi(y)| +|\sigma(x-y_j)||\varphi(y)-\varphi(y_j)|\nonumber\\ &~~ \le \gamma \cdot \left( \sup_{\substack{z_1,z_2\in K_\varepsilon\\ |z_1-z_2|\le\frac{2a}{n}}} |\sigma(z_1)-\sigma(z_2)| + \sup_{\substack{z_1,z_2\in [-a,a] \\ |z_1-z_2|\le\frac{2a}{n}}} |\varphi(z_1)-\varphi(z_2)| \right). \end{align*} \]

Finally, uniform continuity of \( \sigma\) on \( K_\varepsilon\) and \( \varphi\) on \( [-a,a]\) imply that the last term tends to \( 0\) as \( n\to\infty\) uniformly for all \( x\in [-b,b]\) . This shows that there exist \( C<\infty\) (independent of \( \varepsilon\) and \( x\) ) and \( n_\varepsilon\in\mathbb{N}\) (independent of \( x\) ) such that the term in (14) is bounded by \( C\varepsilon\) for all \( n\ge n_\varepsilon\) . Since \( \varepsilon\) was arbitrary, this yields the claim.

Lemma 5

If \( \sigma\in \mathcal{M}\) and \( \sigma*\varphi\) is a polynomial for all \( \varphi\in C_c^\infty(\mathbb{R})\) , then \( \sigma\) is a polynomial.

Proof

Fix \( -\infty<a<b<\infty\) and consider \( C_c^\infty(a,b)\mathrm{:}= \{\varphi\in C^\infty(\mathbb{R})\,|\,{\rmsupp}\varphi\subseteq [a,b]\}\) . Define a metric \( \rho\) on \( C_c^\infty(a,b)\) via

\[ \begin{align*} \rho(\varphi,\psi)\mathrm{:}= \sum_{j\in\mathbb{N}_0} 2^{-j} \frac{|\varphi-\psi|_{C^j(a,b)}}{1+|\varphi-\psi|_{C^j(a,b)}}, \end{align*} \]

where

\[ \begin{align*} |\varphi|_{C^j(a,b)} \mathrm{:}= \sup_{x\in [a,b]}|\varphi^{(j)}(x)|. \end{align*} \]

Since the space of \( j\) times differentiable functions on \( [a,b]\) is complete with respect to the norm \( \sum_{i=0}^j|\cdot|_{C^i(a,b)}\) , see for instance [4, Satz 104.3], the space \( C_c^\infty(a,b)\) is complete with the metric \( \rho\) . For \( k\in\mathbb{N}\) set

\[ \begin{align*} V_k\mathrm{:}= \{\varphi\in C_c^\infty(a,b)\,|\,\sigma*\varphi\in\mathcal{P}_k\}, \end{align*} \]

where \( \mathcal{P}_k\mathrm{:}= {\rm {span}}\{\mathbb{R} \ni x \mapsto x^j\,|\,0\le j\le k\}\) denotes the space of polynomials of degree at most \( k\) . Then \( V_k\) is closed with respect to the metric \( \rho\) . To see this, we need to show that for a converging sequence \( \varphi_j \to \varphi^*\) with respect to \( \rho\) and \( \varphi_j \in V_k\) , it follows that \( D^{k+1}(\sigma * \varphi^*)= 0\) and hence \( \sigma * \varphi^*\) is a polynomial: Using \( D^{k+1}(\sigma *\varphi_j) = 0\) if \( \varphi_j\in V_k\) , the linearity of the convolution, and the fact that \( D^{k+1}(\sigma*g) = \sigma*D^{k+1}(g)\) for differentiable \( g\) and if both sides are well-defined, we get

\[ \begin{align*} &\sup_{x \in [a,b]}| D^{k+1}(\sigma * \varphi^*)(x)|\\ & ~~ = \sup_{x \in [a,b]}| \sigma * D^{k+1}( \varphi^* - \varphi_j)(x)| \\ & ~~ \leq |b-a| \sup_{z \in [a-b, b-a]} |\sigma(z)| \cdot \sup_{x \in [a,b]}| D^{k+1}( \varphi_j -\varphi^*)(x)|. \end{align*} \]

Since \( \sigma\) is locally bounded, the right hand-side converges to \( 0\) as \( j\to\infty\) .

By assumption we have

\[ \begin{align*} \bigcup_{k\in\mathbb{N}}V_k = C_c^\infty(a,b). \end{align*} \]

Baire’s category theorem (Theorem 51) implies the existence of \( k_0\in\mathbb{N}\) (depending on \( a\) , \( b\) ) such that \( V_{k_0}\) contains an open subset of \( C_c^\infty(a,b)\) . Since \( V_{k_0}\) is a vector space, it must hold \( V_{k_0}=C_c^\infty(a,b)\) .

We now show that \( \varphi*\sigma\in \mathcal{P}_{k_0}\) for every \( \varphi\in C_c^\infty(\mathbb{R})\) ; in other words, \( k_0=k_0(a,b)\) can be chosen independent of \( a\) and \( b\) . First consider a shift \( s\in\mathbb{R}\) and let \( \tilde a\mathrm{:}= a+s\) and \( \tilde b\mathrm{:}= b+s\) . Then with \( S(x)\mathrm{:}= x+s\) , for any \( \varphi\in C_c^\infty(\tilde a,\tilde b)\) holds \( \varphi\circ S\in C_c^\infty(a,b)\) , and thus \( (\varphi\circ S)*\sigma\in\mathcal{P}_{k_0}\) . Since \( (\varphi\circ S)*\sigma(x) = \varphi*\sigma(x+s)\) , we conclude that \( \varphi*\sigma\in\mathcal{P}_{k_0}\) . Next let \( -\infty<\tilde a<\tilde b<\infty\) be arbitrary. Then, for any integer \( n>(\tilde b-\tilde a)/(b-a)\) we can cover \( (\tilde a,\tilde b)\) with \( n\in\mathbb{N}\) overlapping open intervals \( (a_{1},b_{1}),\dots,(a_n,b_n)\) , each of length \( b-a\) . Any \( \varphi\in C_c^\infty(\tilde a,\tilde b)\) can be written as \( \varphi=\sum_{j=1}^n\varphi_j\) where \( \varphi_j\in C_c^\infty(a_j,b_j)\) . Then \( \varphi*\sigma = \sum_{j=1}^n\varphi_j*\sigma\in\mathcal{P}_{k_0}\) , and thus \( \varphi*\sigma\in\mathcal{P}_{k_0}\) for every \( \varphi\in C_c^\infty(\mathbb{R})\) .

Finally, Exercise 8 implies \( \sigma\in\mathcal{P}_{k_0}\) .

Now we can put everything together to show Theorem 2.

Proof (of Theorem 2)

By Exercise 6 we have the implication “\( \Rightarrow\) ”.

For the other direction we assume that \( \sigma\in\mathcal{M}\) is not a polynomial. Then by Lemma 5 there exists \( \varphi\in C_c^\infty(\mathbb{R})\) such that \( \sigma*\varphi\) is not a polynomial. According to Lemma 4 we have \( \sigma*\varphi\in \overline{\mathcal{N}_1^1(\sigma;1)}^{\rm {cc}}\) . We conclude with Lemma 3 that \( \mathcal{N}_1^1(\sigma;1)\) is a universal approximator of \( C^0(\mathbb{R})\) .

Finally, by Lemma 1, \( \mathcal{N}_d^1(\sigma;1)\) is a universal approximator of \( C^0(\mathbb{R}^d)\) .

4.1.3 Deep neural networks

Theorem 2 shows the universal approximation capability of single-hidden-layer neural networks with activation functions \( \sigma\in\mathcal{M}\backslash\mathcal{P}\) : they can approximate every continuous function on every compact set to arbitrary precision, given sufficient width. This result directly extends to neural networks of any fixed depth \( L\ge 1\) . The idea is to use the fact that the identity function can be approximated with a shallow neural network. Composing a shallow neural network approximation of the target function \( f\) with (multiple) shallow neural networks approximating the identity function, gives a deep neural network approximation of \( f\) .

Instead of directly applying Theorem 2, we first establish the following proposition regarding the approximation of the identity function. Rather than \( \sigma\in\mathcal{M}\backslash\mathcal{P}\) , it requires a different (mild) assumption on the activation function. This allows for a constructive proof, yielding explicit bounds on the neural network size, which will prove useful later in the book.

Proposition 3

Let \( d\) , \( L\in \mathbb{N}\) , let \( K \subseteq\mathbb{R}^d\) be compact, and let \( \sigma: \mathbb{R} \to \mathbb{R}\) be such that there exists an open set on which \( \sigma\) is differentiable and not constant. Then, for every \( \varepsilon >0\) , there exists a neural network \( \Phi \in \mathcal{N}_d^d(\sigma; L, d)\) such that

\[ \| \Phi({\boldsymbol{x}}) - {\boldsymbol{x}} \|_{\infty} < \varepsilon~~\text{for all }{\boldsymbol{x}}\in K. \]

Proof

The proof uses the same idea as in Lemma 3, where we approximate the derivative of the activation function by a simple neural network. Let us first assume \( d \in \mathbb{N}\) and \( L = 1\) .

Let \( x^* \in \mathbb{R}\) be such that \( \sigma\) is differentiable on a neighborhood of \( x^*\) and \( \sigma'(x^*) = \theta \neq 0\) . Moreover, let \( {\boldsymbol{x}}^* = (x^*, \dots, x^*) \in \mathbb{R}^d\) . Then, for \( \lambda >0\) we define

\[ \Phi_\lambda({\boldsymbol{x}}) \mathrm{:=} \frac{\lambda}{\theta} \sigma\left( \frac{{\boldsymbol{x}}}{\lambda} + {\boldsymbol{x}}^*\right) - \frac{\lambda}{\theta} \sigma({\boldsymbol{x}}^*), \]

Then, we have, for all \( {\boldsymbol{x}} \in K\) ,

\[ \begin{align} \Phi_\lambda({\boldsymbol{x}}) - {\boldsymbol{x}} = \lambda \frac{\sigma({\boldsymbol{x}} /\lambda + {\boldsymbol{x}}^*) - \sigma({\boldsymbol{x}}^*)}{\theta} - {\boldsymbol{x}}. \end{align} \]

(15)

If \( x_i=0\) for \( i \in \{1, \dots, d\}\) , then (15) shows that \( (\Phi_\lambda({\boldsymbol{x}}) - {\boldsymbol{x}})_i = 0\) . Otherwise

\[ |(\Phi_\lambda({\boldsymbol{x}}) - {\boldsymbol{x}})_i| = \frac{|x_i|}{|\theta|}\left| \frac{\sigma(x_i/\lambda + x^*) - \sigma(x^*)}{x_i/\lambda} - \theta\right|. \]

By the definition of the derivative, we have that \( |(\Phi_\lambda({\boldsymbol{x}}) - {\boldsymbol{x}})_i| \to 0\) for \( \lambda \to \infty\) uniformly for all \( {\boldsymbol{x}} \in K\) and \( i \in \{1, \dots, d\}\) . Therefore, \( |\Phi_\lambda({\boldsymbol{x}}) - {\boldsymbol{x}}| \to 0\) for \( \lambda \to \infty\) uniformly for all \( {\boldsymbol{x}} \in K\) .

The extension to \( L > 1\) is straightforward and is the content of Exercise 10.

Using the aforementioned generalization of Proposition 3 to arbitrary non-polynomial activation functions \( \sigma\in\mathcal{M}\) , we obtain the following extension of Theorem 2.

Corollary 1

Let \( d\in\mathbb{N}\) , \( L\in\mathbb{N}\) and \( \sigma\in\mathcal{M}\) . Then \( \mathcal{N}_d^1(\sigma;L)\) is a universal approximator of \( C^0(\mathbb{R}^d)\) if and only if \( \sigma\) is not a polynomial.

Proof

We only show the implication “\( \Leftarrow\) ”. The other direction is again left as an exercise, see Exercise 6.

Assume \( \sigma\in\mathcal{M}\) is not a polynomial, let \( K\subseteq\mathbb{R}^d\) be compact, and let \( f\in C^0(\mathbb{R}^d)\) . Fix \( \varepsilon\in (0,1)\) . We need to show that there exists a neural network \( \Phi\in\mathcal{N}_d^1(\sigma;L)\) such that \( \sup_{{\boldsymbol{x}}\in K}|f({\boldsymbol{x}})-\Phi({\boldsymbol{x}})|<\varepsilon\) . The case \( L=1\) holds by Theorem 2, so let \( L>1\) .

By Theorem 2, there exist \( \Phi_{\rm {shallow}}\in \mathcal{N}_d^1(\sigma;1)\) such that

\[ \begin{equation} \sup_{{\boldsymbol{x}}\in K}|f({\boldsymbol{x}})-\Phi_{\rm shallow}({\boldsymbol{x}})|<\frac{\varepsilon}{2}. \end{equation} \]

(16)

Compactness of \( \{f({\boldsymbol{x}})\,|\,{\boldsymbol{x}}\in K\}\) implies that we can find \( n>0\) such that

\[ \begin{equation} \{\Phi_{\rm shallow}({\boldsymbol{x}})\,|\,{\boldsymbol{x}}\in K\}\subseteq [-n,n]. \end{equation} \]

(17)

Let \( \Phi_{\rm {id}}\in \mathcal{N}_1^1(\sigma;L-1)\) be an approximation to the identity such that

\[ \begin{equation} \sup_{x\in [-n,n]}|x-\Phi_{\rm id}(x)|<\frac{\varepsilon}{2}, \end{equation} \]

(18)

which is possible by the extension of Proposition 3 to general non-polynomial activation functions \( \sigma\in\mathcal{M}\) .

Denote \( \Phi\mathrm{:}= \Phi_{\rm {id}}\circ \Phi_{\rm shallow}\) . According to Proposition 1 (II) holds \( \Phi\in\mathcal{N}_d^1(\sigma;L)\) as desired. Moreover (16), (17), (18) imply

\[ \begin{align*} \sup_{{\boldsymbol{x}}\in K}|f({\boldsymbol{x}})-\Phi({\boldsymbol{x}})| &=\sup_{{\boldsymbol{x}}\in K}|f({\boldsymbol{x}})-\Phi_{\rm id}(\Phi_{\rm shallow}({\boldsymbol{x}}))|\\ &\le \sup_{{\boldsymbol{x}}\in K}\big(|f({\boldsymbol{x}})-\Phi_{\rm shallow}({\boldsymbol{x}})|+ |\Phi_{\rm shallow}({\boldsymbol{x}})-\Phi_{\rm id}(\Phi_{\rm shallow}({\boldsymbol{x}}))|\big)\\ &\le \frac{\varepsilon}{2} + \frac{\varepsilon}{2} = \varepsilon. \end{align*} \]

This concludes the proof.

4.1.4 Other norms

In addition to the case of continuous functions, universal approximation theorems can be shown for various other function classes and topologies, which may also allow for the approximation of functions exhibiting discontinuities or singularities. To give but one example, we next state such a result for Lebesgue spaces on compact sets. The proof is left to the reader, see Exercise 9.

Corollary 2

Let \( d\in\mathbb{N}\) , \( L\in\mathbb{N}\) , \( p \in [1, \infty)\) , and let \( \sigma\in\mathcal{M}\) not be a polynomial. Then for every \( \varepsilon>0\) , every compact \( K \subseteq\mathbb{R}^d\) , and every \( f \in L^p(K)\) there exists \( \Phi^{f,\varepsilon} \in \mathcal{N}_d^1(\sigma;L)\) such that

\[ \left(\int_{K} | f({\boldsymbol{x}}) - \Phi({\boldsymbol{x}})|^p \,\mathrm{d}{\boldsymbol{x}}\right)^{1/p} \leq \varepsilon. \]

4.2 Superexpressive activations and Kolmogorov’s superposition theorem

In the previous section, we saw that a large class of activation functions allow for universal approximation. However, these results did not provide any insights into the necessary neural network size for achieving a specific accuracy.

Before exploring this topic further in the following chapters, we next present a remarkable result that shows how the required neural network size is significantly influenced by the choice of activation function. The result asserts that, with the appropriate activation function, every \( f\in C^0(K)\) on a compact set \( K\subseteq\mathbb{R}^d\) can be approximated to every desired accuracy \( \varepsilon>0\) using a neural network of size \( O(d^2)\) ; in particular the neural network size is independent of \( \varepsilon>0\) , \( K\) , and \( f\) . We will first discuss the one-dimensional case.

Proposition 4

There exists a continuous activation function \( \sigma:\mathbb{R}\to\mathbb{R}\) such that for every compact \( K\subseteq\mathbb{R}\) , every \( \varepsilon>0\) and every \( f\in C^0(K)\) there exists \( \Phi(x)=\sigma(wx+b)\in\mathcal{N}_1^1(\sigma;1,1)\) such that

\[ \begin{align*} \sup_{x\in K}|f(x)-\Phi(x)|<\varepsilon. \end{align*} \]

Proof

Denote by \( \tilde{\mathcal{P}}_n\) all polynomials \( p(x)=\sum_{j=0}^n q_jx^j\) with rational coefficients, i.esuch that \( q_j\in\mathbb{Q}\) for all \( j=0,\dots,n\) . Then \( \tilde{\mathcal{P}}_n\) can be identified with the \( n\) -fold cartesian product \( \mathbb{Q}\times…\times\mathbb{Q}\) , and thus \( \tilde{\mathcal{P}}_n\) is a countable set. Consequently also the set \( \tilde{\mathcal{P}}\mathrm{:}=\bigcup_{n\in\mathbb{N}}\tilde{\mathcal{P}}_n\) of all polynomials with rational coefficients is countable. Let \( (p_i)_{i\in\mathbb{Z}}\) be an enumeration of these polynomials, and set

\[ \begin{align*} \sigma(x)\mathrm{:}= \left\{ \begin{array}{ll} p_i(x-2i) &\text{if }x\in [2i,2i+1]\\ p_i(1)(2i+2-x)+p_{i+1}(0)(x-2i-1) &\text{if }x\in (2i+1,2i+2). \end{array}\right. \end{align*} \]

In words, \( \sigma\) equals \( p_i\) on even intervals \( [2i, 2i+1]\) and is linear on odd intervals \( [2i+1, 2i+2]\) , resulting in a continuous function overall.

We first assume \( K=[0,1]\) . By Example 1, for every \( \varepsilon>0\) exists \( p(x)=\sum_{j=1}^nr_jx^j\) such that \( \sup_{x\in [0,1]}|p(x)-f(x)|<{\varepsilon}/{2}\) . Now choose \( q_j\in\mathbb{Q}\) so close to \( r_j\) such that \( \tilde p(x)\mathrm{:}= \sum_{j=1}^n q_jx^j\) satisfies \( \sup_{x\in [0,1]}|\tilde p(x)-p(x)|<{\varepsilon}/{2}\) . Let \( i\in\mathbb{Z}\) such that \( \tilde p(x)=p_i(x)\) , i.e., \( p_i(x)=\sigma(2i+x)\) for all \( x\in [0,1]\) . Then \( \sup_{x\in [0,1]}|f(x)-\sigma(x+2i)|<\varepsilon\) .

For general compact \( K\) assume that \( K\subseteq [a,b]\) . By Tietze’s extension theorem, \( f\) allows a continuous extension to \( [a,b]\) , so without loss of generality \( K=[a,b]\) . By the first case we can find \( i\in\mathbb{Z}\) such that with \( y=(x-a)/(b-a)\) (i.e\( y\in [0,1]\) if \( x\in [a,b]\) )

\[ \begin{align*} \sup_{x\in [a,b]}\left|f(x)-\sigma\left(\frac{x-a}{b-a}+2i\right)\right| = \sup_{y\in [0,1]}\left|f(y\cdot (b-a)+a)-\sigma(y+2i)\right| <\varepsilon, \end{align*} \]

which gives the statement with \( w={1}/(b-a)\) and \( b=-{a}\cdot(b-a)+2i\) .

To extend this result to arbitrary dimension, we will use Kolmogorov’s superposition theorem. It states that every continuous function of \( d\) variables can be expressed as a composition of functions that each depend only on one variable. We omit the technical proof, which can be found in [36].

Theorem 3 (Kolmogorov)

For every \( d\in\mathbb{N}\) there exist \( 2d^2+d\) monotonically increasing functions \( \varphi_{i,j}\in C^0(\mathbb{R})\) , \( i=1,\dots,d\) , \( j=1,\dots,2d+1\) , such that for every \( f\in C^0([0,1]^d)\) there exist functions \( f_j\in C^0(\mathbb{R})\) , \( j=1,\dots,2d+1\) satisfying

\[ \begin{align*} f({\boldsymbol{x}})=\sum_{j=1}^{2d+1} f_j\left(\sum_{i=1}^d \varphi_{i,j}(x_i) \right)~~\text{for all }{\boldsymbol{x}}\in [0,1]^d. \end{align*} \]

Corollary 3

Let \( d\in\mathbb{N}\) . With the activation function \( \sigma:\mathbb{R}\to\mathbb{R}\) from Proposition 4, for every compact \( K\subseteq\mathbb{R}^d\) , every \( \varepsilon>0\) and every \( f\in C^0(K)\) there exists \( \Phi\in\mathcal{N}_d^1(\sigma;2,2d^2+d)\) (i.e\( \rm{ width}(\Phi)=2d^2+d\) and \( \rm{ depth}(\Phi)=2\) ) such that

\[ \begin{align*} \sup_{{\boldsymbol{x}}\in K}|f({\boldsymbol{x}})-\Phi({\boldsymbol{x}})|<\varepsilon. \end{align*} \]

Proof

Without loss of generality we can assume \( K=[0,1]^d\) : the extension to the general case then follows by Tietze’s extension theorem and a scaling argument as in the proof of Proposition 4.

Let \( f_j\) , \( \varphi_{i,j}\) , \( i=1,\dots,d\) , \( j=1,\dots,2d+1\) be as in Theorem 3. Fix \( \varepsilon>0\) . Let \( a>0\) be so large that

\[ \begin{align*} \sup_{i,j}\sup_{x\in [0,1]}|\varphi_{i,j}(x)|\le a. \end{align*} \]

Since each \( f_j\) is uniformly continuous on the compact set \( [-da,da]\) , we can find \( \delta>0\) such that

\[ \begin{align} \sup_j\sup_{\substack{|y-\tilde y|<\delta\\ |y|,|\tilde y|\le da}}|f_j(y)-f_j(\tilde y)|< \frac{\varepsilon}{2(2d+1)}. \end{align} \]

(18)

By Proposition 4 there exist \( w_{i,j}\) , \( b_{i,j}\in\mathbb{R}\) such that

\[ \begin{align} \sup_{i,j}\sup_{x\in [0,1]}|\varphi_{i,j}(x)-\underbrace{\sigma(w_{i,j}x+b_{i,j})}_{=\mathrm{:} \tilde\varphi_{i,j}(x)}|<\frac{\delta}{d} \end{align} \]

(19)

and \( w_{j}\) , \( b_{j}\in\mathbb{R}\) such that

\[ \begin{align} \sup_{j}\sup_{|y|\le a+\delta}|f_j(y)-\underbrace{\sigma(w_{j}y+b_{j})}_{=\mathrm{:} \tilde f_j(y)}|<\frac{\varepsilon}{2(2d+1)}. \end{align} \]

(20)

Then for all \( {\boldsymbol{x}}\in [0,1]^d\) by (19)

\[ \begin{align*} \left|\sum_{i=1}^d\varphi_{i,j}(x_i) -\sum_{i=1}^d\tilde \varphi_{i,j}(x_i)\right| < d\frac{\delta}{d}=\delta. \end{align*} \]

Thus with

\[ \begin{align*} y_j\mathrm{:}= \sum_{j=1}^d\varphi_{i,j}(x_i),~~ \tilde y_j\mathrm{:}= \sum_{j=1}^d\tilde\varphi_{i,j}(x_i) \end{align*} \]

it holds \( |y_j-\tilde y_j|<\delta\) . Using (18) and (20) we conclude

\[ \begin{align*} &\left|f({\boldsymbol{x}})-\sum_{j=1}^{2d+1}\sigma \left(w_{j} \cdot \left(\sum_{i=1}^d\sigma(w_{i,j}x_i+b_{i,j})\right)+b_{j} \right)\right|= \left|\sum_{j=1}^{2d+1}(f_j(y_j)- \tilde f_j(\tilde y_j))\right|\nonumber\\ &~~~~\le \sum_{j=1}^{2d+1}\left(|f_j(y_j)- f_j(\tilde y_j)|+|f_j(\tilde y_j)-\tilde f_j(\tilde y_j)| \right)\nonumber\\ &~~~~\le \sum_{j=1}^{2d+1}\left(\frac{\varepsilon}{2(2d+1)}+\frac{\varepsilon}{2(2d+1)}\right)\le \varepsilon. \end{align*} \]

This concludes the proof.

Kolmogorov’s superposition theorem is intriguing as it shows that approximating \( d\) -dimensional functions can be reduced to the (generally much simpler) one-dimensional case through compositions. Neural networks, by nature, are well suited to approximate functions with compositional structures. However, the functions \( f_j\) in Theorem 3, even though only one-dimensional, could become very complex and challenging to approximate themselves if \( d\) is large.

Similarly, the “magic” activation function in Proposition 4 encodes the information of all rational polynomials on the unit interval, which is why a neural network of size \( O(1)\) suffices to approximate every function to arbitrary accuracy. Naturally, no practical algorithm can efficiently identify appropriate neural network weights and biases for this architecture. As such, the results presented in Section 4.2 should be taken with a pinch of salt as their practical relevance is highly limited. Nevertheless, they highlight that while universal approximation is a fundamental and important property of neural networks, it leaves many aspects unexplored. To gain further insight into practically relevant architectures, in the following chapters, we investigate neural networks with activation functions such as the ReLU.

Bibliography and further reading

The foundation of universal approximation theorems goes back to the late 1980s with seminal works by Cybenko [37], Hornik et al[38, 39], Funahashi [40] and Carroll and Dickinson [41]. These results were subsequently extended to a wider range of activation functions and architectures. The present analysis in Section 4.1 closely follows the arguments in [32], where it was essentially shown that universal approximation can be achieved if the activation function is not polynomial. The proof of Lemma 1 is from [34, Theorem 2.1], with earlier results of this type being due to [42].

Kolmogorov’s superposition theorem stated in Theorem 3 was originally proven in 1957 [36]. For a more recent and constructive proof see for instance [43]. Kolmogorov’s theorem and its obvious connections to neural networks have inspired various research in this field, e.g [44, 45, 46, 47, 48], with its practical relevance being debated [49, 50]. The idea for the “magic” activation function in Section 4.2 comes from [51] where it is shown that such an activation function can even be chosen monotonically increasing.

Exercises

Exercise 5

Write down a generator of a (minimal) topology on \( C^0(\mathbb{R}^d)\) such that \( f_n\to f\in C^0(\mathbb{R}^d)\) if and only if \( f_n\xrightarrow{\rm{ cc}} f\) , and show this equivalence. This topology is referred to as the topology of compact convergence.

Exercise 6

Show the implication “\( \Rightarrow\) ” of Theorem 2 and Corollary 1.

Exercise 7

Prove Lemma 2. Hint: Consider \( \sigma(nx)\) for large \( n\in\mathbb{N}\) .

Exercise 8

Let \( k\in\mathbb{N}\) , \( \sigma\in\mathcal{M}\) and assume that \( \sigma *\varphi\in \mathcal{P}_k\) for all \( \varphi\in C_c^\infty(\mathbb{R})\) . Show that \( \sigma\in\mathcal{P}_k\) .

Hint: Consider \( \psi\in C_c^\infty(\mathbb{R})\) such that \( \psi\ge 0\) and \( \int_\mathbb{R}\psi(x)\,\mathrm{d} x =1\) and set \( \psi_\varepsilon(x):=\psi(x/\varepsilon)/\varepsilon\) . Use that away from the discontinuities of \( \sigma\) it holds \( \psi_\varepsilon *\sigma(x)\to\sigma(x)\) as \( \varepsilon\to 0\) . Conclude that \( \sigma\) is piecewise in \( \mathcal{P}_k\) , and finally show that \( \sigma\in C^{k}(\mathbb{R})\) .

Exercise 9

Prove Corollary 2 with the use of Corollary 1.

Exercise 10

Complete the proof of Proposition 3 for \( L>1\) .

5 Splines

In Chapter 4, we saw that sufficiently large neural networks can approximate every continuous function to arbitrary accuracy. However, these results did not further specify the meaning of “sufficiently large” or what constitutes a suitable architecture. Ideally, given a function \( f\) , and a desired accuracy \( \varepsilon>0\) , we would like to have a (possibly sharp) bound on the required size, depth, and width guaranteeing the existence of a neural network approximating \( f\) up to error \( \varepsilon\) .

The field of approximation theory establishes such trade-offs between properties of the function \( f\) (e.g., its smoothness), the approximation accuracy, and the number of parameters needed to achieve this accuracy. For example, given \( k\) , \( d\in\mathbb{N}\) , how many parameters are required to approximate a function \( f:[0,1]^d\to\mathbb{R}\) with \( \| f \|_{C^k({[0,1]^d})}\le 1\) up to uniform error \( \varepsilon\) ? Splines are known to achieve this approximation accuracy with a superposition of \( O(\varepsilon^{-d/k})\) simple (piecewise polynomial) basis functions. In this chapter, following [52], we show that certain sigmoidal neural networks can match this performance in terms of the neural network size. In fact, from an approximation theoretical viewpoint we show that the considered neural networks are at least as expressive as superpositions of splines.

5.1 B-splines and smooth functions

We introduce a simple type of spline and its approximation properties below.

Definition 7

For \( n \in \mathbb{N}\) , the univariate cardinal B-spline order \( n\in \mathbb{N}\) is given by

\[ \begin{align} \mathcal{S}_n(x) \mathrm{:=} \frac{1}{(n-1)!} \sum_{\ell = 0}^{n} (-1)^\ell \binom{n}{\ell} \sigma_{\rm ReLU}(x-\ell)^{n-1} ~~ \text{ for } x \in \mathbb{R}, \end{align} \]

(22)

where \( 0^0 \mathrm{:}= 0\) and \( \sigma_{\rm {ReLU}}\) denotes the ReLU activation function.

By shifting and dilating the cardinal B-spline, we obtain a system of univariate splines. Taking tensor products of these univariate splines yields a set of higher-dimensional functions known as the multivariate B-splines.

Definition 8

For \( t \in \mathbb{R}\) and \( n\) , \( \ell \in \mathbb{N}\) we define \( \mathcal{S}_{\ell, t, n} \mathrm{:=} \mathcal{S}_n (2^{\ell}(\cdot - t))\) . Additionally, for \( d \in \mathbb{N}\) , \( {\boldsymbol{t}} \in \mathbb{R}^d\) , and \( n\) , \( \ell \in \mathbb{N}\) , we define the the multivariate B-spline \( \mathcal{S}_{\ell, {\boldsymbol{t}}, n}^d\) as

\[ \mathcal{S}_{\ell, {\boldsymbol{t}}, n}^d({\boldsymbol{x}}) \mathrm{:=} \prod_{i = 1}^d \mathcal{S}_{\ell, t_i, n}(x_i)~~ \text{ for } {\boldsymbol{x}} = (x_1, \dots x_d) \in \mathbb{R}^d, \]

and

\[ \mathcal{B}^n \mathrm{:=} \left\{\mathcal{S}_{\ell, {\boldsymbol{t}}, n}^d\, \middle|\, \ell \in \mathbb{N}, {\boldsymbol{t}} \in \mathbb{R}^d\right\} \]

is the dictionary of B-splines of order \( n\) .

Having introduced the system \( \mathcal{B}^n\) , we would like to understand how well we can represent each smooth function by superpositions of elements of \( \mathcal{B}^n\) . The following theorem is adapted from the more general result [1, Theorem 7]; also see [2, Theorem D.3] for a presentation closer to the present formulation.

Theorem 4

Let \( d\) , \( n\) , \( k \in \mathbb{N}\) such that \( 0< k \leq n\) . Then there exists \( C\) such that for every \( f \in C^k([0,1]^d)\) and every \( N \in \mathbb{N}\) , there exist \( c_i \in \mathbb{R}\) with \( |c_i| \leq C \| f \|_{{L^\infty([0,1]^d)}}\) and \( B_i \in \mathcal{B}^n\) for \( i = 1, \dots, N\) , such that

\[ \left\|f - \sum_{i=1}^N c_i B_i \right\|_{L^\infty([0,1]^d)} \leq C N^{-\frac{k}{d}} \|f \|_{C^k[0,1]^d}. \]

Remark 3

There are a couple of critical concepts in Theorem 4 that will reappear throughout this book. The number of parameters \( N\) determines the approximation accuracy \( N^{-k/d}\) . This implies that achieving accuracy \( \varepsilon>0\) requires \( O(\varepsilon^{-d/k})\) parameters (according to this upper bound), which grows exponentially in \( d\) . This exponential dependence on \( d\) is referred to as the “curse of dimension” and will be discussed again in the subsequent chapters. The smoothness parameter \( k\) has the opposite effect of \( d\) , and improves the convergence rate. Thus, smoother functions can be approximated with fewer B-splines than rougher functions. This more efficient approximation requires the use of B-splines of order \( n\) with \( n\ge k\) . We will see in the following, that the order of the B-spline is closely linked to the concept of depth in neural networks.

5.2 Reapproximation of B-splines with sigmoidal activations

We now show that the approximation rates of B-splines can be transfered to certain neural networks. The following argument is based on [1].

Definition 9

A function \( \sigma: \mathbb{R} \to \mathbb{R}\) is called sigmoidal of order \( q\in \mathbb{N}\) , if \( \sigma \in C^{q-1}(\mathbb{R})\) and there exists \( C >0\) such that

\[ \begin{align*} \frac{\sigma(x)}{x^q} &\to 0 &&\text{as } x\to - \infty,\\ \frac{\sigma(x)}{x^q} &\to 1 &&\text{as } x\to \infty,\\ |\sigma(x)| &\leq C \cdot (1+|x|)^q &&\text{for all } x \in \mathbb{R}. \end{align*} \]

Example 3

The rectified power unit \( x\mapsto\sigma_{\rm {ReLU}}(x)^q\) is sigmoidal of order \( q\) .

Our goal in the following is to show that neural networks can approximate a linear combination of \( N\) B-splines with a number of parameters that is proportional to \( N\) . As an immediate consequence of Theorem 4, we then obtain a convergence rate for neural networks. Let us start by approximating a single univariate B-spline with a neural network of fixed size.

Proposition 5

Let \( n\in \mathbb{N}\) , \( n \geq 2\) , \( K>0\) , and let \( \sigma: \mathbb{R} \to \mathbb{R}\) be sigmoidal of order \( q\geq 2\) . There exists a constant \( C>0\) such that for every \( \varepsilon >0\) there is a neural network \( \Phi^{ \mathcal{S}_n}\) with activation function \( \sigma\) , \( \lceil\log_{q}(n-1)\rceil\) layers, and size \( C\) , such that

\[ \left\|\mathcal{S}_n - \Phi^{ \mathcal{S}_n}\right\|_{L^\infty([-K,K])} \leq \varepsilon. \]

Proof

By definition (22), \( \mathcal{S}_n\) is a linear combination of \( n+1\) shifts of \( \sigma_{\rm {ReLU}}^{n-1}\) . We start by approximating \( \sigma_{\rm {ReLU}}^{n-1}\) . It is not hard to see (Exercise 11) that, for every \( K'>0\) and every \( t \in \mathbb{N}\)

\[ \begin{align} \left|a^{-q^t} \underbrace{\sigma \circ \sigma \circ \dots \circ \sigma(a x)}_{t-\text{ times}} - \sigma_{\rm ReLU}(x)^{q^t} \right|\to 0 ~~\text{as } a \to \infty \end{align} \]

(23)

uniformly for all \( x \in [-K',K']\) .

Set \( t \mathrm{:=} \lceil\log_{q}(n-1)\rceil\) . Then \( t \geq 1\) since \( n \geq 2\) , and \( q^t \geq n-1\) . Thus, for every \( K'>0\) and \( \varepsilon >0\) there exists a neural network \( \Phi^{q^t}_\varepsilon\) with \( \lceil\log_{q}(n-1)\rceil\) layers satisfying

\[ \begin{align} \left|\Phi^{q^t}_\varepsilon(x) - \sigma_{\rm ReLU}(x)^{q^t}\right| \leq \varepsilon ~~\text{for all }x \in [-K', K']. \end{align} \]

(24)

This shows that we can approximate the ReLU to the power of \( q^t\ge n-1\) . However, our goal is to obtain an approximation of the ReLU raised to the power \( n-1\) , which could be smaller than \( q^t\) . To reduce the order, we emulate approximate derivatives of \( \Phi^{q^t}_\varepsilon\) . Concretely, we show the following claim: For all \( 1\leq p \leq q^t\) for every \( K'>0\) and \( \varepsilon >0\) there exists a neural network \( \Phi^{p}_\varepsilon\) having \( \lceil\log_{q}(n-1)\rceil\) layers and satisfying

\[ \begin{align} \left|\Phi^{p}_\varepsilon(x) - \sigma_{\rm ReLU}(x)^{p}\right| \leq \varepsilon ~~\text{for all }x\in [-K',K']. \end{align} \]

(25)

The claim holds for \( p = q^t\) . We now proceed by induction over \( p=q^t,q^t-1,\dots\) Assume (25) holds for some \( p\in\{2,\dots,q^t\}\) . Fix \( \delta \geq 0\) . Then

\[ \begin{align*} &\left|\frac{\Phi^{p}_{\delta^2}(x + \delta) - \Phi^{p}_{\delta^2}(x )}{p \delta} - \sigma_{\rm ReLU}(x)^{p-1}\right|\\ &~~ \leq 2\frac{\delta}{p} + \left|\frac{\sigma_{\rm ReLU}(x + \delta)^{p} - \sigma_{\rm ReLU}(x)^{p}}{p \delta} - \sigma_{\rm ReLU}(x)^{p-1}\right|. \end{align*} \]

Hence, by the binomial theorem it follows that there exists \( \delta_{*}>0\) such that

\[ \begin{align*} \left|\frac{\Phi^{p}_{\delta_{*}^2}(x + \delta_{*}) - \Phi^{p}_{\delta_{*}^2}(x)}{p \delta_{*}} - \sigma_{\rm ReLU}(x)^{p-1}\right| \leq \varepsilon, \end{align*} \]

for all \( x \in [-K', K']\) . By Proposition 1, \( (\Phi^{p}_{\delta_{*}^2}(x + \delta_{*}) - \Phi^{p}_{\delta_{*}^2})/(p \delta_{*})\) is a neural network with \( \lceil\log_{q}(n-1)\rceil\) layers and size independent from \( \varepsilon\) . Calling this neural network \( \Phi^{p-1}_{\varepsilon}\) shows that (25) holds for \( p-1\) , which concludes the induction argument and proves the claim.

For every neural network \( \Phi\) , every spatial translation \( \Phi(\cdot-t)\) is a neural network of the same architecture. Hence, every term in the sum (22) can be approximated to arbitrary accuracy by a neural network of a fixed size. Since by Proposition 1, sums of neural networks of the same depth are again neural networks of the same depth, the result follows.

Next, we extend Proposition 5 to the multivariate splines \( \mathcal{S}_{\ell, {\boldsymbol{t}}, n}^d\) for arbitrary \( \ell\) , \( d \in \mathbb{N}\) , \( {\boldsymbol{t}} \in \mathbb{R}^d\) .

Proposition 6

Let \( n\) , \( d\in \mathbb{N}\) , \( n \geq 2\) , \( K>0\) , and let \( \sigma: \mathbb{R} \to \mathbb{R}\) be sigmoidal of order \( q\geq 2\) . Further let \( \ell \in \mathbb{N}\) and \( {\boldsymbol{t}} \in \mathbb{R}^d\) .

Then, there exists a constant \( C>0\) such that for every \( \varepsilon >0\) there is a neural network \( \Phi^{ \mathcal{S}_{\ell, {\boldsymbol{t}}, n}^d}\) with activation function \( \sigma\) , \( \lceil \log_2(d)\rceil + \lceil\log_{q}(n-1)\rceil\) layers, and size \( C\) , such that

\[ \left\|\mathcal{S}_{\ell, {\boldsymbol{t}}, n}^d - \Phi^{\mathcal{S}_{\ell, {\boldsymbol{t}}, n}^d}\right\|_{L^\infty([-K,K]^d)} \leq \varepsilon. \]

Proof

By definition \( \mathcal{S}_{\ell, {\boldsymbol{t}}, n}^d({\boldsymbol{x}})=\prod_{i=1}^d\mathcal{S}_{\ell,t_i,n}(x_i)\) where

\[ \mathcal{S}_{\ell, t_i, n}(x_i)= \mathcal{S}_{n}(2^\ell(x_i-t_i)). \]

By Proposition 5 there exist a constant \( C'>0\) such that for each \( i = 1, \dots, d\) and all \( \varepsilon >0\) , there is a neural network \( \Phi^{\mathcal{S}_{\ell, t_i, n}}\) with size \( C'\) and \( \lceil \log_q(n-1)\rceil\) layers such that

\[ \left\|\mathcal{S}_{\ell, t_i, n}- \Phi^{\mathcal{S}_{\ell, t_i, n}}\right\|_{L^\infty([-K,K]^d)} \leq \varepsilon. \]

If \( d=1\) , this shows the statement. For general \( d\) , it remains to show that the product of the \( \Phi^{\mathcal{S}_{\ell, t_i, n}}\) for \( i = 1, \dots, d\) can be approximated.

We first prove the following claim by induction: For every \( d\in \mathbb{N}\) , \( d \geq 2\) , there exists a constant \( C''>0\) , such that for all \( {K'} \geq 1\) and all \( \varepsilon>0\) there exists a neural network \( \Phi_{{\rm {mult}},\varepsilon,d}\) with size \( C''\) , \( \lceil \log_2(d) \rceil\) layers, and activation function \( \sigma\) such that for all \( x_1, \dots, x_d\) with \( |x_i| \leq {K'}\) for all \( i = 1, \dots, d\) ,

\[ \begin{align} \left|\Phi_{{\rm mult}, \varepsilon, d}(x_1, \dots, x_d) - \prod_{i =1}^d x_i\right| < \varepsilon. \end{align} \]

(26)

For the base case, let \( d = 2\) . Similar to the proof of Proposition 5, one can show that there exists \( C'''>0\) such that for every \( \varepsilon>0\) and \( K'>0\) there exists a neural network \( \Phi_{{\rm {square}}, \varepsilon}\) with one hidden layer and size \( C'''\) such that

\[ \begin{align*} |\Phi_{{\rm square}, \varepsilon} - \sigma_{\rm ReLU}(x)^2| \leq \varepsilon ~~\text{for all } |x| \leq {K'}. \end{align*} \]

For every \( x = (x_1,x_2) \in \mathbb{R}^2\)

\[ \begin{align} x_1 x_2 &= \frac{1}{2} \left((x_1+x_2)^2 - x_1^2 - x_2^2\right) \nonumber \end{align} \]

(27)

\[ \begin{align} &= \frac{1}{2} \left(\sigma_{\rm ReLU}(x_1+x_2)^2 + \sigma_{\rm ReLU}(-x_1-x_2)^2 - \sigma_{\rm ReLU}(x_1)^2 \right.\\ &~~ - \left.\sigma_{\rm ReLU}(-x_1)^2 - \sigma_{\rm ReLU}(x_2)^2 - \sigma_{\rm ReLU}(-x_2)^2\right). \\\end{align} \]

Each term on the right-hand side can be approximated up to uniform error \( \varepsilon /6\) with a network of size \( C'''\) and one hidden layer. By Proposition 1, we conclude that there exists a neural network \( \Phi_{{\rm {mult}}, \varepsilon, 2}\) satisfying (26) for \( d=2\) .

Assume the induction hypothesis (26) holds for \( d-1\ge 1\) , and let \( \varepsilon >0\) and \( {K'}\geq 1\) . We have

\[ \begin{align} \prod_{i =1}^d x_i = \prod_{i = 1}^{\lfloor d/2 \rfloor} x_i \cdot \prod_{i = \lfloor d/2 \rfloor + 1}^{d} x_i. \end{align} \]

(28)

We will now approximate each of the terms in the product on the right-hand side of (28) by a neural network using the induction assumption.

For simplicity assume in the following that \( \lceil\log_2(\lfloor d/2 \rfloor)\rceil = \lceil\log_2(d - \lfloor d/2 \rfloor)\rceil\) . The general case can be addressed via Proposition 3. By the induction assumption there then exist neural networks \( \Phi_{{\rm {mult}}, 1}\) and \( \Phi_{{\rm {mult}}, 2}\) both with \( \lceil\log_2(\lfloor d/2 \rfloor)\rceil\) layers, such that for all \( x_i\) with \( |x_i| \leq {K'}\) for \( i = 1, \dots, d\)

\[ \begin{align*} \left|\Phi_{{\rm mult}, 1}(x_1, \dots, x_{\lfloor d/2 \rfloor}) - \prod_{i = 1}^{\lfloor d/2 \rfloor} x_i\right| &< \frac{\varepsilon}{4 (({K'})^{\lfloor d/2 \rfloor} + \varepsilon)}, \\ \left|\Phi_{{\rm mult}, 2}(x_{\lfloor d/2 \rfloor+1}, \dots, x_{d}) - \prod_{i = \lfloor d/2 \rfloor + 1}^{d} x_i\right| &< \frac{\varepsilon}{4 (({K'})^{\lfloor d/2 \rfloor} + \varepsilon)}. \end{align*} \]

By Proposition 1, \( \Phi_{{\rm {mult}}, \varepsilon, d} \mathrm{:=} \Phi_{{\rm mult}, \varepsilon /2, 2} \circ (\Phi_{{\rm mult}, 1}, \Phi_{{\rm mult}, 2})\) is a neural network with \( 1 + \lceil\log_2(\lfloor d/2 \rfloor)\rceil = \lceil\log_2(d)\rceil\) layers. By construction, the size of \( \Phi_{{\rm {mult}},\varepsilon,d}\) does not depend on \( K'\) or \( \varepsilon\) . Thus, to complete the induction, it only remains to show (26).

For all \( a\) , \( b\) , \( c\) , \( d \in \mathbb{R}\) holds

\[ |ab - cd| \leq |a| |b-d| + |d| |a-c|. \]

Hence, for \( x_1, \dots, x_d\) with \( |x_i| \leq {K'}\) for all \( i = 1, \dots, d\) , we have that

\[ \begin{align*} &\left|\prod_{i =1}^d x_i - \Phi_{{\rm mult}, \varepsilon, d}(x_1, \dots, x_d)\right| \\ &\leq \frac{\varepsilon}{2} + \left|\prod_{i = 1}^{\lfloor d/2 \rfloor} x_i \cdot \prod_{i = \lfloor d/2 \rfloor + 1}^{d} x_i - \Phi_{{\rm mult}, 1} (x_1, \dots, x_{\lfloor d/2 \rfloor}) \Phi_{{\rm mult}, 2} (x_{\lfloor d/2 \rfloor+1}, \dots, x_d ) \right|\\ &\leq \frac{\varepsilon}{2} + |{K'}|^{\lfloor d/2 \rfloor} \frac{\varepsilon}{4 (({K'})^{\lfloor d/2 \rfloor} + \varepsilon)} + (|{K'}|^{\lceil d/2 \rceil} + \varepsilon) \frac{\varepsilon}{4 (({K'})^{\lfloor d/2 \rfloor} + \varepsilon)} < \varepsilon. \end{align*} \]

This completes the proof of (26).

The overall result follows by using Proposition 1 to show that the multiplication network can be composed with a neural network comprised of the \( \Phi^{\mathcal{S}_{\ell, t_i, n}}\) for \( i = 1, \dots, d\) . Since in no step above the size of the individual networks was dependent on the approximation accuracy, this is also true for the final network.

Proposition 6 shows that we can approximate a single multivariate B-spline with a neural network with a size that is independent of the accuracy. Combining this observation with Theorem 4 leads to the following result.

Theorem 5

Let \( d\) , \( n\) , \( k \in \mathbb{N}\) such that \( 0< k \leq n\) and \( n\ge 2\) . Let \( q \geq 2\) , and let \( \sigma\) be sigmoidal of order \( q\) .

Then there exists \( C\) such that for every \( f \in C^k([0,1]^d)\) and every \( N \in \mathbb{N}\) there exists a neural network \( \Phi^N\) with activation function \( \sigma\) , \( \lceil \log_2(d)\rceil + \lceil\log_{q}(k-1)\rceil\) layers, and size bounded by \( CN\) , such that

\[ \left\|f - \Phi^N \right\|_{L^\infty([0,1]^d)} \leq C N^{-\frac{k}{d}}\| f \|_{{C^k([0,1]^d)}}. \]

Proof

Fix \( N\in\mathbb{N}\) . By Theorem 4, there exist coefficients \( |c_i|\leq C\| f \|_{{L^\infty([0,1]^d)}}\) and \( B_i \in \mathcal{B}^n\) for \( i = 1, \dots, N\) , such that

\[ \begin{align*} \left\|f - \sum_{i=1}^N c_i B_i \right\|_{L^\infty([0,1]^d)} \leq C N^{-\frac{k}{d}} \|f \|_{C^k([0,1]^d)}. \end{align*} \]

Moreover, by Proposition 6, for each \( i = 1, \dots, N\) exists a neural network \( \Phi^{B_i}\) with \( \lceil \log_2(d)\rceil + \lceil\log_{q}(k-1)\rceil\) layers, and a fixed size, which approximates \( B_i\) on \( [-1,1]^d \supseteq [0,1]^d\) up to error of \( \varepsilon\mathrm{:}= N^{-k/d}/N\) . The size of \( \Phi^{B_i}\) is independent of \( i\) and \( N\) .

By Proposition 1, there exists a neural network \( \Phi^N\) that uniformly approximates \( \sum_{i=1}^N c_i B_i\) up to error \( \varepsilon\) on \( [0,1]^d\) , and has \( \lceil \log_2(d)\rceil + \lceil\log_{q}(k-1)\rceil\) layers. The size of this network is linear in \( N\) (see Exercise 12). This concludes the proof.

Theorem 5 shows that neural networks with higher-order sigmoidal functions can approximate smooth functions with the same accuracy as spline approximations while having a comparable number of parameters. The network depth is required to behave like \( O(\log(k))\) in terms of the smoothness parameter \( k\) , cp. Remark 3.

Bibliography and further reading

The argument of linking sigmoidal activation functions with spline based approximation was first introduced in [52, 55]. For further details on spline approximation, see [53] or the book [56].

The general strategy of approximating basis functions by neural networks, and then lifting approximation results for those bases has been employed widely in the literature, and will also reappear again in this book. While the following chapters primarily focus on ReLU activation, we highlight a few notable approaches with non-ReLU activations based on the outlined strategy: To approximate analytic functions, [57] emulates a monomial basis. To approximate periodic functions, a basis of trigonometric polynomials is recreated in [58]. Wavelet bases have been emulated in [59]. Moreover, neural networks have been studied through the representation system of ridgelets [60] and ridge functions [61]. A general framework describing the emulation of representation systems to transfer approximation results was presented in [62].

Exercises

Exercise 11

Show that (22) holds.

Exercise 12

Let \( L \in \mathbb{N}\) , \( \sigma \colon \mathbb{R} \to \mathbb{R}\) , and let \( \Phi_1\) , \( \Phi_2\) be two neural networks with architecture \( (\sigma; d_0, d_1^{(1)}, \dots, d_{L}^{(1)}, d_{L+1})\) and \( (\sigma; d_0, d_1^{(2)}, \dots, d_{L}^{(2)}, d_{L+1})\) . Show that \( \Phi_1 + \Phi_2\) is a neural network with \( \rm{ size}(\Phi_1 + \Phi_2) \leq \rm{ size}(\Phi_1) + \rm{ size}(\Phi_2)\) .

Exercise 13

Show that, for \( \sigma = \sigma_{\rm {ReLU}}^2\) and \( k \leq 2\) , for all \( f \in C^{k}([0,1]^d)\) all weights of the approximating neural network of Theorem 5 can be bounded in absolute value by \( O(\max\{2, \|f \|_{C^k([0,1]^d)}\})\) .

6 ReLU neural networks

In this chapter, we discuss feedforward neural networks using the ReLU activation function \( \sigma_{\rm {ReLU}}\) introduced in Section 3.3. We refer to these functions as ReLU neural networks. Due to its simplicity and the fact that it reduces the vanishing and exploding gradients phenomena, the ReLU is one of the most widely used activation functions in practice.

A key component of the proofs in the previous chapters was the approximation of derivatives of the activation function to emulate polynomials. Since the ReLU is piecewise linear, this trick is not applicable. This makes the analysis fundamentally different from the case of smoother activation functions. Nonetheless, we will see that even this extremely simple activation function yields a very rich class of functions possessing remarkable approximation capabilities.

To formalize these results, we begin this chapter by adopting a framework from [1], which enables the tracking of the number of network parameters for basic manipulations such as adding up or composing two neural networks. This will allow to bound the network complexity, when constructing more elaborate networks from simpler ones. With these preliminaries at hand, the rest of the chapter is dedicated to the exploration of links between ReLU neural networks and the class of “continuous piecewise linear functions.” In Section 6.2, we will see that every such function can be exactly represented by a ReLU neural network. Afterwards, in Section 6.3 we will give a more detailed analysis of the required network complexity. Finally, we will use these results to prove a first approximation theorem for ReLU neural networks in Section 6.4. The argument is similar in spirit to Chapter 5, in that we transfer established approximation theory for piecewise linear functions to the class of ReLU neural networks of a certain architecture.

6.1 Basic ReLU calculus

The goal of this section is to formalize how to combine and manipulate ReLU neural networks. We have seen an instance of such a result already in Proposition 1. Now we want to make this result more precise under the assumption that the activation function is the ReLU. We sharpen Proposition 1 by adding bounds on the number of weights that the resulting neural networks have. The following four operations form the basis of all constructions in the sequel.

  • Reproducing an identity: We have seen in Proposition 3 that for most activation functions, an approximation to the identity can be built by neural networks. For ReLUs, we can have an even stronger result and reproduce the identity exactly. This identity will play a crucial role in order to extend certain neural networks to deeper neural networks, and to facilitate an efficient composition operation.
  • Composition: We saw in Proposition 1 that we can produce a composition of two neural networks and the resulting function is a neural network as well. There we did not study the size of the resulting neural networks. For ReLU activation functions, this composition can be done in a very efficient way leading to a neural network that has up to a constant not more than the number of weights of the two initial neural networks.
  • Parallelization: Also the parallelization of two neural networks was discussed in Proposition 1. We will refine this notion and make precise the size of the resulting neural networks.
  • Linear combinations: Similarly, for the sum of two neural networks, we will give precise bounds on the size of the resulting neural network.

6.1.1 Identity

We start with expressing the identity on \( \mathbb{R}^d\) as a neural network of depth \( L\in\mathbb{N}\) .

Lemma 6 (Identity)

Let \( L\in\mathbb{N}\) . Then, there exists a ReLU neural network \( {\Phi^{\rm {id}}_{L}}\) such that \( {\Phi^{\rm {id}}_{L}}({\boldsymbol{x}})={\boldsymbol{x}}\) for all \( {\boldsymbol{x}}\in\mathbb{R}^d\) . Moreover, \( {\rm {depth}}({\Phi^{\rm id}_{L}})=L\) , \( {\rm {width}}({\Phi^{\rm id}_{L}})=2d\) , and \( {\rm {size}}({\Phi^{\rm id}_{L}})=2d\cdot (L+1)\) .

Proof

Writing \( {\boldsymbol{I}}_d\in\mathbb{R}^{d\times d}\) for the identity matrix, we choose the weights

\[ \begin{align*} &({\boldsymbol{W}}^{(0)},{\boldsymbol{b}}^{(0)}),\dots,({\boldsymbol{W}}^{(L)},{\boldsymbol{b}}^{(L)}) \\ &~~ \mathrm{:}= \left(\begin{pmatrix}{\boldsymbol{I}}_d\\ -{\boldsymbol{I}}_d\end{pmatrix},\boldsymbol{0}\right), \underbrace{({\boldsymbol{I}}_{2d},\boldsymbol{0}),\dots,({\boldsymbol{I}}_{2d},\boldsymbol{0})}_{L-1\text{ times}}, (({\boldsymbol{I}}_d,-{\boldsymbol{I}}_d),\boldsymbol{0}) . \end{align*} \]

Using that \( x=\sigma_{\rm {ReLU}}(x)-\sigma_{\rm ReLU}(-x)\) for all \( x\in\mathbb{R}\) and \( \sigma_{\rm {ReLU}}(x)=x\) for all \( x\ge 0\) it is obvious that the neural network \( {\Phi^{\rm {id}}_{L}}\) associated to the weights above satisfies the assertion of the lemma.

We will see in Exercise 14 that the property to exactly represent the identity is not shared by sigmoidal activation functions. It does hold for polynomial activation functions though; also see Proposition 3.

6.1.2 Composition

Assume we have two neural networks \( \Phi_1\) , \( \Phi_2\) with architectures \( (\sigma_{\rm {ReLU}};d_0^1, \dots, d_{L_1+1}^1)\) and \( (\sigma_{\rm {ReLU}};d_0^2, \dots, d_{L_1+1}^2)\) respectively. Moreover, we assume that they have weights and biases given by

\[ \begin{align*} ({\boldsymbol{W}}^{(0)}_1,{\boldsymbol{b}}^{(0)}_1),\dots,({\boldsymbol{W}}^{(L_1)}_1,{\boldsymbol{b}}^{(L_1)}_1), \text{ and } ({\boldsymbol{W}}^{(0)}_2,{\boldsymbol{b}}^{(0)}_2),\dots,({\boldsymbol{W}}^{(L_2)}_2,{\boldsymbol{b}}^{(L_2)}_2), \end{align*} \]

respectively. If the output dimension \( d^1_{L_1+1}\) of \( \Phi_1\) equals the input dimension \( d_0^2\) of \( \Phi_2\) , we can define two types of concatenations: First \( \Phi_2\circ\Phi_1\) is the neural network with weights and biases given by

\[ \begin{align*} &\left({\boldsymbol{W}}^{(0)}_1,{\boldsymbol{b}}^{(0)}_1\right),\dots,\left({\boldsymbol{W}}^{(L_1-1)}_1,{\boldsymbol{b}}^{(L_1-1)}_1\right),\left({\boldsymbol{W}}^{(0)}_2{\boldsymbol{W}}^{(L_1)}_1,{\boldsymbol{W}}^{(0)}_2 {\boldsymbol{b}}^{(L_1)}_1+{\boldsymbol{b}}_2^{(0)}\right),\\ & ~~ \left({\boldsymbol{W}}^{(1)}_2,{\boldsymbol{b}}^{(1)}_2\right),\dots,\left({\boldsymbol{W}}^{(L_2)}_2,{\boldsymbol{b}}^{(L_2)}_2\right). \end{align*} \]

Second, \( \Phi_2\bullet\Phi_1\) is the neural network defined as \( \Phi_2\circ {\Phi^{\rm {id}}_{1}}\circ \Phi_1\) . In terms of weighs and biases, \( \Phi_2\bullet\Phi_1\) is given as

\[ \begin{align*} & \left({\boldsymbol{W}}^{(0)}_{1},{\boldsymbol{b}}^{(0)}_{1}\right),\dots,\left({\boldsymbol{W}}^{(L_1-1)}_1,{\boldsymbol{b}}^{(L_1-1)}_1\right), \left(\begin{pmatrix} {\boldsymbol{W}}^{(L_1)}_1\\ -{\boldsymbol{W}}^{(L_1)}_1 \end{pmatrix}, \begin{pmatrix} {\boldsymbol{b}}^{(L_1)}_1\\ -{\boldsymbol{b}}^{(L_1)}_1 \end{pmatrix}\right),\nonumber\\ &~\left(\left({\boldsymbol{W}}^{(0)}_2,-{\boldsymbol{W}}^{(0)}_2\right),{\boldsymbol{b}}^{(0)}_2\right), \left({\boldsymbol{W}}^{(1)}_2,{\boldsymbol{b}}^{(1)}_2\right), \dots,\left({\boldsymbol{W}}^{(L_2)}_2,{\boldsymbol{b}}^{(L_2)}_2\right). \end{align*} \]

The following lemma collects the properties of the constructions above.

Lemma 7 (Composition)

Let \( \Phi_1\) , \( \Phi_2\) be neural networks with architectures \( (\sigma_{\rm {ReLU}};d_0^1, \dots, d_{L_1+1}^1)\) and \( (\sigma_{\rm {ReLU}};d_0^2, \dots, d_{L_2+1}^2)\) . Assume \( d_{L_1+1}^1=d_0^2\) . Then \( \Phi_2\circ\Phi_1({\boldsymbol{x}})= \Phi_2\bullet\Phi_1({\boldsymbol{x}})= \Phi_2(\Phi_1({\boldsymbol{x}}))\) for all \( {\boldsymbol{x}}\in\mathbb{R}^{d^0_1}\) . Moreover,

\[ \begin{align*} {\rm width}(\Phi_2\circ\Phi_1) &\le \max\{{\rm width}(\Phi_1),{\rm width}(\Phi_2)\},\\ {\rm depth}(\Phi_2\circ\Phi_1) &= {\rm depth}(\Phi_1)+{\rm depth}(\Phi_2),\\ {\rm size}(\Phi_2\circ\Phi_1) &\le {\rm size}(\Phi_1)+{\rm size}(\Phi_2)+(d_{L_1}^1+1)d^1_2, \end{align*} \]

and

\[ \begin{align*} {\rm width}(\Phi_2\bullet\Phi_1) &\le 2\max\{{\rm width}(\Phi_1),{\rm width}(\Phi_2)\},\\ {\rm depth}(\Phi_2\bullet\Phi_1) &= {\rm depth}(\Phi_1)+{\rm depth}(\Phi_2)+1,\\ {\rm size}(\Phi_2\bullet\Phi_1) &\le 2({\rm size}(\Phi_1)+{\rm size}(\Phi_2)). \end{align*} \]

Proof

The fact that \( \Phi_2\circ\Phi_1({\boldsymbol{x}})= \Phi_2\bullet\Phi_1({\boldsymbol{x}})= \Phi_2(\Phi_1({\boldsymbol{x}}))\) for all \( {\boldsymbol{x}}\in\mathbb{R}^{d_0^1}\) follows immediately from the construction. The same can be said for the width and depth bounds. To confirm the size bound, we note that \( {\boldsymbol{W}}_2^{(0)}{\boldsymbol{W}}_1^{(L_1)}\in\mathbb{R}^{d^2_1\times d^1_{L_1}}\) and hence \( {\boldsymbol{W}}_2^{(0)}{\boldsymbol{W}}_1^{(L_1)}\) has not more than \( d_1^2\times d_{L_1}^1\) (nonzero) entries. Moreover, \( {\boldsymbol{W}}^{(0)}_2 {\boldsymbol{b}}^{(L_1)}_1+{\boldsymbol{b}}_2^{(0)}\in\mathbb{R}^{d_1^2}\) . Thus, the \( L_1\) -th layer of \( \Phi_2\circ\Phi_1({\boldsymbol{x}})\) has at most \( d_1^2\times (1+d_{L_1}^1)\) entries. The rest is obvious from the construction.

Interpreting linear transformations as neural networks of depth \( 0\) , the previous lemma is also valid in case \( \Phi_1\) or \( \Phi_2\) is a linear mapping.

6.1.3 Parallelization

Let \( (\Phi_i)_{i=1}^m\) be neural networks with architectures \( (\sigma_{\rm {ReLU}};d_0^i, \dots, d_{L_i+1}^i)\) , respectively. We proceed to build a neural network \( (\Phi_1, \dots, \Phi_m)\) realizing the function

\[ \begin{align} (\Phi_1, \dots, \Phi_m) \colon \mathbb{R}^{\sum_{j=1}^m d^j_{0}}&\to \mathbb{R}^{\sum_{j=1}^m d^j_{L_j+1}} \end{align} \]

(29)

\[ \begin{align} ({\boldsymbol{x}}_1,\dots,{\boldsymbol{x}}_m)&\mapsto (\Phi_1({\boldsymbol{x}}_1),\dots,\Phi_m({\boldsymbol{x}}_m)). \\ \\\end{align} \]

(30)

To do so we first assume \( L_1=\dots=L_m=L\) , and define \( (\Phi_1, \dots, \Phi_m)\) via the following sequence of weight-bias tuples:

\[ \begin{align} \left(\begin{pmatrix} {\boldsymbol{W}}^{(0)}_1 & &\\ &\ddots &\\ & &{\boldsymbol{W}}^{(0)}_m\\ \end{pmatrix}, \begin{pmatrix} {\boldsymbol{b}}^{(0)}_1\\ \vdots\\ {\boldsymbol{b}}^{(0)}_m \end{pmatrix} \right),\dots, \left(\begin{pmatrix} {\boldsymbol{W}}^{(L)}_1 & &\\ &\ddots &\\ && {\boldsymbol{W}}^{(L)}_m \end{pmatrix}, \begin{pmatrix} {\boldsymbol{b}}^{(L)}_1\\ \vdots\\ {\boldsymbol{b}}^{(L)}_m \end{pmatrix} \right) \end{align} \]

(31)

where these matrices are understood as block-diagonal filled up with zeros. For the general case where the \( \Phi_j\) might have different depths, let \( L_{\max}\mathrm{:}=\max_{1\le i\le m}L_i\) and \( I\mathrm{:}=\{1\le i\le m\,|\,L_i<L_{\max}\}\) . For \( j\in I^c\) set \( \widetilde{\Phi}_j\mathrm{:}=\Phi_j\) , and for each \( j\in I\)

\[ \begin{align} \widetilde\Phi_j&\mathrm{:}= {\Phi^{\rm id}_{L_{\max}-L_j}}\circ \Phi_j. \end{align} \]

(32)

Finally,

\[ \begin{align} (\Phi_1,\dots,\Phi_m)\mathrm{:}=(\widetilde\Phi_1,\dots,\widetilde\Phi_m). \end{align} \]

(33)

We collect the properties of the parallelization in the lemma below.

Lemma 8 (Parallelization)

Let \( m \in \mathbb{N}\) and \( (\Phi_i)_{i=1}^m\) be neural networks with architectures \( (\sigma_{\rm {ReLU}};d_0^i, \dots, d_{L_i+1}^i)\) , respectively. Then the neural network \( (\Phi_1, \dots, \Phi_m)\) satisfies

\[ (\Phi_1, \dots, \Phi_m)({\boldsymbol{x}}) = (\Phi_1({\boldsymbol{x}}_1),\dots,\Phi_m({\boldsymbol{x}}_m)) \text{ for all } {\boldsymbol{x}} \in \mathbb{R}^{\sum_{j=1}^m d^j_{0}}. \]

Moreover, with \( L_{\max}\mathrm{:}= \max_{j\le m}L_j\) it holds that

\[ \begin{align} {\rm width}((\Phi_1, \dots, \Phi_m)) &\le 2\sum_{j=1}^m{\rm width}(\Phi_j), \end{align} \]

(34.a)

\[ \begin{align} {\rm depth}((\Phi_1, \dots, \Phi_m)) &= \max_{j\le m}{\rm depth}(\Phi_j),\\ {\rm size}((\Phi_1, \dots, \Phi_m)) &\le {2}\sum_{j=1}^m{\rm size}(\Phi_j)+2\sum_{j=1}^m(L_{\max}-L_j) {d_{L_j+1}^j}. \\ \\\end{align} \]

(34.b)

Proof

All statements except for the bound on the size follow immediately from the construction. To obtain the bound on the size, we note that by construction the sizes of the \( (\widetilde\Phi_i)_{i=1}^m\) in (32) will simply be added. The size of each \( \widetilde\Phi_i\) can be bounded with Lemma 7.

If all input dimensions \( d_0^1=\dots=d_0^m \mathrm{:=} d_0\) are the same, we will also use parallelization with shared inputs to realize the function \( {\boldsymbol{x}}\mapsto (\Phi_1({\boldsymbol{x}}),\dots,\Phi_m({\boldsymbol{x}}))\) from \( \mathbb{R}^{d_0}\to \mathbb{R}^{d^1_{L_1+1}+\dots+d^m_{L_m+1}}\) . In terms of the construction (31), the only required change is that the block-diagonal matrix \( \rm{{diag}}({\boldsymbol{W}}_1^{(0)},\dots,{\boldsymbol{W}}_m^{(0)})\) becomes the matrix in \( \mathbb{R}^{\sum_{j=1}^m d_1^j\times d_0^1}\) which stacks \( {\boldsymbol{W}}_1^{(0)},\dots,{\boldsymbol{W}}_m^{(0)}\) on top of each other. Similarly, we will allow \( \Phi_j\) to only take some of the entries of \( {\boldsymbol{x}}\) as input. For parallelization with shared inputs we will use the same notation \( (\Phi_j)_{j=1}^m\) as before, where the precise meaning will always be clear from context. Note that Lemma 8 remains valid in this case.

6.1.4 Linear combinations

Let \( m \in \mathbb{N}\) and let \( (\Phi_i)_{i=1}^m\) be ReLU neural networks that have architectures \( (\sigma_{\rm {ReLU}};d_0^i, \dots, d_{L_i+1}^i)\) , respectively. Assume that \( d^1_{L_1+1}=\dots=d^m_{L_m+1}\) , i.e., all \( \Phi_1,\dots,\Phi_m\) have the same output dimension. For scalars \( \alpha_j\in\mathbb{R}\) , we wish to construct a ReLU neural network \( \sum_{j=1}^m\alpha_j\Phi_j\) realizing the function

\[ \begin{align*} \left\{ \begin{array}{l} \mathbb{R}^{\sum_{j=1}^m d^j_{0}}\to \mathbb{R}^{d^1_{L_1+1}}\\ ({\boldsymbol{x}}_1,\dots,{\boldsymbol{x}}_m)\mapsto \sum_{j=1}^m\alpha_j\Phi_j({\boldsymbol{x}}_j). \end{array}\right. \end{align*} \]

This corresponds to the parallelization \( (\Phi_1, \dots, \Phi_m)\) composed with the linear transformation \( ({\boldsymbol{z}}_1,\dots,{\boldsymbol{z}}_m)\mapsto \sum_{j=1}^m\alpha_j{\boldsymbol{z}}_j\) . The following result holds.

Lemma 9 (Linear combinations)

Let \( m \in \mathbb{N}\) and \( (\Phi_i)_{i=1}^m\) be neural networks with architectures \( (\sigma_{\rm {ReLU}};d_0^i, \dots, d_{L_i+1}^i)\) , respectively. Assume that \( d^1_{L_1+1}=\dots=d^m_{L_m+1}\) , let \( \alpha\in\mathbb{R}^m\) and set \( L_{\max}\mathrm{:}= \max_{j\le m}L_j\) . Then, there exists a neural network \( \sum_{j=1}^m\alpha_j\Phi_j\) such that \( (\sum_{j=1}^m\alpha_j\Phi_j)({\boldsymbol{x}}) = \sum_{j=1}^m\alpha_j\Phi_j({\boldsymbol{x}}_j)\) for all \( {\boldsymbol{x}} = ({\boldsymbol{x}}_j)_{j=1}^m \in \mathbb{R}^{\sum_{j=1}^m d^j_{0}}\) . Moreover,

\[ \begin{align} {\rm width}\left(\sum_{j=1}^m\alpha_j\Phi_j\right) &\le 2\sum_{j=1}^m{\rm width}(\Phi_j), \end{align} \]

(35.a)

\[ \begin{align} {\rm depth}\left(\sum_{j=1}^m\alpha_j\Phi_j\right) &= \max_{j\le m}{\rm depth}(\Phi_j),\\ {\rm size}\left(\sum_{j=1}^m\alpha_j\Phi_j\right) &\le {2}\sum_{j=1}^m{\rm size}(\Phi_j)+2\sum_{j=1}^m(L_{\max}-L_j) {d_{L_j+1}^j}. \\ \\\end{align} \]

(35.b)

Proof

The construction of \( \sum_{j=1}^m\alpha_j\Phi_j\) is analogous to that of \( (\Phi_1, \dots, \Phi_m)\) , i.e., we first define the linear combination of neural networks with the same depth. Then the weights are chosen as in (31), but with the last linear transformation replaced by

\[ \begin{align*} \left(( \alpha_1{\boldsymbol{W}}^{(L)}_1 \cdots \alpha_m{\boldsymbol{W}}^{(L)}_m), \sum_{j=1}^m \alpha_j{\boldsymbol{b}}^{(L)}_j \right). \end{align*} \]

For general depths, we define the sum of the neural networks to be the sum of the extended neural networks \( \widetilde{\Phi}_i\) as of (32). All statements of the lemma follow immediately from this construction.

In case \( d_0^1=\dots=d_0^m \mathrm{:=} d_0\) (all neural networks have the same input dimension), we will also consider linear combinations with shared inputs, i.e., a neural network realizing

\[ {\boldsymbol{x}}\mapsto \sum_{j=1}^m\alpha_j\Phi_j({\boldsymbol{x}})~~\text{for } {\boldsymbol{x}} \in \mathbb{R}^{d_0}. \]

This requires the same minor adjustment as discussed at the end of Section 6.1.3. Lemma 9 remains valid in this case and again we do not distinguish in notation for linear combinations with or without shared inputs.

6.2 Continuous piecewise linear functions

In this section, we will relate ReLU neural networks to a large class of functions. We first formally introduce the set of continuous piecewise linear functions from a set \( \Omega\subseteq\mathbb{R}^d\) to \( \mathbb{R}\) . Note that we admit in particular \( \Omega=\mathbb{R}^d\) in the following definition.

Definition 10

Let \( \Omega\subseteq\mathbb{R}^d\) , \( d\in\mathbb{N}\) . We call a function \( f:\Omega\to\mathbb{R}\) continuous, piecewise linear (cpwl) if \( f\in C^0(\Omega)\) and there exist \( n\in\mathbb{N}\) affine functions \( g_j \colon \mathbb{R}^d\to\mathbb{R},\) \( g_j({\boldsymbol{x}}) = {\boldsymbol{w}}_j^\top{\boldsymbol{x}}+b_j\) such that for each \( {\boldsymbol{x}}\in\Omega\) it holds that \( f({\boldsymbol{x}})= g_j({\boldsymbol{x}})\) for at least one \( j\in\{1,\dots,n\}\) . For \( m>1\) we call \( f:\Omega\to\mathbb{R}^m\) cpwl if and only if each component of \( f\) is cpwl.

Remark 4

A “continuous piecewise linear function” as in Definition 10 is actually piecewise affine. To maintain consistency with the literature, we use the terminology cpwl.

In the following, we will refer to the connected domains on which \( f\) is equal to one of the functions \( g_j\) , also as regions or pieces. If \( f\) is cpwl with \( q\in\mathbb{N}\) regions, then with \( n \in \mathbb{N}\) denoting the number of affine functions it holds \( n\le q\) .

Note that, the mapping \( {\boldsymbol{x}} \mapsto \sigma_{\rm {ReLU}}({\boldsymbol{w}}^\top{\boldsymbol{x}}+b)\) , which is a ReLU neural network with a single neuron, is cpwl (with two regions). Consequently, every ReLU neural network is a repeated composition of linear combinations of cpwl functions. It is not hard to see that the set of cpwl functions is closed under compositions and linear combinations. Hence, every ReLU neural network is a cpwl function. Interestingly, the reverse direction of this statement is also true, meaning that every cpwl function can be represented by a ReLU neural network as we shall demonstrate below. Therefore, we can identify the class of functions realized by arbitrary ReLU neural networks as the class of cpwl functions.

Theorem 6

Let \( d\in \mathbb{N}\) , let \( \Omega\subseteq\mathbb{R}^d\) be convex, and let \( f:\Omega\to\mathbb{R}\) be cpwl with \( n \in \mathbb{N}\) as in Definition 10. Then, there exists a ReLU neural network \( \Phi^f\) such that \( \Phi^f({\boldsymbol{x}})=f({\boldsymbol{x}})\) for all \( {\boldsymbol{x}}\in\Omega\) and

\[ {\rm size}(\Phi^f)=O(dn2^n),~ {\rm width}(\Phi^f)=O(dn2^n),~ {\rm depth}(\Phi^f)=O(n). \]

A statement similar to Theorem 6 can be found in [2, 3]. There, the authors give a construction with a depth that behaves logarithmic in \( d\) and is independent of \( n\) , but with significantly larger bounds on the size. As we shall see, the proof of Theorem 6 is a simple consequence of the following well-known result from [4]; also see [5], and for sharper bounds [6]. It states that every cpwl function can be expressed as a finite maximum of a finite minimum of certain affine functions.

Proposition 7

Let \( d\in \mathbb{N}\) , \( \Omega\subseteq\mathbb{R}^d\) be convex, and let \( f:\Omega\to\mathbb{R}\) be cpwl with \( n\in\mathbb{N}\) affine functions as in Definition 10. Then there exists \( m\in\mathbb{N}\) and sets \( s_j\subseteq\{1,\dots,n\}\) for \( j\in \{1,\dots,m\}\) , such that

\[ \begin{align} f({\boldsymbol{x}}) = \max_{1\le j\le m}\min_{i\in s_j}(g_i({\boldsymbol{x}}))~~\text{ for all } {\boldsymbol{x}}\in\Omega. \end{align} \]

(36)

Proof

Step 1. We start with \( d=1\) , i.e., \( \Omega\subseteq\mathbb{R}\) is a (possibly unbounded) interval and for each \( x\in \Omega\) there exists \( j\in\{1,\dots,n\}\) such that with \( g_j(x) \mathrm{:=} w_jx+b_j\) it holds that \( f(x)=g_j(x)\) . Without loss of generality, we can assume that \( g_i \neq g_j\) for all \( i \neq j\) . Since the graphs of the \( g_j\) are lines, they intersect at (at most) finitely many points in \( \Omega\) .

Since \( f\) is continuous, we conclude that there exist finitely many intervals covering \( \Omega\) , such that \( f\) coincides with one of the \( g_j\) on each interval. For each \( x\in \Omega\) let

\[ \begin{align*} s_x\mathrm{:}= \{1\le j\le n\,|\,g_j(x)\ge f(x)\} \end{align*} \]

and

\[ \begin{align*} f_x(y)\mathrm{:}= \min_{j\in s_x}g_j(y)~~\text{ for all } y\in \Omega. \end{align*} \]

Clearly, \( f_x(x)=f(x)\) . We claim that, additionally,

\[ \begin{align} f_x(y)\le f(y)~~\text{ for all } y\in \Omega. \end{align} \]

(37)

This then shows that

\[ \begin{align*} f(y)=\max_{x\in \Omega}f_x(y) = \max_{x\in \Omega}\min_{j\in s_x}g_j(y)~~\text{ for all } y\in\mathbb{R}. \end{align*} \]

Since there exist only finitely many possibilities to choose a subset of \( \{1,\dots,n\}\) , we conclude that (36) holds for \( d=1\) .

It remains to verify the claim (37). Fix \( y\neq x\in\Omega\) . Without loss of generality, let \( x<y\) and let \( x=x_0<\dots<x_k=y\) be such that \( f|_{[x_{i-1},x_{i}]}\) equals some \( g_j\) for each \( i\in\{1,\dots,k\}\) . In order to show (37), it suffices to prove that there exists at least one \( j\) such that \( g_j(x_0)\ge f(x_0)\) and \( g_j(x_k)\le f(x_k)\) . The claim is trivial for \( k=1\) . We proceed by induction. Suppose the claim holds for \( k-1\) , and consider the partition \( x_0<\dots<x_k\) . Let \( r\in\{1,\dots,n\}\) be such that \( f|_{[x_0,x_1]}=g_r|_{[x_0,x_1]}\) . Applying the induction hypothesis to the interval \( [x_1,x_k]\) , we can find \( j\in\{1,\dots,n\}\) such that \( g_j(x_1)\ge f(x_1)\) and \( g_j(x_k)\le f(x_k)\) . If \( g_j(x_0)\ge f(x_0)\) , then \( g_j\) is the desired function. Otherwise, \( g_j(x_0)<f(x_0)\) . Then \( g_r(x_0)=f(x_0)>g_j(x_0)\) and \( g_r(x_1)=f(x_1)\le g_j(x_1)\) . Therefore \( g_r(x)\le g_j(x)\) for all \( x\ge x_1\) , and in particular \( g_r(x_k)\le g_j(x_k)\) . Thus \( g_r\) is the desired function.

Step 2. For general \( d\in\mathbb{N}\) , let \( g_j({\boldsymbol{x}})\mathrm{:}= {\boldsymbol{w}}_j^\top{\boldsymbol{x}}+b_j\) for \( j=1,\dots,n\) . For each \( {\boldsymbol{x}}\in\Omega\) , let

\[ \begin{align*} s_{\boldsymbol{x}}\mathrm{:}= \{1\le j\le n\,|\,g_j({\boldsymbol{x}})\ge f({\boldsymbol{x}})\} \end{align*} \]

and for all \( {\boldsymbol{y}}\in\Omega\) , let

\[ \begin{align*} f_{\boldsymbol{x}}({\boldsymbol{y}})\mathrm{:}= \min_{j\in s_{\boldsymbol{x}}}g_j({\boldsymbol{y}}). \end{align*} \]

For an arbitrary \( 1\) -dimensional affine subspace \( S\subseteq \mathbb{R}^d\) passing through \( {\boldsymbol{x}}\) consider the line (segment) \( I\mathrm{:}= S\cap\Omega\) , which is connected since \( \Omega\) is convex. By Step 1, it holds

\[ \begin{align*} f({\boldsymbol{y}}) = \max_{{\boldsymbol{x}}\in\Omega}f_{\boldsymbol{x}}({\boldsymbol{y}})=\max_{{\boldsymbol{x}}\in\Omega}\min_{j\in s_{\boldsymbol{x}}}g_j({\boldsymbol{y}}) \end{align*} \]

on all of \( I\) . Since \( I\) was arbitrary the formula is valid for all \( {\boldsymbol{y}}\in\Omega\) . This again implies (36) as in Step 1.

Remark 5

For any \( a_1,\dots,a_k\in\mathbb{R}\) holds \( \min\{-a_1,\dots,-a_k\}=-\max\{a_1,\dots,a_k\}\) . Thus, in the setting of Proposition 7, there exists \( \tilde m\in\mathbb{N}\) and sets \( \tilde s_j\subseteq\{1,\dots,n\}\) for \( j=1,\dots,\tilde m\) , such that for all \( {\boldsymbol{x}}\in\Omega\)

\[ \begin{align*} f({\boldsymbol{x}})=-(-f({\boldsymbol{x}})) &= -\max_{1\le j\le \tilde m}\min_{i\in \tilde s_j}(-g_i({\boldsymbol{x}}))\\ &=-\max_{1\le j\le \tilde m}(-\max_{i\in \tilde s_j}(g_i({\boldsymbol{x}})))\\ &=\min_{1\le j\le \tilde m}(\max_{i\in \tilde s_j}(g_i({\boldsymbol{x}}))). \end{align*} \]

To prove Theorem 6, it therefore suffices to show that the minimum and the maximum are expressible by ReLU neural networks.

Lemma 10

For every \( x\) , \( y\in\mathbb{R}\) it holds that

\[ \begin{align*} \min\{x,y\} = \sigma_{\rm ReLU}(y)-\sigma_{\rm ReLU}(-y) - \sigma_{\rm ReLU}(y-x) \in \mathcal{N}_2^1( \sigma_{\rm ReLU};1,3) \end{align*} \]

and

\[ \begin{align*} \max\{x,y\} = \sigma_{\rm ReLU}(y)-\sigma_{\rm ReLU}(-y)+\sigma_{\rm ReLU}(x-y)\in \mathcal{N}_2^1( \sigma_{\rm ReLU};1,3). \end{align*} \]

Proof

We have

\[ \begin{align*} \max\{x,y\} &= y + \left\{ \begin{array}{ll} 0 &\text{if }y>x\\ x-y &\text{if }x\ge y \end{array}\right. \\ &= y + \sigma_{\rm ReLU}(x-y). \end{align*} \]

Using \( y=\sigma_{\rm {ReLU}}(y)-\sigma_{\rm ReLU}(-y)\) , the claim for the maximum follows. For the minimum observe that \( \min\{x,y\}=-\max\{-x,-y\}\) .

Sketch of the neural network in Lemma
lemma:minmax.
Only edges with non-zero weights are drawn.
Figure 9. Sketch of the neural network in Lemma 10. Only edges with non-zero weights are drawn.

The minimum of \( n\ge 2\) inputs can be computed by repeatedly applying the construction of Lemma 10. The resulting neural network is described in the next lemma.

Lemma 11

For every \( n\ge 2\) there exists a neural network \( {\Phi^{\min}_{n}}:\mathbb{R}^n\to\mathbb{R}\) with

\[ {\rm size}({\Phi^{\min}_{n}})\le 16 n,~~ {\rm width}({\Phi^{\min}_{n}})\le 3n,~~ {\rm depth}({\Phi^{\min}_{n}})\le \lceil\log_2(n)\rceil \]

such that \( {\Phi^{\min}_{n}}(x_1,\dots,x_n)=\min_{1\le j\le n} x_j\) . Similarly, there exists a neural network \( {\Phi^{\max}_{n}}:\mathbb{R}^n\to\mathbb{R}\) realizing the maximum and satisfying the same complexity bounds.

Proof

Throughout denote by \( {\Phi^{\min}_{2}}:\mathbb{R}^2\to\mathbb{R}\) the neural network from Lemma 10. It is of depth \( 1\) and size \( 7\) (since all biases are zero, it suffices to count the number of connections in Figure 9).

Step 1. Consider first the case where \( n=2^k\) for some \( k\in\mathbb{N}\) . We proceed by induction of \( k\) . For \( k=1\) the claim is proven. For \( k\ge 2\) set

\[ \begin{equation} {\Phi^{\min}_{2^k}}\mathrm{:}= {\Phi^{\min}_{2}}\circ ({\Phi^{\min}_{2^{k-1}}},{\Phi^{\min}_{2^{k-1}}}). \end{equation} \]

(38)

By Lemma 7 and Lemma 8 we have

\[ \begin{align*} {\rm depth}({\Phi^{\min}_{2^k}})\le {\rm depth}({\Phi^{\min}_{2}})+{\rm depth}({\Phi^{\min}_{2^{k-1}}})\le\cdots\le k. \end{align*} \]

Next, we bound the size of the neural network. Note that all biases in this neural network are set to \( 0\) , since the \( {\Phi^{\min}_{2}}\) neural network in Lemma 10 has no biases. Thus, the size of the neural network \( {\Phi^{\min}_{2^k}}\) corresponds to the number of connections in the graph (the number of nonzero weights). Careful inspection of the neural network architecture, see Figure 10, reveals that

\[ \begin{align*} {\rm size}({\Phi^{\min}_{2^k}})&=4\cdot 2^{k-1} +\sum_{j=0}^{k-2} 12\cdot 2^{j} +3\\ &= 2n+12\cdot (2^{k-1}-1)+3 = 2n+6n-9\le 8n, \end{align*} \]

and that \( {\rm {width}}({\Phi^{\min}_{2^k}})\le ({3}/{2}) 2^{k}\) . This concludes the proof for the case \( n=2^k\) .

Step 2. For the general case, we first let

\[ {\Phi^{\min}_{1}}(x)\mathrm{:}= x~~\text{for all }x\in\mathbb{R} \]

be the identity on \( \mathbb{R}\) , i.ea linear transformation and thus formally a depth \( 0\) neural network. Then, for all \( n\ge 2\)

\[ \begin{equation} {\Phi^{\min}_{n}}\mathrm{:}= {\Phi^{\min}_{2}}\circ \left\{ \begin{array}{ll} ({\Phi^{\rm id}_{1}}\circ{\Phi^{\min}_{\lfloor\frac{n}{2}\rfloor}},{\Phi^{\min}_{\lceil\frac{n}{2}\rceil}}) &\text{if }n\in\{2^k+1\,|\,k\in\mathbb{N}\}\\ ({\Phi^{\min}_{\lfloor\frac{n}{2}\rfloor}},{\Phi^{\min}_{\lceil\frac{n}{2}\rceil}}) &\text{otherwise.} \end{array}\right. \end{equation} \]

(39)

This definition extends (38) to arbitrary \( n\ge 2\) , since the first case in (39) never occurs if \( n\ge 2\) is a power of two.

To analyze (39), we start with the depth and claim that

\[ {\rm depth}({\Phi^{\min}_{n}})=k~~\text{for all }2^{k-1}<n\le 2^{k} \]

and all \( k\in\mathbb{N}\) . We proceed by induction over \( k\) . The case \( k=1\) is clear. For the induction step, assume the statement holds for some fixed \( k\in\mathbb{N}\) and fix an integer \( n\) with \( 2^{k}<n\le 2^{k+1}\) . Then

\[ \Big\lceil \frac{n}{2}\Big\rceil\in (2^{k-1},2^k]\cap\mathbb{N} \]

and

\[ \Big\lfloor \frac{n}{2}\Big\rfloor \in\left\{ \begin{array}{ll} \{2^{k-1}\} &\text{if }n=2^{k}+1\\ (2^{k-1},2^k]\cap\mathbb{N} &\text{otherwise.} \end{array}\right. \]

Using the induction assumption, (39) and Lemmas 6 and 7, this shows

\[ {\rm depth}({\Phi^{\min}_{n}}) = {\rm depth}({\Phi^{\min}_{2}}) + k = 1+k, \]

and proves the claim.

For the size and width bounds, we only sketch the argument: Fix \( n\in\mathbb{N}\) such that \( 2^{k-1}<n\le 2^k\) . Then \( {\Phi^{\min}_{n}}\) is constructed from at most as many subnetworks as \( {\Phi^{\min}_{2^k}}\) , but with some \( {\Phi^{\min}_{2}}:\mathbb{R}^2\to\mathbb{R}\) blocks replaced by \( {\Phi^{\rm {id}}_{1}}:\mathbb{R}\to\mathbb{R}\) , see Figure 11. Since \( {\Phi^{\rm {id}}_{1}}\) has the same depth as \( {\Phi^{\min}_{2}}\) , but is smaller in width and number of connections, the width and size of \( {\Phi^{\min}_{n}}\) is bounded by the width and size of \( {\Phi^{\min}_{2^k}}\) . Due to \( 2^k\le 2n\) , the bounds from Step 1 give the bounds stated in the lemma.

Step 3. For the maximum, define

\[ {\Phi^{\max}_{n}} (x_1,\dots,x_n)\mathrm{:}= -{\Phi^{\min}_{n}}(-x_1,\dots,-x_n). \]
Architecture of the {^{}_{2^k}}) neural network in
 Step 1 of the proof of 
Lemma lemma:minmaxn and the number of connections in each
layer for k=3) . Each grey box corresponds to 12) connections in the graph.
Figure 10. Architecture of the \( {\Phi^{\min}_{2^k}}\) neural network in Step 1 of the proof of Lemma 11 and the number of connections in each layer for \( k=3\) . Each grey box corresponds to \( 12\) connections in the graph.
Construction of {^{}_{n}}) for general n) 
in Step 2 of the proof of Lemma lemma:minmaxn.
Figure 11. Construction of \( {\Phi^{\min}_{n}}\) for general \( n\) in Step 2 of the proof of Lemma 11.

Proof (of Theorem 6)

By Proposition 7 the neural network

\[ \begin{align*} \Phi\mathrm{:}= {\Phi^{\max}_{m}}\bullet ({\Phi^{\min}_{|s_j|}})_{j=1}^m\bullet (({\boldsymbol{w}}_i^\top {\boldsymbol{x}}+b_i)_{i\in s_j})_{j=1}^m \end{align*} \]

realizes the function \( f\) .

Since the number of possibilities to choose subsets of \( \{1,\dots,n\}\) equals \( 2^n\) we have \( m\le 2^n\) . Since each \( s_j\) is a subset of \( \{1,\dots,n\}\) , the cardinality \( |s_j|\) of \( s_j\) is bounded by \( n\) . By Lemma 7, Lemma 8, and Lemma 11

\[ \begin{align*} {\rm depth}(\Phi) &\le 2+{\rm depth}({\Phi^{\max}_{m}})+\max_{1\le j\le n}{\rm depth}({\Phi^{\min}_{|s_j|}})\\ &\le 1+\lceil\log_2(2^n)\rceil+\lceil\log_2(n)\rceil = O(n) \end{align*} \]

and

\[ \begin{align*} {\rm width}(\Phi)&\le 2\max\Big\{{\rm width}({\Phi^{\max}_{m}}),\sum_{j=1}^m{\rm width}({\Phi^{\min}_{|s_j|}}),\sum_{j=1}^m{\rm width}(({\boldsymbol{w}}_i^\top {\boldsymbol{x}}+b_i)_{i\in s_j}))\Big\}\nonumber\\ &\le 2\max\{3 m, 3 m n, mdn\} = O(dn2^n) \end{align*} \]

and

\[ \begin{align*} {\rm size}(\Phi)&\le 4\Big({\rm size}({\Phi^{\max}_{m}})+{\rm size}(({\Phi^{\min}_{|s_j|}})_{j=1}^m)+{\rm size}{(({\boldsymbol{w}}_i^\top {\boldsymbol{x}}+b_i)_{i\in s_j})_{j=1}^m)}\Big)\nonumber\\ &\le 4\left(16m +2\sum_{j=1}^m (16|s_j|+2\lceil\log_2(n)\rceil)+{nm(d+1)}\right) = O(dn2^n). \end{align*} \]

This concludes the proof.

6.3 Simplicial pieces

This section studies the case, where we do not have arbitrary cpwl functions, but where the regions on which \( f\) is affine are simplices. Under this condition, we can construct neural networks that scale merely linearly in the number of such regions, which is a serious improvement from the exponential dependence of the size on the number of regions that was found in Theorem 6.

6.3.1 Triangulations of \( \Omega\)

For the ensuing discussion, we will consider \( \Omega\subseteq\mathbb{R}^d\) to be partitioned into simplices. This partitioning will be termed a triangulation of \( \Omega\) . Other notions prevalent in the literature include a tessellation of \( \Omega\) , or a simplicial mesh on \( \Omega\) . To give a precise definition, let us first recall some terminology. For a set \( S\subseteq\mathbb{R}^d\) we denote the convex hull of \( S\) by

\[ \begin{align} {\rm co}(S)\mathrm{:}= \left\{\sum_{j=1}^n\alpha_j {\boldsymbol{x}}_j\, \middle|\,n\in\mathbb{N},~{\boldsymbol{x}}_j\in S,\alpha_j\ge 0,~\sum_{j=1}^n\alpha_j=1\right\}. \end{align} \]

(40)

An \( n\) -simplex is the convex hull of \( n\in \mathbb{N}\) points that are independent in a specific sense. This is made precise in the following definition.

Definition 11

Let \( n\in\mathbb{N}_0\) , \( d\in\mathbb{N}\) and \( n\le d\) . We call \( {\boldsymbol{x}}_0,\dots,{\boldsymbol{x}}_n\in\mathbb{R}^d\) affinely independent if and only if either \( n=0\) or \( n\ge 1\) and the vectors \( {\boldsymbol{x}}_1-{\boldsymbol{x}}_0,\dots,{\boldsymbol{x}}_n-{\boldsymbol{x}}_0\) are linearly independent. In this case, we call \( {\rm {co}}({\boldsymbol{x}}_0,\dots,{\boldsymbol{x}}_n)\mathrm{:}={\rm co}(\{{\boldsymbol{x}}_0,\dots,{\boldsymbol{x}}_n\})\) an \( n\) -simplex.

As mentioned before, a triangulation refers to a partition of a space into simplices. We give a formal definition below.

Definition 12

Let \( d\in\mathbb{N}\) , and \( \Omega\subseteq\mathbb{R}^d\) be compact. Let \( \mathcal{T}\) be a finite set of \( d\) -simplices, and for each \( \tau\in\mathcal{T}\) let \( V(\tau)\subseteq \Omega\) have cardinality \( d+1\) such that \( \tau={\rm {co}}(V(\tau))\) . We call \( \mathcal{T}\) a regular triangulation of \( \Omega\) , if and only if

  1. \( \bigcup_{\tau\in\mathcal{T}}\tau = \Omega\) ,
  2. for all \( \tau\) , \( \tau'\in\mathcal{T}\) it holds that \( \tau\cap\tau'={\rm {co}}(V(\tau)\cap V(\tau'))\) .

We call \( {\boldsymbol{\eta}}\in \mathcal{V}\mathrm{:}= \bigcup_{\tau\in\mathcal{T}}V(\tau)\) a node (or vertex) and \( \tau\in\mathcal{T}\) an element of the triangulation.

The first is a regular triangulation, while the second and the third are not.
Figure 12. The first is a regular triangulation, while the second and the third are not.

For a regular triangulation \( \mathcal{T}\) with nodes \( \mathcal{V}\) we also introduce the constant

\[ \begin{align} k_\mathcal{T}\mathrm{:}= \max_{{\boldsymbol{\eta}}\in\mathcal{V}}|\{\tau\in\mathcal{T}\,|\,{\boldsymbol{\eta}}\in\tau\}| \end{align} \]

(41)

corresponding to the maximal number of elements shared by a single node.

6.3.2 Size bounds for regular triangulations

Throughout this subsection, let \( \mathcal{T}\) be a regular triangulation of \( \Omega\) , and we adhere to the notation of Definition 12. We will say that \( f:\Omega\to\mathbb{R}\) is cpwl with respect to \( \mathcal{T}\) if \( f\) is cpwl and \( f|_\tau\) is affine for each \( \tau\in\mathcal{T}\) . The rest of this subsection is dedicated to proving the following result. It was first shown in [7] with a more technical argument, and extends an earlier statement from [3] to general triangulations (also see Section 6.3.3).

Theorem 7

Let \( d \in \mathbb{N}\) , \( \Omega\subseteq\mathbb{R}^d\) be a bounded domain, and let \( \mathcal{T}\) be a regular triangulation of \( \Omega\) . Let \( f:\Omega\to\mathbb{R}\) be cpwl with respect to \( \mathcal{T}\) and \( f|_{\partial\Omega}=0\) . Then there exists a ReLU neural network \( \Phi:\Omega\to\mathbb{R}\) realizing \( f\) , and it holds

\[ \begin{align} {\rm size}(\Phi) = O(|\mathcal{T}|),~~ {\rm width}(\Phi)=O(|\mathcal{T}|),~~ {\rm depth}(\Phi) = O(1), \end{align} \]

(42)

\[ \begin{align} \\\end{align} \]

(43)

where the constants in the Landau notation depend on \( d\) and \( k_\mathcal{T}\) in (41).

We will split the proof into several lemmata. The strategy is to introduce a basis of the space of cpwl functions on \( \mathcal{T}\) the elements of which vanish on the boundary of \( \Omega\) . We will then show that there exist \( O(|\mathcal{T}|)\) basis functions, each of which can be represented with a neural network the size of which depends only on \( k_\mathcal{T}\) and \( d\) . To construct this basis, we first point out that an affine function on a simplex is uniquely defined by its values at the nodes.

Lemma 12

Let \( d \in \mathbb{N}\) . Let \( \tau\mathrm{:}= {\rm {co}}({\boldsymbol{\eta}}_0,\dots,{\boldsymbol{\eta}}_d)\) be a \( d\) -simplex. For every \( y_0,\dots,y_{d}\in\mathbb{R}\) , there exists a unique \( g\in\mathcal{P}_1(\mathbb{R}^d)\) such that \( g({\boldsymbol{\eta}}_i)=y_i\) , \( i=0,\dots,d\) .

Proof

Since \( {\boldsymbol{\eta}}_1-{\boldsymbol{\eta}}_0,\dots,{\boldsymbol{\eta}}_d-{\boldsymbol{\eta}}_0\) is a basis of \( \mathbb{R}^d\) , there is a unique \( {\boldsymbol{w}}\in\mathbb{R}^d\) such that \( {\boldsymbol{w}}^\top ({\boldsymbol{\eta}}_i-{\boldsymbol{\eta}}_0)=y_i-y_0\) for \( i=1,\dots,d\) . Then \( g({\boldsymbol{x}})\mathrm{:}= {\boldsymbol{w}}^\top{\boldsymbol{x}}+(y_0-{\boldsymbol{w}}^\top{\boldsymbol{\eta}}_0)\) is as desired. Moreover, for every \( g\in\mathcal{P}_1\) it holds that \( g(\sum_{i=0}^d\alpha_i{\boldsymbol{\eta}}_i)=\sum_{i=0}^d\alpha_ig({\boldsymbol{\eta}}_i)\) whenever \( \sum_{i=0}^d\alpha_i=1\) (this is in general not true if the coefficients do not sum to \( 1\) ). Hence, \( g\) is uniquely determined by its values at the nodes.

Since \( \Omega\) is the union of the simplices \( \tau\in\mathcal{T}\) , every cpwl function with respect to \( \mathcal{T}\) is thus uniquely defined through its values at the nodes. Hence, the desired basis consists of cpwl functions \( \varphi_{\boldsymbol{\eta}}:\Omega\to\mathbb{R}\) with respect to \( \mathcal{T}\) such that

\[ \begin{align} \varphi_{\boldsymbol{\eta}}({\boldsymbol{\mu}})=\delta_{{\boldsymbol{\eta}}{\boldsymbol{\mu}}}~~\text{ for all }{\boldsymbol{\mu}}\in\mathcal{V}, \end{align} \]

(44)

where \( \delta_{{\boldsymbol{\eta}}{\boldsymbol{\mu}}}\) denotes the Kronecker delta. Assuming \( \varphi_{\boldsymbol{\eta}}\) to be well-defined for the moment, we can then represent every cpwl function \( f:\Omega\to\mathbb{R}\) that vanishes on the boundary \( \partial\Omega\) as

\[ \begin{align*} f({\boldsymbol{x}}) = \sum_{{\boldsymbol{\eta}}\in\mathcal{V}\cap\mathring{\Omega}}f({\boldsymbol{\eta}})\varphi_{\boldsymbol{\eta}}({\boldsymbol{x}})~~\text{for all }{\boldsymbol{x}}\in\Omega. \end{align*} \]

Note that it suffices to sum over the set of interior nodes \( \mathcal{V}\cap\mathring{\Omega}\) , since \( f({\boldsymbol{\eta}})=0\) whenever \( {\boldsymbol{\eta}}\in\partial\Omega\) . To formally verify existence and well-definedness of \( \varphi_{\boldsymbol{\eta}}\) , we first need a lemma characterizing the boundary of so-called “patches” of the triangulation: For each \( {\boldsymbol{\eta}}\in\mathcal{V}\) , we introduce the patch \( \omega({\boldsymbol{\eta}})\) of the node \( {\boldsymbol{\eta}}\) as the union of all elements containing \( {\boldsymbol{\eta}}\) , i.e.,

\[ \begin{align} \omega({\boldsymbol{\eta}})\mathrm{:}= \bigcup_{\{\tau\in\mathcal{T}\,|\,{\boldsymbol{\eta}}\in\tau\}}\tau. \end{align} \]

(45)

Lemma 13

Let \( {\boldsymbol{\eta}}\in\mathcal{V}\cap\mathring{\Omega}\) be an interior node. Then,

\[ \begin{align*} \partial\omega({\boldsymbol{\eta}}) =\bigcup_{\{\tau\in\mathcal{T}\,|\,{\boldsymbol{\eta}}\in\tau\}}{\rm co}(V(\tau)\backslash\{{\boldsymbol{\eta}}\}). \end{align*} \]
Visualization of Lemma lemma:wboundary in two
dimensions.
The patch ({})) consists of the union of all
 2) -simplices _i) containing {}) .
Its boundary consists of
the union of all 1) -simplices made up by the nodes of each _i) 
without the center node, i.e., the convex hulls of
 V(_i){}) .
Figure 13. Visualization of Lemma 13 in two dimensions. The patch \( \omega({\boldsymbol{\eta}})\) consists of the union of all \( 2\) -simplices \( \tau_i\) containing \( {\boldsymbol{\eta}}\) . Its boundary consists of the union of all \( 1\) -simplices made up by the nodes of each \( \tau_i\) without the center node, i.e., the convex hulls of \( V(\tau_i)\backslash\{{\boldsymbol{\eta}}\}\) .

We refer to Figure 13 for a visualization of Lemma 13. The proof of Lemma 13 is quite technical but nonetheless elementary. We therefore only outline the general argument but leave the details to the reader in Excercise 18: The boundary of \( \omega({\boldsymbol{\eta}})\) must be contained in the union of the boundaries of all \( \tau\) in the patch \( \omega({\boldsymbol{\eta}})\) . Since \( {\boldsymbol{\eta}}\) is an interior point of \( \Omega\) , it must also be an interior point of \( \omega({\boldsymbol{\eta}})\) . This can be used to show that for every \( S\mathrm{:}= \{{\boldsymbol{\eta}}_{i_0},\dots,{\boldsymbol{\eta}}_{i_{k}}\}\subseteq V(\tau)\) of cardinality \( k+1\le d\) , the interior of (the \( k\) -dimensional manifold) \( {\rm {co}}(S)\) belongs to the interior of \( \omega({\boldsymbol{\eta}})\) whenever \( {\boldsymbol{\eta}}\in S\) . Using Exercise 18, it then only remains to check that \( {\rm {co}}(S)\subseteq\partial\omega({\boldsymbol{\eta}})\) whenever \( {\boldsymbol{\eta}}\notin S\) , which yields the claimed formula. We are now in position to show well-definedness of the basis functions in (44).

Lemma 14

For each interior node \( {\boldsymbol{\eta}}\in\mathcal{V}\cap \mathring{\Omega}\) there exists a unique cpwl function \( \varphi_{{\boldsymbol{\eta}}}:\Omega\to\mathbb{R}\) satisfying (44). Moreover, \( \varphi_{{\boldsymbol{\eta}}}\) can be expressed by a ReLU neural network with size, width, and depth bounds that only depend on \( d\) and \( k_\mathcal{T}\) .

Proof

By Lemma 12, on each \( \tau\in\mathcal{T}\) , the affine function \( \varphi_{\boldsymbol{\eta}}|_{\tau}\) is uniquely defined through the values at the nodes of \( \tau\) . This defines a continuous function \( \varphi_{{\boldsymbol{\eta}}}:\Omega\to\mathbb{R}\) . Indeed, whenever \( \tau\cap\tau'\neq\emptyset\) , then \( \tau\cap\tau'\) is a subsimplex of both \( \tau\) and \( \tau'\) in the sense of Definition 12 (VI). Thus, applying Lemma 12 again, the affine functions on \( \tau\) and \( \tau'\) coincide on \( \tau\cap\tau'\) .

Using Lemma 12, Lemma 13 and the fact that \( \varphi_{{\boldsymbol{\eta}}}({\boldsymbol{\mu}})=0\) whenever \( {\boldsymbol{\mu}}\neq{\boldsymbol{\eta}}\) , we find that \( \varphi_{\boldsymbol{\eta}}\) vanishes on the boundary of the patch \( \omega({\boldsymbol{\eta}})\subseteq\Omega\) . Thus, \( \varphi_{\boldsymbol{\eta}}\) vanishes on the boundary of \( \Omega\) . Extending by zero, it becomes a cpwl function \( \varphi_{\boldsymbol{\eta}}:\mathbb{R}^d\to\mathbb{R}\) . This function is nonzero only on elements \( \tau\) for which \( {\boldsymbol{\eta}}\in\tau\) . Hence, it is a cpwl function with at most \( n\mathrm{:}= k_\mathcal{T}+1\) affine functions. By Theorem 6, \( \varphi_{\boldsymbol{\eta}}\) can be expressed as a ReLU neural network with the claimed size, width and depth bounds; to apply Theorem 6 we used that (the extension of) \( \varphi_{\boldsymbol{\eta}}\) is defined on the convex domain \( \mathbb{R}^d\) .

Finally, Theorem 7 is now an easy consequence of the above lemmata.

Proof (of Theorem 7)

With

\[ \begin{align} \Phi({\boldsymbol{x}})\mathrm{:}= \sum_{{\boldsymbol{\eta}}\in\mathcal{V}\cap\mathring{\Omega}} f({\boldsymbol{\eta}})\varphi_{\boldsymbol{\eta}}({\boldsymbol{x}}) ~~ \text{ for } {\boldsymbol{x}} \in \Omega, \end{align} \]

(46)

it holds that \( \Phi:\Omega\to\mathbb{R}\) satisfies \( \Phi({\boldsymbol{\eta}})=f({\boldsymbol{\eta}})\) for all \( {\boldsymbol{\eta}}\in\mathcal{V}\) . By Lemma 12 this implies that \( f\) equals \( \Phi\) on each \( \tau\) , and thus \( f\) equals \( \Phi\) on all of \( \Omega\) . Since each element \( \tau\) is the convex hull of \( d+1\) nodes \( {\boldsymbol{\eta}}\in\mathcal{V}\) , the cardinality of \( \mathcal{V}\) is bounded by the cardinality of \( \mathcal{T}\) times \( d+1\) . Thus, the summation in (46) is over \( O(|\mathcal{T}|)\) terms. Using Lemma 9 and Lemma 14 we obtain the claimed bounds on size, width, and depth of the neural network.

6.3.3 Size bounds for locally convex triangulations

Assuming local convexity of the triangulation, in this section we make the dependence of the constants in Theorem 7 explicit in the dimension \( d\) and in the maximal number of simplices \( k_{\mathcal{T}}\) touching a node, see (41). As such the improvement over Theorem 7 is modest, and the reader may choose to skip this section on a first pass. Nonetheless, the proof, originally from [3], is entirely constructive and gives some further insight on how ReLU networks express functions. Let us start by stating the required convexity constraint.

Definition 13

A regular triangulation \( \mathcal{T}\) is called locally convex if and only if \( \omega({\boldsymbol{\eta}})\) is convex for all interior nodes \( {\boldsymbol{\eta}}\in\mathcal{V}\cap\mathring{\Omega}\) .

The following theorem is a variant of [3, Theorem 3.1].

Theorem 8

Let \( d \in \mathbb{N}\) , and let \( \Omega\subseteq\mathbb{R}^d\) be a bounded domain. Let \( \mathcal{T}\) be a locally convex regular triangulation of \( \Omega\) . Let \( f:\Omega\to\mathbb{R}\) be cpwl with respect to \( \mathcal{T}\) and \( f|_{\partial\Omega}=0\) . Then, there exists a constant \( C>0\) (independent of \( d\) , \( f\) and \( \mathcal{T}\) ) and there exists a neural network \( \Phi^f:\Omega\to\mathbb{R}\) such that \( \Phi^f = f\) ,

\[ \begin{align*} {\rm size}(\Phi^f) &\le C \cdot (1+d^2 k_{\mathcal{T}} |\mathcal{T}|),\\ {\rm width}(\Phi^f)&\le C \cdot (1+d \log(k_{\mathcal{T}})|\mathcal{T}|),\\ {\rm depth}(\Phi^f)&\le C \cdot (1+\log_2(k_{\mathcal{T}})). \end{align*} \]

Assume in the following that \( \mathcal{T}\) is a locally convex triangulation. We will split the proof of the theorem again into a few lemmata. First, we will show that a convex patch can be written as an intersection of finitely many half-spaces. Specifically, with the affine hull of a set \( S\) defined as

\[ \begin{align} {\rm aff}(S)\mathrm{:}= \left\{\sum_{j=1}^n\alpha_j {\boldsymbol{x}}_j\, \middle|\,n\in\mathbb{N},~{\boldsymbol{x}}_j\in S,~\alpha_j\in\mathbb{R},~\sum_{j=1}^n\alpha_j=1\right\} \end{align} \]

(47)

let in the following for \( \tau\in\mathcal{T}\) and \( {\boldsymbol{\eta}}\in V(\tau)\)

\[ \begin{align*} H_0(\tau,{\boldsymbol{\eta}})\mathrm{:}= {\rm aff}(V(\tau)\backslash\{{\boldsymbol{\eta}}\}) \end{align*} \]

be the affine hyperplane passing through all nodes in \( V(\tau)\backslash\{{\boldsymbol{\eta}}\}\) , and let further

\[ \begin{align*} H_+(\tau,{\boldsymbol{\eta}})\mathrm{:}= \{{\boldsymbol{x}}\in\mathbb{R}^d\,|\,{\boldsymbol{x}}\text{ is on the same side of }H_0(\tau,{\boldsymbol{\eta}})\text{ as }{\boldsymbol{\eta}}\}\cup H_0(\tau,{\boldsymbol{\eta}}). \end{align*} \]

Lemma 15

Let \( {\boldsymbol{\eta}}\) be an interior node. Then a patch \( \omega({\boldsymbol{\eta}})\) is convex if and only if

\[ \begin{align} \omega({\boldsymbol{\eta}})=\bigcap_{\{\tau\in\mathcal{T}\,|\,{\boldsymbol{\eta}}\in\tau\}}H_+(\tau,{\boldsymbol{\eta}}). \end{align} \]

(48)

Proof

The right-hand side is a finite intersection of (convex) half-spaces, and thus itself convex. It remains to show that if \( \omega({\boldsymbol{\eta}})\) is convex, then (48) holds. We start with ``\( \supset\) ''. Suppose \( {\boldsymbol{x}}\notin \omega({\boldsymbol{\eta}})\) . Then the straight line \( {\rm {co}}(\{{\boldsymbol{x}},{\boldsymbol{\eta}}\})\) must pass through \( \partial\omega({\boldsymbol{\eta}})\) , and by Lemma 13 this implies that there exists \( \tau\in\mathcal{T}\) with \( {\boldsymbol{\eta}}\in\tau\) such that \( {\rm {co}}(\{{\boldsymbol{x}},{\boldsymbol{\eta}}\})\) passes through \( {\rm {aff}}(V(\tau)\backslash\{{\boldsymbol{\eta}}\})=H_0(\tau,{\boldsymbol{\eta}})\) . Hence \( {\boldsymbol{\eta}}\) and \( {\boldsymbol{x}}\) lie on different sides of this affine hyperplane, which shows ``\( \supseteq\) ''. Now we show ``\( \subseteq\) ''. Let \( \tau\in\mathcal{T}\) be such that \( {\boldsymbol{\eta}}\in\tau\) and fix \( {\boldsymbol{x}}\) in the complement of \( H_+(\tau,{\boldsymbol{\eta}})\) . Suppose that \( {\boldsymbol{x}}\in\omega({\boldsymbol{\eta}})\) . By convexity, we then have \( {\rm {co}}(\{{\boldsymbol{x}}\}\cup\tau)\subseteq\omega({\boldsymbol{\eta}})\) . This implies that there exists a point in \( {\rm {co}}(V(\tau)\backslash\{{\boldsymbol{\eta}}\})\) belonging to the interior of \( \omega({\boldsymbol{\eta}})\) . This contradicts Lemma 13. Thus, \( {\boldsymbol{x}}\notin\omega({\boldsymbol{\eta}})\) .

The above lemma allows us to explicitly construct the basis functions \( \varphi_{\boldsymbol{\eta}}\) in (44). To see this, denote in the following for \( \tau\in\mathcal{T}\) and \( {\boldsymbol{\eta}}\in V(\tau)\) by \( g_{\tau,{\boldsymbol{\eta}}}\in\mathcal{P}_1(\mathbb{R}^d)\) the affine function such that

\[ \begin{align*} g_{\tau,{\boldsymbol{\eta}}}({\boldsymbol{\mu}})=\left\{ \begin{array}{ll} 1 & \text{if }{\boldsymbol{\eta}}={\boldsymbol{\mu}}\\ 0 &\text{if }{\boldsymbol{\eta}}\neq {\boldsymbol{\mu}} \end{array}\right. ~~ \text{ for all } {\boldsymbol{\mu}}\in V(\tau). \end{align*} \]

This function exists and is unique by Lemma 12. Observe that \( \varphi_{\boldsymbol{\eta}}({\boldsymbol{x}})=g_{\tau,{\boldsymbol{\eta}}}({\boldsymbol{x}})\) for all \( {\boldsymbol{x}}\in\tau\) .

Lemma 16

Let \( {\boldsymbol{\eta}}\in\mathcal{V}\cap\mathring{\Omega}\) be an interior node and let \( \omega({\boldsymbol{\eta}})\) be a convex patch. Then

\[ \begin{align} \varphi_{{\boldsymbol{\eta}}}({\boldsymbol{x}}) = \max\left\{0,\min_{\{\tau\in\mathcal{T}\,|\,{\boldsymbol{\eta}}\in\tau\}}g_{\tau,{\boldsymbol{\eta}}}({\boldsymbol{x}})\right\}~~ \text{ for all } {\boldsymbol{x}}\in\mathbb{R}^d. \end{align} \]

(49)

Proof

First let \( {\boldsymbol{x}}\notin\omega({\boldsymbol{\eta}})\) . By Lemma 15 there exists \( \tau\in \mathcal{T}\) with \( {\boldsymbol{\eta}}\in\tau\) such that \( {\boldsymbol{x}}\) is in the complement of \( H_+(\tau,{\boldsymbol{\eta}})\) . Observe that

\[ \begin{align} g_{\tau,{\boldsymbol{\eta}}}|_{H_+(\tau,{\boldsymbol{\eta}})}\ge 0 ~~\text{and} ~~ g_{\tau,{\boldsymbol{\eta}}}|_{H_+(\tau,{\boldsymbol{\eta}})^c}<0. \end{align} \]

(50)

Thus

\[ \begin{align*} \min_{\{\tau\in\mathcal{T}\,|\,{\boldsymbol{\eta}}\in\tau\}}g_{\tau,{\boldsymbol{\eta}}}({\boldsymbol{x}}) <0~~\text{ for all } {\boldsymbol{x}}\in\omega({\boldsymbol{\eta}})^c, \end{align*} \]

i.e., (49) holds for all \( {\boldsymbol{x}}\in \mathbb{R}\backslash \omega({\boldsymbol{\eta}})\) . Next, let \( \tau\) , \( \tau'\in\mathcal{T}\) such that \( {\boldsymbol{\eta}}\in\tau\) and \( {\boldsymbol{\eta}}\in\tau'\) . We wish to show that \( g_{\tau,{\boldsymbol{\eta}}}({\boldsymbol{x}})\le g_{\tau',{\boldsymbol{\eta}}}({\boldsymbol{x}})\) for all \( {\boldsymbol{x}}\in\tau\) . Since \( g_{\tau,{\boldsymbol{\eta}}}({\boldsymbol{x}})=\varphi_{{\boldsymbol{\eta}}}({\boldsymbol{x}})\) for all \( {\boldsymbol{x}}\in\tau\) , this then concludes the proof of (49). By Lemma 15 it holds

\[ \begin{align*} {\boldsymbol{\mu}}\in H_+(\tau',{\boldsymbol{\eta}})~~\text{for all}~~{\boldsymbol{\mu}}\in V(\tau). \end{align*} \]

Hence, by (50)

\[ \begin{align*} g_{\tau',{\boldsymbol{\eta}}}({\boldsymbol{\mu}})\ge 0 = g_{\tau,{\boldsymbol{\eta}}}({\boldsymbol{\mu}})~~\text{for all}~~ {\boldsymbol{\mu}} \in V(\tau)\backslash\{{\boldsymbol{\eta}}\}. \end{align*} \]

Moreover, \( g_{\tau,{\boldsymbol{\eta}}}({\boldsymbol{\eta}})=g_{\tau',{\boldsymbol{\eta}}}({\boldsymbol{\eta}})=1\) . Thus, \( g_{\tau,{\boldsymbol{\eta}}}({\boldsymbol{\mu}})\ge g_{\tau',{\boldsymbol{\eta}}}({\boldsymbol{\mu}})\) for all \( {\boldsymbol{\mu}}\in V(\tau')\) and therefore

\[ \begin{align*} g_{\tau',{\boldsymbol{\eta}}}({\boldsymbol{x}})\ge g_{\tau,{\boldsymbol{\eta}}}({\boldsymbol{x}})~~\text{ for all } {\boldsymbol{x}}\in{\rm co}(V(\tau'))=\tau'. \end{align*} \]

Proof (of Theorem 8)

For every interior node \( {\boldsymbol{\eta}}\in\mathcal{V}\cap\mathring{\Omega}\) , the cpwl basis function \( \varphi_{\boldsymbol{\eta}}\) in (44) can be expressed as in (49), i.e.,

\[ \begin{align*} \varphi_{\boldsymbol{\eta}}({\boldsymbol{x}}) = \sigma\bullet {\Phi^{\min}_{|\{\tau\in\mathcal{T}\,|\,{\boldsymbol{\eta}}\in\tau\}|}} \bullet (g_{\tau,{\boldsymbol{\eta}}}({\boldsymbol{x}}))_{\{\tau\in\mathcal{T}\,|\,{\boldsymbol{\eta}}\in\tau\}}, \end{align*} \]

where \( (g_{\tau,{\boldsymbol{\eta}}}({\boldsymbol{x}}))_{\{\tau\in\mathcal{T}\,|\,{\boldsymbol{\eta}}\in\tau\}}\) denotes the parallelization with shared inputs of the functions \( g_{\tau,{\boldsymbol{\eta}}}({\boldsymbol{x}})\) for all \( \tau\in\mathcal{T}\) such that \( {\boldsymbol{\eta}}\in\tau\) .

For this neural network, with \( |\{\tau\in\mathcal{T}\,|\,{\boldsymbol{\eta}}\in\tau\}|\le k_\mathcal{T}\) , we have by Lemma 7

\[ \begin{align} {\rm size}(\varphi_{\boldsymbol{\eta}})&\le 4\big({\rm size}(\sigma)+{\rm size}({\Phi^{\min}_{|\{\tau\in\mathcal{T}\,|\,{\boldsymbol{\eta}}\in\tau\}|}})+{\rm size}((g_{\tau,{\boldsymbol{\eta}}})_{\{\tau\in\mathcal{T}\,|\,{\boldsymbol{\eta}}\in\tau\}})\big) \nonumber \end{align} \]

(51)

\[ \begin{align} &\le 4(2+16k_\mathcal{T}+k_\mathcal{T} d) \\\end{align} \]

(52)

and similarly

\[ \begin{align} {\rm depth}(\varphi_{\boldsymbol{\eta}})\le 4+\lceil\log_2(k_\mathcal{T})\rceil, ~~{\rm width}(\varphi_{\boldsymbol{\eta}})\le \max\{1,3k_\mathcal{T},d\}. \end{align} \]

(53)

Since for every interior node, the number of simplices touching the node must be larger or equal to \( d\) , we can assume \( \max\{k_\mathcal{T},d\}=k_{\mathcal{T}}\) in the following (otherwise there exist no interior nodes, and the function \( f\) is constant \( 0\) ). As in the proof of Theorem 7, the neural network

\[ \begin{align*} \Phi({\boldsymbol{x}})\mathrm{:}= \sum_{{\boldsymbol{\eta}}\in\mathcal{V}\cap\mathring{\Omega}}f({\boldsymbol{\eta}})\varphi_{\boldsymbol{\eta}}({\boldsymbol{x}}) \end{align*} \]

realizes the function \( f\) on all of \( \Omega\) . Since the number of nodes \( |\mathcal{V}|\) is bounded by \( (d+1)|\mathcal{T}|\) , an application of Lemma 9 yields the desired bounds.

6.4 Convergence rates for Hölder continuous functions

Theorem 7 immediately implies convergence rates for certain classes of (low regularity) functions. Recall for example the space \( C^{0,s}\) of Hölder continuous functions.

Definition 14

Let \( s\in (0,1]\) and \( \Omega\subseteq\mathbb{R}^d\) . Then for \( f:\Omega\to\mathbb{R}\)

\[ \begin{align} \| f \|_{C^{0,s}(\Omega)}\mathrm{:}= \sup_{{\boldsymbol{x}}\in \Omega}|f({\boldsymbol{x}})|+ \sup_{{\boldsymbol{x}}\neq{\boldsymbol{y}}\in \Omega}\frac{|f({\boldsymbol{x}})-f({\boldsymbol{y}})|}{\| {\boldsymbol{x}}-{\boldsymbol{y}} \|_{2}^s}, \end{align} \]

(54)

and we denote by \( C^{0,s}(\Omega)\) the set of functions \( f\in C^0(\Omega)\) for which \( \| f \|_{C^{0,s}(\Omega)}<\infty\) .

Hölder continuous functions can be approximated well by cpwl functions. This leads to the following result.

Theorem 9

Let \( d\in\mathbb{N}\) . There exists a constant \( C=C(d)\) such that for every \( f\in C^{0,s}([0,1]^d)\) and every \( N\) there exists a ReLU neural network \( \Phi_N^f\) with

\[ \begin{align*} {\rm size}(\Phi_N^f)\le CN,~~ {\rm width}(\Phi_N^f)\le CN,~~ {\rm depth}(\Phi_N^f)= C \end{align*} \]

and

\[ \begin{align*} \sup_{{\boldsymbol{x}}\in [0,1]^d}\left|f({\boldsymbol{x}})-\Phi_N^f({\boldsymbol{x}})\right|\le C \| f \|_{C^{0,s}({[0,1]^d})} N^{-\frac{s}{d}}. \end{align*} \]

Proof

For \( M\ge 2\) , consider the set of nodes \( \{{{\boldsymbol{\nu}}}/{M}\,|\,{\boldsymbol{\nu}}\in\{-1,\dots,M+1\}^d\}\) where \( {{\boldsymbol{\nu}}}/{M}=({\nu_1}/{M},\dots,{\nu_d}/{M})\) . These nodes suggest a partition of \( [-1/M,1+1/M]^d\) into \( (2+M)^d\) sub-hypercubes. Each such sub-hypercube can be partitioned into \( d!\) simplices, such that we obtain a regular triangulation \( \mathcal{T}\) with \( d!(2+M)^d\) elements on \( [0,1]^d\) . According to Theorem 7 there exists a neural network \( \Phi\) that is cpwl with respect to \( \mathcal{T}\) and \( \Phi({{\boldsymbol{\nu}}}/{M})=f({{\boldsymbol{\nu}}}/{M})\) whenever \( {\boldsymbol{\nu}}\in\{0,\dots,M\}^d\) and \( \Phi({{\boldsymbol{\nu}}}/{M})=0\) for all other (boundary) nodes. It holds

\[ \begin{align} \begin{split} {\rm size}(\Phi)&\le C |\mathcal{T}|=C d!(2+M)^{d},\\ {\rm width}(\Phi)&\le C |\mathcal{T}|=C d!(2+M)^{d},\\ {\rm depth}(\Phi)&\le C \end{split} \end{align} \]

(55)

for a constant \( C\) that only depends on \( d\) (since for our regular triangulation \( \mathcal{T}\) , \( k_\mathcal{T}\) in (41) is a fixed \( d\) -dependent constant).

Let us bound the error. Fix a point \( {\boldsymbol{x}}\in [0,1]^d\) . Then \( {\boldsymbol{x}}\) belongs to one of the interior simplices \( \tau\) of the triangulation. Two nodes of the simplex have distance at most

\[ \begin{align*} \left(\sum_{j=1}^d \left(\frac{1}{M}\right)^2 \right)^{1/2}=\frac{\sqrt{d}}{M}=\mathrm{:}\varepsilon. \end{align*} \]

Since \( \Phi|_{\tau}\) is the linear interpolant of \( f\) at the nodes \( V(\tau)\) of the simplex \( \tau\) , \( \Phi({\boldsymbol{x}})\) is a convex combination of the \( (f({\boldsymbol{\eta}}))_{{\boldsymbol{\eta}}\in V(\tau)}\) . Fix an arbitrary node \( {\boldsymbol{\eta}}_0\in V(\tau)\) . Then \( \| {\boldsymbol{x}}-{\boldsymbol{\eta}}_0 \|_{2}\le\varepsilon\) and

\[ \begin{align*} |\Phi({\boldsymbol{x}})-\Phi({\boldsymbol{\eta}}_0)| \le \max_{{\boldsymbol{\eta}},{\boldsymbol{\mu}}\in V(\tau)}|f({\boldsymbol{\eta}})-f({\boldsymbol{\mu}})| &\le \sup_{\substack{{\boldsymbol{x}},{\boldsymbol{y}}\in [0,1]^d\\ \| {\boldsymbol{x}}-{\boldsymbol{y}} \|_{2}\le \varepsilon}}|f({\boldsymbol{x}})-f({\boldsymbol{y}})|\\ &\le \| f \|_{C^{0,s}({[0,1]^d})} \varepsilon^s. \end{align*} \]

Hence, using \( f({\boldsymbol{\eta}}_0)=\Phi({\boldsymbol{\eta}}_0)\) ,

\[ \begin{align} |f({\boldsymbol{x}})-\Phi({\boldsymbol{x}})| &\le |f({\boldsymbol{x}})-f({\boldsymbol{\eta}}_0)|+ |\Phi({\boldsymbol{x}})-\Phi({\boldsymbol{\eta}}_0)|\nonumber \end{align} \]

(56)

\[ \begin{align} &\le 2 \| f \|_{C^{0,s}({[0,1]^d})} \varepsilon^s\\ & = 2 \| f \|_{C^{0,s}({[0,1]^d})} d^{\frac{s}{2}} M^{-s}\\ & = 2 d^{\frac{s}{2}} \| f \|_{C^{0,s}({[0,1]^d})} N^{-\frac{s}{d}} \\ \\\end{align} \]

where \( N \mathrm{:}= M^d\) . The statement follows by (55) and (56).

The principle behind Theorem 9 can be applied in even more generality. Since we can represent every cpwl function on a regular triangulation with a neural network of size \( O(N)\) , where \( N\) denotes the number of elements, most classical (e.gfinite element) approximation theory for cpwl functions can be lifted to generate statements about ReLU approximation. For instance, it is well-known, that functions in the Sobolev space \( H^{2}([0,1]^d)\) can be approximated by cpwl functions on a regular triangulation in terms of \( L^2([0,1]^d)\) with the rate \( {2}/{d}\) , e.g., [8, Chapter 22]. Similar as in the proof of Theorem 9, for every \( f\in H^2([0,1]^d)\) and every \( N\in\mathbb{N}\) there then exists a ReLU neural network \( \Phi_N\) such that \( {\rm{size}}(\Phi_N)=O(N)\) and

\[ \begin{align*} \| f-\Phi_N \|_{{L^2([0,1]^d)}}\le C \| f \|_{{H^2([0,1]^d)}} N^{-\frac{2}{d}}. \end{align*} \]

Finally, we may consider how to approximate smoother functions such as \( f\in C^k([0,1]^d)\) , \( k>1\) , with ReLU neural networks. As discussed in Chapter 5 for sigmoidal activation functions, larger \( k\) can lead to faster convergence. However, we will see in the following chapter, that the emulation of piecewise affine functions on regular triangulations will not yield improved approximation rates as \( k\) increases. To leverage such smoothness with ReLU networks, in Chapter 8 we will first build networks that emulate polynomials. Surprisingly, it turns out that polynomials can be approximated very efficiently by deep ReLU neural networks.

Bibliography and further reading

The ReLU calculus introduced in Section 6.1 was similarly given in [1]. The fact that every cpwl function can be expressed as a maximum over a minimum of linear functions goes back to the papers [9, 4]; see also [5] for an accessible presentation of this result. Additionally, [6] provides sharper bounds on the number of required nestings in such representations.

The main result of Section 6.2, which shows that every cpwl function can be expressed by a ReLU network, is then a straightforward consequence. This was first observed in [2], which also provided bounds on the network size. These bounds were significantly improved in [3] for cpwl functions on triangular meshes that satisfy a local convexity condition. Under this assumption, it was shown that the network size essentially only grows linearly with the number of pieces. The paper [7] showed that the convexity assumption is not necessary for this statement to hold. We give a similar result in Section 6.3.2, using a simpler argument than [7]. The locally convex case from [3] is separately discussed in Section 6.3.3, as it allows for further improvements in some constants.

The implications for the approximation of Hölder continuous functions discussed in Section 6.4, follows by standard approximation theory for cpwl functions; see for example [10] or the finite element literature such as [11, 12, 8], which focus on approximation in Sobolev spaces. Additionally, [13] provide a stronger result, where it is shown that ReLU networks can essentially achieve twice the rate proven in Theorem 9, and this is sharp. For a general reference on splines and piecewise polynomial approximation see for instance [14]. Finally we mention that similar convergence results can also be shown for other activation functions, see, e.g., [15].

Exercises

Exercise 14

Let \( p:\mathbb{R}\to\mathbb{R}\) be a polynomial of degree \( n\ge 1\) (with leading coefficient nonzero) and let \( s:\mathbb{R}\to\mathbb{R}\) be a continuous sigmoidal activation function. Show that the identity map \( x\mapsto x:\mathbb{R}\to\mathbb{R}\) belongs to \( \mathcal{N}_1^1(p;1,n+1)\) but not to \( \mathcal{N}_1^1(s;L)\) for any \( L\in\mathbb{N}\) .

Exercise 15

Consider cpwl functions \( f:\mathbb{R}\to\mathbb{R}\) with \( n\in\mathbb{N}_0\) breakpoints (points where the function is not \( C^1\) ). Determine the minimal size required to exactly express every such \( f\) with a depth-\( 1\) ReLU neural network.

Exercise 16

Show that, the notion of affine independence is invariant under permutations of the points.

Exercise 17

Let \( \tau={\rm {co}}({\boldsymbol{x}}_0,\dots,{\boldsymbol{x}}_d)\) be a \( d\) -simplex. Show that the coefficients \( \alpha_i\ge 0\) such that \( \sum_{i=0}^d\alpha_i=1\) and \( {\boldsymbol{x}}=\sum_{i=0}^d\alpha_i {\boldsymbol{x}}_i\) are unique for every \( {\boldsymbol{x}}\in \tau\) .

Exercise 18

Let \( \tau={\rm {co}}({\boldsymbol{\eta}}_0,\dots,{\boldsymbol{\eta}}_d)\) be a \( d\) -simplex. Show that the boundary of \( \tau\) is given by \( \bigcup_{i=0}^d{\rm {co}}(\{{\boldsymbol{\eta}}_0,\dots,{\boldsymbol{\eta}}_d\}\backslash\{{\boldsymbol{\eta}}_i\})\) .

7 Affine pieces for ReLU neural networks

In the previous chapters, we observed some remarkable approximation results of shallow ReLU neural networks. In practice, however, deeper architectures are more common. To understand why, in this chapter we discuss some potential shortcomings of shallow ReLU networks compared to deep ReLU networks.

Traditionally, an insightful approach to study limitations of ReLU neural networks has been to analyze the number of linear regions these functions can generate.

Definition 15

Let \( d\in \mathbb{N}\) , \( \Omega \subseteq \mathbb{R}^d\) , and let \( f\colon \Omega \to \mathbb{R}\) be cpwl (see Definition 10). We say that \( f\) has \( p \in \mathbb{N}\) pieces (or linear regions), if \( p\) is the smallest number of connected open sets \( (\Omega_i)_{i=1}^p\) such that \( \bigcup_{i=1}^p \overline{\Omega_i} = \Omega\) , and \( f|_{\Omega_i}\) is an affine function for all \( i = 1, \dots, p\) . We denote \( \rm{ Pieces}(f, \Omega) \mathrm{:=} p\) .

For \( d=1\) we call every point where \( f\) is not differentiable a break point of \( f\) .

To get an accurate cpwl approximation of a function, the approximating function needs to have many pieces. The next theorem, corresponding to [1, Theorem 2], quantifies this statement.

Theorem 10

Let \( -\infty<a<b<\infty\) and \( f \in C^3([a,b])\) so that \( f\) is not affine. Then there exists a constant \( C>0\) depending only on \( \int_{a}^b\sqrt{|f''(x)|}\,\mathrm{d} x\) so that

\[ \| g-f \|_{{L^\infty([a,b])}} > C p^{-2} \]

for all cpwl \( g\) with at most \( p\in\mathbb{N}\) pieces.

The proof of the theorem is left to the reader, see Exercise 19.

Theorem 10 implies that for ReLU neural networks we need architectures allowing for many pieces, if we want to approximate non-linear functions to high accuracy. How many pieces can we create for a fixed depth and width? We establish a simple theoretical upper bound in Section 7.1. Subsequently, we investigate under which conditions these upper bounds are attainable in Section 7.2. Lastly, in Section 7.3, we will discuss the practical relevance of this analysis by examining how many pieces “typical” neural networks possess. Surprisingly, it turns out that randomly initialized deep neural networks on average do not have a number of pieces that is anywhere close to the theoretically achievable maximum.

7.1 Upper bounds

Neural networks are based on the composition and addition of neurons. These two operations increase the possible number of pieces in a very specific way. Figure 14 depicts the two operations and their effect. They can be described as follows:

  • Summation: Let \( \Omega\subseteq\mathbb{R}\) . The sum of two cpwl functions \( f_1\) , \( f_2:\Omega\to\mathbb{R}\) satisfies

    \[ \begin{align} {\rm Pieces}(f_1 + f_2, \Omega) \leq {\rm Pieces}(f_1, \Omega) + {\rm Pieces}(f_2, \Omega)-1. \end{align} \]

    (53)

    This holds because the sum is affine in every point where both \( f_1\) and \( f_2\) are affine. Therefore, the sum has at most as many break points as \( f_1\) and \( f_2\) combined. Moreover, the number of pieces of a univariate function equals the number of its break points plus one.

  • Composition: Let again \( \Omega \subseteq \mathbb{R}\) . The composition of two functions \( f_1 \colon \mathbb{R}^d \to \mathbb{R}\) and \( f_2\colon \Omega \to \mathbb{R}^d\) satisfies

    \[ \begin{align} {\rm Pieces}(f_1 \circ f_2, \Omega) \leq {\rm Pieces}(f_1, \mathbb{R}^d) \cdot {\rm Pieces}(f_2, \Omega). \end{align} \]

    (54)

    This is because for each of the affine pieces of \( f_2\) —let us call one of those pieces \( A \subseteq \mathbb{R}\) —we have that \( f_2\) is either constant or injective on \( A\) . If it is constant, then \( f_1 \circ f_2\) is constant. If it is injective, then \( \rm{ Pieces}(f_1 \circ f_2, A) = \rm{ Pieces}(f_1, f_2(A)) \leq \rm{ Pieces}(f_1, \mathbb{R}^d)\) . Since this holds for all pieces of \( f_2\) we get (54).

&lt;span data-controller=&quot;mathjax&quot;&gt;&lt;span style=&quot;font-weight: 700&quot;&gt;Top:&lt;/span&gt; Composition of two cpwl functions f_1 f_2) can create a piece whenever the value of f_2) crosses a level that is associated to a break point of f_1) .
&lt;span style=&quot;font-weight: 700&quot;&gt;Bottom:&lt;/span&gt; Addition of two cpwl functions f_1 + f_2) produces a cpwl function that can have break points at positions where either f_1) or f_2) has a break point.&lt;/span&gt;
Figure 14. Top: Composition of two cpwl functions \( f_1 \circ f_2\) can create a piece whenever the value of \( f_2\) crosses a level that is associated to a break point of \( f_1\) . Bottom: Addition of two cpwl functions \( f_1 + f_2\) produces a cpwl function that can have break points at positions where either \( f_1\) or \( f_2\) has a break point.

These considerations give the following result, which follows the argument of [2, Lemma 2.1]. We state it for general cpwl activation functions. The ReLU activation function corresponds to \( p=2\) . Recall that the notation \( (\sigma;d_0,\dots,d_{L+1})\) denotes the architecture of a feedforward neural network, see Definition 1.

Theorem 11

Let \( L \in \mathbb{N}\) . Let \( \sigma\) be cpwl with \( p\) pieces. Then, every neural network with architecture \( (\sigma;1, d_1, \dots, d_L, 1)\) has at most \( (p \cdot \rm{ width}(\Phi))^{L}\) pieces.

Proof

The proof is via induction over the depth \( L\) . Let \( L=1\) , and let \( \Phi:\mathbb{R}\to\mathbb{R}\) be a neural network of architecture \( (\sigma;1,d_1,1)\) . Then

\[ \Phi(x) = \sum_{k = 1}^{d_1} w_k^{(1)} \sigma(w_k^{(0)} x + b_{k}^{(0)}) + b^{(1)}~~\text{for }x\in\mathbb{R}, \]

for certain \( {\boldsymbol{w}}^{(0)}\) , \( {\boldsymbol{w}}^{(1)}\) , \( {\boldsymbol{b}}^{(0)}\in\mathbb{R}^{d_1}\) and \( b^{(1)}\in\mathbb{R}\) . By (53), \( \rm{ Pieces}(\Phi)\le p \cdot \rm{ width}(\Phi)\) .

For the induction step, assume the statement holds for \( L\in \mathbb{N}\) , and let \( \Phi:\mathbb{R} \to\mathbb{R}\) be a neural network of architecture \( (\sigma;1,d_1,\dots,d_{L+1},1)\) . Then, we can write

\[ \Phi(x) = \sum_{j=1}^{d_{L+1}}w_j\sigma(h_j(x)) + b~~\text{for }x\in\mathbb{R}, \]

for some \( {\boldsymbol{w}}\in\mathbb{R}^{d_{L+1}}\) , \( b\in\mathbb{R}\) , and where each \( h_j\) is a neural network of architecture \( (\sigma;1,d_1,\dots,d_{L},1)\) . Using the induction hypothesis, each \( \sigma \, \circ h_\ell\) has at most \( p \cdot (p \cdot \rm{ width}(\Phi))^{L}\) affine pieces. Hence \( \Phi\) has at most \( \rm{ width}(\Phi) \cdot p \cdot (p\cdot \rm{ width}(\Phi))^{L} = (p \cdot \rm{ width}(\Phi))^{L+1}\) affine pieces. This completes the proof.

Theorem 11 shows that there are limits to how many pieces can be created with a certain architecture. It is noteworthy that the effects of the depth and the width of a neural network are vastly different. While increasing the width can polynomially increase the number of pieces, increasing the depth can result in exponential increase. This is a first indication of the prowess of depth of neural networks.

To understand the effect of this on the approximation problem, we apply the bound of Theorem 11 to Theorem 10.

Theorem 12

Let \( d_0 \in \mathbb{N}\) and \( f \in C^3([0,1]^{d_0})\) . Assume there exists a line segment \( \mathfrak{s} \subseteq [0,1]^{d_0}\) of positive length such that \( 0<c\mathrm{:}= \int_\mathfrak{s} \sqrt{|f''(x)|}\,\mathrm{d} x\) . Then, there exists \( C>0\) solely depending on \( c\) , such that for all ReLU neural networks \( \Phi:\mathbb{R}^{d_0}\to\mathbb{R}\) with \( L\) hidden layers

\[ \| f-\Phi \|_{{L^\infty([0,1]^{d_0})}} \geq C \cdot (2 {\rm width}(\Phi))^{-2L}. \]

Theorem 12 gives a lower bound on achievable approximation rates in dependence of the depth \( L\) . As target functions become smoother, we expect that we can achieve faster convergence rates (cp. Chapter 5). However, without increasing the depth, it seems to be impossible to leverage such additional smoothness.

This observation strongly indicates that deeper architectures can be superior. Before making this more concrete, we first explore whether the upper bounds of Theorem 11 are also achievable.

7.2 Tightness of upper bounds

We follow [2] to construct a ReLU neural network, that realizes the upper bound of Theorem 11. First let \( h_1:[0,1]\to\mathbb{R}\) be the hat function

\[ \begin{align*} h_1(x)\mathrm{:}= \left\{ \begin{array}{ll} 2x &\text{if }x\in [0,\frac{1}{2}]\\ 2-2x &\text{if }x\in [\frac{1}{2},1]. \end{array}\right. \end{align*} \]

This function can be expressed by a ReLU neural network of depth one and with two nodes

\[ \begin{align} h_1(x)= \sigma_{\rm ReLU}(2x)-\sigma_{\rm ReLU}(4x-2)~~\text{ for all } x\in [0,1]. \end{align} \]

(55.a)

We recursively set

\[ \begin{align} h_n\mathrm{:}= h_{n-1}\circ h_1~~\text{for all }n\ge 2, \end{align} \]

(55.b)

\[ \begin{align} \end{align} \]

(55.c)

\[ \begin{align} \\\end{align} \]

(55.d)

i.e., \( h_n=h_1\circ…\circ h_1\) is the \( n\) -fold composition of \( h_1\) . Since \( h_1:[0,1]\to [0,1]\) , we have \( h_n:[0,1]\to [0,1]\) and

\[ \begin{align*} h_n\in\mathcal{N}_1^1(\sigma_{\rm ReLU};n,2). \end{align*} \]

It turns out that this function has a rather interesting behavior. It is a “sawtooth” function with \( 2^{n-1}\) spikes, see Figure 15.

Lemma 17

Let \( n \in \mathbb{N}\) . It holds for all \( x\in[0,1]\)

\[ \begin{align*} h_n(x) = \left\{ \begin{array}{ll} 2^n(x-i2^{-n}) &\text{if \( i\ge 0\) is even and }x\in [i2^{-n},(i+1)2^{-n}]\\ 2^n((i+1)2^{-n}-x) &\text{if \( i\ge 1\) is odd and }x\in [i2^{-n},(i+1)2^{-n}]. \end{array}\right. \end{align*} \]

Proof

The case \( n=1\) holds by definition. We proceed by induction, and assume the statement holds for \( n\) . Let \( x\in[0,{1}/{2}]\) and \( i\ge 0\) even such that \( x\in [i2^{-(n+1)},(i+1)2^{-(n+1)}]\) . Then \( 2x\in [i2^{-n},(i+1)2^{-n}]\) . Thus

\[ \begin{align*} h_{n}(h_1(x))=h_n(2x) = 2^{n}(2x-i2^{-n}) = 2^{n+1}(x-i2^{-n+1}). \end{align*} \]

Similarly, if \( x\in[0,{1}/{2}]\) and \( i\ge 1\) odd such that \( x\in [i2^{-(n+1)},(i+1)2^{-(n+1)}]\) , then \( h_1(x)=2x\in [i2^{-n},(i+1)2^{-n}]\) and

\[ \begin{align*} h_{n}(h_1(x))=h_n(2x) = 2^{n}(2x-(i+1)2^{-n}) = 2^{n+1}(x-(i+1)2^{-n+1}). \end{align*} \]

The case \( x\in[{1}/{2},1]\) follows by observing that \( h_{n+1}\) is symmetric around \( {1}/{2}\) .

&lt;span data-controller=&quot;mathjax&quot;&gt;The functions h_n) in Lemma lemma:hn.&lt;/span&gt;
Figure 15. The functions \( h_n\) in Lemma 17.

The neural network \( h_n\) has size \( O(n)\) and is piecewise linear on at least \( 2^n\) pieces. This shows that the number of pieces can indeed increase exponentially in the neural network size, also see the upper bound in Theorem 11.

7.3 Number of pieces in practice

We have seen in Theorem 11 that deep neural networks can have many more pieces than their shallow counterparts. This begs the question if deep neural networks tend to generate more pieces in practice. More formally: If we randomly initialize the weights of a neural network, what is the expected number of linear regions? Will this number scale exponentially with the depth? This question was analyzed in [3], and surprisingly, it was found that the number of pieces of randomly initialized neural networks typically does not depend exponentially on the depth. In Figure 16, we depict two neural networks, one shallow and one deep, that were randomly initialized according to He initialization [4]. Both neural networks have essentially the same number of pieces (\( 114\) and \( 110\) ) and there is no clear indication that one has a deeper architecture than the other.

In the following, we will give a simplified version of the main result of [3] to show why random deep neural networks often behave like shallow neural networks.

Figure 17
Figure 18
Figure 16. Two randomly initialized neural networks \( \Phi_1\) and \( \Phi_2\) with architectures \( (\sigma_{\rm {ReLU}};2, 10, 10, 1)\) and \( (\sigma_{\rm {ReLU}};2,5,5,5,5,5,1)\) . The initialization scheme was He initialization [1]. The number of linear regions equals \( 114\) and \( 110\) , respectively.

We recall from Figure 14 that pieces are generated through composition of two functions \( f_1\) and \( f_2\) , if the values of \( f_2\) cross a level that is associated to a break point of \( f_1\) . In the case of a simple neuron of the form

\[ \begin{align*} {\boldsymbol{x}} \mapsto \sigma_{\rm ReLU}(\langle {\boldsymbol{a}}, h({\boldsymbol{x}}) \rangle + b) \end{align*} \]

where \( h\) is a cpwl function, \( {\boldsymbol{a}}\) is a vector, and \( b\) is a scalar, many pieces can be generated if \( \langle{\boldsymbol{a}}, h({\boldsymbol{x}}) \rangle \) crosses the \( -b\) level often.

If \( {\boldsymbol{a}}\) , \( b\) are random variables, and we know that \( h\) does not oscillate too much, then we can quantify the probability of \( \langle {\boldsymbol{a}}, h({\boldsymbol{x}}) \rangle \) crossing the \( -b\) level often. The following lemma from [5, Lemma 3.1] provides the details.

Lemma 18

Let \( c >0\) and let \( h\colon [0,c] \to \mathbb{R}\) be a cpwl function on \( [0,c]\) . Let \( t \in \mathbb{N}\) , let \( A\subseteq\mathbb{R}\) be a Lebesgue measurable set, and assume that for every \( y \in A\)

\[ |\{x \in [0,c]\,|\,h(x) = y\}| \geq t. \]

Then, \( c \| h' \|_{L^\infty} \geq \| h' \|_{L^1} \geq |A| \cdot t\) , where \( |A|\) is the Lebesgue measure of \( A\) . In particular, if \( h\) has at most \( P\in \mathbb{N}\) pieces and \( \| h' \|_{L^1}<\infty\) , then for all \( \delta>0\) , \( t\le P\)

\[ \begin{align*} \mathbb{P}\left[|\{x \in [0,c]\,|\,h(x) = U \}| \geq t\right] &\leq \frac{\| h' \|_{L^1}}{\delta t},\\ \mathbb{P}\left[|\{x \in [0,c]\,|\,h(x) = U\}| > P \right] &= 0, \end{align*} \]

where \( U\) is a uniformly distributed variable on \( [-\delta/2, \delta/2]\) .

Proof

We will assume \( c = 1\) . The general case then follows by considering \( \tilde{h}(x) = h(x/c)\) .

Let for \( (c_i)_{i=1}^{P+1} \subseteq [0,1]\) with \( c_1 = 0\) , \( c_{P+1} = 1\) and \( c_i \leq c_{i+1}\) for all \( i = 1,\dots, P+1\) the pieces of \( h\) be given by \( ( (c_i, c_{i+1}) )_{i=1}^P\) . We denote

\[ \begin{align*} V_1 \mathrm{:=} [0, c_2], ~ V_i \mathrm{:=} (c_i, c_{i+1}] \text{ for } i = 1, \dots, P \end{align*} \]

and for \( i =1, \dots, P+1\)

\[ \begin{align*} \widetilde{V}_i \mathrm{:=} \bigcup_{j=1}^{i-1} V_j. \end{align*} \]

We define, for \( n \in \mathbb{N} \cup \{\infty\}\)

\[ \begin{align*} T_{i, n} &\mathrm{:=} h(V_i)\cap \left\{y \in A\, \middle|\,|\{ x \in \widetilde{V}_i\,|\,h(x) = y\}| = n-1\right\}. \end{align*} \]

In words, \( T_{i, n}\) contains the values of \( A\) that are hit on \( V_i\) for the \( n\) th time. Since \( h\) is cpwl, we observe that for all \( i = 1, \dots, P\)

  1. \( T_{i, n_1} \cap T_{i, n_2} = \emptyset\) for all \( n_1,n_2 \in \mathbb{N} \cup \{\infty\}\) , \( n_1\neq n_2\) ,
  2. \( T_{i, \infty} \cup \bigcup_{n=1}^\infty T_{i, n} = h(V_i) \cap A\) ,
  3. \( T_{i, n} = \emptyset\) for all \( P < n < \infty\) ,
  4. \( |T_{i, \infty}| = 0\) .

Note that, since \( h\) is affine on \( V_i\) it holds that \( h' = |h(V_i)| / |V_i|\) on \( V_i\) . Hence, for \( t \leq P\)

\[ \begin{align*} \| h' \|_{L^1} &\geq \sum_{i=1}^P |h(V_i)| \geq \sum_{i=1}^P |h(V_i) \cap A|\\ & = \sum_{i=1}^P \left(\sum_{n=1}^\infty |T_{i, n}|\right) + |T_{i, \infty}|\\ & = \sum_{i=1}^P \sum_{n=1}^\infty |T_{i, n}|\\ & \geq \sum_{n=1}^t \sum_{i=1}^P |T_{i, n}|, \end{align*} \]

where the first equality follows by (i), (ii), the second by (iv), and the last inequality by (iii). Note that, by assumption for all \( n \leq t\) every \( y \in A\) is an element of \( T_{i, n}\) or \( T_{i, \infty}\) for some \( i \leq P\) . Therefore, by (iv)

\[ \sum_{i=1}^P |T_{i, n}| \geq |A|, \]

which completes the proof.

Lemma 18 applied to neural networks essentially states that, in a single neuron, if the bias term is chosen uniformly randomly on an interval of length \( \delta\) , then the probability of generating at least \( t\) pieces by composition scales reciprocal to \( t\) .

Next, we will analyze how Lemma 18 implies an upper bound on the number of pieces generated in a randomly initialized neural network. For simplicity, we only consider random biases in the following, but mention that similar results hold if both the biases and weights are random variables [3].

Definition 16

Let \( L\in \mathbb{N}\) , \( (d_0, d_1, \dots, d_{L}, 1)\in \mathbb{N}^{L+2}\) and \( {\boldsymbol{W}}^{(\ell)} \in \mathbb{R}^{d_{\ell+1} \times d_{\ell}}\) for \( \ell = 0, \dots, L\) . Furthermore, let \( \delta>0\) and let the bias vectors \( {\boldsymbol{b}}^{(\ell)} \in \mathbb{R}^{d_{\ell+1}}\) , for \( \ell=0,\dots,L\) , be random variables such that each entry of each \( {\boldsymbol{b}}^{(\ell)}\) is independently and uniformly distributed on the interval \( [-\delta/2,\delta/2]\) . We call the associated ReLU neural network a random-bias neural network.

To apply Lemma 18 to a single neuron with random biases, we also need some bound on the derivative of the input to the neuron.

Definition 17

Let \( L\in \mathbb{N}\) , \( (d_0, d_1, \dots, d_{L}, 1)\in \mathbb{N}^{L+2}\) , and \( {\boldsymbol{W}}^{(\ell)} \in \mathbb{R}^{d_{\ell+1} \times d_{\ell}}\) and \( {\boldsymbol{b}}^{(\ell)} \in \mathbb{R}^{d_{\ell+1}}\) for \( \ell = 0, \dots, L\) . Moreover let \( \delta>0\) .

For \( \ell = 1, \dots, L+1\) , \( i=1,\dots, d_{\ell}\) introduce the functions

\[ \begin{align*} \eta_{\ell, i}({\boldsymbol{x}}; ({\boldsymbol{W}}^{(j)}, {\boldsymbol{b}}^{(j)})_{j = 0}^{\ell-1}) = ({\boldsymbol{W}}^{(\ell-1)}{\boldsymbol{x}}^{(\ell-1)})_i~~ \text{for } {\boldsymbol{x}} \in \mathbb{R}^{d_0}, \end{align*} \]

where \( {\boldsymbol{x}}^{(\ell-1)}\) is as in (5). We call

\[ \begin{align*} \nu\left(({\boldsymbol{W}}^{(\ell)})_{\ell = 1}^L, \delta\right) \mathrm{:=} &\max\Bigg\{ \left\| \eta_{\ell, i}'(\, \cdot \, ; ({\boldsymbol{W}}^{(j)}, {\boldsymbol{b}}^{(j)})_{j = 0}^{\ell-1}) \right\|_{2} \Bigg| \\ & ({\boldsymbol{b}}^{(j)})_{j = 0}^L \in \prod_{j = 0}^L [-\delta/2,\delta/2]^{d_{j+1}},\ell = 1, \dots, L, i=1,\dots, d_{\ell}\Bigg\} \end{align*} \]

the maximal internal derivative of \( \Phi\) .

We can now formulate the main result of this section.

Theorem 13

Let \( L\in \mathbb{N}\) and let \( (d_0, d_1, \dots, d_{L}, 1) \in \mathbb{N}^{L+2}\) . Let \( \delta\in (0,1]\) . Let \( {\boldsymbol{W}}^{(\ell)} \in \mathbb{R}^{d_{\ell+1} \times d_{\ell}}\) , for \( \ell = 0, \dots, L\) , be such that \( \nu\left(({\boldsymbol{W}}^{(\ell)})_{\ell = 0}^L, \delta\right) \leq C_\nu\) for a \( C_\nu >0\) .

For an associated random-bias neural network \( \Phi\) , we have that for a line segment \( \mathfrak{s} \subseteq \mathbb{R}^{d_0}\) of length 1

\[ \begin{align} \mathbb{E}[{\rm Pieces}(\Phi, \mathfrak{s})]\leq 1 + d_1 + \frac{C_\nu}{\delta} (1 + (L-1) \ln(2 {\rm width}(\Phi))) \sum_{j = 2}^{L} d_j. \end{align} \]

(56)

Proof

Let \( {\boldsymbol{W}}^{(\ell)} \in \mathbb{R}^{d_{\ell+1} \times d_{\ell}}\) for \( \ell = 0, \dots, L\) . Moreover, let \( {\boldsymbol{b}}^{(\ell)} \in [-\delta/2, \delta/2]^{d_{\ell+1}}\) for \( \ell = 0, \dots, L\) be uniformly distributed random variables. We denote

\[ \begin{align*} \theta_\ell \colon \mathfrak{s} &\to \mathbb{R}^{d_\ell}\\ {\boldsymbol{x}} &\mapsto (\eta_{\ell, i}({\boldsymbol{x}}; ({\boldsymbol{W}}^{(j)}, {\boldsymbol{b}}^{(j)})_{j = 0}^{\ell-1}))_{i=1}^{d_\ell}. \end{align*} \]

Let \( \kappa\colon \mathfrak{s} \to [0,1]\) be an isomorphism. Since each coordinate of \( \theta_{\ell}\) is cpwl, there are points \( {\boldsymbol{x}}_0, {\boldsymbol{x}}_1, …, {\boldsymbol{x}}_{q_\ell} \in \mathfrak{s}\) with \( \kappa({\boldsymbol{x}}_j) < \kappa({\boldsymbol{x}}_{j+1})\) for \( j = 0, \dots, q_{\ell} - 1\) , such that \( \theta_{\ell}\) is affine (as a function into \( \mathbb{R}^{d_\ell}\) ) on \( [\kappa({\boldsymbol{x}}_j), \kappa({\boldsymbol{x}}_{j+1})]\) for all \( j = 0, \dots, q_{\ell}-1\) as well as on \( [0, \kappa({\boldsymbol{x}}_0)]\) and \( [\kappa({\boldsymbol{x}}_{q_\ell}),1]\) .

We will now inductively find an upper bound on the \( q_\ell\) .

Let \( \ell = 2\) , then

\[ \theta_2({\boldsymbol{x}}) = {\boldsymbol{W}}^{(1)} \sigma_{\rm ReLU}( {\boldsymbol{W}}^{(0)} {\boldsymbol{x}} + {\boldsymbol{b}}^{(0)}). \]

Since \( {\boldsymbol{W}}^{(1)} \cdot + b^{(1)}\) is an affine function, it follows that \( \theta_2\) can only be non-affine in points where \( \sigma_{\rm {ReLU}}( {\boldsymbol{W}}^{(0)} \cdot + {\boldsymbol{b}}^{(0)})\) is not affine. Therefore, \( \theta_2\) is only non-affine if one coordinate of \( {\boldsymbol{W}}^{(0)} \cdot + {\boldsymbol{b}}^{(0)}\) intersects \( 0\) nontrivially. This can happen at most \( d_1\) times. We conclude that we can choose \( q_2 = d_1\) .

Next, let us find an upper bound on \( q_{\ell+1}\) from \( q_\ell\) . Note that

\[ \theta_{\ell+1}({\boldsymbol{x}}) ={\boldsymbol{W}}^{(\ell)} \sigma_{\rm ReLU}( \theta_{\ell}({\boldsymbol{x}}) + {\boldsymbol{b}}^{(\ell-1)}). \]

Now \( \theta_{\ell+1}\) is affine in every point \( {\boldsymbol{x}}\in \mathfrak{s}\) where \( \theta_{\ell}\) is affine and \( (\theta_{\ell}({\boldsymbol{x}}) + {\boldsymbol{b}}^{(\ell-1)})_i\neq 0\) for all coordinates \( i = 1, \dots, d_\ell\) . As a result, we have that we can choose \( q_{\ell+1}\) such that

\[ q_{\ell+1} \leq q_{\ell} + \big|\{{\boldsymbol{x}} \in \mathfrak{s}\,|\,(\theta_{\ell}({\boldsymbol{x}}) + {\boldsymbol{b}}^{(\ell-1)})_i = 0 \text{ for at least one } i =1, \dots, d_{\ell}\}\big|. \]

Therefore, for \( \ell \geq 2\)

\[ \begin{align*} q_{\ell+1} &\leq d_{1} + \sum_{j = 3}^{\ell} \big|\{{\boldsymbol{x}} \in \mathfrak{s}\,|\,(\theta_{j}({\boldsymbol{x}}) + {\boldsymbol{b}}^{(j)})_i = 0 \text{ for at least one } i =1, \dots, d_{j}\}\big|\\ &\leq d_1 + \sum_{j = 2}^{\ell} \sum_{i=1}^{d_j} \big|\{{\boldsymbol{x}} \in \mathfrak{s}\,|\,\eta_{j, i}({\boldsymbol{x}}) = - {\boldsymbol{b}}^{(j)}_i \}\big|. \end{align*} \]

By Theorem 11, we have that

\[ {\rm Pieces}\left(\eta_{\ell, i}( \, \cdot \, ; ({\boldsymbol{W}}^{(j)}, {\boldsymbol{b}}^{(j)})_{j = 0}^{\ell-1}), \mathfrak{s}\right) \leq (2 {\rm width}(\Phi))^{\ell-1}. \]

We define for \( k\in \mathbb{N} \cup \{\infty\}\)

\[ \begin{align*} p_{k,\ell, i} \mathrm{:=} \mathbb{P} \left[\big|\{{\boldsymbol{x}} \in \mathfrak{s}\,|\,\eta_{\ell, i}({\boldsymbol{x}}) = - {\boldsymbol{b}}^{(\ell)}_i \}\big| \geq k\right] \end{align*} \]

Then by Lemma 18

\[ \begin{align*} p_{k,\ell, i} \leq \frac{C_\nu}{\delta k} \end{align*} \]

and for \( k > (2 \rm{ width}(\Phi))^{\ell-1}\)

\[ \begin{align*} p_{k,\ell, i} = 0. \end{align*} \]

It holds

\[ \begin{align*} &\mathbb{E}\left[\sum_{j = 2}^{L} \sum_{i=1}^{d_j} \Big|\left\{{\boldsymbol{x}} \in \mathfrak{s}\, \middle|\,\eta_{j, i}({\boldsymbol{x}}) = - {\boldsymbol{b}}^{(j)}_i \right\}\Big|\right] \\ \leq & \sum_{j = 2}^{L} \sum_{i=1}^{d_j} \sum_{k =1}^{\infty} k \cdot \mathbb{P}\left[ \Big|\left\{{\boldsymbol{x}} \in \mathfrak{s}\, \middle|\,\eta_{j, i}({\boldsymbol{x}}) = - {\boldsymbol{b}}^{(j)}_i\right\}\Big| = k\right]\\ \leq & \sum_{j = 2}^{L} \sum_{i=1}^{d_j} \sum_{k =1}^{\infty} k \cdot (p_{k,j, i} - p_{k+1,j, i}). \end{align*} \]

The inner sum can be bounded by

\[ \begin{align*} \sum_{k = 1}^\infty k \cdot (p_{k,j, i} - p_{k+1,j, i}) &= \sum_{k = 1}^\infty k \cdot p_{k,j, i} - \sum_{k = 1}^\infty k \cdot p_{k+1,j, i}\\ &= \sum_{k = 1}^\infty k \cdot p_{k,j, i} - \sum_{k = 2}^\infty (k-1) \cdot p_{k,j, i}\\ &= p_{1,j, i} + \sum_{k = 2}^\infty p_{k,j, i}\\ &= \sum_{k = 1}^\infty p_{k,j, i}\\ &\leq C_\nu \delta^{-1} \sum_{k =1}^{(2 {\rm width}(\Phi))^{L-1}} \frac{1}{k}\\ &\leq C_\nu \delta^{-1} \left(1+ \int_{1}^{(2 {\rm width}(\Phi))^{L-1}} \frac{1}{x} \,\mathrm{d} x\right)\\ &\leq C_\nu \delta^{-1} (1 + (L-1) \ln((2 {\rm width}(\Phi)))). \end{align*} \]

We conclude that, in expectation, we can bound \( q_{L+1}\) by

\[ \begin{align*} d_1 + C_\nu \delta^{-1} (1 + (L-1) \ln(2 {\rm width}(\Phi))) \sum_{j = 2}^{L} d_j. \end{align*} \]

Finally, since \( \theta_{L} = \Phi_{L+1}|_{\mathfrak{s}}\) , it follows that

\[ \begin{align*} {\rm Pieces}(\Phi, \mathfrak{s}) \leq q_{L + 1} + 1 \end{align*} \]

which yields the result.

Remark 6

We make the following observations about Theorem 13:

  • Non-exponential dependence on depth: If we consider (56), we see that the number of pieces scales in expectation essentially like \( O(LN)\) , where \( N\) is the total number of neurons of the architecture. This shows that in expectation, the number of pieces is linear in the number of layers, as opposed to the exponential upper bound of Theorem 11.
  • Maximal internal derivative: Theorem 13 requires the weights to be chosen such that the maximal internal derivative is bounded by a certain number. However, if they are randomly initialized in such a way that with high probability the maximal internal derivative is bounded by a small number, then similar results can be shown. In practice, weights in the \( \ell\) th layer are often initialized according to a centered normal distribution with standard deviation \( \sqrt{2/d_{\ell}}\) , [4]. Due to the anti-proportionality of the variance to the width of the layers it is achieved that the internal derivatives remain bounded with high probability, independent of the width of the neural networks. This explaines the observation from Figure 16.

Bibliography and further reading

Establishing bounds on the number of linear regions of a ReLU network has been a popular tool to investigate the complexity of ReLU neural networks, see [6, 7, 8, 9, 3]. The bound presented in Section 7.1, is based on [2]. For the construction of the sawtooth function in Section 7.2, we follow the arguments in [2, 10]. Together with the lower bound on the number of required linear regions given in [1], this analysis shows how depth can be a limiting factor in terms of achievable convergence rates, as stated in Theorem 12. Finally, the analysis of the number of pieces deep neural networks attained with random intialization (Section 7.3) is based on [3] and [5].

Exercises

Exercise 19

Let \( -\infty<a<b<\infty\) and let \( f\in C^3([a,b])\backslash\mathcal{P}_1\) . Denote by \( p(\varepsilon)\in\mathbb{N}\) the minimal number of intervals partitioning \( [a,b]\) , such that a (not necessarily continuous) piecewise linear function on \( p(\varepsilon)\) intervals can approximate \( f\) on \( [a,b]\) uniformly up to error \( \varepsilon>0\) . In this exercise, we wish to show

\[ \begin{align} \liminf_{\varepsilon\searrow 0} p(\varepsilon)\sqrt{\varepsilon} >0. \end{align} \]

(57)

Therefore, we can find a constant \( C>0\) such that \( \varepsilon\ge C p(\varepsilon)^{-2}\) for all \( \varepsilon>0\) . This shows a variant of Theorem 10. Proceed as follows to prove (57):

  1. Fix \( \varepsilon>0\) and let \( a=x_0<x_1\dots<x_{p(\varepsilon)}=b\) be a partitioning into \( p(\varepsilon)\) pieces. For \( i=0,\dots,p(\varepsilon)-1\) and \( x\in [x_i,x_{i+1}]\) let

    \[ \begin{align*} e_i(x)\mathrm{:}= f(x) - \left(f(x_i)+\frac{f(x_{i+1})-f(x_i)}{x_{i+1}-x_i}(x-x_i)\right). \end{align*} \]

    Show that \( |e_i(x)|\le 2\varepsilon\) for all \( x\in [x_i,x_{i+1}]\) .

  2. With \( h_i\mathrm{:}= x_{i+1}-x_i\) and \( m_i\mathrm{:}= (x_i+x_{i+1})/{2}\) show that

    \[ \begin{align*} \max_{x\in [x_i,x_{i+1}]}|e_i(x)| = \frac{h_i^2}{8} |f''(m_i)| +O(h_i^3). \end{align*} \]
  3. Assuming that \( c\mathrm{:}= \inf_{x\in [a,b]}|f''(x)|>0\) show that

    \[ \begin{align*} \liminf_{\varepsilon\searrow 0}p(\varepsilon)\sqrt{\varepsilon} \ge \frac{1}{4}\int_a^b \sqrt{|f''(x)|}\,\mathrm{d} x. \end{align*} \]
  4. Conclude that (57) holds for general non-linear \( f\in C^3([a,b])\) .

Exercise 20

Show that, for \( L = 1\) , Theorem 11 holds for piecewise smooth functions, when replacing the number of affine pieces by the number of smooth pieces. These are defined by replacing “affine” by “smooth” (meaning \( C^\infty\) ) in Definition 15.

Exercise 21

Show that, for \( L > 1\) , Theorem 11 does not hold for piecewise smooth functions, when replacing the number of affine pieces by the number of smooth pieces.

Exercise 22

For \( p \in \mathbb{N}\) , \( p > 2\) and \( n\in \mathbb{N}\) , construct a function \( h^{(p)}_n\) similar to \( h_n\) of (17), such that \( h^{(p)}_n \in \mathcal{N}_1^1(\sigma_{\rm {ReLU}};n,p)\) and such that \( h^{(p)}_n\) has \( p^n\) pieces and size \( O(p^2 n)\) .

8 Deep ReLU neural networks

In the previous chapter, we observed that many layers are a necessary prerequisite for ReLU neural networks to approximate smooth functions with high rates. We now analyze which depth is sufficient to achieve good approximation rates for smooth functions.

To approximate smooth functions efficiently, one of the main tools in Chapter 5 was to rebuild polynomial-based functions, such as higher-order B-splines. For smooth activation functions, we were able to reproduce polynomials by using the nonlinearity of the activation functions. This argument certainly cannot be repeated for the piecewise linear ReLU. On the other hand, up until now, we have seen that deep ReLU neural networks are extremely efficient at producing the strongly oscillating sawtooth functions discussed in Lemma 17. The main observation in this chapter is that the sawtooth functions are intimately linked to the squaring function, which again leads to polynomials. This observation was first made by Dmitry Yarotsky [1] in 2016, and the present chapter is primarily based on this paper.

In Sections 8.1 and 8.2, we give Yarotsky’s approximation of the squaring and multiplication functions. As a direct consequence, we show in Section 8.3 that deep ReLU neural networks can be significantly more efficient than shallow ones in approximating analytic functions such as polynomials and (certain) trigonometric functions. Using these tools, we conclude in Section 8.4 that deep ReLU neural networks can efficiently approximate \( k\) -times continuously differentiable functions with Hölder continuous derivatives.

8.1 The square function

We start with the approximation of the map \( x\mapsto x^2\) . The construction, first given in [1], is based on the sawtooth functions \( h_n\) defined in (59) and originally introduced in [2], see Figure 15. The proof idea is visualized in Figure 19.

Construction of s_n) in Proposition prop:sn.
Figure 19. Construction of \( s_n\) in Proposition 8.

Proposition 8

Let \( n\in\mathbb{N}\) . Then

\[ \begin{align*} s_n(x)\mathrm{:}= x-\sum_{j=1}^n \frac{h_j(x)}{2^{2j}} \end{align*} \]

is a piecewise linear function on \( [0,1]\) with break points \( x_{n,j}=j2^{-n}\) , \( j=0,\dots,2^n\) . Moreover, \( s_n(x_{n,k})=x_{n,k}^2\) for all \( k=0,\dots,2^n\) , i.e\( s_n\) is the piecewise linear interpolant of \( x^2\) on \( [0,1]\) .

Proof

The statement holds for \( n=1\) . We proceed by induction. Assume the statement holds for \( s_n\) and let \( k\in\{0,\dots,2^{n+1}\}\) . By Lemma 17, \( h_{n+1}(x_{n+1,k})=0\) whenever \( k\) is even. Hence for even \( k\in\{0,\dots,2^{n+1}\}\)

\[ \begin{align*} s_{n+1}(x_{n+1,k}) &= x_{n+1,k}-\sum_{j=1}^{n+1} \frac{h_j(x_{n+1,k})}{2^{2j}}\\ & =s_n(x_{n+1,k})-\frac{h_{n+1}(x_{n+1,k})}{2^{2(n+1)}} =s_n(x_{n+1,k})=x_{n+1,k}^2, \end{align*} \]

where we used the induction assumption \( s_n(x_{n+1,k})=x_{n+1,k}^2\) for \( x_{n+1,k}=k2^{-(n+1)}=\frac{k}{2}2^{-n}=x_{n,k/2}\) .

Now let \( k\in\{1,\dots,2^{n+1}-1\}\) be odd. Then by Lemma 17, \( h_{n+1}(x_{n+1,k})=1\) . Moreover, since \( s_n\) is linear on \( [x_{n,(k-1)/{2}},x_{n,(k+1)/{2}}]=[x_{n+1,k-1},x_{n+1,k+1}]\) and \( x_{n+1,k}\) is the midpoint of this interval,

\[ \begin{align*} s_{n+1}(x_{n+1,k}) &= s_n(x_{n+1,k}) - \frac{h_{n+1}(x_{n+1,k})}{2^{2(n+1)}}\nonumber\\ &= \frac{1}{2}(x_{n+1,k-1}^2+x_{n+1,k+1}^2)-\frac{1}{2^{2(n+1)}}\nonumber\\ &=\frac{(k-1)^2}{2^{2(n+1)+1}}+\frac{(k+1)^2}{2^{2(n+1)+1}} -\frac{2}{2^{2(n+1)+1}}\nonumber\\ &= \frac{1}{2}\frac{2k^2}{2^{2(n+1)}} = \frac{k^2}{2^{2(n+1)}}=x_{n+1,k}^2. \end{align*} \]

This completes the proof.

As a consequence there holds the following, [1, Proposition 2].

Lemma 19

For \( n \in \mathbb{N}\) , it holds

\[ \begin{align*} \sup_{x\in[0,1]}|x^2-s_n(x)|\le 2^{-2n-1}. \end{align*} \]

Moreover \( s_n\in\mathcal{N}_{1}^1(\sigma_{\rm {ReLU}};n,3)\) , and \( {\rm {size}}(s_n)\le 7n\) and \( {\rm {depth}}(s_n)=n\) .

Proof

Set \( e_n(x)\mathrm{:}= x^2-s_n(x)\) . Let \( x\) be in the interval \( [x_{n,k},x_{n,k+1}]=[k2^{-n},(k+1)2^{-n}]\) of length \( 2^{-n}\) . Since \( s_n\) is the linear interpolant of \( x^2\) on this interval, we have

\[ \begin{align*} |e_n'(x)| = \left|2x-\frac{x_{n,k+1}^2-x_{n,k}^2}{2^{-n}}\right| =\left|2x - \frac{2k+1}{2^n}\right|\le \frac{1}{2^n}. \end{align*} \]

Thus \( e_n:[0,1]\to\mathbb{R}\) has Lipschitz constant \( 2^{-n}\) . Since \( e_n(x_{n,k})=0\) for all \( k=0,\dots,2^n\) , and the length of the interval \( [x_{n,k},x_{n,k+1}]\) equals \( 2^{-n}\) we get

\[ \begin{align*} \sup_{x\in [0,1]}|e_n(x)|\le \frac{1}{2} 2^{-n} 2^{-n}=2^{-2n-1}. \end{align*} \]

Finally, to see that \( s_n\) can be represented by a neural network of the claimed architecture, note that for \( n\ge 2\)

\[ \begin{align*} s_n(x) = x-\sum_{j=1}^n\frac{h_j(x)}{2^{2j}} = s_{n-1}(x)-\frac{h_n(x)}{2^{2n}} = \sigma_{\rm ReLU}\circ s_{n-1}(x)-\frac{h_1\circ h_{n-1}(x)}{2^{2n}}. \end{align*} \]

Here we used that \( s_{n-1}\) is the piecewise linear interpolant of \( x^2\) , so that \( s_{n-1}(x)\ge 0\) and thus \( s_{n-1}(x)=\sigma_{\rm {ReLU}}(s_{n-1}(x))\) for all \( x\in [0,1]\) . Hence \( s_n\) is of depth \( n\) and width \( 3\) , see Figure 20.

The neural networks h_1(x)=_{ {ReLU}}(2x)-_{ ReLU}(4x-2)) 
and s_n(x) = _{ {ReLU}}(s_{n-1}(x))-{h_n(x)}/{2^{2n}}) 
where h_n = h_1 h_{n-1}) . Figure based on yarotsky and Schwab2019Deep.
Figure 20. The neural networks \( h_1(x)=\sigma_{\rm {ReLU}}(2x)-\sigma_{\rm ReLU}(4x-2)\) and \( s_n(x) = \sigma_{\rm {ReLU}}(s_{n-1}(x))-{h_n(x)}/{2^{2n}}\) where \( h_n = h_1\circ h_{n-1}\) . Figure based on [1, Fig. 2c] and [3, Fig. 1a].

In conclusion, we have shown that \( s_n:[0,1]\to [0,1]\) approximates the square function uniformly on \( [0,1]\) with exponentially decreasing error in the neural network size. Note that due to Theorem 12, this would not be possible with a shallow neural network, which can at best interpolate \( x^2\) on a partition of \( [0,1]\) with polynomially many (w.r.tthe neural network size) pieces.

8.2 Multiplication

According to Lemma 19, depth can help in the approximation of \( x\mapsto x^2\) , which, on first sight, seems like a rather specific example. However, as we shall discuss in the following, this opens up a path towards fast approximation of functions with high regularity, e.g., \( C^k([0,1]^d)\) for some \( k>1\) . The crucial observation is that, via the polarization identity we can write the product of two numbers as a sum of squares

\[ \begin{align} x\cdot y = \frac{(x+y)^2-(x-y)^2}{4} \end{align} \]

(62)

for all \( x\) , \( y\in\mathbb{R}\) . Efficient approximation of the operation of multiplication allows efficient approximation of polynomials. Those in turn are well-known to be good approximators for functions exhibiting \( k\in\mathbb{N}\) derivatives. Before exploring this idea further in the next section, we first make precise the observation that neural networks can efficiently approximate the multiplication of real numbers.

We start with the multiplication of two numbers, in which case neural networks of logarithmic size in the desired accuracy are sufficient, [1, Proposition 3].

Lemma 20

For every \( \varepsilon>0\) there exists a ReLU neural network \( {\Phi^{\times}_{\varepsilon}}:[-1,1]^2\to [-1,1]\) such that

\[ \begin{align*} \sup_{x,y\in [-1,1]}|x\cdot y-{\Phi^{\times}_{\varepsilon}}(x,y)|\le \varepsilon, \end{align*} \]

and it holds \( {\rm {size}}({\Phi^{\times}_{\varepsilon}})\le C \cdot (1+|\log(\varepsilon)|)\) and \( {\rm {depth}}({\Phi^{\times}_{\varepsilon}})\le C\cdot(1+|\log(\varepsilon)|)\) for a constant \( C>0\) independent of \( \varepsilon\) . Moreover, \( {\Phi^{\times}_{\varepsilon}}(x,y)=0\) if \( x=0\) or \( y=0\) .

Proof

With \( n=\lceil|\log_4(\varepsilon)|\rceil\) , define the neural network

\[ \begin{align} {\Phi^{\times}_{\varepsilon}}(x,y)\mathrm{:}= & s_n\left(\frac{\sigma_{\rm ReLU}(x+y)+\sigma_{\rm ReLU}(-x-y)}{2}\right) \nonumber \end{align} \]

(63)

\[ \begin{align} & - s_n \left(\frac{\sigma_{\rm ReLU}(x-y)+\sigma_{\rm ReLU}(y-x)}{2}\right). \\\end{align} \]

(64)

Since \( |a|=\sigma_{\rm {ReLU}}(a)+\sigma_{\rm ReLU}(-a)\) , by (62) we have for all \( x\) , \( y\in [-1,1]\)

\[ \begin{align*} \left| x\cdot y-{\Phi^{\times}_{\varepsilon}}(x,y)\right| &= \left|\frac{(x+y)^2-(x-y)^2}{4}-\left(s_n\left(\frac{|x+y|}{2}\right)-s_n\left(\frac{|x-y|}{2}\right)\right)\right|\nonumber\\ &= \left|\frac{4(\frac{x+y}{2})^2-4(\frac{x-y}{2})^2}{4}-\frac{4s_n(\frac{|x+y|}{2})-4s_n(\frac{|x-y|}{2})}{4}\right|\nonumber\\ &\le \frac{4(2^{-2n-1}+2^{-2n-1})}{4}=4^{-n}\le \varepsilon, \end{align*} \]

where we used \( |x+y|/2\) , \( |x-y|/2\in [0,1]\) . We have \( {\rm {depth}}({\Phi^{\times}_{\varepsilon}})=1+{\rm depth}(s_n)=1+n\le 1+\lceil\log_4(\varepsilon)\rceil\) and \( {\rm {size}}({\Phi^{\times}_{\varepsilon}}) \le C+2{\rm size}(s_n)\le C n\le C\cdot (1-\log(\varepsilon))\) for some constant \( C>0\) .

The fact that \( {\Phi^{\times}_{\varepsilon}}\) maps from \( [-1,1]^2\to [-1,1]\) follows by (63) and because \( s_n:[0,1]\to [0,1]\) . Finally, if \( x=0\) , then \( {\Phi^{\times}_{\varepsilon}}(x,y)=s_n(|x+y|)-s_n(|x-y|)=s_n(|y|)-s_n(|y|)=0\) . If \( y=0\) the same argument can be made.

In a similar way as in Proposition 6 and Lemma 11, we can apply operations with two inputs in the form of a binary tree to extend them to an operation on arbitrary many inputs; see again [1], and [3, Proposition 3.3] for the specific argument considered here.

Proposition 9

For every \( n\ge 2\) and \( \varepsilon>0\) there exists a ReLU neural network \( {\Phi^{\times}_{n,\varepsilon}}:[-1,1]^n\to [-1,1]\) such that

\[ \begin{align*} \sup_{x_j\in [-1,1]}\left|\prod_{j=1}^n x_j-{\Phi^{\times}_{n,\varepsilon}}(x_1,\dots,x_n)\right|\le \varepsilon, \end{align*} \]

and it holds \( {\rm {size}}({\Phi^{\times}_{n,\varepsilon}})\le Cn\cdot(1+|\log(\varepsilon /n)|)\) and \( {\rm {depth}}({\Phi^{\times}_{n,\varepsilon}})\le C\log(n)(1+|\log(\varepsilon /n)|)\) for a constant \( C>0\) independent of \( \varepsilon\) and \( n\) .

Proof

We begin with the case \( n=2^k\) . For \( k=1\) let \( {\tilde\Phi^{\times}_{2,\delta}}\mathrm{:}= {\Phi^{\times}_{\delta}}\) . If \( k\ge 2\) let

\[ \begin{align*} {\tilde\Phi^{\times}_{2^k,\delta}}\mathrm{:}= {\Phi^{\times}_{\delta}}\circ \left({\tilde\Phi^{\times}_{2^{k-1},\delta}},{\tilde\Phi^{\times}_{2^{k-1},\delta}}\right). \end{align*} \]

Using Lemma 20, we find that this neural network has depth bounded by

\[ \begin{align*} {\rm depth}\left({\tilde\Phi^{\times}_{2^k,\delta}}\right) \le k{\rm depth}({\Phi^{\times}_{\delta}})\le Ck \cdot (1+|\log(\delta)|) \le C\log(n)(1+|\log(\delta)|). \end{align*} \]

Observing that the number of occurences of \( {\Phi^{\times}_{\delta}}\) equals \( \sum_{j=0}^{k-1}2^j\le n\) , the size of \( {\tilde\Phi^{\times}_{2^k,\delta}}\) can bounded by \( Cn {\rm {size}}({\Phi^{\times}_{\delta}})\le Cn \cdot (1+|\log(\delta)|)\) .

To estimate the approximation error, denote with \( {\boldsymbol{x}}=(x_j)_{j=1}^{2^k}\)

\[ \begin{align*} e_k\mathrm{:}= \sup_{x_j\in [-1,1]} \left|\prod_{j\le 2^{k}}x_j-{\tilde\Phi^{\times}_{2^{k},\delta}}({\boldsymbol{x}})\right|. \end{align*} \]

Then, using short notation of the type \( {\boldsymbol{x}}_{\le 2^{k-1}}\mathrm{:}= (x_1,\dots,x_{2^{k-1}})\) ,

\[ \begin{align*} e_k &=\sup_{x_j\in [-1,1]}\left|\prod_{j=1}^{2^k}x_j-{\Phi^{\times}_{\delta}}\left({\tilde\Phi^{\times}_{2^{k-1},\delta}}({\boldsymbol{x}}_{\le 2^{k-1}}),{\tilde\Phi^{\times}_{2^{k-1},\delta}}({\boldsymbol{x}}_{>2^{k-1}})\right) \right|\nonumber\\ &\le \delta + \sup_{x_j\in [-1,1]}\left( \left|\prod_{j\le 2^{k-1}}x_j\right| e_{k-1} +\left|{\tilde\Phi^{\times}_{2^{k-1},\delta}}({\boldsymbol{x}}_{>2^{k-1}})\right| e_{k-1} \right)\nonumber\\ &\le \delta +2 e_{k-1}\le \delta+2(\delta+2e_{k-2})\le\dots\le \delta\sum_{j=0}^{k-2}2^j+2^{k-1}e_1\nonumber\\ &\le 2^k \delta =n\delta. \end{align*} \]

Here we used \( e_1\le\delta\) , and that \( {\tilde\Phi^{\times}_{2^k,\delta}}\) maps \( [-1,1]^{2^{k-1}}\) to \( [-1,1]\) , which is a consequence of Lemma 20.

The case for general \( n\ge 2\) (not necessarily \( n=2^k\) ) is treated similar as in Lemma 11, by replacing some \( {\Phi^{\times}_{\delta}}\) neural networks with identity neural networks.

Finally, setting \( \delta\mathrm{:}= {\varepsilon}/{n}\) and \( {\Phi^{\times}_{n,\varepsilon}}\mathrm{:}= {\tilde\Phi^{\times}_{n,\delta}}\) concludes the proof.

8.3 Polynomials and depth separation

As a first consequence of the above observations, we consider approximating the polynomial

\[ \begin{equation} p(x)=\sum_{j=0}^n c_jx^j. \end{equation} \]

(65)

One possibility to approximate \( p\) is via the Horner scheme and the approximate multiplication \( \Phi^{\times}_\varepsilon\) from Lemma 20, yielding

\[ \begin{align*} p(x) &= c_0+x\cdot(c_1+x\cdot(\dots +x\cdot c_n)\dots)\\ &\simeq c_0+\Phi_\varepsilon^\times(x,c_1+\Phi_\varepsilon^\times(x,c_2\dots +\Phi_\varepsilon^\times(x,c_n))\dots). \end{align*} \]

This scheme requires depth \( O(n)\) due to the nested multiplications. An alternative is to approximate all monomials \( 1,x,\dots,x^n\) with a binary tree using approximate multiplications \( \Phi_\varepsilon^\times\) , and combing them in the output layer, see Figure 21. This idea leads to a network of size \( O(n\log(n))\) and depth \( O(\log(n))\) . The following lemma formalizes this, see [4, Lemma A.5], [5, Proposition III.5], and in particular [6, Lemma 4.3]. The proof is left as Exercise 24.

Lemma 21

There exists a constant \( C>0\) , such that for any \( \varepsilon\in (0,1)\) and any polynomial \( p\) of degree \( n\ge 2\) as in (65), there exists a neural network \( \Phi_\varepsilon^p\) such that

\[ \sup_{x\in [-1,1]}|p(x)-\Phi_\varepsilon^p(x)|\le C \varepsilon \sum_{j=0}^n|c_j| \]

and \( {\rm {size}}(\Phi_\varepsilon^p)\le C n\log(n/\varepsilon)\) and \( {\rm {depth}}(\Phi_\varepsilon^p)\le C\log(n/\varepsilon)\) .

Monomials 1,,x^n) with n=2^{k}) can be generated in
 a binary tree of depth k) . Each node represents the product of
 its inputs, with single-input nodes interpreted as
 squares.
Figure 21. Monomials \( 1,\dots,x^n\) with \( n=2^{k}\) can be generated in a binary tree of depth \( k\) . Each node represents the product of its inputs, with single-input nodes interpreted as squares.

Lemma 21 shows that deep ReLU networks can approximate polynomials efficiently. This leads to an interesting implication regarding the superiority of deep architectures. Recall that \( f:[-1,1]\to\mathbb{R}\) is analytic if its Taylor series around any point \( x\in [-1,1]\) converges to \( f\) in a neighbourhood of \( x\) . For instance all polynomials, \( \sin\) , \( \cos\) , \( \exp\) etcare analytic. We now show that these functions (except linear ones) can be approximated much more efficiently with deep ReLU networks than by shallow ones: for fixed-depth networks, the number of parameters must grow faster than any polynomial compared to the required size of deep architectures. More precisely there holds the following.

Proposition 10

Let \( L\in\mathbb{N}\) and let \( f:[-1,1]\to\mathbb{R}\) be analytic but not linear. Then there exist constants \( C\) , \( \beta>0\) such that for every \( \varepsilon>0\) , there exists a ReLU neural network \( \Phi_{\rm {deep}}\) satisfying

\[ \begin{equation} \sup_{x\in [-1,1]}|f(x)-\Phi_{\rm deep}(x)|\le C\exp\Big(-\beta \sqrt{{\rm size}(\Phi_{\rm deep})}\Big)\le \varepsilon, \end{equation} \]

(66)

but for any ReLU neural network \( \Phi_{\rm {shallow}}\) of depth at most \( L\) holds

\[ \begin{equation} \sup_{x\in [-1,1]}|f(x)-\Phi_{\rm shallow}(x)|\ge C^{-1} {\rm size}(\Phi_{\rm shallow})^{-2L}. \end{equation} \]

(67)

Proof

The lower bound on (67) holds by Theorem 12.

Let us show the upper bound on the deep neural network. Assume first that the convergence radius of the Taylor series of \( f\) around \( 0\) is \( r>1\) . Then for all \( x\in [-1,1]\)

\[ f(x)=\sum_{j\in\mathbb{N}_0}c_j x^j~~\text{where}~~ c_j= \frac{f^{(j)}(0)}{j!}~~\text{and}~~ |c_j|\le C_r r^{-j}, \]

for all \( j\in\mathbb{N}_0\) and some \( C_r>0\) . Hence \( p_n(x)\mathrm{:}= \sum_{j=0}^n c_j x^j\) satisfies

\[ \sup_{x\in [-1,1]}|f(x)-p_n(x)|\le \sum_{j>n}|c_j| \le C_r \sum_{j>n} r^{-j}\le \frac{C_r r^{-n}}{1-r^{-1}}. \]

Let now \( \Phi^{p_n}_\varepsilon\) be the network in Lemma 21. Then

\[ \sup_{x\in [-1,1]}|f(x)-\Phi^{p_n}_\varepsilon(x)|\le \tilde C \cdot\Big(\varepsilon+\frac{r^{-n}}{1-r^{-1}}\Big), \]

for some \( \tilde C=\tilde C(r,C_r)\) . Choosing \( n=n(\varepsilon)=\lceil \log(\varepsilon)/\log(r)\rceil\) , with the bounds from Lemma 21 we find that

\[ \sup_{x\in [-1,1]}|f(x)-\Phi^{p_n}_\varepsilon(x)|\le 2 \tilde C \varepsilon \]

and for another constant \( \hat C=\hat C(r)\)

\[ {\rm size}(\Phi_\varepsilon^{p_n})\le \hat C\cdot (1+\log(\varepsilon)^2)~~ \text{and}~~ {\rm depth}(\Phi_\varepsilon^{p_n})\le \hat C\cdot (1+\log(\varepsilon)). \]

This implies the existence of \( C\) , \( \beta>0\) and \( \Phi_{\rm {deep}}\) as in (66). The general case, where the Taylor expansions of \( f\) converges only locally is left as Exercise 25.

The proposition shows that the approximation of certain (highly relevant) functions requires significantly more parameters when using shallow instead of deep architectures. Such statements are known as depth separation results. We refer for instance to [2, 7, 8], where such a result was shown by Telgarsky based on the sawtooth function constructed in Section 7.2. Lower bounds on the approximation in the spirit of Proposition 10 were also given in [9] and [1].

Remark 7

Proposition 10 shows in particular that for analytic \( f:[-1,1]\to\mathbb{R}\) , holds the error bound \( \exp(-\beta \sqrt{N})\) in terms of the network size \( N\) . This can be generalized to multivariate analytic functions \( f:[-1,1]^d\to\mathbb{R}\) , in which case the bound reads \( \exp(-\beta N^{1/(1+d)})\) , see [10, 11].

8.4 \( C^{k,s}\) functions

We will now discuss the implications of our observations in the previous sections for the approximation of functions in the class \( C^{k,s}\) .

Definition 18

Let \( k\in\mathbb{N}_0\) , \( s\in [0,1]\) and \( \Omega\subseteq\mathbb{R}^d\) . Then for \( f:\Omega\to\mathbb{R}\)

\[ \begin{align} \begin{split} \| f \|_{C^{k,s}(\Omega)}\mathrm{:}=& \sup_{{\boldsymbol{x}}\in \Omega}\max_{\{{\boldsymbol{\alpha}}\in \mathbb{N}_0^d\,|\,|{\boldsymbol{\alpha}}|\le k\}} |D^{\boldsymbol{\alpha}} f({\boldsymbol{x}})| \\ &+ \sup_{{\boldsymbol{x}}\neq{\boldsymbol{y}}\in \Omega} \max_{\{{\boldsymbol{\alpha}}\in \mathbb{N}_0^d\,|\,|{\boldsymbol{\alpha}}|=k\}} \frac{|D^{\boldsymbol{\alpha}} f({\boldsymbol{x}})-D^{\boldsymbol{\alpha}} f({\boldsymbol{y}})|}{\| {\boldsymbol{x}}-{\boldsymbol{y}} \|_{}^s}, \end{split} \end{align} \]

(68)

and we denote by \( C^{k,s}(\Omega)\) the set of functions \( f\in C^k(\Omega)\) for which \( \| f \|_{C^{k,s}(\Omega)}<\infty\) .

Note that these spaces are ordered according to

\[ \begin{align*} C^k(\Omega)\supseteq C^{k,s}(\Omega)\supseteq C^{k,t}(\Omega)\supseteq C^{k+1}(\Omega) \end{align*} \]

for all \( 0<s\le t\le 1\) .

In order to state our main result, we first recall a version of Taylor’s remainder formula for \( C^{k,s}(\Omega)\) functions.

Lemma 22

Let \( d\in\mathbb{N}\) , \( k\in\mathbb{N}\) , \( s\in [0,1]\) , \( \Omega=[0,1]^d\) and \( f\in C^{k,s}(\Omega)\) . Then for all \( {\boldsymbol{a}}\) , \( {\boldsymbol{x}}\in\Omega\)

\[ \begin{align} f({\boldsymbol{x}})=\sum_{\{{\boldsymbol{\alpha}}\in\mathbb{N}_0^d\,|\,0\le |{\boldsymbol{\alpha}}|\le k\}} \frac{D^{\boldsymbol{\alpha}} f({\boldsymbol{a}})}{{\boldsymbol{\alpha}} !}({\boldsymbol{x}}-{\boldsymbol{a}})^{\boldsymbol{\alpha}}+R_k({\boldsymbol{x}}) \end{align} \]

(69)

\[ \begin{align} \\\end{align} \]

(70)

where with \( h\mathrm{:}= \max_{i\le d}|a_i-x_i|\) we have \( |R_k({\boldsymbol{x}})|\le h^{k+s}\frac{d^{k+1/2}}{k!}\| f \|_{C^{k,s}(\Omega)}\) .

Proof

First, for a function \( g\in C^{k}(\mathbb{R})\) and \( a\) , \( t\in\mathbb{R}\)

\[ \begin{align*} g(t)&=\sum_{j=0}^{k-1}\frac{g^{(j)}(a)}{j!} (t-a)^j + \frac{g^{(k)}(\xi)}{k!}(t-a)^{k}\\ &=\sum_{j=0}^{k}\frac{g^{(j)}(a)}{j!} (t-a)^j + \frac{g^{(k)}(\xi)-g^{(k)}(a)}{k!}(t-a)^{k}, \end{align*} \]

for some \( \xi\) between \( a\) and \( t\) . Now let \( f\in C^{k,s}(\mathbb{R}^d)\) and \( {\boldsymbol{a}}\) , \( {\boldsymbol{x}}\in\mathbb{R}^d\) . Thus with \( g(t)\mathrm{:}= f({\boldsymbol{a}}+t \cdot ({\boldsymbol{x}}-{\boldsymbol{a}}))\) holds for \( f({\boldsymbol{x}})=g(1)\)

\[ \begin{align*} f({\boldsymbol{x}})= \sum_{j=0}^{k-1}\frac{g^{(j)}(0)}{j!} + \frac{g^{(k)}(\xi)}{k!}. \end{align*} \]

By the chain rule

\[ \begin{align*} g^{(j)}(t) = \sum_{\{{\boldsymbol{\alpha}}\in\mathbb{N}_0^d\,|\,|{\boldsymbol{\alpha}}|=j\}}\binom{j}{{\boldsymbol{\alpha}}}D^{\boldsymbol{\alpha}} f({\boldsymbol{a}}+t \cdot ({\boldsymbol{x}}-{\boldsymbol{a}}))({\boldsymbol{x}}-{\boldsymbol{a}})^{\boldsymbol{\alpha}}, \end{align*} \]

where we use the multivariate notations \( \binom{j}{{\boldsymbol{\alpha}}}=\frac{j!}{{\boldsymbol{\alpha}} !}=\frac{j!}{\prod_{j=1}^d\alpha_j!}\) and \( ({\boldsymbol{x}}-{\boldsymbol{a}})^{\boldsymbol{\alpha}}=\prod_{j=1}^d(x_j-a_j)^{\alpha_j}\) . Hence

\[ \begin{align*} f({\boldsymbol{x}}) = &\underbrace{\sum_{\{{\boldsymbol{\alpha}}\in\mathbb{N}_0^d\,|\,0\le |{\boldsymbol{\alpha}}|\le k\}} \frac{D^{\boldsymbol{\alpha}} f({\boldsymbol{a}})}{{\boldsymbol{\alpha}} !}({\boldsymbol{x}}-{\boldsymbol{a}})^{\boldsymbol{\alpha}}}_{\in \mathcal{P}_k} \\ & ~~ +\underbrace{\sum_{|{\boldsymbol{\alpha}}|=k} \frac{D^{\boldsymbol{\alpha}} f({\boldsymbol{a}}+\xi\cdot ({\boldsymbol{x}}-{\boldsymbol{a}}))-D^{\boldsymbol{\alpha}} f({\boldsymbol{a}})}{{\boldsymbol{\alpha}} !}({\boldsymbol{x}}-{\boldsymbol{a}})^{{\boldsymbol{\alpha}}}}_{=\mathrm{:} R_k}, \end{align*} \]

for some \( \xi\in [0,1]\) . Using the definition of \( h\) , the remainder term can be bounded by

\[ \begin{align*} |R_k|&\le h^{k} \max_{|{\boldsymbol{\alpha}}|=k}\sup_{\substack{{\boldsymbol{x}}\in\Omega\\ t\in [0,1]}}|D^{\boldsymbol{\alpha}} f({\boldsymbol{a}}+t\cdot ({\boldsymbol{x}}-{\boldsymbol{a}}))-D^{\boldsymbol{\alpha}} f({\boldsymbol{a}})| \frac{1}{k!}\sum_{\{{\boldsymbol{\alpha}}\in\mathbb{N}_0^d\,|\,|{\boldsymbol{\alpha}}|=k\}}\binom{k}{{\boldsymbol{\alpha}}}\nonumber\\ &\le h^{k+s}\frac{d^{k+\frac s 2}}{k!}\| f \|_{C^{k,s}({\Omega})}, \end{align*} \]

where we used (68), \( \| {\boldsymbol{x}}-{\boldsymbol{a}} \|_{}\le \sqrt{d}h\) , and \( \sum_{\{{\boldsymbol{\alpha}}\in\mathbb{N}_0^d\,|\,|{\boldsymbol{\alpha}}|=k\}}\binom{k}{{\boldsymbol{\alpha}}}=(1+…+1)^k=d^k\) by the multinomial formula.

We now come to the main statement of this section. Up to logarithmic terms, it shows the convergence rate \( (k+s)/{d}\) for approximating functions in \( C^{k,s}([0,1]^d)\) .

Theorem 14

Let \( d\in\mathbb{N}\) , \( k\in\mathbb{N}_0\) , \( s\in [0,1]\) , and \( \Omega=[0,1]^d\) . Then, there exists a constant \( C>0\) such that for every \( f\in C^{k,s}(\Omega)\) and every \( N\ge 2\) there exists a ReLU neural network \( \Phi_N^f\) such that

\[ \begin{align} \sup_{{\boldsymbol{x}}\in\Omega}|f({\boldsymbol{x}})-\Phi_N^f({\boldsymbol{x}})|\le C \| f \|_{C^{k,s}(\Omega)} N^{-\frac{k+s}{d}}, \end{align} \]

(71)

\( {\rm {size}}(\Phi_N^f)\le CN\log(N)\) and \( {\rm {depth}}(\Phi_N^f)\le C \log(N)\) .

Proof

The idea of the proof is to use the so-called “partition of unity method”: First we will construct a partition of unity \( (\varphi_{\boldsymbol{\nu}})_{{\boldsymbol{\nu}}}\) , such that for an appropriately chosen \( M \in \mathbb{N}\) each \( \varphi_{\boldsymbol{\nu}}\) has support on a \( O({1}/{M})\) neighborhood of a point \( {\boldsymbol{\eta}}\in\Omega\) . On each of these neighborhoods we will use the local Taylor polynomial \( p_{\boldsymbol{\nu}}\) of \( f\) around \( {\boldsymbol{\eta}}\) to approximate the function. Then \( \sum_{{\boldsymbol{\nu}}}\varphi_{\boldsymbol{\nu}} p_{\boldsymbol{\nu}}\) gives an approximation to \( f\) on \( \Omega\) . This approximation can be emulated by a neural network of the type \( \sum_{{\boldsymbol{\nu}}}{\Phi^{\times}_{\varepsilon}}(\varphi_{\boldsymbol{\nu}},\hat p_{\boldsymbol{\nu}})\) , where \( \hat p_{\boldsymbol{\nu}}\) is an neural network approximation to the polynomial \( p_{\boldsymbol{\nu}}\) .

It suffices to show the theorem in the case where

\[ \max\left\{\frac{d^{k+1/2}}{k!},\exp(d)\right\}\| f \|_{C^{k,s}(\Omega)}\le 1. \]

The general case can then be immediately deduced by a scaling argument.

Step 1. We construct the neural network. Define

\[ \begin{align} M\mathrm{:}= \lceil N^{1/d}\rceil~~\text{and}~~\varepsilon\mathrm{:}= N^{-\frac{k+s}{d}}. \end{align} \]

(72)

Consider a uniform simplicial mesh with nodes \( \{{{{\boldsymbol{\nu}}}/{M}}\,|\,{\boldsymbol{\nu}}\le M\}\) where \( {{{\boldsymbol{\nu}}}/{M}}\mathrm{:}= ({\nu_1}/{M},\dots,{\nu_d}/{M})\) , and where “\( {\boldsymbol{\nu}}\le M\) ” is short for \( \{{\boldsymbol{\nu}}\in\mathbb{N}_0^d\,|\,\nu_i\le M \text{ for all } i\le d\}\) . We denote by \( \varphi_{\boldsymbol{\nu}}\) the cpwl basis function on this mesh such that \( \varphi_{\boldsymbol{\nu}}({{{\boldsymbol{\nu}}}/{M}})=1\) and \( \varphi_{\boldsymbol{\nu}}({{{\boldsymbol{\mu}}}/{M}}) = 0\) whenever \( {\boldsymbol{\mu}}\neq{\boldsymbol{\nu}}\) . As shown in Chapter 6, \( \varphi_{\boldsymbol{\nu}}\) is a neural network of size \( O(1)\) . Then

\[ \begin{align} \sum_{{\boldsymbol{\nu}}\le M}\varphi_{\boldsymbol{\nu}} \equiv 1~~\text{on }\Omega, \end{align} \]

(73)

is a partition of unity. Moreover, observe that

\[ \begin{align} {\rm supp}(\varphi_{\boldsymbol{\nu}})\subseteq \left\{{\boldsymbol{x}}\in\Omega\, \middle|\,\left\| {\boldsymbol{x}}-{\frac{{\boldsymbol{\nu}}}{M}} \right\|_{\infty}\le \frac{1}{M}\right\}, \end{align} \]

(74)

where \( \| {\boldsymbol{x}} \|_{\infty}=\max_{i\le d}|x_i|\) .

For each \( {\boldsymbol{\nu}}\le M\) define the multivariate polynomial

\[ \begin{align*} p_{\boldsymbol{\nu}}({\boldsymbol{x}})\mathrm{:}= \sum_{|{\boldsymbol{\alpha}}|\le k} \frac{D^{\boldsymbol{\alpha}} f\left({\frac{{\boldsymbol{\nu}}}{M}}\right)}{{\boldsymbol{\alpha}} !}\left({\boldsymbol{x}}-{\frac{{\boldsymbol{\nu}}}{M}}\right)^{\boldsymbol{\alpha}}\in \mathcal{P}_k, \end{align*} \]

and the approximation

\[ \begin{align*} \hat p_{\boldsymbol{\nu}}({\boldsymbol{x}})\mathrm{:}= \sum_{|{\boldsymbol{\alpha}}|\le k} \frac{D^{\boldsymbol{\alpha}} f\left({\frac{{\boldsymbol{\nu}}}{M}}\right)}{{\boldsymbol{\alpha}} !} {\Phi^{\times}_{|{\boldsymbol{\alpha}}|,\varepsilon}}\left(x_{i_{{\boldsymbol{\alpha}},1}}-\frac{\nu_{i_{{\boldsymbol{\alpha}},1}}}{M},\dots,x_{i_{{\boldsymbol{\alpha}},k}}-\frac{\nu_{i_{{\boldsymbol{\alpha}},k}}}{M}\right), \end{align*} \]

where \( (i_{{\boldsymbol{\alpha}},1},\dots,i_{{\boldsymbol{\alpha}},k})\in \{0,\dots,d\}^k\) is arbitrary but fixed such that \( |\{j\,|\,i_{{\boldsymbol{\alpha}},j}=r\}|=\alpha_r\) for all \( r=1,\dots,d\) . Finally, define

\[ \begin{align} \Phi_N^f\mathrm{:}= \sum_{{\boldsymbol{\nu}}\le M}{\Phi^{\times}_{\varepsilon}}(\varphi_{\boldsymbol{\nu}},\hat p_{\boldsymbol{\nu}}). \end{align} \]

(75)

Step 2. We bound the approximation error. First, for each \( {\boldsymbol{x}}\in\Omega\) , using (73) and (74)

\[ \begin{align*} \left|f({\boldsymbol{x}})-\sum_{{\boldsymbol{\nu}}\le M}\varphi_{\boldsymbol{\nu}}({\boldsymbol{x}})p_{\boldsymbol{\nu}}({\boldsymbol{x}})\right| &\le \sum_{{\boldsymbol{\nu}}\le M}|\varphi_{\boldsymbol{\nu}}({\boldsymbol{x}})| |p_{\boldsymbol{\nu}}({\boldsymbol{x}})-f({\boldsymbol{x}})|\nonumber\\ &\le \max_{{\boldsymbol{\nu}}\le M} \sup_{\{{\boldsymbol{y}}\in\Omega\,|\,\| {\frac{{\boldsymbol{\nu}}}{M}}-{\boldsymbol{y}} \|_{\infty}\le \frac{1}{M}\}} |f({\boldsymbol{y}})-p_{\boldsymbol{\nu}}({\boldsymbol{y}})|. \end{align*} \]

By Lemma 22 we obtain

\[ \begin{align} \sup_{{\boldsymbol{x}}\in\Omega}\left|f({\boldsymbol{x}})-\sum_{{\boldsymbol{\nu}}\le M}\varphi_{\boldsymbol{\nu}}({\boldsymbol{x}})p_{\boldsymbol{\nu}}({\boldsymbol{x}})\right| \le M^{-(k+s)}\frac{d^{k+\frac{1}{2}}}{k!}\| f \|_{C^{k,s}(\Omega)}\le M^{-(k+s)}. \end{align} \]

(76)

Next, fix \( {\boldsymbol{\nu}}\le M\) and \( {\boldsymbol{y}}\in\Omega\) such that \( \| {{{\boldsymbol{\nu}}}/{M}}-{\boldsymbol{y}} \|_{\infty}\le {1}/{M}\le 1\) . Then by Proposition 9

\[ \begin{align} |p_{\boldsymbol{\nu}}({\boldsymbol{y}})-\hat p_{\boldsymbol{\nu}}({\boldsymbol{y}})| &\le \sum_{|{\boldsymbol{\alpha}}|\le k} \frac{D^{\boldsymbol{\alpha}} f\left({\frac{{\boldsymbol{\nu}}}{M}}\right)}{{\boldsymbol{\alpha}} !} \left|\prod_{j=1}^k \left(y_{i_{{\boldsymbol{\alpha}},j}}-\frac{\nu_{i_{{\boldsymbol{\alpha}},j}}}{M}\right)\right. \nonumber \end{align} \]

(77)

\[ \begin{align} &~~ \left. -{\Phi^{\times}_{|{\boldsymbol{\alpha}}|,\varepsilon}}\left(y_{i_{{\boldsymbol{\alpha}},1}}-\frac{\nu_{i_{{\boldsymbol{\alpha}},1}}}{M},\dots,y_{i_{{\boldsymbol{\alpha}},k}}-\frac{i_{{\boldsymbol{\alpha}},k}}{M}\right)\right| \end{align} \]

(78)

\[ \begin{align} &\le \varepsilon \sum_{|{\boldsymbol{\alpha}}|\le k} \frac{D^{\boldsymbol{\alpha}} f({\frac{{\boldsymbol{\nu}}}{M}})}{{\boldsymbol{\alpha}} !} \le \varepsilon \exp(d)\| f \|_{C^{k,s}(\Omega)}\le \varepsilon, \\ \\ \\\end{align} \]

(79)

where we used \( |D^{\boldsymbol{\alpha}} f({{{\boldsymbol{\nu}}}/{M}})|\le \| f \|_{C^{k,s}(\Omega)}\) and

\[ \begin{align*} \sum_{\{{\boldsymbol{\alpha}}\in\mathbb{N}_0^d\,|\,|{\boldsymbol{\alpha}}|\le k\}}\frac{1}{{\boldsymbol{\alpha}} !}= \sum_{j=0}^k\frac{1}{j!}\sum_{\{{\boldsymbol{\alpha}}\in\mathbb{N}_0^d\,|\,|{\boldsymbol{\alpha}}|=j\}}\frac{j!}{{\boldsymbol{\alpha}} !}= \sum_{j=0}^k\frac{d^j}{j!} \le \sum_{j=0}^\infty\frac{d^j}{j!} = \exp(d). \end{align*} \]

Similarly, one shows that

\[ \begin{align*} |\hat p_{\boldsymbol{\nu}}({\boldsymbol{x}})|\le \exp(d)\| f \|_{C^{k,s}(\Omega)}\le 1~~\text{ for all } {\boldsymbol{x}}\in\Omega. \end{align*} \]

Fix \( {\boldsymbol{x}}\in\Omega\) . Then \( {\boldsymbol{x}}\) belongs to a simplex of the mesh, and thus \( {\boldsymbol{x}}\) can be in the support of at most \( d+1\) (the number of nodes of a simplex) functions \( \varphi_{\boldsymbol{\nu}}\) . Moreover, Lemma 20 implies that \( {\rm {supp}}(\varphi_{\boldsymbol{\nu}}(\cdot),\hat p_{\boldsymbol{\nu}}(\cdot))\subseteq {\rm supp}\varphi_{\boldsymbol{\nu}}\) . Hence, using Lemma 20 and (77)

\[ \begin{align*} &\left|\sum_{{\boldsymbol{\nu}}\le M}\varphi_{\boldsymbol{\nu}}({\boldsymbol{x}})p_{\boldsymbol{\nu}}({\boldsymbol{x}})- \sum_{{\boldsymbol{\nu}}\le M}{\Phi^{\times}_{\varepsilon}}(\varphi_{\boldsymbol{\nu}}({\boldsymbol{x}}),\hat p_{\boldsymbol{\nu}}({\boldsymbol{x}})) \right| \nonumber\\ &~~\le \sum_{\{{\boldsymbol{\nu}}\le M\,|\,{\boldsymbol{x}}\in{\rm supp}\varphi_{\boldsymbol{\nu}}\}} \left(|\varphi_{\boldsymbol{\nu}}({\boldsymbol{x}})p_{\boldsymbol{\nu}}({\boldsymbol{x}})- \varphi_{\boldsymbol{\nu}}({\boldsymbol{x}})\hat p_{\boldsymbol{\nu}}({\boldsymbol{x}})| \right.\\ &~~ ~~ \left. +|\varphi_{\boldsymbol{\nu}}({\boldsymbol{x}})\hat p_{\boldsymbol{\nu}}({\boldsymbol{x}})-{\Phi^{\times}_{\varepsilon}}(\varphi_{\boldsymbol{\nu}}({\boldsymbol{x}}),\hat p_{\boldsymbol{\nu}}({\boldsymbol{x}}))|\right)\nonumber\\ &~~\le \varepsilon + (d+1)\varepsilon=(d+2)\varepsilon. \end{align*} \]

In total, together with (76)

\[ \begin{align*} \sup_{{\boldsymbol{x}}\in\Omega}|f({\boldsymbol{x}})-\Phi_N^f({\boldsymbol{x}})|\le M^{-(k+s)}+\varepsilon \cdot (d+2). \end{align*} \]

With our choices in (72) this yields the error bound (71).

Step 3. It remains to bound the size and depth of the neural network in (75).

By Lemma 14, for each \( 0\le{\boldsymbol{\nu}}\le M\) we have

\[ \begin{align} {\rm size}(\varphi_{\boldsymbol{\nu}})\le C\cdot (1+k_\mathcal{T}),~~ {\rm depth}(\varphi_{\boldsymbol{\nu}})\le C\cdot (1+\log(k_\mathcal{T})), \end{align} \]

(80)

where \( k_\mathcal{T}\) is the maximal number of simplices attached to a node in the mesh. Note that \( k_\mathcal{T}\) is independent of \( M\) , so that the size and depth of \( \varphi_{\boldsymbol{\nu}}\) are bounded by a constant \( C_\varphi\) independent of \( M\) .

Lemma 20 and Proposition 9 thus imply with our choice of \( \varepsilon=N^{-(k+s)/d}\)

\[ \begin{align*} {\rm depth}(\Phi_N^f)&={\rm depth}({\Phi^{\times}_{\varepsilon}})+\max_{{\boldsymbol{\nu}}\le M}{\rm depth}(\varphi_{\boldsymbol{\eta}})+\max_{{\boldsymbol{\nu}}\le M}{\rm depth}(\hat p_{\boldsymbol{\nu}})\nonumber\\ &\le C\cdot(1+|\log(\varepsilon)|+C_\varphi)+{\rm depth}({\Phi^{\times}_{k,\varepsilon}})\nonumber\\ &\le C\cdot(1+|\log(\varepsilon)|+C_\varphi)\nonumber\\ &\le C\cdot(1+\log(N)) \end{align*} \]

for some constant \( C>0\) depending on \( k\) and \( d\) (we use “\( C\) ” to denote a generic constant that can change its value in each line).

To bound the size, we first observe with Lemma 9 that

\[ \begin{align*} {\rm size}(\hat p_{\boldsymbol{\nu}})\le C\cdot \left(1+\sum_{|{\boldsymbol{\alpha}}|\le k}{\rm size}\left({\Phi^{\times}_{|{\boldsymbol{\alpha}}|,\varepsilon}}\right)\right)\le C\cdot (1+|\log(\varepsilon)|) \end{align*} \]

for some \( C\) depending on \( k\) . Thus, for the size of \( \Phi_N^f\) we obtain with \( M=\lceil N^{1/d}\rceil\)

\[ \begin{align*} {\rm size}(\Phi_N^f)&\le C\cdot \left(1+\sum_{{\boldsymbol{\nu}}\le M}\left({\rm size}({\Phi^{\times}_{\varepsilon}})+{\rm size}(\varphi_{\boldsymbol{\nu}})+{\rm size}(\hat p_{\boldsymbol{\nu}})\right)\right)\nonumber\\ &\le C\cdot (1+M)^d(1+|\log(\varepsilon)|+C_\varphi)\nonumber\\ &\le C\cdot (1+N^{1/d})^d(1+C_\varphi+\log(N))\nonumber\\ &\le CN\log(N), \end{align*} \]

which concludes the proof.

Theorem 14 is similar in spirit to [1, Section 3.2]; the main differences are that [1] considers the class \( C^k([0,1]^d)\) instead of \( C^{k,s}([0,1]^d)\) , and uses an approximate partition of unity, while we use the exact partition of unity constructed in Chapter 6. Up to logarithmic terms, the theorem shows the convergence rate \( (k+s)/{d}\) . As long as \( k\) is large, in principle we can achieve arbitrarily large (and \( d\) -independent if \( k\ge d\) ) convergence rates. In contrast to Theorem 9, achieving error \( N^{-\frac{k+s}{d}}\) requires depth \( O(\log(N))\) , i.ethe neural network depth is required to increase. This can be avoided however, and networks of depth \( O(k/d)\) suffice to attain these convergence rates [4].

Remark 8

Let \( L:{\boldsymbol{x}}\mapsto{\boldsymbol{A}}{\boldsymbol{x}}+{\boldsymbol{b}}:\mathbb{R}^d\to\mathbb{R}^d\) be a bijective affine transformation and set \( \Omega\mathrm{:}= L([0,1]^d)\subseteq\mathbb{R}^d\) . Then for a function \( f\in C^{k,s}(\Omega)\) , by Theorem 14 there exists a neural network \( \Phi_N^f\) such that

\[ \begin{align*} \sup_{{\boldsymbol{x}}\in\Omega}|f({\boldsymbol{x}})-\Phi_N^f(L^{-1}({\boldsymbol{x}}))| &=\sup_{{\boldsymbol{x}}\in [0,1]^d}|f(L({\boldsymbol{x}}))-\Phi_N^f({\boldsymbol{x}})|\\ &\le C \| f\circ L \|_{C^{k,s}({[0,1]^d})} N^{-\frac{k+s}{d}}. \end{align*} \]

Since for \( {\boldsymbol{x}}\in [0,1]^d\) holds \( |f(L({\boldsymbol{x}}))|\le \sup_{{\boldsymbol{y}}\in\Omega}|f({\boldsymbol{y}})|\) and if \( \boldsymbol{0}\neq {\boldsymbol{\alpha}}\in\mathbb{N}_0^d\) is a multiindex \( |D^{\boldsymbol{\alpha}} (f(L({\boldsymbol{x}}))|\le \| A \|_{}^{|{\boldsymbol{\alpha}}|}\sup_{{\boldsymbol{y}}\in \Omega}|D^{\boldsymbol{\alpha}} f({\boldsymbol{y}})|\) , we have \( \| f\circ L \|_{C^{k,s}({[0,1]^d})}\le (1+\| A \|_{}^{k+s})\| f \|_{C^{k,s}(\Omega)}\) . Thus the convergence rate \( N^{-\frac{k+s}{d}}\) is achieved on every set of the type \( L([0,1]^d)\) for an affine map \( L\) , and in particular on every hypercube \( \times_{j=1}^d[a_j,b_j]\) .

Bibliography and further reading

This chapter is based on the seminal 2017 paper by Yarotsky [1], where the construction of approximating the square function, the multiplication, and polynomials (discussed in Sections 8.1, 8.2, 8.3) was first introduced and analyzed. The construction relies on the sawtooth function discussed in Section 7.2 and originally constructed by Telgarsky in [2]. Similar results were obtained around the same time by Liang and Srikant via a bit extraction technique using both the ReLU and the Heaviside function as activation functions [9]. These works have since sparked a large body of research, as they allow to lift polynomial approximation theory to neural network classes. Convergence results based on this type of argument include for example [4, 12, 13, 10, 11]. We also refer to [14] for related results on rational approximation.

The depth separation result in Section 8.3 is based on the exponential convergence rates obtained for analytic functions in [10, 11], also see [5, Lemma III.7]. For the approximation of polynomials with ReLU neural networks stated in Lemma 21, see, e.g., [4, 5, 6], and also [15, 16] for constructions based on Chebyshev polynomials, which can be more efficient. For further depth separation results, we refer to [2, 7, 17, 18, 19]. Moreover, closely related to such statements is the 1987 thesis by Håstad [15], which considers the limitations of logic circuits in terms of depth.

The approximation result derived in Section 8.4 for \( C^{k,s}\) functions follows by standard approximation theory for piecewise polynomial functions, and is similar as in [1]. We point out that such statements can also be shown for other activation functions than ReLU; see in particular the works of Mhaskar [20, 21] and Section 6 in Pinkus’ Acta Numerica article [22] for sigmoidal and smooth activations. Additionally, the more recent paper [23] specifically addresses the hyperbolic tangent activation. Finally, [24] studies general activation functions that allow for the construction of approximate partitions of unity.

Exercises

Exercise 23

We show another type of depth separation result: Let \( d\ge 2\) . Prove that there exist ReLU NNs \( \Phi:\mathbb{R}^d\to\mathbb{R}\) of depth two, which cannot be represented exactly by ReLU NNs \( \Phi:\mathbb{R}^d\to\mathbb{R}\) of depth one.

Hint: Show that nonzero ReLU NNs of depth one necessarily have unbounded support.

Exercise 24

Prove Lemma 21.

Hint: Proceed by induction over the iteration depth in Figure 21.

Exercise 25

Show Proposition 10 in the general case where the Taylor series of \( f\) only converges locally (see proof of Proposition 10).

Hint: Use the partition of unity method from the proof of Theorem 14.

9 High-dimensional approximation

In the previous chapters we established convergence rates for the approximation of a function \( f:[0,1]^d\to\mathbb{R}\) by a neural network. For example, Theorem 14 provides the error bound \( \mathcal{O}(N^{-(k+s)/d})\) in terms of the network size \( N\) (up to logarithmic terms), where \( k\) and \( s\) describe the smoothness of \( f\) . Achieving an accuracy of \( \varepsilon>0\) , therefore, necessitates a network size \( N=O(\varepsilon^{-d/(k+s)})\) (according to this bound). Hence, the size of the network needs to increase exponentially in \( d\) . This exponential dependence on the dimension \( d\) is referred to as the curse of dimensionality [102]. For classical smoothness spaces, such exponential \( d\) dependence cannot be avoided [102, 103, 104]. However, functions \( f\) that are of interest in practice may have additional properties, which allow for better convergence rates.

In this chapter, we discuss three scenarios under which the curse of dimensionality can be mitigated. First, we examine an assumption limiting the behavior of functions in their Fourier domain. This assumption allows for slow but dimension independent approximation rates. Second, we consider functions with a specific compositional structure. Concretely, these functions are constructed by compositions and linear combinations of simple low-dimensional subfunctions. In this case, the curse of dimension is present but only through the input dimension of the subfunctions. Finally, we study the situation, where we still approximate high-dimensional functions, but only care about the approximation accuracy on a lower dimensional submanifold. Here, the approximation rate is goverened by the smoothness and the dimension of the manifold.

9.1 The Barron class

In [105], Barron introduced a set of functions that can be approximated by neural networks without a curse of dimensionality. This set, known as the Barron class, is characterized by a specific type of bounded variation. To define it, for \( g\in L^1(\mathbb{R}^d)\) we denote by

\[ \begin{align*} \check g({\boldsymbol{w}})\mathrm{:}= \int_{\mathbb{R}^d} g({\boldsymbol{x}}) e^{\rm{i} {\boldsymbol{w}}^\top {\boldsymbol{x}}} \,\mathrm{d} {\boldsymbol{x}} \end{align*} \]

its inverse Fourier transform. Then, for \( C>0\) the Barron class is defined as

\[ \Gamma_{C} \mathrm{:=} \left\{f\in C(\mathbb{R}^d)\, \middle|\,\exists g \in L^1(\mathbb{R}^d) ,~\int_{\mathbb{R}^d} |{\boldsymbol{\xi}}| |g({\boldsymbol{\xi}})| \,\mathrm{d}{\boldsymbol{\xi}} \leq C \text{ and } f = \check{g}\right\}. \]

We say that a function \( f \in \Gamma_C\) has a finite Fourier moment, even though technically the Fourier transform of \( f\) may not be well-defined, since \( f\) does not need to be integrable. By the Riemann-Lebesgue Lemma, [106, Lemma 1.1.1], the condition \( f\in C(\mathbb{R}^d)\) in the definition of \( \Gamma_{C}\) is automatically satisfied if \( g \in L^1(\mathbb{R}^d)\) as in the definition exists.

The following proof approximation result for functions in \( \Gamma_C\) is due to [105]. The presentation of the proof is similar to [107, Section 5].

Theorem 15

Let \( \sigma:\mathbb{R}\to\mathbb{R}\) be sigmoidal (see Definition 6) and let \( f\in \Gamma_C\) for some \( C>0\) . Denote by \( B_1^d\mathrm{:}= \{{\boldsymbol{x}}\in\mathbb{R}^d\,|\,\| {\boldsymbol{x}} \|_{}\le 1\}\) the unit ball. Then, for every \( c > 4C^2\) and every \( N\in\mathbb{N}\) there exists a neural network \( \Phi^f\) with architecture \( (\sigma; d, N, 1)\) such that

\[ \begin{align} \frac{1}{|B_1^d|}\int_{B_1^d} \left|f({\boldsymbol{x}})- \Phi^f({\boldsymbol{x}})\right|^2 \,\mathrm{d}{\boldsymbol{x}} \leq \frac{c}{N}, \end{align} \]

(73)

where \( |B_1^d|\) is the Lebesgue measure of \( B_1^d\) .

Remark 9

The approximation rate in (73) can be slightly improved under some assumptions on the activation function such as powers of the ReLU, [108].

Importantly, the dimension \( d\) does not enter on the right-hand side of (73), in particular the convergence rate is not directly affected by the dimension, which is in stark contrast to the results of the previous chapters. However, it should be noted, that the constant \( C\) may still have some inherent \( d\) -dependence, see Exercise 29.

The proof of Theorem 15 is based on a peculiar property of high-dimensional convex sets, which is described by the (approximate) Carathéodory theorem, the original version of which was given in [1]. The more general version stated in the following lemma follows [2, Theorem 0.0.2] and [3, 4]. For its statement recall that \( \overline{\rm{ co}}(G)\) denotes the the closure of the convex hull of \( G\) .

Lemma 23

Let \( H\) be a Hilbert space, and let \( G\subseteq H\) be such that for some \( B>0\) it holds that \( \| g \|_{H}\le B\) for all \( g\in G\) . Let \( f \in \overline{\mathrm{co}}(G)\) . Then, for every \( N \in \mathbb{N}\) and every \( c> B^2\) there exist \( (g_i)_{i=1}^N \subseteq G\) such that

\[ \begin{align} \left\|f - \frac{1}{N}\sum_{i=1}^N g_i\right\|_{H}^2 \leq \frac{c}{N}. \end{align} \]

(74)

Proof

Fix \( \varepsilon>0\) and \( N\in\mathbb{N}\) . Since \( f\in\overline{\rm{ co}}(G)\) , there exist coefficients \( \alpha_1,\dots,\alpha_m\in [0,1]\) summing to \( 1\) , and linearly independent elements \( h_1,\dots,h_m\in G\) such that

\[ f^*\mathrm{:}= \sum_{j=1}^m\alpha_jh_j \]

satisfies \( \| f-f^* \|_{H}<\varepsilon\) . We claim that there exists \( g_1,\dots,g_N\) , each in \( \{h_1,\dots,h_m\}\) , such that

\[ \begin{equation} \left\| f^*-\frac{1}{N}\sum_{j=1}^Ng_j \right\|_{H}^2\le\frac{B^2}{N}. \end{equation} \]

(75)

Since \( \varepsilon>0\) was arbitrary, this then concludes the proof. Since there exists an isometric isomorphism from \( \rm{ span}\{h_1,\dots,h_m\}\) to \( \mathbb{R}^m\) , there is no loss of generality in assuming \( H=\mathbb{R}^m\) in the following.

Let \( X_i\) , \( i=1,\dots,N\) , be i.i.d\( \mathbb{R}^m\) -valued random variables with

\[ \mathbb{P}[X_i=h_j] = \alpha_j~~\text{for all }i=1,\dots,m. \]

In particular \( \mathbb{E}[X_i]=\sum_{j=1}^m\alpha_jh_j=f^*\) for each \( i\) . Moreover,

\[ \begin{align} \mathbb{E}\left[\left\| f^*-\frac{1}{N}\sum_{j=1}^N X_j \right\|_{H}^2\right] &= \mathbb{E}\left[\left\| \frac{1}{N}\sum_{j=1}^N (f^*-X_j) \right\|_{H}^2\right] \nonumber \end{align} \]

(76)

\[ \begin{align} &= \frac{1}{N^2}\Bigg[\sum_{j=1}^N\| f^*-X_j \|_{H}^2+\sum_{i\neq j}\left\langle f^*-X_i, f^*-X_j\right\rangle_{H} \Bigg]\\ &= \frac{1}{N} \mathbb{E}[\| f^*-X_1 \|_{H}^2]\nonumber\\ &=\frac{1}{N}\mathbb{E}[\| f^* \|_{H}-2\left\langle f^*, X_1\right\rangle_{H}+\| X_1 \|_{H}^2]\nonumber\\ &=\frac{1}{N}\mathbb{E}[\| X_1 \|_{H}^2-\| f^* \|_{H}^2] \le \frac{B^2}{N}. \\\end{align} \]

Here we used that the \( (X_i)_{i=1}^N\) are i.i.d., the fact that \( \mathbb{E}[X_i]=f^*\) , as well as \( \mathbb{E}{\left\langle f^*-X_i, f^*-X_j\right\rangle_{}}=0\) if \( i\neq j\) . Since the expectation in (76) is bounded by \( B^2/N\) , there must exist at least one realization of the random variables \( X_i\in\{h_1,\dots,h_m\}\) , denoted as \( g_i\) , for which (75) holds.

Lemma 23 provides a powerful tool: If we want to approximate a function \( f\) with a superposition of \( N\) elements in a set \( G\) , then it is sufficient to show that \( f\) can be represented as an arbitrary (infinite) convex combination of elements of \( G\) .

Lemma 23 suggests that we can prove Theorem 15 by showing that each function in \( \Gamma_C\) belongs to the closure of the convex hull of all neural networks with a single neuron, i.ethe set of all affine transforms of the sigmoidal activation function \( \sigma\) . We make a small detour before proving this result. We first show that each function \( f \in \Gamma_C\) is in the closure of the convex hull of the set of affine transforms of Heaviside functions, i.ethe set

\[ G_C \mathrm{:=} \left\{B_1^d \ni {\boldsymbol{x}} \mapsto \gamma \cdot \boldsymbol{1}_{\mathbb{R}_+}( \langle {\boldsymbol{a}}, {\boldsymbol{x}} \rangle + b)\, \middle|\,{\boldsymbol{a}} \in \mathbb{R}^d, b \in \mathbb{R}, |\gamma| \leq 2C\right\}. \]

The following lemma, corresponding to [105, Theorem 2] and [107, Lemma 5.12], provides a link between \( \Gamma_C\) and \( G_C\) .

Lemma 24

Let \( d\in \mathbb{N}\) , \( C >0\) and \( f \in \Gamma_C\) . Then \( f|_{B_1^d}-f(\boldsymbol{0}) \in \overline{\mathrm{co}}(G_C)\) , where the closure is taken with respect to the norm

\[ \begin{align} \|g\|_{{L}^{2, \diamond}(B_1^d)} \mathrm{:=} \left(\frac{1}{|B_1^d|} \int_{B_1^d} |g({\boldsymbol{x}})|^2 \,\mathrm{d}{\boldsymbol{x}}\right)^{1/2}. \end{align} \]

(77)

Proof

Step 1. We express \( f({\boldsymbol{x}})\) via an integral.

Since \( f \in \Gamma_C\) , we have that there exist \( g \in L^1(\mathbb{R}^d)\) such that for all \( {\boldsymbol{x}} \in \mathbb{R}^d\)

\[ \begin{align} f({\boldsymbol{x}}) - f(\boldsymbol{0})&= \int_{\mathbb{R}^d} g({\boldsymbol{\xi}}) \left(e^{\rm{i} \langle {\boldsymbol{x}}, {\boldsymbol{\xi}}\rangle} - 1\right)\,\mathrm{d}{\boldsymbol{\xi}}\nonumber \end{align} \]

(78)

\[ \begin{align} &= \int_{\mathbb{R}^d} \left|g({\boldsymbol{\xi}})\right| \left(e^{\rm{i} (\langle {\boldsymbol{x}}, {\boldsymbol{\xi}}\rangle + \kappa({\boldsymbol{\xi}}))} - e^{\rm{i}\kappa({\boldsymbol{\xi}})}\right)\,\mathrm{d}{\boldsymbol{\xi}}\\ &= \int_{\mathbb{R}^d} \left|g({\boldsymbol{\xi}})\right| \big(\cos(\langle {\boldsymbol{x}}, {\boldsymbol{\xi}}\rangle + \kappa({\boldsymbol{\xi}})) - \cos(\kappa({\boldsymbol{\xi}}))\big) \,\mathrm{d}{\boldsymbol{\xi}}, \\\end{align} \]

where \( \kappa({\boldsymbol{\xi}})\) is the phase of \( g({\boldsymbol{\xi}})\) , i.e\( g({\boldsymbol{\xi}})=|g({\boldsymbol{\xi}})|e^{\rm{i} \kappa({\boldsymbol{\xi}})}\) , and the last equality follows since \( f\) is real-valued. Define a measure \( \mu\) on \( \mathbb{R}^d\) via its Lebesgue density

\[ \,\mathrm{d}\mu({\boldsymbol{\xi}}) \mathrm{:=} \frac{1}{C'}|{\boldsymbol{\xi}}| |g({\boldsymbol{\xi}})| \,\mathrm{d}{\boldsymbol{\xi}}, \]

where \( C'\mathrm{:=}\int|{\boldsymbol{\xi}}||g({\boldsymbol{\xi}})|\,\mathrm{d}{\boldsymbol{\xi}}\leq C\) ; this is possible since \( f \in \Gamma_C\) . Then (78) leads to

\[ \begin{align} f({\boldsymbol{x}}) - f(\boldsymbol{0}) = C' \int_{\mathbb{R}^d} \frac{\cos(\langle {\boldsymbol{x}}, {\boldsymbol{\xi}} \rangle + \kappa({\boldsymbol{\xi}})) - \cos(\kappa({\boldsymbol{\xi}}))}{|{\boldsymbol{\xi}}|} \,\mathrm{d}\mu({\boldsymbol{\xi}}). \end{align} \]

(79)

Step 2. We show that \( {\boldsymbol{x}}\mapsto f({\boldsymbol{x}})-f(\boldsymbol{0})\) is in the \( L^{2, \diamond}(B_1^d)\) closure of convex combinations of the functions \( {\boldsymbol{x}}\mapsto q_{\boldsymbol{x}}({\boldsymbol{\theta}})\) , where \( {\boldsymbol{\theta}} \in \mathbb{R}^d\) , and

\[ \begin{equation} \begin{aligned} q_{\boldsymbol{x}}:&B_1^d \to\mathbb{R}\\ &{\boldsymbol{\xi}} \mapsto C' \frac{\cos(\langle {\boldsymbol{x}}, {\boldsymbol{\xi}} \rangle+ \kappa({\boldsymbol{\xi}})) - \cos(\kappa({\boldsymbol{\xi}}))}{|{\boldsymbol{\xi}}|}. \end{aligned} \end{equation} \]

(80)

The cosine function is 1-Lipschitz. Hence for any \( {\boldsymbol{\xi}}\in\mathbb{R}^d\) the map (80) is bounded by one. In addition, it is easy to see that \( q_{\boldsymbol{x}}\) is well-defined and continuous even in the origin. Therefore, for \( {\boldsymbol{x}} \in B_1^d\) , the integral (79) can be approximated by a Riemann sum, i.e.,

\[ \begin{align} &\left|C' \int_{\mathbb{R}^d} q_{\boldsymbol{x}}({\boldsymbol{\xi}}) \,\mathrm{d}\mu({\boldsymbol{\xi}}) - C' \sum_{{\boldsymbol{\theta}} \in \frac{1}{n} \mathbb{Z}^d } q_{\boldsymbol{x}}({\boldsymbol{\theta}}) \cdot \mu(I_{\boldsymbol{\theta}})\right| \to 0~~\text{as }n\to\infty \end{align} \]

(81)

where \( I_{\boldsymbol{\theta}} \mathrm{:=} [0,1/n)^d + {\boldsymbol{\theta}}\) . Since \( {\boldsymbol{x}} \mapsto f({\boldsymbol{x}}) - f(\boldsymbol{0})\) is continuous and thus bounded on \( B_{1}^d\) , we have by the dominated convergence theorem that

\[ \begin{align} &\frac{1}{|B_1^d|}\int_{B_1^d} \left|f({\boldsymbol{x}}) - f(\boldsymbol{0})- C' \sum_{{\boldsymbol{\theta}} \in \frac{1}{n} \mathbb{Z}^d} q_{\boldsymbol{x}}({\boldsymbol{\theta}})\cdot \mu(I_{\boldsymbol{\theta}})\right|^2 \,\mathrm{d}{\boldsymbol{x}} \to 0. \end{align} \]

(82)

Since \( \sum_{{\boldsymbol{\theta}} \in \frac{1}{n} \mathbb{Z}^d} \mu(I_{\boldsymbol{\theta}}) = \mu(\mathbb{R}^d) = 1\) , the claim holds.

Step 3. We prove that \( {\boldsymbol{x}} \mapsto q_{\boldsymbol{x}}({\boldsymbol{\theta}})\) is in the \( L^{2, \diamond}(B_1^d)\) closure of convex combinations of \( G_C\) for every \( {\boldsymbol{\theta}}\in\mathbb{R}^d\) . Together with Step 2, this then concludes the proof.

Setting \( z = \langle {\boldsymbol{x}} , {\boldsymbol{\theta}} /|{\boldsymbol{\theta}}| \rangle\) , the result follows if the maps

\[ \begin{equation} \begin{aligned} h_{\boldsymbol{\theta}}:&[-1,1] \to\mathbb{R}\\ &z \mapsto C' \frac{\cos(|{\boldsymbol{\theta}}| z + \kappa({\boldsymbol{\theta}})) - \cos(\kappa({\boldsymbol{\theta}}))}{|{\boldsymbol{\theta}}|} \end{aligned} \end{equation} \]

(83)

can be approximated arbitrarily well by convex combinations of functions of the form

\[ \begin{align} [-1,1] \ni z \mapsto \gamma \boldsymbol{1}_{\mathbb{R}_+}\left( a' z + b'\right), \end{align} \]

(84)

where \( a'\) , \( b' \in \mathbb{R}\) and \( |\gamma| \leq 2C\) . To show this define for \( T \in \mathbb{N}\)

\[ \begin{align*} g_{T,+} &\mathrm{:=} \sum_{i=1}^T \frac{\left|h_{\boldsymbol{\theta}}\left(\frac{i}{T}\right) -h_{\boldsymbol{\theta}}\left(\frac{i-1}{T}\right)\right|}{2C} \left( 2C \mathrm{sign}\left( h_{\boldsymbol{\theta}}\left(\frac{i}{T}\right) - h_{\boldsymbol{\theta}}\left(\frac{i-1}{T}\right)\right) \boldsymbol{1}_{\mathbb{R}_+}\left(x - \frac{i}{T}\right)\right),\\ g_{T,-} &\mathrm{:=} \sum_{i=1}^T \frac{\left|h_{\boldsymbol{\theta}}\left( - \frac{i}{T} \right) - h_{\boldsymbol{\theta}}\left(\frac{1-i}{T}\right)\right|}{2C} \left( 2C \mathrm{sign}\left(h_{\boldsymbol{\theta}}\left(-\frac{i}{T}\right) - h_{\boldsymbol{\theta}}\left(\frac{1-i}{T}\right)\right) \boldsymbol{1}_{\mathbb{R}_+}\left(-x + \frac{i}{T}\right)\right). \end{align*} \]

By construction, \( g_{T,-} + g_{T,+}\) is a piecewise constant approximation to \( h_{\boldsymbol{\theta}}\) that interpolates \( h_{\boldsymbol{\theta}}\) at \( i/T\) for \( i = 1, \dots, T\) . Since \( h_{\boldsymbol{\theta}}\) is continuous, we have that \( g_{T,-} + g_{T,+}\to h_{\boldsymbol{\theta}}\) uniformly as \( T\to \infty\) . Moreover, \( \| h_{\boldsymbol{\theta}}' \|_{L^\infty(\mathbb{R})} \leq C\) and hence

\[ \begin{align*} &\sum_{i=1}^T \frac{|h_{\boldsymbol{\theta}}(i/T) - h_{\boldsymbol{\theta}}((i-1)/T)|}{2C} + \sum_{i=1}^T \frac{|h_{\boldsymbol{\theta}}( - i/T ) - h_{\boldsymbol{\theta}}((1-i)/T)|}{2C}\\ & \leq \frac{2}{2C T} \sum_{i=1}^T \| h_{\boldsymbol{\theta}}' \|_{L^\infty(\mathbb{R})} \leq 1, \end{align*} \]

where we used \( C'\le C\) for the last inequality. We conclude that \( g_{T,-} + g_{T,+}\) is a convex combination of functions of the form (84). Hence, \( h_{\boldsymbol{\theta}}\) can be arbitrarily well approximated by convex combinations of the form (84). This concludes the proof.

We now have all tools to complete the proof of Theorem 15.

Proof (of Theorem 15)

Let \( f \in \Gamma_C\) . By Lemma 24

\[ f|_{B_1^d} - f(\boldsymbol{0}) \in \overline{\mathrm{co}}(G_C), \]

where the closure is understood with respect to the norm (77). It is not hard to see that for every \( g \in G_C\) it holds that \( \|g\|_{L^{2,\diamond}(B_1^d)}\leq 2C\) . Applying Lemma 23 with the Hilbert space \( L^{2, \diamond}(B_1^d)\) , we get that for every \( N \in \mathbb{N}\) there exist \( |\gamma_i|\leq 2C\) , \( {\boldsymbol{a}}_i\in \mathbb{R}^d\) , \( b_i \in \mathbb{R}\) , for \( i = 1, \dots, N\) , so that

\[ \frac{1}{|B_1^d|} \int_{B_1^d} \left| f({\boldsymbol{x}}) - f(\boldsymbol{0}) - \frac{1}{N} \sum_{i=1}^N \gamma_i\boldsymbol{1}_{\mathbb{R}_+}(\langle {\boldsymbol{a}}_i, x\rangle +b_i)\right|^2 \,\mathrm{d}{\boldsymbol{x}} \leq \frac{4C^2}{N}. \]

By Exercise 7, it holds that \( \sigma(\lambda \cdot) \to \boldsymbol{1}_{\mathbb{R}_+}\) for \( \lambda \to \infty\) almost everywhere. Thus, for every \( \delta>0\) there exist \( \tilde{{\boldsymbol{a}}}_i\) , \( \tilde{b}_i\) , \( i = 1, \dots,N\) , so that

\[ \frac{1}{|B_1^d|} \int_{B_1^d} \left| f({\boldsymbol{x}}) - f(\boldsymbol{0}) - \frac{1}{N} \sum_{i=1}^N \gamma_i\sigma\left(\langle \tilde{{\boldsymbol{a}}}_i, {\boldsymbol{x}}\rangle + \tilde{b}_i\right)\right|^2 \,\mathrm{d}{\boldsymbol{x}} \leq \frac{4C^2}{N} + \delta. \]

The result follows by observing that

\[ \frac{1}{N}\sum_{i=1}^N \gamma_i\sigma\left(\langle \tilde{{\boldsymbol{a}}}_i, {\boldsymbol{x}}\rangle + \tilde{b}_i\right) + f(\boldsymbol{0}) \]

is a neural network with architecture \( (\sigma; d,N,1)\) .

The dimension-independent approximation rate of Theorem 15 may seem surprising, especially when comparing to the results in Chapters 5 and 6. However, this can be explained by recognizing that the assumption of a finite Fourier moment is effectively a dimension-dependent regularity assumption. Indeed, the condition becomes more restrictive in higher dimensions and hence the complexity of \( \Gamma_C\) does not grow with the dimension.

To further explain this, let us relate the Barron class to classical function spaces. In [105, Section II] it was observed that a sufficient condition is that all derivatives of order up to \( \lfloor d/2 \rfloor +2\) are square-integrable. In other words, if \( f\) belongs to the Sobolev space \( H^{\lfloor d/2 \rfloor +2}(\mathbb{R}^d)\) , then \( f\) is a Barron function. Importantly, the functions must become smoother, as the dimension increases. This assumption would also imply an approximation rate of \( N^{-1/2}\) in the \( L^2\) norm by sums of at most \( N\) B-splines, see [53, 103]. However, in such estimates some constants may still depend exponentially on \( d\) , whereas all constants in Theorem 15 are controlled independently of \( d\) .

Another notable aspect of the approximation of Barron functions is that the absolute values of the weights other than the output weights are not bounded by a constant. To see this, we refer to (81), where arbitrarily large \( \theta\) need to be used. While \( \Gamma_C\) is a compact set, the set of neural networks of the specified architecture for a fixed \( N\in \mathbb{N}\) is not parameterized with a compact parameter set. In a certain sense, this is reminiscent of Proposition 4 and Theorem 3, where arbitrarily strong approximation rates where achieved by using a very complex activation function and a non-compact parameter space.

9.2 Functions with compositionality structure

As a next instance of types of functions for which the curse of dimensionality can be overcome, we study functions with compositional structure. In words, this means that we study high-dimensional functions that are constructed by composing many low-dimensional functions. This point of view was proposed in [112]. Note that this can be a realistic assumption in many cases, such as for sensor networks, where local information is first aggregated in smaller clusters of sensors before some information is sent to a processing unit for further evaluation.

We introduce a model for compositional functions next. Consider a directed acyclic graph \( \mathcal{G}\) with \( M\) vertices \( \eta_1,\dots,\eta_M\) such that

  • exactly \( d\) vertices, \( \eta_1,\dots,\eta_d\) , have no ingoing edge,
  • each vertex has at most \( m\in\mathbb{N}\) ingoing edges,
  • exactly one vertex, \( \eta_M\) , has no outgoing edge.

With each vertex \( \eta_j\) for \( j>d\) we associate a function \( f_j:\mathbb{R}^{d_j}\to\mathbb{R}\) . Here \( d_j\) denotes the cardinality of the set \( S_j\) , which is defined as the set of indices \( i\) corresponding to vertices \( \eta_i\) for which we have an edge from \( \eta_i\) to \( \eta_j\) . Without loss of generality, we assume that \( m\ge d_j=|S_j|\ge 1\) for all \( j>d\) . Finally, we let

\[ \begin{align} F_j\mathrm{:}= x_j~ \text{ for all } ~ j\le d \\\end{align} \]

(85.a)

and

\[ \begin{align} F_j\mathrm{:}= f_j((F_i)_{i\in S_j})~ \text{ for all } ~ j>d. \end{align} \]

(85.b)

Then \( F_M(x_1,\dots,x_d)\) is a function from \( \mathbb{R}^d\to\mathbb{R}\) . Assuming

\[ \begin{align} \| f_j \|_{C^{k,s}(\mathbb{R}^{d_j})}\le 1~ \text{ for all } ~ j=d+1,\dots,M, \end{align} \]

(86)

we denote the set of all functions of the type \( F_M\) by \( \mathcal{F}^{k,s}(m,d,M)\) . Figure 22 shows possible graphs of such functions.

Clearly, for \( s=0\) , \( \mathcal{F}^{k,0}(m,d,M)\subseteq C^{k}(\mathbb{R}^d)\) since the composition of functions in \( C^k\) belongs again to \( C^k\) . A direct application of Theorem 14 allows to approximate \( F_M\in \mathcal{F}^{k}(m,d,M)\) with a neural network of size \( O(N\log(N))\) and error \( O(N^{-\frac{k}{d}})\) . Since each \( f_j\) depends only on \( m\) variables, intuitively we expect an error convergence of type \( O(N^{-\frac{k}{m}})\) with the constant somehow depending on the number \( M\) of vertices. To show that this is actually possible, in the following we associate with each node \( \eta_j\) a depth \( l_j\ge 0\) , such that \( l_j\) is the maximum number of edges connecting \( \eta_j\) to one of the nodes \( \{\eta_1,\dots,\eta_d\}\) .

&lt;span data-controller=&quot;mathjax&quot;&gt;Three types of graphs that could be the basis of compositional functions.
The associated functions are composed of two or three-dimensional functions only.&lt;/span&gt;
Figure 22. Three types of graphs that could be the basis of compositional functions. The associated functions are composed of two or three-dimensional functions only.

Proposition 12

Let \( k\) , \( m\) , \( d\) , \( M \in \mathbb{N}\) and \( s>0\) . Let \( F_M\in\mathcal{F}^{k,s}(m,d,M)\) . Then there exists a constant \( C=C(m,k+s,M)\) such that for every \( N \in \mathbb{N}\) there exists a ReLU neural network \( \Phi^{F_M}\) such that

\[ \begin{align*} {\rm size}(\Phi^{F_M})\le C N\log(N),~~{\rm depth}(\Phi^{F_M})\le C\log(N) \end{align*} \]

and

\[ \begin{align*} \sup_{{\boldsymbol{x}}\in [0,1]^d}|F_M({\boldsymbol{x}})-\Phi^{F_M}({\boldsymbol{x}})|\le N^{-\frac{k+s}{m}}. \end{align*} \]

Proof

Throughout this proof we assume without loss of generality that the indices follow a topological ordering, i.e., they are ordered such that \( S_j\subseteq\{1,\dots,j-1\}\) for all \( j\) (i.ethe inputs of vertex \( \eta_j\) can only be vertices \( \eta_i\) with \( i<j\) ).

Step 1. First assume that \( \hat f_j\) are functions such that with \( 0 < \varepsilon \leq 1\)

\[ \begin{align} |f_j({\boldsymbol{x}})-\hat f_j({\boldsymbol{x}})|\le \delta_j\mathrm{:}= \varepsilon \cdot (2m)^{-(M+1-j)} ~ \text{ for all } ~ {\boldsymbol{x}}\in [-2,2]^{d_j}. \end{align} \]

(87)

Let \( \hat F_j\) be defined as in (85), but with all \( f_j\) in (85.b) replaced by \( \hat f_j\) . We now check the error of the approximation \( \hat F_M\) to \( F_M\) . To do so we proceed by induction over \( j\) and show that for all \( {\boldsymbol{x}}\in [-1,1]^d\)

\[ \begin{align} |F_j({\boldsymbol{x}})-\hat F_j({\boldsymbol{x}})|\le (2m)^{-(M-j)}\varepsilon. \end{align} \]

(88)

Note that due to \( \| f_j \|_{C^k}\le 1\) we have \( |F_j({\boldsymbol{x}})|\le 1\) and thus (88) implies in particular that \( \hat F_j({\boldsymbol{x}})\in [-2,2]\) .

For \( j=1\) it holds \( F_1(x_1)=\hat F_1(x_1)=x_1\) , and thus (88) is valid for all \( x_1\in [-1,1]\) . For the induction step, for all \( {\boldsymbol{x}}\in [-1,1]^d\) by (87) and the induction hypothesis

\[ \begin{align*} |F_j({\boldsymbol{x}})-\hat F_j({\boldsymbol{x}})| & = |f_j((F_i)_{i\in S_j})-\hat f_j((\hat F_i)_{i\in S_j})|\nonumber\\ & = |f_j((F_i)_{i\in S_j})-f_j((\hat F_i)_{i\in S_j})|+|f_j((\hat F_i)_{i\in S_j})-\hat f_j((\hat F_i)_{i\in S_j})|\nonumber\\ & \le \sum_{i\in S_j}|F_i-\hat F_i|+\delta_j\nonumber\\ &\le m \cdot (2m)^{-(M-(j-1))}\varepsilon+(2m)^{-(M+1-j)}\varepsilon\nonumber\\ &\le (2m)^{-(M-j)}\varepsilon. \end{align*} \]

Here we used that \( |\frac{d}{dx_r}f_j((x_i)_{i\in S_j})|\le 1\) for all \( r\in S_j\) so that by the triangle inequality and the mean value theorem

\[ \begin{align*} |f_j((x_i)_{i\in S_j})-f_j((y_i)_{i\in S_j})| &\le \sum_{r\in S_j}|f((x_i)_{\substack{i\in S_j\\ i\le r}},(y_i)_{\substack{i\in S_j\\ i>r}}) -f((x_i)_{\substack{i\in S_j\\ i< r}},(y_i)_{\substack{i\in S_j\\ i\ge r}})|\nonumber\\ &\le \sum_{r\in S_j}|x_r-y_r|. \end{align*} \]

This shows that (88) holds, and thus for all \( {\boldsymbol{x}}\in [-1,1]^d\)

\[ \begin{align*} |F_M({\boldsymbol{x}})-\hat F_M ({\boldsymbol{x}})|\le \varepsilon. \end{align*} \]

Step 2. We sketch a construction, of how to write \( \hat F_M\) from Step 1 as a neural network \( \Phi^{F_M}\) of the asserted size and depth bounds. Fix \( N\in\mathbb{N}\) and let

\[ \begin{align*} N_j\mathrm{:}= \lceil N(2m)^{\frac{m}{k+s}(M+1-j)}\rceil. \end{align*} \]

By Theorem 14, since \( d_j\le m\) , we can find a neural network \( \Phi^{f_j}\) satisfying

\[ \begin{align} \sup_{{\boldsymbol{x}}\in [-2,2]^{d_j}}|f_j({\boldsymbol{x}})-\Phi^{f_j}({\boldsymbol{x}})|\le N_j^{-\frac{k+s}{m}}\le N^{-\frac{k+s}{m}}(2m)^{-(M+1-j)} \end{align} \]

(89)

and

\[ \begin{align*} {\rm size}(\Phi^{f_j}) \le C N_j\log(N_j) \le C N (2m)^{\frac{m(M+1-j)}{k+s}}\left(\log(N)+\log(2m)\frac{m(M+1-j)}{k+s}\right) \end{align*} \]

as well as

\[ \begin{align*} {\rm depth}(\Phi^{f_j})\le C \cdot \left(\log(N)+\log(2m)\frac{m(M+1-j)}{k+s}\right). \end{align*} \]

Then

\[ \begin{align*} \sum_{j=1}^{M}{\rm size}(\Phi^{f_j})&\le 2 C N\log(N)\sum_{j=1}^M (2m)^{\frac{m(M+1-j)}{k+s}}\nonumber\\ &\le 2C N\log(N) \sum_{j=1}^M \left((2m)^{\frac{m}{k+s}}\right)^j\nonumber\\ &\le 2C N\log(N) (2m)^{\frac{m(M+1)}{k+s}}. \end{align*} \]

Here we used \( \sum_{j=1}^M a^j\le\int_1^{M+1}\exp(\log(a)x)\,\mathrm{d} x\le \frac{1}{\log(a)}a^{M+1}\) .

The function \( \hat F_M\) from Step 1 then will yield error \( N^{-\frac{k+s}{m}}\) by (87) and (89). We observe that \( \hat F_M\) can be constructed inductively as a neural network \( \Phi^{F_M}\) by propagating all values \( \Phi^{F_1},\dots,\hat \Phi^{F_j}\) to all consecutive layers using identity neural networks and then using the outputs of \( (\Phi^{F_i})_{i\in S_{j+1}}\) as input to \( \Phi^{f_{j+1}}\) . The depth of this neural network is bounded by

\[ \begin{align*} \sum_{j=1}^M{\rm depth}(\Phi^{f_j}) = O(M\log(N)). \end{align*} \]

We have at most \( \sum_{j=1}^M |S_j|\le mM\) values which need to be propagated through these \( O(M\log(N))\) layers, amounting to an overhead \( O(mM^2\log(N))=O(\log(N))\) for the identity neural networks. In all the neural network size is thus \( O(N\log(N))\) .

Remark 10

From the proof we observe that the constant \( C\) in Proposition 12 behaves like \( O((2m)^{\frac{m(M+1)}{k+s}})\) .

9.3 Functions on manifolds

Another instance in which the curse of dimension can be mitigated, is if the input to the network belongs to \( \mathbb{R}^d\) , but stems from an \( m\) -dimensional manifold \( \mathcal{M}\subseteq\mathbb{R}^d\) . If we only measure the approximation error on \( \mathcal{M}\) , then we can again show that it is \( m\) rather than \( d\) that determines the rate of convergence.

&lt;span data-controller=&quot;mathjax&quot;&gt;One-dimensional sub-manifold of three-dimensional space.
At the orange point, we depict a ball and the tangent space of the manifold.&lt;/span&gt;
Figure 23. One-dimensional sub-manifold of three-dimensional space. At the orange point, we depict a ball and the tangent space of the manifold.

To explain the idea, we assume in the following that \( \mathcal{M}\) is a smooth, compact \( m\) -dimensional manifold in \( \mathbb{R}^d\) . Moreover, we suppose that there exists \( \delta>0\) and finitely many points \( {\boldsymbol{x}}_1,\dots,{\boldsymbol{x}}_M\in\mathcal{M}\) such that the \( \delta\) -balls \( B_{\delta/2}({\boldsymbol{x}}_i)\mathrm{:}= \{{\boldsymbol{y}}\in\mathbb{R}^d\,|\,\| {\boldsymbol{y}}-{\boldsymbol{x}} \|_{2}<{\delta}/{2}\}\) for \( j=1,\dots,M\) cover \( \mathcal{M}\) (for every \( \delta>0\) such \( {\boldsymbol{x}}_i\) exist since \( \mathcal{M}\) is compact). Moreover, denoting by \( T_{{\boldsymbol{x}}}\mathcal{M}\simeq \mathbb{R}^m\) the tangential space of \( \mathcal{M}\) at \( {\boldsymbol{x}}\) , we assume \( \delta>0\) to be so small that the orthogonal projection

\[ \begin{align} \pi_j:B_{\delta}({\boldsymbol{x}}_j)\cap\mathcal{M}\to T_{{\boldsymbol{x}}_j}\mathcal{M} \end{align} \]

(90)

is injective, the set \( \pi_j(B_{\delta}({\boldsymbol{x}}_j)\cap\mathcal{M})\subseteq T_{{\boldsymbol{x}}_j}\mathcal{M}\) has \( C^\infty\) boundary, and the inverse of \( \pi_j\) , i.e.

\[ \begin{align} \pi_j^{-1}:\pi_j(B_{\delta}({\boldsymbol{x}}_j)\cap\mathcal{M})\to\mathcal{M} \end{align} \]

(91)

is \( C^\infty\) (this is possible because \( \mathcal{M}\) is a smooth manifold). A visualization of this assumption is shown in Figure 23.

Note that \( \pi_j\) in (90) is a linear map, whereas \( \pi_j^{-1}\) in (91) is in general non-linear.

For a function \( f:\mathcal{M}\to\mathbb{R}\) we can then write

\[ \begin{align} f({\boldsymbol{x}}) = f(\pi_j^{-1}(\pi_j({\boldsymbol{x}})))=f_j(\pi_j({\boldsymbol{x}})) ~~\text{for all }{\boldsymbol{x}}\in B_\delta({\boldsymbol{x}}_j)\cap\mathcal{M} \end{align} \]

(92)

where

\[ \begin{align*} f_j\mathrm{:}= f\circ \pi_j^{-1}:\pi_j(B_{\delta}({\boldsymbol{x}}_j)\cap\mathcal{M})\to\mathbb{R}. \end{align*} \]

In the following, for \( f:\mathcal{M}\to\mathbb{R}\) , \( k\in\mathbb{N}_0\) , and \( s\in [0,1)\) we let

\[ \begin{align*} \| f \|_{C^{k,s}(\mathcal{M})}\mathrm{:}= \sup_{j=1,\dots,M}\| f_j \|_{C^{k,s}(\pi_j(B_{\delta}({\boldsymbol{x}}_j)\cap\mathcal{M}))}. \end{align*} \]

We now state the main result of this section.

Proposition 13

Let \( d\) , \( k \in \mathbb{N}\) , \( s \geq 0\) , and let \( \mathcal{M}\) be a smooth, compact \( m\) -dimensional manifold in \( \mathbb{R}^d\) . Then there exists a constant \( C>0\) such that for all \( f\in C^{k,s}(\mathcal{M})\) and every \( N\in\mathbb{N}\) there exists a ReLU neural network \( \Phi_N^f\) such that \( \rm{ size}(\Phi_N^f)\le CN\log(N)\) , \( \rm{ depth}(\Phi_N^f)\le C\log(N)\) and

\[ \begin{align*} \sup_{{\boldsymbol{x}}\in\mathcal{M}}|f({\boldsymbol{x}})-\Phi_N^f({\boldsymbol{x}})|\le C \| f \|_{C^{k,s}(\mathcal{M})}N^{-\frac{k+s}{m}}. \end{align*} \]

Proof

Since \( \mathcal{M}\) is compact there exists \( A>0\) such that \( \mathcal{M}\subseteq [-A,A]^d\) . Similar as in the proof of Theorem 14, we consider a uniform mesh with nodes \( \{-A+2A\frac{{\boldsymbol{\nu}}}{n}\,|\,{\boldsymbol{\nu}}\le n\}\) , and the corresponding piecewise linear basis functions forming the partition of unity \( \sum_{{\boldsymbol{\nu}}\le n}\varphi_{\boldsymbol{\nu}}\equiv 1\) on \( [-A,A]^d\) where \( \rm{supp}\varphi_{\boldsymbol{\nu}}\subseteq \{{\boldsymbol{y}}\in\mathbb{R}^d\,|\,\| \frac{{\boldsymbol{\nu}}}{n}-{\boldsymbol{y}} \|_{\infty}\le \frac{A}{n}\}\) . Let \( \delta>0\) be as in the beginning of this section. Since \( \mathcal{M}\) is covered by the balls \( (B_{\delta/2}({\boldsymbol{x}}_j))_{j=1}^M\) , fixing \( n \in \mathbb{N}\) large enough, for each \( {\boldsymbol{\nu}}\) such that \( \rm{supp}\varphi_{\boldsymbol{\nu}} \cap\mathcal{M}\neq\emptyset\) there exists \( j({\boldsymbol{\nu}})\in\{1,\dots,M\}\) such that \( \rm{supp}\varphi_{\boldsymbol{\nu}}\subseteq B_\delta({\boldsymbol{x}}_{j({\boldsymbol{\nu}})})\) and we set \( I_j\mathrm{:}= \{{\boldsymbol{\nu}}\le n \,|\,j=j({\boldsymbol{\nu}})\}\) . Using (92) we then have for all \( {\boldsymbol{x}}\in\mathcal{M}\)

\[ \begin{align} f({\boldsymbol{x}})= \sum_{{\boldsymbol{\nu}}\le n}\varphi_{\boldsymbol{\nu}}({\boldsymbol{x}}) f({\boldsymbol{x}}) =\sum_{j=1}^M\sum_{{\boldsymbol{\nu}}\in I_j}\varphi_{\boldsymbol{\nu}}({\boldsymbol{x}}) f_j(\pi_j({\boldsymbol{x}})). \end{align} \]

(93)

Next, we approximate the functions \( f_j\) . Let \( C_j\) be the smallest (\( m\) -dimensional) cube in \( T_{{\boldsymbol{x}}_j}\mathcal{M}\simeq\mathbb{R}^m\) such that \( \pi_j(B_\delta({\boldsymbol{x}}_j)\cap\mathcal{M})\subseteq C_j\) . The function \( f_j\) can be extended to a function on \( C_j\) (we will use the same notation for this extension) such that

\[ \begin{align*} \| f \|_{C^{k,s}(C_j)}\le C \| f \|_{C^{k,s}(\pi_j(B_\delta({\boldsymbol{x}}_j)\cap\mathcal{M}))}, \end{align*} \]

for some constant depending on \( \pi_j(B_\delta({\boldsymbol{x}}_j)\cap\mathcal{M})\) but independent of \( f\) . Such an extension result can, for example, be found in [113, Chapter VI]. By Theorem 14 (also see Remark 8), there exists a neural network \( \hat f_j:C_j\to\mathbb{R}\) such that

\[ \begin{align} \sup_{{\boldsymbol{x}}\in C_j}|f_j({\boldsymbol{x}})-\hat f_j({\boldsymbol{x}})|\le C N^{-\frac{k+s}{m}} \\\end{align} \]

(94)

and

\[ \begin{align*} {\rm size}(\hat f_j)\le C N\log(N),~~ {\rm depth}(\hat f_j)\le C\log(N). \end{align*} \]

To approximate \( f\) in (93) we now let with \( \varepsilon\mathrm{:}= N^{-\frac{k+s}{m}}\)

\[ \begin{align*} \Phi_N\mathrm{:}= \sum_{j=1}^M\sum_{{\boldsymbol{\nu}}\in I_j}{\Phi^{\times}_{\varepsilon}}(\varphi_{\boldsymbol{\nu}},\hat f_i\circ\pi_j), \end{align*} \]

where we note that \( \pi_j\) is linear and thus \( \hat f_j\circ\pi_j\) can be expressed by a neural network. First let us estimate the error of this approximation. For \( {\boldsymbol{x}}\in\mathcal{M}\)

\[ \begin{align*} |f({\boldsymbol{x}})-\Phi_N({\boldsymbol{x}})| &\le\sum_{j=1}^M \sum_{{\boldsymbol{\nu}}\in I_j}|\varphi_{\boldsymbol{\nu}}({\boldsymbol{x}})f_j(\pi_j({\boldsymbol{x}}))-{\Phi^{\times}_{\varepsilon}}(\varphi_{\boldsymbol{\nu}}({\boldsymbol{x}}),\hat f_j(\pi_j({\boldsymbol{x}})))|\nonumber\\ &\le\sum_{j=1}^M \sum_{{\boldsymbol{\nu}}\in I_j}\left(|\varphi_{\boldsymbol{\nu}}({\boldsymbol{x}})f_j(\pi_j({\boldsymbol{x}})) -\varphi_{\boldsymbol{\nu}}({\boldsymbol{x}})\hat f_j(\pi_j({\boldsymbol{x}}))| \right.\\ & ~~ \left. + |\varphi_{\boldsymbol{\nu}}({\boldsymbol{x}})\hat f_j(\pi_j({\boldsymbol{x}})) -{\Phi^{\times}_{\varepsilon}}(\varphi_{\boldsymbol{\nu}}({\boldsymbol{x}}),\hat f_j(\pi_j({\boldsymbol{x}})))|\right)\nonumber\\ &\le\sup_{i\le M}\| f_i-\hat f_i \|_{L^\infty(C_i)} \sum_{j=1}^M \sum_{{\boldsymbol{\nu}}\in I_j}|\varphi_{\boldsymbol{\nu}}({\boldsymbol{x}})| + \sum_{j=1}^M \sum_{\{{\boldsymbol{\nu}}\in I_j\,|\,{\boldsymbol{x}}\in\rm{supp}\varphi_{\boldsymbol{\nu}}\}}\varepsilon\nonumber\\ &\le CN^{-\frac{k+s}{m}} + d\varepsilon\le CN^{-\frac{k+s}{m}}, \end{align*} \]

where we used that \( {\boldsymbol{x}}\) can be in the support of at most \( d\) of the \( \varphi_{\boldsymbol{\nu}}\) , and where \( C\) is a constant depending on \( d\) and \( \mathcal{M}\) .

Finally, let us bound the size and depth of this approximation. Using \( \rm{ size}(\varphi_{\boldsymbol{\nu}})\le C\) , \( \rm{ depth}(\varphi_{\boldsymbol{\nu}})\le C\) (see (49)) and \( \rm{ size}({\Phi^{\times}_{\varepsilon}})\le C\log(\varepsilon)\le C\log(N)\) and \( \rm{ depth}({\Phi^{\times}_{\varepsilon}})\le C\rm{ depth}(\varepsilon)\le C\log(N)\) (see Lemma 20) we find

\[ \begin{align*} \sum_{j=1}^M\sum_{{\boldsymbol{\nu}}\in I_j}\left({\rm size}({\Phi^{\times}_{\varepsilon}})+{\rm size}(\varphi_{\boldsymbol{\nu}})+{\rm size}(\hat f_i\circ\pi_j)\right) &\le \sum_{j=1}^M\sum_{{\boldsymbol{\nu}}\in I_j}C\log(N)+C+CN\log(N)\nonumber\\ &=O(N\log(N)), \end{align*} \]

which implies the bound on \( \rm{ size}(\Phi_N)\) . Moreover,

\[ \begin{align*} {\rm depth}(\Phi_N)&\le {\rm depth}({\Phi^{\times}_{\varepsilon}})+\max\left\{{\rm depth}(\varphi_{\boldsymbol{\nu}},\hat f_j)\right\}\\ &\le C \log(N)+\log(N)=O(\log(N)). \end{align*} \]

This completes the proof.

Bibliography and further reading

The ideas of Section 9.1 were originally developed in [105], with an extension to \( L^\infty\) approximation provided in [114]. These arguments can be extended to yield dimension-independent approximation rates for high-dimensional discontinuous functions, provided the discontinuity follows a Barron function, as shown in [107]. The Barron class has been generalized in various ways, as discussed in [115, 116, 117, 118, 119].

The compositionality assumption of Section 9.2 was discussed in the form presented in [112]. An alternative approach, known as the hierarchical composition/interaction model, was studied in [120].

The manifold assumption discussed in Section 9.3 is frequently found in the literature, with notable examples including [121, 122, 123, 124, 125, 126].

Another prominent direction, omitted in this chapter, pertains to scientific machine learning. High-dimensional functions often arise from (parametric) PDEs, which have a rich literature describing their properties and structure. Various results have shown that neural networks can leverage the inherent low-dimensionality known to exist in such problems. Efficient approximation of certain classes of high-dimensional (or even infinite-dimensional) analytic functions, ubiquitous in parametric PDEs, has been verified in [86, 127]. Further general analyses for high-dimensional parametric problems can be found in [128, 129], and results exploiting specific structural conditions of the underlying PDEs, e.g., in [130, 131]. Additionally, [90, 93, 91] provide results regarding fast convergence for certain smooth functions in potentially high but finite dimensions.

For high-dimensional PDEs, elliptic problems have been addressed in [132], linear and semilinear parabolic evolution equations have been explored in [133, 134, 135], and stochastic differential equations in [136, 137].

Exercises

Exercise 27

Let \( C>0\) and \( d\in \mathbb{N}\) . Show that, if \( g \in \Gamma_C\) , then

\[ a^{-d} g\left(a (\cdot -{\boldsymbol{b}})\right) \in \Gamma_C, \]

for every \( a\in \mathbb{R}_+\) , \( {\boldsymbol{b}}\in \mathbb{R}^d\) .

Exercise 28

Let \( C>0\) and \( d\in \mathbb{N}\) . Show that, for \( g_i \in \Gamma_C\) , \( i = 1, \dots, m\) and \( c= (c_i)_{i=1}^m \) it holds that

\[ \sum_{i=1}^m c_i g_i \in \Gamma_{\|c\|_1 C}. \]

Exercise 29

Show that for every \( d\in\mathbb{N}\) the function \( f({\boldsymbol{x}})\mathrm{:}= \exp(-{\| {\boldsymbol{x}} \|_{2}^2}/{2})\) , \( {\boldsymbol{x}}\in \mathbb{R}^d\) , belongs to \( \Gamma_d\) , and it holds \( C_f=O(\sqrt{d})\) , for \( d \to \infty\) .

Exercise 30

Let \( d\in \mathbb{N}\) , and let \( f({\boldsymbol{x}}) = \sum_{i=1}^\infty c_i \sigma_{\rm {ReLU}}(\langle {\boldsymbol{a}}_i, {\boldsymbol{x}} \rangle + b_i)\) for \( {\boldsymbol{x}} \in \mathbb{R}^d\) with \( \|{\boldsymbol{a}}_i\| = 1, |b_i| \leq 1 \) for all \( i \in \mathbb{N}\) . Show that for every \( N \in \mathbb{N}\) , there exists a ReLU neural network with \( N\) neurons and one layer such that

\[ \|f - f_N\|_{L^2(B_1^d)} \leq \frac{3 \|c\|_1}{\sqrt{N}}. \]

Hence, every infinite ReLU neural network can be approximated at a rate \( O(N^{1/2})\) by finite ReLU neural networks of width \( N\) .

Exercise 31

Let \( C>0\) prove that every \( f \in \Gamma_C\) is continuously differentiable.

10 Interpolation

The learning problem associated to minimizing the empirical risk of (3) is based on minimizing an error that results from evaluating a neural network on a finite set of (training) points. In contrast, all previous approximation results focused on achieving uniformly small errors across the entire domain. Finding neural networks that achieve a small training error appears to be much simpler, since, instead of \( \|f - \Phi_n \|_\infty \to 0\) for a sequence of neural networks \( \Phi_n\) , it suffices to have \( \Phi_n({\boldsymbol{x}}_i) \to f({\boldsymbol{x}}_i)\) for all \( {\boldsymbol{x}}_i\) in the training set.

In this chapter, we study the extreme case of the aforementioned approximation problem. We analyze under which conditions it is possible to find a neural network that coincides with the target function \( f\) at all training points. This is referred to as interpolation. To make this notion more precise, we state the following definition.

Definition 19 (Interpolation)

Let \( d\) , \( m \in \mathbb{N}\) , and let \( \Omega \subseteq \mathbb{R}^d\) . We say that a set of functions \( \mathcal{H} \subseteq \{h \colon \Omega \to \mathbb{R}\}\) interpolates \( m\) points in \( \Omega\) , if for every \( S = ({\boldsymbol{x}}_i, y_i)_{i=1}^m \subseteq \Omega \times \mathbb{R}\) , such that \( {\boldsymbol{x}}_i \neq {\boldsymbol{x}}_j\) for \( i \neq j\) , there exists a function \( h \in \mathcal{H}\) such that \( h({\boldsymbol{x}}_i) = y_i\) for all \( i = 1, \dots, m\) .

Knowing the interpolation properties of an architecture represents extremely valuable information for two reasons:

  • Consider an architecture that interpolates \( m\) points and let the number of training samples be bounded by \( m\) . Then (3) always has a solution.
  • Consider again an architecture that interpolates \( m\) points and assume that the number of training samples is less than \( m\) . Then for every point \( \tilde{{\boldsymbol{x}}}\) not in the training set and every \( y \in \mathbb{R}\) there exists a minimizer \( h\) of (3) that satisfies \( h(\tilde{{\boldsymbol{x}}}) = y\) . As a consequence, without further restrictions (many of which we will discuss below), such an architecture cannot generalize to unseen data.

The existence of solutions to the interpolation problem does not follow trivially from the approximation results provided in the previous chapters (even though we will later see that there is a close connection). We also remark that the question of how many points neural networks with a given architecture can interpolate is closely related to the so-called VC dimension, which we will study in Chapter 15.

We start our analysis of the interpolation properties of neural networks by presenting a result similar to the universal approximation theorem but for interpolation in the following section. In the subsequent section, we then look at interpolation with desirable properties.

10.1 Universal interpolation

Under what conditions on the activation function and architecture can a set of neural networks interpolate \( m \in \mathbb{N}\) points? According to Chapter 4, particularly Theorem 2, we know that shallow neural networks can approximate every continuous function with arbitrary accuracy, provided the neural network width is large enough. As the neural network’s width and/or depth increases, the architectures become increasingly powerful, leading us to expect that at some point, they should be able to interpolate \( m\) points. However, this intuition may not be correct:

Example 4

Let \( \mathcal{H}:=\{f\in C^0([0,1])\,|\,f(0)\in\mathbb{Q}\}\) . Then \( \mathcal{H}\) is dense in \( C^0([0,1])\) , but \( \mathcal{H}\) does not even interpolate one point in \( [0,1]\) .

Moreover, Theorem 2 is an asymptotic result that only states that a given function can be approximated for sufficiently large neural network architectures, but it does not state how large the architecture needs to be.

Surprisingly, Theorem 2 can nonetheless be used to give a guarantee that a fixed-size architecture yields sets of neural networks that allow the interpolation of \( m\) points. This result is due to [1]; for a more detailed discussion of previous results see the bibliography section. Due to its similarity to the universal approximation theorem and the fact that it uses the same assumptions, we call the following theorem the “Universal Interpolation Theorem”. For its statement recall the definition of the set of allowed activation functions \( \mathcal{M}\) in (8) and the class \( \mathcal{N}_d^1(\sigma,1,n)\) of shallow neural networks of width \( n\) introduced in Definition 5.

Theorem 16 (Universal Interpolation Theorem)

Let \( d\) , \( n \in \mathbb{N}\) and let \( \sigma \in \mathcal{M}\) not be a polynomial. Then \( \mathcal{N}_d^1(\sigma,1,n)\) interpolates \( n+1\) points in \( \mathbb{R}^d\) .

Proof

Fix \( ({\boldsymbol{x}}_i)_{i=1}^{n+1}\subseteq\mathbb{R}^d\) arbitrary. We will show that for any \( (y_i)_{i=1}^{n+1}\subseteq\mathbb{R}\) there exist weights and biases \( ({\boldsymbol{w}}_j)_{j=1}^n\subseteq \mathbb{R}^d\) , \( (b_j)_{j=1}^n\) , \( (v_j)_{j=1}^n\subseteq \mathbb{R}\) , \( c\in \mathbb{R}\) such that

\[ \begin{align} \Phi({\boldsymbol{x}}_i):=\sum_{j=1}^{n} v_j\sigma({\boldsymbol{w}}_j^\top{\boldsymbol{x}}_i+b_j)+c=y_i~\text{for all }~ i=1,\dots,n+1. \end{align} \]

(103)

Since \( \Phi\in \mathcal{N}_d^1(\sigma,1,n)\) this then concludes the proof.

Denote

\[ \begin{align} {\boldsymbol{A}}\mathrm{:}=\begin{pmatrix} 1& \sigma({\boldsymbol{w}}_1^\top{\boldsymbol{x}}_1+b_1)&\cdots &\sigma({\boldsymbol{w}}_{n}^\top{\boldsymbol{x}}_1+b_{n}) \\ \vdots &\vdots&\ddots&\vdots \\ 1& \sigma({\boldsymbol{w}}_1^\top{\boldsymbol{x}}_{n+1}+b_1)&\cdots &\sigma({\boldsymbol{w}}_{n}^\top{\boldsymbol{x}}_{n+1}+b_{n}) \\ \end{pmatrix}\in\mathbb{R}^{(n+1)\times (n+1)}. \end{align} \]

(104)

Then \( {\boldsymbol{A}}\) being regular implies that for each \( (y_i)_{i=1}^{n+1}\) exist \( c\) and \( (v_j)_{j=1}^{n}\) such that (103) holds. Hence, it suffices to find \( ({\boldsymbol{w}}_j)_{j=1}^n\) and \( (b_j)_{j=1}^n\) such that \( {\boldsymbol{A}}\) is regular.

To do so, we proceed by induction over \( k=0,\dots,n\) , to show that there exist \( ({\boldsymbol{w}}_j)_{j=1}^k\) and \( (b_j)_{j=1}^k\) such that the first \( k+1\) columns of \( {\boldsymbol{A}}\) are linearly independent. The case \( k=0\) is trivial. Next let \( 0<k<n\) and assume that the first \( k\) columns of \( {\boldsymbol{A}}\) are linearly independent. We wish to find \( {\boldsymbol{w}}_{k}\) , \( b_{k}\) such that the first \( k+1\) columns are linearly independent. Suppose such \( {\boldsymbol{w}}_{k}\) , \( b_{k}\) do not exist and denote by \( Y_k\subseteq\mathbb{R}^{n+1}\) the space spanned by the first \( k\) columns of \( {\boldsymbol{A}}\) . Then for all \( {\boldsymbol{w}}\in\mathbb{R}^n\) , \( b\in\mathbb{R}\) the vector \( (\sigma({\boldsymbol{w}}^\top{\boldsymbol{x}}_i+b))_{i=1}^{n+1}\in\mathbb{R}^{n+1}\) must belong to \( Y_k\) . Fix \( {\boldsymbol{y}}=(y_i)_{i=1}^{n+1}\in\mathbb{R}^{n+1}\backslash Y_k\) . Then

\[ \begin{align*} \inf_{\tilde\Phi\in\mathcal{N}_d^1(\sigma,1)}\| (\tilde\Phi({\boldsymbol{x}}_i))_{i=1}^{n+1}-{\boldsymbol{y}} \|_{2}^2&=\inf_{N,{\boldsymbol{w}}_j,b_j,v_j,c} \sum_{i=1}^{n+1}\Big(\sum_{j=1}^N v_j\sigma({\boldsymbol{w}}_j^\top{\boldsymbol{x}}_i+b_j)+c-y_i\Big)^2\\ &\ge \inf_{\tilde{\boldsymbol{y}}\in Y_k}\| \tilde{\boldsymbol{y}}-{\boldsymbol{y}} \|_{2}^2>0. \end{align*} \]

Since we can find a continuous function \( f:\mathbb{R}^d\to\mathbb{R}\) such that \( f({\boldsymbol{x}}_i)=y_i\) for all \( i=1,\dots,n+1\) , this contradicts Theorem 2.

10.2 Optimal interpolation and reconstruction

Consider a bounded domain \( \Omega\subseteq\mathbb{R}^d\) , a function \( f:\Omega\to\mathbb{R}\) , distinct points \( {\boldsymbol{x}}_1,\dots,{\boldsymbol{x}}_m\in \Omega\) , and corresponding function values \( y_i\mathrm{:}= f({\boldsymbol{x}}_i)\) . Our objective is to approximate \( f\) based solely on the data pairs \( ({\boldsymbol{x}}_i,y_i)\) , \( i=1,\dots,m\) . In this section, we will show that, under certain assumptions on \( f\) , ReLU neural networks can express an “optimal” reconstruction which also turns out to be an interpolant of the data.

10.2.1 Motivation

In the previous section, we observed that neural networks with \( m-1\in \mathbb{N}\) hidden neurons can interpolate \( m\) points for every reasonable activation function. However, not all interpolants are equally suitable for a given application. For instance, consider Figure 24 for a comparison between polynomial and piecewise affine interpolation on the unit interval.

Interpolation of eight points by a polynomial of
		degree seven and by a piecewise affine spline.
		The
		polynomial interpolation has a significantly larger
		derivative or Lipschitz constant than the piecewise affine
		interpolator.
Figure 24. Interpolation of eight points by a polynomial of degree seven and by a piecewise affine spline. The polynomial interpolation has a significantly larger derivative or Lipschitz constant than the piecewise affine interpolator.

The two interpolants exhibit rather different behaviors. In general, there is no way of determining which constitutes a better approximation to \( f\) . In particular, given our limited information about \( f\) , we cannot accurately reconstruct any additional features that may exist between interpolation points \( {\boldsymbol{x}}_1,\dots,{\boldsymbol{x}}_m\) . In accordance with Occam’s razor, it thus seems reasonable to assume that \( f\) does not exhibit extreme oscillations or behave erratically between interpolation points. As such, the piecewise interpolant appears preferable in this scenario. One way to formalize the assumption that \( f\) does not “exhibit extreme oscillations” is to assume that the Lipschitz constant

\[ \begin{align} {\rm Lip}(f):=\sup_{{\boldsymbol{x}}\neq {\boldsymbol{y}}}\frac{|f({\boldsymbol{x}})-f({\boldsymbol{y}})|}{\| {\boldsymbol{x}}-{\boldsymbol{y}} \|_{}} \end{align} \]

(105)

of \( f\) is bounded by a fixed value \( M\in\mathbb{R}\) . Here \( \| \cdot \|_{}\) denotes an arbitrary fixed norm on \( \mathbb{R}^d\) .

How should we choose \( M\) ? For every function \( f:\Omega\to\mathbb{R}\) satisfying

\[ \begin{align} f({\boldsymbol{x}}_i)=y_i~\text{ for all }~ i=1,\dots,m, \end{align} \]

(106)

we have

\[ \begin{align} {\rm Lip}(f) = \sup_{{\boldsymbol{x}}\neq {\boldsymbol{y}}\in\Omega} \frac{|f({\boldsymbol{x}}) - f({\boldsymbol{y}})|}{\| {\boldsymbol{x}}-{\boldsymbol{y}} \|_{}} \geq \sup_{i \neq j} \frac{|y_i - y_j|}{\| {\boldsymbol{x}}_i - {\boldsymbol{x}}_j \|_{}} \mathrm{:=} \tilde M. \end{align} \]

(107)

Because of this, we fix \( M\) as a real number greater than or equal to \( \tilde M\) for the remainder of our analysis.

10.2.2 Optimal reconstruction for Lipschitz continuous functions

The above considerations raise the following question: Given only the information that the function has Lipschitz constant at most \( M\) , what is the best reconstruction of \( f\) based on the data? We consider here the “best reconstruction” to be a function that minimizes the \( L^\infty\) -error in the worst case. Specifically, with

\[ \begin{align} {\rm Lip}_M(\Omega) \mathrm{:}=\{f:\Omega\to\mathbb{R}\,|\,{\rm Lip}(f)\le M\}, \end{align} \]

(108)

denoting the set of all functions with Lipschitz constant at most \( M\) , we want to solve the following problem:

Problem 1

We wish to find an element

\[ \begin{align} \Phi\in {\rm argmin}_{h:\Omega\to\mathbb{R}}~\sup_{\substack{f \in {\rm Lip}_M(\Omega)\\ \text{\( f\) satisfies (98)}}} ~\sup_{{\boldsymbol{x}}\in\Omega}|f({\boldsymbol{x}})-h({\boldsymbol{x}})|. \end{align} \]

(109)

The next theorem shows that a function \( \Phi\) as in (109) indeed exists. This \( \Phi\) not only allows for an explicit formula, it also belongs to \( {\rm{Lip}}_M(\Omega)\) and additionally interpolates the data. Hence, it is not just an optimal reconstruction, it is also an optimal interpolant. This theorem goes back to [2], which, in turn, is based on [3].

Theorem 17

Let \( m\) , \( d \in \mathbb{N}\) , \( \Omega\subseteq\mathbb{R}^d\) , \( f:\Omega\to\mathbb{R}\) , and let \( {\boldsymbol{x}}_1,\dots,{\boldsymbol{x}}_m \in \Omega\) , \( y_1,\dots,y_m\in \mathbb{R}\) satisfy (106) and (107) with \( \tilde{M} \ge 0\) . Further, let \( M\geq \tilde{M}\) .

Then, Problem 1 has at least one solution given by

\[ \begin{align} \Phi({\boldsymbol{x}}):=\frac{1}{2}(f_{\rm upper}({\boldsymbol{x}})+f_{\rm lower}({\boldsymbol{x}})) ~~ \text{ for } {\boldsymbol{x}} \in \Omega, \end{align} \]

(110)

where

\[ \begin{align*} f_{\rm upper}({\boldsymbol{x}})&\mathrm{:}= \min_{k=1,\dots,m}(y_k+M\| {\boldsymbol{x}}-{\boldsymbol{x}}_k \|_{})\\ f_{\rm lower}({\boldsymbol{x}})&\mathrm{:}= \max_{k=1,\dots,m}(y_k-M\| {\boldsymbol{x}}-{\boldsymbol{x}}_k \|_{}). \end{align*} \]

Moreover, \( \Phi\in{{\rm {Lip}}}_M(\Omega)\) and \( \Phi\) interpolates the data (i.esatisfies (106)).

Proof

First we claim that for all \( h_1\) , \( h_2\in{{\rm {Lip}}}_M(\Omega)\) holds \( \max\{h_1,h_2\}\in {\rm{Lip}}_M(\Omega)\) as well as \( \min\{h_1,h_2\}\in {\rm{Lip}}_M(\Omega)\) . Since \( \min\{h_1,h_2\}=-\max\{-h_1,-h_2\}\) , it suffices to show the claim for the maximum. We need to check that

\[ \begin{align} \frac{|\max\{h_1({\boldsymbol{x}}),h_2({\boldsymbol{x}})\}-\max\{h_1({\boldsymbol{y}}),h_2({\boldsymbol{y}})\}|}{\| {\boldsymbol{x}}-{\boldsymbol{y}} \|_{}}\le M \end{align} \]

(111)

for all \( {\boldsymbol{x}}\neq{\boldsymbol{y}}\in\Omega\) . Fix \( {\boldsymbol{x}}\neq{\boldsymbol{y}}\) . Without loss of generality we assume that

\[ \begin{align*} \max\{h_1({\boldsymbol{x}}),h_2({\boldsymbol{x}})\}\ge \max\{h_1({\boldsymbol{y}}),h_2({\boldsymbol{y}})\}~\text{and}~ \max\{h_1({\boldsymbol{x}}),h_2({\boldsymbol{x}})\}=h_1({\boldsymbol{x}}). \end{align*} \]

If \( \max\{h_1({\boldsymbol{y}}),h_2({\boldsymbol{y}})\}=h_1({\boldsymbol{y}})\) then the numerator in (111) equals \( h_1({\boldsymbol{x}})-h_1({\boldsymbol{y}})\) which is bounded by \( M\| {\boldsymbol{x}}-{\boldsymbol{y}} \|_{}\) . If \( \max\{h_1({\boldsymbol{y}}),h_2({\boldsymbol{y}})\}=h_2({\boldsymbol{y}})\) , then the numerator equals \( h_1({\boldsymbol{x}})-h_2({\boldsymbol{y}})\) which is bounded by \( h_1({\boldsymbol{x}})-h_1({\boldsymbol{y}})\le M\| {\boldsymbol{x}}-{\boldsymbol{y}} \|_{}\) . In either case (111) holds.

Clearly, \( {\boldsymbol{x}}\mapsto y_k-M\| {\boldsymbol{x}}-{\boldsymbol{x}}_k \|_{}\in{{\rm {Lip}}}_M(\Omega)\) for each \( k=1,\dots,m\) and thus \( f_{\rm{upper}}\) , \( f_{\rm{lower}}\in{{\rm Lip}}_M(\Omega)\) as well as \( \Phi\in{{\rm {Lip}}}_M(\Omega)\) .

Next we claim that for all \( f\in{{\rm {Lip}}}_M(\Omega)\) satisfying (106) holds

\[ \begin{align} f_{\rm lower}({\boldsymbol{x}})\le f({\boldsymbol{x}})\le f_{\rm upper}({\boldsymbol{x}})~\text{ for all } ~ {\boldsymbol{x}}\in\Omega. \end{align} \]

(112)

This is true since for every \( k\in\{1,\dots,m\}\) and \( {\boldsymbol{x}}\in\Omega\)

\[ \begin{align*} |y_k-f({\boldsymbol{x}})| = |f({\boldsymbol{x}}_k)-f({\boldsymbol{x}})|\le M \| {\boldsymbol{x}}-{\boldsymbol{x}}_k \|_{} \end{align*} \]

so that for all \( {\boldsymbol{x}}\in\Omega\)

\[ \begin{align*} f({\boldsymbol{x}})\le \min_{k=1,\dots,m}(y_k+M\| {\boldsymbol{x}}-{\boldsymbol{x}}_k \|_{}),~~ f({\boldsymbol{x}})\ge \max_{k=1,\dots,m}(y_k-M\| {\boldsymbol{x}}-{\boldsymbol{x}}_k \|_{}). \end{align*} \]

Since \( f_{\rm{upper}}\) , \( f_{\rm{lower}}\in{{\rm Lip}}_M(\Omega)\) satisfy (106), we conclude that for every \( h:\Omega\to\mathbb{R}\) holds

\[ \begin{align} \sup_{\substack{f\in{\rm Lip}_M(\Omega)\\ \text{\( f\) satisfies (98)}}}\sup_{{\boldsymbol{x}}\in\Omega}|f({\boldsymbol{x}})-h({\boldsymbol{x}})| &\ge \sup_{{\boldsymbol{x}}\in\Omega} \max\{|f_{\rm lower}({\boldsymbol{x}})-h({\boldsymbol{x}})|,|f_{\rm upper}({\boldsymbol{x}})-h({\boldsymbol{x}})|\}\nonumber \end{align} \]

(113)

\[ \begin{align} & \ge \sup_{{\boldsymbol{x}}\in\Omega} \frac{|f_{\rm lower}({\boldsymbol{x}}) - f_{\rm upper}({\boldsymbol{x}})|}{2}. \end{align} \]

(114)

\[ \begin{align} \end{align} \]

(115)

\[ \begin{align} \\\end{align} \]

(116)

Moreover, using (112),

\[ \begin{align} \sup_{\substack{f\in{\rm Lip}_M(\Omega)\\ \text{\( f\) satisfies (98)}}}\sup_{{\boldsymbol{x}}\in\Omega}|f({\boldsymbol{x}})-\Phi({\boldsymbol{x}})|&\le \sup_{{\boldsymbol{x}}\in\Omega} \max\{|f_{\rm lower}({\boldsymbol{x}})-\Phi({\boldsymbol{x}})|,|f_{\rm upper}({\boldsymbol{x}})-\Phi({\boldsymbol{x}})|\} \nonumber \end{align} \]

(117)

\[ \begin{align} &= \sup_{{\boldsymbol{x}}\in\Omega} \frac{|f_{\rm lower}({\boldsymbol{x}}) - f_{\rm upper}({\boldsymbol{x}})|}{2}. \end{align} \]

(118)

\[ \begin{align} \end{align} \]

(119)

\[ \begin{align} \\\end{align} \]

(120)

Finally, (113) and (117) imply that \( \Phi\) is a solution of Problem 1.

Figure 25 depicts \( f_{\rm{upper}}\) , \( f_{\rm{lower}}\) , and \( \Phi\) for the interpolation problem shown in Figure 24, while Figure 26 provides a two-dimensional example.

Interpolation of the points from Figure fig:InterpolationGoodAndBad with the optimal Lipschitz interpolant.
Figure 25. Interpolation of the points from Figure 24 with the optimal Lipschitz interpolant.

10.2.3 Optimal ReLU reconstructions

So far everything was valid with an arbitrary norm on \( \mathbb{R}^d\) . For the next theorem, we will restrict ourselves to the \( 1\) -norm \( \| {\boldsymbol{x}} \|_{1}=\sum_{j=1}^d|x_j|\) . Using the explicit formula of Theorem 17, we will now show the remarkable result that ReLU neural networks can exactly express an optimal reconstruction (in the sense of Problem 1) with a neural network whose size scales linearly in the product of the dimension \( d\) and the number of data points \( m\) . Additionally, the proof is constructive, thus allowing in principle for an explicit construction of the neural network without the need for training.

Figure 27
Figure 28
Figure 29
Figure 26. Two-dimensional example of the interpolation method of (110). From top left to bottom we see \( f_{\mathrm{upper}}\) , \( f_{\mathrm{lower}}\) , and \( \Phi\) . The interpolation points \( ({\boldsymbol{x}}_i, y_i)_{i=1}^6\) are marked with red crosses.

Theorem 18 (Optimal Lipschitz Reconstruction)

Let \( m\) , \( d \in \mathbb{N}\) , \( \Omega\subseteq\mathbb{R}^d\) , \( f:\Omega\to\mathbb{R}\) , and let \( {\boldsymbol{x}}_1,\dots,{\boldsymbol{x}}_m \in \Omega\) , \( y_1,\dots,y_m\in \mathbb{R}\) satisfy (106) and (107) with \( \tilde{M} > 0\) . Further, let \( M\geq \tilde{M}\) and let \( \| \cdot \|_{}=\| \cdot \|_{1}\) in (107) and (108).

Then, there exists a ReLU neural network \( \Phi \in {\rm{Lip}}_M(\Omega)\) that interpolates the data (i.esatisfies (106)) and satisfies

\[ \begin{align*} \Phi \in {\rm argmin}_{h:\Omega\to\mathbb{R}} \sup_{\substack{f \in {\rm Lip}_M(\Omega)\\ \text{\( f\) satisfies (98)}}} \sup_{{\boldsymbol{x}}\in\Omega}|f({\boldsymbol{x}})-h({\boldsymbol{x}})|. \end{align*} \]

Moreover, \( {\rm{depth}}(\Phi)=O(\log(m))\) , \( {\rm{width}}(\Phi)=O(dm)\) and all weights of \( \Phi\) are bounded in absolute value by \( \max\{M,\| {\boldsymbol{y}} \|_{\infty}\}\) .

Proof

To prove the result, we simply need to show that the function in (110) can be expressed as a ReLU neural network with the size bounds described in the theorem. First we notice, that there is a simple ReLU neural network that implements the 1-norm. It holds for all \( {\boldsymbol{x}} \in \mathbb{R}^d\) that

\[ \begin{align*} \| {\boldsymbol{x}} \|_{1} = \sum_{i=1}^d \left(\sigma(x_i) + \sigma(-x_i)\right). \end{align*} \]

Thus, there exists a ReLU neural network \( \Phi^{\| \cdot \|_{1}}\) such that for all \( {\boldsymbol{x}}\in\mathbb{R}^d\)

\[ \begin{align*} \mathrm{width}(\Phi^{\| \cdot \|_{1}}) = 2d, ~~ \mathrm{depth}(\Phi^{\| \cdot \|_{1}}) = 1, ~~ \Phi^{\| \cdot \|_{1}}({\boldsymbol{x}}) = \| {\boldsymbol{x}} \|_{1} \end{align*} \]

As a result, there exist ReLU neural networks \( \Phi_k:\mathbb{R}^d\to\mathbb{R}\) , \( k = 1, \dots, m\) , such that

\[ \begin{align*} \mathrm{width}(\Phi_k) = 2d, ~~ \mathrm{depth}(\Phi_k) = 1, ~~ \Phi_k({\boldsymbol{x}}) = y_k + M \| {\boldsymbol{x}} - {\boldsymbol{x}}_k \|_{1} \end{align*} \]

for all \( {\boldsymbol{x}} \in \mathbb{R}^d\) . Using the parallelization of neural networks introduced in Section 6.1.3, there exists a ReLU neural network \( \Phi_{\rm{all}}\mathrm{:}= (\Phi_1, \dots, \Phi_m)\colon \mathbb{R}^d \to \mathbb{R}^m\) such that

\[ \begin{align*} \mathrm{width}(\Phi_{\mathrm{all}} ) &= 4md, ~~ \mathrm{depth}(\Phi_{\mathrm{all}} ) = 1 \end{align*} \]

and

\[ \begin{align*} \Phi_{\mathrm{all}} ({\boldsymbol{x}}) &= (y_k + M \| {\boldsymbol{x}} - {\boldsymbol{x}}_k \|_{1})_{k=1}^m ~~ \text{ for all } {\boldsymbol{x}} \in \mathbb{R}^d. \end{align*} \]

Using Lemma 11, we can now find a ReLU neural network \( \Phi_{\mathrm{upper}}\) such that \( \Phi_{\mathrm{upper}} = f_{\mathrm{upper}}({\boldsymbol{x}})\) for all \( {\boldsymbol{x}} \in \Omega\) , \( \mathrm{width}(\Phi_{\mathrm{upper}}) \leq \max\{16 m, 4md\}\) , and \( \mathrm{depth}(\Phi_{\mathrm{upper}}) \leq 1 + \log(m)\) .

Essentially the same construction yields a ReLU neural network \( \Phi_{\mathrm{lower}}\) with the respective properties. Lemma 9 then completes the proof.

Bibliography and further reading

The universal interpolation theorem stated in this chapter is due to [1, Theorem 5.1]. There exist several earlier interpolation results, which were shown under stronger assumptions: In [4], the interpolation property is already linked with a rank condition on the matrix (104). However, no general conditions on the activation functions were formulated. In [5], the interpolation theorem is established under the assumption that the activation function \( \sigma\) is continuous and nondecreasing, \( \lim_{x \to -\infty} \sigma(x) = 0\) , and \( \lim_{x \to \infty} \sigma(x) = 1\) . This result was improved in [6], which dropped the nondecreasing assumption on \( \sigma\) .

The main idea of the optimal Lipschitz interpolation theorem in Section 10.2 is due to [2]. A neural network construction of Lipschitz interpolants, which however is not the optimal interpolant in the sense of Problem 1, is given in [7, Theorem 2.27].

Exercises

Exercise 31

Under the assumptions of Theorem 17, we define for \( x \in \Omega\) the set of nearest neighbors by

\[ I_x \mathrm{:}= {\rm argmin}_{i = 1, \dots, m} \| x_i - x \|. \]

The one-nearest-neighbor classifier \( f_{\rm{1NN}}\) is defined by

\[ \begin{align*} f_{\rm 1NN}(x) = \frac{1}{2} (\min_{i\in I_x} y_i + \max_{i\in I_x} y_i). \end{align*} \]

Let \( \Phi_M\) be the function in (110). Show that for all \( x \in \Omega\)

\[ \Phi_M(x) \to f_{\rm 1NN}(x)~~ \text{as } M \to \infty. \]

Exercise 32

Extend Theorem 18 to the \( \| \cdot \|_{\infty}\) -norm.

Hint: The resulting neural network will need to be deeper than the one of Theorem 18.

11 Training of neural networks

Up to this point, we have discussed the representation and approximation of certain function classes using neural networks. The second pillar of deep learning concerns the question of how to fit a neural network to given data, i.e., having fixed an architecture, how to find suitable weights and biases. This task amounts to minimizing a so-called objective function such as the empirical risk \( \hat{\mathcal{R}}_S\) in (3). Throughout this chapter we denote the objective function by

\[ \begin{align*} f :\mathbb{R}^n\to\mathbb{R}, \end{align*} \]

and interpret it as a function of all neural network weights and biases collected in a vector in \( \mathbb{R}^n\) . The goal is to (approximately) determine a minimizer, i.e., some \( {\boldsymbol{w}}_*\in\mathbb{R}^n\) satisfying

\[ \begin{align*} f ({\boldsymbol{w}}_*)\le f ({\boldsymbol{w}})~~\text{for all }{\boldsymbol{w}}\in\mathbb{R}^n. \end{align*} \]

Standard approaches primarily include variants of (stochastic) gradient descent. These are the focus of the present chapter, in which we discuss basic ideas and results in convex optimization using gradient-based algorithms. In Sections 11.1, 11.2, and 11.3, we explore gradient descent, stochastic gradient descent, and accelerated gradient descent, and provide convergence proofs for smooth and strongly convex objectives. Section 11.4 discusses adaptive step sizes and explains the core principles behind popular algorithms such as Adam. Finally, Section 11.5 introduces the backpropagation algorithm, which enables the efficient application of gradient-based methods to neural network training.

11.1 Gradient descent

The general idea of gradient descent is to start with some \( {\boldsymbol{w}}_0\in\mathbb{R}^n\) , and then apply sequential updates by moving in the direction of steepest descent of the objective function. Assume for the moment that \( f \in C^2(\mathbb{R}^n)\) , and denote the \( k\) th iterate by \( {\boldsymbol{w}}_k\) . Then

\[ \begin{align} f ({\boldsymbol{w}}_k+{\boldsymbol{v}})= f ({\boldsymbol{w}}_k)+{\boldsymbol{v}}^\top\nabla f ({\boldsymbol{w}}_k)+O(\| {\boldsymbol{v}} \|_{}^2)~~ \text{for }\| {\boldsymbol{v}} \|_{}^2 \to 0. \end{align} \]

(109)

This shows that the change in \( f \) around \( {\boldsymbol{w}}_k\) is locally described by the gradient \( \nabla f ({\boldsymbol{w}}_k)\) . For small \( {\boldsymbol{v}}\) the contribution of the second order term is negligible, and the direction \( {\boldsymbol{v}}\) along which the decrease of the risk is maximized equals the negative gradient \( -\nabla f ({\boldsymbol{w}}_k)\) .

Thus, \( -\nabla f ({\boldsymbol{w}}_k)\) is also called the direction of steepest descent. This leads to an update of the form

\[ \begin{align} {\boldsymbol{w}}_{k+1}\mathrm{:}= {\boldsymbol{w}}_k-h_k \nabla f ({\boldsymbol{w}}_k), \end{align} \]

(110)

where \( h_k>0\) is referred to as the step size or learning rate. We refer to this iterative algorithm as gradient descent.

Figure 31
Figure 32
Figure 30. Two examples of gradient descent as defined in (110). The red points represent the \( {\boldsymbol{w}}_k\) .

By (109) and (110) it holds (also see [144, Section 1.2])

\[ \begin{equation} f ({\boldsymbol{w}}_{k+1})= f ({\boldsymbol{w}}_k)-h_k\| \nabla f ({\boldsymbol{w}}_k) \|_{}^2+O(h_k^2), \end{equation} \]

(111)

so that if \( \nabla f ({\boldsymbol{w}}_k)\neq\boldsymbol{0}\) , a small enough step size \( h_k\) ensures that the algorithm decreases the value of the objective function. In practice, tuning the learning rate \( h_k\) can be a subtle issue as it should strike a balance between the following dissenting requirements:

  1. \( h_k\) needs to be sufficiently small so that the second-order term in (111) is not dominating, and the update (110) decreases the objective function.
  2. \( h_k\) should be large enough to ensure significant decrease of the objective function, which facilitates faster convergence of the algorithm.

A learning rate that is too high might overshoot the minimum, while a rate that is too low results in slow convergence. Common strategies include, in particular, constant learning rates (\( h_k=h\) for all \( k\in\mathbb{N}_0\) ), learning rate schedules such as decaying learning rates (\( h_k\searrow 0\) as \( k\to\infty\) ), and adaptive methods. For adaptive methods the algorithm dynamically adjusts \( h_k\) based on the values of \( f ({\boldsymbol{w}}_j)\) or \( \nabla f ({\boldsymbol{w}}_j)\) for \( j\le k\) .

11.1.1 Structural conditions and existence of minimizers

We start our analysis by discussing three key notions for analyzing gradient descent algorithms, beginning with an intuitive (but loose) geometric description. A continuously differentiable objective function \( f :\mathbb{R}^n\to\mathbb{R}\) will be called

  1. smooth if, at each \( {\boldsymbol{w}}\in\mathbb{R}^n\) , \( f \) is bounded above and below by a quadratic function that touches its graph at \( {\boldsymbol{w}}\) ,
  2. convex if, at each \( {\boldsymbol{w}}\in\mathbb{R}^n\) , \( f \) lies above its tangent at \( {\boldsymbol{w}}\) ,
  3. strongly convex if, at each \( {\boldsymbol{w}}\in\mathbb{R}^n\) , \( f \) lies above its tangent at \( {\boldsymbol{w}}\) plus a convex quadratic term.

These concepts are illustrated in Figure 33. We next give the precise mathematical definitions.

&lt;span data-controller=&quot;mathjax&quot;&gt;The graph of L) -smooth functions lies between two quadratic
 functions at each point, see (eq:Lsmooth0), the graph of convex
 function lies above the tangent at each point, see
 (eq:convex), and the graph of ) -strongly convex functions
 lies above a convex quadratic function at each point, see
 (eq:sc).&lt;/span&gt;
Figure 33. The graph of \( L\) -smooth functions lies between two quadratic functions at each point, see (112), the graph of convex function lies above the tangent at each point, see (116), and the graph of \( \mu\) -strongly convex functions lies above a convex quadratic function at each point, see (117).

Definition 20

Let \( n\in \mathbb{N}\) and \( L>0\) . A function \( f :\mathbb{R}^n\to\mathbb{R}\) is called \( L\) -smooth if \( f \in C^1(\mathbb{R}^n)\) and

\[ \begin{align} f ({\boldsymbol{v}})&\le f ({\boldsymbol{w}})+\left\langle \nabla f ({\boldsymbol{w}}), {\boldsymbol{v}}-{\boldsymbol{w}}\right\rangle_{} + \frac{L}{2}\| {\boldsymbol{w}}-{\boldsymbol{v}} \|_{}^2~~\text{for all }{\boldsymbol{w}},{\boldsymbol{v}}\in\mathbb{R}^n, \end{align} \]

(112.a)

\[ \begin{align} f ({\boldsymbol{v}})&\ge f ({\boldsymbol{w}})+\left\langle \nabla f ({\boldsymbol{w}}), {\boldsymbol{v}}-{\boldsymbol{w}}\right\rangle_{} - \frac{L}{2}\| {\boldsymbol{w}}-{\boldsymbol{v}} \|_{}^2~~\text{for all }{\boldsymbol{w}},{\boldsymbol{v}}\in\mathbb{R}^n. \end{align} \]

(112.b)

By definition, \( L\) -smooth functions satisfy the geometric property (VII). In the literature, \( L\) -smoothness is often instead defined as Lipschitz continuity of the gradient

\[ \begin{align} \| \nabla f ({\boldsymbol{w}})-\nabla f ({\boldsymbol{v}}) \|_{} \le L \| {\boldsymbol{w}}-{\boldsymbol{v}} \|_{}~~\text{for all }{\boldsymbol{w}},{\boldsymbol{v}}\in\mathbb{R}^n. \end{align} \]

(113)

Lemma 25

Let \( L>0\) . Then \( f \in C^1(\mathbb{R}^n)\) is \( L\) -smooth if and only if (113) holds.

Proof

We show that (112) implies (113). To this end assume first that \( f\in C^2(\mathbb{R}^n)\) , and that (113) does not hold. Then we can find \( {\boldsymbol{w}}\neq{\boldsymbol{v}}\) with

\[ \| {\boldsymbol{w}}-{\boldsymbol{v}} \|_{} \sup_{\| {\boldsymbol{e}} \|_{}=1}\int_0^1 {\boldsymbol{e}}^\top\nabla^2 f ({\boldsymbol{v}}+t({\boldsymbol{w}}-{\boldsymbol{v}}))\frac{{\boldsymbol{w}}-{\boldsymbol{v}}}{\| {\boldsymbol{w}}-{\boldsymbol{v}} \|_{}}\,\mathrm{d} t=\| \nabla f ({\boldsymbol{w}})-\nabla f ({\boldsymbol{v}}) \|_{} > L \| {\boldsymbol{w}}-{\boldsymbol{v}} \|_{}, \]

where \( \nabla^2 f \in\mathbb{R}^{n\times n}\) denotes the Hessian. Since the Hessian is symmetric, this implies existence of \( {\boldsymbol{u}}\) , \( {\boldsymbol{e}}\in\mathbb{R}^n\) with \( \| {\boldsymbol{e}} \|_{}=1\) and \( |{\boldsymbol{e}}^\top\nabla^2 f ({\boldsymbol{u}}){\boldsymbol{e}}|>L\) . Without loss of generality

\[ \begin{equation} {\boldsymbol{e}}^\top\nabla^2 f ({\boldsymbol{u}}){\boldsymbol{e}}>L. \end{equation} \]

(114)

Then for \( h>0\) by Taylor’s formula

\[ f ({\boldsymbol{u}}+h{\boldsymbol{e}}) = f ({\boldsymbol{u}}) + h\left\langle \nabla f ({\boldsymbol{u}}), {\boldsymbol{e}}\right\rangle_{} + \int_0^h {\boldsymbol{e}}^\top\nabla^2 f ({\boldsymbol{u}}+t{\boldsymbol{e}}){\boldsymbol{e}} (h-t)\,\mathrm{d} t. \]

Continuity of \( t\mapsto {\boldsymbol{e}}^\top\nabla^2 f ({\boldsymbol{u}}+t{\boldsymbol{e}}){\boldsymbol{e}}\) and (114) implies that for \( h>0\) small enough

\[ \begin{align*} f ({\boldsymbol{u}}+h{\boldsymbol{e}}) &> f ({\boldsymbol{u}}) + h\left\langle \nabla f ({\boldsymbol{u}}), {\boldsymbol{e}}\right\rangle_{} + L\int_0^h (h-t)\,\mathrm{d} t\\ &= f ({\boldsymbol{u}}) + \left\langle \nabla f ({\boldsymbol{u}}), h{\boldsymbol{e}}\right\rangle_{} + \frac{L}{2}\| h{\boldsymbol{e}} \|_{}^2. \end{align*} \]

This contradicts (112.a).

Now let \( f \in C^1(\mathbb{R}^n)\) and assume again that (113) does not hold. Then for every \( \varepsilon>0\) and every compact set \( K\subseteq\mathbb{R}^n\) there exists \( f_{\varepsilon,K}\in C^2(\mathbb{R}^n)\) such that \( \| f-f_{\varepsilon,K} \|_{C^1(K)}<\varepsilon\) . By choosing \( \varepsilon>0\) sufficiently small and \( K\) sufficiently large, it follows that \( f_{\varepsilon,K}\) violates (113). Consequently, by the previous argument, \( f_{\varepsilon,K}\) must also violate (112), which in turn implies that \( f\) does not satisfy (112) either.

Finally, the fact that (113) implies (112) is left as Exercise 34.

Definition 21

Let \( n \in \mathbb{N}\) . A function \( f :\mathbb{R}^n\to\mathbb{R}\) is called convex if and only if

\[ \begin{align} f (\lambda{\boldsymbol{w}}+(1-\lambda){\boldsymbol{v}}) \le \lambda f ({\boldsymbol{w}})+(1-\lambda) f ({\boldsymbol{v}}) \end{align} \]

(115)

for all \( {\boldsymbol{w}}\) , \( {\boldsymbol{v}}\in\mathbb{R}^n\) , \( \lambda\in (0,1)\) .

In case \( f \) is continuously differentiable, this is equivalent to the geometric property (VIII) as the next lemma shows. The proof is left as Exercise 35.

Lemma 26

Let \( f \in C^1(\mathbb{R}^n)\) . Then \( f \) is convex if and only if

\[ \begin{align} f ({\boldsymbol{v}})\ge f ({\boldsymbol{w}})+\left\langle \nabla f ({\boldsymbol{w}}), {\boldsymbol{v}}-{\boldsymbol{w}}\right\rangle_{} ~~ \text{for all }{\boldsymbol{w}},{\boldsymbol{v}}\in\mathbb{R}^n. \end{align} \]

(116)

The concept of convexity is strengthened by so-called strong-convexity, which requires an additional positive quadratic term on the right-hand side of (116), and thus corresponds to geometric property (IX) by definition.

Definition 22

Let \( n \in \mathbb{N}\) and \( \mu>0\) . A function \( f :\mathbb{R}^n\to\mathbb{R}\) is called \( \mu\) -strongly convex if \( f\in C^1(\mathbb{R}^n)\) and

\[ \begin{align} f ({\boldsymbol{v}}) \ge f ({\boldsymbol{w}})+\left\langle \nabla f ({\boldsymbol{w}}), {\boldsymbol{v}}-{\boldsymbol{w}}\right\rangle_{}+ \frac{\mu}{2}\| {\boldsymbol{v}}-{\boldsymbol{w}} \|_{}^2 ~~\text{for all }{\boldsymbol{w}},{\boldsymbol{v}}\in\mathbb{R}^n. \end{align} \]

(117)

A convex function need not be bounded from below (e.g\( w\mapsto w\) ) and thus need not have any (global) minimizers. And even if it is bounded from below, there need not exist minimizers (e.g\( w\mapsto \exp(w)\) ). However we have the following statement.

Lemma 27

Let \( f :\mathbb{R}^n\to\mathbb{R}\) . If \( f \) is

  1. convex, then the set of minimizers of \( f \) is convex and has cardinality \( 0\) , \( 1\) , or \( \infty\) ,
  2. \( \mu\) -strongly convex, then \( f \) has exactly one minimizer.

Proof

Let \( f \) be convex, and assume that \( {\boldsymbol{w}}_*\) and \( {\boldsymbol{v}}_*\) are two minimizers of \( f \) . Then every convex combination \( \lambda{\boldsymbol{w}}_*+(1-\lambda){\boldsymbol{v}}_*\) , \( \lambda\in [0,1]\) , is also a minimizer due to (115). This shows the first claim.

Now let \( f \) be \( \mu\) -strongly convex. Then (117) implies \( f \) to be lower bounded by a convex quadratic function. Hence there exists at least one minimizer \( {\boldsymbol{w}}_*\) , and \( \nabla f ({\boldsymbol{w}}_*)=0\) . By (117) we then have \( f ({\boldsymbol{v}})> f ({\boldsymbol{w}}_*)\) for every \( {\boldsymbol{v}}\neq{\boldsymbol{w}}_*\) .

11.1.2 Convergence of gradient descent

As announced before, to analyze convergence, we focus on \( \mu\) -strongly convex and \( L\) -smooth objectives only; we refer to the bibliography section for further reading under weaker assumptions. The following theorem, which establishes linear convergence of gradient descent, is a standard result (see, e.g., [145, 146, 147]). The proof presented here is taken from [148, Theorem 3.6].

Recall that a sequence \( e_k\) is said to converge linearly to \( 0\) , if and only if there exist constants \( C>0\) and \( c\in [0,1)\) such that

\[ \begin{align*} e_k\le Cc^k~~\text{for all } k\in\mathbb{N}_0. \end{align*} \]

The constant \( c\) is also referred to as the rate of convergence. Before giving the statement, we first note that comparing (112.a) and (117) it necessarily holds \( L\ge\mu\) and therefore \( \kappa\mathrm{:}= L/\mu\ge 1\) . This term, known as the condition number of \( f \) , determines the rate of convergence.

Theorem 19

Let \( n \in \mathbb{N}\) and \( L\geq \mu > 0\) . Let \( f \colon \mathbb{R}^n \to \mathbb{R}\) be \( L\) -smooth and \( \mu\) -strongly convex. Further, let \( h_k=h\in (0,1/L]\) for all \( k\in\mathbb{N}_0\) , let \( ({\boldsymbol{w}}_k)_{k=0}^\infty \subseteq \mathbb{R}^n\) be defined by (110), and let \( {\boldsymbol{w}}_*\) be the unique minimizer of \( f \) .

Then, \( f ({\boldsymbol{w}}_k)\to f ({\boldsymbol{w}}_*)\) and \( {\boldsymbol{w}}_k\to{\boldsymbol{w}}_*\) converge linearly for \( k \to \infty\) . For the specific choice \( h=1/L\) it holds for all \( k\in\mathbb{N}\)

\[ \begin{align} \| {\boldsymbol{w}}_k-{\boldsymbol{w}}_* \|_{}^2&\le \Big(1-\frac{\mu}{L}\Big)^{k}\| {\boldsymbol{w}}_0-{\boldsymbol{w}}_* \|_{}^2 \end{align} \]

(118.a)

\[ \begin{align} f ({\boldsymbol{w}}_k)- f ({\boldsymbol{w}}_*)&\le \frac{L}{2} \Big(1-\frac{\mu}{L}\Big)^{k}\| {\boldsymbol{w}}_0-{\boldsymbol{w}}_* \|_{}^2. \end{align} \]

(118.b)

Proof

It suffices to show (118.a), since (118.b) follows directly by (118.a) and (112.a) because \( \nabla f ({\boldsymbol{w}}_*)=0\) . The case \( k=0\) is trivial, so let \( k\in\mathbb{N}\) .

Expanding \( {\boldsymbol{w}}_{k}={\boldsymbol{w}}_{k-1}-h\nabla f ({\boldsymbol{w}}_{k-1})\) and using \( \mu\) -strong convexity (117)

\[ \begin{align} \| {\boldsymbol{w}}_{k}-{\boldsymbol{w}}_* \|_{}^2 &= \| {\boldsymbol{w}}_{k-1}-{\boldsymbol{w}}_* \|_{}^2-2h\left\langle \nabla f ({\boldsymbol{w}}_{k-1}), {\boldsymbol{w}}_{k-1}-{\boldsymbol{w}}_*\right\rangle_{} + h^2\| \nabla f ({\boldsymbol{w}}_{k-1}) \|_{}^2\nonumber \end{align} \]

(119)

\[ \begin{align} &\le (1-\mu h)\| {\boldsymbol{w}}_{k-1}-{\boldsymbol{w}}_* \|_{}^2-2h\cdot ( f ({\boldsymbol{w}}_{k-1})- f ({\boldsymbol{w}}_*)) + h^2\| \nabla f ({\boldsymbol{w}}_{k-1}) \|_{}^2. \\\end{align} \]

(120)

To bound the sum of the last two terms, we first use (112.a) to get

\[ \begin{align*} f ({\boldsymbol{w}}_{k})&\le f ({\boldsymbol{w}}_{k-1}) + \left\langle \nabla f ({\boldsymbol{w}}_{k-1}), -h\nabla f ({\boldsymbol{w}}_{k-1})\right\rangle_{}+\frac{L}{2}\| h\nabla f ({\boldsymbol{w}}_{k-1}) \|_{}^2\\ &= f ({\boldsymbol{w}}_{k-1})+\left(\frac{L}{2}-\frac{1}{h}\right)h^2\| \nabla f ({\boldsymbol{w}}_{k-1}) \|_{}^2 \end{align*} \]

so that for \( h< 2/L\)

\[ \begin{align*} h^2\| \nabla f ({\boldsymbol{w}}_{k-1}) \|_{}^2&\le \frac{1}{1/h-L/2}( f ({\boldsymbol{w}}_{k-1})- f ({\boldsymbol{w}}_{k}))\\ &\le \frac{1}{1/h-L/2}( f ({\boldsymbol{w}}_{k-1})- f ({\boldsymbol{w}}_{*})), \end{align*} \]

and therefore

\[ \begin{align*} &-2h\cdot ( f ({\boldsymbol{w}}_{k-1})- f ({\boldsymbol{w}}_*)) + h^2\| \nabla f ({\boldsymbol{w}}_{k-1}) \|_{}^2\\ &~~\le \Big(-2h+\frac{1}{1/h-L/2}\Big)( f ({\boldsymbol{w}}_{k-1})- f ({\boldsymbol{w}}_{*})). \end{align*} \]

If \( 2h\ge 1/(1/h-L/2)\) , which is equivalent to \( h\le 1/L\) , then the last term is less or equal to zero.

Hence (119) implies for \( h\le 1/L\)

\[ \begin{align*} \| {\boldsymbol{w}}_{k}-{\boldsymbol{w}}_* \|_{}^2 \le (1-\mu h)\| {\boldsymbol{w}}_{k-1}-{\boldsymbol{w}}_* \|_{}^2\le\dots\le (1-\mu h)^{k} \| {\boldsymbol{w}}_0-{\boldsymbol{w}}_* \|_{}^2. \end{align*} \]

This concludes the proof.

Remark 11 (Convex objective functions)

Let \( f \colon \mathbb{R}^n \to \mathbb{R}\) be a convex and \( L\) -smooth function with a minimizer at \( {\boldsymbol{w}}_*\) . As shown in Lemma 27, the minimizer need not be unique, so we cannot expect \( {\boldsymbol{w}}_k \to {\boldsymbol{w}}_*\) in general. However, the objective values still converge. Specifically, under these assumptions, the following holds [145, Theorem 2.1.14, Corollary 2.1.2]: If \( h_k = h \in (0,2/L)\) for all \( k \in \mathbb{N}_0\) and \( ({\boldsymbol{w}}_k)_{k=0}^\infty \subseteq \mathbb{R}^n\) is generated by (110), then

\[ \begin{align*} f ({\boldsymbol{w}}_k) - f ({\boldsymbol{w}}_*) = O(k^{-1}) ~ \text{as } k \to \infty. \end{align*} \]

11.2 Stochastic gradient descent

We next discuss a stochastic variant of gradient descent. The idea, which originally goes back to Robbins and Monro [149], is to replace the gradient \( \nabla f ({\boldsymbol{w}}_{k})\) in (110) by a random variable that we denote by \( {\boldsymbol{G}}_{k}\) . We interpret \( {\boldsymbol{G}}_k\) as an approximation to \( \nabla f ({\boldsymbol{w}}_{k})\) . More precisely, throughout we assume that given \( {\boldsymbol{w}}_{k}\) , \( {\boldsymbol{G}}_k\) is an unbiased estimator of \( \nabla f ({\boldsymbol{w}}_k)\) conditionally independent of \( {\boldsymbol{w}}_0,\dots,{\boldsymbol{w}}_{k-1}\) (see Appendix 18.3.3), so that

\[ \begin{align} \mathbb{E}[{\boldsymbol{G}}_{k}|{\boldsymbol{w}}_{k}]=\mathbb{E}[{\boldsymbol{G}}_{k}|{\boldsymbol{w}}_{k},\dots,{\boldsymbol{w}}_0] = \nabla f ({\boldsymbol{w}}_{k}). \end{align} \]

(121)

After choosing some initial value \( {\boldsymbol{w}}_0\in\mathbb{R}^n\) , the update rule becomes

\[ \begin{align} {\boldsymbol{w}}_{k+1}\mathrm{:}= {\boldsymbol{w}}_{k}-{h}_k {\boldsymbol{G}}_{k}, \end{align} \]

(122)

where \( {h}_k>0\) denotes again the step size. Unlike in Section 11.1, we focus here on \( k\) -dependent step sizes \( h_k\) .

11.2.1 Motivation and decreasing learning rates

The next example motivates the algorithm in the standard setting, e.g[25, Chapter 8].

Example 5 (Empirical risk minimization)

Suppose we have a data sample \( S\mathrm{:}= ({\boldsymbol{x}}_j,y_j)_{j=1}^{m}\) , where \( y_j\in\mathbb{R}\) is the label corresponding to the data point \( {\boldsymbol{x}}_j\in\mathbb{R}^d\) . Using the square loss, we wish to fit a neural network \( \Phi(\cdot,{\boldsymbol{w}}):\mathbb{R}^d\to\mathbb{R}\) depending on parameters (i.eweights and biases) \( {\boldsymbol{w}}\in\mathbb{R}^n\) , such that the empirical risk in (3)

\[ \begin{align*} f ({\boldsymbol{w}})\mathrm{:}=\hat{\mathcal{R}}_S({\boldsymbol{w}})= \frac{1}{m}\sum_{j=1}^{{m}}(\Phi({\boldsymbol{x}}_j,{\boldsymbol{w}})-y_j)^2, \end{align*} \]

is minimized. Performing one step of gradient descent requires the computation of

\[ \begin{align} \nabla f ({\boldsymbol{w}})=\frac{2}{m}\sum_{j=1}^{{m}}(\Phi({\boldsymbol{x}}_j,{\boldsymbol{w}})-y_j)\nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}}_j,{\boldsymbol{w}}), \end{align} \]

(123)

and thus the computation of \( {{m}}\) gradients of the neural network \( \Phi\) . For large \( {{m}}\) (in practice \( {{m}}\) can be in the millions or even larger), this computation might be infeasible.

To reduce computational cost, we replace the full gradient (123) by the stochastic gradient

\[ \begin{align*} {\boldsymbol{G}}\mathrm{:}= 2(\Phi({\boldsymbol{x}}_j,{\boldsymbol{w}})-y_j)\nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}}_j,{\boldsymbol{w}}) \end{align*} \]

where \( j\sim{\rm{uniform}}(1,\dots,{{m}})\) is a random variable with uniform distribution on the discrete set \( \{1,\dots,{{m}}\}\) . Then

\[ \begin{align*} \mathbb{E}[{\boldsymbol{G}}]=\frac{2}{m}\sum_{j=1}^{{m}}(\Phi({\boldsymbol{x}}_j,{\boldsymbol{w}})-y_j)\nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}}_j,{\boldsymbol{w}})=\nabla f ({\boldsymbol{w}}), \end{align*} \]

meaning that \( {\boldsymbol{G}}\) is an unbiased estimator of \( \nabla f ({\boldsymbol{w}})\) . Importantly, computing (a realization of) \( {\boldsymbol{G}}\) merely requires a single gradient evaluation of the neural network.

More generally, one can choose mini-batches of size \( m_b\) (where \( m_b\ll {{m}}\) ) and let

\[ {\boldsymbol{G}}=\frac{2}{m_b}\sum_{j\in J}(\Phi({\boldsymbol{x}}_j,{\boldsymbol{w}}) - y_j)\nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}}_j,{\boldsymbol{w}}), \]

where \( J\) is a random subset of \( \{1,\dots,{{m}}\}\) of cardinality \( m_b\) . A larger mini-batch size reduces the variance of \( {\boldsymbol{G}}\) (thus giving a more robust estimate of the true gradient) but increases the computational cost.

A related common variant is the following: Let \( m_bk=m\) for \( m_b\) , \( k\) , \( m\in\mathbb{N}\) , i.ethe number of data points \( m\) is a \( k\) -fold multiple of the mini-batch size \( m_b\) . In each epoch, first a random partition \( \dot\bigcup_{i=1}^k J_i=\{1,\dots,m\}\) with \( |J_i|=m_b\) for each \( i\) , is determined. Then for each \( i=1,\dots,k\) , the weights are updated with the gradient estimate

\[ \begin{align*} \frac{2}{m_b}\sum_{j\in J_i}\Phi({\boldsymbol{x}}_j-y_j,{\boldsymbol{w}})\nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}}_j,{\boldsymbol{w}}). \end{align*} \]

Hence, in one epoch (corresponding to \( k\) updates of the weights), the algorithm sweeps through the whole dataset, and each data point is used exactly once.

&lt;span data-controller=&quot;mathjax&quot;&gt;constant learning rate for SGD&lt;/span&gt;
Figure 35. constant learning rate for SGD
&lt;span data-controller=&quot;mathjax&quot;&gt;decreasing learning rate for SGD&lt;/span&gt;
Figure 36. decreasing learning rate for SGD
Figure 34. 20 steps of gradient descent (GD) and stochastic gradient descent (SGD) for a strongly convex quadratic objective function. GD was computed with a constant learning rate, while SGD was computed with either a constant learning rate (\( h_k=h\) ) or a decreasing learning rate (\( h_k \sim 1/k\) ).
&lt;span data-controller=&quot;mathjax&quot;&gt; {w}_k) far from {w}_*)&lt;/span&gt;
Figure 38. \( {\boldsymbol{w}}_k\) far from \( {\boldsymbol{w}}_*\)
&lt;span data-controller=&quot;mathjax&quot;&gt; {w}_k) close to {w}_*)&lt;/span&gt;
Figure 39. \( {\boldsymbol{w}}_k\) close to \( {\boldsymbol{w}}_*\)
Figure 37. The update vector \( -h_k{\boldsymbol{G}}_k\) (black) is a draw from a random variable with expectation \( -h_k \nabla f ({\boldsymbol{w}}_k)\) (blue). In order to get convergence, the variance of the update vector should decrease as \( {\boldsymbol{w}}_k\) approaches the minimizer \( {\boldsymbol{w}}_*\) . Otherwise, convergence will in general not hold, see Figure 34 (a).

Let \( {\boldsymbol{w}}_k\) be generated by (122). Using \( L\) -smoothness (113) we then have [150, Lemma 4.2]

\[ \begin{align*} \mathbb{E}[ f ({\boldsymbol{w}}_{k+1})|{\boldsymbol{w}}_k]- f ({\boldsymbol{w}}_{k}) &\le \mathbb{E}[\left\langle \nabla f ({\boldsymbol{w}}_k), {\boldsymbol{w}}_{k+1}-{\boldsymbol{w}}_k\right\rangle_{}] +\frac{L}{2}\mathbb{E}[\| {\boldsymbol{w}}_{k+1}-{\boldsymbol{w}}_k \|_{}^2|{\boldsymbol{w}}_k]\nonumber\\ &= -h_k \| \nabla f ({\boldsymbol{w}}_k) \|_{}^2 + \frac{L}{2}\mathbb{E}[\| h_k{\boldsymbol{G}}_k \|_{}^2|{\boldsymbol{w}}_k]. \end{align*} \]

For the objective function to decrease in expectation, the first term \( h_k\| \nabla f ({\boldsymbol{w}}_k) \|_{}^2\) should dominate the second term \( \frac{L}{2}\mathbb{E}[\| h_k{\boldsymbol{G}}_k \|_{}^2|{\boldsymbol{w}}_k]\) . As \( {\boldsymbol{w}}_k\) approaches the minimum, we have \( \| \nabla f ({\boldsymbol{w}}_k) \|_{}\to 0\) , which suggests that \( \mathbb{E}[\| h_k{\boldsymbol{G}}_k \|_{}^2|{\boldsymbol{w}}_k]\) should also decrease over time.

This is illustrated in Figure 34 (a), which shows the progression of stochastic gradient descent (SGD) with a constant learning rate, \( h_k = h\) , on a quadratic objective function and using artificially added gradient noise, such that \( \mathbb{E}[\| {\boldsymbol{G}}_k \|_{}^2|{\boldsymbol{w}}_k]\) does not tend to zero. The stochastic updates in (122) cause fluctuations in the optimization path. Since these fluctuations do not decrease as the algorithm approaches the minimum, the iterates will not converge. Instead they stabilize within a neighborhood of the minimum, and oscillate around it, e.g[148, Theorem 9.8]. In practice, this might yield a good enough approximation to \( {\boldsymbol{w}}_*\) . To achieve convergence in the limit, the variance of the update vector, \( -h_k {\boldsymbol{G}}_k\) , must decrease over time however. This can be achieved either by reducing the variance of \( {\boldsymbol{G}}_k\) , for example through larger mini-batches (cf. Example 5), or more commonly, by decreasing the step size \( h_k\) as \( k\) progresses. Figure 34 (b) shows this for \( h_k \sim 1/k\) ; also see Figure 37.

11.2.2 Convergence of stochastic gradient descent

Since \( {\boldsymbol{w}}_k\) in (122) is a random variable by construction, a convergence statement can only be stochastic, e.g., in expectation or with high probability. The next theorem, which is based on [151, Theorem 3.2] and [150, Theorem 4.7], concentrates on the former. The result is stated under assumption (126), which bounds the second moments of the stochastic gradients \( {\boldsymbol{G}}_k\) and ensures that they grow at most linearly with \( \| \nabla f ({\boldsymbol{w}}_k) \|_{}^2\) . Moreover, the analysis relies on the following decreasing step sizes

\[ \begin{align} {h}_k\mathrm{:}= \min\Big(\frac{\mu}{L^2\gamma},\frac{1}{\mu} \frac{(k+1)^2-k^2}{(k+1)^2}\Big)~~ \text{for all }k\in\mathbb{N}_0, \end{align} \]

(124)

from [151]. Note that \( h_k=O(k^{-1})\) as \( k\to\infty\) , since

\[ \begin{align} \frac{(k+1)^2-k^2}{(k+1)^2}= \frac{2k+1}{(k+1)^2}=\frac{2}{(k+1)}+O(k^{-2}). \end{align} \]

(125)

This learning rate decay will allow us to establish a convergence rate. However, in practice, a less aggressive decay, such as \( h_k = O(k^{-1/2})\) , or heuristic methods that decrease the learning rate based on the observed convergence behaviour may be preferred.

Theorem 20

Let \( n \in \mathbb{N}\) and \( L\) , \( \mu\) , \( \gamma > 0\) . Let \( f \colon \mathbb{R}^n \to \mathbb{R}\) be \( L\) -smooth and \( \mu\) -strongly convex. Fix \( {\boldsymbol{w}}_0\in\mathbb{R}^n\) , let \( (h_k)_{k=0}^\infty\) be as in (124) and let \( ({\boldsymbol{G}}_{k})_{k=0}^\infty\) , \( ({\boldsymbol{w}}_k)_{k=1}^\infty\) be sequences of random variables as in (121) and (122). Assume that, for some fixed \( \gamma >0\) , and all \( k\in\mathbb{N}\)

\[ \begin{align} \mathbb{E}[\| {\boldsymbol{G}}_{k} \|_{}^2|{\boldsymbol{w}}_{k}]\le\gamma (1+\| \nabla f ({\boldsymbol{w}}_k) \|_{}^2). \end{align} \]

(126)

Then there exists a constant \( C=C(\gamma,\mu,L)\) such that for all \( k\in\mathbb{N}\)

\[ \begin{align*} \mathbb{E}[\| {\boldsymbol{w}}_k-{\boldsymbol{w}}_* \|_{}^2]&\le \frac{C}{k}, \nonumber\\ \mathbb{E}[ f ({\boldsymbol{w}}_k)]- f ({\boldsymbol{w}}_*)&\le \frac{C}{k}. \end{align*} \]

Proof

Using (121) and (126) it holds for \( k\ge 1\)

\[ \begin{align*} &\mathbb{E}[\| {\boldsymbol{w}}_{k}-{\boldsymbol{w}}_* \|_{}^2|{\boldsymbol{w}}_{k-1}]\\ &~=\| {\boldsymbol{w}}_{k-1}-{\boldsymbol{w}}_* \|_{}^2 -2 h_{k-1}\mathbb{E}[\left\langle {\boldsymbol{G}}_{k-1}, {\boldsymbol{w}}_{k-1}-{\boldsymbol{w}}_*\right\rangle_{}|{\boldsymbol{w}}_{k-1}]+h_{k-1}^2 \mathbb{E}[\| {\boldsymbol{G}}_{k-1} \|_{}^2|{\boldsymbol{w}}_{k-1}]\nonumber\\ &~\le \| {\boldsymbol{w}}_{k-1}-{\boldsymbol{w}}_* \|_{}^2 -2 h_{k-1}\left\langle \nabla f ({\boldsymbol{w}}_{k-1}), {\boldsymbol{w}}_{k-1}-{\boldsymbol{w}}_*\right\rangle_{}+h_{k-1}^2 \gamma(1+\| \nabla f ({\boldsymbol{w}}_{k-1}) \|_{}^2). \end{align*} \]

By \( \mu\) -strong convexity (117)

\[ \begin{align*} &-2 h_{k-1}\left\langle \nabla f ({\boldsymbol{w}}_{k-1}), {\boldsymbol{w}}_{k-1}-{\boldsymbol{w}}_*\right\rangle_{}\\ &~~\le -\mu h_{k-1}\| {\boldsymbol{w}}_{k-1}-{\boldsymbol{w}}_* \|_{}^2-2h_{k-1}\cdot ( f ({\boldsymbol{w}}_{k-1})- f ({\boldsymbol{w}}_*)). \end{align*} \]

Moreover, \( L\) -smoothness, \( \mu\) -strong convexity and \( \nabla f ({\boldsymbol{w}}_*)=\boldsymbol{0}\) imply

\[ \| \nabla f ({\boldsymbol{w}}_{k-1}) \|_{}^2\le L^2\| {\boldsymbol{w}}_{k-1}-{\boldsymbol{w}}_* \|_{}^2 \le \frac{2L^2}{\mu}( f ({\boldsymbol{w}}_{k-1})- f ({\boldsymbol{w}}_*)). \]

Combining the previous estimates we arrive at

\[ \begin{align*} \mathbb{E}[\| {\boldsymbol{w}}_{k}-{\boldsymbol{w}}_* \|_{}^2|{\boldsymbol{w}}_{k-1}] \le &(1-\mu h_{k-1})\| {\boldsymbol{w}}_{k-1}-{\boldsymbol{w}}_* \|_{}^2+h_{k-1}^2\gamma \\ &+2h_{k-1}\Big(\frac{L^2\gamma}{\mu}h_{k-1}-1\Big)( f ({\boldsymbol{w}}_{k-1})- f ({\boldsymbol{w}}_*)). \end{align*} \]

The choice of \( h_{k-1}\le \mu/(L^2\gamma)\) and the fact that (cf. (121))

\[ \begin{align*} \mathbb{E}[\| {\boldsymbol{w}}_{k}-{\boldsymbol{w}}_* \|_{}^2|{\boldsymbol{w}}_{k-1},{\boldsymbol{w}}_{k-2}]=\mathbb{E}[\| {\boldsymbol{w}}_{k}-{\boldsymbol{w}}_* \|_{}^2|{\boldsymbol{w}}_{k-1}] \end{align*} \]

thus give

\[ \begin{align*} \mathbb{E}[\| {\boldsymbol{w}}_{k}-{\boldsymbol{w}}_* \|_{}^2|{\boldsymbol{w}}_{k-1}]\le (1-\mu h_{k-1}) \mathbb{E}[\| {\boldsymbol{w}}_{k-1}-{\boldsymbol{w}}_* \|_{}^2|{\boldsymbol{w}}_{k-2}] +h_{k-1}^2\gamma. \end{align*} \]

With \( e_0\mathrm{:}= \| {\boldsymbol{w}}_0-{\boldsymbol{w}}_* \|_{}^2\) and \( e_k\mathrm{:}= \mathbb{E}[\| {\boldsymbol{w}}_{k}-{\boldsymbol{w}}_* \|_{}^2|{\boldsymbol{w}}_{k-1}]\) for \( k\ge 1\) we have found

\[ \begin{align*} e_{k}&\le (1-\mu h_{k-1})e_{k-1}+{h}_{k-1}^2\gamma\nonumber\\ &\le (1-\mu{h}_{k-1})((1-\mu{h}_{k-2})e_{k-2}+{h}_{k-2}^2\gamma)+{h}_{k-1}^2\gamma\nonumber\\ &\le\dots\le e_0\prod_{j=0}^{k-1}(1-\mu{h}_j)+\gamma\sum_{j=0}^{k-1}{h}_j^2\prod_{i=j+1}^{k-1}(1-\mu{h}_i). \end{align*} \]

Note that there exists \( k_0\in\mathbb{N}\) such that by (124) and (125)

\[ h_i= \frac{1}{\mu}\frac{(i+1)^2-i^2}{(i+1)^2}~ \text{ for all } ~ i\ge k_0. \]

Hence there exists \( \tilde C\) depending on \( \gamma\) , \( \mu\) , \( L\) (but independent of \( k\) ) such that

\[ \begin{align*} \prod_{i=j}^{k-1}(1-\mu{h}_i) \le \tilde C \prod_{i=j}^{k-1}\frac{i^2}{(i+1)^2} =\tilde C \frac{j^2}{k^2}~~\text{ for all }0\le j\le k \end{align*} \]

and thus

\[ \begin{align*} e_k &\le \tilde C \frac{\gamma}{\mu^2}\sum_{j=0}^{k-1} \left(\frac{(j+1)^2-j^2}{(j+1)^2}\right)^2\frac{(j+1)^2}{{k^2}}\nonumber\\ &\le \frac{\tilde C\gamma}{\mu^2} \frac{1}{{k^2}}\sum_{j=0}^{k-1} \underbrace{\frac{(2j+1)^2}{(j+1)^2}}_{\le 4}\nonumber\\ &\le \frac{\tilde C\gamma}{\mu^2} \frac{4k}{{k^2}}= \frac{C}{k}, \end{align*} \]

with \( C\mathrm{:}= 4\tilde C\gamma/\mu^2\) .

Since \( \mathbb{E}[\| {\boldsymbol{w}}_k-{\boldsymbol{w}}_* \|_{}^2]\) is the expectation of \( e_k=\mathbb{E}[\| {\boldsymbol{w}}_k-{\boldsymbol{w}}_* \|_{}^2|{\boldsymbol{w}}_{k-1}]\) with respect to the random variable \( {\boldsymbol{w}}_{k-1}\) , and \( C/k\) is a constant independent of \( {\boldsymbol{w}}_{k-1}\) , we obtain

\[ \begin{align*} \mathbb{E}[\| {\boldsymbol{w}}_k-{\boldsymbol{w}}_* \|_{}^2]\le \frac{C}{k}. \end{align*} \]

Finally, using \( L\) -smoothness

\[ \begin{align*} f ({\boldsymbol{w}}_k)- f ({\boldsymbol{w}}_*)\le \left\langle \nabla f ({\boldsymbol{w}}_*), {\boldsymbol{w}}_k-{\boldsymbol{w}}_*\right\rangle_{}+\frac{L}{2}\| {\boldsymbol{w}}_k-{\boldsymbol{w}}_* \|_{}^2 =\frac{L}{2}\| {\boldsymbol{w}}_k-{\boldsymbol{w}}_* \|_{}^2, \end{align*} \]

and taking the expectation concludes the proof.

The specific choice of \( {h}_k\) in (124) simplifies the calculations in the proof, but it is not necessary in order for the asymptotic convergence to hold; see for instance [150, Theorem 4.7] or [152, Chapter 4] for more general statements.

11.3 Acceleration

Acceleration is an important tool for the training of neural networks [153]. The idea was first introduced by Polyak in 1964 under the name “heavy ball method” [154]. It is inspired by the dynamics of a heavy ball rolling down the valley of the loss landscape. Since then other types of acceleration have been proposed and analyzed, with Nesterov acceleration being the most prominent example [155]. In this section, we first give some intuition by discussing the heavy ball method for a simple quadratic loss. Afterwards we turn to Nesterov acceleration and give a convergence proof for \( L\) -smooth and \( \mu\) -strongly convex objective functions that improves upon the bounds obtained for gradient descent.

11.3.1 Heavy ball method

We follow [156, 157, 158] to motivate the idea. Consider the quadratic objective function in two dimensions

\[ \begin{align} f ({\boldsymbol{w}})\mathrm{:}= \frac{1}{2}{\boldsymbol{w}}^\top {\boldsymbol{D}} {\boldsymbol{w}}~~\text{where}~~ {\boldsymbol{D}} = \begin{pmatrix} \zeta_1 &0\\ 0 &\zeta_2 \end{pmatrix} \end{align} \]

(127)

with \( \zeta_1\ge\zeta_2>0\) . Clearly, \( f \) has a unique minimizer at \( {\boldsymbol{w}}_*=\boldsymbol{0}\in\mathbb{R}^2\) . Starting at some \( {\boldsymbol{w}}_0\in\mathbb{R}^2\) , gradient descent with constant step size \( h>0\) computes the iterates

\[ \begin{align*} {\boldsymbol{w}}_{k+1}={\boldsymbol{w}}_k - h {\boldsymbol{D}}{\boldsymbol{w}}_k = \begin{pmatrix} 1-h\zeta_1 &0\\ 0&1-h\zeta_2 \end{pmatrix} {\boldsymbol{w}}_k = \begin{pmatrix} (1-h\zeta_1)^{k+1} &0\\ 0&(1-h\zeta_2)^{k+1} \end{pmatrix} {\boldsymbol{w}}_0. \end{align*} \]

The method converges for arbitrary initialization \( {\boldsymbol{w}}_0\) if and only if

\[ \begin{align*} |1-h\zeta_1|<1~~\text{and}~~ |1-h\zeta_2|<1. \end{align*} \]

The optimal step size balancing the rate of convergence in both coordinates is

\[ \begin{align} h_* = \rm{argmin}_{h>0} \max\{|1-h\zeta_1|,|1-h\zeta_2|\} = \frac{2}{\zeta_1+\zeta_2}. \end{align} \]

(128)

With \( \kappa=\zeta_1/\zeta_2\) we then obtain the convergence rate

\[ \begin{align} |1-h_*\zeta_1| = |1-h_*\zeta_2| = \frac{\zeta_1-\zeta_2}{\zeta_1+\zeta_2}=\frac{\kappa-1}{\kappa+1}\in [0,1). \end{align} \]

(129)

If \( \zeta_1\gg \zeta_2\) , this term is close to \( 1\) , and thus the convergence will be slow. This is consistent with our analysis for strongly convex objective functions; by Exercise 39 the condition number of \( f \) equals \( \kappa=\zeta_1/\zeta_2\gg 1\) . Hence, the upper bound in Theorem 19 converges only slowly. Similar considerations hold for general quadratic objective functions in \( \mathbb{R}^n\) such as

\[ \begin{align} \tilde f ({\boldsymbol{w}})=\frac{1}{2}{\boldsymbol{w}}^\top {\boldsymbol{A}}{\boldsymbol{w}}+{\boldsymbol{b}}^\top{\boldsymbol{w}}+c \end{align} \]

(130)

with \( {\boldsymbol{A}}\in\mathbb{R}^{n\times n}\) symmetric positive definite, \( {\boldsymbol{b}}\in\mathbb{R}^n\) and \( c\in\mathbb{R}\) , see Exercise 40.

Remark 12

Interpreting (130) as a second-order Taylor approximation of some objective function around its minimizer, we note that the described effects also occur for general objective functions with ill-conditioned Hessians at the minimizer.

&lt;span data-controller=&quot;mathjax&quot;&gt;20 steps of gradient descent (GD) and the heavy
ball method (HB) on the objective function (eq:2dobj) with
 _1=12 1=_2) , step size h==h_*) as in
(eq:opth), and =1/3) . Figure based on Polyak.&lt;/span&gt;
Figure 40. 20 steps of gradient descent (GD) and the heavy ball method (HB) on the objective function (127) with \( \zeta_1=12\gg 1=\zeta_2\) , step size \( h=\alpha=h_*\) as in (128), and \( \beta=1/3\) . Figure based on [157, Fig. 6].

Figure 40 gives further insight into the poor performance of gradient descent for (127) with \( \zeta_1\gg\zeta_2\) . The loss-landscape looks like a ravine (the derivative is much larger in one direction than the other), and away from the floor, \( \nabla f \) mainly points to the opposite side. Therefore the iterates oscillate back and forth in the first coordinate, and make little progress in the direction of the valley along the second coordinate axis. To address this problem, the heavy ball method introduces a “momentum” term which can mitigate this effect to some extent. The idea is to choose the update not just according to the gradient at the current location, but to add information from the previous steps. After initializing \( {\boldsymbol{w}}_0\) and, e.g., \( {\boldsymbol{w}}_1={\boldsymbol{w}}_0-\alpha\nabla f ({\boldsymbol{w}}_0)\) , let for \( k\in \mathbb{N}\)

\[ \begin{align} {\boldsymbol{w}}_{k+1} = {\boldsymbol{w}}_k - \alpha\nabla f ({\boldsymbol{w}}_k)+\beta({\boldsymbol{w}}_k-{\boldsymbol{w}}_{k-1}). \end{align} \]

(131)

This is known as Polyak’s heavy ball method [154, 157]. Here \( \alpha>0\) and \( \beta\in (0,1)\) are hyperparameters (that could also depend on \( k\) ) and in practice need to be carefully tuned to balance the strength of the gradient and the momentum term. Iteratively expanding (131) with the given initialization, observe that for \( k\ge 0\)

\[ \begin{align} {\boldsymbol{w}}_{k+1} ={\boldsymbol{w}}_k-\alpha\Bigg(\sum_{j=0}^k \beta^j\nabla f ({\boldsymbol{w}}_{k-j})\Bigg). \end{align} \]

(132)

Thus, \( {\boldsymbol{w}}_k\) is updated using an exponentially weighted moving average of all past gradients. Choosing the momentum parameter \( \beta\) in the interval \( (0,1)\) ensures that the influence of previous gradients on the update decays exponentially. The concrete value of \( \beta\) determines the balance between the impact of recent and past gradients.

Intuitively, this linear combination of the past gradients averages out some of the oscillation observed for gradient descent in Figure 40; the update vector is strengthened in directions where past gradients are aligned (the second coordinate axis), while it is dampened in directions where the gradients’ signs alternate (the first coordinate axis). Similarly, when using stochastic gradients, it can help to reduce some of the variance.

As mentioned earlier, the heavy ball method can be interpreted as a discretization of the dynamics of a ball rolling down the valley of the loss landscape. If the ball has positive mass, i.eis “heavy”, its momentum prevents the ball from bouncing back and forth too strongly. The following remark elucidates this connection.

Remark 13

As pointed out, e.g., in [157, 158], for suitable choices of \( \alpha\) and \( \beta\) , (131) can be interpreted as a discretization of the second-order ODE

\[ \begin{align} m {\boldsymbol{w}}''(t) = -\nabla f ({\boldsymbol{w}}(t)) - r{\boldsymbol{w}}'(t). \end{align} \]

(133)

This equation describes the movement of a point mass \( m\) under influence of the force field \( -\nabla f ({\boldsymbol{w}}(t))\) ; the term \( -{\boldsymbol{w}}'(t)\) , which points in the negative direction of the current velocity, corresponds to friction, and \( r>0\) is the friction coefficient. The discretization

\[ \begin{align*} m\frac{{\boldsymbol{w}}_{k+1}-2{\boldsymbol{w}}_k+{\boldsymbol{w}}_{k-1}}{h^2} = -\nabla f ({\boldsymbol{w}}_k)-\frac{{\boldsymbol{w}}_{k+1}-{\boldsymbol{w}}_k}{h} \end{align*} \]

then leads to

\[ \begin{align} {\boldsymbol{w}}_{k+1} = {\boldsymbol{w}}_k-\underbrace{\frac{h^2}{m-rh}}_{=\alpha}\nabla f ({\boldsymbol{w}}_k)+\underbrace{\frac{m}{m-rh}}_{=\beta}({\boldsymbol{w}}_k-{\boldsymbol{w}}_{k-1}), \end{align} \]

(134)

and thus to (131), [158].

Letting \( m=0\) in (134), we recover the gradient descent update (110). Hence, positive mass \( m>0\) corresponds to the momentum term. The gradient descent update in turn can be interpreted as an Euler discretization of the gradient flow

\[ \begin{equation} {\boldsymbol{w}}'(t)=-\nabla f ({\boldsymbol{w}}(t)). \end{equation} \]

(135)

Note that \( -\nabla f ({\boldsymbol{w}}(t))\) represents the velocity of \( {\boldsymbol{w}}(t)\) in (135), whereas in (133), up to the friction term, it corresponds to an acceleration.

11.3.2 Nesterov acceleration

Nesterov’s accelerated gradient method (NAG) [155, 145] builds on the heavy ball method. After initializing \( {\boldsymbol{w}}_0\) , \( {\boldsymbol{v}}_0\in\mathbb{R}^n\) , the update is formulated for \( k\ge 0\) as the two-step process

\[ \begin{align} {\boldsymbol{w}}_{k+1} &= {\boldsymbol{v}}_k - \alpha\nabla f ({\boldsymbol{v}}_k)\\ {\boldsymbol{v}}_{k+1} &= {\boldsymbol{w}}_{k+1} + \beta ({\boldsymbol{w}}_{k+1}-{\boldsymbol{w}}_{k}) \\\end{align} \]

(136.a)

where again \( \alpha>0\) and \( \beta\in (0,1)\) are hyperparameters. Substituting the second line into the first we get for \( k\ge 1\)

\[ \begin{align*} {\boldsymbol{w}}_{k+1} = {\boldsymbol{w}}_k -\alpha\nabla f ({\boldsymbol{v}}_k) + \beta({\boldsymbol{w}}_k-{\boldsymbol{w}}_{k-1}). \end{align*} \]

Comparing with the heavy ball method (131), the key difference is that the gradient is not evaluated at the current position \( {\boldsymbol{w}}_k\) , but instead at the point \( {\boldsymbol{v}}_k={\boldsymbol{w}}_k+\beta({\boldsymbol{w}}_k-{\boldsymbol{w}}_{k-1})\) , which can be interpreted as an estimate of the position at the next iteration. This improves stability and robustness with respect to hyperparameter settings, see [159, Sections 4 and 5].

We now discuss the convergence of NAG for \( L\) -smooth and \( \mu\) -strongly convex objective functions \( f \) . To give the analysis, it is convenient to first rewrite (136) as a three sequence update: Let \( \tau=\sqrt{\mu/L}\) , \( \alpha=1/L\) , and \( \beta = (1-\tau)/(1+\tau)\) . After initializing \( {\boldsymbol{w}}_0\) , \( {\boldsymbol{v}}_0\in\mathbb{R}^n\) , (136) can also be written as \( {\boldsymbol{u}}_0 = ((1+\tau){\boldsymbol{v}}_0-{\boldsymbol{w}}_0)/\tau\) and for \( k\ge 0\)

\[ \begin{align} {\boldsymbol{v}}_k &= \frac{\tau}{1+\tau}{\boldsymbol{u}}_k+\frac{1}{1+\tau}{\boldsymbol{w}}_k \end{align} \]

(137.a)

\[ \begin{align} {\boldsymbol{w}}_{k+1} &={\boldsymbol{v}}_k - \frac{1}{L}\nabla f ({\boldsymbol{v}}_k) \end{align} \]

(137.b)

\[ \begin{align} {\boldsymbol{u}}_{k+1}&={\boldsymbol{u}}_k+\tau\cdot ({\boldsymbol{v}}_k-{\boldsymbol{u}}_k)-\frac{\tau}{\mu}\nabla f ({\boldsymbol{v}}_k), \end{align} \]

(137.c)

see Exercise 41. The proof of the next theorem proceeds along the lines of [160, Theorem A.3.1], [161, Proposition 10]; also see [162, Proposition 20] who present a similar proof of a related result based on the same references.

Theorem 21

Let \( n \in \mathbb{N}\) , \( 0<\mu\le L\) , and let \( f \colon \mathbb{R}^n \to \mathbb{R}\) be \( L\) -smooth and \( \mu\) -strongly convex. Further, let \( {\boldsymbol{w}}_0\) , \( {\boldsymbol{v}}_0\in\mathbb{R}^n\) and let \( \tau=\sqrt{\mu/L}\) . Let \( ({\boldsymbol{v}}_k,{\boldsymbol{w}}_{k+1},{\boldsymbol{u}}_{k+1})_{k=0}^\infty \subseteq \mathbb{R}^n\) be defined by (137.a), and let \( {\boldsymbol{w}}_*\) be the unique minimizer of \( f \) .

Then, for all \( k\in\mathbb{N}_0\) , it holds that

\[ \begin{align} \| {\boldsymbol{u}}_k-{\boldsymbol{w}}_* \|_{}^2 & \le \frac{2}{\mu} \Big(1-\sqrt{\frac{\mu}{L}}\Big)^{k}\Big( f ({\boldsymbol{w}}_0)- f ({\boldsymbol{w}}_*)+\frac{\mu}{2}\| {\boldsymbol{u}}_0-{\boldsymbol{w}}_* \|_{}^2\Big), \end{align} \]

(138.a)

\[ \begin{align} f ({\boldsymbol{w}}_k)- f ({\boldsymbol{w}}_*)&\le \Big(1-\sqrt{\frac{\mu}{L}}\Big)^{k}\Big( f ({\boldsymbol{w}}_0)- f ({\boldsymbol{w}}_*)+\frac{\mu}{2}\| {\boldsymbol{u}}_0-{\boldsymbol{w}}_* \|_{}^2\Big). \end{align} \]

(138.b)

Proof

Define

\[ \begin{align} e_k\mathrm{:}= ( f ({\boldsymbol{w}}_k)- f ({\boldsymbol{w}}_*))+\frac{\mu}{2}\| {\boldsymbol{u}}_k-{\boldsymbol{w}}_* \|_{}^2. \end{align} \]

(139)

To show (138), it suffices to prove with \( c\mathrm{:}= 1-\tau\) that \( e_{k+1}\le ce_k\) for all \( k\in\mathbb{N}_0\) .

Step 1. We bound the first term in \( e_{k+1}\) defined in (139). Using \( L\) -smoothness (112.a) and (137.b)

\[ \begin{align*} f ({\boldsymbol{w}}_{k+1})- f ({\boldsymbol{v}}_k) \le \left\langle \nabla f ({\boldsymbol{v}}_k), {\boldsymbol{w}}_{k+1}-{\boldsymbol{v}}_k\right\rangle_{} + \frac{L}{2}\| {\boldsymbol{w}}_{k+1}-{\boldsymbol{v}}_k \|_{}^2 = -\frac{1}{2L}\| \nabla f ({\boldsymbol{v}}_k) \|_{}^2. \end{align*} \]

Thus, since \( c+\tau=1\) ,

\[ \begin{align} f ({\boldsymbol{w}}_{k+1})- f ({\boldsymbol{w}}_*)&\le ( f ({\boldsymbol{v}}_k)- f ({\boldsymbol{w}}_*))-\frac{1}{2L}\| \nabla f ({\boldsymbol{v}}_k) \|_{}^2\nonumber \end{align} \]

(140)

\[ \begin{align} &= c\cdot ( f ({\boldsymbol{w}}_k)- f ({\boldsymbol{w}}_*)) +c\cdot ( f ({\boldsymbol{v}}_k)- f ({\boldsymbol{w}}_k))\\ &~ +\tau\cdot ( f ({\boldsymbol{v}}_k)- f ({\boldsymbol{w}}_*)) -\frac{1}{2L}\| \nabla f ({\boldsymbol{v}}_k) \|_{}^2. \\\end{align} \]

Step 2. We bound the second term in \( e_{k+1}\) defined in (139). By (137.c)

\[ \begin{align} &\frac{\mu}{2}\| {\boldsymbol{u}}_{k+1}-{\boldsymbol{w}}_* \|_{}^2 - \frac{\mu}{2}\| {\boldsymbol{u}}_{k}-{\boldsymbol{w}}_* \|_{}^2\nonumber \end{align} \]

(141)

\[ \begin{align} &~ = \frac{\mu}{2} \| {\boldsymbol{u}}_{k+1}-{\boldsymbol{u}}_k + {\boldsymbol{u}}_k-{\boldsymbol{w}}_* \|_{}^2- \frac{\mu}{2}\| {\boldsymbol{u}}_{k}-{\boldsymbol{w}}_* \|_{}^2\\ &~ = \frac{\mu}{2}\| {\boldsymbol{u}}_{k+1}-{\boldsymbol{u}}_k \|_{}^2+\mu \left\langle \tau\cdot ({\boldsymbol{v}}_k-{\boldsymbol{u}}_k)-\frac{\tau}{\mu}\nabla f ({\boldsymbol{v}}_k), {\boldsymbol{u}}_k-{\boldsymbol{w}}_*\right\rangle_{}\nonumber\\ &~= \frac{\mu}{2}\| {\boldsymbol{u}}_{k+1}-{\boldsymbol{u}}_k \|_{}^2 +\tau\left\langle \nabla f ({\boldsymbol{v}}_k), {\boldsymbol{w}}_*-{\boldsymbol{u}}_k\right\rangle_{}-\tau\mu\left\langle {\boldsymbol{v}}_k-{\boldsymbol{u}}_k, {\boldsymbol{w}}_*-{\boldsymbol{u}}_k\right\rangle_{}. \\\end{align} \]

Using \( \mu\) -strong convexity (117), we get

\[ \begin{align*} &\tau\left\langle \nabla f ({\boldsymbol{v}}_k), {\boldsymbol{w}}_*-{\boldsymbol{u}}_k\right\rangle_{}= \tau\left\langle \nabla f ({\boldsymbol{v}}_k), {\boldsymbol{v}}_k-{\boldsymbol{u}}_k\right\rangle_{}+ \tau\left\langle \nabla f ({\boldsymbol{v}}_k), {\boldsymbol{w}}_*-{\boldsymbol{v}}_k\right\rangle_{}\\ &~~~~\le \tau\left\langle \nabla f ({\boldsymbol{v}}_k), {\boldsymbol{v}}_k-{\boldsymbol{u}}_k\right\rangle_{} - \tau\cdot ( f ({\boldsymbol{v}}_k)- f ({\boldsymbol{w}}_*))-\frac{\tau\mu}{2}\| {\boldsymbol{v}}_k-{\boldsymbol{w}}_* \|_{}^2. \end{align*} \]

Moreover,

\[ \begin{align*} &-\frac{\tau\mu}{2}\| {\boldsymbol{v}}_k-{\boldsymbol{w}}_* \|_{}^2-\tau \mu\left\langle {\boldsymbol{v}}_k-{\boldsymbol{u}}_k, {\boldsymbol{w}}_*-{\boldsymbol{u}}_k\right\rangle_{}\\ &~~~~=-\frac{\tau\mu}{2}\Big(\| {\boldsymbol{v}}_k-{\boldsymbol{w}}_* \|_{}^2-2\left\langle {\boldsymbol{v}}_k-{\boldsymbol{u}}_k, {\boldsymbol{v}}_k-{\boldsymbol{w}}_*\right\rangle_{}+2\left\langle {\boldsymbol{v}}_k-{\boldsymbol{u}}_k, {\boldsymbol{v}}_k-{\boldsymbol{u}}_k\right\rangle_{}\Big)\\ &~~~~=-\frac{\tau\mu}{2}(\| {\boldsymbol{u}}_k-{\boldsymbol{w}}_* \|_{}^2+\| {\boldsymbol{v}}_k-{\boldsymbol{u}}_k \|_{}^2). \end{align*} \]

Thus, (141) is bounded by

\[ \begin{align*} &\frac{\mu}{2}\| {\boldsymbol{u}}_{k+1}-{\boldsymbol{u}}_k \|_{}^2 + \tau\left\langle \nabla f ({\boldsymbol{v}}_k), {\boldsymbol{v}}_k-{\boldsymbol{u}}_k\right\rangle_{}-\tau\cdot ( f ({\boldsymbol{v}}_k)- f ({\boldsymbol{w}}_*))\nonumber\\ & -\frac{\tau\mu}{2}\| {\boldsymbol{u}}_k-{\boldsymbol{w}}_* \|_{}^2-\frac{\tau\mu}{2}\| {\boldsymbol{v}}_k-{\boldsymbol{u}}_k \|_{}^2. \end{align*} \]

From (137.a) we have \( \tau\cdot ({\boldsymbol{v}}_k-{\boldsymbol{u}}_k)={\boldsymbol{w}}_k-{\boldsymbol{v}}_k\) , so that with \( c=1-\tau\) we arrive at

\[ \begin{align} &\frac{\mu}{2}\| {\boldsymbol{u}}_{k+1}-{\boldsymbol{w}}_* \|_{}^2 \le c\frac{\mu}{2}\| {\boldsymbol{u}}_{k}-{\boldsymbol{w}}_* \|_{}^2+\frac{\mu}{2}\| {\boldsymbol{u}}_{k+1}-{\boldsymbol{u}}_k \|_{}^2\nonumber \end{align} \]

(142)

\[ \begin{align} &~~ + \left\langle \nabla f ({\boldsymbol{v}}_k), {\boldsymbol{w}}_k-{\boldsymbol{v}}_k\right\rangle_{}-\tau\cdot ( f ({\boldsymbol{v}}_k)- f ({\boldsymbol{w}}_*)) -\frac{\mu}{2\tau}\| {\boldsymbol{w}}_k-{\boldsymbol{v}}_k \|_{}^2. \\\end{align} \]

(143)

Step 3. We show \( e_{k+1}\le c e_k\) . Adding (140) and (142) gives

\[ \begin{align*} e_{k+1} &\le ce_k +c\cdot ( f ({\boldsymbol{v}}_k)- f ({\boldsymbol{w}}_k))-\frac{1}{2L}\| \nabla f ({\boldsymbol{v}}_k) \|_{}^2+ \frac{\mu}{2}\| {\boldsymbol{u}}_{k+1}-{\boldsymbol{u}}_k \|_{}^2\nonumber\\ &~+ \left\langle \nabla f ({\boldsymbol{v}}_k), {\boldsymbol{w}}_k-{\boldsymbol{v}}_k\right\rangle_{} -\frac{\mu}{2\tau}\| {\boldsymbol{w}}_k-{\boldsymbol{v}}_k \|_{}^2. \end{align*} \]

Using (137.a), (137.c) we expand

\[ \begin{align*} \frac{\mu}{2}\| {\boldsymbol{u}}_{k+1}-{\boldsymbol{u}}_k \|_{}^2 &= \frac{\mu}{2}\left\| {\boldsymbol{w}}_k-{\boldsymbol{v}}_k-\frac{\tau}{\mu}\nabla f ({\boldsymbol{v}}_k) \right\|_{}^2\\ &= \frac{\mu}{2}\| {\boldsymbol{w}}_k-{\boldsymbol{v}}_k \|_{}^2 -\tau\left\langle \nabla f ({\boldsymbol{v}}_k), {\boldsymbol{w}}_k-{\boldsymbol{v}}_k\right\rangle_{}+\frac{\tau^2}{2\mu}\| \nabla f ({\boldsymbol{v}}_k) \|_{}^2, \end{align*} \]

to obtain

\[ \begin{align*} e_{k+1}&\le c e_k +\Big(\frac{\tau^2}{2\mu}-\frac{1}{2L}\Big)\| \nabla f ({\boldsymbol{v}}_k) \|_{}^2 -\frac{\mu}{2\tau}\| {\boldsymbol{w}}_k-{\boldsymbol{v}}_k \|_{}^2\\ &~+c\cdot \big(f({\boldsymbol{v}}_k)-f({\boldsymbol{w}}_k)+\left\langle \nabla f ({\boldsymbol{v}}_k), {\boldsymbol{w}}_k-{\boldsymbol{v}}_k\right\rangle_{}\big)+\frac{\mu}{2}\| {\boldsymbol{w}}_k-{\boldsymbol{v}}_k \|_{}^2. \end{align*} \]

The last line can be bounded using \( \mu\) -strong convexity (117) and \( \mu\le L\)

\[ \begin{align*} &c\cdot(f({\boldsymbol{v}}_k)-f({\boldsymbol{w}}_k)+\left\langle \nabla f ({\boldsymbol{v}}_k), {\boldsymbol{w}}_k-{\boldsymbol{v}}_k\right\rangle_{})+\frac{\mu}{2}\| {\boldsymbol{v}}_k-{\boldsymbol{w}}_k \|_{}^2\\ &~~\le -(1-\tau)\frac{\mu}{2}\| {\boldsymbol{v}}_k-{\boldsymbol{w}}_k \|_{}^2+\frac{\mu}{2}\| {\boldsymbol{v}}_k-{\boldsymbol{w}}_k \|_{}^2 \le \frac{\tau L}{2}\| {\boldsymbol{v}}_k-{\boldsymbol{w}}_k \|_{}^2. \end{align*} \]

In all

\[ \begin{align*} e_{k+1}\le ce_k+ \Big(\frac{\tau^2}{2\mu}-\frac{1}{2L}\Big)\| \nabla f ({\boldsymbol{v}}_k) \|_{}^2 +\Big(\frac{\tau L}{2}-\frac{\mu}{2\tau}\Big)\| {\boldsymbol{w}}_k-{\boldsymbol{v}}_k \|_{}^2= c e_k, \end{align*} \]

where the terms in brackets vanished since \( \tau=\sqrt{\mu/L}\) . This concludes the proof.

Comparing the result for gradient descent (118) with NAG (138), the improvement for strongly convex objectives lies in the convergence rate, which is \( 1-\kappa^{-1}\) for gradient descent , and \( 1-\kappa^{-1/2}\) for NAG, where \( \kappa=L/\mu\) . For NAG the convergence rate depends only on the square root of the condition number \( \kappa\) . For ill-conditioned problems where \( \kappa\) is large, we therefore expect much better performance for accelerated methods.

11.4 Adaptive and coordinate-wise learning rates

In recent years, a multitude of first order (gradient descent) methods has been proposed and studied for the training of neural networks. Many of them incorporate some or all of the following key strategies: stochastic gradients, acceleration, and adaptive step sizes. The concept of stochastic gradients and acceleration have been covered in the Sections 11.2 and 11.3, and we will touch upon adaptive learning rates in the present one. Specifically, following the original papers [163, 164, 95, 165] and in particular the overviews [25, Section 8.5], [166], and [19, Chapter 11], we explain the main ideas behind AdaGrad, RMSProp, and Adam. The paper [166] provides an intuitive general overview of first order methods and discusses several additional variants that are omitted here. Moreover, in practice, various other techniques and heuristics such as batch normalization, gradient clipping, regularization and dropout, early stopping, specific weight initializations etcare used. We do not discuss them here, and refer for example to [167], [25], or [19, Chapter 11].

11.4.1 Coordinate-wise scaling

In Section 11.3.1, we saw why plain gradient descent can be inefficient for ill-conditioned objective functions. This issue can be particularly pronounced in high-dimensional optimization problems, such as when training neural networks, where certain parameters influence the network output much more than others. As a result, a single learning rate may be suboptimal; directions in parameter space with small gradients are updated too slowly, while in directions with large gradients the algorithm might overshoot. To address this, one approach is to precondition the gradient by multiplying it with a matrix that accounts for the geometry of the parameter space, e.g[168, 169]. A simpler and computationally efficient alternative is to scale each component of the gradient individually, corresponding to a diagonal preconditioning matrix. This allows different learning rates for different coordinates. The key question is how to set these learning rates. The main idea, first proposed in [163], is to scale each component inverse proportional to the magnitude of past gradients. In the words of the authors of [163]: “Our procedures give frequently occurring features very low learning rates and infrequent features high learning rates.”

After initializing \( {\boldsymbol{u}}_{0}=\boldsymbol{0}\in\mathbb{R}^n\) , \( {\boldsymbol{s}}_{0}=\boldsymbol{0}\in\mathbb{R}^n\) , and \( {\boldsymbol{w}}_0\in\mathbb{R}^n\) , all methods discussed below are special cases of

\[ \begin{align} {\boldsymbol{u}}_{k+1} &= \beta_1{\boldsymbol{u}}_{k}+\beta_2\nabla f ({\boldsymbol{w}}_{k}) \end{align} \]

(144.a)

\[ \begin{align} {\boldsymbol{s}}_{k+1} &= \gamma_1{\boldsymbol{s}}_{k}+\gamma_2\nabla f ({\boldsymbol{w}}_{k})\odot\nabla f ({\boldsymbol{w}}_{k}) \end{align} \]

(144.b)

\[ \begin{align} {\boldsymbol{w}}_{k+1} &= {\boldsymbol{w}}_{k}-\alpha_k{\boldsymbol{u}}_{k+1}\oslash \sqrt{{\boldsymbol{s}}_{k+1}+\varepsilon} \end{align} \]

(144.c)

for \( k\in\mathbb{N}_0\) , and certain hyperparameters \( \alpha_k\) , \( \beta_1\) , \( \beta_2\) , \( \gamma_1\) , \( \gamma_2\) , and \( \varepsilon\) . Here \( \odot\) and \( \oslash\) denote the componentwise (Hadamard) multiplication and division, respectively, and \( \sqrt{{\boldsymbol{s}}_{k+1}+\varepsilon}\) is understood as the vector \( (\sqrt{v_{k+1,i}+\varepsilon})_i\) . Equation (144.a) defines an update vector and corresponds to heavy ball momentum if \( \beta_1\in (0,1)\) . If \( \beta_1=0\) , then \( {\boldsymbol{u}}_{k+1}\) is simply a multiple of the current gradient. Equation (144.b) defines a scaling vector \( {\boldsymbol{s}}_{k+1}\) that is used to set a coordinate-wise learning rate of the update vector in (144.c). The constant \( \varepsilon>0\) is chosen small but positive to avoid division by zero in (144.c). These type of methods are often applied using mini-batches, see Section 11.2. For simplicity we present them with the full gradients.

Example 6

Consider an objective function \( f :\mathbb{R}^n\to\mathbb{R}\) , and its rescaled version

\[ f _{\boldsymbol{\zeta}}({\boldsymbol{w}})\mathrm{:}= f ({\boldsymbol{w}}\odot{\boldsymbol{\zeta}})~~\text{with gradient}~~ \nabla f _{\boldsymbol{\zeta}}({\boldsymbol{w}}) = {\boldsymbol{\zeta}}\odot \nabla f ({\boldsymbol{w}}\odot {\boldsymbol{\zeta}}), \]

for some \( {\boldsymbol{\zeta}}\in (0,\infty)^n\) . Gradient descent (110) applied to \( f _{\boldsymbol{\zeta}}\) performs the update

\[ {\boldsymbol{w}}_{k+1}={\boldsymbol{w}}_k - h_k {\boldsymbol{\zeta}}\odot \nabla f ({\boldsymbol{w}}\odot{\boldsymbol{\zeta}}). \]

Setting \( \varepsilon=0\) , (144) on the other hand performs the update

\[ {\boldsymbol{w}}_{k+1} = {\boldsymbol{w}}_{k}-\alpha_k \Bigg(\beta_2\sum_{j=0}^k\beta_1^j\nabla f ({\boldsymbol{w}}_{k-j}\odot{\boldsymbol{\zeta}})\Bigg)\oslash \sqrt{\gamma_2\sum_{j=0}^k\gamma_1^j\nabla f ({\boldsymbol{w}}_{k-j}\odot{\boldsymbol{\zeta}})\odot \nabla f ({\boldsymbol{w}}_{k-j}\odot{\boldsymbol{\zeta}})}. \]

Note how the outer scaling factor \( {\boldsymbol{\zeta}}\) has vanished due to the division, in this sense making the update invariant to a componentwise rescaling of the objective.

11.4.2 Algorithms

11.4.2.1 AdaGrad

AdaGrad [163], which stands for Adaptive Gradient Algorithm, corresponds to (144) with

\[ \begin{align*} \beta_1=0,~~\gamma_1=\beta_2=\gamma_2=1,~~ \alpha_k=\alpha~\text{for all }k\in\mathbb{N}_0. \end{align*} \]

This leaves the hyperparameters \( \varepsilon>0\) and \( \alpha>0\) . Here \( \alpha>0\) can be understood as a “global” learning rate. The default values in tensorflow [170] are \( \alpha=0.001\) and \( \varepsilon=10^{-7}\) . The AdaGrad update then reads

\[ \begin{align*} {\boldsymbol{s}}_{k+1} &= {\boldsymbol{s}}_{k}+\nabla f ({\boldsymbol{w}}_{k})\odot\nabla f ({\boldsymbol{w}}_{k})\\ {\boldsymbol{w}}_{k+1}&={\boldsymbol{w}}_{k}-\alpha\nabla f ({\boldsymbol{w}}_{k})\oslash\sqrt{{\boldsymbol{s}}_{k+1}+\varepsilon}. \end{align*} \]

Due to

\[ \begin{align} {\boldsymbol{s}}_{k+1}=\sum_{j=0}^{k}\nabla f ({\boldsymbol{w}}_j)\odot\nabla f ({\boldsymbol{w}}_j), \end{align} \]

(145)

the algorithm therefore scales the gradient \( \nabla f ({\boldsymbol{w}}_{k})\) in the update componentwise by the inverse square root of the sum over all past squared gradients plus \( \varepsilon\) . Note that the scaling factor \( (s_{k+1,i}+\varepsilon)^{-1/2}\) for component \( i\) will be large, if the previous gradients for that component were small, and vice versa.

11.4.2.2 RMSProp

RMSProp, which stands for Root Mean Squared Propagation, was introduced by Tieleman and Hinton [95]. It corresponds to (144) with

\[ \begin{align*} \beta_1=0,~~\beta_2=1,~~ \gamma_2=1-\gamma_1\in (0,1),~~ \alpha_k=\alpha~\text{for all }k\in\mathbb{N}_0, \end{align*} \]

effectively leaving the hyperparameters \( \varepsilon>0\) , \( \gamma_1\in (0,1)\) and \( \alpha>0\) . The default values in tensorflow [170] are \( \varepsilon=10^{-7}\) , \( \alpha=0.001\) and \( \gamma_1=0.9\) . The algorithm is thus given through

\[ \begin{align} {\boldsymbol{s}}_{k+1} &= \gamma_1{\boldsymbol{s}}_{k}+(1-\gamma_1)\nabla f ({\boldsymbol{w}}_{k})\odot\nabla f ({\boldsymbol{w}}_{k}) \end{align} \]

(146.a)

\[ \begin{align} {\boldsymbol{w}}_{k+1} &= {\boldsymbol{w}}_{k}- \alpha \nabla f ({\boldsymbol{w}}_{k})\oslash \sqrt{{\boldsymbol{s}}_{k+1}+\varepsilon}. \end{align} \]

(146.b)

The scaling vector can be expressed as

\[ \begin{align*} {\boldsymbol{s}}_{k+1} = (1-\gamma_1)\sum_{j=0}^{k}\gamma_1^j\nabla f ({\boldsymbol{w}}_{k-j})\odot\nabla f ({\boldsymbol{w}}_{k-j}), \end{align*} \]

and corresponds to an exponentially weighted moving average over the past squared gradients. Unlike for AdaGrad (145), where past gradients accumulate indefinitely, RMSprop exponentially downweights older gradients, giving more weight to recent updates. This prevents the overly rapid decay of learning rates and slow convergence sometimes observed in AdaGrad, e.g[171, 19]. For the same reason, the authors of Adadelta [164] proposed to use as a scaling vector the average over a moving window of the past \( m\) squared gradients, for some fixed \( m\in\mathbb{N}\) . For more details on Adadelta, see [164, 166]. The standard RMSProp algorithm does not incorporate momentum, however this possibility is already mentioned in [95], also see [153].

11.4.2.3 Adam

Adam [165], which stands for Adaptive Moment Estimation, corresponds to (144) with

\[ \begin{align*} \beta_2=1-\beta_1\in (0,1),~~ \gamma_2=1-\gamma_1\in (0,1),~~ \alpha_k = \alpha \frac{\sqrt{1-\gamma_1^{k+1}}}{1-\beta_1^{k+1}} \end{align*} \]

for all \( k\in\mathbb{N}_0\) , for some \( \alpha>0\) . The default values for the remaining parameters recommended in [165] are \( \varepsilon=10^{-8}\) , \( \alpha=0.001\) , \( \beta_1=0.9\) and \( \gamma_1=0.999\) . The update can be formulated as

\[ \begin{align} {\boldsymbol{u}}_{k+1} &= \beta_1{\boldsymbol{u}}_{k}+(1-\beta_1)\nabla f ({\boldsymbol{w}}_k), &&\hat {\boldsymbol{u}}_{k+1} = \frac{{\boldsymbol{u}}_{k+1}}{1-\beta_1^{k+1}}\\ {\boldsymbol{s}}_{k+1} &= \gamma_1{\boldsymbol{s}}_{k}+(1-\gamma_1)\nabla f ({\boldsymbol{w}}_k)\odot\nabla f ({\boldsymbol{w}}_k), &&\hat{\boldsymbol{s}}_{k+1} = \frac{{\boldsymbol{s}}_{k+1}}{1-\gamma_1^{k+1}} \end{align} \]

(147.a)

\[ \begin{align} {\boldsymbol{w}}_{k+1}&={\boldsymbol{w}}_k-\alpha \hat{\boldsymbol{u}}_{k+1}\oslash\sqrt{\hat{\boldsymbol{s}}_{k+1}+\varepsilon}. \end{align} \]

(147.b)

Compared to RMSProp, Adam introduces two modifications. First, due to \( \beta_1>0\) ,

\[ {\boldsymbol{u}}_{k+1}=(1-\beta_1)\sum_{j=0}^k\beta_1^{j}\nabla f ({\boldsymbol{w}}_{k-j}) \]

which corresponds to heavy ball momentum (cf. (132)). Second, to counteract the initialization bias from \( {\boldsymbol{u}}_0 = \boldsymbol{0} \) and \( {\boldsymbol{s}}_0 = \boldsymbol{0} \), Adam applies a bias correction via

\[ \hat {\boldsymbol{u}}_k = \frac{{\boldsymbol{u}}_k}{1-\beta_1^k}, ~ \hat {\boldsymbol{s}}_k = \frac{{\boldsymbol{s}}_k}{1-\gamma_1^k}. \]

It should be noted that there exist specific settings and convex optimization problems for which Adam (and RMSProp and Adadelta) does not necessarily converge to a minimizer, see [172]. The authors of [172] propose a modification termed AMSGrad, which avoids this issue. Nonetheless, Adam remains a highly popular algorithm for the training of neural networks. We also note that, in the stochastic optimization setting, convergence proofs of such algorithms in general still require \( k\) -dependent decrease of the “global” learning rate such as \( \alpha=O(k^{-1/2})\) in (146.b) and (147.b).

11.5 Backpropagation

We now explain how to apply gradient-based methods to the training of neural networks. Let \( \Phi\in\mathcal{N}_{d_0}^{d_{L+1}}(\sigma;L,n)\) (see Definition 5) and assume that the activation function satisfies \( \sigma\in C^1(\mathbb{R})\) . As earlier, we denote the neural network parameters by

\[ \begin{align} {\boldsymbol{w}} = (({\boldsymbol{W}}^{(0)},{\boldsymbol{b}}^{(0)}),\dots,({\boldsymbol{W}}^{(L)},{\boldsymbol{b}}^{(L)})) \end{align} \]

(148)

with weight matrices \( {\boldsymbol{W}}^{(\ell)}\in\mathbb{R}^{d_{\ell+1}\times d_\ell}\) and bias vectors \( {\boldsymbol{b}}^{(\ell)}\in\mathbb{R}^{d_{\ell+1}}\) . Additionally, we fix a differentiable loss function \( \mathcal{L}:\mathbb{R}^{d_{L+1}}\times \mathbb{R}^{d_{L+1}}\to\mathbb{R}\) , e.g., \( \mathcal{L}({\boldsymbol{w}},\tilde{\boldsymbol{w}})=\| {\boldsymbol{w}}-\tilde{\boldsymbol{w}} \|_{}^2/2\) , and assume given data \( ({\boldsymbol{x}}_j,{\boldsymbol{y}}_j)_{j=1}^m\subseteq \mathbb{R}^{d_0}\times\mathbb{R}^{d_{L+1}}\) . The goal is to minimize an empirical risk of the form

\[ \begin{align*} f ({\boldsymbol{w}}) \mathrm{:}= \frac{1}{m}\sum_{j=1}^m\mathcal{L}(\Phi({\boldsymbol{x}}_j,{\boldsymbol{w}}),{\boldsymbol{y}}_j) \end{align*} \]

as a function of the neural network parameters \( {\boldsymbol{w}}\) . An application of the gradient step (110) to update the parameters requires the computation of

\[ \begin{align*} \nabla f ({\boldsymbol{w}}) = \frac{1}{m}\sum_{j=1}^m \nabla_{\boldsymbol{w}}\mathcal{L}(\Phi({\boldsymbol{x}}_j,{\boldsymbol{w}}),{\boldsymbol{y}}_j). \end{align*} \]

For stochastic methods, as explained in Example 5, we only compute the average over a (random) subbatch of the dataset. In either case, we need an algorithm to determine \( \nabla_{\boldsymbol{w}}\mathcal{L}(\Phi({\boldsymbol{x}},{\boldsymbol{w}}),{\boldsymbol{y}})\) , i.ethe gradients

\[ \begin{align} \nabla_{{\boldsymbol{b}}^{(\ell)}}\mathcal{L}(\Phi({\boldsymbol{x}},{\boldsymbol{w}}),{\boldsymbol{y}})\in\mathbb{R}^{d_{\ell+1}},~ \nabla_{{\boldsymbol{W}}^{(\ell)}}\mathcal{L}(\Phi({\boldsymbol{x}},{\boldsymbol{w}}),{\boldsymbol{y}})\in\mathbb{R}^{d_{\ell+1}\times d_\ell} \end{align} \]

(149)

for all \( \ell=0,\dots,L\) .

The backpropagation algorithm [173] provides an efficient way to do so, by storing intermediate values in the calculation. To explain it, for fixed input \( {\boldsymbol{x}}\in\mathbb{R}^{d_0}\) introduce the notation

\[ \begin{align} \bar{\boldsymbol{x}}^{(1)} &\mathrm{:}= {\boldsymbol{W}}^{(0)}{\boldsymbol{x}}+{\boldsymbol{b}}^{(0)} \end{align} \]

(150.a)

\[ \begin{align} \bar{\boldsymbol{x}}^{(\ell+1)}&\mathrm{:}= {\boldsymbol{W}}^{(\ell)}\sigma(\bar{\boldsymbol{x}}^{(\ell)})+{\boldsymbol{b}}^{(\ell)} ~~\text{for }\ell\in\{1,\dots,L\}, \end{align} \]

(150.b)

where the application of \( \sigma:\mathbb{R}\to\mathbb{R}\) to a vector is, as always, understood componentwise.

With the notation of Definition 1, \( {\boldsymbol{x}}^{(\ell)}=\sigma(\bar{\boldsymbol{x}}^{(\ell)})\in\mathbb{R}^{d_\ell}\) for \( \ell=1,\dots,L\) and \( \bar{\boldsymbol{x}}^{(L+1)}={\boldsymbol{x}}^{(L+1)}=\Phi({\boldsymbol{x}},{\boldsymbol{w}})\in\mathbb{R}^{d_{L+1}}\) is the output of the neural network. Therefore, the \( \bar{\boldsymbol{x}}^{(\ell)}\) for \( \ell=1,\dots,L\) are sometimes also referred to as the preactivations.

In the following, we additionally fix \( {\boldsymbol{y}}\in\mathbb{R}^{d_{L+1}}\) and write

\[ \begin{align*} \mathcal{L}\mathrm{:}= \mathcal{L}(\Phi({\boldsymbol{x}},{\boldsymbol{w}}),{\boldsymbol{y}})=\mathcal{L}(\bar{\boldsymbol{x}}^{(L+1)},{\boldsymbol{y}}). \end{align*} \]

Note that \( \bar{\boldsymbol{x}}^{(k)}\) depends on \( ({\boldsymbol{W}}^{(\ell)},{\boldsymbol{b}}^{(\ell)})\) only if \( k>\ell\) . Since \( \bar{\boldsymbol{x}}^{(\ell+1)}\) is a function of \( \bar{\boldsymbol{x}}^{(\ell)}\) for each \( \ell\) , by repeated application of the chain rule

\[ \begin{align} \frac{\partial \mathcal{L}}{\partial W_{ij}^{(\ell)}} = \underbrace{\frac{\partial \mathcal{L}}{\partial \bar{\boldsymbol{x}}^{(L+1)}}}_{\in \mathbb{R}^{1\times d_{L+1}}} \underbrace{\frac{\partial \bar{\boldsymbol{x}}^{(L+1)}}{\partial \bar{\boldsymbol{x}}^{(L)}}}_{\in\mathbb{R}^{d_{L+1}\times d_L}}\cdots \underbrace{\frac{\partial \bar{\boldsymbol{x}}^{(\ell+2)}}{\partial \bar{\boldsymbol{x}}^{(\ell+1)}}}_{\in\mathbb{R}^{d_{\ell+2}\times d_{\ell+1}}} \underbrace{\frac{\partial \bar{\boldsymbol{x}}^{(\ell+1)}}{\partial W_{ij}^{(\ell)}}}_{\in\mathbb{R}^{d_{\ell+1}\times 1}}. \end{align} \]

(151)

An analogous calculation holds for \( \partial\mathcal{L} /\partial b_j^{(\ell)}\) . Since all terms in (151) are easy to compute (see (150)), in principle we could use this formula to determine the gradients in (149). To avoid unnecessary computations, the main idea of backpropagation is to introduce

\[ \begin{align*} {\boldsymbol{\alpha}}^{(\ell)}\mathrm{:}= \nabla_{\bar{\boldsymbol{x}}^{(\ell)}}\mathcal{L}\in\mathbb{R}^{d_{\ell}}~~\text{for all }\ell=1,\dots,L+1 \end{align*} \]

and observe that

\[ \begin{align*} \frac{\partial \mathcal{L}}{\partial W_{ij}^{(\ell)}} = ({\boldsymbol{\alpha}}^{(\ell+1)})^\top \frac{\partial \bar{\boldsymbol{x}}^{(\ell+1)}}{\partial W_{ij}^{(\ell)}}. \end{align*} \]

As the following lemma shows, the \( {\boldsymbol{\alpha}}^{(\ell)}\) can be computed recursively for \( \ell=L+1,\dots,1\) . This explains the name “backpropagation”. As before, \( \odot\) denotes the componentwise product.

Lemma 28

It holds

\[ \begin{align} {\boldsymbol{\alpha}}^{(L+1)} = \nabla_{\bar{\boldsymbol{x}}^{(L+1)}}\mathcal{L}(\bar{\boldsymbol{x}}^{(L+1)},{\boldsymbol{y}}) \end{align} \]

(152)

and

\[ \begin{align*} {\boldsymbol{\alpha}}^{(\ell)} = \sigma'(\bar{\boldsymbol{x}}^{(\ell)}) \odot ({\boldsymbol{W}}^{(\ell)})^\top{\boldsymbol{\alpha}}^{(\ell+1)} ~~\text{for all }\ell=L,\dots,1. \end{align*} \]

Proof

Equation (152) holds by definition. For \( \ell\in\{1,\dots,L\}\) by the chain rule

\[ \begin{align*} {\boldsymbol{\alpha}}^{(\ell)} = \frac{\partial \mathcal{L}}{\partial \bar{\boldsymbol{x}}^{(\ell)}} = \underbrace{\Big(\frac{\partial \bar{\boldsymbol{x}}^{(\ell+1)}}{\partial \bar{\boldsymbol{x}}^{(\ell)}}\Big)^\top}_{\in\mathbb{R}^{d_{\ell}\times d_{\ell+1}}} \underbrace{\frac{\partial \mathcal{L}}{\partial \bar{\boldsymbol{x}}^{(\ell+1)}}}_{\in\mathbb{R}^{d_{\ell+1}\times 1}} = \Big(\frac{\partial \bar{\boldsymbol{x}}^{(\ell+1)}}{\partial \bar{\boldsymbol{x}}^{(\ell)}}\Big)^\top {\boldsymbol{\alpha}}^{(\ell+1)}. \end{align*} \]

By (150.b) for \( i\in\{1,\dots,d_{\ell+1}\}\) , \( j\in\{1,\dots,d_{\ell}\}\)

\[ \begin{align*} \Big(\frac{\partial \bar{\boldsymbol{x}}^{(\ell+1)}}{\partial \bar{\boldsymbol{x}}^{(\ell)}}\Big)_{ij} = \frac{\partial \bar x^{(\ell+1)}_i}{\partial \bar x^{(\ell)}_j} = W_{ij}^{(\ell)}\sigma'(\bar x_j^{(\ell)}). \end{align*} \]

Thus the claim follows.

Putting everything together, we obtain explicit formulas for (149).

Proposition 14

It holds

\[ \begin{align*} \nabla_{{\boldsymbol{b}}^{(\ell)}}\mathcal{L} = {\boldsymbol{\alpha}}^{(\ell+1)}\in\mathbb{R}^{d_{\ell+1}} ~~\text{for }\ell=0,\dots,L \end{align*} \]

and

\[ \begin{align*} \nabla_{{\boldsymbol{W}}^{(0)}}\mathcal{L} = {\boldsymbol{\alpha}}^{(1)} {\boldsymbol{x}}^\top \in\mathbb{R}^{d_{1}\times d_0} \end{align*} \]

and

\[ \begin{align*} \nabla_{{\boldsymbol{W}}^{(\ell)}}\mathcal{L} = {\boldsymbol{\alpha}}^{(\ell+1)} \sigma(\bar{\boldsymbol{x}}^{(\ell)})^\top \in\mathbb{R}^{d_{\ell+1}\times d_\ell} ~~ \text{for }\ell=1,\dots,L. \end{align*} \]

Proof

By (150.a) for \( i\) , \( k\in\{1,\dots,d_{1}\}\) , and \( j\in\{1,\dots,d_{0}\}\)

\[ \begin{align*} \frac{\partial \bar x_k^{(1)}}{\partial b_i^{(0)}} = \delta_{ki}~~ \text{and}~~ \frac{\partial \bar x_k^{(1)}}{\partial W_{ij}^{(0)}} = \delta_{ki} x_j, \end{align*} \]

and by (150.b) for \( \ell\in\{1,\dots,L\}\) and \( i\) , \( k\in\{1,\dots,d_{\ell+1}\}\) , and \( j\in\{1,\dots,d_{\ell}\}\)

\[ \begin{align*} \frac{\partial \bar x_k^{(\ell+1)}}{\partial b_i^{(\ell)}} = \delta_{ki}~~ \text{and}~~ \frac{\partial \bar x_k^{(\ell+1)}}{\partial W_{ij}^{(\ell)}} = \delta_{ki}\sigma(\bar x^{(\ell)}_j). \end{align*} \]

Thus, with \( {\boldsymbol{e}}_i=(\delta_{ki})_{k=1}^{d_{\ell+1}}\)

\[ \begin{align*} \frac{\partial\mathcal{L}}{\partial b_i^{(\ell)}} = \Big(\frac{\partial\bar{\boldsymbol{x}}^{(\ell+1)}}{\partial b_i^{(\ell)}}\Big)^\top \frac{\partial \mathcal{L}}{\partial\bar{\boldsymbol{x}}^{(\ell+1)}} = {\boldsymbol{e}}_i^\top {\boldsymbol{\alpha}}^{(\ell+1)} = \alpha_i^{(\ell+1)}~~\text{for }\ell\in\{0,\dots,L\} \end{align*} \]

and similarly

\[ \begin{align*} \frac{\partial\mathcal{L}}{\partial W_{ij}^{(0)}} = \Big(\frac{\partial\bar{\boldsymbol{x}}^{(1)}}{\partial W_{ij}^{(0)}}\Big)^\top {\boldsymbol{\alpha}}^{(1)} = \bar x_j^{(0)}{\boldsymbol{e}}_i^\top {\boldsymbol{\alpha}}^{(1)} = \bar x_j^{(0)}\alpha_i^{(1)} \end{align*} \]

and

\[ \begin{align*} \frac{\partial\mathcal{L}}{\partial W_{ij}^{(\ell)}} = \sigma(\bar x_j^{(\ell)})\alpha_i^{(\ell+1)} ~~\text{for }\ell\in\{1,\dots,L\}. \end{align*} \]

This concludes the proof.

Lemma 28 and Proposition 14 motivate Algorithm 1, in which a forward pass computing \( \bar{\boldsymbol{x}}^{(\ell)}\) , \( \ell=1,\dots,L+1\) , is followed by a backward pass to determine the \( {\boldsymbol{\alpha}}^{(\ell)}\) , \( \ell=L+1,\dots,1\) , and the gradients of \( \mathcal{L}\) with respect to the neural network parameters. This shows how to use gradient-based optimizers from the previous sections for the training of neural networks.

Two important remarks are in order. First, the objective function associated to neural networks is typically not convex as a function of the neural network weights and biases. Thus, the analysis of the previous sections will in general not be directly applicable. It may still give some insight about the convergence behavior locally around a (local) minimizer however. Second, we assumed the activation function to be continuously differentiable, which does not hold for ReLU. Using the concept of subgradients, gradient-based algorithms and their analysis may be generalized to some extent to also accommodate non-differentiable loss functions, see Exercises 3638.


Algorithm 1 Backpropagation
1.Input: Network input \( {\boldsymbol{x}}\) , target output \( {\boldsymbol{y}}\) , neural network parameters \( (({\boldsymbol{W}}^{(0)},{\boldsymbol{b}}^{(0)}),\dots,({\boldsymbol{W}}^{(L)},{\boldsymbol{b}}^{(L)}))\)
2.Output: Gradients of the loss function \( \mathcal{L}\) with respect to neural network parameters
3.
4.Forward pass
5.\( \bar{\boldsymbol{x}}^{(1)} \leftarrow {\boldsymbol{W}}^{(0)}{\boldsymbol{x}}+{\boldsymbol{b}}^{(0)}\)
6.for \( \ell = 1,\dots,L\) do
7.\( \bar{\boldsymbol{x}}^{(\ell+1)}\leftarrow {\boldsymbol{W}}^{(\ell)}\sigma(\bar{\boldsymbol{x}}^{(\ell)})+{\boldsymbol{b}}^{(\ell)}\)
8.end for
9.
10.Backward pass
11.\( {\boldsymbol{\alpha}}^{(L+1)} \leftarrow \nabla_{\bar{\boldsymbol{x}}^{(L+1)}}\mathcal{L}(\bar{\boldsymbol{x}}^{(L+1)},{\boldsymbol{y}})\)
12.for \( \ell = L,\dots,1\) do
13.\( \nabla_{{\boldsymbol{b}}^{(\ell)}}\mathcal{L}\leftarrow {\boldsymbol{\alpha}}^{(\ell+1)}\)
14.\( \nabla_{{\boldsymbol{W}}^{(\ell)}}\mathcal{L}\leftarrow {\boldsymbol{\alpha}}^{(\ell+1)} \sigma(\bar{\boldsymbol{x}}^{(\ell)})^\top\)
15.\( {\boldsymbol{\alpha}}^{(\ell)} \leftarrow \sigma'(\bar{\boldsymbol{x}}^{(\ell)}) \odot ({\boldsymbol{W}}^{(\ell)})^\top{\boldsymbol{\alpha}}^{(\ell+1)}\)
16.end for
17.\( \nabla_{{\boldsymbol{b}}^{(0)}}\mathcal{L} \leftarrow {\boldsymbol{\alpha}}^{(1)}\)
18.\( \nabla_{{\boldsymbol{W}}^{(0)}}\mathcal{L}\leftarrow {\boldsymbol{\alpha}}^{(1)}{\boldsymbol{x}}^\top\)

Bibliography and further reading

The convergence proof of gradient descent for smooth and strongly convex functions presented in Section 11.1 follows [1], which provides a collection of simple proofs for various (stochastic) gradient descent methods together with detailed references. For standard textbooks on gradient descent and convex optimization, see [2, 3, 4, 5, 6, 7, 8]. These references also include convergence proofs under weaker assumptions than those considered here. For convergence results assuming for example the Polyak-Łojasiewicz inequality, which does not require convexity, see, e.g., [9].

Stochastic gradient descent (SGD) discussed in Section 11.2 originally dates back to Robbins and Monro [149]. The proof presented here for strongly convex objective functions is based on [151, 150] and in particular uses the step size from [151]; also see [177, 178, 179, 180]. For insights into the potential benefits of SGD in terms of generalization properties, see, e.g., [181, 182, 183, 184, 185].

The heavy ball method in Section 11.3 goes back to Polyak [154]. To motivate the algorithm we proceed as in [156, 157, 158], and also refer to [186, 187]. The analysis of Nesterov acceleration [155] follows the arguments in [160, 161], with a similar proof also given in [162].

For Section 11.4 on adaptive learning rates, we follow the overviews [25, Section 8.5], [166], and [19, Chapter 11] and the original works that introduced AdaGrad [163], Adadelta [164], RMSProp [95] and Adam [165]. Regarding the analysis of RMSProp and Adam, we refer to [172] which give an example of non-convergence, and provide a modification of the algorithm together with a convergence analysis. Convergence proofs (for variations of) AdaGrad and Adam can also be found in [188].

The backpropagation algorithm discussed in Section 11.5 was popularized by Rumelhart, Hinton and Williams [173]; for further details on the historical developement we refer to [22, Section 5.5], and for a more in-depth discussion of the algorithm, see for instance [24, 189, 190].

Similar discussions of gradient descent algorithms in the context of deep learning as given here were presented in [2] and [3]: [2, Chapter 7] provides accessible convergence proofs of (stochastic) gradient descent and gradient flow under different smoothness and convexity assumptions, and [3, Part III] gives a broader overview of optimization techniques in deep learning, but restricts part of the analysis to quadratic objective functions. As in [150], our analysis of gradient descent, stochastic gradient descent, and Nesterov acceleration, exclusively focused on strongly convex objective functions. We also refer to this paper for a more detailed general treatment and analysis of optimization algorithms in machine learning, covering various methods that are omitted here. Details on implementations in Python can for example be found in [19], and for recommendations and tricks regarding the implementation we also refer to [191, 167].

Exercises

Exercise 34

Let \( L>0\) and let \( f:\mathbb{R}^n\to\mathbb{R}\) be continuously differentiable. Show that (113) implies (112).

Exercise 35

Let \( f \in C^1(\mathbb{R}^n)\) . Show that \( f \) is convex in the sense of Definition 21 if and only if

\[ \begin{align*} f ({\boldsymbol{w}})+\left\langle \nabla f ({\boldsymbol{w}}), {\boldsymbol{v}}-{\boldsymbol{w}}\right\rangle_{}\le f ({\boldsymbol{v}})~~ \text{for all }{\boldsymbol{w}},{\boldsymbol{v}}\in\mathbb{R}^n. \end{align*} \]

Definition 23

For convex \( f :\mathbb{R}^n\to\mathbb{R}\) , \( {\boldsymbol{g}}\in\mathbb{R}^n\) is called a subgradient (or subdifferential) of \( f \) at \( {\boldsymbol{v}}\) if and only if

\[ \begin{equation} f ({\boldsymbol{w}})\ge f ({\boldsymbol{v}})+\left\langle {\boldsymbol{g}}, {\boldsymbol{w}}-{\boldsymbol{v}}\right\rangle_{}~~\text{for all }{\boldsymbol{w}}\in\mathbb{R}^n. \end{equation} \]

(153)

The set of all subgradients of \( f \) at \( {\boldsymbol{v}}\) is denoted by \( \partial f ({\boldsymbol{v}})\) .

For convex functions \( f \) , a subgradient always exists, i.e \( \partial f ({\boldsymbol{v}})\) is necessarily nonempty, e.g., [146, Section 1.2]. Subgradients generalize the notion of gradients for convex functions, since for any convex continuously differentiable \( f \) , (153) is satisfied with \( {\boldsymbol{g}}=\nabla f ({\boldsymbol{v}})\) . The following three exercises on subgradients are based on the lecture notes [192]. Also see, e.g., [193, 146, 147] for more details on subgradient descent.

Exercise 36

Let \( f :\mathbb{R}^n\to\mathbb{R}\) be convex and \( \rm{ Lip}( f )\le L\) . Show that for any \( {\boldsymbol{g}}\in\partial f ({\boldsymbol{v}})\) holds \( \| {\boldsymbol{g}} \|_{}\le L\) .

Exercise 37

Let \( f :\mathbb{R}^n\to\mathbb{R}\) be convex, \( \rm{ Lip}( f )\le L\) and suppose that \( {\boldsymbol{w}}_*\) is a minimizer of \( f \) . Fix \( {\boldsymbol{w}}_0\in\mathbb{R}^n\) , and for \( k\in\mathbb{N}_0\) define the subgradient descent update

\[ {\boldsymbol{w}}_{k+1}\mathrm{:}= {\boldsymbol{w}}_{k}-h_k {\boldsymbol{g}}_{k}, \]

where \( {\boldsymbol{g}}_{k}\) is an arbitrary fixed element of \( \partial f ({\boldsymbol{w}}_{k})\) . Show that

\[ \min_{i\le k} f ({\boldsymbol{w}}_{i}) - f ({\boldsymbol{w}}_*) \le \frac{ \| {\boldsymbol{w}}_{0} - {\boldsymbol{w}}_* \|_{}^2 + L^2 \sum \limits_{i=1}^k h_i^2 }{2 \sum \limits_{i=1}^k h_i }. \]

Hint: Start by recursively expanding \( \| {\boldsymbol{w}}_k-{\boldsymbol{w}}_* \|_{}^2=…\) , and then apply the property of the subgradient.

Exercise 38

Consider the setting of Exercise 37. Determine step sizes \( h_1,\dots,h_k\) (which may depend on \( k\) , i.e\( h_{k,1},\dots,h_{k,k}\) ) such that for any arbitrarily small \( \delta>0\)

\[ \min_{i\le k} f ({\boldsymbol{w}}_i)- f ({\boldsymbol{w}}_*) = O(k^{-1/2+\delta})~~\text{as }k\to\infty. \]

Exercise 39

Let \( {\boldsymbol{A}}\in\mathbb{R}^{n\times n}\) be symmetric positive semidefinite, \( {\boldsymbol{b}}\in\mathbb{R}^n\) and \( c\in\mathbb{R}\) . Denote the eigenvalues of \( {\boldsymbol{A}}\) by \( \zeta_1\ge\dots\ge\zeta_n\ge 0\) . Show that the objective function

\[ \begin{align} f ({\boldsymbol{w}})\mathrm{:}= \frac{1}{2}{\boldsymbol{w}}^\top {\boldsymbol{A}}{\boldsymbol{w}}+{\boldsymbol{b}}^\top{\boldsymbol{w}}+c \end{align} \]

(154)

is convex and \( \zeta_1\) -smooth. Moreover, if \( \zeta_n>0\) , then \( f \) is \( \zeta_n\) -strongly convex. Show that these values are optimal in the sense that \( f \) is neither \( L\) -smooth nor \( \mu\) -strongly convex if \( L<\zeta_1\) and \( \mu>\zeta_n\) .

Hint: Note that \( L\) -smoothness and \( \mu\) -strong convexity are invariant under shifts and the addition of constants. That is, for every \( \alpha\in\mathbb{R}\) and \( {\boldsymbol{\beta}}\in\mathbb{R}^n\) , \( \tilde f ({\boldsymbol{w}})\mathrm{:}= \alpha+ f ({\boldsymbol{w}}+{\boldsymbol{\beta}})\) is \( L\) -smooth or \( \mu\) -strongly convex if and only if \( f \) is. It thus suffices to consider \( {\boldsymbol{w}}^\top{\boldsymbol{A}}{\boldsymbol{w}} /2\) .

Exercise 40

Let \( f \) be as in Exercise 39. Show that gradient descent converges for arbitrary initialization \( {\boldsymbol{w}}_0\in\mathbb{R}^n\) , if and only if

\[ \begin{align*} \max_{j=1,\dots,n}|1-h\zeta_j|<1. \end{align*} \]

Show that \( \rm{argmin}_{h>0} \max_{j=1,\dots,n} |1-h\zeta_j|=2/(\zeta_1+\zeta_n)\) and conclude that the convergence will be slow if \( f \) is ill-conditioned, i.eif \( \zeta_1/\zeta_n\gg 1\) .

Hint: Assume first that \( {\boldsymbol{b}}=\boldsymbol{0}\in\mathbb{R}^n\) and \( c=0\in\mathbb{R}\) in (154), and use the singular value decomposition \( {\boldsymbol{A}}={\boldsymbol{U}}^\top{\rm{diag}}(\zeta_1,\dots,\zeta_n){\boldsymbol{U}}\) .

Exercise 41

Show that (136) can equivalently be written as (137) with \( \tau=\sqrt{\mu/L}\) , \( \alpha=1/L\) , \( \beta = (1-\tau)/(1+\tau)\) and the initialization \( {\boldsymbol{u}}_0 = ((1+\tau){\boldsymbol{w}}_0-{\boldsymbol{s}}_0)/\tau\) .

12 Wide neural networks and the neural tangent kernel

In this chapter we explore the dynamics of training (shallow) neural networks of large width. Throughout assume given data pairs

\[ \begin{equation} ({\boldsymbol{x}}_i,y_i)\in\mathbb{R}^d\times \mathbb{R}~~ i\in\{1,\dots,m\}, \end{equation} \]

(155.a)

for distinct \( {\boldsymbol{x}}_i\) . We wish to train a model (e.ga neural network) \( \Phi({\boldsymbol{x}},{\boldsymbol{w}})\) depending on the input \( {\boldsymbol{x}}\in\mathbb{R}^d\) and the parameters \( {\boldsymbol{w}}\in\mathbb{R}^n\) . To this end we consider either minimization of the ridgeless (unregularized) objective

\[ \begin{equation} f ({\boldsymbol{w}})\mathrm{:}= \sum_{i=1}^m (\Phi({\boldsymbol{x}}_i,{\boldsymbol{w}})-y_i)^2, \end{equation} \]

(155.b)

or, for some regularization parameter \( \lambda\ge 0\) , of the ridge regularized objective

\[ \begin{equation} f _\lambda({\boldsymbol{w}})\mathrm{:}= f ({\boldsymbol{w}}) + \lambda \| {\boldsymbol{w}} \|_{}^2. \end{equation} \]

(155.c)

The adjectives ridge and ridgeless thus indicate the presence or absence of the regularization term \( \| {\boldsymbol{w}} \|_{}^2\) . In the ridgeless case, the objective is a multiple of the empirical risk \( \widehat{\mathcal{R}}_S(\Phi)\) in (3) for the sample \( S = ({\boldsymbol{x}}_i,y_i)_{i=1}^m\) and the square-loss. Regularization is a common tool in machine learning to improve model generalization and stability. The goal of this chapter is to get some insight into the dynamics of \( \Phi({\boldsymbol{x}},{\boldsymbol{w}}_k)\) as the parameter vector \( {\boldsymbol{w}}_k\) progresses during training. Additionally, we want to gain some intuition about the influence of regularization, and the behaviour of the trained model \( {\boldsymbol{x}}\mapsto \Phi({\boldsymbol{x}},{\boldsymbol{w}}_k)\) for large \( k\) . We do so through the lense of so-called kernel methods. As a training algorithm we exclusively focus on gradient descent with constant step size.

If \( \Phi({\boldsymbol{x}},{\boldsymbol{w}})\) depends linearly on the parameters \( {\boldsymbol{w}}\) , the objective function (155.c) is convex. As established in the previous chapter (cf. Remark 11), gradient descent then finds a global minimizer. For typical neural network architectures, \( {\boldsymbol{w}}\mapsto \Phi({\boldsymbol{x}},{\boldsymbol{w}})\) is not linear, and such a statement is in general not true. Recent research has shown that neural network behavior tends to linearize in \( {\boldsymbol{w}}\) as network width increases [194]. This allows to transfer some of the results and techniques from the linear case to the training of neural networks.

We start this chapter in Sections 12.1 and 12.2 by recalling (kernel) least-squares methods, which describe linear (in \( {\boldsymbol{w}}\) ) models. Following [195], the subsequent sections examine why neural networks exhibit linear-like behavior in the infinite-width limit. In Section 12.3 we introduce the so-called tangent kernel. Section 12.4 presents abstract results showing, under suitable assumptions, convergence towards a global minimizer when training the model. Section 12.5 builds on this analysis and discusses connections to kernel regression with the tangent kernel. In Section 12.6 we then detail the implications for wide neural networks. A similar treatment of these results was previously given by Telgarsky in [2, Chapter 8] for gradient flow (rather than gradient descent), based on [196].

12.1 Linear least-squares regression

Arguably one of the simplest machine learning algorithms is linear least-squares regression, e.g., [1, 2, 3, 4]. Given data (167.a), we fit a linear function \( {\boldsymbol{x}}\mapsto\Phi({\boldsymbol{x}},{\boldsymbol{w}})\mathrm{:}= {\boldsymbol{x}}^\top{\boldsymbol{w}}\) by minimizing \( f \) or \( f _\lambda\) in (167). With

\[ \begin{equation} {\boldsymbol{A}} = \begin{pmatrix} {\boldsymbol{x}}_1^\top\\ \vdots\\ {\boldsymbol{x}}_m^\top \end{pmatrix}\in\mathbb{R}^{m\times d} ~~\text{and}~~ {\boldsymbol{y}} = \begin{pmatrix} y_1\\ \vdots\\ y_m \end{pmatrix}\in\mathbb{R}^m \end{equation} \]

(168)

it holds

\[ \begin{equation} f ({\boldsymbol{w}})=\| {\boldsymbol{A}}{\boldsymbol{w}}-{\boldsymbol{y}} \|_{}^2~~\text{and}~~ f _\lambda({\boldsymbol{w}})= f ({\boldsymbol{w}})+\lambda\| {\boldsymbol{w}} \|_{}^2. \end{equation} \]

(169)

The \( {\boldsymbol{x}}_1,\dots,{\boldsymbol{x}}_m\) are referred to as the training points (or design points), and throughout the rest of Section 12.1, we denote their span by

\[ \begin{equation} \tilde H \mathrm{:}= {\rm span}\{{\boldsymbol{x}}_1,\dots,{\boldsymbol{x}}_m\}\subseteq\mathbb{R}^d. \end{equation} \]

(170)

This is the subspace spanned by the rows of \( {\boldsymbol{A}}\) .

Remark 14

More generally, the ansatz \( \Phi({\boldsymbol{x}},({\boldsymbol{w}},b))\mathrm{:}={\boldsymbol{w}}^\top {\boldsymbol{x}}+b\) corresponds to

\[ \Phi({\boldsymbol{x}},({\boldsymbol{w}},b))=(1,{\boldsymbol{x}}^\top) \begin{pmatrix}b\\ {\boldsymbol{w}}\end{pmatrix}. \]

Therefore, additionally allowing for a bias can be treated similarly.

12.1.1 Existence of minimizers

We start with the ridgeless case \( \lambda=0\) . The model \( \Phi({\boldsymbol{x}},{\boldsymbol{w}})={\boldsymbol{x}}^\top{\boldsymbol{w}}\) is linear in both \( {\boldsymbol{x}}\) and \( {\boldsymbol{w}}\) . In particular, \( {\boldsymbol{w}}\mapsto f ({\boldsymbol{w}})\) is a convex function by Exercise 38. If \( {\boldsymbol{A}}\) is invertible, then \( f \) has the unique minimizer \( {\boldsymbol{w}}_*={\boldsymbol{A}}^{-1}{\boldsymbol{y}}\) . If \( {\rm {rank}}({\boldsymbol{A}})=d\) , then \( f \) is strongly convex by Exercise 38, and there still exists a unique minimizer. If however \( {\rm {rank}}({\boldsymbol{A}})<d\) , then \( \ker({\boldsymbol{A}})\neq\{\boldsymbol{0}\}\) and there exist infinitely many minimizers of \( f \) . To guarantee uniqueness, we consider the minimum norm solution

\[ \begin{equation} {\boldsymbol{w}}_*\mathrm{:}={\rm argmin}_{{\boldsymbol{w}}\in M}\| {\boldsymbol{w}} \|_{},~~ M\mathrm{:}= \{{\boldsymbol{w}}\in\mathbb{R}^d\,|\, f ({\boldsymbol{w}})\le f ({\boldsymbol{v}})~\forall{\boldsymbol{v}}\in\mathbb{R}^d\}. \end{equation} \]

(171)

It’s a standard result that \( {\boldsymbol{w}}_*\) is well-defined and belongs to the span \( \tilde H\) of the training points, e.g., [2, 1, 4]. While one way to prove this is through the pseudoinverse (see Theorem 50), we provide an alternative argument here, which can be directly extended to the infinite-dimensional case as discussed in Section 12.2 ahead.

Theorem 22

There is a unique minimum norm solution \( {\boldsymbol{w}}_*\in\mathbb{R}^d\) in (171). It lies in the subspace \( \tilde H\) , and is the unique minimizer of \( f \) in \( \tilde H\) , i.e.

\[ \begin{equation} {\boldsymbol{w}}_*={\rm argmin}_{\tilde{\boldsymbol{w}}\in \tilde H} f (\tilde{\boldsymbol{w}}). \end{equation} \]

(172)

Proof

We start with existence and uniqueness of \( {\boldsymbol{w}}_*\in \mathbb{R}^d\) in (171). Let

\[ C\mathrm{:}= {\rm span}\left\{{\boldsymbol{A}}{\boldsymbol{w}}\, \middle|\,{\boldsymbol{w}}\in \mathbb{R}^d\right\}\subseteq\mathbb{R}^m. \]

Then \( C\) is a finite dimensional space, and as such it is closed and convex. Therefore \( {\boldsymbol{y}}_*={\rm {argmin}}_{\tilde{\boldsymbol{y}}\in C}\| \tilde{\boldsymbol{y}}-{\boldsymbol{y}} \|_{}\) exists and is unique (this is a fundamental property of Hilbert spaces, see Theorem 55). In particular, the set \( M=\{{\boldsymbol{w}}\in \mathbb{R}^d\,|\,A{\boldsymbol{w}}={\boldsymbol{y}}_*\}\subseteq \mathbb{R}^d\) of minimizers of \( f \) is not empty. Clearly \( M\subseteq \mathbb{R}^d\) is closed and convex. As before, \( {\boldsymbol{w}}_*={\rm {argmin}}_{{\boldsymbol{w}}\in M}\| {\boldsymbol{w}} \|_{}\) exists and is unique.

It remains to show (172). Decompose \( {\boldsymbol{w}}_*=\tilde{\boldsymbol{w}}+\hat{\boldsymbol{w}}\) with \( \tilde{\boldsymbol{w}}\in\tilde H\) and \( \hat{\boldsymbol{w}}\in\tilde H^\perp\) (see Definition 56). By definition of \( {\boldsymbol{A}}\) it holds \( {\boldsymbol{A}}{\boldsymbol{w}}_*={\boldsymbol{A}}\tilde{\boldsymbol{w}}\) and \( f ({\boldsymbol{w}}_*)= f (\tilde{\boldsymbol{w}})\) . Moreover \( \| {\boldsymbol{w}}_* \|_{}^2=\| \tilde{\boldsymbol{w}} \|_{}^2+\| \hat{\boldsymbol{w}} \|_{}^2\) . Since \( {\boldsymbol{w}}_*\) is the minimum norm solution, \( {\boldsymbol{w}}_*=\tilde{\boldsymbol{w}}\in\tilde H\) . To conclude the proof, we need to show that \( {\boldsymbol{w}}_*\) is the only minimizer of \( f \) in \( \tilde H\) . Assume there exists a minimizer \( {\boldsymbol{v}}\) of \( f \) in \( \tilde H\) different from \( {\boldsymbol{w}}_*\) . Then \( \boldsymbol{0}\neq {\boldsymbol{w}}_*-{\boldsymbol{v}}\in\tilde H\) . Thus \( {\boldsymbol{A}}({\boldsymbol{w}}_*-{\boldsymbol{v}})\neq\boldsymbol{0}\) and \( {\boldsymbol{y}}_*={\boldsymbol{A}}{\boldsymbol{w}}_*\neq {\boldsymbol{A}}{\boldsymbol{v}}\) , which contradicts that \( {\boldsymbol{v}}\) minimizes \( f \) .

Next let \( \lambda>0\) in (169). Then minimizing \( f _\lambda\) is referred to as ridge regression or Tikhonov regularized least squares [5, 6, 7, 3]. The next proposition shows that there exists a unique minimizer of \( f _\lambda\) , which is closely connected to the minimum norm solution, e.g[7, Theorem 5.2].

Theorem 23

Let \( \lambda>0\) . Then, with \( f _\lambda\) in (169), there exists a unique minimizer

\[ {\boldsymbol{w}}_{*,\lambda}\mathrm{:}= {\rm argmin}_{{\boldsymbol{w}}\in\mathbb{R}^d} f _\lambda({\boldsymbol{w}}). \]

It holds \( {\boldsymbol{w}}_{*,\lambda}\in\tilde H\) , and

\[ \begin{equation} \lim_{\lambda\to 0}{\boldsymbol{w}}_{*,\lambda} = {\boldsymbol{w}}_*. \end{equation} \]

(173)

Proof

According to Exercise 42, \( {\boldsymbol{w}}\mapsto f _\lambda({\boldsymbol{w}})\) is strongly convex on \( \mathbb{R}^d\) , and thus also on the subspace \( \tilde H\subseteq\mathbb{R}^d\) . Therefore, there exists a unique minimizer of \( f _\lambda\) in \( \tilde H\) , which we denote by \( {\boldsymbol{w}}_{*,\lambda}\in\tilde H\) . To show that there exists no other minimizer of \( f _\lambda\) in \( \mathbb{R}^d\) , fix \( {\boldsymbol{w}}\in \mathbb{R}^d\backslash\tilde H\) and decompose \( {\boldsymbol{w}}=\tilde{\boldsymbol{w}}+\hat{\boldsymbol{w}}\) with \( \tilde{\boldsymbol{w}}\in\tilde H\) and \( \boldsymbol{0}\neq\hat{\boldsymbol{w}}\in\tilde H^\perp\) . Then

\[ f ({\boldsymbol{w}})=\| {\boldsymbol{A}}{\boldsymbol{w}}-{\boldsymbol{y}} \|_{}^2=\| {\boldsymbol{A}}\tilde{\boldsymbol{w}}-{\boldsymbol{y}} \|_{}^2= f (\tilde{\boldsymbol{w}}) \]

and

\[ \| {\boldsymbol{w}} \|_{}^2=\| \tilde{\boldsymbol{w}} \|_{}^2+\| \hat{\boldsymbol{w}} \|_{}^2 > \| \tilde{\boldsymbol{w}} \|_{}^2. \]

Thus \( f _\lambda({\boldsymbol{w}})> f _\lambda(\tilde{\boldsymbol{w}})\ge f _\lambda({\boldsymbol{w}}_{*,\lambda})\) .

It remains to show (173). We have

\[ \begin{align*} f_\lambda({\boldsymbol{w}}) &= ({\boldsymbol{A}}{\boldsymbol{w}}-{\boldsymbol{y}})^\top({\boldsymbol{A}}{\boldsymbol{w}}-{\boldsymbol{y}})+\lambda{\boldsymbol{w}}^\top{\boldsymbol{w}}\\ &={\boldsymbol{w}}^\top({\boldsymbol{A}}^\top{\boldsymbol{A}}+\lambda{\boldsymbol{I}}_d){\boldsymbol{w}}-2{\boldsymbol{w}}^\top{\boldsymbol{A}}^\top{\boldsymbol{y}}, \end{align*} \]

where \( {\boldsymbol{I}}_d\in\mathbb{R}^{d\times d}\) is the identity matrix. The minimizer is reached at \( \nabla f_\lambda({\boldsymbol{w}})=0\) , which yields

\[ {\boldsymbol{w}}_{*,\lambda} = ({\boldsymbol{A}}^\top{\boldsymbol{A}}+\lambda{\boldsymbol{I}}_d)^{-1}{\boldsymbol{A}}^\top{\boldsymbol{y}}. \]

Let \( {\boldsymbol{A}}={\boldsymbol{U}}{\boldsymbol{ \Sigma }}{\boldsymbol{V}}^\top\) be the singular value decomposition of \( {\boldsymbol{A}}\) , where \( {\boldsymbol{ \Sigma }}\in\mathbb{R}^{m\times d}\) contains the nonzero singular values \( s_1\ge \dots \ge s_r>0\) , and \( {\boldsymbol{U}}\in\mathbb{R}^{m\times m}\) , \( {\boldsymbol{V}}\in\mathbb{R}^{d\times d}\) are orthogonal. Then

\[ \begin{align*} {\boldsymbol{w}}_{*,\lambda} &= ({\boldsymbol{V}}({\boldsymbol{ \Sigma }}^\top{\boldsymbol{ \Sigma }}+\lambda{\boldsymbol{I}}_d){\boldsymbol{V}}^\top)^{-1}{\boldsymbol{V}}{\boldsymbol{ \Sigma }}^\top{\boldsymbol{U}}^\top{\boldsymbol{y}}\\ &={\boldsymbol{V}}\underbrace{\begin{pmatrix} \frac{s_1}{s_1^2+\lambda} &&&\\ &\ddots&&\boldsymbol{0}\\ &&\frac{s_r}{s_r^2+\lambda}&\\ &\boldsymbol{0}&&\boldsymbol{0}\\ \end{pmatrix}}_{\in\mathbb{R}^{d\times m}} {\boldsymbol{U}}^\top{\boldsymbol{y}}, \end{align*} \]

where \( \boldsymbol{0}\) stands for a zero block of suitable size. As \( \lambda\to 0\) , this converges towards \( {\boldsymbol{A}}^\dagger{\boldsymbol{y}}\) , where \( {\boldsymbol{A}}^\dagger\) denotes the pseudoinverse of \( {\boldsymbol{A}}\) , see (293). By Theorem 50, \( {\boldsymbol{A}}^\dagger{\boldsymbol{y}}={\boldsymbol{w}}_*\) .

12.1.2 Gradient descent

Consider gradient descent to minimize the objective \( f _\lambda\) in (169). Starting with \( {\boldsymbol{w}}_0\in\mathbb{R}^d\) , the iterative update with constant step size \( h>0\) reads

\[ \begin{equation} {\boldsymbol{w}}_{k+1} = {\boldsymbol{w}}_k-2h{\boldsymbol{A}}^\top({\boldsymbol{A}}{\boldsymbol{w}}_k-{\boldsymbol{y}})-2h\lambda{\boldsymbol{w}}_k~~\text{for all }k\in\mathbb{N}_0. \end{equation} \]

(174)

Let again first \( \lambda=0\) , i.e\( f _\lambda= f \) . Since \( f \) is convex and quadratic, by Remark 11 for sufficiently small step size \( h>0\) it holds \( f ({\boldsymbol{w}}_k)\to f ({\boldsymbol{w}}_*)\) as \( k\to\infty\) . Is it also true that \( {\boldsymbol{w}}_k\) converges to the minimal norm solution \( {\boldsymbol{w}}_*\in\tilde H\) ? Recall that \( \tilde H\) is spanned by the columns of \( {\boldsymbol{A}}^\top\) . Thus, if \( {\boldsymbol{w}}_0\in\tilde H\) , then by (174), the iterates \( {\boldsymbol{w}}_k\) never leave the subspace \( \tilde H\) . Since there is only one minimizer in \( \tilde H\) , it follows that \( {\boldsymbol{w}}_k\to{\boldsymbol{w}}_*\) as \( k\to\infty\) .

This shows that gradient descent, does not find an arbitrary optimum when minimizing \( f \) , but converges towards the minimum norm solution as long as \( {\boldsymbol{w}}_0\in\tilde H\) (e.g\( {\boldsymbol{w}}_0=\boldsymbol{0}\) ). It is well-known [8, Theorem 16], that iterations of type (174) lead to minimal norm solutions as made more precise by the next proposition. To state it, we write in the following \( s_{\rm {max}}({\boldsymbol{A}})\) for the maximal singular value of \( {\boldsymbol{A}}\) , and \( s_{\rm {min}}({\boldsymbol{A}})\) for the minimal positive singular value, with the convention \( s_{\rm {min}}({\boldsymbol{A}})\mathrm{:}= \infty\) in case \( {\boldsymbol{A}}=0\) . The full proof is left as Exercise 41.

Proposition 14

Let \( \lambda=0\) and fix \( h\in (0,s_{\rm {max}}({\boldsymbol{A}})^{-2})\) . Let \( {\boldsymbol{w}}_0=\tilde{\boldsymbol{w}}_0+\hat{\boldsymbol{w}}_0\) where \( \tilde {\boldsymbol{w}}_0 \in\tilde H\) and \( \hat{\boldsymbol{w}}_0\in\tilde H^\perp\) , and let \( ({\boldsymbol{w}}_k)_{k\in\mathbb{N}}\) be defined by (174). Then

\[ \lim_{k\to\infty}{\boldsymbol{w}}_k={\boldsymbol{w}}_*+\hat{\boldsymbol{w}}_0. \]

Next we consider ridge regression, where \( \lambda>0\) in (169), (174). The condition on the step size in the next proposition can be weakened to \( h\in (\lambda+s_{\rm {max}}({\boldsymbol{A}})^2)^{-1}\) , but we omit doing so for simplicity.

Proposition 15

Let \( \lambda>0\) , and fix \( h\in (0,(2\lambda+2s_{\rm {max}}({\boldsymbol{A}})^2)^{-1})\) . Let \( {\boldsymbol{w}}_0\in\mathbb{R}^n\) and let \( ({\boldsymbol{w}}_k)_{k\in\mathbb{N}}\) be defined by (174). Then

\[ \lim_{k\to\infty}{\boldsymbol{w}}_k={\boldsymbol{w}}_{*,\lambda} \]

and

\[ \| {\boldsymbol{w}}_*-{\boldsymbol{w}}_{*,\lambda} \|_{}\le \Big|\frac{\lambda}{s_{\rm min}({\boldsymbol{A}})^3+s_{\rm min}({\boldsymbol{A}})\lambda}\Big|\| {\boldsymbol{y}} \|_{}= O(\lambda)~~\text{as }\lambda\to 0. \]

Proof

By Exercise 38, \( f _\lambda\) is \( (2\lambda+2s_{\rm {max}}({\boldsymbol{A}})^2)\) -smooth, and by Exercise 42, \( f _\lambda\) is strongly convex. Thus Theorem 19 implies convergence of gradient descent towards the unique minimizer \( {\boldsymbol{w}}_{*,\lambda}\) .

For the bound on the distance to \( {\boldsymbol{w}}_*\) , assume \( {\boldsymbol{A}}\neq 0\) (the case \( {\boldsymbol{A}}=0\) is trivial). Expressing \( {\boldsymbol{w}}_*\) via the pseudoinverse of \( {\boldsymbol{A}}\) (see Appendix 19.1) we get

\[ {\boldsymbol{w}}_*={\boldsymbol{A}}^\dagger{\boldsymbol{y}} ={\boldsymbol{V}}\begin{pmatrix} \frac{1}{s_1} &&&\\ &\ddots&&\boldsymbol{0}\\ &&\frac{1}{s_r}&\\ &\boldsymbol{0}&&\boldsymbol{0}\\ \end{pmatrix} {\boldsymbol{U}}^\top{\boldsymbol{y}}, \]

where \( {\boldsymbol{A}}={\boldsymbol{U}}{\boldsymbol{ \Sigma }}{\boldsymbol{V}}^\top\) is the singular value decomposition of \( {\boldsymbol{A}}\) , and \( s_1\ge\dots\ge s_r>0\) denote the singular values of \( {\boldsymbol{A}}\) . The explicit formula for \( {\boldsymbol{w}}_{*,\lambda}\) obtained in the proof of Theorem 23 then yields

\[ \| {\boldsymbol{w}}_*-{\boldsymbol{w}}_{*,\lambda} \|_{}\le \max_{i\le r} \Big|\frac{s_i}{s_i^2+\lambda}-\frac{1}{s_i}\Big|\| {\boldsymbol{y}} \|_{}. \]

This gives the claimed bound.

By Proposition 15, if we use ridge regression with a small regularization parameter \( \lambda>0\) , then gradient descent converges to a vector \( {\boldsymbol{w}}_{*,\lambda}\) which is \( O(\lambda)\) close to the minimal norm solution \( {\boldsymbol{w}}_*\) , regardless of the initialization \( {\boldsymbol{w}}_0\) .

12.2 Feature methods and kernel least-squares regression

Linear models are often too simplistic to capture the true relationship between \( {\boldsymbol{x}}\) and \( y\) . Feature- and kernel-based methods (e.g., [205, 206, 199]) address this by replacing \( {\boldsymbol{x}} \mapsto \left\langle {\boldsymbol{x}}, {\boldsymbol{w}}\right\rangle_{}\) with \( {\boldsymbol{x}} \mapsto \left\langle \phi({\boldsymbol{x}}), {\boldsymbol{w}}\right\rangle_{}\) where \( \phi:\mathbb{R}^d \to \mathbb{R}^n\) is a (typically nonlinear) map. This introduces nonlinearity in \( {\boldsymbol{x}}\) while retaining linearity in the parameter \( {\boldsymbol{w}}\in\mathbb{R}^n\) .

Example 7

Let data \( (x_i,y_i)_{i=1}^m\subseteq \mathbb{R}\times\mathbb{R}\) be given, and define for \( x\in\mathbb{R}\)

\[ \phi(x)\mathrm{:}= (1,x,\dots,x^{n-1})^\top\in\mathbb{R}^n. \]

For \( {\boldsymbol{w}}\in\mathbb{R}^n\) , the model \( x\mapsto \left\langle \phi(x), {\boldsymbol{w}}\right\rangle_{}=\sum_{j=0}^{n-1}w_j x^j\) can represent any polynomial of degree \( n-1\) .

Let us formalize this idea. For reasons that will become apparent later (see Remark 15), it is useful to allow for the case \( n=\infty\) . To this end, let \( (H,\left\langle \cdot, \cdot\right\rangle_{H})\) be a Hilbert space (see Appendix 19.2.4), referred to as the feature space, and let \( \phi:\mathbb{R}^d\to H\) denote the feature map. The model is defined as

\[ \begin{equation} \Phi({\boldsymbol{x}},w)\mathrm{:}= \left\langle \phi({\boldsymbol{x}}), w\right\rangle_{H} \end{equation} \]

(163)

with \( w \in H\) . We may think of \( H\) in the following either as \( \mathbb{R}^n\) for some \( n\in\mathbb{N}\) , or as \( \ell^2(\mathbb{N})\) (see Example 25); in this case the components of \( \phi\) are referred to as features. For some \( \lambda\ge0\) , the goal is to minimize the objective

\[ \begin{equation} f (w)\mathrm{:}= \sum_{j=1}^{m}\big(\left\langle \phi({\boldsymbol{x}}_j), w\right\rangle_{H}-y_j\big)^2 ~~\text{or} ~~ f _\lambda(w)\mathrm{:}= f(w)+\lambda\| w \|_{H}^2. \end{equation} \]

(164)

In analogy to (158), throughout the rest of Section 12.2 denote by

\[ \tilde H\mathrm{:}= {\rm span}\{\phi({\boldsymbol{x}}_1),\dots,\phi({\boldsymbol{x}}_m)\}\subseteq H \]

the space spanned by the feature vectors at the training points.

12.2.1 Existence of minimizers

We start with the ridgeless case \( \lambda=0\) in (164). To guarantee uniqueness and regularize the problem, we again consider the minimum norm solution

\[ \begin{equation} w_*\mathrm{:}= \rm{argmin}_{w\in M}\| w \|_{H},~~ M\mathrm{:}= \{w\in H\,|\, f (w)\le f (v)~\forall v\in H\}. \end{equation} \]

(165)

Theorem 24

There is a unique minimum norm solution \( w_*\in H\) in (165). It lies in the subspace \( \tilde H\) , and is the unique minimizer of \( f \) in \( \tilde H\) , i.e.

\[ \begin{equation} w_* = \rm{argmin}_{\tilde w\in\tilde H} f (\tilde w). \end{equation} \]

(166)

The proof of Theorem 22 is formulated such that it extends verbatim to Theorem 24, upon replacing \( \mathbb{R}^d\) with \( H\) and the matrix \( {\boldsymbol{A}} \in \mathbb{R}^{m \times d}\) with the linear map

\[ \begin{align*} A:&H\to \mathbb{R}^m\\ &w\mapsto (\left\langle \phi({\boldsymbol{x}}_i), w\right\rangle_{H})_{i=1}^m. \end{align*} \]

Similarly, Theorem 23 extends to the current setting with small modifications. The key observation is that by Theorem 24, the minimizer is attained in the finite-dimensional subspace \( \tilde H\) . Selecting a basis for \( \tilde H\) , the proof then proceeds analogously. We leave it to the reader to check this, see Exercise 44. This leads to the following statement.

Theorem 25

Let \( \lambda>0\) . Then, with \( f _\lambda\) in (164), there exists a unique minimizer

\[ \begin{equation} w_{*,\lambda}\mathrm{:}= \rm{argmin}_{w\in H} f _\lambda(w). \end{equation} \]

(167)

It holds \( w_{*,\lambda}\in\tilde H\) , and

\[ \lim_{\lambda\to 0}w_{*,\lambda} = w_*. \]

Statements as in Theorems 24 and 25, which yield that the minimizer is attained in the finite dimensional subspace \( \tilde H\) , are known in the literature as representer theorems, [207, 208].

12.2.2 The kernel trick

We now explain the connection to “kernels”. At first glance, minimizing (164) in the potentially infinite-dimensional Hilbert space \( H\) seems infeasible. However, the so-called kernel trick enables this computation [209].

Definition 24

A symmetric function \( K:\mathbb{R}^d\times\mathbb{R}^d\to\mathbb{R}\) is called a kernel, if for any \( {\boldsymbol{x}}_1,\dots,{\boldsymbol{x}}_k\in\mathbb{R}^d\) , \( k\in\mathbb{N}\) , the kernel matrix \( {\boldsymbol{G}}=(K({\boldsymbol{x}}_i,{\boldsymbol{x}}_j))_{i,j=1}^k\in\mathbb{R}^{k\times k}\) is symmetric positive semidefinite.

Given a feature map \( \phi:\mathbb{R}^d\to H\) , it is easy to check that

\[ \begin{equation} K({\boldsymbol{x}},{\boldsymbol{z}})\mathrm{:}= \left\langle \phi({\boldsymbol{x}}), \phi({\boldsymbol{z}})\right\rangle_{H}~~\text{for all } {\boldsymbol{x}},{\boldsymbol{z}}\in\mathbb{R}^d, \end{equation} \]

(168)

defines a kernel. The corresponding kernel matrix \( {\boldsymbol{G}}\in\mathbb{R}^{{m}\times {m}}\) is

\[ G_{ij}=\left\langle \phi({\boldsymbol{x}}_i), \phi({\boldsymbol{x}}_j)\right\rangle_{H}=K({\boldsymbol{x}}_i,{\boldsymbol{x}}_j). \]

The ansatz \( w_*=\sum_{j=1}^{m}\alpha_j\phi({\boldsymbol{x}}_j)\) then turns the optimization problem (165) into

\[ \begin{equation} \rm{argmin}_{{\boldsymbol{\alpha}}\in\mathbb{R}^m}\| {\boldsymbol{G}}{\boldsymbol{\alpha}}-{\boldsymbol{y}} \|_{}^2 + \lambda {\boldsymbol{\alpha}}^\top{\boldsymbol{G}}{\boldsymbol{\alpha}}. \end{equation} \]

(169)

Such a minimizing \( {\boldsymbol{\alpha}}\) need not be unique (if \( {\boldsymbol{G}}\) is not regular), however, any such \( {\boldsymbol{\alpha}}\) yields a minimizer in \( \tilde H\) , and thus \( w_{*,\lambda}=\sum_{j=1}^{m}\alpha_j\phi({\boldsymbol{x}}_j)\) for any \( \lambda\ge 0\) by Theorems 24 and 25. This suggests the following algorithm:


Algorithm 2 Kernel least-squares regression
1.Input: Data \( ({\boldsymbol{x}}_i,y_i)_{i=1}^m\in\mathbb{R}^d\times \mathbb{R}\) , kernel \( K:\mathbb{R}^d\times\mathbb{R}^d\to \mathbb{R}\) , regularization parameter \( \lambda\ge 0\) , evaluation point \( {\boldsymbol{x}}\in\mathbb{R}^d\)
2.Output: (Ridge or ridgeless) kernel least squares estimator at \( {\boldsymbol{x}}\)
3.
4.compute the kernel matrix \( {\boldsymbol{G}}=(K({\boldsymbol{x}}_i,{\boldsymbol{x}}_j))_{i,j=1}^m\)
5.determine a minimizer \( {\boldsymbol{\alpha}}\in\mathbb{R}^m\) of \( \| {\boldsymbol{G}}{\boldsymbol{\alpha}}-{\boldsymbol{y}} \|_{}^2+\lambda{\boldsymbol{\alpha}}^\top{\boldsymbol{G}}{\boldsymbol{\alpha}}\)
6.evaluate \( \Phi({\boldsymbol{x}},w_{*,\lambda})\) via \[ \Phi({\boldsymbol{x}},w_{*,\lambda})=\left\langle \phi({\boldsymbol{x}}), \sum_{j=1}^{m}\alpha_j\phi({\boldsymbol{x}}_j)\right\rangle_{H}=\sum_{j=1}^{m}\alpha_jK({\boldsymbol{x}},{\boldsymbol{x}}_j) \]

Given the well-definedness of \( w_{*,0}\mathrm{:}= w_*\) and \( w_{*,\lambda}\) for \( \lambda\ge 0\) , we refer to

\[ {\boldsymbol{x}}\mapsto \Phi({\boldsymbol{x}},w_{*,\lambda})= \left\langle \phi({\boldsymbol{x}}), w_{*,\lambda}\right\rangle_{H} \]

as the (ridge or ridgeless) kernel least-squares estimator. By the above considerations, its computation neither requires explicit knowledge of the feature map \( \phi\) nor of \( w_{*,\lambda}\in H\) . It is sufficient to choose a kernel \( K:\mathbb{R}^d\times\mathbb{R}^d\to\mathbb{R}\) and perform all computations in finite dimensional spaces. This is known as the kernel trick. While Algorithm 2 will not play a role in the rest of the chapter, we present it here to give a more complete picture.

Remark 15

If \( \Omega\subseteq\mathbb{R}^d\) is compact and \( K:\Omega\times\Omega\to\mathbb{R}\) is a continuous kernel, then Mercer’s theorem implies existence of a Hilbert space \( H\) and a feature map \( \phi:\mathbb{R}^d\to H\) such that

\[ K({\boldsymbol{x}},{\boldsymbol{z}})=\left\langle \phi({\boldsymbol{x}}), \phi({\boldsymbol{z}})\right\rangle_{H}~~\text{for all }{\boldsymbol{x}},{\boldsymbol{z}}\in\Omega, \]

i.e\( K\) is the corresponding kernel. See for instance [210, Sec. 3.2] or [211, Thm. 4.49].

12.2.3 Gradient descent

In practice we may either minimize \( f _\lambda\) in (164) (in the Hilbert space \( H\) ) or the objective in (169) (in \( \mathbb{R}^m\) ). We now focus on the former, as this will allow to draw connections to neural network training in the subsequent sections. In order to use gradient descent, we assume \( H=\mathbb{R}^n\) equipped with the Euclidean inner product. Initializing \( {\boldsymbol{w}}_0\in\mathbb{R}^n\) , gradient descent with constant step size \( h>0\) to minimize \( f _\lambda\) reads

\[ {\boldsymbol{w}}_{k+1} = {\boldsymbol{w}}_k-2h{\boldsymbol{A}}^\top({\boldsymbol{A}}{\boldsymbol{w}}_k-{\boldsymbol{y}})-2h\lambda{\boldsymbol{w}}_k~~\text{for all }k\in\mathbb{N}_0, \]

where now

\[ {\boldsymbol{A}}=\begin{pmatrix} \phi({\boldsymbol{x}}_1)^\top\\ \vdots\\ \phi({\boldsymbol{x}}_m)^\top \end{pmatrix}. \]

This corresponds to the situation discussed in Section 12.1.2.

Let \( \lambda=0\) . For sufficiently small step size, by Proposition 15 for \( {\boldsymbol{x}}\in\mathbb{R}^d\)

\[ \begin{equation} \lim_{k\to\infty}\Phi({\boldsymbol{x}},{\boldsymbol{w}}_k)=\left\langle \phi({\boldsymbol{x}}), {\boldsymbol{w}}_*\right\rangle_{}+\left\langle \phi({\boldsymbol{x}}), \hat{\boldsymbol{w}}_0\right\rangle_{}, \end{equation} \]

(170)

where

\[ {\boldsymbol{w}}_0=\tilde{\boldsymbol{w}}_0+\hat{\boldsymbol{w}}_0 \]

with \( \tilde{\boldsymbol{w}}_0\in\tilde H=\rm{ span}\{\phi({\boldsymbol{x}}_0),\dots,\phi({\boldsymbol{x}}_m)\}\subseteq\mathbb{R}^m\) , and \( \hat{\boldsymbol{w}}_0\in\tilde H^\perp\) . For \( \lambda=0\) , gradient descent thus yields the ridgeless kernel least squares estimator plus an additional term \( \left\langle \phi({\boldsymbol{x}}), \hat{\boldsymbol{w}}_0\right\rangle_{}\) depending on initialization. Notably, on the set

\[ \begin{equation} \{{\boldsymbol{x}}\in\mathbb{R}^d\,|\,\phi({\boldsymbol{x}})\in {\rm span}\{\phi({\boldsymbol{x}}_1),\dots,\phi({\boldsymbol{x}}_m)\}\}, \end{equation} \]

(171)

(170) always coincides with the ridgeless least squares estimator.

Now let \( \lambda>0\) . For sufficiently small step size, by Proposition 16 for \( {\boldsymbol{x}}\in\mathbb{R}^d\)

\[ \lim_{k\to\infty}\Phi({\boldsymbol{x}},{\boldsymbol{w}}_k)=\left\langle \phi({\boldsymbol{x}}), {\boldsymbol{w}}_{*,\lambda}\right\rangle_{}= \left\langle \phi({\boldsymbol{x}}), {\boldsymbol{w}}_{*}\right\rangle_{}+O(\lambda)~~\text{as }\lambda\to 0. \]

Thus, for \( \lambda>0\) gradient descent determines the ridge kernel least-squares estimator regardless of the initialization. Moreover, for fixed \( {\boldsymbol{x}}\) , the limiting model is \( O(\lambda)\) close to the ridgeless kernel least-squares estimator.

12.3 Tangent kernel

Consider a general model \( \Phi({\boldsymbol{x}},{\boldsymbol{w}})\) with input \( {\boldsymbol{x}}\in\mathbb{R}^d\) and parameters \( {\boldsymbol{w}}\in\mathbb{R}^n\) . The goal is to minimize the square loss objective (167.b) given data (167.a). Our analysis in this and the following two sections focuses on the ridgeless case. We will revisit ridge regression in Section 12.6.4, where we consider a simple test example of training a neural network with and without regularization.

If \( {\boldsymbol{w}}\mapsto \Phi({\boldsymbol{x}},{\boldsymbol{w}})\) is not linear, then unlike in Sections 12.1 and 12.2, the objective function (167.b) is in general not convex, and most results on first order methods in Chapter 11 are not directly applicable. We thus simplify the situation by linearizing the model in the parameter \( {\boldsymbol{w}}\in\mathbb{R}^n\) around initialization: Fixing \( {\boldsymbol{w}}_0\in\mathbb{R}^n\) , let

\[ \begin{equation} \Phi^{\rm lin}({\boldsymbol{x}},{\boldsymbol{p}})\mathrm{:}= \Phi({\boldsymbol{x}},{\boldsymbol{w}}_0)+\nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}},{\boldsymbol{w}}_0)^\top({\boldsymbol{p}}-{\boldsymbol{w}}_0)\\ ~~\text{for all }{\boldsymbol{w}}\in\mathbb{R}^n, \end{equation} \]

(184)

which is the first order Taylor approximation of \( \Phi\) around the initial parameter \( {\boldsymbol{w}}_0\) . The parameters of the linearized model will always be denoted by \( {\boldsymbol{p}}\in\mathbb{R}^n\) to distinguish them from the parameters \( {\boldsymbol{w}}\) of the full model. Introduce

\[ \begin{equation} \delta_j\mathrm{:}= y_j-\Phi({\boldsymbol{x}}_i,{\boldsymbol{w}}_0)+\nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}}_j,{\boldsymbol{w}}_0)^\top{\boldsymbol{w}}_0~~\text{for all }j=1,\dots,m. \end{equation} \]

(185)

The square loss objective for the linearized model then reads

\[ \begin{align} f ^{\rm lin}({\boldsymbol{p}})&\mathrm{:}= \sum_{j=1}^m(\Phi^{\rm lin}({\boldsymbol{x}}_j,{\boldsymbol{p}})-y_j)^2 = \sum_{j=1}^m \big(\left\langle \nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}}_j,{\boldsymbol{w}}_0), {\boldsymbol{p}}\right\rangle_{} - \delta_j\big)^2 \end{align} \]

(186)

where \( \left\langle \cdot, \cdot\right\rangle_{}\) stands for the Euclidean inner product in \( \mathbb{R}^n\) . Comparing with (176), minimizing \( f ^{\rm {lin}}\) corresponds to kernel least squares regression with feature map

\[ \phi({\boldsymbol{x}})=\nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}},{\boldsymbol{w}}_0)\in\mathbb{R}^n. \]

By (180) the corresponding kernel is

\[ \begin{equation} \hat K_n({\boldsymbol{x}},{\boldsymbol{z}}) = \left\langle \nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}},{\boldsymbol{w}}_0), \nabla_{\boldsymbol{w}}\Phi({\boldsymbol{z}},{\boldsymbol{w}}_0)\right\rangle_{}. \end{equation} \]

(187)

We refer to \( \hat K_n\) as the empirical tangent kernel, as it arises from the first order Taylor approximation (the tangent) of the original model \( \Phi\) around initialization \( {\boldsymbol{w}}_0\) . Note that \( \hat K_n\) depends on the choice of \( {\boldsymbol{w}}_0\) . For later reference we point out that as explained in Section 12.2.3, minimizing \( f ^{\rm {lin}}\) with gradient descent, sufficiently small step size, no regularization, yields a sequence \( ({\boldsymbol{p}}_k)_{k\in\mathbb{N}_0}\) satisfying

\[ \begin{equation} \lim_{k\to\infty}\Phi^{\rm lin}({\boldsymbol{x}},{\boldsymbol{p}}_k) = \underbrace{\left\langle \phi({\boldsymbol{x}}), {\boldsymbol{p}}_*\right\rangle_{}}_{\substack{\text{ridgeless kernel least-squares} \\ \text{estimator with kernel \( \hat K_n\) }}} \;+\; \underbrace{\left\langle \phi({\boldsymbol{x}}), \hat{\boldsymbol{p}}_0\right\rangle_{}}_{\substack{\text{term depending on initialization,} \\ \text{which vanishes at training points}}}, \end{equation} \]

(188)

where \( \hat{\boldsymbol{p}}_0\) is the projection of the initialization \( {\boldsymbol{p}}_0\) onto \( {\rm {span}}\{\phi({\boldsymbol{x}}_1),\dots,\phi({\boldsymbol{x}}_n)\}^\perp\) . In particular, the second term vanishes at the training points.

12.4 Global minimizers

Consider a general model \( \Phi:\mathbb{R}^d\times\mathbb{R}^n\to\mathbb{R}\) , data as in (167.a), and the ridgeless square loss

\[ f ({\boldsymbol{w}}) = \sum_{j=1}^m (\Phi({\boldsymbol{x}}_j,{\boldsymbol{w}})-y_j)^2. \]

In this section we discuss sufficient conditions under which gradient descent converges to a global minimizer.

The idea is as follows: if \( {\boldsymbol{w}} \mapsto \Phi({\boldsymbol{x}}, {\boldsymbol{w}})\) is nonlinear but sufficiently close to its linearization \( \Phi^{\rm {lin}}\) in (184) within some region, the objective function behaves almost like a convex function there. If the region is large enough to contain both the intial value \( {\boldsymbol{w}}_0\) and a global minimum, then we expect gradient descent to never leave this (almost convex) basin during training and find a global minimizer.

To illustrate this, consider Figures 41 and 42 where we set the number of training samples to \( m=1\) and the number of parameters to \( n=1\) . For the above reasoning to hold, the difference between \( \Phi\) and \( \Phi^{\rm {lin}}\) , as well as the difference in their derivatives, must remain small within a neighborhood of \( {\boldsymbol{w}}_0\) . The neighbourhood should be large enough to contain the global minimizer, and thus depends critically on two factors: the initial error \( \Phi({\boldsymbol{x}}_1,w_0)-y_1\) , and the magnitude of the derivative \( \frac{\partial}{\partial w}\Phi({\boldsymbol{x}}_1,w_0)\) .

Graph of w ({x}_1,w)) and the
 linearization w ^{ {lin}}({x}_1,w)) at the
 initial parameter w_0) , s.t w({x}_1,w_0) 0) . If ) and
 ^{ {lin}}) are close, then there exists w) s.t ({x}_1,w)=y_1) (left). If the derivatives are also close,
 the loss (({x}_1,w)-y_1)^2) is nearly convex in w) , and
 gradient descent finds a global minimizer
 (right).
Figure 41. Graph of \( w\mapsto \Phi({\boldsymbol{x}}_1,w)\) and the linearization \( w\mapsto \Phi^{\rm {lin}}({\boldsymbol{x}}_1,w)\) at the initial parameter \( w_0\) , s.t \( \frac{\partial}{\partial w}\Phi({\boldsymbol{x}}_1,w_0)\neq 0\) . If \( \Phi\) and \( \Phi^{\rm {lin}}\) are close, then there exists \( w\) s.t \( \Phi({\boldsymbol{x}}_1,w)=y_1\) (left). If the derivatives are also close, the loss \( (\Phi({\boldsymbol{x}}_1,w)-y_1)^2\) is nearly convex in \( w\) , and gradient descent finds a global minimizer (right).
Same as Figure fig:ntk1. If ) and
 ^{ {lin}}) are not close, there need not exist w) such
 that ({x}_1,w)=y_1) , and gradient descent need not converge
 to a global minimizer.
Figure 42. Same as Figure 41. If \( \Phi\) and \( \Phi^{\rm {lin}}\) are not close, there need not exist \( w\) such that \( \Phi({\boldsymbol{x}}_1,w)=y_1\) , and gradient descent need not converge to a global minimizer.

For general \( m\) and \( n\) , we now make the required assumptions on \( \Phi\) precise.

Assumption 1

Let \( \Phi\in C^1(\mathbb{R}^d\times\mathbb{R}^n)\) and \( {\boldsymbol{w}}_0\in\mathbb{R}^n\) . There exist constants \( r,R,U,L>0\) and \( 0<\theta_{\rm {min}}\le \theta_{\rm max}<\infty\) such that \( \| {\boldsymbol{x}}_i \|_{}\le R\) for all \( i=1,\dots,m\) , and it holds that

  1. the kernel matrix of the empirical tangent kernel

    \[ \begin{equation} (\hat K_n({\boldsymbol{x}}_i,{\boldsymbol{x}}_j))_{i,j=1}^m = \big(\left\langle \nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}}_i,{\boldsymbol{w}}_0), {\nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}}_j,{\boldsymbol{w}}_0)}\right\rangle_{}\big)_{i,j=1}^m\in\mathbb{R}^{m\times m} \end{equation} \]

    (189)

    is regular and its eigenvalues belong to \( [\theta_{\rm {min}},\theta_{\rm max}]\) ,

  2. for all \( {\boldsymbol{x}}\in\mathbb{R}^d\) with \( \| {\boldsymbol{x}} \|_{}\le R\)

    \[ \begin{align} \| \nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}},{\boldsymbol{w}}) \|_{}&\le U &&\text{for all }{\boldsymbol{w}}\in B_r({\boldsymbol{w}}_0) \end{align} \]

    (190.a)

    \[ \begin{align} \| \nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}},{\boldsymbol{w}})-\nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}},{\boldsymbol{v}}) \|_{}&\le L\| {\boldsymbol{w}}-{\boldsymbol{v}} \|_{} &&\text{for all }{\boldsymbol{w}},~{\boldsymbol{v}}\in B_r({\boldsymbol{w}}_0), \end{align} \]

    (190.b)

  3. \[ \begin{equation} L\le \frac{\theta_{\rm min}^2}{12m^{3/2} U^2 \sqrt{ f ({\boldsymbol{w}}_0)}} ~~\text{and}~~ r=\frac{2 \sqrt{m} U\sqrt{ f ({\boldsymbol{w}}_0)}}{\theta_{\rm min}}. \end{equation} \]

    (191)

Let us give more intuitive explanations of these technical assumptions: (X) implies in particular that \( (\nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}}_i,{\boldsymbol{w}}_0)^\top)_{i=1}^m\in\mathbb{R}^{m\times n}\) has full rank \( m\le n\) (thus we have at least as many parameters \( n\) as training data \( m\) ). In the context of Figure 41, this means that \( \frac{\partial}{\partial w}\Phi({\boldsymbol{x}}_1,w_0)\neq 0\) and thus \( \Phi^{\rm {lin}}\) is a not a constant function. This guarantees existence of \( {\boldsymbol{p}}\) such that \( \Phi^{\rm {lin}}({\boldsymbol{x}}_i,{\boldsymbol{p}})=y_i\) for all \( i=1,\dots,m\) . Next, (XI) formalizes in particular the required closeness of \( \Phi\) and its linearization \( \Phi^{\rm {lin}}\) . For example, since \( \Phi^{\rm {lin}}\) is the first order Taylor approximation of \( \Phi\) at \( {\boldsymbol{w}}_0\) ,

\[ |\Phi({\boldsymbol{x}},{\boldsymbol{w}})-\Phi^{\rm lin}({\boldsymbol{x}},{\boldsymbol{w}})| = |(\nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}},\tilde{\boldsymbol{w}})-\nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}},{\boldsymbol{w}}_0))^\top({\boldsymbol{w}}-{\boldsymbol{w}}_0)|\le L \| {\boldsymbol{w}}-{\boldsymbol{w}}_0 \|_{}^2, \]

for some \( \tilde{\boldsymbol{w}}\) in the convex hull of \( {\boldsymbol{w}}\) and \( {\boldsymbol{w}}_0\) . Finally, (XII) ties together all constants, ensuring the full model to be sufficiently close to its linearization in a large enough ball of radius \( r\) around \( {\boldsymbol{w}}_0\) . Notably, \( r\) may be smaller for smaller initial error \( \sqrt{ f ({\boldsymbol{w}}_0)}\) and for larger \( \theta_{\rm {min}}\) , which aligns with our intuition from Figure 41. We are now ready to state the following theorem, which is a variant of [1, Thm. G.1]. The proof closely follows the arguments given there. In Section 12.6 we will see that the theorem’s main requirement—Assumption 1—is satisfied with high probability for certain (wide) neural networks.

Theorem 26

Let Assumption 1 hold. Fix a positive learning rate

\[ \begin{equation} h \le \frac{1}{\theta_{\rm min}+\theta_{\rm max}}. \end{equation} \]

(192)

Let \( ({\boldsymbol{w}}_k)_{k\in\mathbb{N}}\) be generated by gradient descent, i.efor all \( k\in\mathbb{N}_0\)

\[ \begin{equation} {\boldsymbol{w}}_{k+1}={\boldsymbol{w}}_k- h\nabla f ({\boldsymbol{w}}_k). \end{equation} \]

(193)

It then holds for all \( k\in\mathbb{N}\)

\[ \begin{align} \| {\boldsymbol{w}}_k-{\boldsymbol{w}}_0 \|_{}&\le r \end{align} \]

(194.a)

\[ \begin{align} f ({\boldsymbol{w}}_k)&\le (1-h\theta_{\rm min})^{2k} f ({\boldsymbol{w}}_0). \end{align} \]

(194.b)

Proof

Denote the model prediction error at the data points by

\[ {\boldsymbol{e}}({\boldsymbol{w}})\mathrm{:}= \begin{pmatrix} \Phi({\boldsymbol{x}}_1,{\boldsymbol{w}})-y_1\\ \vdots\\ \Phi({\boldsymbol{x}}_m,{\boldsymbol{w}})-y_m \end{pmatrix}\in\mathbb{R}^m ~~\text{s.t.}~~ \nabla {\boldsymbol{e}}({\boldsymbol{w}}) = \begin{pmatrix} \nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}}_1,{\boldsymbol{w}})^\top\\ \vdots\\ \nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}}_m,{\boldsymbol{w}})^\top \end{pmatrix}\in\mathbb{R}^{m\times n} \]

and with the empirical tangent kernel \( \hat K_n\) in Assumption 1 (X)

\[ \begin{equation} \nabla {\boldsymbol{e}}({\boldsymbol{w}}_0) \nabla {\boldsymbol{e}}({\boldsymbol{w}}_0)^\top =(\hat K_n({\boldsymbol{x}}_i,{\boldsymbol{x}}_j))_{i,j=1}^m\in\mathbb{R}^{m\times m}. \end{equation} \]

(195)

By (190.a)

\[ \begin{equation} \| \nabla {\boldsymbol{e}}({\boldsymbol{w}}) \|_{}^2 \le \| \nabla {\boldsymbol{e}}({\boldsymbol{w}}) \|_{F}^2 =\sum_{j=1}^m\| \nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}}_j,{\boldsymbol{w}}) \|_{}^2 \le m U^2. \end{equation} \]

(196.a)

Similarly, using (190.b)

\[ \begin{align} \| \nabla {\boldsymbol{e}}({\boldsymbol{w}})-\nabla {\boldsymbol{e}}({\boldsymbol{v}}) \|_{}^2&\le \sum_{j=1}^m\| \nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}}_j,{\boldsymbol{w}})-\nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}}_j,{\boldsymbol{v}}) \|_{}^2\nonumber \end{align} \]

(196.b)

\[ \begin{align} &\le m L^2 \| {\boldsymbol{w}}-{\boldsymbol{v}} \|_{}^2 ~~\text{for all }{\boldsymbol{w}},~{\boldsymbol{v}}\in B_r({\boldsymbol{w}}_0). \end{align} \]

(196.c)

\[ \begin{align} \\\end{align} \]

(196.d)

Step 1. Denote \( c\mathrm{:}= 1-h\theta_{\rm {min}}\in (0,1)\) . In the remainder of the proof we use induction over \( k\) to show

\[ \begin{align} \sum_{j=0}^{k-1}\| {\boldsymbol{w}}_{j+1}-{\boldsymbol{w}}_j \|_{}&\le 2h\sqrt{m}U\| {\boldsymbol{e}}({\boldsymbol{w}}_0) \|_{} \sum_{j=0}^{k-1}c^{j}, \end{align} \]

(197.a)

\[ \begin{align} \| {\boldsymbol{e}}({\boldsymbol{w}}_k) \|_{}^2&\le \| {\boldsymbol{e}}({\boldsymbol{w}}_0) \|_{}^2 c^{2k}, \end{align} \]

(197.b)

for all \( k\in\mathbb{N}_0\) and where an empty sum is understood as zero. Since, \( \sum_{j=0}^\infty c^j = (1-c)^{-1}\) , and \( \| {\boldsymbol{e}}({\boldsymbol{w}}) \|_{}=\sqrt{ f ({\boldsymbol{w}})}\) , using (191) we have

\[ \begin{equation} 2h\sqrt{m}U\| {\boldsymbol{e}}({\boldsymbol{w}}_0) \|_{}\sum_{j=0}^\infty c^j = 2h\sqrt{m}U\sqrt{ f ({\boldsymbol{w}}_0)}\frac{1}{h\theta_{\rm min}}\le r, \end{equation} \]

(198)

these inequalities directly imply (194).

The case \( k=0\) is trivial. For the induction step, assume (197) holds for some \( k\in\mathbb{N}_0\) .

Step 2. We show (197.a) for \( k+1\) . The induction assumption (197.a) and (198) give \( {\boldsymbol{w}}_k\in B_r({\boldsymbol{w}}_0)\) . Next

\[ \begin{equation} \nabla f ({\boldsymbol{w}}_k) = \nabla ({\boldsymbol{e}}({\boldsymbol{w}}_k)^\top {\boldsymbol{e}}({\boldsymbol{w}}_k)) = 2\nabla {\boldsymbol{e}}({\boldsymbol{w}}_k)^\top {\boldsymbol{e}}({\boldsymbol{w}}_k). \end{equation} \]

(199)

Using the iteration rule (193) and the bounds (196.a) and (197.b)

\[ \begin{align*} \| {\boldsymbol{w}}_{k+1}-{\boldsymbol{w}}_k \|_{}&=2h\| \nabla {\boldsymbol{e}}({\boldsymbol{w}}_k)^\top {\boldsymbol{e}}({\boldsymbol{w}}_k) \|_{}\\ &\le 2h \sqrt{m}U \| {\boldsymbol{e}}({\boldsymbol{w}}_k) \|_{}\\ &\le 2h \sqrt{m}U \| {\boldsymbol{e}}({\boldsymbol{w}}_0) \|_{} c^{k}. \end{align*} \]

This shows (197.a) for \( k+1\) . In particular by (198)

\[ \begin{equation} {\boldsymbol{w}}_{k+1},~{\boldsymbol{w}}_k \in B_r({\boldsymbol{w}}_0). \end{equation} \]

(200)

Step 3. We show (197.b) for \( k+1\) . Since \( e:\mathbb{R}^n\to\mathbb{R}^m\) is continuously differentiable, there exists \( \tilde{\boldsymbol{w}}_k\) in the convex hull of \( {\boldsymbol{w}}_k\) and \( {\boldsymbol{w}}_{k+1}\) such that

\[ {\boldsymbol{e}}({\boldsymbol{w}}_{k+1}) = {\boldsymbol{e}}({\boldsymbol{w}}_k)+\nabla {\boldsymbol{e}}(\tilde{\boldsymbol{w}}_k) ({\boldsymbol{w}}_{k+1}-{\boldsymbol{w}}_k) = {\boldsymbol{e}}({\boldsymbol{w}}_k)-h\nabla {\boldsymbol{e}}(\tilde{\boldsymbol{w}}_k) \nabla f ({\boldsymbol{w}}_k), \]

and thus by (199)

\[ \begin{align*} {\boldsymbol{e}}({\boldsymbol{w}}_{k+1}) &= {\boldsymbol{e}}({\boldsymbol{w}}_k) -2h\nabla {\boldsymbol{e}}(\tilde{\boldsymbol{w}}_k) \nabla {\boldsymbol{e}}({\boldsymbol{w}}_k)^\top {\boldsymbol{e}}({\boldsymbol{w}}_k)\\ &= \big({\boldsymbol{I}}_m-2h\nabla {\boldsymbol{e}}(\tilde{\boldsymbol{w}}_k) \nabla {\boldsymbol{e}}({\boldsymbol{w}}_k)^\top\big){\boldsymbol{e}}({\boldsymbol{w}}_k), \end{align*} \]

where \( {\boldsymbol{I}}_m\in\mathbb{R}^{m\times m}\) is the identity matrix. We wish to show that

\[ \begin{equation} \| {\boldsymbol{I}}_m-2h\nabla {\boldsymbol{e}}(\tilde{\boldsymbol{w}}_k) \nabla {\boldsymbol{e}}({\boldsymbol{w}}_k)^\top \|_{}\le c, \end{equation} \]

(201)

which then implies (197.b) for \( k+1\) and concludes the proof.

Using (196) and the fact that \( {\boldsymbol{w}}_k\) , \( \tilde {\boldsymbol{w}}_k\in B_r({\boldsymbol{w}}_0)\) by (200),

\[ \begin{align*} &\| \nabla {\boldsymbol{e}}(\tilde{\boldsymbol{w}}_k) \nabla {\boldsymbol{e}}({\boldsymbol{w}}_k)^\top-\nabla {\boldsymbol{e}}({\boldsymbol{w}}_0) \nabla {\boldsymbol{e}}({\boldsymbol{w}}_0)^\top \|_{}\\ &~~~~\le \| \nabla {\boldsymbol{e}}(\tilde{\boldsymbol{w}}_k) \nabla {\boldsymbol{e}}({\boldsymbol{w}}_k)^\top-\nabla {\boldsymbol{e}}({\boldsymbol{w}}_k) \nabla {\boldsymbol{e}}({\boldsymbol{w}}_k)^\top \|_{}\\ &~~~~~+ \| \nabla {\boldsymbol{e}}({\boldsymbol{w}}_k) \nabla {\boldsymbol{e}}({\boldsymbol{w}}_k)^\top-\nabla {\boldsymbol{e}}({\boldsymbol{w}}_k) \nabla {\boldsymbol{e}}({\boldsymbol{w}}_0)^\top \|_{} \\ &~~~~~+\| \nabla {\boldsymbol{e}}({\boldsymbol{w}}_k) \nabla {\boldsymbol{e}}({\boldsymbol{w}}_0)^\top-\nabla {\boldsymbol{e}}({\boldsymbol{w}}_0) \nabla {\boldsymbol{e}}({\boldsymbol{w}}_0)^\top \|_{}\\ &~~~~\le 3 mULr. \end{align*} \]

Since the eigenvalues of \( \nabla {\boldsymbol{e}}({\boldsymbol{w}}_0)\nabla {\boldsymbol{e}}({\boldsymbol{w}}_0)^\top\) belong to \( [\theta_{\rm {min}},\theta_{\rm max}]\) by (195) and Assumption 1 (X), as long as \( h\le (\theta_{\rm {min}}+\theta_{\rm max})^{-1}\) we have

\[ \begin{align*} \| {\boldsymbol{I}}_m-2h\nabla {\boldsymbol{e}}(\tilde{\boldsymbol{w}}_k) \nabla {\boldsymbol{e}}({\boldsymbol{w}}_k)^\top \|_{} &\le \| {\boldsymbol{I}}_m-2h\nabla {\boldsymbol{e}}({\boldsymbol{w}}_0) \nabla {\boldsymbol{e}}({\boldsymbol{w}}_0)^\top \|_{}+6h m ULr\\ &\le 1-2h\theta_{\rm min}+6h m ULr. \end{align*} \]

Due to (191)

\[ \begin{align*} 1-2h\theta_{\rm min}+6h m ULr &\le 1-2h\theta_{\rm min}+6h m U\frac{\theta_{\rm min}^2}{12 m^{3/2} U^2\sqrt{ f ({\boldsymbol{w}}_0)}}\frac{2\sqrt{m}U\sqrt{ f ({\boldsymbol{w}}_0)}}{\theta_{\rm min}}\\ &= 1-h\theta_{\rm min}= c, \end{align*} \]

which concludes the proof.

Let us emphasize that (194.b) implies that gradient descent (193) achieves zero loss in the limit. Consequently, the limiting model interpolates the training data. This shows in particular convergence to a global minimizer for the (generally nonconvex) optimization problem of minimizing \( f ({\boldsymbol{w}})\) .

12.5 Proximity to trained linearized model

The analysis in Section 12.4 was based on the observation that the linearization \( \Phi^{\rm {lin}}\) closely mimics the behaviour of the full model \( \Phi\) for parameters with distance of at most \( r\) (cf. Assumption 1) to the initial parameter \( {\boldsymbol{w}}_0\) . Theorem 26 states that the parameters remain within this range throughout training. This suggests that the predictions of the trained full model \( \lim_{k\to\infty}\Phi({\boldsymbol{x}},{\boldsymbol{w}}_k)\) , are similar to those of the trained linear model \( \lim_{k\to\infty}\Phi^{\rm {lin}}({\boldsymbol{x}},{\boldsymbol{p}}_k)\) . In this section we formalize this statement.

12.5.1 Evolution of model predictions

We adopt again the notation \( \Phi^{\rm {lin}}:\mathbb{R}^d\times \mathbb{R}^n\to\mathbb{R}\) from (184) to represent the linearization of \( \Phi:\mathbb{R}^d\times \mathbb{R}^n\to\mathbb{R}\) around \( {\boldsymbol{w}}_0\) . The parameters of the linearized model are represented by \( {\boldsymbol{p}} \in \mathbb{R}^n\) , and the corresponding loss function is written as \( f ^{\rm {lin}}({\boldsymbol{p}})\) , as in (186). Additionally, we define \( {\boldsymbol{X}}\mathrm{:}= ({\boldsymbol{x}}_1,\dots,{\boldsymbol{x}}_m)\) and

\[ \begin{align*} \Phi({\boldsymbol{X}},{\boldsymbol{w}})&\mathrm{:}=(\Phi({\boldsymbol{x}}_i,{\boldsymbol{w}}))_{i=1}^m\in\mathbb{R}^m\\ \Phi^{\rm lin}({\boldsymbol{X}},{\boldsymbol{p}})&\mathrm{:}=(\Phi^{\rm lin}({\boldsymbol{x}}_i,{\boldsymbol{p}}))_{i=1}^m\in\mathbb{R}^m \end{align*} \]

to denote the predicted values at the training points \( {\boldsymbol{x}}_1,\dots,{\boldsymbol{x}}_m\) for given parameter choices \( {\boldsymbol{w}}\) , \( {\boldsymbol{p}}\in\mathbb{R}^n\) . Moreover

\[ \nabla_{\boldsymbol{w}} \Phi({\boldsymbol{X}},{\boldsymbol{w}}) = \begin{pmatrix} \nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}}_1,{\boldsymbol{w}})^\top\\ \vdots\\ \nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}}_m,{\boldsymbol{w}})^\top \end{pmatrix}\in\mathbb{R}^{m\times n} \]

and similarly for \( \nabla_{\boldsymbol{w}}\Phi^{\rm {lin}}({\boldsymbol{X}},{\boldsymbol{w}})\) . Given \( {\boldsymbol{x}}\in\mathbb{R}^d\) , the model predictions at \( {\boldsymbol{x}}\) and \( {\boldsymbol{X}}\) evolve under gradient descent as follows:

  • full model: Initialize \( {\boldsymbol{w}}_0\in\mathbb{R}^n\) , and set for step size \( h>0\) and all \( k\in\mathbb{N}_0\)

    \[ \begin{equation} {\boldsymbol{w}}_{k+1}={\boldsymbol{w}}_k-h\nabla_{\boldsymbol{w}} f ({\boldsymbol{w}}_k). \end{equation} \]

    (202)

    Then

    \[ \nabla_{\boldsymbol{w}} f ({\boldsymbol{w}}) = \nabla_{\boldsymbol{w}}\| \Phi({\boldsymbol{X}},{\boldsymbol{w}})-{\boldsymbol{y}} \|_{}^2 = 2\nabla_{\boldsymbol{w}}\Phi({\boldsymbol{X}},{\boldsymbol{w}})^\top(\Phi({\boldsymbol{X}},{\boldsymbol{w}})-{\boldsymbol{y}}). \]

    Thus

    \[ \begin{align*} \Phi({\boldsymbol{x}},{\boldsymbol{w}}_{k+1})&= \Phi({\boldsymbol{x}},{\boldsymbol{w}}_{k})+(\nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}},\tilde {\boldsymbol{w}}_k))^\top({\boldsymbol{w}}_{k+1}-{\boldsymbol{w}}_k)\nonumber\\ &=\Phi({\boldsymbol{x}},{\boldsymbol{w}}_{k})-2h \nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}},\tilde {\boldsymbol{w}}_k)^\top\nabla_{\boldsymbol{w}}\Phi({\boldsymbol{X}},{\boldsymbol{w}}_k)^\top(\Phi({\boldsymbol{X}},{\boldsymbol{w}}_k)-{\boldsymbol{y}}), \end{align*} \]

    where \( \tilde{\boldsymbol{w}}_k\) is in the convex hull of \( {\boldsymbol{w}}_k\) and \( {\boldsymbol{w}}_{k+1}\) . Introducing

    \[ \begin{equation} \begin{aligned} {\boldsymbol{G}}^k({\boldsymbol{x}},{\boldsymbol{X}})&\mathrm{:}= \nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}},\tilde{\boldsymbol{w}}_k)^\top\nabla_{\boldsymbol{w}}\Phi({\boldsymbol{X}},{\boldsymbol{w}}_k)^\top\in\mathbb{R}^{1\times m}\\ {\boldsymbol{G}}^k({\boldsymbol{X}},{\boldsymbol{X}})&\mathrm{:}= \nabla_{\boldsymbol{w}}\Phi({\boldsymbol{X}},\tilde{\boldsymbol{w}}_k)\nabla_{\boldsymbol{w}}\Phi({\boldsymbol{X}},{\boldsymbol{w}}_k)^\top\in\mathbb{R}^{m\times m} \end{aligned} \end{equation} \]

    (203)

    this yields

    \[ \begin{align} \Phi({\boldsymbol{x}},{\boldsymbol{w}}_{k+1}) &=\Phi({\boldsymbol{x}},{\boldsymbol{w}}_{k})-2h {\boldsymbol{G}}^k({\boldsymbol{x}},{\boldsymbol{X}})(\Phi({\boldsymbol{X}},{\boldsymbol{w}}_k)-{\boldsymbol{y}}), \end{align} \]

    (204.a)

    \[ \begin{align} \Phi({\boldsymbol{X}},{\boldsymbol{w}}_{k+1}) &=\Phi({\boldsymbol{X}},{\boldsymbol{w}}_{k})-2h {\boldsymbol{G}}^k({\boldsymbol{X}},{\boldsymbol{X}})(\Phi({\boldsymbol{X}},{\boldsymbol{w}}_k)-{\boldsymbol{y}}). \end{align} \]

    (204.b)

  • linearized model: Initialize \( {\boldsymbol{p}}_0\mathrm{:}={\boldsymbol{w}}_0\in\mathbb{R}^n\) , and set for step size \( h>0\) and all \( k\in\mathbb{N}_0\)

    \[ \begin{equation} {\boldsymbol{p}}_{k+1}={\boldsymbol{p}}_k-h\nabla_{\boldsymbol{p}} f ^{\rm lin}({\boldsymbol{p}}_k). \end{equation} \]

    (205)

    Then, since \( \nabla_{\boldsymbol{p}}\Phi^{\rm {lin}}({\boldsymbol{x}},{\boldsymbol{p}})=\nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}},{\boldsymbol{w}}_0)\) for any \( {\boldsymbol{p}}\in\mathbb{R}^n\) ,

    \[ \nabla_{\boldsymbol{p}} f ^{\rm lin}({\boldsymbol{p}}) = \nabla_{\boldsymbol{p}}\| \Phi^{\rm lin}({\boldsymbol{X}},{\boldsymbol{p}})-{\boldsymbol{y}} \|_{}^2 = 2\nabla_{\boldsymbol{w}}\Phi({\boldsymbol{X}},{\boldsymbol{w}}_0)^\top(\Phi^{\rm lin}({\boldsymbol{X}},{\boldsymbol{p}})-{\boldsymbol{y}}) \]

    and

    \[ \begin{align*} \Phi^{\rm lin}({\boldsymbol{x}},{\boldsymbol{p}}_{k+1})&= \Phi^{\rm lin}({\boldsymbol{x}},{\boldsymbol{p}}_{k})+ \nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}},{\boldsymbol{w}}_0)({\boldsymbol{p}}_{k+1}-{\boldsymbol{p}}_k)\nonumber\\ &=\Phi^{\rm lin}({\boldsymbol{x}},{\boldsymbol{p}}_{k})-2h \nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}},{\boldsymbol{w}}_0)^\top\nabla_{\boldsymbol{w}}\Phi({\boldsymbol{X}},{\boldsymbol{w}}_0)^\top(\Phi^{\rm lin}({\boldsymbol{X}},{\boldsymbol{p}}_k)-{\boldsymbol{y}}). \end{align*} \]

    Introducing (cf. (189))

    \[ \begin{equation} \begin{aligned} {\boldsymbol{G}}^{\rm lin}({\boldsymbol{x}},{\boldsymbol{X}})&\mathrm{:}=\nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}},{\boldsymbol{w}}_0)^\top\nabla_{\boldsymbol{w}}\Phi({\boldsymbol{X}},{\boldsymbol{w}}_0)^\top\in\mathbb{R}^{1\times m},\\ {\boldsymbol{G}}^{\rm lin}({\boldsymbol{X}},{\boldsymbol{X}})&\mathrm{:}=\nabla_{\boldsymbol{w}}\Phi({\boldsymbol{X}},{\boldsymbol{w}}_0)\nabla_{\boldsymbol{w}}\Phi({\boldsymbol{X}},{\boldsymbol{w}}_0)^\top=(\hat K_n({\boldsymbol{x}}_i,{\boldsymbol{x}}_j))_{i,j=1}^m\in\mathbb{R}^{m\times m} \end{aligned} \end{equation} \]

    (206)

    this yields

    \[ \begin{align} \Phi^{\rm lin}({\boldsymbol{x}},{\boldsymbol{p}}_{k+1})&=\Phi^{\rm lin}({\boldsymbol{x}},{\boldsymbol{p}}_{k})-2h {\boldsymbol{G}}^{\rm lin}({\boldsymbol{x}},{\boldsymbol{X}})(\Phi^{\rm lin}({\boldsymbol{X}},{\boldsymbol{p}}_k)-{\boldsymbol{y}}) \end{align} \]

    (207.a)

    \[ \begin{align} \Phi^{\rm lin}({\boldsymbol{X}},{\boldsymbol{p}}_{k+1})&=\Phi^{\rm lin}({\boldsymbol{X}},{\boldsymbol{p}}_{k})-2h {\boldsymbol{G}}^{\rm lin}({\boldsymbol{X}},{\boldsymbol{X}})(\Phi^{\rm lin}({\boldsymbol{X}},{\boldsymbol{p}}_k)-{\boldsymbol{y}}). \end{align} \]

    (207.b)

The full dynamics (204) are governed by the \( k\) -dependent kernel matrices \( {\boldsymbol{G}}^k\) . In contrast, the linear model’s dynamics are entirely determined by the initial kernel matrix \( {\boldsymbol{G}}^{\rm {lin}}\) . The following corollary gives an upper bound on how much these matrices may deviate during training, [1, Thm. G.1].

Corollary 4

Let \( {\boldsymbol{w}}_0={\boldsymbol{p}}_0\in\mathbb{R}^n\) , and let Assumption 1 be satisfied for some \( r,R,U,L,\theta_{\rm {min}},\theta_{\rm max}>0\) . Let \( ({\boldsymbol{w}}_k)_{k\in\mathbb{N}}\) , \( ({\boldsymbol{p}}_k)_{k\in\mathbb{N}}\) be generated by gradient descent (202), (205) with a positive step size

\[ h<\frac{1}{\theta_{\rm min}+\theta_{\rm max}}. \]

Then for all \( {\boldsymbol{x}}\in\mathbb{R}^d\) with \( \| {\boldsymbol{x}} \|_{}\le R\)

\[ \begin{align} \sup_{k\in\mathbb{N}}\| {\boldsymbol{G}}^k({\boldsymbol{x}},{\boldsymbol{X}})-{\boldsymbol{G}}^{\rm lin}({\boldsymbol{x}},{\boldsymbol{X}}) \|_{}&\le 2\sqrt{m}ULr, \end{align} \]

(208.a)

\[ \begin{align} \sup_{k\in\mathbb{N}}\| {\boldsymbol{G}}^k({\boldsymbol{X}},{\boldsymbol{X}})-{\boldsymbol{G}}^{\rm lin}({\boldsymbol{X}},{\boldsymbol{X}}) \|_{}&\le 2 mULr. \end{align} \]

(208.b)

Proof

By Theorem 26 it holds \( {\boldsymbol{w}}_k\in B_r({\boldsymbol{w}}_0)\) for all \( k\in\mathbb{N}\) , and thus also \( \tilde{\boldsymbol{w}}_k\in B_r({\boldsymbol{w}}_0)\) for \( \tilde{\boldsymbol{w}}_k\) in the convex hull of \( {\boldsymbol{w}}_k\) and \( {\boldsymbol{w}}_{k+1}\) as in (203). Using Assumption 1 (XI), the definitions of \( {\boldsymbol{G}}^k\) and \( {\boldsymbol{G}}^{\rm {lin}}\) give

\[ \begin{align*} \| {\boldsymbol{G}}^k({\boldsymbol{x}},{\boldsymbol{X}})-{\boldsymbol{G}}^{\rm lin}({\boldsymbol{x}},{\boldsymbol{X}}) \|_{} &\le \| \nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}},\tilde{\boldsymbol{w}}_k) \|_{}\| \nabla_{\boldsymbol{w}}\Phi({\boldsymbol{X}},{\boldsymbol{w}}_k)-\nabla_{\boldsymbol{w}}\Phi({\boldsymbol{X}},{\boldsymbol{w}}_0) \|_{}\\ &~+\| \nabla_{\boldsymbol{w}}\Phi({\boldsymbol{X}},{\boldsymbol{w}}_0) \|_{}\| \nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}},\tilde{\boldsymbol{w}}_k)-\nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}},{\boldsymbol{w}}_0) \|_{}\\ &\le (\sqrt{m}+1)ULr. \end{align*} \]

The proof for the second inequality is similar.

12.5.2 Limiting model predictions

We begin by stating the main result of this section, which is based on and follows the arguments in [1, Thm. H.1]. It gives an upper bound on the discrepancy between the full and linearized models at each training step, and thus in the limit.

Theorem 27

Consider the setting of Corollary 4, in particular let \( r\) , \( R\) , \( \theta_{\rm {min}}\) , \( \theta_{\rm {max}}\) be as in Assumption 1. Then for all \( {\boldsymbol{x}}\in\mathbb{R}^d\) with \( \| {\boldsymbol{x}} \|_{}\le R\)

\[ \sup_{k\in\mathbb{N}}\| \Phi({\boldsymbol{x}},{\boldsymbol{w}}_k)-\Phi^{\rm lin}({\boldsymbol{x}},{\boldsymbol{p}}_k) \|_{}\le \frac{4\sqrt{m}ULr}{\theta_{\rm min}}\left(1+\frac{mU^2}{(h\theta_{\rm min})^2(\theta_{\rm min}+\theta_{\rm max})}\right)\sqrt{ f ({\boldsymbol{w}}_0)}. \]

To prove the theorem, we first examine the difference between the full and linearized models on the training data.

Proposition 16

Consider the setting of Corollary 4 and set

\[ \alpha\mathrm{:}= \frac{2mULr}{h\theta_{\rm min}(\theta_{\rm min}+\theta_{\rm max})}\sqrt{ f ({\boldsymbol{w}}_0)}. \]

Then for all \( k\in\mathbb{N}\)

\[ \| \Phi({\boldsymbol{X}},{\boldsymbol{w}}_k)-\Phi^{\rm lin}({\boldsymbol{X}},{\boldsymbol{p}}_k) \|_{}\le \alpha k(1-h\theta_{\rm min})^{k-1}. \]

Proof

Throughout this proof we write for short

\[ {\boldsymbol{G}}^k={\boldsymbol{G}}^k({\boldsymbol{X}},{\boldsymbol{X}})~~\text{and}~~ {\boldsymbol{G}}^{\rm lin}={\boldsymbol{G}}^{\rm lin}({\boldsymbol{X}},{\boldsymbol{X}}), \]

and set for \( k\in\mathbb{N}\)

\[ {\boldsymbol{e}}_k\mathrm{:}= \Phi({\boldsymbol{X}},{\boldsymbol{w}}_k)-\Phi^{\rm lin}({\boldsymbol{X}},{\boldsymbol{p}}_k). \]

Subtracting (207.b) from (204.b) we get for \( k\ge 0\)

\[ \begin{align*} {\boldsymbol{e}}_{k+1} &= {\boldsymbol{e}}_k-2h{\boldsymbol{G}}^k(\Phi({\boldsymbol{X}},{\boldsymbol{w}}_k)-{\boldsymbol{y}})+2h{\boldsymbol{G}}^{\rm lin}(\Phi^{\rm lin}({\boldsymbol{X}},{\boldsymbol{p}}_k)-{\boldsymbol{y}})\\ &=({\boldsymbol{I}}_m-2h{\boldsymbol{G}}^{\rm lin}){\boldsymbol{e}}_k-2h ({\boldsymbol{G}}^k-{\boldsymbol{G}}^{\rm lin})(\Phi({\boldsymbol{X}},{\boldsymbol{w}}_k)-{\boldsymbol{y}}) \end{align*} \]

where \( {\boldsymbol{I}}_m\in\mathbb{R}^{m\times m}\) is the identity. Set \( c\mathrm{:}= 1-h\theta_{\rm {min}}\) . Then by (208.b), (194.b), and because \( h<(\theta_{\min}+\theta_{\max})^{-1}\) , we can bound the second term by

\[ \| 2h ({\boldsymbol{G}}^k-{\boldsymbol{G}}^{\rm lin})(\Phi({\boldsymbol{X}},{\boldsymbol{w}}_k)-{\boldsymbol{y}}) \|_{} \le \underbrace{2 \frac{mULr}{\theta_{\rm min}+\theta_{\rm max}} \sqrt{ f ({\boldsymbol{w}}_0)}}_{=\mathrm{:} \tilde\alpha} c^k. \]

Moreover, assumption 1 (X) and \( h< (\theta_{\min}+\theta_{\max})^{-1}\) yield

\[ \| {\boldsymbol{I}}_m-2h{\boldsymbol{G}}^{\rm lin} \|_{}\le 1-2h\theta_{\rm min} \le c. \]

Hence, using \( \sum_{j=0}^\infty c^j = (h\theta_{\rm {min}})^{-1}\)

\[ \| {\boldsymbol{e}}_{k+1} \|_{}\le c\| {\boldsymbol{e}}_k \|_{}+\tilde\alpha c^k\le\dots\le c^k\| {\boldsymbol{e}}_0 \|_{}+\sum_{j=0}^{k}c^{k-j}\tilde\alpha c^j \le c^{k} \| {\boldsymbol{e}}_0 \|_{} + \frac{\tilde\alpha}{h\theta_{\rm min}} (k+1) c^k. \]

Since \( {\boldsymbol{w}}_0={\boldsymbol{p}}_0\) it holds \( \Phi^{\rm {lin}}({\boldsymbol{X}},{\boldsymbol{p}}_0)=\Phi({\boldsymbol{X}},{\boldsymbol{w}}_0)\) (cf. (184)). Thus \( \| {\boldsymbol{e}}_0 \|_{}=0\) which gives the statement.

We are now in position to prove the theorem.

Proof (of Theorem 27)

Throughout this proof we write for short

\[ {\boldsymbol{G}}^k={\boldsymbol{G}}^k({\boldsymbol{x}},{\boldsymbol{X}})\in\mathbb{R}^{1\times m}~~\text{and}~~ {\boldsymbol{G}}^{\rm lin}={\boldsymbol{G}}^{\rm lin}({\boldsymbol{x}},{\boldsymbol{X}})\in\mathbb{R}^{1\times m}, \]

and set for \( k\in\mathbb{N}\)

\[ e_k\mathrm{:}= \Phi({\boldsymbol{x}},{\boldsymbol{w}}_k)-\Phi^{\rm lin}({\boldsymbol{x}},{\boldsymbol{p}}_k). \]

Subtracting (207.a) from (204.a)

\[ \begin{align*} e_{k+1} &= e_k-2h{\boldsymbol{G}}^k(\Phi({\boldsymbol{X}},{\boldsymbol{w}}_k)-{\boldsymbol{y}})+2h{\boldsymbol{G}}^{\rm lin}(\Phi^{\rm lin}({\boldsymbol{X}},{\boldsymbol{p}}_k)-{\boldsymbol{y}})\\ &= e_k-2h({\boldsymbol{G}}^k-{\boldsymbol{G}}^{\rm lin})(\Phi({\boldsymbol{X}},{\boldsymbol{w}}_k)-{\boldsymbol{y}})+2h {\boldsymbol{G}}^{\rm lin}(\Phi^{\rm lin}({\boldsymbol{X}},{\boldsymbol{p}}_k)-\Phi({\boldsymbol{X}},{\boldsymbol{w}}_k)). \end{align*} \]

Denote \( c\mathrm{:}= 1-h\theta_{\rm {min}}\) . By (208.a) and (194.b)

\[ \begin{align*} 2h\| {\boldsymbol{G}}^k-{\boldsymbol{G}}^{\rm lin} \|_{}&\le 4h\sqrt{m}ULr\\ \| \Phi({\boldsymbol{X}},{\boldsymbol{w}}_k)-{\boldsymbol{y}} \|_{}&\le c^k\sqrt{ f ({\boldsymbol{w}}_0)} \end{align*} \]

and by (190.a) (cf. (206)) and Proposition 16

\[ \begin{align*} 2h\| {\boldsymbol{G}}^{\rm lin} \|_{}&\le 2h\sqrt{m}U^2\\ \| \Phi({\boldsymbol{X}},{\boldsymbol{w}}_k)-\Phi^{\rm lin}({\boldsymbol{X}},{\boldsymbol{p}}_k) \|_{}&\le \alpha k c^{k-1}. \end{align*} \]

Hence for \( k\ge 0\)

\[ \begin{align*} |e_{k+1}|&\le |e_k| + \underbrace{4h\sqrt{m}ULr\sqrt{ f ({\boldsymbol{w}}_0)}}_{=\mathrm{:} \beta_1}c^k + \underbrace{2h\sqrt{m}U^2\alpha}_{=\mathrm{:}\beta_2} k c^{k-1}. \end{align*} \]

Repeatedly applying this bound and using \( \sum_{j\ge 0}c^j=(1-c)^{-1}=(h\theta_{\rm {min}})^{-1}\) and \( \sum_{j\ge 0}jc^{j-1}=(1-c)^{-2}=(h\theta_{\rm {min}})^{-2}\)

\[ |e_{k+1}|\le |e_0|+\beta_1\sum_{j=0}^k c^j+\beta_2 \sum_{j=0}^k j c^{j-1} \le \frac{\beta_1}{h\theta_{\rm min}}+\frac{\beta_2}{(h\theta_{\rm min})^2}. \]

Here we used that due to \( {\boldsymbol{w}}_0={\boldsymbol{p}}_0\) it holds \( \Phi({\boldsymbol{x}},{\boldsymbol{w}}_0)=\Phi^{\rm {lin}}({\boldsymbol{x}},{\boldsymbol{p}}_0)\) so that \( e_0=0\) .

12.6 Training dynamics for shallow neural networks

In this section, following [1], we discuss the implications of Theorems 26 and 27 for wide neural networks. As in [2], for ease of presentation we focus on a shallow architecture with only one hidden layer, but stress that similar considerations also hold for deep networks, see the bibliography section.

12.6.1 Architecture

Let \( \Phi:\mathbb{R}^{d}\to\mathbb{R}\) be a neural network of depth one and width \( n\in\mathbb{N}\) of type

\[ \begin{equation} \Phi({\boldsymbol{x}},{\boldsymbol{w}}) = {\boldsymbol{v}}^\top\sigma({\boldsymbol{U}}{\boldsymbol{x}}+{\boldsymbol{b}})+ c. \end{equation} \]

(209)

Here \( {\boldsymbol{x}}\in\mathbb{R}^{d}\) is the input, and \( {\boldsymbol{U}}\in\mathbb{R}^{n\times d}\) , \( {\boldsymbol{v}}\in\mathbb{R}^{n}\) , \( {\boldsymbol{b}}\in\mathbb{R}^{n}\) and \( c\in\mathbb{R}\) are the parameters which we collect in the vector \( {\boldsymbol{w}}=({\boldsymbol{U}},{\boldsymbol{b}},{\boldsymbol{v}},c)\in\mathbb{R}^{n(d+2)+1}\) (with \( {\boldsymbol{U}}\) suitably reshaped). For future reference we note that

\[ \begin{equation} \begin{aligned} \nabla_{{\boldsymbol{U}}}\Phi({\boldsymbol{x}},{\boldsymbol{w}}) &=({\boldsymbol{v}}\odot \sigma'({\boldsymbol{U}}{\boldsymbol{x}}+{\boldsymbol{b}})) {\boldsymbol{x}}^\top \in\mathbb{R}^{n\times d}\\ \nabla_{{\boldsymbol{b}}}\Phi({\boldsymbol{x}},{\boldsymbol{w}}) &= {\boldsymbol{v}}\odot \sigma'({\boldsymbol{U}}{\boldsymbol{x}}+{\boldsymbol{b}}) \in\mathbb{R}^n\\ \nabla_{{\boldsymbol{v}}}\Phi({\boldsymbol{x}},{\boldsymbol{w}}) &= \sigma({\boldsymbol{U}}{\boldsymbol{x}}+{\boldsymbol{b}})\in\mathbb{R}^n\\ \nabla_{c}\Phi({\boldsymbol{x}},{\boldsymbol{w}}) &= 1\in\mathbb{R}, \end{aligned} \end{equation} \]

(210)

where \( \odot\) denotes the Hadamard product. We also write \( \nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}},{\boldsymbol{w}})\in\mathbb{R}^{n(d+2)+1}\) to denote the full gradient with respect to all parameters.

In practice, it is common to initialize the weights randomly, and in this section we consider so-called LeCun initialization [3]. The following condition on the distribution \( \mathcal{W}\) used for this initialization will be assumed throughout the rest of Section 12.6.

Assumption 2

The distribution \( \mathcal{W}\) on \( \mathbb{R}\) has expectation zero, variance one, and finite moments up to order eight.

To explicitly indicate the expectation and variance in the notation, we also write \( \mathcal{W}(0,1)\) instead of \( \mathcal{W}\) , and for \( \mu\in\mathbb{R}\) and \( \varsigma>0\) we use \( \mathcal{W}(\mu,\varsigma^2)\) to denote the corresponding scaled and shifted measure with expectation \( \mu\) and variance \( \varsigma^2\) ; thus, if \( X\sim\mathcal{W}(0,1)\) then \( \mu+\varsigma X\sim\mathcal{W}(\mu,\varsigma^2)\) . LeCun initialization sets the variance of the weights in each layer to be reciprocal to the input dimension of the layer: the idea is to normalize the output variance of all network nodes. The initial parameters

\[ {\boldsymbol{w}}_0=({\boldsymbol{U}}_0,{\boldsymbol{b}}_0,{\boldsymbol{v}}_0,c_0) \]

are thus randomly initialized with components

\[ \begin{equation} U_{0;ij}\overset{\rm iid}{\sim} \mathcal{W}\Big(0,\frac{1}{d}\Big),~~ v_{0;i}\overset{\rm iid}{\sim} \mathcal{W}\Big(0,\frac{1}{n}\Big),~~ b_{0;i},~c_0=0, \end{equation} \]

(211)

independently for all \( i=1,\dots,n\) , \( j=1,\dots,d\) . For a fixed \( \varsigma>0\) one might choose variances \( \varsigma^2/d\) and \( \varsigma^2/n\) in (211), which would require only minor modifications in the rest of this section. Biases are set to zero for simplicity, with nonzero initialization discussed in the exercises. All expectations and probabilities in Section 12.6 are understood with respect to this random initialization.

Example 8

Typical examples for \( \mathcal{W}(0,1)\) are the standard normal distribution on \( \mathbb{R}\) or the uniform distribution on \( [-\sqrt{3},\sqrt{3}]\) .

12.6.2 Neural tangent kernel

We begin our analysis by investigating the empirical tangent kernel

\[ \hat K_n({\boldsymbol{x}},{\boldsymbol{z}}) = \left\langle \nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}},{\boldsymbol{w}}_0), \nabla_{\boldsymbol{w}} \Phi({\boldsymbol{z}},{\boldsymbol{w}}_0)\right\rangle_{} \]

of the shallow network (209) with initialization 211. Scaled properly, it converges in the infinite width limit \( n\to\infty\) towards a specific kernel known as the neural tangent kernel (NTK) [4]. This kernel depends on both the architecture and the initialization scheme. Since we focus on the specific setting introduced in Section 12.6.1 in the following, we simply denote it by \( K^{\rm {NTK}}\) .

Theorem 28

Let \( R<\infty\) such that \( |\sigma(x)|\le R\cdot (1+|x|)\) and \( |\sigma'(x)|\le R\cdot (1+|x|)\) for all \( x\in\mathbb{R}\) . For any \( {\boldsymbol{x}}\) , \( {\boldsymbol{z}}\in\mathbb{R}^d\) and \( u_i\overset{{\rm {iid}}}{\sim}\mathcal{W}(0,1/d)\) , \( i=1,\dots,d\) , it then holds

\[ \lim_{n\to\infty} \frac{1}{n}\hat K_n({\boldsymbol{x}},{\boldsymbol{z}}) =\mathbb{E} [\sigma({\boldsymbol{u}}^\top{\boldsymbol{x}})\sigma({\boldsymbol{u}}^\top{\boldsymbol{z}})]=\mathrm{:} K^{\rm NTK}({\boldsymbol{x}},{\boldsymbol{z}}) \]

almost surely.

Moreover, for every \( \delta\) , \( \varepsilon>0\) there exists \( n_0(\delta,\varepsilon,R)\in\mathbb{N}\) such that for all \( n\ge n_0\) and all \( {\boldsymbol{x}}\) , \( {\boldsymbol{z}}\in\mathbb{R}^d\) with \( \| {\boldsymbol{x}} \|_{}\) , \( \| {\boldsymbol{z}} \|_{}\le R\)

\[ \mathbb{P}\Bigg[\left\| \frac{1}{n}\hat K_n({\boldsymbol{x}},{\boldsymbol{z}})-K^{\rm NTK}({\boldsymbol{x}},{\boldsymbol{z}}) \right\|_{}<\varepsilon\Bigg]\ge 1-\delta. \]

Proof

Denote \( {\boldsymbol{x}}^{(1)}={\boldsymbol{U}}_0{\boldsymbol{x}}+{\boldsymbol{b}}_0\in\mathbb{R}^n\) and \( {\boldsymbol{z}}^{(1)}={\boldsymbol{U}}_0{\boldsymbol{z}}+{\boldsymbol{b}}_0\in\mathbb{R}^n\) . Due to the initialization (211) and our assumptions on \( \mathcal{W}(0,1)\) , the components

\[ x_i^{(1)}=\sum_{j=1}^d U_{0;ij}x_j\sim{\boldsymbol{u}}^\top{\boldsymbol{x}}~~ i=1,\dots,n \]

are i.i.dwith finite \( p\) th moment (independent of \( n\) ) for all \( 1\le p\le 8\) . Due to the linear growth bound on \( \sigma\) and \( \sigma'\) , the same holds for the \( (\sigma(x_i^{(1)}))_{i=1}^n\) and the \( (\sigma'(x_i^{(1)}))_{i=1}^n\) . Similarly, the \( (\sigma(z^{(1)}_i))_{i=1}^n\) and \( (\sigma'(z^{(1)}_i))_{i=1}^n\) are collections of i.i.drandom variables with finite \( p\) th moment for all \( 1\le p\le 8\) .

Denote \( \tilde v_i=\sqrt{n}v_{0;i}\) such that \( \tilde v_i\overset{{\rm {iid}}}{\sim}\mathcal{W}(0,1)\) . By (210)

\[ \begin{align*} \frac{1}{n}\hat K_n({\boldsymbol{x}},{\boldsymbol{z}}) &=(1+{\boldsymbol{x}}^\top{\boldsymbol{z}}) \frac{1}{n^2}\sum_{i=1}^n \tilde v_{i}^2\sigma'(x_i^{(1)})\sigma'(z_i^{(1)}) +\frac{1}{n}\sum_{i=1}^n \sigma(x_i^{(1)})\sigma(z_i^{(1)}) +\frac{1}{n}. \end{align*} \]

Since

\[ \begin{equation} \frac{1}{n}\sum_{i=1}^n \tilde v_{i}^2\sigma'(x_i^{(1)})\sigma'(z_i^{(1)}) \end{equation} \]

(212)

is an average over i.i.drandom variables with finite variance, the law of large numbers implies almost sure convergence of this expression towards

\[ \begin{align*} \mathbb{E}\big[\tilde v_{i}^2\sigma'(x_i^{(1)})\sigma'(z_i^{(1)})\big] &=\mathbb{E}[\tilde v_{i}^2] \mathbb{E}[\sigma'(x_i^{(1)})\sigma'(z_i^{(1)})]\\ &= \mathbb{E}[\sigma'({\boldsymbol{u}}^\top{\boldsymbol{x}})\sigma'({\boldsymbol{u}}^\top{\boldsymbol{z}})], \end{align*} \]

where we used that \( \tilde v_{i}^2\) is independent of \( \sigma'(x_i^{(1)})\sigma'(z_i^{(1)})\) . By the same argument

\[ \frac{1}{n}\sum_{i=1}^n \sigma(x_i^{(1)})\sigma(z_i^{(1)})\to \mathbb{E}[\sigma({\boldsymbol{u}}^\top{\boldsymbol{x}})\sigma({\boldsymbol{u}}^\top{\boldsymbol{z}})] \]

almost surely as \( n\to\infty\) . This shows the first statement.

The existence of \( n_0\) follows similarly by an application of Theorem 46.

Example 9 (\( K^{\rm {NTK}}\) for ReLU)

Let \( \sigma(x)=\max\{0,x\}\) and let \( \mathcal{W}(0,1)\) be the standard normal distribution. For \( {\boldsymbol{x}}\) , \( {\boldsymbol{z}}\in\mathbb{R}^d\) denote by

\[ \vartheta= \arccos\left(\frac{{\boldsymbol{x}}^\top{\boldsymbol{z}}}{\| {\boldsymbol{x}} \|_{}\| {\boldsymbol{z}} \|_{}}\right) \]

the angle between these vectors. Then according to [5, Appendix A], it holds with \( u_i\overset{{\rm {iid}}}{\sim}\mathcal{W}(0,1)\) , \( i=1,\dots,d\) ,

\[ K^{\rm NTK}({\boldsymbol{x}},{\boldsymbol{z}}) = \mathbb{E}[\sigma({\boldsymbol{u}}^\top{\boldsymbol{x}})\sigma({\boldsymbol{u}}^\top{\boldsymbol{z}})]=\frac{\| {\boldsymbol{x}} \|_{}\| {\boldsymbol{z}} \|_{}}{2\pi d}(\sin(\vartheta)+(\pi-\vartheta)\cos(\vartheta)). \]

12.6.3 Training dynamics and model predictions

We now proceed as in [1, Appendix G], to show that the analysis in Sections 12.4-12.5 is applicable to the wide neural network (209) with high probability under random initialization (211). We work under the following assumptions on the activation function and training data [1, Assumptions 1-4].

Assumption 3

There exist \( 1\le R<\infty\) and \( 0<\theta_{\rm {min}}^{\rm NTK}\le \theta_{\rm max}^{\rm NTK}<\infty\) such that

  1. \( \sigma:\mathbb{R}\to\mathbb{R}\) belongs to \( C^1(\mathbb{R})\) and \( |\sigma(0)|\) , \( {\rm {Lip}}(\sigma)\) , \( {\rm {Lip}}(\sigma')\le R\) ,
  2. \( \| {\boldsymbol{x}}_i \|_{}\) , \( |y_i|\le R\) for all training data \( ({\boldsymbol{x}}_i,y_i)\in\mathbb{R}^d\times\mathbb{R}\) , \( i=1,\dots,m\) ,
  3. the kernel matrix of the neural tangent kernel

    \[ (K^{\rm NTK}({\boldsymbol{x}}_i,{\boldsymbol{x}}_j))_{i,j=1}^m\in\mathbb{R}^{m\times m} \]

    is regular and its eigenvalues belong to \( [\theta_{\rm {min}}^{\rm NTK},\theta_{\rm max}^{\rm NTK}]\) .

We start by showing Assumption 1 (X) for the present setting. More precisely, we give bounds for the eigenvalues of the empirical tangent kernel.

Lemma 29

Let Assumption 3 be satisfied. Then for every \( \delta>0\) there exists \( n_0(\delta,\theta_{\rm {min}}^{\rm NTK},m,R)\in\mathbb{R}\) such that for all \( n\ge n_0\) it holds with probability at least \( 1-\delta\) that all eigenvalues of

\[ (\hat K_n({\boldsymbol{x}}_i,{\boldsymbol{x}}_j))_{i,j=1}^m= \big(\left\langle \nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}}_i,{\boldsymbol{w}}_0), \nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}}_j,{\boldsymbol{w}}_0)\right\rangle_{}\big)_{i,j=1}^m\in\mathbb{R}^{m\times m} \]

belong to \( [n\theta_{\rm {min}}^{\rm NTK}/2,2n\theta_{\rm max}^{\rm NTK}]\) .

Proof

Denote \( \hat{\boldsymbol{G}}_n\mathrm{:}= (\hat K_n({\boldsymbol{x}}_i,{\boldsymbol{x}}_j))_{i,j=1}^m\) and \( {\boldsymbol{G}}^{\rm {NTK}}\mathrm{:}= (K^{\rm NTK}({\boldsymbol{x}}_i,{\boldsymbol{x}}_j))_{i,j=1}^m\) . By Theorem 28, there exists \( n_0\) such that for all \( n\ge n_0\) holds with probability at least \( 1-\delta\) that

\[ \left\| {\boldsymbol{G}}^{\rm NTK}-\frac{1}{n}\hat{\boldsymbol{G}}_n \right\|_{}\le\frac{\theta_{\rm min}^{\rm NTK}}{2}. \]

Assuming this bound to hold

\[ \frac{1}{n}\| \hat{\boldsymbol{G}}_n \|_{} = \sup_{\substack{{\boldsymbol{a}}\in\mathbb{R}^m\\ \| {\boldsymbol{a}} \|_{}=1}} \frac{1}{n}\| \hat{\boldsymbol{G}}_n{\boldsymbol{a}} \|_{} \ge \inf_{\substack{{\boldsymbol{a}}\in\mathbb{R}^m\\ \| {\boldsymbol{a}} \|_{}=1}} \| {\boldsymbol{G}}^{\rm NTK}{\boldsymbol{a}} \|_{}-\frac{\theta_{\rm min}^{\rm NTK}}{2} \ge \theta_{\rm min}^{\rm NTK}-\frac{\theta_{\rm min}^{\rm NTK}}{2}\ge \frac{\theta_{\rm min}^{\rm NTK}}{2}, \]

where we have used that \( \theta_{\rm {min}}^{\rm NTK}\) is the smallest eigenvalue, and thus singular value, of the symmetric positive definite matrix \( {\boldsymbol{G}}^{\rm {NTK}}\) . This shows that (with probability at least \( \delta\) ) the smallest eigenvalue of \( \hat{\boldsymbol{G}}_n\) is larger or equal to \( n\theta_{\rm {min}}^{\rm NTK}/2\) . Similarly, we conclude that the largest eigenvalue is bounded from above by \( n(\theta_{\rm {max}}^{\rm NTK}+\theta_{\rm min}^{\rm NTK}/2)\le 2n\theta_{\rm max}^{\rm NTK}\) . This concludes the proof.

Next we check Assumption 1 (XI). To this end we first bound the norm of a random matrix.

Lemma 30

Let \( \mathcal{W}(0,1)\) be as in Assumption 2, and let \( {\boldsymbol{W}}\in\mathbb{R}^{n\times d}\) with \( W_{ij}\overset{{\rm {iid}}}{\sim}\mathcal{W}(0,1)\) . Denote the fourth moment of \( \mathcal{W}(0,1)\) by \( \mu_4\) . Then

\[ \mathbb{P}\Big[\| {\boldsymbol{W}} \|_{}\le \sqrt{n(d+1)}\Big]\ge 1-\frac{d\mu_4}{n}. \]

Proof

It holds

\[ \| {\boldsymbol{W}} \|_{} \le \| {\boldsymbol{W}} \|_{F} = \Big(\sum_{i=1}^n\sum_{j=1}^dW_{ij}^2\Big)^{1/2}. \]

The \( \alpha_i\mathrm{:}= \sum_{j=1}^dW_{ij}^2\) , \( i=1,\dots,n\) , are i.i.ddistributed with expectation \( d\) and finite variance \( d C\) , where \( C\le\mu_4\) is the variance of \( W_{11}^2\) . By Theorem 46

\[ \mathbb{P}\Big[\| {\boldsymbol{W}} \|_{}>\sqrt{n(d+1)}\Big] \le \mathbb{P}\Big[\frac{1}{n}\sum_{i=1}^n\alpha_i>d+1\Big]\le \mathbb{P}\Big[\Big|\frac{1}{n}\sum_{i=1}^n\alpha_i-d\Big|>1\Big]\le \frac{d\mu_4}{n}, \]

which concludes the proof.

Lemma 31

Let Assumption 3 (XIII) be satisfied with some constant \( R\) . Then there exists \( M(R)\) , and for all \( c\) , \( \delta>0\) there exists \( n_0(c,d,\delta)\in\mathbb{N}\) such that for all \( n\ge n_0\) it holds with probability at least \( 1-\delta\) that

\[ \begin{aligned} \| \nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}},{\boldsymbol{w}}) \|_{}&\le M\sqrt{n} &&\text{for all }{\boldsymbol{w}}\in B_{c n^{-1/2}}({\boldsymbol{w}}_0)\\ \| \nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}},{\boldsymbol{w}})-\nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}},{\boldsymbol{v}}) \|_{}&\le M\sqrt{n}\| {\boldsymbol{w}}-{\boldsymbol{v}} \|_{} &&\text{for all }{\boldsymbol{w}},~{\boldsymbol{v}}\in B_{c n^{-1/2}}({\boldsymbol{w}}_0) \end{aligned} \]

for all \( {\boldsymbol{x}}\in\mathbb{R}^d\) with \( \| {\boldsymbol{x}} \|_{}\le R\) .

Proof

Due to the initialization (211), by Lemma 30 we can find \( \tilde n_0(\delta,d)\) such that for all \( n\ge \tilde n_0\) holds with probability at least \( 1-\delta\) that

\[ \begin{equation} \| {\boldsymbol{v}}_0 \|_{}\le 2~~\text{and}~~ \| {\boldsymbol{U}}_0 \|_{}\le 2\sqrt{n}. \end{equation} \]

(213)

For the rest of this proof we let \( {\boldsymbol{x}}\in\mathbb{R}^d\) arbitrary with \( \| {\boldsymbol{x}} \|_{}\le R\) , we set

\[ n_0\mathrm{:}= \max\{c^2,\tilde n_0(\delta,d)\} \]

and we fix \( n\ge n_0\) so that \( n^{-1/2}c\le 1\) . To prove the lemma we need to show that the claimed inequalities hold as long as (213) is satisfied. We will several times use that for all \( {\boldsymbol{p}}\) , \( {\boldsymbol{q}}\in\mathbb{R}^n\)

\[ \| {\boldsymbol{p}}\odot{\boldsymbol{q}} \|_{}\le\| {\boldsymbol{p}} \|_{}\| {\boldsymbol{q}} \|_{}~~\text{and}~~ \| \sigma({\boldsymbol{p}}) \|_{}\le R\sqrt{n}+R\| {\boldsymbol{p}} \|_{} \]

since \( |\sigma(x)|\le R\cdot (1+|x|)\) . The same holds for \( \sigma'\) .

Step 1. We show the bound on the gradient. Fix

\[ {\boldsymbol{w}}=({\boldsymbol{U}},{\boldsymbol{b}},{\boldsymbol{v}},c)~~\text{s.t.}~~ \| {\boldsymbol{w}}-{\boldsymbol{w}}_0 \|_{}\le cn^{-1/2}. \]

Using formula (210) for \( \nabla_{\boldsymbol{b}}\Phi\) , the fact that \( {\boldsymbol{b}}_0=\boldsymbol{0}\) by (211), and the above inequalities

\[ \begin{align} \| \nabla_{\boldsymbol{b}}\Phi({\boldsymbol{x}},{\boldsymbol{w}}) \|_{} &\le \| \nabla_{\boldsymbol{b}}\Phi({\boldsymbol{x}},{\boldsymbol{w}}_0) \|_{}+ \| \nabla_{\boldsymbol{b}}\Phi({\boldsymbol{x}},{\boldsymbol{w}})-\nabla_{\boldsymbol{b}}\Phi({\boldsymbol{x}},{\boldsymbol{w}}_0) \|_{}\nonumber \end{align} \]

(214)

\[ \begin{align} &=\| {\boldsymbol{v}}_0\odot\sigma'({\boldsymbol{U}}_0{\boldsymbol{x}}) \|_{}+ \| {\boldsymbol{v}}\odot\sigma'({\boldsymbol{U}}{\boldsymbol{x}}+{\boldsymbol{b}})-{\boldsymbol{v}}_0\odot\sigma'({\boldsymbol{U}}_0{\boldsymbol{x}}) \|_{}\\ &\le 2(R\sqrt{n}+2R^2\sqrt{n}) + \| {\boldsymbol{v}}\odot\sigma'({\boldsymbol{U}}{\boldsymbol{x}}+{\boldsymbol{b}})-{\boldsymbol{v}}_0\odot\sigma'({\boldsymbol{U}}_0{\boldsymbol{x}}) \|_{}. \\ \\\end{align} \]

(215)

Due to

\[ \begin{equation} \| {\boldsymbol{U}} \|_{}\le\| {\boldsymbol{U}}_0 \|_{}+\| {\boldsymbol{U}}_0-{\boldsymbol{U}} \|_{F}\le 2\sqrt{n}+cn^{-1/2}\le 3\sqrt{n}, \end{equation} \]

(216)

and using the fact that \( \sigma'\) has Lipschitz constant \( R\) , the last norm in (214) is bounded by

\[ \begin{align*} &\| ({\boldsymbol{v}}-{\boldsymbol{v}}_0)\odot\sigma'({\boldsymbol{U}}{\boldsymbol{x}}+{\boldsymbol{b}}) \|_{}+\| {\boldsymbol{v}}_0\odot(\sigma'({\boldsymbol{U}}{\boldsymbol{x}}+{\boldsymbol{b}})-\sigma'({\boldsymbol{U}}_0{\boldsymbol{x}})) \|_{}\\ &~~\le cn^{-1/2}(R\sqrt{n}+R\cdot (\| {\boldsymbol{U}} \|_{}\| {\boldsymbol{x}} \|_{}+\| {\boldsymbol{b}} \|_{})) + 2 R\cdot (\| {\boldsymbol{U}}-{\boldsymbol{U}}_0 \|_{}\| {\boldsymbol{x}} \|_{}+\| {\boldsymbol{b}} \|_{})\\ &~~\le R\sqrt{n}+3\sqrt{n} R^2+cn^{-1/2}R+2R\cdot (cn^{-1/2}R+cn^{-1/2})\nonumber\\ &~~\le \sqrt{n}(4R+5R^2) \end{align*} \]

and therefore

\[ \| \nabla_{\boldsymbol{b}}\Phi({\boldsymbol{x}},{\boldsymbol{w}}) \|_{}\le \sqrt{n}(6R+9R^2). \]

For the gradient with respect to \( {\boldsymbol{U}}\) we use \( \nabla_{\boldsymbol{U}}\Phi({\boldsymbol{x}},{\boldsymbol{w}})=\nabla_{\boldsymbol{b}}\Phi({\boldsymbol{x}},{\boldsymbol{w}}){\boldsymbol{x}}^\top\) , so that

\[ \| \nabla_{\boldsymbol{U}}\Phi({\boldsymbol{x}},{\boldsymbol{w}}) \|_{F} =\| \nabla_{\boldsymbol{b}}\Phi({\boldsymbol{x}},{\boldsymbol{w}}){\boldsymbol{x}}^\top \|_{F} =\| \nabla_{\boldsymbol{b}}\Phi({\boldsymbol{x}},{\boldsymbol{w}}) \|_{}\| {\boldsymbol{x}} \|_{} \le \sqrt{n} (6R^2+9R^3). \]

Next

\[ \begin{align*} \| \nabla_{\boldsymbol{v}}\Phi({\boldsymbol{x}},{\boldsymbol{w}}) \|_{} &= \| \sigma({\boldsymbol{U}}{\boldsymbol{x}}+{\boldsymbol{b}}) \|_{}\\ &\le R\sqrt{n} + R\| {\boldsymbol{U}}{\boldsymbol{x}}+{\boldsymbol{b}} \|_{}\\ &\le R\sqrt{n} + R\cdot (3\sqrt{n}R+ cn^{-1/2})\\ &\le \sqrt{n}(2R+3R^2), \end{align*} \]

and finally \( \nabla_c\Phi({\boldsymbol{x}},{\boldsymbol{w}})=1\) . In all, with \( M_1(R)\mathrm{:}= (1+8R+12R^2)\)

\[ \| \nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}},\tilde{\boldsymbol{w}}) \|_{}\le \sqrt{n} M_1(R). \]

Step 2. We show Lipschitz continuity. Fix

\[ {\boldsymbol{w}}=({\boldsymbol{U}},{\boldsymbol{b}},{\boldsymbol{v}},c)~~\text{and}~~\tilde{\boldsymbol{w}}=(\tilde{\boldsymbol{U}},\tilde{\boldsymbol{b}},\tilde{\boldsymbol{v}},\tilde c) \]

such that \( \| {\boldsymbol{w}}-{\boldsymbol{w}}_0 \|_{}\) , \( \| \tilde{\boldsymbol{w}}-{\boldsymbol{w}}_0 \|_{}\le cn^{-1/2}\) . Then

\[ \| \nabla_{\boldsymbol{b}}\Phi({\boldsymbol{x}},{\boldsymbol{w}})-\nabla_{\boldsymbol{b}}\Phi({\boldsymbol{x}},\tilde{\boldsymbol{w}}) \|_{} =\| {\boldsymbol{v}}\odot\sigma'({\boldsymbol{U}}{\boldsymbol{x}}+{\boldsymbol{b}})-\tilde{\boldsymbol{v}}\odot\sigma'(\tilde{\boldsymbol{U}}{\boldsymbol{x}}+\tilde {\boldsymbol{b}}) \|_{}. \]

Using \( \| \tilde{\boldsymbol{v}} \|_{}\le \| {\boldsymbol{v}}_0 \|_{}+cn^{-1/2}\le 3\) and (216), this term is bounded by

\[ \begin{align*} &\| ({\boldsymbol{v}}-\tilde{\boldsymbol{v}})\odot\sigma'({\boldsymbol{U}}{\boldsymbol{x}}+{\boldsymbol{b}}) \|_{}+\| \tilde{\boldsymbol{v}}\odot(\sigma'({\boldsymbol{U}}{\boldsymbol{x}}+{\boldsymbol{b}})-\sigma'(\tilde{\boldsymbol{U}}{\boldsymbol{x}}+\tilde{\boldsymbol{b}})) \|_{}\\ &~~\le \| {\boldsymbol{v}}-\tilde{\boldsymbol{v}} \|_{}(R\sqrt{n}+R\cdot (\| {\boldsymbol{U}} \|_{}\| {\boldsymbol{x}} \|_{}+\| {\boldsymbol{b}} \|_{})) +3R\cdot (\| {\boldsymbol{x}} \|_{}\| {\boldsymbol{U}}-\tilde{\boldsymbol{U}} \|_{}+\| {\boldsymbol{b}}-\tilde{\boldsymbol{b}} \|_{})\\ &~~\le \| {\boldsymbol{w}}-\tilde{\boldsymbol{w}} \|_{}\sqrt{n}(5R+6R^2). \end{align*} \]

For \( \nabla_{\boldsymbol{U}}\Phi({\boldsymbol{x}},{\boldsymbol{w}})\) we obtain similar as in Step 1

\[ \begin{align*} \| \nabla_{\boldsymbol{U}}\Phi({\boldsymbol{x}},{\boldsymbol{w}})-\nabla_{\boldsymbol{U}}\Phi({\boldsymbol{x}},\tilde{\boldsymbol{w}}) \|_{F}&= \| {\boldsymbol{x}} \|_{}\| \nabla_{\boldsymbol{b}}\Phi({\boldsymbol{x}},{\boldsymbol{w}})-\nabla_{\boldsymbol{b}}\Phi({\boldsymbol{x}},\tilde{\boldsymbol{w}}) \|_{}\\ &\le \| {\boldsymbol{w}}-\tilde{\boldsymbol{w}} \|_{}\sqrt{n}(5R^2+6R^3). \end{align*} \]

Next

\[ \begin{align*} \| \nabla_{\boldsymbol{v}}\Phi({\boldsymbol{x}},{\boldsymbol{w}})-\nabla_{\boldsymbol{v}}\Phi({\boldsymbol{x}},\tilde{\boldsymbol{w}}) \|_{} &= \| \sigma({\boldsymbol{U}}{\boldsymbol{x}}+{\boldsymbol{b}})-\sigma(\tilde{\boldsymbol{U}}{\boldsymbol{x}}-\tilde{\boldsymbol{b}}) \|_{}\\ &\le R\cdot (\| {\boldsymbol{U}}-\tilde{\boldsymbol{U}} \|_{}\| {\boldsymbol{x}} \|_{}+\| {\boldsymbol{b}}-\tilde{\boldsymbol{b}} \|_{})\\ &\le \| {\boldsymbol{w}}-\tilde{\boldsymbol{w}} \|_{}(R^2+R) \end{align*} \]

and finally \( \nabla_c\Phi({\boldsymbol{x}},{\boldsymbol{w}})=1\) is constant. With \( M_2(R)\mathrm{:}= R+6R^2+6R^3\) this shows

\[ \| \nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}},{\boldsymbol{w}})-\nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}},\tilde{\boldsymbol{w}}) \|_{}\le \sqrt{n} M_2(R)\| {\boldsymbol{w}}-\tilde{\boldsymbol{w}} \|_{}. \]

In all, this concludes the proof with \( M(R)\mathrm{:}=\max\{M_1(R),M_2(R)\}\) .

Next, we show that the initial error \( f ({\boldsymbol{w}}_0)\) remains bounded with high probability.

Lemma 32

Let Assumption 3 (XIII), (XIV) be satisfied. Then for every \( \delta>0\) exists \( R_0(\delta,m,R)>0\) such that for all \( n\in\mathbb{N}\)

\[ \mathbb{P}[ f ({\boldsymbol{w}}_0)\le R_0]\ge 1-\delta. \]

Proof

Let \( i\in\{1,\dots,m\}\) , and set \( {\boldsymbol{\alpha}}\mathrm{:}= {\boldsymbol{U}}_0{\boldsymbol{x}}_i\) and \( \tilde v_j\mathrm{:}= \sqrt{n} v_{0;j}\) for \( j=1,\dots,n\) , so that \( \tilde v_j\overset{{\rm {iid}}}{\sim}\mathcal{W}(0,1)\) . Then

\[ \Phi({\boldsymbol{x}}_i,{\boldsymbol{w}}_0) = \frac{1}{\sqrt{n}}\sum_{j=1}^n\tilde v_j\sigma(\alpha_j). \]

By Assumption 2 and (211), the \( \tilde v_{j}\sigma(\alpha_j)\) , \( j=1,\dots,n\) , are i.i.d centered random variables with finite variance bounded by a constant \( C(R)\) independent of \( n\) . Thus the variance of \( \Phi({\boldsymbol{x}}_i,{\boldsymbol{w}}_0)\) is also bounded by \( C(R)\) . By Chebyshev’s inequality, see Lemma 37, for every \( k>0\)

\[ \mathbb{P}[|\Phi({\boldsymbol{x}}_i,{\boldsymbol{w}}_0)|\ge k\sqrt{C}]\le \frac{1}{k^2}. \]

Setting \( k=\sqrt{m/\delta}\)

\[ \begin{align*} \mathbb{P}\Big[\sum_{i=1}^m|\Phi({\boldsymbol{x}}_i,{\boldsymbol{w}}_0)-y_i|^2 \ge m (k \sqrt{C}+R)^2\Big] &\le \sum_{i=1}^m\mathbb{P}\Big[|\Phi({\boldsymbol{x}}_i,{\boldsymbol{w}}_0)-y_i| \ge k \sqrt{C}+R\Big]\\ &\le \sum_{i=1}^m\mathbb{P}\Big[|\Phi({\boldsymbol{x}}_i,{\boldsymbol{w}}_0)| \ge k \sqrt{C}\Big]\le \delta, \end{align*} \]

which shows the claim with \( R_0=m\cdot (\sqrt{Cm/\delta}+R)^2\) .

The next theorem, which corresponds to [1, Thms. G.1 and H.1], is the main result of this section. It summarizes our findings in the present setting for a shallow network of width \( n\) : with high probability, gradient descent converges to a global minimizer and the limiting network interpolates the data. During training the network weights remain close to initialization. The trained network gives predictions that are \( O(n^{-1/2})\) close to the predictions of the trained linearized model. In the statement of the theorem we denote again by \( \Phi^{\rm {lin}}\) the linearization of \( \Phi\) defined in (184), and by \( f ^{\rm {lin}}\) , \( f \) , the corresponding square loss objectives defined in (167.b), (186), respectively.

Theorem 29

Let Assumption 3 be satisfied, and let the parameters \( {\boldsymbol{w}}_0\) of the width-\( n\) neural network \( \Phi\) in (209) be initialized according to (211). Fix the learning rate

\[ h= \frac{1}{\theta_{\rm min}^{\rm NTK}+4\theta_{\rm max}^{\rm NTK}}\frac{1}{n}, \]

set \( {\boldsymbol{p}}_0\mathrm{:}={\boldsymbol{w}}_0\) and let for all \( k\in\mathbb{N}_0\)

\[ {\boldsymbol{w}}_{k+1}={\boldsymbol{w}}_k-h\nabla f ({\boldsymbol{w}}_k)~~\text{and}~~ {\boldsymbol{p}}_{k+1}={\boldsymbol{p}}_k-h\nabla f ^{\rm lin}({\boldsymbol{p}}_k). \]

Then for every \( \delta>0\) there exist \( C>0\) , \( n_0\in\mathbb{N}\) such that for all \( n\ge n_0\) it holds with probability at least \( 1-\delta\) that for all \( k\in\mathbb{N}\) and all \( {\boldsymbol{x}}\in\mathbb{R}^d\) with \( \| {\boldsymbol{x}} \|_{}\le R\)

\[ \begin{align} \| {\boldsymbol{w}}_k-{\boldsymbol{w}}_0 \|_{}&\le \frac{C}{\sqrt{n}} \end{align} \]

(217.a)

\[ \begin{align} f ({\boldsymbol{w}}_k)&\le C \Big(1-\frac{h n}{2\theta_{\rm min}^{\rm NTK}}\Big)^{2k} \end{align} \]

(217.b)

\[ \begin{align} \| \Phi({\boldsymbol{x}},{\boldsymbol{w}}_k)-\Phi^{\rm lin}({\boldsymbol{x}},{\boldsymbol{p}}_k) \|_{}&\le \frac{C}{\sqrt{n}}. \end{align} \]

(217.c)

Proof

We wish to apply Theorems 26 and 27. This requires Assumption 1 to be satisfied.

Fix \( \delta>0\) and let \( R_0(\delta)\) be as in Lemma 32, so that with probability at least \( 1-\delta/2\) it holds \( \sqrt{ f ({\boldsymbol{w}}_0)}\le R_0\) . Next, let \( M=M(R)\) be as in Lemma 31, and fix \( n_{0,1}\in\mathbb{N}\) and \( c>0\) so large that for all \( n\ge n_{0,1}\)

\[ \begin{equation} M\sqrt{n}\le \frac{n^2(\theta_{\rm min}^{\rm NTK}/2)^2}{12m^{3/2}M^2n\sqrt{R_0}}~~ \text{and}~~ c n^{-1/2}=\frac{2\sqrt{m}M\sqrt{n}}{n\theta_{\rm min}^{\rm NTK}}\sqrt{R_0}. \end{equation} \]

(218)

By Lemma 29 and 31, we can then find \( n_{0,2}\) such that for all \( n\ge n_{0,2}\) with probability at least \( 1-\delta/2\) we have that Assumption 1 (X), (XI) holds with the values

\[ \begin{equation} L = M \sqrt{n},~ U = M \sqrt{n},~ r = cn^{-1/2},~ \theta_{\rm min}=\frac{n\theta_{\rm min}^{\rm NTK}}{2},~ \theta_{\rm max}=2n\theta_{\rm max}^{\rm NTK}. \end{equation} \]

(219)

Together with (218), this shows that Assumption 1 holds with probability at least \( 1-\delta\) as long as \( n\ge n_0\mathrm{:}= \max\{n_{0,1},n_{0,2}\}\) .

Inequalities (217.a), (217.b) are then a direct consequence of Theorem 26. For (217.c), we plug the values of (219) into the bound in Theorem 27 to obtain for \( k\in\mathbb{N}\)

\[ \begin{align*} \| \Phi({\boldsymbol{x}},{\boldsymbol{w}}_k)-\Phi^{\rm lin}({\boldsymbol{x}},{\boldsymbol{p}}_k) \|_{}&\le \frac{4\sqrt{m}ULr}{\theta_{\rm min}}\left(1+\frac{mU^2}{(h\theta_{\rm min})^2(\theta_{\rm min}+\theta_{\rm max})}\right)\sqrt{ f ({\boldsymbol{w}}_0)}\\ &\le \frac{C_1}{\sqrt{n}}(1+C_2)\sqrt{ f ({\boldsymbol{w}}_0)}, \end{align*} \]

for some constants \( C_1\) , \( C_2\) depending on \( m\) , \( M\) , \( c\) , \( \theta_{\rm {min}}^{LC}\) , \( \theta_{\rm {max}}^{LC}\) but independent of \( n\) .

Note that the convergence rate in (219) does not improve as \( n\) grows, since \( h\) is bounded by a constant times \( 1/n\) .

12.6.4 Connection to kernel least-squares and Gaussian processes

Theorem 29 establishes that the trained neural network mirrors the behaviour of the trained linearized model. As pointed out in Section 12.3, the prediction of the trained linearized model corresponds to the ridgeless least squares estimator plus a term depending on the choice of random initialization \( {\boldsymbol{w}}_0\in\mathbb{R}^n\) . We should thus understand both the model at initialization \( {\boldsymbol{x}}\mapsto\Phi({\boldsymbol{x}},{\boldsymbol{w}}_0)\) and the model after training \( {\boldsymbol{x}}\mapsto\Phi({\boldsymbol{x}},{\boldsymbol{w}}_k)\) , as random draws of a certain distribution over functions. To explain this further, we introduce Gaussian processes.

Definition 25

Let \( (\Omega,\mathfrak{A},\mathbb{P})\) be a probability space (see Section 18.1), and let \( g:\mathbb{R}^d\times \Omega\to\mathbb{R}\) . We call \( g\) a Gaussian process with mean function \( \mu:\mathbb{R}^d\to\mathbb{R}\) and covariance function \( c:\mathbb{R}^d\times\mathbb{R}^d\to\mathbb{R}\) if

  1. for each \( {\boldsymbol{x}}\in\mathbb{R}^d\) it holds that \( \omega\mapsto g({\boldsymbol{x}},\omega)\) is a random variable,
  2. for all \( k\in\mathbb{N}\) and all \( {\boldsymbol{x}}_1,\dots,{\boldsymbol{x}}_k\in\mathbb{R}^d\) the random variables \( g({\boldsymbol{x}}_1,\cdot),\dots,g({\boldsymbol{x}}_k,\cdot)\) are jointly Gaussian distributed with

    \[ (g({\boldsymbol{x}}_1,\omega),\dots,g({\boldsymbol{x}}_k,\omega))\sim {\rm N}\Big(\mu({\boldsymbol{x}}_i)_{i=1}^k,(c({\boldsymbol{x}}_i,{\boldsymbol{x}}_j))_{i,j=1}^k\Big). \]

In words, \( g\) is a Gaussian process, if \( \omega\mapsto g({\boldsymbol{x}},\omega)\) defines a collection of random variables indexed over \( {\boldsymbol{x}}\in\mathbb{R}^d\) , and the joint distribution of \( (g({\boldsymbol{x}}_1,\cdot))_{j=1}^k\) is a Gaussian whose mean and variance are determined by \( \mu\) and \( c\) respectively. Fixing \( \omega\in\Omega\) , we can then interpret \( {\boldsymbol{x}}\mapsto g({\boldsymbol{x}},\omega)\) as a random draw from a distribution over functions.

As first observed in [6], certain neural networks at initialization tend to Gaussian processes in the infinite width limit.

Proposition 17

Let \( |\sigma(x)|\le R(1+|x|)^4\) for all \( x\in\mathbb{R}\) . Consider depth-\( n\) networks \( \Phi\) as in (209) with initialization (211). Let \( K^{\rm {NTK}}:\mathbb{R}^d\times\mathbb{R}^d\) be as in Theorem 28.

Then for all distinct \( {\boldsymbol{x}}_1,\dots,{\boldsymbol{x}}_k\in\mathbb{R}^d\) it holds that

\[ \lim_{n\to\infty}(\Phi({\boldsymbol{x}}_1,{\boldsymbol{w}}_0),\dots,\Phi({\boldsymbol{x}}_k,{\boldsymbol{w}}_0)) \sim {\rm N}(\boldsymbol{0},(K^{\rm NTK}({\boldsymbol{x}}_i,{\boldsymbol{x}}_j))_{i,j=1}^k) \]

with convergence in distribution.

Proof

Set \( \tilde v_i\mathrm{:}= \sqrt{n}v_{0,i}\) and \( \tilde{\boldsymbol{u}}_i=(U_{0,i1},\dots,U_{0,id})\in\mathbb{R}^d\) , so that \( \tilde v_i\overset{{\rm {iid}}}{\sim}\mathcal{W}(0,1)\) , and the \( \tilde{\boldsymbol{u}}_i\in\mathbb{R}^d\) are also i.i.d., with each component distributed according to \( \mathcal{W}(0,1/d)\) .

Then for any \( {\boldsymbol{x}}_1,\dots,{\boldsymbol{x}}_k\)

\[ {\boldsymbol{Z}}_i\mathrm{:}= \begin{pmatrix} \tilde v_i \sigma(\tilde{\boldsymbol{u}}_i^\top{\boldsymbol{x}}_1)\\ \vdots\\ \tilde v_i \sigma(\tilde{\boldsymbol{u}}_i^\top{\boldsymbol{x}}_k) \end{pmatrix}\in\mathbb{R}^k~~ i=1,\dots,n, \]

defines \( n\) centered i.i.dvectors in \( \mathbb{R}^k\) with finite second moments (here we use the assumption on \( \sigma\) and the fact that \( \mathcal{W}(0,1)\) has finite moments of order \( 8\) by Assumption 2). By the central limit theorem, see Theorem 48,

\[ \begin{pmatrix} \Phi({\boldsymbol{x}}_1,{\boldsymbol{w}}_0)\\ \vdots\\ \Phi({\boldsymbol{x}}_k,{\boldsymbol{w}}_0)\\ \end{pmatrix} =\frac{1}{\sqrt{n}}\sum_{j=1}^n{\boldsymbol{Z}}_i \]

converges in distribution to \( {\rm {N}}(\boldsymbol{0},{\boldsymbol{C}})\) , where

\[ C_{ij}=\mathbb{E}[\tilde v_1^2\sigma(\tilde{\boldsymbol{u}}_1^\top{\boldsymbol{x}}_i)\sigma(\tilde{\boldsymbol{u}}_1^\top{\boldsymbol{x}}_j)] =\mathbb{E}[\sigma(\tilde{\boldsymbol{u}}_1^\top{\boldsymbol{x}}_i)\sigma(\tilde{\boldsymbol{u}}_1^\top{\boldsymbol{x}}_j)]=K^{\rm NTK}({\boldsymbol{x}}_i,{\boldsymbol{x}}_j). \]

This concludes the proof.

In the sense of Proposition 17, the network \( \Phi({\boldsymbol{x}},{\boldsymbol{w}}_0)\) converges to a Gaussian process as the width \( n\) tends to infinity. It can also be shown that the linearized network after training corresponds to a Gaussian process, with a mean and covariance function that depend on the data, architecture, and initialization. Since the full and linearized models coincide in the infinite width limit (see Theorem 29) we can infer that wide networks post-training resemble draws from a Gaussian process, see [1, Section 2.3.1] and [7].

To motivate the mean function of this Gaussian process, we informally take an expectation (over random initialization) of (188), yielding

\[ \mathbb{E}\Big[\lim_{k\to\infty}\Phi^{\rm lin}({\boldsymbol{x}},{\boldsymbol{p}}_k)\Big] = \underbrace{\left\langle \phi({\boldsymbol{x}}), {\boldsymbol{p}}_*\right\rangle_{}}_{\substack{\text{ridgeless kernel least-squares} \\ \text{estimator with kernel \( \hat K_n\) }}} \;+\; \underbrace{\mathbb{E}[\left\langle \phi({\boldsymbol{x}}), \hat{\boldsymbol{w}}_0\right\rangle_{}]}_{=0}. \]

Here the second term vanishes, because \( \hat{\boldsymbol{w}}_0\) is the orthogonal projection of the centered random variable \( {\boldsymbol{w}}_0\) onto a subspace, so that \( \mathbb{E}[\hat{\boldsymbol{w}}_0]=\boldsymbol{0}\) . This suggests, that after training for a long time, the mean of the trained linearized model resembles the ridgeless kernel least-squares estimator with kernel \( \hat K_n\) . Since \( \hat K_n/n\) converges to \( K^{\rm {NTK}}\) by Theorem 28, and a scaling of the kernel by \( 1/n\) does not affect the corresponding kernel least-squares estimator, we expect that for large widths \( n\) and large \( k\)

\[ \begin{equation} \mathbb{E}\Big[\Phi({\boldsymbol{x}},{\boldsymbol{w}}_k)\Big] \simeq \mathbb{E}\Big[\Phi^{\rm lin}({\boldsymbol{x}},{\boldsymbol{p}}_k)\Big] \simeq {\substack{\text{ridgeless kernel least-squares estimator} \\ \text{with kernel \( K^{\rm NTK}\) evaluated at \( {\boldsymbol{x}}\) }}}. \end{equation} \]

(220)

In words, after sufficient training, the mean (over random initializations) of the trained neural network \( {\boldsymbol{x}}\mapsto\Phi({\boldsymbol{x}},{\boldsymbol{w}}_k)\) resembles the kernel least-squares estimator with kernel \( K^{\rm {NTK}}\) . Thus, under these assumptions, we obtain an explicit characterization of what the network prediction looks like after training with gradient descent. For more details and a characterization of the corresponding covariance function, we refer again to [1, Section 2.3.1].

Let us now consider a numerical experiment to visualize this observation. In Figure 43 we plot 80 different realizations of a neural network before and after training, i.ethe functions

\[ \begin{equation} {\boldsymbol{x}}\mapsto\Phi({\boldsymbol{x}},{\boldsymbol{w}}_0)~~\text{and}~~ {\boldsymbol{x}}\mapsto\Phi({\boldsymbol{x}},{\boldsymbol{w}}_k). \end{equation} \]

(221)

The architecture was chosen as in (209) with activation function \( \sigma=\arctan(x)\) , width \( n=250\) and initialization

\[ \begin{equation} U_{0;ij}\overset{\rm iid}{\sim} {\rm N}\Big(0,\frac{3}{d}\Big),~~ v_{0;i}\overset{\rm iid}{\sim} {\rm N}\Big(0,\frac{3}{n}\Big),~~ b_{0;i},~c_0\overset{\rm iid}{\sim}{\rm N}(0,2). \end{equation} \]

(222)

The network was trained on the ridgeless square loss

\[ f ({\boldsymbol{w}})=\sum_{j=1}^m(\Phi({\boldsymbol{x}}_j,{\boldsymbol{w}})-y_j)^2, \]

and a dataset of size \( m=3\) with \( k=5000\) steps of gradient descent and constant step size \( h=1/n\) . Before training, the network’s outputs resemble random draws from a Gaussian process with a constant zero mean function. Post-training, the outputs show minimal variance at the training points, since they essentially interpolate the data, as can be expected due to Theorem 29—specifically (217.b). Outside of the training points, we observe increased variance stemming from the second term in (188). The mean should be close to the ridgeless kernel least squares estimator with kernel \( K^{\rm {NTK}}\) by (220).

before training
Figure 44. before training
after training without regularization
Figure 45. after training without regularization
Figure 43. 80 realizations of a neural network at initialization (a) and after training without regularization on the red data points (b). The dashed line shows the mean. Figure based on [4, Fig. 2], [1, Fig. 2].

Figure 46 shows realizations of the network trained with ridge regularization, i.eusing the loss function (167.c). Initialization and all hyperparameters match those in Figure 43, with the regularization parameter \( \lambda\) set to \( 0.01\) . For a linear model, the prediction after training with ridge regularization is expected to exhibit reduced randomness, as the trained model is \( O(\lambda)\) close to the ridgeless kernel least-squares estimator (see Section 12.2.3). We emphasize that Theorem 27, showing closeness of the trained linearized and full model, does not encompass ridge regularization, however in this example we observe a similar effect.

80 realizations of the neural network in Figure fig:ntkrealizations
 after training on the red data points with added ridge regularization.
 The dashed line shows the mean.
Figure 46. 80 realizations of the neural network in Figure 43 after training on the red data points with added ridge regularization. The dashed line shows the mean.

12.6.5 Role of initialization

Consider the gradient \( \nabla_{\boldsymbol{w}}\Phi({\boldsymbol{x}},{\boldsymbol{w}}_0)\) as in (210) with LeCun initialization (211), so that \( v_{0;i}\overset{{\rm {iid}}}{\sim}\mathcal{W}(0,1/n)\) and \( U_{0;ij}\overset{{\rm {iid}}}{\sim}\mathcal{W}(0,1/d)\) . For the gradient norms in terms of the width \( n\) we obtain

\[ \begin{aligned} \mathbb{E}[\| \nabla_{{\boldsymbol{U}}}\Phi({\boldsymbol{x}},{\boldsymbol{w}}_0) \|_{}^2]&= \mathbb{E}[\| ({\boldsymbol{v}}_0\odot \sigma'({\boldsymbol{U}}_0{\boldsymbol{x}})){\boldsymbol{x}}^\top \|_{}^2]&&=O(1)\\ \mathbb{E}[\| \nabla_{{\boldsymbol{b}}}\Phi({\boldsymbol{x}},{\boldsymbol{w}}_0) \|_{}^2] &= \mathbb{E}[\| {\boldsymbol{v}}_0\odot \sigma'({\boldsymbol{U}}_0{\boldsymbol{x}}) \|_{}^2]&&= O(1)\\ \mathbb{E}[\| \nabla_{{\boldsymbol{v}}}\Phi({\boldsymbol{x}},{\boldsymbol{w}}_0) \|_{}^2] &= \mathbb{E}[\| \sigma({\boldsymbol{U}}_0{\boldsymbol{x}}) \|_{}^2]&&=O(n)\\ \mathbb{E}[\| \nabla_{c}\Phi({\boldsymbol{x}},{\boldsymbol{w}}_0) \|_{}^2] &= \mathbb{E}[|1|] &&=O(1). \end{aligned} \]

Due to this different scaling, gradient descent with step size \( O(n^{-1})\) as in Theorem 29, will primarily adjust the weigths \( {\boldsymbol{v}}\) in the output layer, while only slightly modifying the remaining parameters \( {\boldsymbol{U}}\) , \( {\boldsymbol{b}}\) , and \( c\) . This is also reflected in the expression for the obtained kernel \( K^{\rm {NTK}}\) computed in Theorem 28, which corresponds only to the contribution of the term \( \left\langle \nabla_{\boldsymbol{v}}\Phi, \nabla_{\boldsymbol{v}}\Phi\right\rangle_{}\) .

LeCun initialization [3], sets the variance of the weight initialization inversely proportional to the input dimension of each layer, so that the variance of all node outputs remains stable and does not blow up as the width increases; also see [8]. However, it does not normalize the backward dynamics, i.e., it does not ensure that the gradients with respect to the parameters have similar variance. To balance the normalization of both the forward and backward dynamics, Glorot and Bengio proposed a normalized initialization, where the variance is chosen inversely proportional to the sum of the input and output dimensions of each layer [9]. We emphasize that the choice of initialization strongly affects the neural tangent kernel (NTK) and, consequently, the predictions of the trained network. For an initialization that explicitly normalizes the backward dynamics, we refer to the original NTK paper [4].

Bibliography and further reading

The discussion on linear and kernel least-squares in Sections 12.1 and 12.2 is mostly standard, and can similarly be found in many textbooks, e.g., [199, 13, 15]; moreover, for more details on least-squares problems and algorithms see [200, 198, 175, 169], for iterative algorithms to compute the pseudoinverse, e.g., [204, 216], and for regularization of ill-posed problems, e.g., [203]. The representer theorem was originally introduced in [207], with a more general version presented in [208]. For an easily accessible formulation, see, e.g., [13, Theorem 16.1]. The kernel trick is commonly attributed to [209], see also [217, 218]. For more details on kernel methods we refer to [205, 206, 219]. For recent works regarding in particular generalization properties of kernel ridgeless regression see for instance [220, 221, 222].

The neural tangent kernel and its connection to the training dynamics was first investigated in [194]. Since then, many works have extended this idea and presented differing perspectives on the topic, see for instance [223, 224, 225, 196]. Our presentation in Sections 12.312.6 is based on and closely follows [195], especially for the main results in these sections, where we adhere to the arguments in this paper. Moreover, a similar treatment of these results for gradient flow (rather than gradient descent) was given in [2] based on [196]; in particular, as in [2], we only consider shallow networks and first provide an abstract analysis valid for arbitrary function parametrizations before specifying to neural networks. The paper [195] and some of the other references cited above also address the case of deep architectures, which are omitted here. The explicit formula for the NTK of ReLU networks as presented in Example 9 was given in [212].

The observation that neural networks at initialization behave like Gaussian processes discussed in Section 12.6.4 was first made in [213]. For a general reference on Gaussian processes see the textbook [226]. When only training the last layer of a network (in which the network is affine linear), there are strong links to random feature methods [227]. Recent developements on this topic can also be found in the literature under the name “Neural network Gaussian processes”, or NNGPs for short [228, 229].

Exercises

Exercise 42

Prove Proposition 15.

Hint: Assume first that \( {\boldsymbol{w}}_0\in \ker({\boldsymbol{A}})^\perp\) (i.e\( {\boldsymbol{w}}_0\in\tilde H\) ). For \( \rm{ rank}({\boldsymbol{A}})<d\) , using \( {\boldsymbol{w}}_{k}={\boldsymbol{w}}_{k-1}-h\nabla f ({\boldsymbol{w}}_{k-1})\) and the singular value decomposition of \( {\boldsymbol{A}}\) , write down an explicit formula for \( {\boldsymbol{w}}_k\) . Observe that due to \( 1/(1-x)=\sum_{k\in\mathbb{N}_0}x^k\) for all \( x\in (0,1)\) it holds \( {\boldsymbol{w}}_k\to{\boldsymbol{A}}^\dagger{\boldsymbol{y}}\) as \( k\to\infty\) , where \( {\boldsymbol{A}}^\dagger\) is the Moore-Penrose pseudoinverse of \( {\boldsymbol{A}}\) .

Exercise 43

Let \( {\boldsymbol{A}}\in\mathbb{R}^{d\times d}\) be symmetric positive semidefinite, \( {\boldsymbol{b}}\in\mathbb{R}^d\) , and \( c\in\mathbb{R}\) . Let for \( \lambda>0\)

\[ f ({\boldsymbol{w}})\mathrm{:}= {\boldsymbol{w}}^\top{\boldsymbol{A}}{\boldsymbol{w}}+{\boldsymbol{b}}^\top{\boldsymbol{w}}+c~~ \text{and} ~~ f _\lambda({\boldsymbol{w}})\mathrm{:}= f ({\boldsymbol{w}})+\lambda \| {\boldsymbol{w}} \|_{}^2. \]

Show that \( f \) is \( 2\lambda\) -strongly convex.

Hint: Use Exercise 39.

Exercise 44

Let \( (H,\left\langle \cdot, \cdot\right\rangle_{H})\) be a Hilbert space, and \( \phi:\mathbb{R}^d\to H\) a mapping. Given \( ({\boldsymbol{x}}_j,y_j)_{j=1}^m\in \mathbb{R}^d\times R\) , for \( \lambda>0\) denote

\[ f _\lambda({\boldsymbol{w}})\mathrm{:}= \sum_{j=1}^{m}\big(\left\langle \phi({\boldsymbol{x}}_j), {\boldsymbol{w}}\right\rangle_{H}-y_j\big)^2+\lambda\| {\boldsymbol{w}} \|_{H}^2~~\text{for all }{\boldsymbol{w}}\in H. \]

Prove that \( f _\lambda\) has a unique minimizer \( {\boldsymbol{w}}_{*,\lambda}\in H\) , that \( {\boldsymbol{w}}_{*,\lambda}\in\tilde H\mathrm{:}= \rm{ span}\{\phi({\boldsymbol{x}}_1),\dots,\phi({\boldsymbol{x}}_m)\}\) , and that \( \lim_{\lambda\to 0}{\boldsymbol{w}}_{*,\lambda}={\boldsymbol{w}}_*\) , where \( {\boldsymbol{w}}_*\) is as in (165).

Hint: Assuming existence of \( {\boldsymbol{w}}_{*,\lambda}\) , first show that \( {\boldsymbol{w}}_{*,\lambda}\) belongs to the finite dimensional space \( \tilde H\) . Now express \( {\boldsymbol{w}}_{*,\lambda}\) in terms of an orthonormal basis of \( \tilde H\) , and prove that \( {\boldsymbol{w}}_{*,\lambda}\to{\boldsymbol{w}}_*\) .

Exercise 45

Let \( {\boldsymbol{x}}_i\in\mathbb{R}^d\) , \( i=1,\dots,m\) . Show that there exists a “feature map” \( \phi:\mathbb{R}^d\to\mathbb{R}^m\) , such that for any configuration of labels \( y_i\in\{-1,1\}\) , there always exists a hyperplane in \( \mathbb{R}^m\) separating the two sets \( \{\phi({\boldsymbol{x}}_i)\,|\,y_i=1\}\) and \( \{\phi({\boldsymbol{x}}_i)\,|\,y_i=-1\}\) .

Exercise 46

Let \( n\in\mathbb{N}\) and consider the polynomial kernel \( K:\mathbb{R}^d\times\mathbb{R}^d\to\mathbb{R}\) , \( K({\boldsymbol{x}},{\boldsymbol{z}})=(1+{\boldsymbol{x}}^\top{\boldsymbol{z}})^n\) . Find a Hilbert space \( H\) and a feature map \( \phi:\mathbb{R}^d\to H\) , such that \( K({\boldsymbol{x}},{\boldsymbol{z}})=\left\langle \phi({\boldsymbol{x}}), \phi({\boldsymbol{z}})\right\rangle_{H}\) .

Hint: Use the multinomial formula.

Exercise 47

Consider the radial basis function (RBF) kernel \( K:\mathbb{R}\times\mathbb{R}\to\mathbb{R}\) , \( K(x,z)\mathrm{:}= \exp(-(x-z)^2)\) . Find a Hilbert space \( H\) and a feature map \( \phi:\mathbb{R}\to H\) such that \( K(x,x')=\left\langle \phi(x), \phi(z)\right\rangle_{H}\) .

Exercise 48

Consider the network (197) with LeCun initialization as in (199), but with the biases instead initialized as

\[ \begin{equation} c,~b_i\overset{\rm iid}{\sim}\mathcal{W}(0,1)~~\text{for all }i=1,\dots,n. \end{equation} \]

(210)

Compute the corresponding NTK as in Theorem 29.

13 Loss landscape analysis

In Chapter 11, we saw how the weights of neural networks get adapted during training, using, e.g., variants of gradient descent. For certain cases, including the wide networks considered in Chapter 12, the corresponding iterative scheme converges to a global minimizer. In general, this is not guaranteed, and gradient descent can for instance get stuck in non-global minima or saddle points.

To get a better understanding of these situations, in this chapter we discuss the so-called loss landscape. This term refers to the graph of the empirical risk as a function of the weights. We give a more rigorous definition below, and first introduce notation for neural networks and their realizations for a fixed architecture.

Definition 26

Let \( \mathcal{A} = (d_0, d_1, \dots, d_{L+1}) \in \mathbb{N}^{L+2}\) , let \( \sigma\colon \mathbb{R} \to \mathbb{R}\) be an activation function, and let \( B>0\) . We denote the set of neural networks \( \Phi\) with \( L\) layers, layer widths \( d_0, d_1, \dots, d_{L+1}\) , all weights bounded in modulus by \( B\) , and using the activation function \( \sigma\) by \( \mathcal{N}(\sigma; \mathcal{A}, B)\) . Additionally, we define

\[ \begin{align*} \mathcal{PN}(\mathcal{A}, B) \mathrm{:=} \times_{\ell=0}^{L} \left([-B,B]^{d_{\ell+1} \times d_{\ell}} \times [-B,B]^{d_{\ell+1}}\right), \end{align*} \]

and the realization map

\[ \begin{align} \begin{split} R_\sigma \colon \mathcal{PN}(\mathcal{A}, B) &\to \mathcal{N}(\sigma; \mathcal{A}, B) \\ ({\boldsymbol{W}}^{(\ell)}, {\boldsymbol{b}}^{(\ell)})_{\ell = 0}^L &\mapsto \Phi, \end{split} \end{align} \]

(211)

where \( \Phi\) is the neural network with weights and biases given by \( ({\boldsymbol{W}}^{(\ell)}, {\boldsymbol{b}}^{(\ell)})_{\ell = 0}^L\) .

Throughout, we will identify \( \mathcal{PN}(\mathcal{A},B)\) with the cube \( [-B,B]^{n_\mathcal{A}}\) , where \( n_{\mathcal{A}} \mathrm{:=} \sum_{\ell = 0}^{L}d_{\ell+1} (d_{\ell} + 1)\) . Now we can introduce the loss landscape of a neural network architecture.

Definition 27

Let \( \mathcal{A} = (d_0, d_1, \dots, d_{L+1}) \in \mathbb{N}^{L+2}\) , let \( \sigma\colon \mathbb{R} \to \mathbb{R}\) . Let \( m \in \mathbb{N}\) , and \( S = ({\boldsymbol{x}}_i, {\boldsymbol{y}}_i)_{i=1}^m \in (\mathbb{R}^{d_0} \times \mathbb{R}^{d_{L+1}})^m\) be a sample and let \( \mathcal{L}\) be a loss function. Then, the loss landscape is the graph of the function \( \Lambda_{\mathcal{A}, \sigma, S, \mathcal{L}}\) defined as

\[ \begin{align*} \begin{split} \Lambda_{\mathcal{A}, \sigma, S, \mathcal{L}}\, \colon \, \mathcal{PN}(\mathcal{A}; \infty) &\to \mathbb{R} \\ \theta &\mapsto \widehat{\mathcal{R}}_S(R_\sigma(\theta)). \end{split} \end{align*} \]

with \( \widehat{\mathcal{R}}_S\) in (3) and \( R_\sigma\) in (211).

Identifying \( \mathcal{PN}(\mathcal{A}, \infty)\) with \( \mathbb{R}^{n_{\mathcal{A}}}\) , we can consider \( \Lambda_{\mathcal{A}, \sigma, S, \mathcal{L}}\) as a map on \( \mathbb{R}^{n_{\mathcal{A}}}\) and the loss landscape is a subset of \( \mathbb{R}^{n_{\mathcal{A}}} \times \mathbb{R}\) . The loss landscape is a high-dimensional surface, with hills and valleys. For visualization a two-dimensional section of a loss landscape is shown in Figure 47.

&lt;span data-controller=&quot;mathjax&quot;&gt;Two-dimensional section of a loss landscape.
The loss landscape shows a spurious valley with local minima, global minima, as well as a region where saddle points appear.
Moreover, a sharp minimum is shown.&lt;/span&gt;
Figure 47. Two-dimensional section of a loss landscape. The loss landscape shows a spurious valley with local minima, global minima, as well as a region where saddle points appear. Moreover, a sharp minimum is shown.

Questions of interest regarding the loss landscape include for example: How likely is it that we find local instead of global minima? Are these local minima typically sharp, having small volume, or are they part of large flat valleys that are difficult to escape? How bad is it to end up in a local minimum? Are most local minima as deep as the global minimum, or can they be significantly higher? How rough is the surface generally, and how do these characteristics depend on the network architecture? While providing complete answers to these questions is hard in general, in the rest of this chapter we give some intuition and mathematical insights for specific cases.

13.1 Visualization of loss landscapes

Visualizing loss landscapes can provide valuable insights into the effects of neural network depth, width, and activation functions. However, we can only visualize an at most two-dimensional surface embedded into three-dimensional space, whereas the loss landscape is a very high-dimensional object (unless the neural networks have only very few weights and biases).

To make the loss landscape accessible, we need to reduce its dimensionality. This can be achieved by evaluating the function \( \Lambda_{\mathcal{A}, \sigma, S, \mathcal{L}}\) on a two-dimensional subspace of \( \mathcal{PN}(\mathcal{A}, \infty)\) . Specifically, we choose three-parameters \( \mu\) , \( \theta_1\) , \( \theta_2\) and examine the function

\[ \begin{align} \mathbb{R}^2 \ni (\alpha_1,\alpha_2) \mapsto \Lambda_{\mathcal{A}, \sigma, S, \mathcal{L}}(\mu + \alpha_1 \theta_1 + \alpha_2 \theta_2). \end{align} \]

(212)

There are various natural choices for \( \mu\) , \( \theta_1\) , \( \theta_2\) :

  • Random directions: This was, for example used in [230, 231]. Here \( \theta_1, \theta_2\) are chosen randomly, while \( \mu\) is either a minimum of \( \Lambda_{\mathcal{A}, \sigma, S, \mathcal{L}}\) or also chosen randomly. This simple approach can offer a quick insight into how rough the surface can be. However, as was pointed out in [232], random directions will very likely be orthogonal to the trajectory of the optimization procedure. Hence, they will likely miss the most relevant features.
  • Principal components of learning trajectory: To address the shortcomings of random directions, another possibility is to determine \( \mu\) , \( \theta_1\) , \( \theta_2\) , which best capture some given learning trajectory; For example, if \( \theta^{(1)}, \theta^{(2)}, \dots, \theta^{(N)}\) are the parameters resulting from the training by SGD, we may determine \( \mu\) , \( \theta_1\) , \( \theta_2\) such that the hyperplane \( \{\mu + \alpha_1 \theta_1 + \alpha_2 \theta_2\,|\,\alpha_1,\alpha_2 \in \mathbb{R}\}\) minimizes the mean squared distance to the \( \theta^{(j)}\) for \( j \in \{1, \dots, N\}\) . This is the approach of [232], and can be achieved by a principal component analysis.
  • Based on critical points: For a more global perspective, \( \mu\) , \( \theta_1\) , \( \theta_2\) can be chosen to ensure the observation of multiple critical points. One way to achieve this is by running the optimization procedure three times with final parameters \( \theta^{(1)}\) , \( \theta^{(2)}\) , \( \theta^{(3)}\) . If the procedures have converged, then each of these parameters is close to a critical point of \( \Lambda_{\mathcal{A}, \sigma, S, \mathcal{L}}\) . We can now set \( \mu = \theta^{(1)}\) , \( \theta_1 = \theta^{(2)}- \mu\) , \( \theta_2 = \theta^{(3)}- \mu\) . This then guarantees that (212) passes through or at least comes very close to three critical points (at \( (\alpha_1,\alpha_2) = (0,0), (0,1), (1,0)\) ). We present six visualizations of this form in Figure 48.
Figure 49
Figure 50
Figure 51
Figure 52
Figure 53
Figure 54
Figure 48. A collection of loss landscapes. In the left column are neural networks with ReLU activation function, the right column shows loss landscapes of neural networks with the hyperbolic tangent activation function. All neural networks have five dimensional input, and one dimensional output. Moreover, from top to bottom the hidden layers have widths 1000, 20, 10, and the number of hidden layers are 1, 4, 7.

Figure 48 gives some interesting insight into the effect of depth and width on the shape of the loss landscape. For very wide and shallow neural networks, we have the widest minima, which, in the case of the tanh activation function also seem to belong to the same valley. With increasing depth and smaller width the minima get steeper and more disconnected.

13.2 Spurious valleys

From the perspective of optimization, the ideal loss landscape has one global minimum in the center of a large valley, so that gradient descent converges towards the minimum irrespective of the chosen initialization.

This situation is not realistic for deep neural networks. Indeed, for a simple shallow neural network

\[ \mathbb{R}^d \ni {\boldsymbol{x}} \mapsto \Phi({\boldsymbol{x}}) = {\boldsymbol{W}}^{(1)} \sigma( {\boldsymbol{W}}^{(0)} {\boldsymbol{x}} + {\boldsymbol{b}}^{(0)}) + {\boldsymbol{b}}^{(1)}, \]

it is clear that for every permutation matrix \( {\boldsymbol{P}}\)

\[ \Phi({\boldsymbol{x}}) = {\boldsymbol{W}}^{(1)} {\boldsymbol{P}}^T \sigma( {\boldsymbol{P}} {\boldsymbol{W}}^{(0)} {\boldsymbol{x}} + {\boldsymbol{P}} {\boldsymbol{b}}^{(0)}) + {\boldsymbol{b}}^{(1)}~~ \text{for all } {\boldsymbol{x}} \in \mathbb{R}^d. \]

Hence, in general there exist multiple parameterizations realizing the same output function. Moreover, if at least one global minimum with non-permutation-invariant weights exists, then there are more than one global minima of the loss landscape.

This is not problematic; in fact, having many global minima is beneficial. The larger issue is the existence of non-global minima. Following [233], we start by generalizing the notion of non-global minima to spurious valleys.

Definition 28

Let \( \mathcal{A} = (d_0, d_1, \dots, d_{L+1}) \in \mathbb{N}^{L+2}\) and \( \sigma\colon \mathbb{R} \to \mathbb{R}\) . Let \( m \in \mathbb{N}\) , and \( S = ({\boldsymbol{x}}_i, {\boldsymbol{y}}_i)_{i=1}^m \in (\mathbb{R}^{d_0} \times \mathbb{R}^{d_{L+1}})^m\) be a sample and let \( \mathcal{L}\) be a loss function. For \( c \in \mathbb{R}\) , we define the sub-level set of \( \Lambda_{\mathcal{A}, \sigma, S, \mathcal{L}}\) as

\[ \Omega_\Lambda(c) \mathrm{:=} \{\theta \in \mathcal{PN}(\mathcal{A}, \infty)\,|\,\Lambda_{\mathcal{A}, \sigma, S, \mathcal{L}}(\theta) \leq c\}. \]

A path-connected component of \( \Omega_\Lambda(c)\) , which does not contain a global minimum of \( \Lambda_{\mathcal{A}, \sigma, S, \mathcal{L}}\) is called a spurious valley.

The next proposition shows that spurious local minima do not exist for shallow overparameterized neural networks, i.e., for neural networks that have at least as many parameters in the hidden layer as there are training samples.

Proposition 19

Let \( \mathcal{A} = (d_0,d_1, 1) \in \mathbb{N}^{3}\) and let \( S = ({\boldsymbol{x}}_i, y_i)_{i=1}^m \in (\mathbb{R}^{d_0} \times \mathbb{R})^m\) be a sample such that \( m \leq d_1\) . Furthermore, let \( \sigma \in \mathcal{M}\) be not a polynomial, and let \( \mathcal{L}\) be a convex loss function. Further assume that \( \Lambda_{\mathcal{A}, \sigma, S, \mathcal{L}}\) has at least one global minimum. Then, \( \Lambda_{\mathcal{A}, \sigma, S, \mathcal{L}}\) , has no spurious valleys.

Proof

Let \( \theta_a, \theta_b \in \mathcal{PN}(\mathcal{A}, \infty)\) with \( \Lambda_{\mathcal{A}, \sigma, S, \mathcal{L}}(\theta_a) > \Lambda_{\mathcal{A}, \sigma, S, \mathcal{L}}(\theta_b)\) . Then we will show below that there is another parameter \( \theta_c\) such that

  • \( \Lambda_{\mathcal{A}, \sigma, S, \mathcal{L}}(\theta_b) \geq \Lambda_{\mathcal{A}, \sigma, S, \mathcal{L}}(\theta_c)\)
  • there is a continuous path \( \alpha:[0,1] \to \mathcal{PN}(\mathcal{A}, \infty)\) such that \( \alpha(0) = \theta_a\) , \( \alpha(1) = \theta_c\) , and \( \Lambda_{\mathcal{A}, \sigma, S, \mathcal{L}}(\alpha)\) is monotonically decreasing.

By Exercise 50, the construction above rules out the existence of spurious valleys by choosing \( \theta_a\) an element of a spurious valley and \( \theta_b\) a global minimum.

Next, we present the construction: Let us denote

\[ \theta_o = \left(\left({\boldsymbol{W}}^{(\ell)}_o, {\boldsymbol{b}}^{(\ell)}_o\right)_{\ell = 0}^1\right) ~~ \text{ for } o \in \{a,b,c\}. \]

Moreover, for \( j = 1, \dots, d_1\) , we introduce \( {\boldsymbol{v}}_o^j \in \mathbb{R}^{m}\) defined as

\[ \begin{align*} ({\boldsymbol{v}}_o^j)_i = \left(\sigma\left({\boldsymbol{W}}^{(0)}_o {\boldsymbol{x}}_i + {\boldsymbol{b}}^{(0)}_o\right)\right)_j ~~ \text{ for } i = 1, \dots, m. \end{align*} \]

Notice that, if we set \( {\boldsymbol{V}}_o = (({\boldsymbol{v}}_o^j)^\top)_{j=1}^{d_1}\) , then

\[ \begin{align} {\boldsymbol{W}}^{(1)}_o {\boldsymbol{V}}_o = \left(R_\sigma(\theta_o)({\boldsymbol{x}}_i) - {\boldsymbol{b}}^{(1)}_o\right)_{i=1}^m, \end{align} \]

(213)

where the right-hand side is considered a row-vector.

We will now distinguish between two cases. For the first the result is trivial and the second can be transformed into the first one.

Case 1: Assume that \( {\boldsymbol{V}}_a\) has rank \( m\) . In this case, it is obvious from (213), that there exists \( \widetilde{{\boldsymbol{W}}}\) such that

\[ \widetilde{{\boldsymbol{W}}} {\boldsymbol{V}}_a = \left(R_\sigma(\theta_b)({\boldsymbol{x}}_i)- {\boldsymbol{b}}^{(1)}_a\right)_{i=1}^m. \]

We can thus set \( \alpha(t) = (({\boldsymbol{W}}^{(0)}_a, {\boldsymbol{b}}^{(0)}_a),((1-t){\boldsymbol{W}}^{(1)}_a + t \widetilde{{\boldsymbol{W}}}, {\boldsymbol{b}}^{(1)}_a))\) .

Note that by construction \( \alpha(0) = \theta_a\) and \( \Lambda_{\mathcal{A}, \sigma, S, \mathcal{L}}(\alpha(1)) = \Lambda_{\mathcal{A}, \sigma, S, \mathcal{L}}(\theta_b)\) . Moreover, \( t\mapsto (R_\sigma(\alpha(t))({\boldsymbol{x}}_i))_{i=1}^m\) describes a straight path in \( \mathbb{R}^m\) and hence, by the convexity of \( \mathcal{L}\) it is clear that \( t\mapsto \Lambda_{\mathcal{A}, \sigma, S, \mathcal{L}}(\alpha(t))\) has a minimum \( t^*\) on \( [0,1]\) with \( \Lambda_{\mathcal{A}, \sigma, S, \mathcal{L}}(\alpha(t^*)) \leq \Lambda_{\mathcal{A}, \sigma, S, \mathcal{L}}(\theta_b)\) . Moreover, \( t\mapsto \Lambda_{\mathcal{A}, \sigma, S, \mathcal{L}}(\alpha(t))\) is monotonically decreasing on \( [0, t^*]\) . Setting \( \theta_c = \alpha(t^*)\) completes this case.

Case 2: Assume that \( V_a\) has rank less than \( m\) . In this case, we show that we find a continuous path from \( \theta_a\) to another neural network parameter with higher rank. The path will be such that \( \Lambda_{\mathcal{A}, \sigma, S, \mathcal{L}}\) is monotonically decreasing.

Under the assumptions, we have that one \( {\boldsymbol{v}}_a^j\) can be written as a linear combination of the remaining \( {\boldsymbol{v}}_a^i\) , \( i \neq j\) . Without loss of generality, we assume \( j = 1\) . Then, there exist \( (\alpha_i)_{i=2}^m\) such that

\[ \begin{align} {\boldsymbol{v}}_a^1 = \sum_{i=2}^m \alpha_i {\boldsymbol{v}}_a^i. \end{align} \]

(214)

Next, we observe that there exists \( {\boldsymbol{v}}^* \in \mathbb{R}^m\) which is linearly independent from all \( ({\boldsymbol{v}}_a^j)_{i=1}^m\) and can be written as \( ({\boldsymbol{v}}^*)_i = \sigma(({\boldsymbol{w}}^*)^\top {\boldsymbol{x}}_i + b^*)\) for some \( {\boldsymbol{w}}^* \in \mathbb{R}^{d_0}, b^*\in \mathbb{R}\) . Indeed, if we assume that such \( {\boldsymbol{v}}^*\) does not exist, then for all \( {\boldsymbol{w}} \in \mathbb{R}^{d_0}, b\in \mathbb{R}\) the vector \( (\sigma({\boldsymbol{w}}^\top {\boldsymbol{x}}_i + b))_{i=1}^m\) belongs to the same \( m-1\) dimensional subspace. It would follow that \( \mathrm{span} \{(\sigma({\boldsymbol{w}}^\top {\boldsymbol{x}}_i + b))_{i=1}^m\,|\,{\boldsymbol{w}} \in \mathbb{R}^{d_0}, b\in \mathbb{R}\}\) is an \( m-1\) dimensional subspace of \( \mathbb{R}^m\) which yields a contradiction to Theorem 16.

Now, we define two paths: First,

\[ \alpha_1(t) = (({\boldsymbol{W}}^{(0)}_a, {\boldsymbol{b}}^{(0)}_a),({\boldsymbol{W}}^{(1)}_a(t), {\boldsymbol{b}}^{(1)}_a)), ~~ \text{ for } t \in [0,1/2] \]

where

\[ ({\boldsymbol{W}}^{(1)}_a(t))_1 = (1-2t)({\boldsymbol{W}}^{(1)}_a)_1~~\text{and}~~({\boldsymbol{W}}^{(1)}_a(t))_i = ({\boldsymbol{W}}^{(1)}_a)_i + 2 t \alpha_i ({\boldsymbol{W}}^{(1)}_a)_1 \]

for \( i = 2,\dots, d_1\) , for \( t \in [0,1/2]\) . Second,

\[ \alpha_2(t) = (({\boldsymbol{W}}^{(0)}_a(t), {\boldsymbol{b}}^{(0)}_a(t)),({\boldsymbol{W}}^{(1)}_a(1/2), {\boldsymbol{b}}^{(1)}_a)), \text{ for } t \in (1/2,1], \]

where

\[ ({\boldsymbol{W}}^{(0)}_a(t))_1 = 2(t-1/2) ({\boldsymbol{W}}^{(0)}_a)_1 + (2t-1) {\boldsymbol{w}}^* ~~\text{and}~~ ({\boldsymbol{W}}^{(0)}_a(t))_i = ({\boldsymbol{W}}^{(0)}_a)_i \]

for \( i = 2, \dots, d_1\) , \( ({\boldsymbol{b}}^{(0)}_a(t))_1 = 2(t-1/2)({\boldsymbol{b}}^{(0)}_a)_1 + (2t-1) b^*\) , and \( ({\boldsymbol{b}}^{(0)}_a(t))_i = ({\boldsymbol{b}}^{(0)}_a)_i\) for \( i = 2, \dots, d_1\) .

It is clear by (214) that \( (R_\sigma(\alpha_1)({\boldsymbol{x}}_i))_{i=1}^m\) is constant. Moreover, \( R_\sigma(\alpha_2)({\boldsymbol{x}})\) is constant for all \( {\boldsymbol{x}} \in \mathbb{R}^{d_0}\) . In addition, by construction for

\[ \bar{{\boldsymbol{v}}}^j \mathrm{:=} \left(\left(\sigma\left({\boldsymbol{W}}^{(0)}_a(1) {\boldsymbol{x}}_i + {\boldsymbol{b}}^{(0)}_a(1)\right)\right)_j\right)_{i=1}^m \]

it holds that \( ((\bar{{\boldsymbol{v}}}^j)^\top)_{j=1}^{d_1}\) has rank larger than that of \( {\boldsymbol{V}}_a\) . Concatenating \( \alpha_1\) and \( \alpha_2\) now yields a continuous path from \( \theta_a\) to another neural network parameter with higher associated rank such that \( \Lambda_{\mathcal{A}, \sigma, S, \mathcal{L}}\) is monotonically decreasing along the path. Iterating this construction, we can find a path to a neural network parameter where the associated matrix has full rank. This reduces the problem to Case 1.

13.3 Saddle points

Saddle points are critical points of the loss landscape at which the loss decreases in one direction. In this sense, saddle points are not as problematic as local minima or spurious valleys if the updates in the learning iteration have some stochasticity. Eventually, a random step in the right direction could be taken and the saddle point can be escaped.

If most of the critical points are saddle points, then, even though the loss landscape is challenging for optimization, one still has a good chance of eventually reaching the global minimum. Saddle points of the loss landscape were studied in [234, 235] and we will review some of the findings in a simplified way below. The main observation in [235] is that, under some quite strong assumptions, it holds that critical points in the loss landscape associated to a large loss are typically saddle points, whereas those associated to small loss correspond to minima. This situation is encouraging for the prospects of optimization in deep learning, since, even if we get stuck in a local minimum, it will very likely be such that the loss is close to optimal.

The results of [235] use random matrix theory, which we do not recall here. Moreover, it is hard to gauge if the assumptions made are satisfied for a specific problem. Nonetheless, we recall the main idea, which provides some intuition to support the above claim.

Let \( \mathcal{A} = (d_0, d_1, 1) \in \mathbb{N}^3\) . Then, for a neural network parameter \( \theta \in \mathcal{PN}(\mathcal{A}, \infty)\) and activation function \( \sigma\) , we set \( \Phi_\theta \mathrm{:=} R_\sigma(\theta)\) and define for a sample \( S = ({\boldsymbol{x}}_i,y_i)_{i=1}^m\) the errors

\[ \begin{align*} e_i = \Phi_\theta({\boldsymbol{x}}_i) - y_i~~ \text{for } i = 1, \dots, m. \end{align*} \]

If we use the square loss, then

\[ \begin{align} \widehat{\mathcal{R}_S}(\Phi_\theta) = \frac{1}{m} \sum_{i=1}^m e_i^2. \end{align} \]

(215)

Next, we study the Hessian of \( \widehat{\mathcal{R}}_S(\Phi_\theta)\) .

Proposition 20

Let \( \mathcal{A} = (d_0, d_1, 1)\) and \( \sigma : \mathbb{R} \to \mathbb{R}\) . Then, for every \( \theta \in \mathcal{PN}(\mathcal{A}, \infty)\) where \( \widehat{\mathcal{R}}_S(\Phi_\theta)\) in (215) is twice continuously differentiable with respect to the weights, it holds that

\[ {\boldsymbol{H}}(\theta) = {\boldsymbol{H}}_0(\theta) + {\boldsymbol{H}}_1(\theta), \]

where \( {\boldsymbol{H}}(\theta)\) is the Hessian of \( \widehat{\mathcal{R}}_S(\Phi_\theta)\) at \( \theta\) , \( {\boldsymbol{H}}_0(\theta)\) is a positive semi-definite matrix which is independent from \( (y_i)_{i=1}^m\) , and \( {\boldsymbol{H}}_1(\theta)\) is a symmetric matrix that for fixed \( \theta\) and \( ({\boldsymbol{x}}_i)_{i=1}^m\) depends linearly on \( (e_i)_{i=1}^m\) .

Proof

Using the identification introduced after Definition 27, we can consider \( \theta\) a vector in \( \mathbb{R}^{n_\mathcal{A}}\) . For \( k = 1, \dots, n_\mathcal{A}\) , we have that

\[ \begin{align*} \frac{\partial \widehat{\mathcal{R}}_S(\Phi_\theta)}{\partial \theta_k} = \frac{2}{m}\sum_{i=1}^m e_i \frac{\partial \Phi_\theta({\boldsymbol{x}}_i)}{\partial \theta_k} . \end{align*} \]

Therefore, for \( j = 1, \dots, n_\mathcal{A}\) , we have, by the Leibniz rule, that

\[ \begin{align} \frac{\partial^2 \widehat{\mathcal{R}}_S(\Phi_\theta)}{\partial \theta_j\partial \theta_k} &= \frac{2}{m}\sum_{i=1}^m \left(\frac{\partial \Phi_\theta({\boldsymbol{x}}_i)}{\partial \theta_j} \frac{\partial \Phi_\theta({\boldsymbol{x}}_i)}{\partial \theta_k} \right) + \frac{2}{m} \left(\sum_{i=1}^m e_i \frac{\partial^2 \Phi_\theta({\boldsymbol{x}}_i)}{\partial \theta_j \partial \theta_k} \right) \end{align} \]

(216)

\[ \begin{align} &\mathrm{:=} {\boldsymbol{H}}_0(\theta) + {\boldsymbol{H}}_1(\theta). \\ \\\end{align} \]

(230)

It remains to show that \( {\boldsymbol{H}}_0(\theta)\) and \( {\boldsymbol{H}}_1(\theta)\) have the asserted properties. Note that, setting

\[ J_{i, \theta} = \begin{pmatrix} \frac{\partial \Phi_\theta({\boldsymbol{x}}_i)}{\partial \theta_1} \\ \vdots \\ \frac{\partial \Phi_\theta({\boldsymbol{x}}_i)}{\partial \theta_{n_\mathcal{A}}} \end{pmatrix}\in\mathbb{R}^{n_{\mathcal{A}}}, \]

we have that \( {\boldsymbol{H}}_0(\theta) =\frac{2}{m}\sum_{i=1}^m J_{i, \theta} J_{i, \theta}^\top\) and hence \( {\boldsymbol{H}}_0(\theta)\) is a sum of positive semi-definite matrices, which shows that \( {\boldsymbol{H}}_0(\theta)\) is positive semi-definite.

The symmetry of \( {\boldsymbol{H}}_1(\theta)\) follows directly from the symmetry of second derivatives which holds since we assumed twice continuous differentiability at \( \theta\) . The linearity of \( {\boldsymbol{H}}_1(\theta)\) in \( (e_i)_{i=1}^m\) is clear from (216).

How does Proposition 20 imply the claimed relationship between the size of the loss and the prevalence of saddle points?

Let \( \theta\) correspond to a critical point. If \( {\boldsymbol{H}}(\theta)\) has at least one negative eigenvalue, then \( \theta\) cannot be a minimum, but instead must be either a saddle point or a maximum. While we do not know anything about \( {\boldsymbol{H}}_1(\theta)\) other than that it is symmetric, it is not unreasonable to assume that it has a negative eigenvalue especially if \( n_{\mathcal{A}}\) is very large. With this consideration, let us consider the following model:

Fix a parameter \( \theta\) . Let \( S^0 = ({\boldsymbol{x}}_i, y_i^0)_{i=1}^m\) be a sample and \( (e_i^0)_{i=1}^m\) be the associated errors. Further let \( {\boldsymbol{H}}^0(\theta), {\boldsymbol{H}}_0^0(\theta), {\boldsymbol{H}}_1^0(\theta)\) be the matrices according to Proposition 20.

Further let for \( \lambda >0\) , \( S^\lambda = ({\boldsymbol{x}}_i, y_i^\lambda)_{i=1}^m\) be such that the associated errors are \( (e_i)_{i=1}^m = \lambda (e_i^0)_{i=1}^m\) . The Hessian of \( \widehat{\mathcal{R}}_{S^\lambda}(\Phi_\theta)\) at \( \theta\) is then \( {\boldsymbol{H}}^\lambda(\theta)\) satisfying

\[ {\boldsymbol{H}}^\lambda(\theta) = {\boldsymbol{H}}^0_0(\theta) + \lambda {\boldsymbol{H}}_1^0(\theta). \]

Hence, if \( \lambda\) is large, then \( {\boldsymbol{H}}^\lambda(\theta)\) is perturbation of an amplified version of \( {\boldsymbol{H}}_1^0(\theta)\) . Clearly, if \( {\boldsymbol{v}}\) is an eigenvector of \( {\boldsymbol{H}}_1(\theta)\) with negative eigenvalue \( -\mu\) , then

\[ {\boldsymbol{v}}^\top {\boldsymbol{H}}^\lambda(\theta){\boldsymbol{v}} \leq (\|{\boldsymbol{H}}^0_0(\theta)\| - \lambda \mu) \|{\boldsymbol{v}}\|^2, \]

which we can expect to be negative for large \( \lambda\) . Thus, \( {\boldsymbol{H}}^\lambda(\theta)\) has a negative eigenvalue for large \( \lambda\) .

On the other hand, if \( \lambda\) is small, then \( {\boldsymbol{H}}^\lambda(\theta)\) is merely a perturbation of \( {\boldsymbol{H}}^0_0(\theta)\) and we can expect its spectrum to resemble that of \( {\boldsymbol{H}}^0_0\) more and more.

What we see is that, the same parameter, is more likely to be a saddle point for a sample that produces a high empirical risk than for a sample with small risk. Note that, since \( {\boldsymbol{H}}^0_0(\theta)\) was only shown to be semi-definite the argument above does not rule out saddle points even for very small \( \lambda\) . But it does show that for small \( \lambda\) , every negative eigenvalue would be very small.

A more refined analysis where we compare different parameters but for the same sample and quantify the likelihood of local minima versus saddle points requires the introduction of a probability distribution on the weights. We refer to [235] for the details.

Bibliography and further reading

The results on visualization of the loss landscape are inspired by [232, 230, 231]. Results on the non-existence of spurious valleys can be found in [233] with similar results in [236]. In [237] the loss landscape was studied by linking it to so-called spin-glass models. There it was found that under strong assumptions critical points associated to lower losses are more likely to be minima than saddle points. In [235], random matrix theory is used to provide similar results, that go beyond those established in Section 13.3. On the topic of saddle points, [234] identifies the existence of saddle points as more problematic than that of local minima, and an alternative saddle-point aware optimization algorithm is introduced.

Two essential topics associated to the loss landscape that have not been discussed in this chapter are mode connectivity and the sharpness of minima. Mode connectivity, roughly speaking describes the phenomenon, that local minima found by SGD over deep neural networks are often connected by simple curves of equally low loss [238, 239]. Moreover, the sharpness of minima has been analyzed and linked to generalization capabilities of neural networks, with the idea being that wide neural networks are easier to find and also yield robust neural networks [240, 241, 242]. However, this does not appear to prevent sharp minima from generalizing well [243].

Exercises

Exercise 49

In view of Definition 28, show that a local minimum of a differentiable function is contained in a spurious valley.

Exercise 50

Show that if there exists a continuous path \( \alpha\) between a parameter \( \theta_1\) and a global minimum \( \theta_2\) such that \( \Lambda_{\mathcal{A}, \sigma, S, \mathcal{L}}(\alpha)\) is monotonically decreasing, then \( \theta_1\) cannot be an element of a spurious valley.

Exercise 51

Find an example of a spurious valley for a simple architecture.

Hint: Use a single neuron ReLU neural network and observe that, for two networks one with positive and one with negative slope, every continuous path in parameter space that connects the two has to pass through a parameter corresponding to a constant function.

14 Shape of neural network spaces

As we have seen in the previous chapter, the loss landscape of neural networks can be quite intricate and is typically not convex. In some sense, the reason for this is that we take the point of view of a map from the parameterization of a neural network. Let us consider a convex loss function \( \mathcal{L}:\mathbb{R} \times \mathbb{R} \to \mathbb{R}\) and a sample \( S = ({\boldsymbol{x}}_i, y_i)_{i=1}^m \in (\mathbb{R}^d \times \mathbb{R})^m\) .

Then, for two neural networks \( \Phi_1, \Phi_2\) and for \( \alpha \in (0,1)\) it holds that

\[ \begin{align*} \widehat{\mathcal{R}}_S(\alpha \Phi_1 + (1-\alpha) \Phi_2) &= \frac{1}{m}\sum_{i=1}^m \mathcal{L}(\alpha \Phi_1({\boldsymbol{x}}_i) + (1-\alpha) \Phi_2 ({\boldsymbol{x}}_i), y_i)\\ & \leq \frac{1}{m}\sum_{i=1}^m \alpha \mathcal{L}(\Phi_1({\boldsymbol{x}}_i), y_i) + (1- \alpha) \mathcal{L}(\Phi_2({\boldsymbol{x}}_i), y_i)\\ & = \alpha \widehat{\mathcal{R}}_S(\Phi_1) + (1-\alpha) \widehat{\mathcal{R}}_S(\Phi_2). \end{align*} \]

Hence, the empirical risk is convex when considered as a map depending on the neural network functions rather then the neural network parameters. A convex function does not have spurious minima or saddle points. As a result, the issues from the previous section are avoided if we take the perspective of neural network sets.

So why do we not optimize over the sets of neural networks instead of the parameters? To understand this, we will now study the set of neural networks associated with a fixed architecture as a subset of other function spaces.

We start by investigating the realization map \( R_\sigma\) introduced in Definition 26. Concretely, we show in Section 14.1, that if \( \sigma\) is Lipschitz, then the set of neural networks is the image of \( \mathcal{PN}(\mathcal{A}, \infty)\) under a locally Lipschitz map. We will use this fact to show in Section 14.2 that sets of neural networks are typically non-convex, and even have arbitrarily large holes. Finally, in Section 14.3, we study the extent to which there exist best approximations to arbitrary functions, in the set of neural networks. We will demonstrate that the lack of best approximations causes the weights of neural networks to grow infinitely during training.

14.1 Lipschitz parameterizations

In this section, we study the realization map \( R_\sigma\) . The main result is the following simplified version of [244, Proposition 4].

Proposition 21

Let \( \mathcal{A} = (d_0, d_1, \dots, d_{L+1}) \in \mathbb{N}^{L+2}\) , let \( \sigma\colon \mathbb{R} \to \mathbb{R}\) be \( C_\sigma\) -Lipschitz continuous with \( C_\sigma \geq 1\) , let \( |\sigma(x)| \leq C_\sigma|x|\) for all \( x \in \mathbb{R}\) , and let \( B\geq 1\) .

Then, for all \( \theta, \theta' \in \mathcal{PN}(\mathcal{A}, B)\) ,

\[ \begin{align*} \| R_\sigma(\theta) - R_\sigma(\theta') \|_{{L^{\infty}([-1,1]^{d_0})}} \leq (2C_\sigma B d_{\rm max})^{L} n_{\mathcal{A}} \| \theta - \theta' \|_{\infty}, \end{align*} \]

where \( d_{\rm {max}} = \max_{\ell = 0, \dots, L+1} d_\ell\) and \( n_{\mathcal{A}} = \sum_{\ell = 0}^{L}d_{\ell+1} (d_{\ell} + 1)\) .

Proof

Let \( \theta, \theta' \in \mathcal{PN}(\mathcal{A}, B)\) and define \( \delta \mathrm{:=} \| \theta - \theta' \|_{\infty}\) . Repeatedly using the triangle inequality we find a sequence \( (\theta_j)_{j=0}^{n_{\mathcal{A}}}\) such that \( \theta_0 = \theta\) , \( \theta_{n_{\mathcal{A}}} = \theta'\) , \( \|\theta_j - \theta_{j+1}\|_{\infty} \leq \delta\) , and \( \theta_j\) and \( \theta_{j+1}\) differ in one entry only for all \( j = 0,\dots {n_{\mathcal{A}}}-1\) . We conclude that for all \( {\boldsymbol{x}} \in [-1,1]^{d_0}\)

\[ \begin{align} \|R_\sigma(\theta)({\boldsymbol{x}}) - R_\sigma(\theta')({\boldsymbol{x}})\|_\infty \leq \sum_{j=0}^{{n_{\mathcal{A}}}-1} \|R_\sigma(\theta_j)({\boldsymbol{x}}) - R_\sigma(\theta_{j+1})({\boldsymbol{x}})\|_\infty. \end{align} \]

(217)

To upper bound (217), we now only need to understand the effect of changing one weight in a neural network by \( \delta\) .

Before we can complete the proof we need two auxiliary lemmas. The first of which holds under slightly weaker assumptions of Proposition 21.

\begin{proof} We start with the case, where \( L = 1\) . Then, for \( (d_0, d_1,d_2) = \mathcal{A}\) , we have that

\[ \Phi({\boldsymbol{x}}) = \mathbf{W}^{(1)} \sigma(\mathbf{W}^{(0)}{\boldsymbol{x}} + \mathbf{b}^{(0)}) + \mathbf{b}^{(1)}, \]

for certain \( \mathbf{W}^{(0)}, \mathbf{W}^{(1)}, \mathbf{b}^{(0)}, \mathbf{b}^{(1)}\) with all entries bounded by \( B\) . As a consequence, we can estimate

\[ \begin{align*} \|\Phi({\boldsymbol{x}}) - \Phi({\boldsymbol{x}}')\|_\infty &= \left\|\mathbf{W}^{(1)} \left( \sigma(\mathbf{W}^{(0)}{\boldsymbol{x}} + \mathbf{b}^{(0)}) - \sigma(\mathbf{W}^{(0)}{\boldsymbol{x}}' + \mathbf{b}^{(0)})\right) \right\|_\infty\\ &\leq d_{1}B \left\| \sigma(\mathbf{W}^{(0)}{\boldsymbol{x}} + \mathbf{b}^{(0)}) - \sigma(\mathbf{W}^{(0)}{\boldsymbol{x}}' + \mathbf{b}^{(0)})\right\|_\infty\\ &\leq d_{1}B C_\sigma\left\| \mathbf{W}^{(0)}({\boldsymbol{x}} - {\boldsymbol{x}}')\right\|_\infty\\ &\leq d_{1}d_0 B^2 C_\sigma \left\|{\boldsymbol{x}} - {\boldsymbol{x}}'\right\|_\infty \leq C_\sigma \cdot (d_{\rm max} B)^2 \left\|{\boldsymbol{x}} - {\boldsymbol{x}}'\right\|_\infty, \end{align*} \]

where we used the Lipschitz property of \( \sigma\) and the fact that \( \|\mathbf{A}{\boldsymbol{x}}\|_\infty \leq n \max_{i,j}|A_{ij}| \|{\boldsymbol{x}}\|_\infty\) for every matrix \( {\boldsymbol{A}}=(A_{ij})_{i = 1, j = 1}^{m,n} \in \mathbb{R}^{m\times n}\) .

The induction step from \( L\) to \( L+1\) follows similarly. This concludes the proof of the lemma.

Lemma 34

Under the assumptions of Proposition 21 it holds that

\[ \begin{align} \|{\boldsymbol{x}}^{(\ell)}\|_\infty \leq (2 C_\sigma B d_{\rm max})^\ell ~ \text{ for all } {\boldsymbol{x}} \in [-1,1]^{d_{0}}. \end{align} \]

(219)

Proof

Per Definitions (5.b) and (5.c), we have that for \( \ell = 1, \dots, L+1\)

\[ \begin{align*} \|{\boldsymbol{x}}^{(\ell)}\|_\infty &\leq C_\sigma\left\|\mathbf{W}^{(\ell-1)}{\boldsymbol{x}}^{(\ell-1)} + \mathbf{b}^{(\ell-1)} \right\|_\infty\\ &\leq C_\sigma B d_{\rm max} \|{\boldsymbol{x}}^{(\ell-1)}\|_\infty + B C_\sigma, \end{align*} \]

where we used the triangle inequality and the estimate \( \|\mathbf{A}{\boldsymbol{x}}\|_\infty \leq n \max_{i,j}|A_{ij}| \|{\boldsymbol{x}}\|_\infty\) , which holds for every matrix \( \mathbf{A} \in \mathbb{R}^{m\times n}\) . We obtain that

\[ \begin{align*} \|{\boldsymbol{x}}^{(\ell)}\|_\infty &\leq C_\sigma B d_{\rm max}\cdot (1+ \|{\boldsymbol{x}}^{(\ell-1)}\|_\infty )\\ &\leq 2 C_\sigma B d_{\rm max}\cdot (\max\{1,\|{\boldsymbol{x}}^{(\ell-1)}\|_\infty\}). \end{align*} \]

Resolving the recursive estimate of \( \|{\boldsymbol{x}}^{(\ell)}\|_\infty\) by \( 2 C_\sigma B d_{\rm {max}}(\max\{1,\|{\boldsymbol{x}}^{(\ell-1)}\|_\infty\})\) , we conclude that

\[ \begin{align*} \|{\boldsymbol{x}}^{(\ell)}\|_\infty \leq (2 C_\sigma B d_{\rm max})^\ell \max\{1,\|{\boldsymbol{x}}^{(0)}\|_\infty\} = (2 C_\sigma B d_{\rm max})^\ell. \end{align*} \]

This concludes the proof of the lemma.

We can now proceed with the proof of Proposition 21. Assume that \( \theta_{j+1}\) and \( \theta_{j}\) differ only in one entry. We assume this entry to be in the \( \ell\) th layer, and we start with the case \( \ell<L\) . It holds

\[ \begin{align*} |R_\sigma(\theta_j)({\boldsymbol{x}}) - R_\sigma(\theta_{j+1})({\boldsymbol{x}})| = |\Phi^\ell( \sigma(\mathbf{W}^{{(\ell)}}{\boldsymbol{x}}^{(\ell)} + \mathbf{b}^{(\ell)})) - \Phi^\ell( \sigma(\overline{\mathbf{W}^{{(\ell)}}}{\boldsymbol{x}}^{(\ell)} + \overline{\mathbf{b}^{(\ell)}}))|, \end{align*} \]

where \( \Phi^\ell \in \mathcal{N}(\sigma; \mathcal{A}^\ell, B)\) for \( \mathcal{A}^\ell = (d_{\ell+1}, \dots, d_{L+1})\) and \( (\mathbf{W}^{{(\ell)}}, \mathbf{b}^{{(\ell)}})\) , \( (\overline{\mathbf{W}}^{{(\ell)}}, \overline{\mathbf{b}}^{{(\ell)}})\) differ in one entry only.

Using the Lipschitz continuity of \( \Phi^\ell\) of Lemma 33, we have

\[ \begin{align*} &|R_\sigma(\theta_j)({\boldsymbol{x}}) - R_\sigma(\theta_{j+1})({\boldsymbol{x}})| \\ &~~ \leq C_\sigma^{L-\ell - 1} (B d_{\rm max})^{L-\ell}|\sigma(\mathbf{W}^{{(\ell)}}{\boldsymbol{x}}^{(\ell)} + \mathbf{b}^{(\ell)}) - \sigma(\overline{\mathbf{W}^{{(\ell)}}}{\boldsymbol{x}}^{(\ell)} + \overline{\mathbf{b}^{(\ell)}})|\\ &~~ \leq C_\sigma^{L-\ell} (B d_{\rm max})^{L-\ell} \| \mathbf{W}^{{(\ell)}}{\boldsymbol{x}}^{(\ell)} + \mathbf{b}^{(\ell)} - \overline{\mathbf{W}^{{(\ell)}}}{\boldsymbol{x}}^{(\ell)} - \overline{\mathbf{b}^{(\ell)}}\|_\infty\\ &~~ \leq 2 C_\sigma^{L-\ell} (B d_{\rm max})^{L-\ell} \delta \max\{1,\|{\boldsymbol{x}}^{(\ell)}\|_\infty\}, \end{align*} \]

where \( \delta \mathrm{:=} \|\theta - \theta'\|_{\infty}\) . Invoking Lemma (34), we conclude that

\[ \begin{align*} |R_\sigma(\theta_j)({\boldsymbol{x}}) - R_\sigma(\theta_{j+1})({\boldsymbol{x}})| &\leq (2 C_\sigma B d_{\rm max})^\ell C_\sigma^{L-\ell}\cdot (B d_{\rm max})^{L-\ell} \delta\\ &\leq (2C_\sigma B d_{\rm max})^{L} \|\theta - \theta'\|_{\infty}. \end{align*} \]

For the case \( \ell=L\) , a similar estimate can be shown. Combining this with (217) yields the result. \end{proof}

Using Proposition 21, we can now consider the set of neural networks with a fixed architecture \( \mathcal{N}(\sigma; \mathcal{A}, \infty)\) as a subset of \( L^\infty([-1,1]^{d_0})\) . What is more, is that \( \mathcal{N}(\sigma; \mathcal{A}, \infty)\) is the image of \( \mathcal{PN}(\mathcal{A}, \infty)\) under a locally Lipschitz map.

14.2 Convexity of neural network spaces

As a first step towards understanding \( \mathcal{N}(\sigma; \mathcal{A}, \infty)\) as a subset of \( L^\infty([-1,1]^{d_0})\) , we notice that it is star-shaped with few centers. Let us first introduce the necessary terminology.

Definition 29

Let \( Z\) be a subset of a linear space. A point \( x \in Z\) is called a center of \( Z\) if, for every \( y \in Z\) it holds that

\[ \{tx+ (1-t)y\,|\,t \in [0,1]\} \subseteq Z. \]

A set is called star-shaped if it has at least one center.

The following proposition follows directly from the definition of a neural network and is the content of Exercise 52.

Proposition 22

Let \( L\in \mathbb{N}\) and \( \mathcal{A} = (d_0, d_1, \dots, d_{L+1})\in \mathbb{N}^{L+2}\) and \( \sigma \colon \mathbb{R} \to \mathbb{R}\) . Then \( \mathcal{N}(\sigma; \mathcal{A}, \infty)\) is scaling invariant, i.e. for every \( \lambda \in \mathbb{R}\) it holds that \( \lambda f \in \mathcal{N}(\sigma; \mathcal{A}, \infty)\) if \( f \in \mathcal{N}(\sigma; \mathcal{A}, \infty)\) , and hence \( 0 \in \mathcal{N}(\sigma; \mathcal{A}, \infty)\) is a center of \( \mathcal{N}(\sigma; \mathcal{A}, \infty)\) .

Knowing that \( \mathcal{N}(\sigma; \mathcal{A} , B)\) is star-shaped with center \( 0\) , we can also ask ourselves if \( \mathcal{N}(\sigma; \mathcal{A}, B)\) has more than this one center. It is not hard to see that also every constant function is a center. The following theorem, which corresponds to [244, Proposition C.4], yields an upper bound on the number of linearly independent centers.

Theorem 30

Let \( L\in \mathbb{N}\) and \( \mathcal{A} = (d_0, d_1, \dots, d_{L+1})\in \mathbb{N}^{L+2}\) , and let \( \sigma: \mathbb{R} \to \mathbb{R}\) be Lipschitz continuous. Then, \( \mathcal{N}(\sigma; \mathcal{A}, \infty)\) contains at most \( n_{\mathcal{A}} = \sum_{\ell = 0}^{L} (d_{\ell} + 1) d_{\ell+1}\) linearly independent centers.

Proof

Assume by contradiction, that there are functions \( (g_i)_{i=1}^{n_{\mathcal{A}}+1} \subseteq \mathcal{N}( \sigma ; \mathcal{A}, \infty) \subseteq L^\infty([-1,1]^{d_0})\) that are linearly independent and centers of \( \mathcal{N}(\sigma; \mathcal{A}, \infty)\) .

By the Theorem of Hahn-Banach, there exist \( (g_i')_{i=1}^{n_{\mathcal{A}}+1} \subseteq (L^\infty([-1,1]^{d_0}))'\) such that \( g_i'(g_j) = \delta_{ij}\) , for all \( i,j \in \{1, \dots, L+1\}\) . We define

\[ \begin{align*} T\colon L^\infty([-1,1]^{d_0}) \to \mathbb{R}^{n_{\mathcal{A}}+1}, ~~ g \mapsto \left(\begin{array}{c} g_1'(g) \\ g_2'(g)\\ \vdots\\ g_{n_{\mathcal{A}}+1}'(g) \end{array} \right). \end{align*} \]

Since \( T\) is continuous and linear, we have that \( T \circ R_\sigma\) is locally Lipschitz continuous by Proposition 21. Moreover, since the \( (g_i)_{i=1}^{n_{\mathcal{A}}+1}\) are linearly independent, we have that \( T(\mathrm{span}((g_i)_{i=1}^{n_{\mathcal{A}}+1}))= \mathbb{R}^{n_{\mathcal{A}}+1}\) . We denote \( V \mathrm{:=} \mathrm{span}((g_i)_{i=1}^{n_{\mathcal{A}}+1})\) .

Next, we would like to establish that \( \mathcal{N}(\sigma; \mathcal{A}, \infty) \supset V\) . Let \( g \in V\) then

\[ g = \sum_{\ell = 1}^{n_{\mathcal{A}}+1} a_\ell g_\ell, \]

for some \( a_1, \dots, a_{n_{\mathcal{A}}+1} \in \mathbb{R}\) . We show by induction that \( \tilde{g}^{(m)} \mathrm{:=} \sum_{\ell = 1}^{m} a_\ell g_\ell \in \mathcal{N}(\sigma; \mathcal{A}, \infty)\) for every \( m \leq n_{\mathcal{A}}+1\) . This is obviously true for \( m = 1\) . Moreover, we have that \( \tilde{g}^{(m+1)} = a_{m+1}g_{m+1} + \tilde{g}^{(m)}\) . Hence, the induction step holds true if \( a_{m+1}= 0\) . If \( a_{m+1} \neq 0\) , then we have that

\[ \begin{align} \widetilde{g}^{(m+1)} &= 2 a_{m+1} \cdot \left(\frac{1}{2} g_{m+1} + \frac{1}{2a_{m+1}}\widetilde{g}^{(m)}\right). \end{align} \]

(220)

By the induction assumption \( \widetilde{g}^{(m)} \in \mathcal{N}(\sigma; \mathcal{A}, \infty)\) and hence by Proposition 22 \( \widetilde{g}^{(m)}/(a_{m+1}) \in \mathcal{N}(\sigma; \mathcal{A}, \infty)\) . Additionally, since \( g_{m+1}\) is a center of \( \mathcal{N}(\sigma; \mathcal{A}, \infty)\) , we have that \( \frac{1}{2} g_{m+1} + \frac{1}{2a_{m+1}}\widetilde{g}^{(m)} \in \mathcal{N}(\sigma; \mathcal{A}, \infty)\) . By Proposition 22, we conclude that \( \widetilde{g}^{(m+1)}\in \mathcal{N}( \sigma; \mathcal{A},\infty)\) .

The induction shows that \( g \in \mathcal{N}( \sigma; \mathcal{A},\infty)\) and thus \( V\subseteq \mathcal{N}( \sigma; \mathcal{A},\infty)\) . As a consequence, \( T \circ R_\sigma(\mathcal{PN}(\mathcal{A}, \infty)) \supseteq T(V) = \mathbb{R}^{n_\mathcal{A}+1}\) .

It is a well known fact of basic analysis that for every \( n \in \mathbb{N}\) there does not exist a surjective and locally Lipschitz continuous map from \( \mathbb{R}^n\) to \( \mathbb{R}^{n+1}\) . We recall that \( n_{\mathcal{A}} = \mathrm{dim}(\mathcal{PN}(\mathcal{A}, \infty))\) . This yields the contradiction.

For a convex set \( X\) , the line between all two points of \( X\) is a subset of \( X\) . Hence, every point of a convex set is a center. This yields the following corollary.

Corollary 5

Let \( \mathcal{A} = (d_0, d_1, \dots, d_{L+1})\) , let, and let \( \sigma: \mathbb{R} \to \mathbb{R}\) be Lipschitz continuous. If \( \mathcal{N}( \sigma; \mathcal{A},\infty)\) contains more than \( n_\mathcal{A}= \sum_{\ell = 0}^{L} (d_{\ell} + 1) d_{\ell+1}\) linearly independent functions, then \( \mathcal{N}( \sigma; \mathcal{A},\infty)\) is not convex.

Corollary 5 tells us that we cannot expect convex sets of neural networks, if the set of neural networks has many linearly independent elements. Sets of neural networks contain for each \( f \in \mathcal{N}( \sigma; \mathcal{A},\infty)\) also all shifts of this function, i.e., \( f(\cdot + {\boldsymbol{b}})\) for a \( {\boldsymbol{b}} \in \mathbb{R}^d\) are elements of \( \mathcal{N}( \sigma; \mathcal{A},\infty)\) . For a set of functions, being shift invariant and having only finitely many linearly independent functions at the same time, is a very restrictive condition. Indeed, it was shown in [244, Proposition C.6] that if \( \mathcal{N}( \sigma; \mathcal{A},\infty)\) has only finitely many linearly independent functions and \( \sigma\) is differentiable in at least one point and has non-zero derivative there, then \( \sigma\) is necessarily a polynomial.

We conclude that the set of neural networks is in general non-convex and star-shaped with 0 and constant functions being centers. One could visualize this set in 3D as in Figure 55.

&lt;span data-controller=&quot;mathjax&quot;&gt;Sketch of the space of neural networks in 3D.
 The vertical axis corresponds to the constant neural network functions, each of which is a center.
The set of neural networks consists of many low-dimensional linear subspaces spanned by certain neural networks ( _1, , _6) in this sketch) and linear functions.
Between these low-dimensional subspaces, there is not always a straight-line connection by Corollary cor:NonConvexity and Theorem thm:noEpsConvexity.&lt;/span&gt;
Figure 55. Sketch of the space of neural networks in 3D. The vertical axis corresponds to the constant neural network functions, each of which is a center. The set of neural networks consists of many low-dimensional linear subspaces spanned by certain neural networks (\( \Phi_1, \dots, \Phi_6\) in this sketch) and linear functions. Between these low-dimensional subspaces, there is not always a straight-line connection by Corollary 5 and Theorem 31.

The fact, that the neural network space is not convex, could also mean that it merely fails to be convex at one point. For example \( \mathbb{R}^2 \setminus \{0\}\) is not convex, but for an optimization algorithm this would likely not pose a problem.

We will next observe that \( \mathcal{N}( \sigma; \mathcal{A},\infty)\) does not have such a benign non-convexity and in fact, has arbitrarily large holes.

To make this claim mathematically precise, we first introduce the notion of \( \varepsilon\) -convexity.

Definition 30

For \( \varepsilon > 0\) , we say that a subset \( A\) of a normed vector space \( X\) is \( \varepsilon\) -convex if

\[ {\rm co}(A) \subseteq A + B_\varepsilon(0), \]

where \( \rm{ co}(A)\) denotes the convex hull of \( A\) and \( B_\varepsilon(0)\) is an \( \varepsilon\) ball around \( 0\) with respect to the norm of \( X\) .

Intuitively speaking, a set that is convex when one fills up all holes smaller than \( \varepsilon\) is \( \varepsilon\) -convex. Now we show that there is no \( \varepsilon>0\) such that \( \mathcal{N}( \sigma; \mathcal{A},\infty)\) is \( \varepsilon\) -convex.

Theorem 31

Let \( L \in \mathbb{N}\) and \( \mathcal{A} = (d_0, d_1, \dots, d_{L}, 1) \in \mathbb{N}^{L+2}\) . Let \( K \subseteq \mathbb{R}^{d_0}\) be compact and let \( \sigma \in \mathcal{M}\) , with \( \mathcal{M}\) as in (8) and assume that \( \sigma\) is not a polynomial. Moreover, assume that there exists an open set, where \( \sigma\) is differentiable and not constant.

If there exists an \( \varepsilon>0\) such that \( \mathcal{N}( \sigma; \mathcal{A},\infty)\) is \( \varepsilon\) -convex, then \( \mathcal{N}( \sigma; \mathcal{A},\infty)\) is dense in \( C(K)\) .

Proof

Step 1. We show that \( \varepsilon\) -convexity implies \( \overline{\mathcal{N}( \sigma; \mathcal{A},\infty)}\) to be convex. By Proposition 22, we have that \( \mathcal{N}( \sigma; \mathcal{A},\infty)\) is scaling invariant. This implies that \( \rm{ co}(\mathcal{N}( \sigma; \mathcal{A},\infty))\) is scaling invariant as well. Hence, if there exists \( \varepsilon >0\) such that \( \mathcal{N}( \sigma; \mathcal{A},\infty)\) is \( \varepsilon\) -convex, then for every \( \varepsilon' >0\)

\[ \begin{align*} {\rm co}(\mathcal{N}( \sigma; \mathcal{A},\infty)) &= \frac{\varepsilon'}{\varepsilon}{\rm co}(\mathcal{N}( \sigma; \mathcal{A},\infty)) \subseteq \frac{\varepsilon'}{\varepsilon}\left(\mathcal{N}( \sigma; \mathcal{A},\infty) + B_\varepsilon(0)\right)\\ &= \mathcal{N}( \sigma; \mathcal{A},\infty) + B_{\varepsilon'}(0). \end{align*} \]

This yields that \( \mathcal{N}( \sigma; \mathcal{A},\infty)\) is \( \varepsilon'\) -convex. Since \( \varepsilon'\) was arbitrary, we have that \( \mathcal{N}( \sigma; \mathcal{A},\infty)\) is \( \varepsilon\) -convex for all \( \varepsilon>0\) .

As a consequence, we have that

\[ \begin{align*} {\rm co}(\mathcal{N}( \sigma; \mathcal{A},\infty)) \subseteq & \bigcap_{\varepsilon>0} (\mathcal{N}( \sigma; \mathcal{A},\infty) + B_\varepsilon(0))\\ \subseteq &\bigcap_{\varepsilon>0} \overline{(\mathcal{N}( \sigma; \mathcal{A},\infty) + B_\varepsilon(0))} = \overline{\mathcal{N}( \sigma; \mathcal{A},\infty)}. \end{align*} \]

Hence, \( \overline{\rm{ co}}(\mathcal{N}( \sigma; \mathcal{A},\infty)) \subseteq \overline{\mathcal{N}( \sigma; \mathcal{A},\infty)}\) and, by the well-known fact that in every metric vector space \( \rm{ co}(\overline{A}) \subseteq \overline{\rm{ co}}(A)\) , we conclude that \( \overline{\mathcal{N}( \sigma; \mathcal{A},\infty)}\) is convex.

Step 2. We show that \( \mathcal{N}_d^1(\sigma; 1)\subseteq\overline{\mathcal{N}(\sigma;\mathcal{A},\infty)}\) . If \( \mathcal{N}( \sigma; \mathcal{A},\infty)\) is \( \varepsilon\) -convex, then by Step 1 \( \overline{\mathcal{N}( \sigma; \mathcal{A},\infty)}\) is convex. The scaling invariance of \( \mathcal{N}( \sigma; \mathcal{A},\infty)\) then shows that \( \overline{\mathcal{N}( \sigma; \mathcal{A},\infty)}\) is a closed linear subspace of \( C(K)\) .

Note that, by Proposition 3 for every \( {\boldsymbol{w}} \in \mathbb{R}^{d_0}\) and \( b \in \mathbb{R}\) there exists a function \( f \in \overline{\mathcal{N}( \sigma; \mathcal{A},\infty)}\) such that

\[ \begin{align} f({\boldsymbol{x}}) = \sigma({\boldsymbol{w}}^\top {\boldsymbol{x}} + b) ~~\text{for all }{\boldsymbol{x}}\in K. \end{align} \]

(221)

By definition, every constant function is an element of \( \overline{\mathcal{N}( \sigma; \mathcal{A},\infty)}\) .

Since \( \overline{\mathcal{N}( \sigma; \mathcal{A},\infty)}\) is a closed vector space, this implies that for all \( n \in \mathbb{N}\) and all \( {\boldsymbol{w}}_1^{(1)}, \dots, {\boldsymbol{w}}_n^{(1)} \in \mathbb{R}^{d_0}\) , \( w_1^{(2)}, \dots, w_n^{(2)} \in \mathbb{R}\) , \( b_1^{(1)}, \dots, b_n^{(1)} \in \mathbb{R}\) , \( b^{(2)} \in \mathbb{R}\)

\[ \begin{align} {\boldsymbol{x}} \mapsto \sum_{i=1}^n w_i^{(2)}\sigma(({\boldsymbol{w}}_i^{(1)})^\top {\boldsymbol{x}} + b_i^{(1)}) + b^{(2)} \in \overline{\mathcal{N}( \sigma; \mathcal{A},\infty)}. \end{align} \]

(222)

Step 3. From (222), we conclude that \( \mathcal{N}_d^1(\sigma; 1) \subseteq \overline{\mathcal{N}( \sigma; \mathcal{A},\infty)}\) . In words, the whole set of shallow neural networks of arbitrary width is contained in the closure of the set of neural networks with a fixed architecture. By Theorem 2, we have that \( \mathcal{N}_d^1(\sigma;1)\) is dense in \( C(K)\) , which yields the result.

For any activation function of practical relevance, a set of neural networks with fixed architecture is not dense in \( C(K)\) . This is only the case for very strange activation functions such as the one discussed in Subsection 4.2. Hence, Theorem 31 shows that in general, sets of neural networks of fixed architectures have arbitrarily large holes.

14.3 Closedness and best-approximation property

The non-convexity of the set of neural networks can have some serious consequences for the way we think of the approximation or learning problem by neural networks.

Consider \( \mathcal{A} = (d_0, \dots, d_{L+1}) \in \mathbb{N}^{L+2}\) and an activation function \( \sigma\) . Let \( H\) be a normed function space on \( [-1,1]^{d_{0}}\) such that \( \mathcal{N}( \sigma; \mathcal{A},\infty) \subseteq H\) . For \( h \in H\) we would like to find a neural network that best approximates \( h\) , i.eto find \( \Phi \in \mathcal{N}( \sigma; \mathcal{A},\infty)\) such that

\[ \begin{align} \| \Phi - h \|_{H} = \inf_{\Phi^* \in \mathcal{N}( \sigma; \mathcal{A},\infty)} \| \Phi^* - h \|_{H}. \end{align} \]

(223)

We say that \( \mathcal{N}( \sigma; \mathcal{A},\infty)\subseteq H\) has

  • the best approximation property, if for all \( h\in H\) there exists at least one \( \Phi\in \mathcal{N}( \sigma; \mathcal{A},\infty)\) such that (223) holds,
  • the unique best approximation property, if for all \( h\in H\) there exists exactly one \( \Phi\in \mathcal{N}( \sigma; \mathcal{A},\infty)\) such that (223) holds,
  • the continuous selection property, if there exists a continuous function \( \phi \colon H \to \mathcal{N}( \sigma; \mathcal{A},\infty)\) such that \( \Phi = \phi(h)\) satisfies (223) for all \( h \in H\) .

We will see in the sequel, that, in the absence of the best approximation property, we will be able to prove that the learning problem necessarily requires the weights of the neural networks to tend to infinity, which may or may not be desirable in applications.

Moreover, having a continuous selection procedure is desirable as it implies the existence of a stable selection algorithm; that is, an algorithm which, for similar problems yields similar neural networks satisfying (223).

Below, we will study the properties above for \( L^p\) spaces, \( p \in [1,\infty)\) . As we will see, neural network classes typically neither satisfy the continuous selection nor the best approximation property.

14.3.1 Continuous selection

As shown in [245], neural network spaces essentially never admit the continuous selection property. To give the argument, we first recall the following result from [245, Theorem 3.4] without proof.

Theorem 32

Let \( p \in (1, \infty)\) . Every subset of \( L^p([-1,1]^{d_0})\) with the unique best approximation property is convex.

This allows to show the next proposition.

Proposition 23

Let \( L\in \mathbb{N}\) , \( \mathcal{A} = (d_0, d_1, \dots, d_{L+1})\in \mathbb{N}^{L+2}\) , let \( \sigma: \mathbb{R} \to \mathbb{R}\) be Lipschitz continuous and not a polynomial, and let \( p \in (1,\infty)\) .

Then, \( \mathcal{N}( \sigma; \mathcal{A},\infty)\subseteq L^{p}([-1,1]^{d_0})\) does not have the continuous selection property.

Proof

We observe from Theorem 30 and the discussion below, that under the assumptions of this proposition, \( \mathcal{N}( \sigma; \mathcal{A},\infty)\) is not convex. We conclude from Theorem 32 that \( \mathcal{N}( \sigma; \mathcal{A},\infty)\) does not have the unique best approximation property. Moreover, if the set \( \mathcal{N}( \sigma; \mathcal{A},\infty)\) does not have the best approximation property, then it is obvious that it cannot have continuous selection. Thus, we can assume without loss of generality, that \( \mathcal{N}( \sigma; \mathcal{A},\infty)\) has the best approximation property and there exists a point \( h \in L^p([-1,1]^{d_0})\) and two different \( \Phi_1\) ,\( \Phi_2\) such that

\[ \begin{align} \| \Phi_1 - h \|_{L^p} = \| \Phi_2 - h \|_{L^p} = \inf_{\Phi^* \in \mathcal{N}( \sigma; \mathcal{A},\infty)} \| \Phi^* - h \|_{L^p}. \end{align} \]

(224)

Note that (224) implies that \( h \not \in \mathcal{N}( \sigma; \mathcal{A},\infty)\) .

Let us consider the following function:

\[ \begin{align*} [-1,1] \ni \lambda \mapsto P(\lambda) = \left\{ \begin{array}{ll} (1+ \lambda) h - \lambda \Phi_1 & \text{ for }\lambda \leq 0,\\ (1- \lambda)h + \lambda \Phi_2& \text{ for }\lambda \geq 0.\\ \end{array}\right. \end{align*} \]

It is clear that \( P(\lambda)\) is a continuous path in \( L^{p}\) . Moreover, for \( \lambda \in (-1,0)\)

\[ \|\Phi_1 -P(\lambda)\|_{L^p} = (1+\lambda) \|\Phi_1 - h\|_{L^p}. \]

Assume towards a contradiction, that there exists \( \Phi^* \in \mathcal{N}( \sigma; \mathcal{A},\infty)\) with \( \Phi^* \neq \Phi_1\) such that for \( \lambda \in (-1,0)\)

\[ \begin{align*} \|\Phi^* -P(\lambda)\|_{L^p} \leq \|\Phi_1 - P(\lambda)\|_{L^p}. \end{align*} \]

Then

\[ \begin{align} \|\Phi^* - h\|_{L^p} &\leq \|\Phi^* - P(\lambda)\|_{L^p} + \|P(\lambda) - h\|_{L^p} \nonumber\\ &\leq \|\Phi_1 - P(\lambda)\|_{L^p} + \|P(\lambda) - h\|_{L^p} \nonumber\\ & = (1+\lambda) \| \Phi_1 - h\|_{L^p} + |\lambda| \| \Phi_1 - h\|_{L^p} = \| \Phi_1 - h\|_{L^p} . \end{align} \]

(225)

Since \( \Phi_1\) is a best approximation to \( h\) this implies that every inequality in the estimate above is an equality. Hence, we have that

\[ \begin{align*} \|\Phi^* - h\|_{L^p} = \|\Phi^* - P(\lambda)\|_{L^p} + \|P(\lambda) - h\|_{L^p}. \end{align*} \]

However, in a strictly convex space like \( L^p([-1,1]^{d_0})\) for \( p > 1\) this implies that

\[ \Phi^* - P(\lambda) = c\cdot (P(\lambda) - h) \]

for a constant \( c \neq 0\) . This yields that

\[ \Phi^* = h + (c+1) \lambda \cdot (h - \Phi_1) \]

and plugging into (225) yields \( | (c+1) \lambda| = 1\) . If \( (c+1) \lambda = -1\) , then we have \( \Phi^* = \Phi_1\) which produces a contradiction. If \( (c+1) \lambda = 1\) , then

\[ \begin{align*} \|\Phi^* - P(\lambda)\|_{L^p} &= \|2h - \Phi_1 - (1+\lambda) h + \lambda \Phi_1\|_{L^p} \\ &= \|(1-\lambda) h - (1-\lambda) \Phi_1\|_{L^p} > \|P(\lambda) - \Phi_1\|_{L^p}, \end{align*} \]

which is another contradiction.

Hence, for every \( \lambda <0\) we have that \( \Phi_1\) is the unique minimizer to \( P(\lambda)\) in \( \mathcal{N}( \sigma; \mathcal{A},\infty)\) . The same argument holds for \( \lambda >0\) and \( \Phi_2\) . We conclude that for every selection function \( \phi \colon L^p([-1,1]^{d_0}) \to \mathcal{N}( \sigma; \mathcal{A},\infty)\) such that \( \Phi = \phi(h)\) satisfies (223) for all \( h \in L^p([-1,1]^{d_0})\) it holds that

\[ \lim_{\lambda \downarrow 0} \phi(P(\lambda)) = \Phi_2 \neq \Phi_1 = \lim_{\lambda \uparrow 0} \phi(P(\lambda)). \]

As a consequence, \( \phi\) is not continuous, which shows the result.

14.3.2 Existence of best approximations

We have seen in Proposition 23 that under very mild assumptions, the continuous selection property cannot hold. Moreover, the next result shows that in many cases, also the best approximation property fails to be satisfied. We provide below a simplified version of [244, Theorem 3.1]. We also refer to [246] for earlier work on this problem.

Proposition 24

Let \( \mathcal{A} = (1, 2, 1)\) and let \( \sigma: \mathbb{R} \to \mathbb{R}\) be Lipschitz continuous. Additionally assume that there exist \( r >0\) and \( \alpha' \neq \alpha\) such that \( \sigma\) is differentiable for all \( |x|>r\) and \( \sigma'(x) \to \alpha\) for \( x \to \infty\) , \( \sigma'(x) \to \alpha'\) for \( x \to -\infty\) .

Then, there exists a sequence in \( \mathcal{N}( \sigma; \mathcal{A},\infty)\) which converges in \( L^{p}([-1,1])\) , for every \( p \in (1,\infty)\) , and the limit of this sequence is discontinuous. In particular, the limit of the sequence does not lie in \( \mathcal{N}(\sigma; \mathcal{A}', \infty)\) for any \( \mathcal{A}'\) .

Proof

For all \( n\in\mathbb{N}\) let

\[ \begin{align*} f_n(x) = \sigma(n x+ 1) - \sigma(n x)~~ \text{for all } x \in \mathbb{R}. \end{align*} \]

Then \( f_n\) can be written as a neural network with architecture \( (\sigma; 1,2,1)\) , i.e., \( \mathcal{A} = (1,2,1)\) . Moreover, for \( x > 0\) we observe with the fundamental theorem of calculus and using integration by substitution that

\[ \begin{align} f_n(x) = \int_{x}^{x+1/n} n \sigma'(n z) dz = \int_{nx}^{nx+1} \sigma'(z) dz. \end{align} \]

(226)

It is not hard to see that the right hand side of (226) converges to \( \alpha\) for \( n \to \infty\) .

Similarly, for \( x < 0\) , we observe that \( f_n(x)\) converges to \( \alpha'\) for \( n \to \infty\) . We conclude that

\[ f_n \to \alpha \boldsymbol{1}_{\mathbb{R}_+} + \alpha' \boldsymbol{1}_{\mathbb{R}_-} \]

almost everywhere as \( n\to\infty\) . Since \( \sigma\) is Lipschitz continuous, we have that \( f_n\) is bounded. Therefore, we conclude that \( f_n \to \alpha \boldsymbol{1}_{\mathbb{R}_+} + \alpha' \boldsymbol{1}_{\mathbb{R}_-}\) in \( L^p\) for all \( p \in [1,\infty)\) by the dominated convergence theorem.

There is a straight-forward extension of Proposition 24 to arbitrary architectures, that will be the content of Exercises 53 and 54.

Remark 16

The proof of Theorem 23 does not extend to the \( L^\infty\) norm. This, of course, does not mean that generally \( \mathcal{N}( \sigma; \mathcal{A},\infty)\) is a closed set in \( L^\infty([-1,1]^{d_0})\) . In fact, almost all activation functions used in practice still give rise to non-closed neural network sets, see [1, Theorem 3.3]. However, there is one notable exception. For the ReLU activation function, it can be shown that \( \mathcal{N}(\sigma_{\rm {ReLU}}; \mathcal{A}, \infty)\) is a closed set in \( L^\infty([-1,1]^{d_0})\) if \( \mathcal{A}\) has only one hidden layer. The closedness of deep ReLU spaces in \( L^\infty\) is an open problem.

14.3.3 Exploding weights phenomenon

Finally, we discuss one of the consequences of the non-existence of best approximations of Proposition 24.

Consider a regression problem, where we aim to learn a function \( f\) using neural networks with a fixed architecture \( \mathcal{N}(\mathcal{A};\sigma,\infty)\) . As discussed in the Chapters 11 and 12, we wish to produce a sequence of neural networks \( (\Phi_n)_{n=1}^\infty\) such that the risk defined in (4) converges to \( 0\) . If the loss \( \mathcal{L}\) is the squared loss, \( \mu\) is a probability measure on \( [-1,1]^{d_0}\) , and the data is given by \( ({\boldsymbol{x}}, f({\boldsymbol{x}}))\) for \( {\boldsymbol{x}} \sim \mu\) , then

\[ \begin{align} \begin{split} \mathcal{R}(\Phi_n) &= \| \Phi_n - f \|_{L^2([-1,1]^{d_0}, \mu)}^2\\ &= \int_{[-1,1]^{d_0}}|\Phi_n({\boldsymbol{x}}) - f({\boldsymbol{x}})|^2 d \mu({\boldsymbol{x}}) \to 0 ~~ \text{ for } n \to \infty. \end{split} \end{align} \]

(227)

According to Proposition 24, for a given \( \mathcal{A}\) , and an activation function \( \sigma\) , it is possible that (227) holds, but \( f \not \in \mathcal{N}( \sigma; \mathcal{A},\infty)\) . The following result shows that in this situation, the weights of \( \Phi_n\) diverge.

Proposition 25

Let \( \mathcal{A} = (d_0, d_1, \dots, d_{L+1}) \in \mathbb{N}^{L+2}\) , let \( \sigma\colon \mathbb{R} \to \mathbb{R}\) be Lipschitz continuous with \( C_\sigma \geq 1\) , and \( |\sigma(x)| \leq C_\sigma|x|\) for all \( x \in \mathbb{R}\) , and let \( \mu\) be a measure on \( [-1,1]^{d_0}\) .

Assume that there exists a sequence \( \Phi_n \in \mathcal{N}( \sigma; \mathcal{A},\infty)\) and \( f \in L^2([-1,1]^{d_0}, \mu) \setminus \mathcal{N}( \sigma; \mathcal{A},\infty)\) such that

\[ \begin{align} \| \Phi_n - f \|_{L^2([-1,1]^{d_0}, \mu)}^2 \to 0. \end{align} \]

(228)

Then

\[ \begin{align} \limsup_{n \to \infty}\max \left\{\|{\boldsymbol{W}}_n^{(\ell)}\|_\infty, \|{\boldsymbol{b}}_n^{(\ell)}\|_\infty\, \middle|\,\ell = 0, \dots L\right\} = \infty. \end{align} \]

(229)

Proof

We assume towards a contradiction that the left-hand side of (229) is finite. As a result, there exists \( C >0\) such that \( \Phi_n \in \mathcal{N}(\sigma; \mathcal{A}, C)\) for all \( n \in \mathbb{N}\) .

By Proposition 21, we conclude that \( \mathcal{N}(\sigma; \mathcal{A}, C)\) is the image of a compact set under a continuous map and hence is itself a compact set in \( L^2([-1,1]^{d_0}, \mu)\) . In particular, we have that \( \mathcal{N}(\sigma; \mathcal{A}, C)\) is closed. Hence, (228) implies \( f \in \mathcal{N}(\sigma; \mathcal{A}, C)\) . This gives a contradiction.

Proposition 25 can be extended to all \( f\) for which there is no best approximation in \( \mathcal{N}( \sigma; \mathcal{A},\infty)\) , see Exercise 55. The results imply that for functions we wish to learn that lack a best approximation within a neural network set, we must expect the weights of the approximating neural networks to grow to infinity. This can be undesirable because, as we will see in the following sections on generalization, a bounded parameter space facilitates many generalization bounds.

Bibliography and further reading

The properties of neural network sets were first studied with a focus on the continuous approximation property in [245, 247, 248] and [246]. The results in [245, 248, 247] already use the non-convexity of sets of shallow neural networks. The results on convexity and closedness presented in this chapter follow mostly the arguments of [244]. Similar results were also derived for other norms in [249].

Exercises

Exercise 52

Prove Proposition 22.

Exercise 53

Extend Proposition 24 to \( \mathcal{A}=(d_0,d_1,1)\) for arbitrary \( d_0\) , \( d_1 \in \mathbb{N}\) , \( d_1\geq 2\) .

Exercise 54

Use Proposition 3, to extend Proposition 24 to arbitrary depth.

Exercise 55

Extend Proposition 25 to functions \( f\) for which there is no best-approximation in \( \mathcal{N}( \sigma; \mathcal{A},\infty)\) . To do this, replace (228) by

\[ \| \Phi_n - f \|_{L^2}^2 \to \inf_{\Phi \in \mathcal{N}( \sigma; \mathcal{A},\infty)} \| \Phi - f \|_{L^2}^2. \]

15 Generalization properties of deep neural networks

As discussed in the introduction in Section 2.2, we generally learn based on a finite data set. For example, given data \( (x_i,y_i)_{i=1}^m\) , we try to find a network \( \Phi\) that satisfies \( \Phi(x_i)=y_i\) for \( i=1,\dots,m\) . The field of generalization is concerned with how well such \( \Phi\) performs on unseen data, which refers to any \( x\) outside of training data \( \{x_1,\dots,x_m\}\) . In this chapter we discuss generalization through the use of covering numbers.

In Sections 15.1 and 15.2 we revisit and formalize the general setup of learning and empirical risk minimization in a general context. Although some notions introduced in these sections have already appeared in the previous chapters, we reintroduce them here for a more coherent presentation. In Sections 15.3-15.5, we first discuss the concepts of generalization bounds and covering numbers, and then apply these arguments specifically to neural networks. In Section 15.6 we explore the so-called “approximation-complexity trade-off”, and finally in Sections 15.7-15.8 we introduce the “VC dimension” and give some implications for classes of neural networks.

15.1 Learning setup

A general learning problem [15, 13, 250] requires a feature space \( {X}\) and a label space \( {Y}\) , which we assume throughout to be measurable spaces. We observe joint data pairs \( (x_i,y_i)_{i=1}^m\subseteq {X}\times{Y}\) , and aim to identify a connection between the \( x\) and \( y\) variables. Specifically, we assume a relationship between features \( x\) and labels \( y\) modeled by a probability distribution \( \mathcal{D}\) over \( {X} \times {Y}\) , that generated the observed data \( (x_i,y_i)_{i=1}^m\) . While this distribution is unknown, our goal is to extract information from it, so that we can make possibly good predictions of \( y\) for a given \( x\) . Importantly, the relationship between \( x\) and \( y\) need not be deterministic.

To make these concepts more concrete, we next present an example that will serve to explain ideas throughout this chapter. This example is of high relevance for many mathematicians, as ensuring a steady supply of high-quality coffee is essential for maximizing the output of our mathematical work.

&lt;span data-controller=&quot;mathjax&quot;&gt;Collection of coffee data.
 The last row lacks a &amp;ldquo;Quality&amp;rdquo; label.
 Our aim is to predict the label without the need for an (expensive) taste test.&lt;/span&gt;
Figure 56. Collection of coffee data. The last row lacks a “Quality” label. Our aim is to predict the label without the need for an (expensive) taste test.

Example 10 (Coffee Quality)

Our goal is to determine the quality of different coffees. To this end we model the quality as a number in

\[ Y = \Big\{\frac{0}{10},\dots,\frac{10}{10}\Big\}, \]

with higher numbers indicating better quality. Let us assume that our subjective assessment of quality of coffee is related to six features: “Acidity”, “Caffeine content”, “Price”, “Aftertaste”, “Roast level”, and “Origin”. The feature space \( X\) thus corresponds to the set of six-tuples describing these attributes, which can be either numeric or categorical (see Figure 56).

We aim to understand the relationship between elements of \( X\) and elements of \( Y\) , but we can neither afford, nor do we have the time to taste all the coffees in the world. Instead, we can sample some coffees, taste them, and grow our database accordingly as depicted in Figure 56. This way we obtain samples of pairs in \( X\times Y\) . The distribution \( \mathcal{D}\) from which they are drawn depends on various external factors. For instance, we might have avoided particularly cheap coffees, believing them to be inferior. As a result they do not occur in our database. Moreover, if a colleague contributes to our database, he might have tried the same brand and arrived at a different rating. In this case, the quality label is not deterministic anymore.

Based on our database, we wish to predict the quality of an untasted coffee. Before proceeding, we first formalize what it means to be a “good” prediction.

Characterizing how good a predictor is requires a notion of discrepancy in the label space. This is the purpose of the so-called loss function, which is a measurable mapping \( \mathcal{L} \colon {Y} \times {Y} \to \mathbb{R}_+\) .

Definition 31

Let \( \mathcal{L} \colon {Y} \times {Y} \to \mathbb{R}_+\) be a loss function and let \( \mathcal{D}\) be a distribution on \( {X} \times {Y}\) . For a measurable function \( h \colon {X} \to {Y}\) we call

\[ \begin{align*} \mathcal{R}(h) = \mathbb{E}_{(x,y) \sim \mathcal{D}}\left[\mathcal{L}(h(x), y) \right] \end{align*} \]

the (population) risk of \( h\) .

Based on the risk, we can now formalize what we consider a good predictor. The best predictor is one such that its risk is as close as possible to the smallest that any function can achieve. More precisely, we would like a risk that is close to the so-called Bayes risk

\[ \begin{align} R^* \mathrm{:=} \inf_{h\colon {X} \to {Y}} \mathcal{R}(h), \end{align} \]

(230)

where the infimum is taken over all measurable \( h:X\to Y\) .

Example 11 (Loss functions)

The choice of a loss function \( \mathcal{L}\) usually depends on the application. For a regression problem, i.e., a learning problem where \( {Y}\) is a non-discrete subset of a Euclidean space, a common choice is the square loss \( \mathcal{L}_2({\boldsymbol{y}},{\boldsymbol{y}}') = \|{\boldsymbol{y}}-{\boldsymbol{y}}'\|^2\) .

For binary classification problems, i.ewhen \( {Y}\) is a discrete set of cardinality two, the “\( 0-1\) loss”

\[ \mathcal{L}_{0-1}(y,y') = \left\{ \begin{array}{ll} 1 & y\neq y'\\ 0 & y= y' \end{array}\right. \]

seems more natural.

Another frequently used loss for binary classification, especially when we want to predict probabilities (i.e., if \( {Y} = [0,1]\) but all labels are binary), is the binary cross-entropy loss

\[ \mathcal{L}_{\rm ce}(y, y') = -(y \log(y') + (1 - y) \log(1 - y')). \]

In contrast to the \( 0-1\) loss, the cross-entropy loss is differentiable, which is desirable in deep learning as we saw in Chapter 11.

In the coffee quality prediction problem, the quality is given as a fraction of the form \( k/10\) for \( k = 0, \dots, 10\) . While this is a discrete set, it makes sense to more heavily penalize predictions that are wrong by a larger amount. For example, predicting \( 4/10\) instead of \( 8/10\) should produce a higher loss than predicting \( 7/10\) . Hence, we would not use the \( 0-1\) loss but, for example, the square loss.

How do we find a function \( h\colon {X} \to {Y}\) with a risk that is as close as possible to the Bayes risk? We will introduce a procedure to tackle this task in the next section.

15.2 Empirical risk minimization

Finding a minimizer of the risk constitutes a considerable challenge. First, we cannot search through all measurable functions. Therefore, we need to restrict ourselves to a specific set \( \mathcal{H} \subseteq \{h:{X}\to{Y}\}\) called the hypothesis set. In the following, this set will be some set of neural networks. Second, we are faced with the problem that we cannot evaluate \( \mathcal{R}(h)\) for non-trivial loss functions, because the distribution \( \mathcal{D}\) is typically unknown so that expectations with respect to \( \mathcal{D}\) cannot be computed. To approximate the risk, we will assume access to an i.i.dsample of \( m\) observations drawn from \( \mathcal{D}\) . This is precisely the situation described in the coffee quality example of Figure 56, where \( m=6\) coffees were sampled. For a given hypothesis \( h\) we can then check how well it performs on our sampled data.

Definition 32

Let \( m\in \mathbb{N}\) , let \( \mathcal{L} \colon {Y} \times {Y} \to \mathbb{R}\) be a loss function and let \( S = (x_i, y_i)_{i=1}^m \in ({X} \times {Y})^m\) be a sample. For \( h \colon {X} \to {Y}\) , we call

\[ \begin{align*} \widehat{\mathcal{R}}_S(h) = \frac{1}{m}\sum_{i=1}^m \mathcal{L}(h(x_i), y_i) \end{align*} \]

the empirical risk of \( h\) .

If the sample \( S\) is drawn i.i.daccording to \( \mathcal{D}\) , then we immediately see from the linearity of the expected value that \( \widehat{\mathcal{R}}_S(h)\) is an unbiased estimator of \( \mathcal{R}(h)\) , i.e., \( \mathbb{E}_{S \sim \mathcal{D}^m}[\widehat{\mathcal{R}}_S(h)] = \mathcal{R}(h)\) . Moreover, the weak law of large numbers states that the sample mean of an i.i.dsequence of integrable random variables converges to the expected value in probability. Hence, there is some hope that, at least for large \( m \in \mathbb{N}\) , minimizing the empirical risk instead of the population risk might lead to a good hypothesis. We formalize this approach in the next definition.

Definition 33

Let \( \mathcal{H} \subseteq \{h \colon {X} \to {Y}\}\) be a hypothesis set. Let \( m\in \mathbb{N}\) , let \( \mathcal{L} \colon {Y} \times {Y} \to \mathbb{R}\) be a loss function and let \( S = (x_i, y_i)_{i=1}^m \in ({X} \times {Y})^m\) be a sample. We call a function \( h_S\) such that

\[ \begin{align} \widehat{\mathcal{R}}_S(h_S) = \inf_{h \in \mathcal{H}}\widehat{\mathcal{R}}_S(h) \end{align} \]

(245)

\[ \begin{align} \\\end{align} \]

(246)

an empirical risk minimizer.

From a generalization perspective, supervised deep learning is empirical risk minimization over sets of neural networks. The question we want to address next is how effective this approach is at producing hypotheses that achieve a risk close to the Bayes risk.

Let \( \mathcal{H}\) be some hypothesis set, such that an empirical risk minimizer \( h_S\) exists for all \( S\in ({X}\times{Y})^m\) ; see Exercise 55 for an explanation of why this is a reasonable assumption. Moreover, let \( g \in \mathcal{H}\) be arbitrary. Then

\[ \begin{align} \mathcal{R}(h_S) - R^* &= \mathcal{R}(h_S)-\widehat{\mathcal{R}}_S(h_S) + \widehat{\mathcal{R}}_S(h_S) - R^* \nonumber \end{align} \]

(247)

\[ \begin{align} &\leq |\mathcal{R}(h_S)-\widehat{\mathcal{R}}_S(h_S)| + \widehat{\mathcal{R}}_S(g) - R^* \\\end{align} \]

(248)

\[ \begin{align} &\leq 2 \sup_{h\in \mathcal{H}}|\mathcal{R}(h)-\widehat{\mathcal{R}}_S(h)| + \mathcal{R}(g) - R^*, \\ \\\end{align} \]

(249)

where in the first inequality we used that \( h_S\) is the empirical risk minimizer. By taking the infimum over all \( g\) , we conclude that

\[ \begin{align} \mathcal{R}(h_S) - R^* &\leq 2 \sup_{h\in \mathcal{H}}|\mathcal{R}(h)-\widehat{\mathcal{R}}_S(h)| + \inf_{g \in \mathcal{H}}\mathcal{R}(g) - R^* \nonumber \end{align} \]

(250)

\[ \begin{align} &\mathrm{:=} 2\varepsilon_{\mathrm{gen}} + \varepsilon_{\mathrm{approx}}. \\\end{align} \]

(251)

Similarly, considering only (247), yields that

\[ \begin{align} \mathcal{R}(h_S) &\leq \sup_{h\in \mathcal{H}}|\mathcal{R}(h)-\widehat{\mathcal{R}}_S(h)| + \inf_{g \in \mathcal{H}}\widehat{\mathcal{R}}_S(g) \nonumber \end{align} \]

(252)

\[ \begin{align} &\mathrm{:=} \varepsilon_{\mathrm{gen}} + \varepsilon_{\mathrm{int}}. \\\end{align} \]

(253)

How to choose \( \mathcal{H}\) to reduce the approximation error \( \varepsilon_{\mathrm{approx}}\) or the interpolation error \( \varepsilon_{\mathrm{int}}\) was discussed at length in the previous chapters. The final piece is to figure out how to bound the generalization error \( \sup_{h\in \mathcal{H}}|\mathcal{R}(h)-\widehat{\mathcal{R}}_S(h)|\) . This will be discussed in the sections below.

15.3 Generalization bounds

We have seen that one aspect of successful learning is to bound the generalization error \( \varepsilon_{\rm {gen}}\) in (246). Let us first formally describe this problem.

Definition 34 (Generalization bound)

Let \( \mathcal{H} \subseteq \{h \colon {X} \to {Y} \}\) be a hypothesis set, and let \( \mathcal{L} \colon {Y} \times {Y} \to \mathbb{R}\) be a loss function. Let \( \kappa \colon (0,1) \times \mathbb{N} \to \mathbb{R}_+\) be such that for every \( \delta \in (0,1)\) holds \( \kappa(\delta, m) \to 0\) for \( m \to \infty\) . We call \( \kappa\) a generalization bound for \( \mathcal{H}\) if for every distribution \( \mathcal{D}\) on \( {X} \times {Y}\) , every \( m \in \mathbb{N}\) and every \( \delta \in (0,1)\) , it holds with probability at least \( 1-\delta\) over the random sample \( S \sim \mathcal{D}^m\) that

\[ \begin{align*} \sup_{h \in \mathcal{H}}|\mathcal{R}(h) - \widehat{\mathcal{R}}_S(h)| \leq \kappa(\delta, m). \end{align*} \]

Remark 17

For a generalization bound \( \kappa\) it holds that

\[ \begin{align*} \mathbb{P}\left[ \left|\mathcal{R}(h_S) - \widehat{\mathcal{R}}_S(h_S)\right| \leq \varepsilon \right] \geq 1-\delta \end{align*} \]

as soon as \( m\) is so large that \( \kappa(\delta, m) \leq \varepsilon\) . If there exists an empirical risk minimizer \( h_S\) such that \( \widehat{\mathcal{R}}_S(h_S) = 0\) , then with high probability the empirical risk minimizer will also have a small risk \( \mathcal{R}(h_S)\) . Empirical risk minimization is often referred to as a “PAC” algorithm, which stands for probably (\( \delta\) ) approximately correct (\( \varepsilon\) ).

Definition 34 requires the upper bound \( \kappa\) on the discrepancy between the empirical risk and the risk to be independent from the distribution \( \mathcal{D}\) . Why should this be possible? After all, we could have an underlying distribution that is not uniform and hence, certain data points could appear very rarely in the sample. As a result, it should be very hard to produce a correct prediction for such points. At first sight, this suggests that non-uniform distributions should be much more challenging than uniform distributions. This intuition is incorrect, as the following argument based on Example 10 demonstrates.

Example 12 (Generalization in the coffee quality problem)

In Example 10, the underlying distribution describes both our process of choosing coffees and the relation between the attributes and the quality. Suppose we do not enjoy drinking coffee that costs less than \( 1\) €. Consequently, we do not have a single sample of such coffee in the dataset, and therefore we have no chance of learning the quality of cheap coffees.

However, the absence of coffee samples costing less than 1€ in our dataset is due to our general avoidance of such coffee. As a result, we run a low risk of incorrectly classifying the quality of a coffee that is cheaper than \( 1\) €, since it is unlikely that we will choose such a coffee in the future.

To establish generalization bounds, we use stochastic tools that guarantee that the empirical risk converges to the true risk as the sample size increases. This is typically achieved through concentration inequalities. One of the simplest and most well-known is Hoeffding’s inequality, see Theorem 47. We will now apply Hoeffding’s inequality to obtain a first generalization bound. This generalization bound is well-known and can be found in many textbooks on machine learning, e.g., [15, 13]. Although the result does not yet encompass neural networks, it forms the basis for a similar result applicable to neural networks, as we discuss subsequently.

Proposition 26 (Finite hypothesis set)

Let \( \mathcal{H} \subseteq \{h \colon {X} \mapsto {Y} \}\) be a finite hypothesis set. Let \( \mathcal{L} \colon {Y} \times {Y} \to \mathbb{R}\) be such that \( \mathcal{L}( {Y} \times {Y}) \subseteq [c_1,c_2]\) with \( c_2 - c_1 = C>0\) .

Then, for every \( m \in \mathbb{N}\) and every distribution \( \mathcal{D}\) on \( {X} \times {Y}\) it holds with probability at least \( 1-\delta\) over the sample \( S \sim \mathcal{D}^m\) that

\[ \begin{align*} \sup_{h \in \mathcal{H}}|\mathcal{R}(h) - \widehat{\mathcal{R}}_S(h)| \leq C\sqrt{\frac{\log(|\mathcal{H}|) + \log(2/\delta)}{2m}}. \end{align*} \]

Proof

Let \( \mathcal{H} = \{h_1, \dots, h_n\}\) . Then it holds by a union bound that

\[ \begin{align*} \mathbb{P}\left[\exists h_i \in \mathcal{H} \colon |\mathcal{R}(h_i) - \widehat{\mathcal{R}}_S(h_i)| > \varepsilon\right] \leq \sum_{i=1}^n \mathbb{P}\left[|\mathcal{R}(h_i) - \widehat{\mathcal{R}}_S(h_i)| > \varepsilon\right]. \end{align*} \]

Note that \( \widehat{\mathcal{R}}_S(h_i)\) is the mean of independent random variables which take their values almost surely in \( [c_1,c_2]\) . Additionally, \( \mathcal{R}(h_i)\) is the expectation of \( \widehat{\mathcal{R}}_S(h_i)\) . The proof can therefore be finished by applying Theorem 47. This will be addressed in Exercise 57.

Consider now a non-finite set of neural networks \( \mathcal{H}\) , and assume that it can be covered by a finite set of (small) balls. Applying Proposition 26 to the centers of these balls, then allows to derive a similar bound as in the proposition for \( \mathcal{H}\) . This intuitive argument will be made rigorous in the following section.

15.4 Generalization bounds from covering numbers

To derive a generalization bound for classes of neural networks, we start by introducing the notion of covering numbers.

Definition 35

Let \( A\) be a relatively compact subset of a metric space \( (X, d)\) . For \( \varepsilon >0\) , we call

\[ \begin{align*} \mathcal{G}(A, \varepsilon, (X,d)) \mathrm{:=} \min\left\{n \in \mathbb{N}\, \middle|\,\exists\, (x_i)_{i=1}^n \subseteq X \text{ s.t} \bigcup_{i=1}^n B_{\varepsilon}(x_i) \supset A \right\}, \end{align*} \]

where \( B_\varepsilon(x) = \{ z \in X\,|\, d(z,x) \leq \varepsilon\}\) , the \( \varepsilon\) -covering number of \( A\) in \( X\) . In case \( X\) or \( d\) are clear from context, we also write \( \mathcal{G}(A, \varepsilon, d)\) or \( \mathcal{G}(A, \varepsilon, X)\) instead of \( \mathcal{G}(A, \varepsilon, (X,d))\) .

A visualization of Definition 35 is given in Figure 57.

&lt;span data-controller=&quot;mathjax&quot;&gt;Illustration of the concept of covering numbers of Definition def:coveringNumber.
 The shaded set
 AR^2) is covered by sixteen Euclidean balls of radius ) .
 Therefore, G(A, , R^2) 16) .&lt;/span&gt;
Figure 57. Illustration of the concept of covering numbers of Definition 35. The shaded set \( A\subseteq\mathbb{R}^2\) is covered by sixteen Euclidean balls of radius \( \varepsilon\) . Therefore, \( \mathcal{G}(A, \varepsilon, \mathbb{R}^2) \leq 16\) .

As we will see, it is possible to upper bound the \( \varepsilon\) -covering numbers of neural networks as a subset of \( L^\infty([0,1]^d)\) , assuming the weights are confined to a fixed bounded set. The precise estimates are postponed to Section 15.5. Before that, let us show how a finite covering number facilitates a generalization bound. We only consider Euclidean feature spaces \( X\) in the following result. A more general version could be easily derived.

Theorem 33

Let \( C_{Y}\) , \( C_{\mathcal{L}} >0\) and \( \alpha >0\) . Let \( {Y} \subseteq [-C_{Y},C_{Y}]\) , \( X \subseteq \mathbb{R}^d\) for some \( d \in \mathbb{N}\) , and \( \mathcal{H} \subseteq \{h \colon {X} \to {Y}\}\) . Further, let \( \mathcal{L} \colon {Y} \times {Y} \to \mathbb{R}\) be \( C_{\mathcal{L}}\) -Lipschitz.

Then, for every distribution \( \mathcal{D}\) on \( {X} \times {Y}\) and every \( m \in \mathbb{N}\) it holds with probability at least \( 1-\delta\) over the sample \( S \sim \mathcal{D}^m\) that for all \( h \in \mathcal{H}\)

\[ \begin{align*} |\mathcal{R}(h) - \widehat{\mathcal{R}}_S(h)| \leq 4 C_{Y} C_{\mathcal{L}} \sqrt{\frac{\log(\mathcal{G}(\mathcal{H}, m^{-\alpha}, L^\infty(X))) + \log(2/\delta)}{m}} + \frac{2C_{\mathcal{L}}}{m^{\alpha}}. \end{align*} \]

Proof

Let

\[ \begin{equation} M = \mathcal{G}(\mathcal{H},m^{-\alpha}, L^\infty(X)) \end{equation} \]

(235)

and let \( \mathcal{H}_M = (h_i)_{i=1}^M\subseteq \mathcal{H}\) be such that for every \( h \in \mathcal{H}\) there exists \( h_i \in \mathcal{H}_M\) with \( \| h - h_i \|_{L^\infty(X)} \leq 1/m^{\alpha}\) . The existence of \( \mathcal{H}_M\) follows by Definition 35.

Fix for the moment such \( h\in\mathcal{H}\) and \( h_i\in\mathcal{H}_M\) . By the reverse and normal triangle inequalities, we have

\[ \begin{align*} |\mathcal{R}(h) - \widehat{\mathcal{R}}_S(h) |- | \mathcal{R}(h_i) - \widehat{\mathcal{R}}_S(h_i)| &\leq |\mathcal{R}(h) - \mathcal{R}(h_i) | + |\widehat{\mathcal{R}}_S(h) - \widehat{\mathcal{R}}_S(h_i)|. \end{align*} \]

Moreover, from the monotonicity of the expected value and the Lipschitz property of \( \mathcal{L}\) it follows that

\[ \begin{align*} |\mathcal{R}(h) - \mathcal{R}(h_i) | &\leq \mathbb{E}| \mathcal{L}(h(x), y) - \mathcal{L}(h_i(x), y) |\\ & \leq C_{\mathcal{L}} \mathbb{E} |h(x) - h_i(x)| \leq \frac{C_{\mathcal{L}}}{m^{\alpha}}. \end{align*} \]

A similar estimate yields \( |\widehat{\mathcal{R}}_S(h) - \widehat{\mathcal{R}}_S(h_i)| \leq C_{\mathcal{L}}/m^{\alpha}\) .

We thus conclude that for every \( \varepsilon>0\)

\[ \begin{align} &\mathbb{P}_{S \sim \mathcal{D}^m}\left[\exists h \in \mathcal{H} \colon |\mathcal{R}(h) - \widehat{\mathcal{R}}_S(h) |\geq \varepsilon \right]\nonumber \end{align} \]

(236)

\[ \begin{align} &~~ \leq \mathbb{P}_{S \sim \mathcal{D}^m}\left[\exists h_i \in \mathcal{H}_M \colon |\mathcal{R}(h_i) - \widehat{\mathcal{R}}_S(h_i) |\geq \varepsilon - \frac{2C_{\mathcal{L}}}{m^{\alpha}} \right]. \\\end{align} \]

(237)

From Proposition 26, we know that for \( \varepsilon>0\) and \( \delta \in (0,1)\)

\[ \begin{align} \mathbb{P}_{S \sim \mathcal{D}^m}\left[\exists h_i \in H_M\colon |\mathcal{R}(h_i) - \widehat{\mathcal{R}}_S(h_i) |\geq \varepsilon - \frac{2C_{\mathcal{L}}}{m^{\alpha}} \right] \leq \delta \end{align} \]

(238)

as long as

\[ \varepsilon - \frac{2C_{\mathcal{L}}}{m^{\alpha}} > C \sqrt{\frac{\log(M) + \log(2/\delta)}{2m}}, \]

where \( C\) is such that \( \mathcal{L}({Y} \times {Y}) \subseteq [c_1,c_2]\) with \( c_2 - c_1 \leq C\) . By the Lipschitz property of \( \mathcal{L}\) we can choose \( C = 2\sqrt{2} C_{\mathcal{L}} C_{Y}\) .

Therefore, the definition of \( M\) in (235) together with (236) and (238) give that with probability at least \( 1-\delta\) it holds for all \( h\in\mathcal{H}\)

\[ \begin{align*} |\mathcal{R}(h) - \widehat{\mathcal{R}}_S(h)| \leq 2 \sqrt{2} C_{\mathcal{L}} C_{{Y}} \sqrt{\frac{\log(\mathcal{G}(\mathcal{H}, m^{-\alpha}, L^\infty)) + \log(2/\delta)}{2m}} + \frac{2C_{\mathcal{L}}}{m^{\alpha}}. \end{align*} \]

This concludes the proof.

15.5 Covering numbers of deep neural networks

We have seen in Theorem 33, estimating \( L^\infty\) -covering numbers is crucial for understanding the generalization error. How can we determine these covering numbers? The set of neural networks of a fixed architecture can be a quite complex set (see Chapter 14), so it is not immediately clear how to cover it with balls, let alone know the number of required balls. The following lemma suggest a simpler approach.

Lemma 35

Let \( {X}_1\) , \( {X}_2\) be two metric spaces and let \( f\colon {X}_1 \to {X}_2\) be Lipschitz continuous with Lipschitz constant \( C_{\rm {Lip}}\) . For every relatively compact \( A \subseteq {X}_1\) it holds that for all \( \varepsilon>0\)

\[ \begin{align*} \mathcal{G}(f(A), C_{\rm Lip} \varepsilon, X_2) \leq \mathcal{G}(A, \varepsilon, X_1). \end{align*} \]

The proof of Lemma 35 is left as an exercise. If we can represent the set of neural networks as the image under the Lipschitz map of another set with known covering numbers, then Lemma 35 gives a direct way to bound the covering number of the neural network class.

Conveniently, we have already observed in Proposition 21, that the set of neural networks is the image of \( \mathcal{PN}(\mathcal{A}, B)\) as in Definition 26 under the Lipschitz continuous realization map \( R_\sigma\) . It thus suffices to establish the \( \varepsilon\) -covering number of \( \mathcal{PN}(\mathcal{A}, B)\) or equivalently of \( [-B,B]^{n_\mathcal{A}}\) . Then, using the Lipschitz property of \( R_\sigma\) that holds by Proposition 21, we can apply Lemma 35 to find the covering numbers of \( \mathcal{N}(\sigma; \mathcal{A}, B)\) . This idea is depicted in Figure 58.

&lt;span data-controller=&quot;mathjax&quot;&gt;Illustration of the main idea to deduce covering numbers of neural network spaces.
 Points R^2) in parameter space in the left
figure correspond to functions R_()) in the right figure (with matching colors). 
 By Lemma lemma:CoveringNumbersLipschitzProperty, a covering of the parameter space on the left translates to a covering of the function space on the right.&lt;/span&gt;
Figure 58. Illustration of the main idea to deduce covering numbers of neural network spaces. Points \( \theta\in\mathbb{R}^2\) in parameter space in the left figure correspond to functions \( R_\sigma(\theta)\) in the right figure (with matching colors). By Lemma 35, a covering of the parameter space on the left translates to a covering of the function space on the right.

Proposition 27

Let \( B\) , \( \varepsilon >0\) and \( q \in \mathbb{N}\) . Then

\[ \mathcal{G}([-B,B]^{q}, \varepsilon, (\mathbb{R}^{q}, \| \cdot \|_{\infty})) \leq \lceil B/\varepsilon \rceil^q. \]

Proof

We start with the one-dimensional case \( q = 1\) . We choose \( k = \lfloor B/\varepsilon \rfloor\)

\[ x_0 = -B + \varepsilon \text{ and } x_j = x_{j-1} + 2\varepsilon \text{ for } j = 1, \dots, k-1. \]

It is clear that all points between \( -B\) and \( x_{k-1}\) have distance at most \( \varepsilon\) to one of the \( x_j\) . Also, \( x_{k-1} = -B + \varepsilon + 2 (k - 1) \varepsilon \geq B - \varepsilon\) . We conclude that \( \mathcal{G}([-B,B], \varepsilon,\mathbb{R}) \leq \lceil B/\varepsilon \rceil\) . Set \( X_k \mathrm{:=} \{x_0, \dots, x_{k-1}\}\) .

For arbitrary \( q\) , we observe that for every \( x \in [-B,B]^q\) there is an element in \( X_k^q = \bigotimes_{k=1}^q X_k\) with \( \| \cdot \|_{\infty}\) distance less than \( \varepsilon\) . Clearly, \( |X_k^q| = \lceil B/\varepsilon \rceil^q\) , which completes the proof.

Having established a covering number for \( [-B,B]^{{n_\mathcal{A}}}\) and hence \( \mathcal{PN}(\mathcal{A}, B)\) , we can now estimate the covering numbers of deep neural networks by combining Lemma 35 and Propositions 21 and 27 .

Theorem 34

Let \( \mathcal{A} = (d_0, d_1, \dots, d_{L+1}) \in \mathbb{N}^{L+2}\) , let \( \sigma\colon \mathbb{R} \to \mathbb{R}\) be \( C_\sigma\) -Lipschitz continuous with \( C_\sigma \geq 1\) , let \( |\sigma(x)| \leq C_\sigma|x|\) for all \( x \in \mathbb{R}\) , and let \( B\geq 1\) . Then

\[ \begin{align*} &\mathcal{G}(\mathcal{N}(\sigma; \mathcal{A},B), \varepsilon, L^\infty([0,1]^{d_0}))\\ &~~ \leq \mathcal{G}([-B,B]^{n_\mathcal{A}}, \varepsilon /(n_\mathcal{A}(2 C_\sigma B d_{\rm max})^L),(\mathbb{R}^{n_\mathcal{A}}, \| \cdot \|_{\infty}))\\ &~~ \leq \lceil n_\mathcal{A} /\varepsilon\rceil^{{n_\mathcal{A}}} \lceil 2 C_\sigma B d_{\rm max}\rceil^{{n_\mathcal{A}}L}. \end{align*} \]

We end this section, by applying the previous theorem to the generalization bound of Theorem 33 with \( \alpha = 1/2\) . To simplify the analysis, we restrict the discussion to neural networks with range \( [-1,1]\) . To this end, denote

\[ \begin{align} \mathcal{N}^*(\sigma; \mathcal{A}, B) \mathrm{:}= \big\{ &\Phi \in \mathcal{N}( \sigma;\mathcal{A}, B) \big|\nonumber \end{align} \]

(239)

\[ \begin{align} & \Phi({\boldsymbol{x}}) \in [-1,1] \text{ for all } {\boldsymbol{x}} \in [0,1]^{d_0}\big\}. \\\end{align} \]

(240)

Since \( \mathcal{N}^*( \sigma; \mathcal{A}, B) \subseteq \mathcal{N}(\sigma;\mathcal{A}, B)\) we can bound the covering numbers of \( \mathcal{N}^*(\sigma; \mathcal{A}, B)\) by those of \( \mathcal{N}(\sigma; \mathcal{A}, B)\) . This yields the following result.

Theorem 35

Let \( C_{\mathcal{L}} >0\) and let \( \mathcal{L} \colon [-1,1] \times [-1,1] \to \mathbb{R}\) be \( C_{\mathcal{L}}\) -Lipschitz continuous. Further, let \( \mathcal{A} = (d_0, d_1, \dots, d_{L+1}) \in \mathbb{N}^{L+2}\) , let \( \sigma\colon \mathbb{R} \to \mathbb{R}\) be \( C_\sigma\) -Lipschitz continuous with \( C_\sigma \geq 1\) , and \( |\sigma(x)| \leq C_\sigma|x|\) for all \( x \in \mathbb{R}\) , and let \( B\geq 1\) .

Then, for every \( m \in \mathbb{N}\) , and every distribution \( \mathcal{D}\) on \( {X} \times [-1,1]\) it holds with probability at least \( 1-\delta\) over \( S \sim \mathcal{D}^m\) that for all \( \Phi \in \mathcal{N}^*( \sigma; \mathcal{A}, B)\)

\[ \begin{align*} |\mathcal{R}(\Phi) -\widehat{\mathcal{R}}_S(\Phi)|\leq &4 C_{\mathcal{L}} \sqrt{\frac{{n_\mathcal{A}} \log(\lceil n_\mathcal{A} \sqrt{m}\rceil) + L {n_\mathcal{A}}\log(\lceil 2 C_\sigma B d_{\rm max}\rceil) + \log(2/\delta)}{m}} \\ &+ \frac{2 C_{\mathcal{L}}}{\sqrt{m}}. \end{align*} \]

15.6 The approximation-complexity trade-off

We recall the decomposition of the error in (233)

\[ \begin{align*} \mathcal{R}(h_S) - R^* &\leq 2\varepsilon_{\mathrm{gen}} + \varepsilon_{\mathrm{approx}}, \end{align*} \]

where \( R_*\) is the Bayes risk defined in (230). We make the following observations about the approximation error \( \varepsilon_{\mathrm{approx}}\) and generalization error \( \varepsilon_{\mathrm{gen}}\) in the context of neural network based learning:

  • Scaling of generalization error: By Theorem 35, for a hypothesis class \( \mathcal{H}\) of neural networks with \( {n_\mathcal{A}}\) weights and \( L\) layers, and for sample of size \( m\in \mathbb{N}\) , the generalization error \( \varepsilon_{\mathrm{gen}}\) essentially scales like

    \[ \varepsilon_{\mathrm{gen}} = O(\sqrt{({n_\mathcal{A}}\log({n_\mathcal{A}} m) + L {n_\mathcal{A}} \log({n_\mathcal{A}}))/m}) ~~\text{as }m \to \infty. \]
  • Scaling of approximation error: Assume there exists \( h^*\) such that \( \mathcal{R}(h^*) = R^*\) , and let the loss function \( \mathcal{L}\) be Lipschitz continuous in the first coordinate. Then

    \[ \begin{align*} \varepsilon_{\mathrm{approx}} = \inf_{h \in \mathcal{H}} \mathcal{R}(h) -\mathcal{R}(h^*) &= \inf_{h \in \mathcal{H}} \mathbb{E}_{(x,y) \sim \mathcal{D}}[ \mathcal{L}(h(x), y) - \mathcal{L}(h^*(x), y) ]\\ & \leq C \inf_{h \in \mathcal{H}}\|h - h^*\|_{L^\infty}, \end{align*} \]

    for some constant \( C>0\) . We have seen in Chapters 6 and 8 that if we choose \( \mathcal{H}\) as a set of neural networks with size \( {n_\mathcal{A}}\) and \( L\) layers, then, for appropriate activation functions, \( \inf_{h \in \mathcal{H}} \|h - h^*\|_{L^\infty}\) behaves like \( {n_\mathcal{A}}^{-r}\) if, e.g., \( h^*\) is a \( d\) -dimensional \( s\) -Hölder regular function and \( r = s/d\) (Theorem 9), or \( h^*\in C^{k,s}([0,1]^d)\) and \( r<( k+s)/d\) (Theorem 14).

By these considerations, we conclude that for an empirical risk minimizer \( \Phi_S\) from a set of neural networks with \( {n_\mathcal{A}}\) weights and \( L\) layers, it holds that

\[ \begin{align} \mathcal{R}(\Phi_S) - R^* \leq O(\sqrt{({n_\mathcal{A}}\log(m) + L {n_\mathcal{A}} \log({n_\mathcal{A}}))/m}) + O({n_\mathcal{A}}^{-r}), \end{align} \]

(241)

for \( m \to \infty\) and for some \( r\) depending on the regularity of \( h^*\) . Note that, enlarging the neural network set, i.e., increasing \( {n_\mathcal{A}}\) has two effects: The term associated to approximation decreases, and the term associated to generalization increases. This trade-off is known as approximation-complexity trade-off. The situation is depicted in Figure 59. The figure and (241) suggest that, the perfect model, achieves the optimal trade-off between approximation and generalization error. Using this notion, we can also separate all models into three classes:

  • Underfitting: If the approximation error decays faster than the estimation error increases.
  • Optimal: If the sum of approximation error and generalization error is at a minimum.
  • Overfitting: If the approximation error decays slower than the estimation error increases.
&lt;span data-controller=&quot;mathjax&quot;&gt;Illustration of the approximation-complexity-trade-off of Equation (eq:upperBoundForReLUNNs).
Here we chose r = 1) and m = 10.000) , also all implicit constants are assumed to be equal to 1.&lt;/span&gt;
Figure 59. Illustration of the approximation-complexity-trade-off of Equation (241). Here we chose \( r = 1\) and \( m = 10.000\) , also all implicit constants are assumed to be equal to 1.

In Chapter 16, we will see that deep learning often operates in the regime where the number of parameters \( {n_\mathcal{A}}\) exceeds the optimal trade-off point. For certain architectures used in practice, \( {n_\mathcal{A}}\) can be so large that the theory of the approximation-complexity trade-off suggests that learning should be impossible. However, we emphasize, that the present analysis only provides upper bounds. It does not prove that learning is impossible or even impractical in the overparameterized regime. Moreover, in Chapter 12 we have already seen indications that learning in the overparametrized regime need not necessarily lead to large generalization errors.

15.7 PAC learning from VC dimension

In addition to covering numbers, there are several other tools to analyze the generalization capacity of hypothesis sets. In the context of classification problems, one of the most important is the so-called Vapnik–Chervonenkis (VC) dimension.

15.7.1 Definition and examples

Let \( \mathcal{H}\) be a hypothesis set of functions mapping from \( \mathbb{R}^d\) to \( \{0,1\}\) . A set \( S=\{{\boldsymbol{x}}_1,\dots,{\boldsymbol{x}}_n\} \subseteq \mathbb{R}^d\) is said to be shattered by \( \mathcal{H}\) if for every \( (y_1,\dots,y_n) \in \{0,1\}^n\) there exists \( h\in\mathcal{H}\) such that \( h({\boldsymbol{x}}_j)=y_j\) for all \( j\in\mathbb{N}\) .

The VC dimension quantifies the complexity of a function class via the number of points that can in principle be shattered.

Definition 36

The VC dimension of \( \mathcal{H}\) is the cardinality of the largest set \( S\subseteq\mathbb{R}^d\) that is shattered by \( \mathcal{H}\) . We denote the VC dimension by \( \mathrm{VCdim}(\mathcal{H})\) .

Example 13 (Intervals)

Let \( \mathcal{H} = \{\boldsymbol{1}_{[a,b]}\,|\,a, b \in \mathbb{R}\}\) . It is clear that \( \mathrm{VCdim}(\mathcal{H}) \geq 2\) since for \( x_1 < x_2\) the functions

\[ \boldsymbol{1}_{[x_1-2,x_1-1]}, ~ \boldsymbol{1}_{[x_1-{1},x_1]}, ~ \boldsymbol{1}_{[x_1,x_2]}, ~ \boldsymbol{1}_{[x_2,x_2+1]}, \]

are all different, when restricted to \( S= (x_1,x_2)\) .

On the other hand, if \( x_1 < x_2 < x_3\) then, since \( h^{-1}(\{1\})\) is an interval for all \( h \in \mathcal{H}\) we have that \( h(x_1) = 1 = h(x_3)\) implies \( h(x_2) = 1\) . Hence, no set of three elements can be shattered. Therefore, \( \mathrm{VCdim}(\mathcal{H}) = 2\) . The situation is depicted in Figure 60.

&lt;span data-controller=&quot;mathjax&quot;&gt;Different ways to classify two or three points.
 The colored-blocks correspond to intervals that produce different classifications of the points.&lt;/span&gt;
Figure 60. Different ways to classify two or three points. The colored-blocks correspond to intervals that produce different classifications of the points.

Example 14 (Half-spaces)

Let \( \mathcal{H}_2 = \{ \boldsymbol{1}_{[0,\infty)}(\langle {\boldsymbol{w}} , \cdot\rangle + b )\,|\, {\boldsymbol{w}}\in \mathbb{R}^2, b \in \mathbb{R}\}\) be a hypothesis set of rotated and shifted two-dimensional half-spaces. In Figure 61 we see that \( \mathcal{H}_2\) shatters a set of three points. More general, for \( d\ge 2\) with

\[ \begin{align*} \mathcal{H}_d\mathrm{:}= \{{\boldsymbol{x}}\mapsto \boldsymbol{1}_{[0,\infty)}({\boldsymbol{w}}^\top{\boldsymbol{x}}+b)\,|\,{\boldsymbol{w}}\in\mathbb{R}^d,~b\in\mathbb{R}\} \end{align*} \]

the VC dimension of \( \mathcal{H}_d\) equals \( d+1\) .

&lt;span data-controller=&quot;mathjax&quot;&gt;Different ways to classify three points by a half-space,
 Scholkopf2002.&lt;/span&gt;
Figure 61. Different ways to classify three points by a half-space, [206, Figure 1.4].

In the example above, the VC dimension coincides with the number of parameters. However, this is not true in general as the following example shows.

Example 15 (Infinite VC dimension)

Let for \( x\in\mathbb{R}\)

\[ \begin{align*} \mathcal{H}\mathrm{:}= \{x\mapsto \boldsymbol{1}_{[0,\infty)}(\sin(wx))\,|\,w\in\mathbb{R}\}. \end{align*} \]

Then the VC dimension of \( \mathcal{H}\) is infinite (Exercise 60).

15.7.2 Generalization based on VC dimension

In the following, we consider a classification problem. Denote by \( \mathcal{D}\) the data-generating distribution on \( \mathbb{R}^{d}\times \{0,1\}\) . Moreover, we let \( \mathcal{H}\) be a set of functions from \( \mathbb{R}^d\to \{0,1\}\) .

In the binary classification set-up, the natural choice of a loss function is the \( 0-1\) loss \( \mathcal{L}_{0-1}(y,y') = \boldsymbol{1}_{y \neq y'}\) . Thus, given a sample \( S\) , the empirical risk of a function \( h \in\mathcal{H}\) is

\[ \begin{align*} \widehat{\mathcal{R}}_S(h) = \frac{1}{m}\sum_{i=1}^{m}\boldsymbol{1}_{h({\boldsymbol{x}}_i)\neq y_i}. \end{align*} \]

Moreover, the risk can be written as

\[ \begin{align*} \mathcal{R}(h) = \mathbb{P}_{({\boldsymbol{x}},y)\sim \mathcal{D}}[h({\boldsymbol{x}})\neq y], \end{align*} \]

i.e., the probability under \( ({\boldsymbol{x}},y)\sim \mathcal{D}\) of \( h\) misclassifying the label \( y\) of \( {\boldsymbol{x}}\) .

We can now give a generalization bound in terms of the VC dimension of \( \mathcal{H}\) , see, e.g., [15, Corollary 3.19]:

Theorem 36

Let \( d,k \in \mathbb{N}\) and \( \mathcal{H} \subseteq \{h \colon \mathbb{R}^d\to \{0,1\}\}\) have VC dimension \( k\) . Let \( \mathcal{D}\) be a distribution on \( \mathbb{R}^d\times \{0,1\}\) . Then, for every \( \delta >0\) and \( m \in \mathbb{N}\) , it holds with probability at least \( 1-\delta\) over a sample \( S \sim \mathcal{D}^m\) that for every \( h \in \mathcal{H}\)

\[ \begin{align} |\mathcal{R}(h)- \widehat{\mathcal{R}}_S(h)| \leq \sqrt{\frac{2k\log(e m /k)}{m}} + \sqrt{\frac{\log(1/\delta)}{2m}}. \end{align} \]

(242)

In words, Theorem 36 tells us that if a hypothesis class has finite VC dimension, then a hypothesis with a small empirical risk will have a small risk if the number of samples is large. This shows that empirical risk minimization is a viable strategy in this scenario. Will this approach also work if the VC dimension is not bounded? No, in fact, in that case, no learning algorithm will succeed in reliably producing a hypothesis for which the risk is close to the best possible. We omit the technical proof of the following theorem from [15, Theorem 3.23].

Theorem 37

Let \( k \in \mathbb{N}\) and let \( \mathcal{H} \subseteq \{h \colon {X} \to \{0,1\}\}\) be a hypothesis set with VC dimension \( k\) . Then, for every \( m \in \mathbb{N}\) and every learning algorithm \( \mathrm{A} \colon ({X} \times \{0,1\})^m \to \mathcal{H}\) there exists a distribution \( \mathcal{D}\) on \( {X} \times \{0,1\}\) such that

\[ \begin{align*} \mathbb{P}_{S \sim \mathcal{D}^m} \left[ \mathcal{R}(\mathrm{A}(S)) - \inf_{h \in \mathcal{H}} \mathcal{R}(h) > \sqrt{\frac{k}{320 m}} \right] \geq \frac{1}{64}. \end{align*} \]

Theorem 37 immediately implies the following statement for the generalization bound.

Corollary 6

Let \( k \in \mathbb{N}\) and let \( \mathcal{H} \subseteq \{h \colon {X} \to \{0,1\}\}\) be a hypothesis set with VC dimension \( k\) . Then, for every \( m \in \mathbb{N}\) there exists a distribution \( \mathcal{D}\) on \( {X} \times \{0,1\}\) such that

\[ \begin{align*} \mathbb{P}_{S \sim \mathcal{D}^m} \left[ \sup_{h\in \mathcal{H}}|\mathcal{R}(h) - \widehat{\mathcal{R}}_S(h)| > \sqrt{\frac{k}{1280 m}} \right] \geq \frac{1}{64}. \end{align*} \]

Proof

For a sample \( S\) , let \( h_S\in\mathcal{H}\) be an empirical risk minimizer, i.e., \( \widehat{\mathcal{R}}_S(h_S) = \min_{h \in \mathcal{H}} \widehat{\mathcal{R}}_S(h)\) . Let \( \mathcal{D}\) be the distribution of Theorem 37. Moreover, for \( \delta >0\) , let \( h_{\delta} \in \mathcal{H}\) be such that

\[ \begin{align*} \mathcal{R}(h_{\delta}) - \inf_{h \in \mathcal{H}} \mathcal{R}(h) < \delta. \end{align*} \]

Then, applying Theorem 37 with \( \mathrm{A}(S) = h_S\) it holds that

\[ \begin{align*} 2\sup_{h \in \mathcal{H}}|\mathcal{R}(h) - \widehat{\mathcal{R}}_S(h)| &\geq |\mathcal{R}(h_S) - \widehat{\mathcal{R}}_S(h_S)| + |\mathcal{R}(h_\delta) - \widehat{\mathcal{R}}_S(h_\delta)|\\ &\geq \mathcal{R}(h_S) - \widehat{\mathcal{R}}_S(h_S) + \widehat{\mathcal{R}}_S(h_\delta) - \mathcal{R}(h_\delta)\\ &\geq \mathcal{R}(h_S) - \mathcal{R}(h_\delta)\\ &>\mathcal{R}(h_S) - \inf_{h \in \mathcal{H}} \mathcal{R}(h) - \delta, \end{align*} \]

where we used the definition of \( h_S\) in the third inequality. The proof is completed by applying Theorem 37 and using that \( \delta\) was arbitrary.

We have seen now, that we have a generalization bound scaling like \( O(1/\sqrt{m})\) for \( m\to \infty\) if and only if the VC dimension of a hypothesis class is finite. In more quantitative terms, we require the VC dimension of a neural network to be smaller than \( m\) .

What does this imply for neural network functions? For ReLU neural networks there holds the following [26, Theorem 8.8].

Theorem 38

Let \( \mathcal{A} \in \mathbb{N}^{L+2}\) , \( L\in\mathbb{N}\) and set

\[ \begin{align*} \mathcal{H}\mathrm{:}= \{\boldsymbol{1}_{[0,\infty)}\circ \Phi\,|\,\Phi\in\mathcal{N}(\sigma_{\rm ReLU}; \mathcal{A}, \infty)\}. \end{align*} \]

Then, there exists a constant \( C>0\) independent of \( L\) and \( \mathcal{A}\) such that

\[ \begin{align*} {\rm VCdim}(\mathcal{H})\le C\cdot (n_{\mathcal{A}}L\log(n_{\mathcal{A}})+n_{\mathcal{A}}L^2). \end{align*} \]

The bound (242) is meaningful if \( m \gg k\) . For ReLU neural networks as in Theorem 38, this means \( m\gg n_{\mathcal{A}} L\log(n_{\mathcal{A}})+n_{\mathcal{A}} L^2\) . Fixing \( L=1\) this amounts to \( m\gg n_{\mathcal{A}} \log(n_{\mathcal{A}})\) for a shallow neural network with \( n_{\mathcal{A}}\) parameters. This condition is contrary to what we assumed in Chapter 12, where it was crucial that \( n_{\mathcal{A}} \gg m\) . If the VC dimension of the neural network sets scale like \( O(n_{\mathcal{A}}\log (n_{\mathcal{A}}))\) , then Theorem 37 and Corollary 6 indicate that, at least for certain distributions, generalization should not be possible in this regime. We will discuss the resolution of this potential paradox in Chapter 16.

15.8 Lower bounds on achievable approximation rates

We conclude this chapter on the complexities and generalization bounds of neural networks by using the established VC dimension bound of Theorem 38 to deduce limitations to the approximation capacity of neural networks. The result described below was first given in [85].

Theorem 39

Let \( k\) , \( d\in\mathbb{N}\) . Assume that for every \( \varepsilon>0\) there exists \( L_\varepsilon\in\mathbb{N}\) and \( \mathcal{A}_\varepsilon\) with \( L_\varepsilon\) layers and input dimension \( d\) such that

\[ \begin{align*} \sup_{\| f \|_{{C^k([0,1]^d)}}\le 1} \inf_{\Phi\in\mathcal{N}(\sigma_{\rm ReLU}; \mathcal{A}_\epsilon, \infty)}\|f-\Phi\|_{C^0({[0,1]^d})}<\frac{\varepsilon}{2}. \end{align*} \]

Then there exists \( C>0\) solely depending on \( k\) and \( d\) , such that for all \( \varepsilon\in (0,1)\)

\[ \begin{align*} n_{\mathcal{A_\varepsilon}} L_\varepsilon\log(n_{\mathcal{A_\varepsilon}}) +n_{\mathcal{A_\varepsilon}} L_\varepsilon^2 \ge C \varepsilon^{-\frac{d}{k}}. \end{align*} \]

Proof

For \( {\boldsymbol{x}}\in\mathbb{R}^d\) consider the “bump function”

\[ \begin{align*} \tilde f({\boldsymbol{x}})\mathrm{:}= \left\{ \begin{array}{ll} \exp\left(1-\frac{1}{1-\| {\boldsymbol{x}} \|_{2}^2}\right) &\text{if }\|{\boldsymbol{x}}\|_2<1\\ 0 &\text{otherwise,} \end{array}\right. \end{align*} \]

and its scaled version

\[ \begin{align*} \tilde f_\varepsilon({\boldsymbol{x}}) \mathrm{:=} \varepsilon \tilde f\left(2\varepsilon^{-1/k} {\boldsymbol{x}}\right), \end{align*} \]

for \( \varepsilon\in (0,1)\) . Then

\[ {\rm supp}(\tilde f_\varepsilon)\subseteq \Big[-\frac{\varepsilon^{1/k}}{2},\frac{\varepsilon^{1/k}}{2}\Big]^d \]

and

\[ \| \tilde f_\varepsilon \|_{C^k}\le 2^k\| \tilde f \|_{C^k} \mathrm{:=} \tau_k>0. \]

Consider the equispaced point set \( \{{\boldsymbol{x}}_1,\dots,{\boldsymbol{x}}_{N(\varepsilon)}\}= \varepsilon^{1/k}\mathbb{Z}^d \cap [0,1]^d\) . The cardinality of this set is \( N(\varepsilon)\simeq \varepsilon^{-d/k}\) . Given \( {\boldsymbol{y}}\in\{0,1\}^{N(\varepsilon)}\) , let for \( {\boldsymbol{x}} \in \mathbb{R}^d\)

\[ \begin{align} f_{{\boldsymbol{y}}}({\boldsymbol{x}}) \mathrm{:=} \tau_k^{-1} \sum_{j=1}^{N(\varepsilon)} y_j \tilde f_{\varepsilon}({\boldsymbol{x}}-{\boldsymbol{x}}_j). \end{align} \]

(256)

Then \( f_{{\boldsymbol{y}}}({\boldsymbol{x}}_j)= \tau_k^{-1} \varepsilon y_j\) for all \( j = 1, \dots, N(\varepsilon)\) and \( \| f_{{\boldsymbol{y}}} \|_{C^k}\le 1\) .

For every \( {\boldsymbol{y}}\in\{0,1\}^{N(\varepsilon)}\) let \( \Phi_{{\boldsymbol{y}}} \in \mathcal{N}(\sigma_{\rm {ReLU}};\mathcal{A}_{\tau_k^{-1} \varepsilon},\infty)\) be such that

\[ \sup_{{\boldsymbol{x}}\in [0,1]^d}|f_{{\boldsymbol{y}}}({\boldsymbol{x}})-\Phi_{{\boldsymbol{y}}} ({\boldsymbol{x}})|<\frac{\varepsilon}{2 \tau_k}. \]

Then

\[ \boldsymbol{1}_{[0,\infty)}\Big(\Phi_y({\boldsymbol{x}}_j)-\frac{\varepsilon}{2 \tau_k}\Big)=y_j~~\text{for all }j = 1, \dots, N(\varepsilon). \]

Hence, the VC dimension of \( \mathcal{N}(\sigma_{\rm {ReLU}};\mathcal{A}_{\tau_k^{-1} \varepsilon},\infty)\) is larger or equal to \( N(\varepsilon)\) . Theorem 38 thus implies

\[ \begin{align*} N(\varepsilon)\simeq\varepsilon^{-\frac{d}{k}}\le C \cdot \Big(n_{\mathcal{A}_{\tau_k^{-1} \varepsilon}} L_{\tau_k^{-1} \varepsilon}\log(n_{\mathcal{A}_{\tau_k^{-1} \varepsilon}})+n_{\mathcal{A}_{\tau_k^{-1} \varepsilon}} L_{\tau_k^{-1} \varepsilon}^2\Big) \end{align*} \]

or equivalently

\[ \begin{align*} \tau_k^{\frac{d}{k}} \varepsilon^{-\frac{d}{k}}\le C \cdot \Big(n_{\mathcal{A}_{\varepsilon}} L_{ \varepsilon}\log(n_{\mathcal{A}_{\varepsilon}})+n_{\mathcal{A}_{\varepsilon}} L_{\varepsilon}^2\Big). \end{align*} \]

This completes the proof.

&lt;span data-controller=&quot;mathjax&quot;&gt;Illustration of f_{y}) from Equation (eq:fy) on [0,1]^2) .&lt;/span&gt;
Figure 62. Illustration of \( f_{\boldsymbol{y}}\) from Equation (243) on \( [0,1]^2\) .

To interpret Theorem 39, we consider two situations:

  • In case the depth is allowed to increase at most logarithmically in \( \varepsilon\) , then reaching uniform error \( \varepsilon\) for all \( f\in C^k([0,1]^d)\) with \( \|f\|_{C^k({[0,1]^d})}\le 1\) requires

    \[ \begin{align*} n_{\mathcal{A}_\varepsilon}\log(n_{\mathcal{A}_\varepsilon})\log(\varepsilon)+n_{\mathcal{A}_\varepsilon}\log(\varepsilon)^2 \ge C \varepsilon^{-\frac{d}{k}}. \end{align*} \]

    In terms of the neural network size, this (necessary) condition becomes \( n_{\mathcal{A}_\varepsilon}\ge C \varepsilon^{-d/k}/\log(\varepsilon)^2\) . As we have shown in Chapter 8, in particular Theorem 14, up to log terms this condition is also sufficient. Hence, while the constructive proof of Theorem 14 might have seemed rather specific, under the assumption of the depth increasing at most logarithmically (which the construction in Chapter 8 satisfies), it was essentially optimal! The neural networks in this proof are shown to have size \( O(\varepsilon^{-d/k})\) up to log terms.

  • If we allow the depth \( L_\varepsilon\) to increase faster than logarithmically in \( \varepsilon\) , then the lower bound on the required neural network size improves. Fixing for example \( \mathcal{A}_\varepsilon\) with \( L_\varepsilon\) layers such that \( n_{\mathcal{A}_\varepsilon} \le WL_\varepsilon\) for some fixed \( \varepsilon\) independent \( W\in\mathbb{N}\) , the (necessary) condition on the depth becomes

    \[ \begin{align*} W\log(WL_\varepsilon)L_\varepsilon^2+W L_\varepsilon^3 \ge C \varepsilon^{-\frac{d}{k}} \end{align*} \]

    and hence \( L_\varepsilon\gtrsim \varepsilon^{-d/(3k)}\) .

    We add that, for arbitrary depth the upper bound on the VC dimension of Theorem 38 can be improved to \( n_\mathcal{A}^2\) , [26, Theorem 8.6], and using this, would improve the just established lower bound to \( L_\varepsilon\gtrsim \varepsilon^{-d/(2k)}\) .

    For fixed width, this corresponds to neural networks of size \( O(\varepsilon^{-d/(2k)})\) , which would mean twice the convergence rate proven in Theorem 14. Indeed, it turns out that neural networks can achieve this rate in terms of the neural network size [75].

To sum up, in order to get error \( \varepsilon\) uniformly for all \( \| f \|_{{C^k([0,1]^d)}}\le 1\) , the size of a ReLU neural network is required to increase at least like \( O(\varepsilon^{-d/(2k)})\) as \( \varepsilon\to 0\) , i.ethe best possible attainable convergence rate is \( 2k/d\) . It has been proven, that this rate is also achievable, and thus the bound is sharp. Achieving this rate requires neural network architectures that grow faster in depth than in width.

Bibliography and further reading

Classical statistical learning theory is based on the foundational work of Vapnik and Chervonenkis [251]. This led to the formulation of the probably approximately correct (PAC) learning model in [252], which is primarily utilized in this chapter. A streamlined mathematical introduction to statistical learning theory can be found in [250].

Since statistical learning theory is well-established, there exists a substantial amount of excellent expository work describing this theory. Some highly recommended books on the topic are [15, 13, 26]. The specific approach of characterizing learning via covering numbers has been discussed extensively in [26, Chapter 14]. Specific results for ReLU activation used in this chapter were derived in [253, 254]. The results of Section 15.8 describe some of the findings in [85, 75]. Other scenarios in which the tightness of the upper bounds were shown are, for example, if quantization of weights is assumed, [62, 87, 63], or when some form of continuity of the approximation scheme is assumed, see [255] for general lower bounds (also applicable to neural networks).

Exercises

Exercise 56

Let \( \mathcal{H}\) be a set of neural networks with fixed architecture, where the weights are taken from a compact set. Moreover, assume that the activation function is continuous. Show that for every sample \( S\) there always exists an empirical risk minimizer \( h_S\) .

Exercise 57

Complete the proof of Proposition 26.

Exercise 58

Prove Lemma 35.

Exercise 59

Show that, the VC dimension of \( \mathcal{H}\) of Example 14 is indeed 3, by demonstrating that no set of four points can be shattered by \( \mathcal{H}\) .

Exercise 60

Show that the VC dimension of

\[ \begin{align*} \mathcal{H}\mathrm{:}= \{x\mapsto \boldsymbol{1}_{[0,\infty)}(\sin(wx))\,|\,w\in\mathbb{R}\} \end{align*} \]

is infinite.

16 Generalization in the overparameterized regime

In the previous chapter, we discussed the theory of generalization for deep neural networks trained by minimizing the empirical risk. A key conclusion was that good generalization is possible as long as we choose an architecture that has a moderate number of neural network parameters relative to the number of training samples. Moreover, we saw in Section 15.6 that the best performance can be expected when the neural network size is chosen to balance the generalization and approximation errors, by minimizing their sum.

&lt;span data-controller=&quot;mathjax&quot;&gt;ImageNet Classification Competition:
 Final score on the test set in the Top 1 category vsParameters-to-Training-Samples Ratio. 
		Note that all architectures have more parameters than training samples. Architectures include AlexNet krizhevsky2012imagenet, VGG16 simonyan2014very, GoogLeNet szegedy2015going, ResNet50/ResNet152 he2016deep, DenseNet121 huang2017densely, ViT-G/14 zhai2022scaling, EfficientNetB0 tan2019efficientnet, and AmoebaNet real2019regularized.&lt;/span&gt;
Figure 63. ImageNet Classification Competition: Final score on the test set in the Top 1 category vsParameters-to-Training-Samples Ratio. Note that all architectures have more parameters than training samples. Architectures include AlexNet [1], VGG16 [2], GoogLeNet [3], ResNet50/ResNet152 [4], DenseNet121 [5], ViT-G/14 [6], EfficientNetB0 [7], and AmoebaNet [8].

Surprisingly, successful neural network architectures do not necessarily follow these theoretical observations. Consider the neural network architectures in Figure 63. They represent some of the most renowned image classification models, and all of them participated in the ImageNet Classification Competition [9]. The training set consisted of 1.2 million images. The \( x\) -axis shows the model performance, and the \( y\) -axis displays the ratio of the number of parameters to the size of the training set; notably, all architectures have a ratio larger than one, i.ehave more parameters than training samples. For the largest model, there are by a factor \( 1000\) more neural network parameters than training samples.

Given that the practical application of deep learning appears to operate in a regime significantly different from the one analyzed in Chapter 15, we must ask: Why do these methods still work effectively?

16.1 The double descent phenomenon

The success of deep learning in a regime not covered by traditional statistical learning theory puzzled researchers for some time. In [10], an intriguing set of experiments was performed. These experiments indicate that while the risk follows the upper bound from Section 15.6 for neural network architectures that do not interpolate the data, the curve does not expand to infinity in the way that Figure 59 suggests. Instead, after surpassing the so-called “interpolation threshold”, the risk starts to decrease again. This behavior, known as double descent, is illustrated in Figure 64.

&lt;span data-controller=&quot;mathjax&quot;&gt;Illustration of the double descent phenomenon.&lt;/span&gt;
Figure 64. Illustration of the double descent phenomenon.

16.1.1 Least-squares regression revisited

To gain further insight, we consider ridgeless kernel least-squares regression as introduced in Section 12.2. Consider a data sample \( ({\boldsymbol{x}}_j,y_j)_{j=1}^m\subseteq \mathbb{R}^d\times\mathbb{R}\) generated by some ground-truth function \( f\) , i.e

\[ \begin{equation} y_j=f({\boldsymbol{x}}_j)~~\text{for }j=1,\dots,m. \end{equation} \]

(244)

Let \( \phi_j: \mathbb{R}^d\to\mathbb{R}\) , \( j\in\mathbb{N}\) , be a sequence of ansatz functions. For \( n\in\mathbb{N}\) , we wish to fit a function \( {\boldsymbol{x}} \mapsto \sum_{i=1}^n w_i \phi_i({\boldsymbol{x}})\) to the data using linear least-squares. To this end, we introduce the feature map

\[ \mathbb{R}^d \ni {\boldsymbol{x}} \mapsto \phi({\boldsymbol{x}})\mathrm{:}= (\phi_1({\boldsymbol{x}}),\dots,\phi_n({\boldsymbol{x}}))^\top\in\mathbb{R}^n. \]

The goal is to determine coefficients \( {\boldsymbol{w}}\in\mathbb{R}^n\) minimizing the empirical risk

\[ \widehat{\mathcal{R}}_S({\boldsymbol{w}})= \frac{1}{m}\sum_{j=1}^m\Big(\sum_{i=1}^n w_i\phi_i({\boldsymbol{x}}_j) -y_j\Big)^2 = \frac{1}{m}\sum_{j=1}^m(\left\langle \phi({\boldsymbol{x}}_j), {\boldsymbol{w}}\right\rangle_{}-y_j)^2. \]

With

\[ \begin{equation} {\boldsymbol{A}}_n \mathrm{:}= \begin{pmatrix} \phi_1({\boldsymbol{x}}_1) &\dots &\phi_n({\boldsymbol{x}}_1)\\ \vdots &\ddots &\vdots\\ \phi_1({\boldsymbol{x}}_m) &\dots &\phi_n({\boldsymbol{x}}_m) \end{pmatrix} = \begin{pmatrix} \phi({\boldsymbol{x}}_1)^\top\\ \vdots\\ \phi({\boldsymbol{x}}_m)^\top \end{pmatrix} \in\mathbb{R}^{m\times n} \end{equation} \]

(245)

and \( {\boldsymbol{y}}=(y_1,\dots,y_m)^\top\) it holds

\[ \begin{equation} \widehat{\mathcal{R}}_{S}({\boldsymbol{w}}) = \frac{1}{m}\| {\boldsymbol{A}}_n{\boldsymbol{w}}-{\boldsymbol{y}} \|_{}^2. \end{equation} \]

(246)

As discussed in Sections 12.1-12.2, a unique minimizer of (246) only exists if \( {\boldsymbol{A}}_n\) has rank \( n\) . For a minimizer \( {\boldsymbol{w}}_n\) , the fitted function reads

\[ \begin{equation} f_n(x)\mathrm{:}= \sum_{j=1}^n w_{n,j}\phi_j(x). \end{equation} \]

(247)

We are interested in the behavior of the \( f_n\) as a function of \( n\) (the number of ansatz functions/parameters of our model), and distinguish between two cases:

  • Underparameterized: If \( n<m\) we have fewer parameters \( n\) than training points \( m\) . For the least squares problem of minimizing \( \widehat{\mathcal{R}}_S\) , this means that there are more conditions \( m\) than free parameters \( n\) . Thus, in general, we cannot interpolate the data, and we have \( \min_{{\boldsymbol{w}}\in\mathbb{R}^n}\widehat{\mathcal{R}}_S({\boldsymbol{w}})>0\) .
  • Overparameterized: If \( n\ge m\) , then we have at least as many parameters \( n\) as training points \( m\) . If the \( {\boldsymbol{x}}_j\) and the \( \phi_j\) are such that \( {\boldsymbol{A}}_n\in\mathbb{R}^{m\times n}\) has full rank \( m\) , then there exists \( {\boldsymbol{w}}\) such that \( \widehat{\mathcal{R}}_S({\boldsymbol{w}})=0\) . If \( n>m\) , then \( {\boldsymbol{A}}_n\) necessarily has a nontrivial kernel, and there exist infinitely many parameters choices \( {\boldsymbol{w}}\) that yield zero empirical risk \( \widehat{\mathcal{R}}_S\) . Some of them lead to better, and some lead to worse prediction functions \( f_n\) in (247).
&lt;span data-controller=&quot;mathjax&quot;&gt;ansatz functions _j)&lt;/span&gt;
Figure 66. ansatz functions \( \phi_j\)
&lt;span data-controller=&quot;mathjax&quot;&gt;Runge function f) and data points&lt;/span&gt;
Figure 67. Runge function \( f\) and data points
Figure 65. Ansatz functions \( \phi_1, \dots, \phi_{40}\) drawn from a Gaussian process, along with the Runge function and \( 18\) equispaced data points.

In the overparameterized case, there exist many minimizers of \( \widehat{\mathcal{R}}_S\) . The training algorithm we use to compute a minimizer determines the type of prediction function \( f_n\) we obtain. We argued in Chapter 12, that for suitable initalization, gradient descent converges towards the minimal norm minimizer

\[ \begin{equation} {\boldsymbol{w}}_{n,*}=\rm{argmin}_{{\boldsymbol{w}}\in M}\| {\boldsymbol{w}} \|_{}\in\mathbb{R}^n,~~ M=\{{\boldsymbol{w}}\in\mathbb{R}^n\,|\,\widehat{\mathcal{R}}_S({\boldsymbol{w}})\le\widehat{\mathcal{R}}_S({\boldsymbol{v}})~\forall{\boldsymbol{v}}\in\mathbb{R}^n\}. \end{equation} \]

(248)

16.1.2 An example

We consider a concrete example. In Figure 65 we plot a set of \( 40\) ansatz functions \( \phi_1,\dots,\phi_{40}\) , which are drawn from a Gaussian process. Additionally, the figure shows a plot of the Runge function \( f\) , and \( m=18\) equispaced points which are used as the training data points. We then fit a function in \( \rm{ span}\{\phi_1,\dots,\phi_n\}\) via (248) and (247). The result is displayed in Figure 68:

  • \( n=2\) : The model can only represent functions in \( \rm{ span}\{\phi_1,\phi_2\}\) . It is not yet expressive enough to give a meaningful approximation of \( f\) .
  • \( n=15\) : The model has sufficient expressivity to capture the main characteristics of \( f\) . Since \( n=15<18=m\) , it is not yet able to interpolate the data. Thus it allows to strike a good balanced between the approximation and generalization error, which corresponds to the scenario discussed in Chapter 15.
  • \( n=18\) : We are at the interpolation threshold. The model is capable of interpolating the data, and there is a unique \( {\boldsymbol{w}}\) such that \( \widehat{\mathcal{R}}_S({\boldsymbol{w}})=0\) . Yet, in between data points the behavior of the predictor \( f_{18}\) seems erratic, and displays strong oscillations. This is referred to as overfitting, and is to be expected due to our analysis in Chapter 15; while the approximation error at the data points has improved compared to the case \( n=15\) , the generalization error has gotten worse.
  • \( n=40\) : This is the overparameterized regime, where we have significantly more parameters than data points. Our prediction \( f_{40}\) interpolates the data and appears to be the best overall approximation to \( f\) so far, due to a “good” choice of minimizer of \( \widehat{\mathcal{R}}_S\) , namely (248). We also note that, while quite good, the fit is not perfect. We cannot expect significant improvement in performance by further increasing \( n\) , since at this point the main limiting factor is the amount of available data. Also see Figure 73 (a).
&lt;span data-controller=&quot;mathjax&quot;&gt; n=2) (underparameterization)&lt;/span&gt;
Figure 69. \( n=2\) (underparameterization)
&lt;span data-controller=&quot;mathjax&quot;&gt; n=15) (balance of apprand generror)&lt;/span&gt;
Figure 70. \( n=15\) (balance of apprand generror)
&lt;span data-controller=&quot;mathjax&quot;&gt; n=18) (interpolation threshold)&lt;/span&gt;
Figure 71. \( n=18\) (interpolation threshold)
&lt;span data-controller=&quot;mathjax&quot;&gt; n=40) (overparameterization)&lt;/span&gt;
Figure 72. \( n=40\) (overparameterization)
Figure 68. Fit of the \( m=18\) red data points using the ansatz functions \( \phi_1, \dots, \phi_n\) from Figure 65, employing equations (248) and (247) for different numbers of ansatz functions \( n\) .

Figure 73 (a) displays the error \( \| f-f_n \|_{{L^2([-1,1])}}\) over \( n\) . We observe the characteristic double descent curve, where the error initially decreases and then peaks at the interpolation threshold, which is marked by the dashed red line. Afterwards, in the overparameterized regime, it starts to decrease again. Figure 73 (b) displays \( \| {\boldsymbol{w}}_{n,*} \|_{}\) . Note how the Euclidean norm of the coefficient vector also peaks at the interpolation threshold.

We emphasize that the precise nature of the convergence curves depends strongly on various factors, such as the distribution and number of training points \( m\) , the ground truth \( f\) , and the choice of ansatz functions \( \phi_j\) (e.g., the specific kernel used to generate the \( \phi_j\) in Figure 65 (a)). In the present setting we achieve a good approximation of \( f\) for \( n=15<18=m\) corresponding to the regime where the approximation and interpolation errors are balanced. However, as Figure 73 (a) shows, it can be difficult to determine a suitable value of \( n<m\) a priori, and the acceptable range of \( n\) values can be quite narrow. For overparametrization (\( n\gg m\) ), the precise choice of \( n\) is less critical, potentially making the algorithm more stable in this regime. We encourage the reader to conduct similar experiments and explore different settings to get a better feeling for the double descent phenomenon.

&lt;span data-controller=&quot;mathjax&quot;&gt; |f-f_n|_{L^2([-1,1])})&lt;/span&gt;
Figure 74. \( \|f-f_n\|_{L^2([-1,1])}\)
&lt;span data-controller=&quot;mathjax&quot;&gt; |{w}_{n,*}|)&lt;/span&gt;
Figure 75. \( \|{\boldsymbol{w}}_{n,*}\|\)
Figure 73. The \( L^2\) -error for the fitted functions in Figure 68, and the Euclidean norm of the corresponding coefficient vector \( {\boldsymbol{w}}_{n,*}\) defined in (248).

16.2 Size of weights

In Figure 73, we observed that the norm of the coefficients \( \| {\boldsymbol{w}}_{n,*} \|_{}\) exhibits similar behavior to the \( L^2\) -error, peaking at the interpolation threshold \( n=18\) . In machine learning, large weights are usually undesirable, as they are associated with large derivatives or oscillatory behavior. This is evident in the example shown in Figure 68 for \( n=18\) . Assuming that the data in (244) was generated by a “smooth” function \( f\) , e.ga function with moderate Lipschitz constant, these large derivatives of the prediction function could lead to poor generalization. Such a smoothness assumption about \( f\) may or may not be satisfied. However, if \( f\) is not smooth, there is little hope of accurately recovering \( f\) from limited data (see the discussion in Section 10.2).

The next result gives an explanation for the observed behavior of \( \| {\boldsymbol{w}}_{n,*} \|_{}\) .

Proposition 27

Assume that \( {\boldsymbol{x}}_1,\dots,{\boldsymbol{x}}_m\) and the \( (\phi_j)_{j\in\mathbb{N}}\) are such that \( {\boldsymbol{A}}_n\) in (245) has full rank \( n\) for all \( n\le m\) . Given \( {\boldsymbol{y}}\in \mathbb{R}^m\) , denote by \( {\boldsymbol{w}}_{n,*}({\boldsymbol{y}})\) the vector in (248). Then

\[ n\mapsto\sup_{\| {\boldsymbol{y}} \|_{}=1}\| {\boldsymbol{w}}_{n,*}({\boldsymbol{y}}) \|_{}~\text{is monotonically}~ \left\{ \begin{array}{ll} \text{increasing} &\text{for }n< m,\\ \text{decreasing} &\text{for }n\ge m. \end{array}\right. \]

Proof

We start with the case \( n\ge m\) . By assumption \( {\boldsymbol{A}}_m\) has full rank \( m\) , and thus \( {\boldsymbol{A}}_n\) has rank \( m\) for all \( n\ge m\) , see (245). In particular, there exists \( {\boldsymbol{w}}_n\in\mathbb{R}^n\) such that \( {\boldsymbol{A}}_n{\boldsymbol{w}}_{n}={\boldsymbol{y}}\) . Now fix \( {\boldsymbol{y}}\in\mathbb{R}^m\) and let \( {\boldsymbol{w}}_n\) be any such vector. Then \( {\boldsymbol{w}}_{n+1}\mathrm{:}= ({\boldsymbol{w}}_n,0)\in\mathbb{R}^{n+1}\) satisfies \( {\boldsymbol{A}}_{n+1}{\boldsymbol{w}}_{n+1}={\boldsymbol{y}}\) and \( \| {\boldsymbol{w}}_{n+1} \|_{}=\| {\boldsymbol{w}}_n \|_{}\) . Thus necessarily \( \| {\boldsymbol{w}}_{n+1,*} \|_{}\le\| {\boldsymbol{w}}_{n,*} \|_{}\) for the minimal norm solutions defined in (248). Since this holds for every \( {\boldsymbol{y}}\) , we obtain the statement for \( n\ge m\) .

Now let \( n<m\) . Recall that the minimal norm solution can be written through the pseudo inverse

\[ {\boldsymbol{w}}_{n,*}({\boldsymbol{y}})={\boldsymbol{A}}_n^\dagger {\boldsymbol{y}}, \]

see Appendix 19.1. That is,

\[ {\boldsymbol{A}}_n^\dagger = {\boldsymbol{V}}_n\begin{pmatrix} s_{n,1}^{-1}& & &\\ &\ddots &&\boldsymbol{0}\\ && s_{n,n}^{-1}& \end{pmatrix} {\boldsymbol{U}}_n^\top\in\mathbb{R}^{n\times m} \]

where \( {\boldsymbol{A}}_n={\boldsymbol{U}}_n{\boldsymbol{ \Sigma }}_n{\boldsymbol{V}}_n^\top\) is the singular value decomposition of \( {\boldsymbol{A}}_n\) , and

\[ {\boldsymbol{ \Sigma }}_n = \begin{pmatrix} s_{n,1} & &\\ & \ddots &\\ &&s_{n,n}\\ &\boldsymbol{0}& \end{pmatrix}\in\mathbb{R}^{m\times n} \]

contains the singular values \( s_{n,1}\ge\dots\ge s_{n,n}>0\) of \( {\boldsymbol{A}}_n\in\mathbb{R}^{m\times n}\) ordered by decreasing size. Since \( {\boldsymbol{V}}_n\in\mathbb{R}^{n\times n}\) and \( {\boldsymbol{U}}_n\in\mathbb{R}^{m\times m}\) are orthogonal matrices, we have

\[ \sup_{\| {\boldsymbol{y}} \|_{}=1}\| {\boldsymbol{w}}_{n,*}({\boldsymbol{y}}) \|_{}=\sup_{\| {\boldsymbol{y}} \|_{}=1}\| {\boldsymbol{A}}_n^\dagger{\boldsymbol{y}} \|_{} =s_{n,n}^{-1}. \]

Finally, since the minimal singular value \( s_{n,n}\) of \( {\boldsymbol{A}}_n\) can be written as

\[ s_{n,n}=\inf_{\substack{{\boldsymbol{x}}\in\mathbb{R}^n\\ \| {\boldsymbol{x}} \|_{}=1}}\| {\boldsymbol{A}}_n{\boldsymbol{x}} \|_{}\ge \inf_{\substack{{\boldsymbol{x}}\in\mathbb{R}^{n+1}\\ \| {\boldsymbol{x}} \|_{}=1}}\| {\boldsymbol{A}}_{n+1}{\boldsymbol{x}} \|_{} = s_{n+1,n+1}, \]

we observe that \( n\mapsto s_{n,n}\) is monotonically decreasing for \( n\le m\) . This concludes the proof.

16.3 Theoretical justification

Let us now examine one possible explanation of the double descent phenomenon for neural networks. While there are many alternative arguments available in the literature (see the bibliography section), the explanation presented here is based on a simplification of the ideas in [11].

The key assumption underlying our analysis is that large overparameterized neural networks tend to be Lipschitz continuous with a Lipschitz constant independent of the size. This is a consequence of neural networks typically having relatively small weights. To motivate this, let us consider the class of neural networks \( \mathcal{N}(\sigma;\mathcal{A},B)\) for an architecture \( \mathcal{A}\) of depth \( d\in\mathbb{N}\) and width \( L\in\mathbb{N}\) . If \( \sigma\) is \( C_\sigma\) -Lipschitz continuous with \( C_\sigma \geq 1\) , such that \( B \leq c_B \cdot (d C_\sigma)^{-1}\) for some \( c_B >0\) , then by Lemma 33

\[ \begin{align} \mathcal{N}(\sigma; \mathcal{A}, B) \subseteq \mathrm{Lip}_{c_B^L}(\mathbb{R}^{d_0}), \end{align} \]

(249)

An assumption of the type \( B \leq c_B \cdot (d C_\sigma)^{-1}\) , i.ea scaling of the weights by the reciprocal \( 1/d\) of the width, is not unreasonable in practice: Standard initialization schemes such as LeCun [12] or He [13] initialization, use random weights with variance scaled inverse proportional to the input dimension of each layer. Moreover, as we saw in Chapter 12, for very wide neural networks, the weights do not move significantly from their initialization during training. Additionally, many training routines use regularization terms on the weights, thereby encouraging the optimization routine to find small weights.

We study the generalization capacity of Lipschitz functions through the covering-number-based learning results of Chapter 15. The set \( \mathrm{Lip}_C(\Omega)\) of \( C\) -Lipschitz functions on a compact \( d\) -dimensional Euclidean domain \( \Omega\) has covering numbers bounded according to

\[ \begin{align} \log( \mathcal{G}(\mathrm{Lip}_C(\Omega), \varepsilon, L^\infty)) \le C_{\rm cov} \cdot \left(\frac{C}{\varepsilon}\right)^d ~ \text{ for all } \varepsilon >0 \end{align} \]

(250)

for some constant \( C_{\rm {cov}}\) independent of \( \varepsilon>0\) . A proof can be found in [1, Lemma 7], see also [2].

As a result of these considerations, we can identify two regimes:

  • Standard regime: For small neural network size \( n_{\mathcal{A}}\) , we consider neural networks as a set parameterized by \( n_\mathcal{A}\) parameters. As we have seen before, this yields a bound on the generalization error that scales linearly with \( n_\mathcal{A}\) . As long as \( n_\mathcal{A}\) is small in comparison to the number of samples, we can expect good generalization by Theorem 35.
  • Overparameterized regime: For large neural network size \( n_{\mathcal{A}}\) and small weights, we consider neural networks as a subset of \( \mathrm{Lip}_C(\Omega)\) for a constant \( C>0\) . This set has a covering number bound that is independent of the number of parameters \( n_\mathcal{A}\) .

Choosing the better of the two generalization bounds for each regime yields the following result. Recall that \( \mathcal{N}^*(\sigma;\mathcal{A},B)\) denotes all neural networks in \( \mathcal{N}(\sigma;\mathcal{A},B)\) with a range contained in \( [-1,1]\) (see (239)).

Theorem 40

Let \( C\) , \( C_{\mathcal{L}} >0\) and let \( \mathcal{L} \colon [-1,1] \times [-1,1] \to \mathbb{R}\) be \( C_{\mathcal{L}}\) -Lipschitz. Further, let \( \mathcal{A} = (d_0, d_1, \dots, d_{L+1}) \in \mathbb{N}^{L+2}\) , let \( \sigma\colon \mathbb{R} \to \mathbb{R}\) be \( C_\sigma\) -Lipschitz continuous with \( C_\sigma \geq 1\) , and \( |\sigma(x)| \leq C_\sigma|x|\) for all \( x \in \mathbb{R}\) , and let \( B>0\) .

Then, there exist \( c_1\) , \( c_2>0\) , such that for every \( m \in \mathbb{N}\) and every distribution \( \mathcal{D}\) on \( [-1,1]^{d_0} \times [-1,1]\) it holds with probability at least \( 1-\delta\) over \( S \sim \mathcal{D}^m\) that for all \( \Phi \in \mathcal{N}^*(\sigma; \mathcal{A}, B) \cap \mathrm{Lip}_C([-1,1]^{d_0})\)

\[ \begin{align} |\mathcal{R}(\Phi) -\widehat{\mathcal{R}}_S(\Phi) |&\leq g(\mathcal{A}, C_\sigma , B,m) + 4 C_{\mathcal{L}} \sqrt{\frac{\log(4/\delta)}{m}}, \end{align} \]

(251)

where

\[ \begin{align*} \nonumber g(\mathcal{A}, C_\sigma , B, m) = \min\left\{c_1 \sqrt{\frac{n_{\mathcal{A}} \log(n_{\mathcal{A}} \lceil \sqrt{m}\rceil) + L n_{\mathcal{A}}\log(d_{\rm max})}{m} }, c_2 m ^{-\frac{1}{2+d_0}}\right\}. \end{align*} \]

Proof

Applying Theorem 33 with \( \alpha = 1/(2+d_0)\) and (250), we obtain that with probability at least \( 1-\delta/2\) it holds for all \( \Phi \in \mathrm{Lip}_C([-1,1]^{d_0})\)

\[ \begin{align*} |\mathcal{R}(\Phi) - \widehat{\mathcal{R}}_S(\Phi)| &\leq 4 C_{\mathcal{L}} \sqrt{\frac{{C_{\rm cov}} (m^{\alpha}C )^{d_0} + \log(4/\delta)}{m}} + \frac{2C_{\mathcal{L}}}{m^{\alpha}}\\ & \leq 4 C_{\mathcal{L}} \sqrt{{C_{\rm cov}} C^{d_0} (m^{d_0/(d_0+2) - 1})} + \frac{2C_{\mathcal{L}}}{m^{\alpha}} + 4 C_{\mathcal{L}} \sqrt{\frac{\log(4/\delta)}{m}}\\ & = 4 C_{\mathcal{L}} \sqrt{{C_{\rm cov}} C^{d_0} (m^{-2/(d_0+2)})} + \frac{2C_{\mathcal{L}}}{m^{\alpha}} + 4 C_{\mathcal{L}}\sqrt{\frac{\log(4/\delta)}{m}}\\ & = \frac{(4 C_{\mathcal{L}} \sqrt{{C_{\rm cov}} C^{d_0}} + 2C_{\mathcal{L}})}{m^{\alpha}} + 4 C_{\mathcal{L}}\sqrt{\frac{\log(4/\delta)}{m}}, \end{align*} \]

where we used in the second inequality that \( \sqrt{x+y}\le \sqrt{x}+\sqrt{y}\) for all \( x\) , \( y\ge 0\) .

In addition, Theorem 35 yields that with probability at least \( 1-\delta/2\) it holds for all \( \Phi \in \mathcal{N}^*(\sigma; \mathcal{A}, B)\)

\[ \begin{align*} |\mathcal{R}(\Phi) - \widehat{\mathcal{R}}_S(\Phi)| &\leq 4 C_{\mathcal{L}} \sqrt{\frac{n_{\mathcal{A}} \log( \lceil n_{\mathcal{A}} \sqrt{m}\rceil ) + L n_{\mathcal{A}} \log( \lceil 2 C_\sigma B d_{\rm max} \rceil ) + \log(4/\delta)}{m}}\\ &~~ + \frac{2C_{\mathcal{L}}}{\sqrt{m}}\\ & \leq 6 C_{\mathcal{L}} \sqrt{\frac{n_{\mathcal{A}} \log( \lceil n_{\mathcal{A}} \sqrt{m}\rceil ) + L n_{\mathcal{A}} \log( \lceil 2 C_\sigma B d_{\rm max} \rceil )}{m}}\\ &~~ +4 C_{\mathcal{L}} \sqrt{\frac{ \log(4/\delta))}{m}}. \end{align*} \]

Then, for \( \Phi \in \mathcal{N}^*(\sigma; \mathcal{A}, B) \cap \mathrm{Lip}_C([-1,1]^{d_0})\) the minimum of both upper bounds holds with probability at least \( 1-\delta\) .

The two regimes in Theorem 40 correspond to the two terms comprising the minimum in the definition of \( g(\mathcal{A}, C_\sigma , B, m)\) . The first term increases with \( n_\mathcal{A}\) while the second is constant. In the first regime, where the first term is smaller, the generalization gap \( |\mathcal{R}(\Phi) - \widehat{\mathcal{R}}_S(\Phi)|\) increases with \( n_\mathcal{A}\) .

In the second regime, where the second term is smaller, the generalization gap is constant with \( n_\mathcal{A}\) . Moreover, it is reasonable to assume that the empirical risk \( \widehat{\mathcal{R}}_S\) will decrease with increasing number of parameters \( n_\mathcal{A}\) .

By (251) we can bound the risk by

\[ \mathcal{R}(\Phi) \leq \widehat{\mathcal{R}}_S + g(\mathcal{A}, C_\sigma , B,m) + 4 C_{\mathcal{L}} \sqrt{\frac{\log(4/\delta)}{m}}. \]

In the second regime, this upper bound is monotonically decreasing. In the first regime it may both decrease and increase. In some cases, this behavior can lead to an upper bound on the risk resembling the curve of Figure 64.

Remark 18

Theorem 40 assumes \( C\) -Lipschitz continuity of the neural networks. As we saw in Sections 16.1.2 and 16.2, this assumption may not hold near the interpolation threshold. Hence, Theorem 40 likely gives a too optimistic upper bound near the interpolation threshold.

Bibliography and further reading

The discussion on kernel regression and the effect of the number of parameters on the norm of the weights was already given in [10]. Similar analyses, with more complex ansatz systems and more precise asymptotic estimates, are found in [16, 17]. Our results in Section 16.3 are inspired by [11]; see also [18].

For a detailed account of further arguments justifying the surprisingly good generalization capabilities of overparameterized neural networks, we refer to [19, Section 2]. Here, we only briefly mention two additional directions of inquiry. First, if the learning algorithm introduces a form of robustness, this can be leveraged to yield generalization bounds [20, 21, 22, 23]. Second, for very overparameterized neural networks, it was stipulated in [24] that neural networks become linear kernel interpolators as discussed in Chapter 12. Thus, for large neural networks, generalization can be studied through kernel regression [24, 25, 26, 27].

Exercises

Exercise 60

Let \( f:[-1,1]\to\mathbb{R}\) be a continuous function, and let \( -1\le x_1<\dots<x_m\le 1\) for some fixed \( m\in\mathbb{N}\) . As in Section 16.1.2, we wish to approximate \( f\) by a least squares approximation. To this end we use the Fourier ansatz functions

\[ \begin{equation} b_0(x)\mathrm{:}= \frac{1}{2}~~\text{and}~~ b_j(x)\mathrm{:}= \left\{ \begin{array}{ll} \sin(\lceil\frac j2\rceil\pi x) &j\ge 1\text{ is odd}\\ \cos(\lceil\frac j2\rceil\pi x) &j\ge 1\text{ is even}. \end{array}\right. \end{equation} \]

(252)

With the empirical risk

\[ \widehat{\mathcal{R}}_S({\boldsymbol{w}})= \frac{1}{m}\sum_{j=1}^m\Big(\sum_{i=0}^n w_ib_i(x_j) -y_j\Big)^2, \]

denote by \( {\boldsymbol{w}}_{*}^n\in\mathbb{R}^{n+1}\) the minimal norm minimizer of \( \widehat{\mathcal{R}}_S\) , and set \( f_n(x)\mathrm{:}= \sum_{i=0}^n w_{*,i}^nb_i(x)\) .

Show that in this case generalization fails in the overparametrized regime: for sufficiently large \( n\gg m\) , \( f_n\) is not necessarily a good approximation to \( f\) . What does \( f_n\) converge to as \( n\to\infty\) ?

Exercise 61

Consider the setting of Exercise 60. We adapt the ansatz functions in (252) by rescaling them via

\[ \tilde b_j\mathrm{:}= c_j b_j. \]

Choose real numbers \( c_j\in\mathbb{R}\) , such that the corresponding minimal norm least squares solution avoids the phenomenon encountered in Exercise 60.

Hint: Should ansatz functions corresponding to large frequencies be scaled by large or small numbers to avoid overfitting?

Exercise 62

Prove (250) for \( d = 1\) .

17 Robustness and adversarial examples

&lt;span data-controller=&quot;mathjax&quot;&gt;Sketch of an adversarial example.&lt;/span&gt;
Figure 76. Sketch of an adversarial example.

How sensitive is the output of a neural network to small changes in its input? Real-world observations of trained neural networks often reveal that even barely noticeable modifications of the input can lead to drastic variations in the network’s predictions. This intriguing behavior was first documented in the context of image classification in [277].

Figure 76 illustrates this concept. The left panel shows a picture of a panda that the neural network correctly classifies as a panda. By adding an almost imperceptible amount of noise to the image, we obtain the modified image in the right panel. To a human, there is no visible difference, but the neural network classifies the perturbed image as a wombat. This phenomenon, where a correctly classified image is misclassified after a slight perturbation, is termed an adversarial example.

In practice, such behavior is highly undesirable. It indicates that our learning algorithm might not be very reliable and poses a potential security risk, as malicious actors could exploit it to trick the algorithm. In this chapter, we describe the basic mathematical principles behind adversarial examples and investigate simple conditions under which they might or might not occur. For simplicity, we restrict ourselves to a binary classification problem but note that the main ideas remain valid in more general situations.

17.1 Adversarial examples

Let us start by formalizing the notion of an adversarial example. We consider the problem of assigning a label \( y\in\{-1,1\}\) to a vector \( {\boldsymbol{x}}\in\mathbb{R}^d\) . It is assumed that the relation between \( {\boldsymbol{x}}\) and \( y\) is described by a distribution \( \mathcal{D}\) on \( \mathbb{R}^d\times\{-1,1\}\) . In particular, for a given \( {\boldsymbol{x}}\) , both values \( -1\) and \( 1\) could have positive probability, i.ethe label is not necessarily deterministic. Additionally, we let

\[ \begin{equation} D_{\boldsymbol{x}}\mathrm{:}= \{{\boldsymbol{x}}\in\mathbb{R}^d\,|\,\exists y\text{ s.t. }({\boldsymbol{x}},y)\in\rm{supp}(\mathcal{D})\}, \end{equation} \]

(253)

and refer to \( D_{\boldsymbol{x}}\) as the feature support. Throughout this chapter we denote by

\[ g \colon \mathbb{R}^{d} \to \{-1,0,1\} \]

a fixed so-called ground-truth classifier, satisfying

\[ \begin{equation} \mathbb{P}[y=g({\boldsymbol{x}})|{\boldsymbol{x}}]\ge \mathbb{P}[y=-g({\boldsymbol{x}})|{\boldsymbol{x}}]~~\text{for all }{\boldsymbol{x}}\in D_{\boldsymbol{x}}. \end{equation} \]

(254)

Note that we allow \( g\) to take the value \( 0\) , which is to be understood as an additional label corresponding to nonrelevant or nonsensical input data \( {\boldsymbol{x}}\) . We will refer to \( g^{-1}(0)\) as the nonrelevant class. The ground truth \( g\) is interpreted as how a human would classify the data, as the following example illustrates.

Example 16

We wish to classify whether an image shows a panda (\( y=1\) ) or a wombat (\( y=-1\) ). Consider again Figure 76, and denote the three images by \( {\boldsymbol{x}}_1\) , \( {\boldsymbol{x}}_2\) , \( {\boldsymbol{x}}_3\) . The first image \( {\boldsymbol{x}}_1\) is a photograph of a panda. Together with a label \( y\) , it can be interpreted as a draw \( ({\boldsymbol{x}}_1,y)\) from a distribution of images \( \mathcal{D}\) , i.e\( {\boldsymbol{x}}_1\in D_{\boldsymbol{x}}\) and \( g({\boldsymbol{x}}_1)=1\) . The second image \( {\boldsymbol{x}}_2\) displays noise and corresponds to nonrelevant data as it shows neither a panda nor a wombat. In particular, \( {\boldsymbol{x}}_2\in D_{\boldsymbol{x}}^c\) and \( g({\boldsymbol{x}}_2)=0\) . The third (perturbed) image \( {\boldsymbol{x}}_3\) also belongs to \( D_{\boldsymbol{x}}^c\) , as it is not a photograph but a noise corrupted version of \( {\boldsymbol{x}}_1\) . Nonetheless, it is not nonrelevant, as a human would classify it as a panda. Thus \( g({\boldsymbol{x}}_3)=1\) .

Additional to the ground truth \( g\) , we denote by

\[ h \colon \mathbb{R}^{d} \to \{-1,1\} \]

some trained classifier.

Definition 37

Let \( g \colon \mathbb{R}^{d} \to \{-1,0,1\}\) be the ground-truth classifier, let \( h \colon \mathbb{R}^{d} \to \{-1,1\}\) be a classifier, and let \( \| \cdot \|_{*}\) be a norm on \( \mathbb{R}^d\) . For \( {\boldsymbol{x}}\in\mathbb{R}^d\) and \( \delta >0\) , we call \( {\boldsymbol{x}}' \in \mathbb{R}^{d}\) an adversarial example to \( {\boldsymbol{x}} \in \mathbb{R}^{d}\) with perturbation \( \delta\) , if and only if

  1. \( \| {\boldsymbol{x}}' - {\boldsymbol{x}} \|_{*} \leq \delta\) ,
  2. \( g({\boldsymbol{x}})g({\boldsymbol{x}}')>0\) ,
  3. \( h({\boldsymbol{x}}) = g({\boldsymbol{x}})\) and \( h({\boldsymbol{x}}')\neq g({\boldsymbol{x}}')\) .

In words, \( {\boldsymbol{x}}'\) is an adversarial example to \( {\boldsymbol{x}}\) with perturbation \( \delta\) , if (i) the distance of \( {\boldsymbol{x}}\) and \( {\boldsymbol{x}}'\) is at most \( \delta\) , (ii) \( {\boldsymbol{x}}\) and \( {\boldsymbol{x}}'\) belong to the same (not nonrelevant) class according to the ground truth classifier, and (iii) the classifier \( h\) correctly classifies \( {\boldsymbol{x}}\) but misclassifies \( {\boldsymbol{x}}'\) .

Remark 19

We emphasize that the concept of a ground-truth classifier \( g\) differs from a minimizer of the Bayes risk (230) for two reasons. First, we allow for an additional label \( 0\) corresponding to the nonrelevant class, which does not exist for the data generating distribution \( \mathcal{D}\) . Second, \( g\) should correctly classify points outside of \( D_{\boldsymbol{x}}\) ; small perturbations of images as we find them in adversarial examples, are not regular images in \( D_{\boldsymbol{x}}\) . Nonetheless, a human classifier can still classify these images, and \( g\) models this property of human classification.

17.2 Bayes classifier

At first sight, an adversarial example seems to be no more than a misclassified sample. Naturally, these exist if the model does not generalize well. In this section we present the more nuanced view of [278].

To avoid edge cases, we assume in the following that for all \( {\boldsymbol{x}}\in D_{\boldsymbol{x}}\)

\[ \begin{equation} \text{either} ~\mathbb{P}[y=1|{\boldsymbol{x}}]>\mathbb{P}[y=-1|{\boldsymbol{x}}]~\text{or}~ \mathbb{P}[y=1|{\boldsymbol{x}}]<\mathbb{P}[y=-1|{\boldsymbol{x}}] \end{equation} \]

(255)

so that (254) uniquely defines \( g({\boldsymbol{x}})\) for \( {\boldsymbol{x}}\in D_{\boldsymbol{x}}\) . We say that the distribution exhausts the domain if \( D_{\boldsymbol{x}} \cup g^{-1}(0) = \mathbb{R}^{d}\) . This means that every point is either in the feature support \( D_{\boldsymbol{x}}\) or it belongs to the nonrelevant class. Moreover, we say that \( h\) is a Bayes classifier if

\[ \mathbb{P}[h({\boldsymbol{x}}) | {\boldsymbol{x}}] \geq \mathbb{P}[-h({\boldsymbol{x}}) | {\boldsymbol{x}}]~~\text{for all }{\boldsymbol{x}}\in D_{\boldsymbol{x}}. \]

By (254), the ground truth \( g\) is a Bayes classifier, and (255) ensures that \( h\) coincides with \( g\) on \( D_{\boldsymbol{x}}\) if \( h\) is a Bayes classifier. It is easy to see that a Bayes classifier minimizes the Bayes risk.

With these two notions, we now distinguish between four cases.

  1. Bayes classifier/exhaustive distribution: If \( h\) is a Bayes classifier and the data exhausts the domain, then there are no adversarial examples. This is because every \( {\boldsymbol{x}}\in \mathbb{R}^d\) either belongs to the nonrelevant class or is classified the same by \( h\) and \( g\) .
  2. Bayes classifier/non-exhaustive distribution: If \( h\) is a Bayes classifier and the distribution does not exhaust the domain, then adversarial examples can exist. Even though the learned classifier \( h\) coincides with the ground truth \( g\) on the feature support, adversarial examples can be constructed for data points on the complement of \( D_{\boldsymbol{x}} \cup g^{-1}(0)\) , which is not empty.
  3. Not a Bayes classifier/exhaustive distribution: The set \( D_{\boldsymbol{x}}\) can be covered by the four subdomains

    \[ \begin{equation} \begin{aligned} C_1 &= h^{-1}(1) \cap g^{-1}(1), ~ F_1 = h^{-1}(-1) \cap g^{-1}(1),\\ C_{-1} &= h^{-1}(-1) \cap g^{-1}(-1), ~ F_{-1} = h^{-1}(1) \cap g^{-1}(-1). \end{aligned} \end{equation} \]

    (256)

    If \( \mathrm{dist}(C_1 \cap D_{\boldsymbol{x}}, F_1 \cap D_{\boldsymbol{x}})\) or \( \mathrm{dist}(C_{-1} \cap D_{\boldsymbol{x}}, F_{-1} \cap D_{\boldsymbol{x}})\) is smaller than \( \delta\) , then there exist points \( {\boldsymbol{x}}\) , \( {\boldsymbol{x}}' \in D_{\boldsymbol{x}}\) such that \( {\boldsymbol{x}}'\) is an adversarial example to \( x\) with perturbation \( \delta\) . Hence, adversarial examples in the feature support can exist. This is, however, not guaranteed to happen. For example, \( D_{\boldsymbol{x}}\) does not need to be connected if \( g^{-1}(0) \neq \emptyset\) , see Exercise 65. Hence, even for classifiers that have incorrect predictions on the data, adversarial examples do not need to exist.

  4. Not a Bayes classifier/non-exhaustive distribution: In this case everything is possible. Data points and their associated adversarial examples can appear in the feature support of the distribution and adversarial examples to elements in the feature support of the distribution can be created by leaving the feature support of the distribution. We will see examples in the following section.

17.3 Affine classifiers

For linear classifiers, a simple argument outlined in [277] and [279] showcases that the high-dimensionality of the input, common in image classification problems, is a potential cause for the existence of adversarial examples.

A linear classifier is a map of the form

\[ {\boldsymbol{x}}\mapsto \mathrm{sign}({\boldsymbol{w}}^\top {\boldsymbol{x}})~~ \text{where }{\boldsymbol{w}}, {\boldsymbol{x}} \in \mathbb{R}^d. \]

Let

\[ {\boldsymbol{x}}' \mathrm{:}= {\boldsymbol{x}} - 2 |{\boldsymbol{w}}^\top {\boldsymbol{x}}| \frac{\mathrm{sign}({\boldsymbol{w}}^\top {\boldsymbol{x}}) \mathrm{sign}({\boldsymbol{w}})}{\|{\boldsymbol{w}}\|_1} \]

where \( \mathrm{sign}({\boldsymbol{w}})\) is understood coordinate-wise. Then \( \|{\boldsymbol{x}} - {\boldsymbol{x}}'\|_\infty \leq 2 |{\boldsymbol{w}}^\top {\boldsymbol{x}}| /\|{\boldsymbol{w}}\|_1\) and it is not hard to see that \( \mathrm{sign}({\boldsymbol{w}}^\top {\boldsymbol{x}}') \neq \mathrm{sign}({\boldsymbol{w}}^\top {\boldsymbol{x}})\) .

For high-dimensional vectors \( {\boldsymbol{w}}\) , \( {\boldsymbol{x}}\) chosen at random but possibly dependent such that \( {\boldsymbol{w}}\) is uniformly distributed on a \( d-1\) dimensional sphere, it holds with high probability that

\[ \begin{align*} \frac{|{\boldsymbol{w}}^\top {\boldsymbol{x}}|}{\|{\boldsymbol{w}}\|_1} \leq \frac{\|{\boldsymbol{x}}\| \|{\boldsymbol{w}}\|}{\|{\boldsymbol{w}}\|_1} \ll \|{\boldsymbol{x}}\|. \end{align*} \]

This can be seen by noting that for every \( c>0\)

\[ \begin{align} \mu(\{{\boldsymbol{w}} \in \mathbb{R}^d \,|\,\|{\boldsymbol{w}}\|_1 > c, \|{\boldsymbol{w}}\| \leq 1\} ) \to 1 \text{ for } d \to \infty, \end{align} \]

(257)

where \( \mu\) is the uniform probability measure on the \( d\) -dimensional Euclidean unit ball, see Exercise 64. Thus, if \( {\boldsymbol{x}}\) has a moderate Euclidean norm, the perturbation of \( {\boldsymbol{x}}'\) is likely small for large dimensions.

Below we give a sufficient condition for the existence of adversarial examples, in case both \( h\) and the ground truth \( g\) are linear classifiers.

Theorem 41

Let \( {\boldsymbol{w}}\) , \( \overline{{\boldsymbol{w}}} \in \mathbb{R}^{d}\) be nonzero. For \( {\boldsymbol{x}} \in \mathbb{R}^d\) , let \( h({\boldsymbol{x}}) = \mathrm{sign}({\boldsymbol{w}}^\top {\boldsymbol{x}})\) be a classifier and let \( g({\boldsymbol{x}}) = \mathrm{sign}(\overline{{\boldsymbol{w}}}^\top x)\) be the ground-truth classifier.

For every \( {\boldsymbol{x}}\in \mathbb{R}^{d}\) with \( h({\boldsymbol{x}})g({\boldsymbol{x}})>0\) and all \( \varepsilon \in (0, |{\boldsymbol{w}}^\top {\boldsymbol{x}}|)\) such that

\[ \begin{align} \frac{|\overline{{\boldsymbol{w}}}^\top {\boldsymbol{x}}|}{\|\overline{{\boldsymbol{w}}}\|} > \frac{\varepsilon + |{\boldsymbol{w}}^\top {\boldsymbol{x}}|}{\|{\boldsymbol{w}}\|} \frac{|{\boldsymbol{w}}^\top \overline{{\boldsymbol{w}}}|}{\|{\boldsymbol{w}}\| \| \overline{{\boldsymbol{w}}}\|} \end{align} \]

(258)

it holds that

\[ \begin{align} {\boldsymbol{x}}' = {\boldsymbol{x}} - h({\boldsymbol{x}}) \frac{\varepsilon + |{\boldsymbol{w}}^\top {\boldsymbol{x}}|}{\|{\boldsymbol{w}}\|^2} {\boldsymbol{w}} \end{align} \]

(259)

is an adversarial example to \( {\boldsymbol{x}}\) with perturbation \( \delta = (\varepsilon + |{\boldsymbol{w}}^\top {\boldsymbol{x}}|)/{\|{\boldsymbol{w}}\|}\) .

Before we present the proof, we give some interpretation of this result. First, note that \( \{{\boldsymbol{x}}\in\mathbb{R}^d\,|\,{\boldsymbol{w}}^\top{\boldsymbol{x}} = 0\}\) is the decision boundary of \( h\) , meaning that points lying on opposite sides of this hyperplane, are classified differently by \( h\) . Due to \( |{\boldsymbol{w}}^\top\overline{{\boldsymbol{w}}}|\le \| {\boldsymbol{w}} \|_{}\| \overline{{\boldsymbol{w}}} \|_{}\) , (258) implies that an adversarial example always exists whenever

\[ \begin{equation} \frac{|\overline{{\boldsymbol{w}}}^\top{\boldsymbol{x}}|}{\| \overline{{\boldsymbol{w}}} \|_{}}> \frac{|{\boldsymbol{w}}^\top{\boldsymbol{x}}|}{\| {\boldsymbol{w}} \|_{}}. \end{equation} \]

(260)

The left term is the decision margin of \( {\boldsymbol{x}}\) for \( g\) , i.ethe distance of \( {\boldsymbol{x}}\) to the decision boundary of \( g\) . Similarly, the term on the right is the decision margin of \( {\boldsymbol{x}}\) for \( h\) . Thus we conclude that adversarial examples exist if the decision margin of \( {\boldsymbol{x}}\) for the ground truth \( g\) is larger than that for the classifier \( h\) .

Second, the term \( ({\boldsymbol{w}}^\top \overline{{\boldsymbol{w}}})/{(\|{\boldsymbol{w}}\| \| \overline{{\boldsymbol{w}}}\|)}\) describes the alignment of the two classifiers. If the classifiers are not aligned, i.e., \( {\boldsymbol{w}}\) and \( \overline{{\boldsymbol{w}}}\) have a large angle between them, then adversarial examples exist even if the margin of the classifier is larger than that of the ground-truth classifier.

Finally, adversarial examples with small perturbation are possible if \( |{\boldsymbol{w}}^\top{\boldsymbol{x}}|\ll\| {\boldsymbol{w}} \|_{}\) . The extreme case \( {\boldsymbol{w}}^\top{\boldsymbol{x}}=0\) means that \( {\boldsymbol{x}}\) lies on the decision boundary of \( h\) , and if \( |{\boldsymbol{w}}^\top{\boldsymbol{x}}|\ll \| {\boldsymbol{w}} \|_{}\) then \( {\boldsymbol{x}}\) is close to the decision boundary of \( h\) .

Proof (of Theorem 41)

We verify that \( {\boldsymbol{x}}'\) in (259) satisfies the conditions of an adversarial example in Definition 37. In the following we will use that due to \( h({\boldsymbol{x}})g({\boldsymbol{x}})>0\)

\[ \begin{equation} g({\boldsymbol{x}}) = {\rm sign}(\overline{{\boldsymbol{w}}}^\top {\boldsymbol{x}}) ={\rm sign}({\boldsymbol{w}}^\top {\boldsymbol{x}}) = h({\boldsymbol{x}})\neq 0. \end{equation} \]

(261)

First, it holds

\[ \begin{align*} \|{\boldsymbol{x}} - {\boldsymbol{x}}'\| = \left\| \frac{\varepsilon + |{\boldsymbol{w}}^\top {\boldsymbol{x}}|}{\|{\boldsymbol{w}}\|^2} {\boldsymbol{w}}\right\| =\frac{\varepsilon + |{\boldsymbol{w}}^\top {\boldsymbol{x}}|}{\|{\boldsymbol{w}}\|} = \delta. \end{align*} \]

Next we show \( g({\boldsymbol{x}})g({\boldsymbol{x}}') > 0\) , i.ethat \( (\overline{{\boldsymbol{w}}}^\top {\boldsymbol{x}}) (\overline{{\boldsymbol{w}}}^\top {\boldsymbol{x}}')\) is positive. Plugging in the definition of \( {\boldsymbol{x}}'\) , this term reads

\[ \begin{align} \overline{{\boldsymbol{w}}}^\top {\boldsymbol{x}} \left(\overline{{\boldsymbol{w}}}^\top{\boldsymbol{x}} - h({\boldsymbol{x}}) \frac{\varepsilon + |{\boldsymbol{w}}^\top {\boldsymbol{x}}|}{\|{\boldsymbol{w}}\|^2}\overline{{\boldsymbol{w}}}^\top {\boldsymbol{w}}\right) &= |\overline{{\boldsymbol{w}}}^\top {\boldsymbol{x}}|^2 - |\overline{{\boldsymbol{w}}}^\top {\boldsymbol{x}}| \frac{\varepsilon + |{\boldsymbol{w}}^\top {\boldsymbol{x}}|}{\|{\boldsymbol{w}}\|^2}\overline{{\boldsymbol{w}}}^\top {\boldsymbol{w}}\nonumber \end{align} \]

(262)

\[ \begin{align} &\geq |\overline{{\boldsymbol{w}}}^\top {\boldsymbol{x}}|^2 - |\overline{{\boldsymbol{w}}}^\top {\boldsymbol{x}}| \frac{\varepsilon + |{\boldsymbol{w}}^\top {\boldsymbol{x}}|}{\|{\boldsymbol{w}}\|^2}|\overline{{\boldsymbol{w}}}^\top {\boldsymbol{w}}|, \\\end{align} \]

(263)

where the equality holds because \( h({\boldsymbol{x}}) = g({\boldsymbol{x}}) = \mathrm{sign}(\overline{{\boldsymbol{w}}}^\top {\boldsymbol{x}})\) by (261). Dividing the right-hand side of (262) by \( |\overline{{\boldsymbol{w}}}^\top {\boldsymbol{x}}| \|\overline{{\boldsymbol{w}}}\|\) , which is positive by (261), we obtain

\[ \begin{align} \frac{|\overline{{\boldsymbol{w}}}^\top {\boldsymbol{x}}|}{\|\overline{{\boldsymbol{w}}}\|} - \frac{\varepsilon + |{\boldsymbol{w}}^\top {\boldsymbol{x}}|}{\|{\boldsymbol{w}}\|} \frac{{|\overline{{\boldsymbol{w}}}^\top {\boldsymbol{w}}|}}{\|{\boldsymbol{w}}\| \|\overline{{\boldsymbol{w}}}\|}. \end{align} \]

(264)

The term (264) is positive thanks to (258).

Finally, we check that \( {0\neq }h({\boldsymbol{x}}') \neq h({\boldsymbol{x}})\) , i.e\( ({\boldsymbol{w}}^\top {\boldsymbol{x}}) ({\boldsymbol{w}}^\top {\boldsymbol{x}}')<0\) . We have that

\[ \begin{align*} ({\boldsymbol{w}}^\top {\boldsymbol{x}}) ({\boldsymbol{w}}^\top {\boldsymbol{x}}') & = |{\boldsymbol{w}}^\top {\boldsymbol{x}}|^2 - {\boldsymbol{w}}^\top {\boldsymbol{x}} h({\boldsymbol{x}}) \frac{\varepsilon + |{\boldsymbol{w}}^\top {\boldsymbol{x}}|}{\|{\boldsymbol{w}}\|^2}{\boldsymbol{w}}^\top {\boldsymbol{w}}\\ & = |{\boldsymbol{w}}^\top {\boldsymbol{x}}|^2 - |{\boldsymbol{w}}^\top {\boldsymbol{x}}| (\varepsilon + |{\boldsymbol{w}}^\top {\boldsymbol{x}}|)<0, \end{align*} \]

where we used that \( h({\boldsymbol{x}}) = \mathrm{sign}({\boldsymbol{w}}^\top {\boldsymbol{x}})\) . This completes the proof.

Theorem 41 readily implies the following proposition for affine classifiers.

Proposition 29

Let \( {\boldsymbol{w}}\) , \( \overline{{\boldsymbol{w}}} \in \mathbb{R}^{d}\) and \( b\) , \( \overline{b}\in\mathbb{R}\) . For \( {\boldsymbol{x}}\in\mathbb{R}^d\) let \( h({\boldsymbol{x}}) = \mathrm{sign}({\boldsymbol{w}}^\top {\boldsymbol{x}} + b)\) be a classifier and let \( g({\boldsymbol{x}}) = \mathrm{sign}(\overline{{\boldsymbol{w}}}^\top {\boldsymbol{x}} + \overline{b})\) be the ground-truth classifier.

For every \( {\boldsymbol{x}}\in \mathbb{R}^{d}\) with \( \overline{{\boldsymbol{w}}}^\top{\boldsymbol{x}} \neq 0\) , \( h({\boldsymbol{x}})g({\boldsymbol{x}})>0\) , and all \( \varepsilon \in (0, |{\boldsymbol{w}}^\top {\boldsymbol{x}} + b|)\) such that

\[ \begin{align*} \frac{|\overline{{\boldsymbol{w}}}^\top {\boldsymbol{x}} + \overline{b}|^2}{\|\overline{{\boldsymbol{w}}}\|^2 + b^2} > \frac{(\varepsilon + |{\boldsymbol{w}}^\top {\boldsymbol{x}} + b|)^2}{\|{\boldsymbol{w}}\|^2 + b^2} \frac{({\boldsymbol{w}}^\top \overline{{\boldsymbol{w}}} + b\overline{b})^2}{(\|{\boldsymbol{w}}\|^2 + b^2) (\| \overline{{\boldsymbol{w}}}\|^2 + \overline{b}^2)} \end{align*} \]

it holds that

\[ {\boldsymbol{x}}' = {\boldsymbol{x}} - h({\boldsymbol{x}}) \frac{\varepsilon + |{\boldsymbol{w}}^\top {\boldsymbol{x}} + b|}{\|{\boldsymbol{w}}\|^2} {\boldsymbol{w}} \]

is an adversarial example with perturbation \( \delta = (\varepsilon + |{\boldsymbol{w}}^\top {\boldsymbol{x}} + b|)/{\|{\boldsymbol{w}}\|}\) to \( {\boldsymbol{x}}\) .

The proof is left to the reader, see Exercise 66.

&lt;span data-controller=&quot;mathjax&quot;&gt;Illustration of the two types of adversarial examples in Examples example:AdversarialsLinearI and example:AdversarialsLinearII.
In panel A) the feature support D_{x}) corresponds to the dashed line.
We depict the two decision boundaries
 DB_h = {x},|,{w}^x = 0) of
 h({x}) = sign({w}^x)) and
 DB_g = {x},|,{w}^x = 0) 
 g({x}) = sign({w}^x)) . Both h) and g) perfectly classify every data point in D_{x}) .
One data point {x}) is shifted outside of the support of the distribution in a way to change its label according to h) .
This creates an adversarial example {x}&amp;#39;) .
In panel B) the data distribution is globally supported.
However, h) and g) do not coincide.
Thus the decision boundaries DB_h) and DB_g) do not coincide.
Moving data points across DB_h) 
can create adversarial examples, as depicted by {x}) and {x}&amp;#39;) .&lt;/span&gt;
Figure 77. Illustration of the two types of adversarial examples in Examples 17 and 18. In panel A) the feature support \( D_{\boldsymbol{x}}\) corresponds to the dashed line. We depict the two decision boundaries \( \rm{ DB}_h = \{{\boldsymbol{x}}\,|\,{\boldsymbol{w}}^\top{\boldsymbol{x}} = 0\}\) of \( h({\boldsymbol{x}}) = \mathrm{sign}({\boldsymbol{w}}^\top{\boldsymbol{x}})\) and \( \rm{ DB}_g = \{{\boldsymbol{x}}\,|\,\overline{{\boldsymbol{w}}}^\top{\boldsymbol{x}} = 0\}\) \( g({\boldsymbol{x}}) = \mathrm{sign}(\overline{{\boldsymbol{w}}}^\top{\boldsymbol{x}})\) . Both \( h\) and \( g\) perfectly classify every data point in \( D_{\boldsymbol{x}}\) . One data point \( {\boldsymbol{x}}\) is shifted outside of the support of the distribution in a way to change its label according to \( h\) . This creates an adversarial example \( {\boldsymbol{x}}'\) . In panel B) the data distribution is globally supported. However, \( h\) and \( g\) do not coincide. Thus the decision boundaries \( \rm{ DB}_h\) and \( \rm{ DB}_g\) do not coincide. Moving data points across \( \rm{ DB}_h\) can create adversarial examples, as depicted by \( {\boldsymbol{x}}\) and \( {\boldsymbol{x}}'\) .

Let us now study two cases of linear classifiers, which allow for different types of adversarial examples. In the following two examples, the ground-truth classifier \( g:\mathbb{R}^{d} \to \{-1,1\}\) is given by \( g({\boldsymbol{x}}) = \mathrm{sign}(\overline{{\boldsymbol{w}}}^\top {\boldsymbol{x}})\) for \( \overline{{\boldsymbol{w}}} \in \mathbb{R}^d\) with \( \|\overline{{\boldsymbol{w}}}\| = 1\) .

For the first example, we construct a Bayes classifier \( h\) admitting adversarial examples in the complement of the feature support. This corresponds to case (XVI) in Section 17.2.

Example 17

Let \( \mathcal{D}\) be the uniform distribution on

\[ \{ (\lambda \overline{{\boldsymbol{w}}}, g(\lambda \overline{{\boldsymbol{w}}}))\,|\,\lambda \in [-1,1] \setminus \{0\}\}\subseteq \mathbb{R}^d\times \{-1,1\}. \]

The feature support equals

\[ D_{\boldsymbol{x}} = \{\lambda \overline{{\boldsymbol{w}}}\,|\,\lambda \in [-1,1] \setminus \{0\}\} \subseteq {\rm span}\{\overline{{\boldsymbol{w}}}\}. \]

Next fix \( \alpha \in (0,1)\) and set \( {\boldsymbol{w}} \mathrm{:}= \alpha\overline{{\boldsymbol{w}}} + (1-\alpha) {\boldsymbol{v}}\) for some \( {\boldsymbol{v}} \in \overline{{\boldsymbol{w}}}^\perp\) with \( \|{\boldsymbol{v}}\| = 1\) , so that \( \| {\boldsymbol{w}} \|_{}=1\) . We let \( h({\boldsymbol{x}}) \mathrm{:}= \mathrm{sign}({\boldsymbol{w}}^\top {\boldsymbol{x}})\) . We now show that every \( {\boldsymbol{x}}\in D_{\boldsymbol{x}}\) satisfies the assumptions of Theorem 41, and therefore admits an adversarial example.

Note that \( h({\boldsymbol{x}}) = g({\boldsymbol{x}})\) for every \( {\boldsymbol{x}} \in D_{\boldsymbol{x}}\) . Hence \( h\) is a Bayes classifier. Now fix \( {\boldsymbol{x}}\in D_{\boldsymbol{x}}\) . Then \( |{\boldsymbol{w}}^\top {\boldsymbol{x}}| \leq \alpha|\overline{{\boldsymbol{w}}}^\top {\boldsymbol{x}}|\) , so that (258) is satisfied. Furthermore, for every \( \varepsilon >0\) it holds that

\[ \delta\mathrm{:}= \frac{\varepsilon + |{\boldsymbol{w}}^\top {\boldsymbol{x}}|}{\|{\boldsymbol{w}}\|} \leq \varepsilon + \alpha. \]

Hence, for \( \varepsilon < |{\boldsymbol{w}}^\top {\boldsymbol{x}}|\) it holds by Theorem 41 that there exists an adversarial example with perturbation less than \( \varepsilon + \alpha\) . For small \( \alpha\) , the situation is depicted in the upper panel of Figure 77.

For the second example, we construct a distribution with global feature support and a classifier which is not a Bayes classifier. This corresponds to case (XVII) in Section 17.2.

Example 18

Let \( \mathcal{D}_{\boldsymbol{x}}\) be a distribution on \( \mathbb{R}^d\) with positive Lebesgue density everywhere outside the decision boundary \( \rm{ DB}_g = \{{\boldsymbol{x}}\,|\,\overline{{\boldsymbol{w}}}^\top{\boldsymbol{x}} = 0\}\) of \( g\) . We define \( \mathcal{D}\) to be the distribution of \( (X, g(X))\) for \( X \sim \mathcal{D}_{\boldsymbol{x}}\) . In addition, let \( {\boldsymbol{w}} \notin \{\pm \overline{{\boldsymbol{w}}}\}\) , \( \|{\boldsymbol{w}}\| = 1\) and \( h({\boldsymbol{x}}) = \mathrm{sign}({\boldsymbol{w}}^\top {\boldsymbol{x}})\) . We exclude \( {\boldsymbol{w}}=-\overline{{\boldsymbol{w}}}\) because, in this case, every prediction of \( h\) is wrong. Thus no adversarial examples are possible.

By construction the feature support is given by \( D_{\boldsymbol{x}}=\mathbb{R}^d\) . Moreover, \( h^{-1}(\{-1\}), h^{-1}(\{1\})\) and \( g^{-1}(\{-1\}), g^{-1}(\{1\})\) are half spaces, which implies that, in the notation of (256) that

\[ \mathrm{dist}(C_{\pm 1} \cap D_{\boldsymbol{x}}, F_{\pm 1} \cap D_{\boldsymbol{x}}) = \mathrm{dist}(C_{\pm 1}, F_{\pm 1}) = 0. \]

Hence, for every \( \delta>0\) there is a positive probability of observing \( {\boldsymbol{x}}\) to which an adversarial example with perturbation \( \delta\) exists.

The situation is depicted in the lower panel of Figure 77.

17.4 ReLU neural networks

So far we discussed classification by affine classifiers. A binary classifier based on a ReLU neural network is a function \( \mathbb{R}^d \ni {\boldsymbol{x}}\mapsto \mathrm{sign}(\Phi({\boldsymbol{x}}))\) , where \( \Phi\) is a ReLU neural network. As noted in [277], the arguments for affine classifiers, see Proposition 29, can be applied to the affine pieces of \( \Phi\) , to show existence of adversarial examples.

Consider a ground-truth classifier \( g\colon \mathbb{R}^d \to \{-1,0,1\}\) . For each \( {\boldsymbol{x}} \in \mathbb{R}^d\) we define the geometric margin of \( g\) at \( {\boldsymbol{x}}\) as

\[ \begin{align} \mu_g({\boldsymbol{x}}) \mathrm{:}= \mathrm{dist}({\boldsymbol{x}}, g^{-1}(\{g({\boldsymbol{x}})\})^c ), \end{align} \]

(265)

i.e., as the distance of \( {\boldsymbol{x}}\) to the closest element that is classified differently from \( {\boldsymbol{x}}\) or the infimum over all distances to elements from other classes if no closest element exists. Additionally, we denote the distance of \( {\boldsymbol{x}}\) to the closest adjacent affine piece by

\[ \begin{align} \nu_{\Phi}({\boldsymbol{x}}) \mathrm{:}= \mathrm{dist}({\boldsymbol{x}}, A_{\Phi, {\boldsymbol{x}}}^c), \end{align} \]

(266)

where \( A_{\Phi, {\boldsymbol{x}}}\) is the largest connected region on which \( \Phi\) is affine and which contains \( {\boldsymbol{x}}\) . We have the following theorem.

Theorem 42

Let \( \Phi\colon \mathbb{R}^d \to \mathbb{R}\) and for \( {\boldsymbol{x}}\in\mathbb{R}^d\) let \( h({\boldsymbol{x}}) = \mathrm{sign}(\Phi({\boldsymbol{x}}))\) . Denote by \( g\colon \mathbb{R}^d \to \{-1,0,1\}\) the ground-truth classifier. Let \( {\boldsymbol{x}}\in\mathbb{R}^d\) and \( \varepsilon>0\) be such that \( \nu_{\Phi}({\boldsymbol{x}})>0\) , \( g({\boldsymbol{x}})\neq 0\) , \( \nabla\Phi({\boldsymbol{x}}) \neq 0\) and

\[ \begin{align*} \mu_g({\boldsymbol{x}}), \nu_{\Phi}({\boldsymbol{x}}) > \frac{\varepsilon + |\Phi({\boldsymbol{x}})|}{\| \nabla\Phi({\boldsymbol{x}})\|}. \end{align*} \]

Then

\[ \begin{align*} {\boldsymbol{x}}' \mathrm{:}= {\boldsymbol{x}} - h({\boldsymbol{x}}) \frac{\varepsilon + |\Phi({\boldsymbol{x}})|}{\| \nabla\Phi({\boldsymbol{x}})\|^2} \nabla\Phi({\boldsymbol{x}}) \end{align*} \]

is an adversarial example to \( {\boldsymbol{x}}\) with perturbation \( \delta = ({\varepsilon + |\Phi({\boldsymbol{x}})|})/{\| \nabla\Phi({\boldsymbol{x}})\|}\) .

Proof

We show that \( {\boldsymbol{x}}'\) satisfies the properties in Definition 37.

By construction \( \|{\boldsymbol{x}} - {\boldsymbol{x}}'\| \leq \delta\) . Since \( \mu_g({\boldsymbol{x}})>\delta\) it follows that \( g({\boldsymbol{x}}) = g({\boldsymbol{x}}')\) . Moreover, by assumption \( g({\boldsymbol{x}}) \neq 0\) , and thus \( g({\boldsymbol{x}})g({\boldsymbol{x}}')>0\) .

It only remains to show that \( h({\boldsymbol{x}}') \neq h({\boldsymbol{x}})\) . Since \( \delta < \nu_{\Phi}({\boldsymbol{x}})\) , we have that \( \Phi({\boldsymbol{x}}) = \nabla \Phi({\boldsymbol{x}})^\top {\boldsymbol{x}} + b\) and \( \Phi({\boldsymbol{x}}') = \nabla \Phi({\boldsymbol{x}})^\top {\boldsymbol{x}}' + b\) for some \( b \in \mathbb{R}\) . Therefore,

\[ \begin{align*} \Phi({\boldsymbol{x}}) - \Phi({\boldsymbol{x}}') &= \nabla \Phi({\boldsymbol{x}})^\top({\boldsymbol{x}} - {\boldsymbol{x}}') = \nabla \Phi({\boldsymbol{x}})^\top\left(h({\boldsymbol{x}}) \frac{\varepsilon + |\Phi({\boldsymbol{x}})|}{\| \nabla\Phi({\boldsymbol{x}})\|^2} \nabla\Phi({\boldsymbol{x}})\right)\\ &=h({\boldsymbol{x}}) (\varepsilon + |\Phi({\boldsymbol{x}})|). \end{align*} \]

Since \( h({\boldsymbol{x}})|\Phi({\boldsymbol{x}})| = \Phi({\boldsymbol{x}})\) it follows that \( \Phi({\boldsymbol{x}}') = - h({\boldsymbol{x}}) \varepsilon\) . Hence, \( h({\boldsymbol{x}}') = - h({\boldsymbol{x}})\) , which completes the proof.

Remark 20

We look at the key parameters in Theorem 42 to understand which factors facilitate adversarial examples.

  • The geometric margin of the ground-truth classifier \( \mu_{g}({\boldsymbol{x}})\) : To make the construction possible, we need to be sufficiently far away from points that belong to a different class than \( {\boldsymbol{x}}\) or to the nonrelevant class.
  • The distance to the next affine piece \( \nu_{\Phi}({\boldsymbol{x}})\) : Since we are looking for an adversarial example within the same affine piece as \( {\boldsymbol{x}}\) , we need this piece to be sufficiently large.
  • The perturbation \( \delta\) : The perturbation is given by \( ({\varepsilon + |\Phi({\boldsymbol{x}})|})/{\| \nabla\Phi({\boldsymbol{x}})\|}\) , which depends on the classification margin \( |\Phi({\boldsymbol{x}})|\) of the ReLU classifier and its sensitivity to inputs \( \| \nabla\Phi({\boldsymbol{x}})\|\) . For adversarial examples to be possible, we either want a small classification margin of \( \Phi\) or a high sensitivity of \( \Phi\) to its inputs.

17.5 Robustness

Having established that adversarial examples can arise in various ways under mild assumptions, we now turn our attention to conditions that prevent their existence.

17.5.1 Global Lipschitz regularity

We have repeatedly observed in the previous sections that a large value of \( \|{\boldsymbol{w}}\|\) for linear classifiers \( \rm{ sign}({\boldsymbol{w}}^\top{\boldsymbol{x}})\) , or \( \|\nabla\Phi({\boldsymbol{x}})\|\) for ReLU classifiers \( \rm{ sign}(\Phi({\boldsymbol{x}}))\) , facilitates the occurrence of adversarial examples. Naturally, both these values are upper bounded by the Lipschitz constant of the classifier’s inner functions \( {\boldsymbol{x}}\mapsto{\boldsymbol{w}}^\top{\boldsymbol{x}}\) and \( {\boldsymbol{x}}\mapsto\Phi({\boldsymbol{x}})\) . Consequently, it was stipulated early on that bounding the Lipschitz constant of the inner functions could be an effective measure against adversarial examples [277].

We have the following result for general classifiers of the form \( {\boldsymbol{x}} \mapsto \mathrm{sign}(\Phi({\boldsymbol{x}}))\) .

Proposition 30

Let \( \Phi \colon \mathbb{R}^d \to \mathbb{R}\) be \( C_L\) -Lipschitz with \( C_L >0\) , and let \( s>0\) . Let \( h({\boldsymbol{x}}) = \mathrm{sign}(\Phi({\boldsymbol{x}}))\) be a classifier, and let \( g\colon \mathbb{R}^d \to \{-1,0,1\}\) be a ground-truth classifier. Moreover, let \( {\boldsymbol{x}}\in\mathbb{R}^d\) be such that

\[ \begin{align} \Phi({\boldsymbol{x}}) g({\boldsymbol{x}}) \geq s. \end{align} \]

(267)

Then there does not exist an adversarial example to \( {\boldsymbol{x}}\) of perturbation \( \delta < s/C_L\) .

Proof

Let \( {\boldsymbol{x}}\in \mathbb{R}^d\) satisfy (267) and assume that \( \|{\boldsymbol{x}}' - {\boldsymbol{x}} \| \leq \delta\) . The Lipschitz continuity of \( \Phi\) implies

\[ \begin{align*} |\Phi({\boldsymbol{x}}') - \Phi({\boldsymbol{x}}) | < s. \end{align*} \]

Since \( |\Phi({\boldsymbol{x}})| \geq s\) we conclude that \( \Phi({\boldsymbol{x}}')\) has the same sign as \( \Phi({\boldsymbol{x}})\) which shows that \( {\boldsymbol{x}}'\) cannot be an adversarial example to \( {\boldsymbol{x}}\) .

Remark 21

As we have seen in Lemma 33, we can bound the Lipschitz constant of ReLU neural networks by restricting the magnitude and number of their weights and the number of layers.

There has been some criticism to results of this form, see, e.g., [280], since an assumption on the Lipschitz constant may potentially restrict the capabilities of the neural network too much. We next present a result that shows under which assumptions on the training set, there exists a neural network that classifies the training set correctly, but does not allow for adversarial examples within the training set.

Theorem 43

Let \( m \in \mathbb{N}\) , let \( g \colon{\mathbb{R}^d} \to \{-1,0,1\}\) be a ground-truth classifier, and let \( ({\boldsymbol{x}}_i, g({\boldsymbol{x}}_i))_{i=1}^m \in (\mathbb{R}^{d} \times \{-1,1\})^m\) . Assume that

\[ \sup_{i \neq j} \frac{|g({\boldsymbol{x}}_i) - g({\boldsymbol{x}}_j)|}{\| {\boldsymbol{x}}_i - {\boldsymbol{x}}_j \|_{}} =\mathrm{:} \widetilde{M} >0. \]

Then there exists a ReLU neural network \( \Phi\) with \( \rm{ depth}(\Phi)=O(\log(m))\) and \( \rm{ width}(\Phi)=O(dm)\) such that for all \( i = 1,\dots, m\)

\[ \mathrm{sign}(\Phi({\boldsymbol{x}}_i)) = g({\boldsymbol{x}}_i) \]

and there is no adversarial example of perturbation \( \delta = 1/\widetilde{M}\) to \( {\boldsymbol{x}}_i\) .

Proof

The result follows directly from Theorem 18 and Proposition 30. The reader is invited to complete the argument in Exercise 67.

17.5.2 Local regularity

One issue with upper bounds involving global Lipschitz constants such as those in Proposition 30, is that these bounds may be quite large for deep neural networks. For example, the upper bound given in Lemma 33 is

\[ \|\Phi({\boldsymbol{x}}) - \Phi({\boldsymbol{x}}')\|_\infty \leq C_\sigma^L \cdot (B d_{\rm max})^{L+1} \|{\boldsymbol{x}}-{\boldsymbol{x}}' \|_\infty \]

which grows exponentially with the depth of the neural network. However, in practice this bound may be pessimistic, and locally the neural network might have significantly smaller gradients than the global Lipschitz constant.

Because of this, it is reasonable to study results preventing adversarial examples under local Lipschitz bounds. Such a result together with an algorithm providing bounds on the local Lipschitz constant was proposed in [281]. We state the theorem adapted to our set-up.

Theorem 44

Let \( h \colon \mathbb{R}^d \to \{-1,1\}\) be a classifier of the form \( h({\boldsymbol{x}}) = \mathrm{sign}(\Phi({\boldsymbol{x}}))\) and let \( g \colon \mathbb{R}^d \to \{-1,0,1\}\) be the ground-truth classifier. Let \( {\boldsymbol{x}} \in \mathbb{R}^d\) satisfy \( g({\boldsymbol{x}}) \neq 0\) , and set

\[ \begin{align} \alpha \mathrm{:}= \max_{R >0} \min \left\{ \Phi({\boldsymbol{x}}) g({\boldsymbol{x}}) \Big/ \sup_{\substack{\| {\boldsymbol{y}}-{\boldsymbol{x}} \|_{\infty}\le R\\ {\boldsymbol{y}}\neq{\boldsymbol{x}}}} \frac{|\Phi({\boldsymbol{y}}) - \Phi({\boldsymbol{x}})|}{\|{\boldsymbol{x}}-{\boldsymbol{y}}\|_\infty} , R \right\}, \end{align} \]

(268)

where the minimum is understood to be \( R\) in case the supremum is zero. Then there are no adversarial examples to \( {\boldsymbol{x}}\) with perturbation \( \delta< \alpha\) .

Proof

Let \( {\boldsymbol{x}} \in \mathbb{R}^d\) be as in the statement of the theorem. Assume, towards a contradiction, that for \( 0<\delta < \alpha\) satisfying (268), there exists an adversarial example \( {\boldsymbol{x}}'\) to \( {\boldsymbol{x}}\) with perturbation \( \delta\) .

If the supremum in (268) is zero, then \( \Phi\) is constant on a ball of radius \( R\) around \( {\boldsymbol{x}}\) . In particular for \( \| {\boldsymbol{x}}'-{\boldsymbol{x}} \|_{}\le\delta<R\) it holds that \( h({\boldsymbol{x}}')=h({\boldsymbol{x}})\) and \( {\boldsymbol{x}}'\) cannot be an adversarial example.

Now assume the supremum in (268) is not zero. It holds by (268) for \( \delta<R\) , that

\[ \begin{align} \delta < \Phi({\boldsymbol{x}}) g({\boldsymbol{x}})\Big/ \sup_{\substack{\| {\boldsymbol{y}}-{\boldsymbol{x}} \|_{\infty}\le R\\ {\boldsymbol{y}}\neq{\boldsymbol{x}}}} \frac{|\Phi({\boldsymbol{y}}) - \Phi({\boldsymbol{x}})|}{\|{\boldsymbol{x}}-{\boldsymbol{y}}\|_\infty}. \end{align} \]

(269)

Moreover,

\[ \begin{align*} |\Phi({\boldsymbol{x}}') - \Phi({\boldsymbol{x}})| &\leq \sup_{\substack{\| {\boldsymbol{y}}-{\boldsymbol{x}} \|_{\infty}\le R\\ {\boldsymbol{y}}\neq{\boldsymbol{x}}}} \frac{|\Phi({\boldsymbol{y}}) - \Phi({\boldsymbol{x}})|}{\|{\boldsymbol{x}}-{\boldsymbol{y}}\|_\infty} \| {\boldsymbol{x}} - {\boldsymbol{x}}'\|_\infty\\ &\leq \sup_{\substack{\| {\boldsymbol{y}}-{\boldsymbol{x}} \|_{\infty}\le R\\ {\boldsymbol{y}}\neq{\boldsymbol{x}}}} \frac{|\Phi({\boldsymbol{y}}) - \Phi({\boldsymbol{x}})|}{\|{\boldsymbol{x}}-{\boldsymbol{y}}\|_\infty} \delta < \Phi({\boldsymbol{x}})g({\boldsymbol{x}}), \end{align*} \]

where we applied (269) in the last line. It follows that

\[ \begin{align*} g({\boldsymbol{x}}) \Phi({\boldsymbol{x}}') &= g({\boldsymbol{x}}) \Phi({\boldsymbol{x}}) + g({\boldsymbol{x}}) (\Phi({\boldsymbol{x}}') - \Phi({\boldsymbol{x}}))\\ &\geq g({\boldsymbol{x}}) \Phi({\boldsymbol{x}}) - |\Phi({\boldsymbol{x}}') - \Phi({\boldsymbol{x}})| > 0. \end{align*} \]

This rules out \( {\boldsymbol{x}}'\) as an adversarial example.

The supremum in (268) is bounded by the Lipschitz constant of \( \Phi\) on \( B_R({\boldsymbol{x}})\) . Thus Theorem 44 depends only on the local Lipschitz constant of \( \Phi\) . One obvious criticism of this result is that the computation of (268) is potentially prohibitive. We next show a different result, for which the assumptions can immediately be checked by applying a simple algorithm that we present subsequently.

To state the following proposition, for a continuous function \( \Phi:\mathbb{R}^d\to\mathbb{R}\) and \( \delta>0\) we define for \( {\boldsymbol{x}} \in \mathbb{R}^d\) and \( \delta >0\)

\[ \begin{align} z^{\delta, \mathrm{max}} &\mathrm{:}= \max\{\Phi({\boldsymbol{y}})\,|\,\|{\boldsymbol{y}} - {\boldsymbol{x}}\|_\infty \leq \delta\}\\ z^{\delta, \mathrm{min}} &\mathrm{:}= \min\{\Phi({\boldsymbol{y}})\,|\,\|{\boldsymbol{y}} - {\boldsymbol{x}}\|_\infty \leq \delta\}. \\\end{align} \]

(270)

Proposition 31

Let \( h\colon \mathbb{R}^d \to \{-1,1\}\) be a classifier of the form \( h({\boldsymbol{x}}) = \mathrm{sign}(\Phi({\boldsymbol{x}}))\) and \( g \colon \mathbb{R}^d \to \{-1,0,1\}\) , let \( {\boldsymbol{x}}\) be such that \( h({\boldsymbol{x}}) = g({\boldsymbol{x}})\) . Then \( {\boldsymbol{x}}\) does not have an adversarial example of perturbation \( \delta\) if \( z^{\delta, \mathrm{max}} z^{\delta, \mathrm{min}}>0\) .

Proof

The proof is immediate, since \( z^{\delta, \mathrm{max}} z^{\delta, \mathrm{min}}>0\) implies that all points in a \( \delta\) neighborhood of \( {\boldsymbol{x}}\) are classified the same.

To apply (31), we only have to compute \( z^{\delta, \mathrm{max}}\) and \( z^{\delta, \mathrm{min}}\) . It turns out that if \( \Phi\) is a neural network, then \( z^{\delta, \mathrm{max}}\) , \( z^{\delta, \mathrm{min}}\) can be approximated by a computation similar to a forward pass of \( \Phi\) . Denote by \( |{\boldsymbol{A}}|\) the matrix obtained by taking the absolute value of each entry of the matrix \( {\boldsymbol{A}}\) . Additionally, we define

\[ {\boldsymbol{A}}^+ = (|{\boldsymbol{A}}| + {\boldsymbol{A}})/2 \text{ and } {\boldsymbol{A}}^- = (|{\boldsymbol{A}}| - {\boldsymbol{A}})/2. \]

The idea behind the Algorithm 3 is common in the area of neural network verification, see, e.g., [282, 283, 284, 285].


Algorithm 3 Compute \( \Phi({\boldsymbol{x}})\) , \( z^{\delta, \mathrm{max}}\) and \( z^{\delta, \mathrm{min}}\) for a given neural network
1.Input: weight matrices \( {\boldsymbol{W}}^{(\ell)}\in\mathbb{R}^{d_{\ell+1}\times d_\ell}\) and bias vectors \( {\boldsymbol{b}}^{(\ell)}\in\mathbb{R}^{d_{\ell+1}}\) for \( \ell = 0, \dots, L\) with \( d_{L+1}= 1\) , monotonous activation function \( \sigma\) , input vector \( {\boldsymbol{x}} \in \mathbb{R}^{d_0}\) , neighborhood size \( \delta>0\)
2.Output: Bounds for \( z^{\delta,\max}\) and \( z^{\delta,\min}\)
3.
4.\( {\boldsymbol{x}}^{(0)}= {\boldsymbol{x}}\)
5.\( \delta^{(0), \mathrm{up}} = \delta \boldsymbol{1} \in \mathbb{R}^{d_0}\)
6.\( \delta^{(0), \mathrm{low}} = \delta \boldsymbol{1} \in \mathbb{R}^{d_0}\)
7.for \( \ell = 0,\dots,L-1\) do
8.\( {\boldsymbol{x}}^{(\ell+1)} = \sigma({\boldsymbol{W}}^{(\ell)} {\boldsymbol{x}}^{(\ell)} + {\boldsymbol{b}}^{(\ell)})\)
9.\( \delta^{(\ell+1), \mathrm{up}} = \sigma({\boldsymbol{W}}^{(\ell)} {\boldsymbol{x}}^{(\ell)} + ({\boldsymbol{W}}^{(\ell)})^+ \delta^{(\ell), \mathrm{up}} + ({\boldsymbol{W}}^{(\ell)})^- \delta^{(\ell), \mathrm{low}}+ {\boldsymbol{b}}^{(\ell)}) - {\boldsymbol{x}}^{(\ell+1)}\)
10.\( \delta^{(\ell+1), \mathrm{low}} = {\boldsymbol{x}}^{(\ell+1)} - \sigma({\boldsymbol{W}}^{(\ell)} {\boldsymbol{x}}^{(\ell)} - ({\boldsymbol{W}}^{(\ell)})^+ \delta^{(\ell), \mathrm{low}} - ({\boldsymbol{W}}^{(\ell)})^- \delta^{(\ell), \mathrm{up}} + {\boldsymbol{b}}^{(\ell)})\)
11.end for
12.\( {\boldsymbol{x}}^{(L+1)} = {\boldsymbol{W}}^{(L)} {\boldsymbol{x}}^{(L)} + {\boldsymbol{b}}^{(L)}\)
13.\( \delta^{(L+1), \mathrm{up}} = ({\boldsymbol{W}}^{(L)})^+ \delta^{(L), \mathrm{up}} + ({\boldsymbol{W}}^{(L)})^- \delta^{(L), \mathrm{low}}\)
14.\( \delta^{(L+1), \mathrm{low}} = ({\boldsymbol{W}}^{(L)})^+ \delta^{(L), \mathrm{low}} + ({\boldsymbol{W}}^{(L)})^- \delta^{(L), \mathrm{up}}\) return \( {\boldsymbol{x}}^{(L+1)}\) , \( {\boldsymbol{x}}^{(L+1)} + \delta^{(L+1), \mathrm{up}}\) , \( {\boldsymbol{x}}^{(L+1)} - \delta^{(L+1), \mathrm{low}}\)

Remark 22

Up to constants, Algorithm 3 has the same computational complexity as a forward pass, also see Algorithm 1. In addition, in contrast to upper bounds based on estimating the global Lipschitz constant of \( \Phi\) via its weights, the upper bounds found via Algorithm 3 include the effect of the activation function \( \sigma\) . For example, if \( \sigma\) is the ReLU, then we may often end up in a situation, where \( \delta^{(\ell), \mathrm{up}}\) or \( \delta^{(\ell), \mathrm{low}}\) can have many entries that are \( 0\) . If an entry of \( {\boldsymbol{W}}^{(\ell)} {\boldsymbol{x}}^{(\ell)} + {\boldsymbol{b}}^{(\ell)}\) is nonpositive, then it is guaranteed that the associated entry in \( \delta^{(\ell), \mathrm{low}}\) will be zero. Similarly, if \( {\boldsymbol{W}}^{(\ell)}\) has only few positive entries, then most of the entries of \( \delta^{(\ell), \mathrm{up}}\) are not propagated to \( \delta^{(\ell+1), \mathrm{up}}\) .

Next, we prove that Algorithm 3 indeed produces sensible output.

Proposition 32

Let \( \Phi\) be a neural network with weight matrices \( {\boldsymbol{W}}^{(\ell)}\in\mathbb{R}^{d_{\ell+1}\times d_\ell}\) and bias vectors \( {\boldsymbol{b}}^{(\ell)}\in\mathbb{R}^{d_{\ell+1}}\) for \( \ell = 0, \dots, L\) , and a monotonically increasing activation function \( \sigma\) .

Let \( {\boldsymbol{x}} \in \mathbb{R}^d\) . Then the output of Algorithm 3 satisfies

\[ {\boldsymbol{x}}^{L+1} + \delta^{(L+1), \mathrm{up}} > z^{\delta, \mathrm{max}} \text{ and } {\boldsymbol{x}}^{L+1} - \delta^{(L+1), \mathrm{low}} < z^{\delta, \mathrm{min}}. \]

Proof

Fix \( {\boldsymbol{y}}\) , \( {\boldsymbol{x}} \in \mathbb{R}^d\) with \( \|{\boldsymbol{y}} - {\boldsymbol{x}}\|_\infty \leq \delta\) and let \( {\boldsymbol{y}}^{(\ell)}\) , \( {\boldsymbol{x}}^{(\ell)}\) for \( \ell = 0, \dots, L+1\) be as in Algorithm 3 applied to \( {\boldsymbol{y}}\) , \( {\boldsymbol{x}}\) , respectively. Moreover, let \( \delta^{\ell, \mathrm{up}}\) , \( \delta^{\ell, \mathrm{low}}\) for \( \ell = 0, \dots, L+1\) be as in Algorithm 3 applied to \( {\boldsymbol{x}}\) . We will prove by induction over \( \ell = 0, \dots, L+1\) that

\[ \begin{align} {\boldsymbol{y}}^{(\ell)} - {\boldsymbol{x}}^{(\ell)} \leq \delta^{\ell, \mathrm{up}}~~\text{and}~~ {\boldsymbol{x}}^{(\ell)} - {\boldsymbol{y}}^{(\ell)} \leq \delta^{\ell, \mathrm{low}}, \end{align} \]

(271)

where the inequalities are understood entry-wise for vectors. Since \( {\boldsymbol{y}}\) was arbitrary this then proves the result.

The case \( \ell = 0\) follows immediately from \( \|{\boldsymbol{y}} - {\boldsymbol{x}}\|_\infty \leq \delta\) . Assume now, that the statement was shown for \( \ell < L\) . We have that

\[ \begin{align*} {\boldsymbol{y}}^{(\ell+1)} - {\boldsymbol{x}}^{(\ell+1)} - \delta^{\ell+1, \mathrm{up}}= &\sigma({\boldsymbol{W}}^{(\ell)} {\boldsymbol{y}}^{(\ell)} + {\boldsymbol{b}}^{(\ell)})\\ &- \sigma\big({\boldsymbol{W}}^{(\ell)} {\boldsymbol{x}}^{(\ell)} + ({\boldsymbol{W}}^{(\ell)})^+ \delta^{(\ell), \mathrm{up}} + ({\boldsymbol{W}}^{(\ell)})^- \delta^{(\ell), \mathrm{low}} + {\boldsymbol{b}}^{(\ell)}\big). \end{align*} \]

The monotonicity of \( \sigma\) implies that

\[ {\boldsymbol{y}}^{(\ell+1)} - {\boldsymbol{x}}^{(\ell+1)} \leq \delta^{\ell+1, \mathrm{up}} \]

if

\[ \begin{align} {\boldsymbol{W}}^{(\ell)} {\boldsymbol{y}}^{(\ell)} \leq {\boldsymbol{W}}^{(\ell)} {\boldsymbol{x}}^{(\ell)} + ({\boldsymbol{W}}^{(\ell)})^+ \delta^{(\ell), \mathrm{up}} + ({\boldsymbol{W}}^{(\ell)})^- \delta^{(\ell), \mathrm{low}}. \end{align} \]

(272)

To prove (272), we observe that

\[ \begin{align*} {\boldsymbol{W}}^{(\ell)} ({\boldsymbol{y}}^{(\ell)} - {\boldsymbol{x}}^{(\ell)}) &= ({\boldsymbol{W}}^{(\ell)})^+ ({\boldsymbol{y}}^{(\ell)} - {\boldsymbol{x}}^{(\ell)}) - ({\boldsymbol{W}}^{(\ell)})^- ({\boldsymbol{y}}^{(\ell)} - {\boldsymbol{x}}^{(\ell)})\\ &=({\boldsymbol{W}}^{(\ell)})^+ ({\boldsymbol{y}}^{(\ell)} - {\boldsymbol{x}}^{(\ell)}) + ({\boldsymbol{W}}^{(\ell)})^- ({\boldsymbol{x}}^{(\ell)} - {\boldsymbol{y}}^{(\ell)})\\ &\leq ({\boldsymbol{W}}^{(\ell)})^+ \delta^{(\ell), \mathrm{up}} + ({\boldsymbol{W}}^{(\ell)})^- \delta^{(\ell), \mathrm{low}}, \end{align*} \]

where we used the induction assumption in the last line. This shows the first estimate in (271). Similarly,

\[ \begin{align*} &{\boldsymbol{x}}^{(\ell+1)} - {\boldsymbol{y}}^{(\ell+1)} - \delta^{\ell+1, \mathrm{low}} \\ &= \sigma({\boldsymbol{W}}^{(\ell)} {\boldsymbol{x}}^{(\ell)} - ({\boldsymbol{W}}^{(\ell)})^+ \delta^{(\ell), \mathrm{low}} - ({\boldsymbol{W}}^{(\ell)})^- \delta^{(\ell), \mathrm{up}} + {\boldsymbol{b}}^{(\ell)}) - \sigma({\boldsymbol{W}}^{(\ell)} {\boldsymbol{y}}^{(\ell)} + {\boldsymbol{b}}^{(\ell)}). \end{align*} \]

Hence, \( {\boldsymbol{x}}^{(\ell+1)} - {\boldsymbol{y}}^{(\ell+1)} \leq \delta^{\ell+1, \mathrm{low}}\) if

\[ \begin{align} {\boldsymbol{W}}^{(\ell)} {\boldsymbol{y}}^{(\ell)} \geq {\boldsymbol{W}}^{(\ell)} {\boldsymbol{x}}^{(\ell)} - ({\boldsymbol{W}}^{(\ell)})^+ \delta^{(\ell), \mathrm{low}} - ({\boldsymbol{W}}^{(\ell)})^- \delta^{(\ell), \mathrm{up}}. \end{align} \]

(273)

To prove (273), we observe that

\[ \begin{align*} {\boldsymbol{W}}^{(\ell)} ({\boldsymbol{x}}^{(\ell)} - {\boldsymbol{y}}^{(\ell)}) &= ({\boldsymbol{W}}^{(\ell)})^+ ({\boldsymbol{x}}^{(\ell)} - {\boldsymbol{y}}^{(\ell)}) - ({\boldsymbol{W}}^{(\ell)})^- ({\boldsymbol{x}}^{(\ell)} - {\boldsymbol{y}}^{(\ell)})\\ &= ({\boldsymbol{W}}^{(\ell)})^+ ({\boldsymbol{x}}^{(\ell)} - {\boldsymbol{y}}^{(\ell)}) + ({\boldsymbol{W}}^{(\ell)})^- ({\boldsymbol{y}}^{(\ell)} - {\boldsymbol{x}}^{(\ell)})\\ &\leq ({\boldsymbol{W}}^{(\ell)})^+ \delta^{(\ell), \mathrm{low}} + ({\boldsymbol{W}}^{(\ell)})^- \delta^{(\ell), \mathrm{up}}, \end{align*} \]

where we used the induction assumption in the last line. This completes the proof of (271) for all \( \ell \leq L\) .

The case \( \ell = L+1\) follows by the same argument, but replacing \( \sigma\) by the identity.

Bibliography and further reading

This chapter begins with the foundational paper [277], but it should be remarked that adversarial examples for non-deep-learning models in machine learning were studied earlier in [286].

The results in this chapter are inspired by various results in the literature, though they may not be found in precisely the same form. The overall setup is inspired by [277]. The explanation based on the high-dimensionality of the data given in Section 17.3 was first formulated in [277] and [279]. The formalism reviewed in Section 17.2 is inspired by [278]. The results on robustness via local Lipschitz properties are due to [281]. Algorithm 3 is covered by results in the area of network verifiability [282, 283, 284, 285]. For a more comprehensive overview of modern approaches, we refer to the survey article [287].

Important directions not discussed in this chapter are the transferability of adversarial examples, defense mechanisms, and alternative adversarial operations. Transferability refers to the phenomenon that adversarial examples for one model often also fool other models, [288, 289]. Defense mechanisms, i.e., techniques for specifically training a neural network to prevent adversarial examples, include for example the Fast Gradient Sign Method of [279], and more sophisticated recent approaches such as [290]. Finally, adversarial examples can be generated not only through additive perturbations, but also through smooth transformations of images, as demonstrated in [291, 292].

Exercises

Exercise 64

Prove (257) by comparing the volume of the \( d\) -dimensional Euclidean unit ball with the volume of the \( d\) -dimensional 1-ball of radius \( c\) for a given \( c>0\) .

Exercise 65

Fix \( \delta>0\) . For a pair of classifiers \( h\) and \( g\) such that \( C_1 \cup C_{-1} = \emptyset\) in (256), there trivially cannot exist any adversarial examples. Construct an example, of \( h\) , \( g\) , \( \mathcal{D}\) such that \( C_1\) , \( C_{-1} \neq \emptyset\) , \( h\) is not a Bayes classifier, and \( g\) is such that no adversarial examples with a perturbation \( \delta\) exist.

Is this also possible if \( g^{-1}(0) = \emptyset\) ?

Exercise 66

Prove Proposition 29.

Hint: Repeat the proof of Theorem 41. In the first part set \( {\boldsymbol{x}}^{(\rm{ext})} = ({\boldsymbol{x}}, 1)\) , \( {\boldsymbol{w}}^{(\rm{ext})} = ({\boldsymbol{w}}, b)\) and \( \overline{{\boldsymbol{w}}}^{(\rm{ext})} = (\overline{{\boldsymbol{w}}}, \overline{b})\) . Then show that \( h({\boldsymbol{x}}') \neq h({\boldsymbol{x}})\) by plugging in the definition of \( {\boldsymbol{x}}'\) .

Exercise 67

Complete the proof of Theorem 43.

Appendix

18 Probability theory

This appendix provides some basic notions and results in probability theory required in the main text. It is intended as a revision for a reader already familiar with these concepts. For more details and further proofs, we refer for example to the standard textbook [293].

18.1 Sigma-algebras, topologies, and measures

Let \( \Omega\) be a set, and denote by \( 2^\Omega\) the powerset of \( \Omega\) .

Definition 38

A subset \( \mathfrak{A}\subseteq 2^\Omega\) is called a sigma-algebra on \( \Omega\) if it satisfies

  1. \( \Omega\in\mathfrak{A}\) ,
  2. \( A^c\in\mathfrak{A}\) whenever \( A\in\mathfrak{A}\) ,
  3. \( \bigcup_{i\in\mathbb{N}}A_i\in\mathfrak{A}\) whenever \( A_i\in\mathfrak{A}\) for all \( i\in\mathbb{N}\) .

For a sigma-algebra \( \mathfrak{A}\) on \( \Omega\) , the tuple \( (\Omega,\mathfrak{A})\) is also referred to as a measurable space. For a measurable space, a subset \( A\subseteq\Omega\) is called measurable, if \( A\in\mathfrak{A}\) . Measurable sets are also called events.

Another key system of subsets of \( \Omega\) is that of a topology.

Definition 39

A subset \( \mathfrak{T}\subseteq 2^\Omega\) is called a topology on \( \Omega\) if it satisfies

  1. \( \emptyset\) , \( \Omega\in\mathfrak{T}\) ,
  2. \( \bigcap_{j=1}^n O_j\in\mathfrak{T}\) whenever \( n\in\mathbb{N}\) and \( O_1,\dots,O_n\in\mathfrak{T}\) ,
  3. \( \bigcup_{i\in I}O_i\in\mathfrak{T}\) whenever for an index set \( I\) holds \( O_i\in\mathfrak{T}\) for all \( i\in I\) .

If \( \mathfrak{T}\) is a topology on \( \Omega\) , we call \( (\Omega,\mathfrak{T})\) a topological space, and a set \( O\subseteq \Omega\) is called open if and only if \( O\in\mathfrak{T}\) .

Remark 23

The two notions differ in that a topology allows for unions of arbitrary (possibly uncountably many) sets, but only for finite intersection, whereas a sigma-algebra allows for countable unions and intersections.

Example 19

Let \( d\in\mathbb{N}\) and denote by \( B_\varepsilon({\boldsymbol{x}})=\{{\boldsymbol{y}}\in\mathbb{R}^d\,|\,\| {\boldsymbol{y}}-{\boldsymbol{x}} \|_{}<\varepsilon\}\) the set of points whose Euclidean distance to \( {\boldsymbol{x}}\) is less than \( \varepsilon\) . Then for every \( A\subseteq\mathbb{R}^d\) , the smallest topology on \( A\) containing \( A\cap B_\varepsilon({\boldsymbol{x}})\) for all \( \varepsilon>0\) , \( {\boldsymbol{x}}\in\mathbb{R}^d\) , is called the Euclidean topology on \( A\) .

If \( (\Omega,\mathfrak{T})\) is a topological space, then the Borel sigma-algebra refers to the smallest sigma-algebra on \( \Omega\) containing all open sets, i.eall elements of \( \mathfrak{T}\) . Throughout this book, subsets of \( \mathbb{R}^d\) are always understood to be equipped with the Euclidean topology and the Borel sigma-algebra. The Borel sigma-algebra on \( \mathbb{R}^d\) is denoted by \( \mathfrak{B}_d\) .

We can now introduce measures.

Definition 40

Let \( (\Omega,\mathfrak{A})\) be a measurable space. A mapping \( \mu:\mathfrak{A}\to [0,\infty]\) is called a measure if it satisfies

  1. \( \mu(\emptyset)=0\) ,
  2. for every sequence \( (A_i)_{i\in\mathbb{N}}\subseteq \mathfrak{A}\) such that \( A_i\cap A_j=\emptyset\) whenever \( i\neq j\) , it holds

    \[ \mu\Big(\bigcup_{i\in\mathbb{N}}A_i\Big) = \sum_{i\in\mathbb{N}}\mu(A_i). \]

We say that the measure is finite if \( \mu(\Omega)<\infty\) , and it is sigma-finite if there exists a sequence \( (A_i)_{i\in\mathbb{N}}\subseteq\mathfrak{A}\) such that \( \Omega=\bigcup_{i\in\mathbb{N}}A_i\) and \( \mu(A_i)<1\) for all \( i\in\mathbb{N}\) . In case \( \mu(\Omega)=1\) , the measure is called a probability measure.

Example 20

One can show that there exists a unique measure \( \lambda\) on \( (\mathbb{R}^d,\mathfrak{B}_d)\) , such that for all sets of the type \( \times_{j=1}^d [a_i,b_i)\) with \( -\infty<a_i\le b_i<\infty\) holds

\[ \lambda(\times_{i=1}^d [a_i,b_i)) = \prod_{i=1}^d (b_i-a_i). \]

This measure is called the Lebesgue measure.

If \( \mu\) is a measure on the measurable space \( (\Omega,\mathfrak{A})\) , then the triplet \( (\Omega,\mathfrak{A},\mu)\) is called a measure space. In case \( \mu\) is a probability measure, it is called a probability space.

Let \( (\Omega,\mathfrak{A},\mu)\) be a measure space. A subset \( N\subseteq\Omega\) is called a null-set, if \( N\) is measurable and \( \mu(N)=0\) . Moreover, an equality or inequality is said to hold \( \mu\) - almost everywhere or \( \mu\) - almost surely, if it is satisfied on the complement of a null-set. In case \( \mu\) is clear from context, we simply write “almost everywhere” or “almost surely” instead. Usually this refers to the Lebesgue measure.

18.2 Random variables

18.2.1 Measurability of functions

To define random variables, we first need to recall the measurability of functions.

Definition 41

Let \( (\Omega_1,\mathfrak{A}_1)\) and \( (\Omega_2,\mathfrak{A}_2)\) be two measurable spaces. A function \( f:\Omega_1\to \Omega_2\) is called measurable if

\[ f^{-1}(A_2)\mathrm{:}= \{\omega\in\Omega_1\,|\,f(\omega)\in A_2\}\in\mathfrak{A}_1~~ \text{for all }A_2\in\mathfrak{A}_2. \]

A mapping \( X:\Omega_1\to\Omega_2\) is called a \( \Omega_2\) -valued random variable if it is measurable.

Remark 24

We again point out the parallels to topological spaces: A function \( f:\Omega_1\to\Omega_2\) between two topological spaces \( (\Omega_1,\mathfrak{T}_1)\) and \( (\Omega_2,\mathfrak{T}_2)\) is called continuous if \( f^{-1}(O_2)\in\mathfrak{T}_1\) for all \( O_2\in\mathfrak{T}_2\) .

Let \( \Omega_1\) be a set and let \( (\Omega_2,\mathfrak{A}_2)\) be a measurable space. For \( X:\Omega_1\to\Omega_2\) , we can ask for the smallest sigma-algebra \( \mathfrak{A}_X\) on \( \Omega_1\) , such that \( X\) is measurable as a mapping from \( (\Omega_1,\mathfrak{A}_X)\) to \( (\Omega_2,\mathfrak{A}_2)\) . Clearly, for every sigma-algebra \( \mathfrak{A}_1\) on \( \Omega_1\) , \( X\) is measurable as a mapping from \( (\Omega_1,\mathfrak{A}_1)\) to \( (\Omega_2,\mathfrak{A}_2)\) if and only if every \( A\in \mathfrak{A}_X\) belongs to \( \mathfrak{A}_1\) ; or in other words, \( \mathfrak{A}_X\) is a sub sigma-algebra of \( \mathfrak{A}_1\) . It is easy to check that \( \mathfrak{A}_X\) is given through the following definition.

Definition 42

Let \( X:\Omega_1\to\Omega_2\) be a random variable. Then

\[ \mathfrak{A}_X\mathrm{:}= \{X^{-1}(A_2)\,|\,A_2\in\mathfrak{A}_2\}\subseteq 2^{\Omega_1} \]

is the sigma-algebra induced by \( X\) on \( \Omega_1\) .

18.2.2 Distribution and expectation

Now let \( (\Omega_1,\mathfrak{A}_1,\mathbb{P})\) be a probability space, and let \( (\Omega_2,\mathfrak{A}_2)\) be a measurable space. Then \( X\) naturally induces a measure on \( (\Omega_2,\mathfrak{A}_2)\) via

\[ \mathbb{P}_X[A_2] \mathrm{:}= \mathbb{P}[X^{-1}(A_2)]~~\text{for all }A_2\in\mathfrak{A}_2. \]

Note that due to the measurability of \( X\) it holds \( X^{-1}(A_2)\in\mathfrak{A}_1\) , so that \( \mathbb{P}_X\) is well-defined.

Definition 43

The measure \( \mathbb{P}_X\) is called the distribution of \( X\) . If \( (\Omega_2,\mathfrak{A}_2)=(\mathbb{R}^d,\mathfrak{B}_d)\) , and there exists a function \( f_X:\mathbb{R}^d\to\mathbb{R}\) such that

\[ \mathbb{P}[A] = \int_A f_X({\boldsymbol{x}})\,\mathrm{d} {\boldsymbol{x}}~~ \text{ for all } A\in\mathfrak{B}_d, \]

then \( f_X\) is called the (Lebesgue) density of \( X\) .

Remark 25

The term distribution is often used without specifying an underlying probability space and random variable. In this case, “distribution” stands interchangeably for “probability measure”. For example, \( \mu\) is a distribution on \( \Omega_2\) states that \( \mu\) is a probability measure on the measurable space \( (\Omega_2,\mathfrak{A}_2)\) . In this case, there always exists a probability space \( (\Omega_1,\mathfrak{A}_1,\mathbb{P})\) and a random variable \( X:\Omega_1\to\Omega_2\) such that \( \mathbb{P}_X=\mu\) ; namely \( (\Omega_1,\mathfrak{A}_1,\mathbb{P})=(\Omega_2,\mathfrak{A}_2,\mu)\) and \( X(\omega)=\omega\) .

Example 21

Some important distributions include the following.

  • Bernoulli distribution: A random variable \( X:\Omega\to\{0,1\}\) is Bernoulli distributed if there exists \( p\in [0,1]\) such that \( \mathbb{P}[X=1]=p\) and \( \mathbb{P}[X=0]=1-p\) .
  • Uniform distribution: A random variable \( X:\Omega\to\mathbb{R}^d\) is uniformly distributed on a measurable set \( A\in\mathfrak{B}_d\) , if its density equals

    \[ f_X({\boldsymbol{x}})=\frac{1}{|A|}\boldsymbol{1}_A({\boldsymbol{x}}) \]

    where \( |A|<\infty\) is the Lebesgue measure of \( A\) .

  • Gaussian distribution: A random variable \( X:\Omega\to\mathbb{R}^d\) is Gaussian distributed with mean \( {\boldsymbol{m}}\in\mathbb{R}^d\) and the regular covariance matrix \( {\boldsymbol{C}}\in\mathbb{R}^{d\times d}\) , if its density equals

    \[ f_X({\boldsymbol{x}}) = \frac{1}{(2\pi\det({\boldsymbol{C}}))^{d/2}}\exp\left(-\frac{1}{2}({\boldsymbol{x}}-{\boldsymbol{m}})^\top{\boldsymbol{C}}^{-1}({\boldsymbol{x}}-{\boldsymbol{m}})\right). \]

    We denote this distribution by \( \rm{ N}({\boldsymbol{m}},{\boldsymbol{C}})\) .

Let \( (\Omega,\mathfrak{A},\mathbb{P})\) be a probability space, let \( X:\Omega\to\mathbb{R}^d\) be an \( \mathbb{R}^d\) -valued random variable. We then call the Lebesgue integral

\[ \begin{equation} \mathbb{E}[X]\mathrm{:}= \int_{\Omega}X(\omega)\,\mathrm{d}\mathbb{P}(\omega) = \int_{\mathbb{R}^d}{\boldsymbol{x}}\,\mathrm{d}\mathbb{P}_X({\boldsymbol{x}}) \end{equation} \]

(274)

the expectation of \( X\) . Moreover, for \( k\in\mathbb{N}\) we say that \( X\) has finite \( k\) -th moment if \( \mathbb{E}[\| X \|_{}^k]<\infty\) . Similarly, for a probability measure \( \mu\) on \( \mathbb{R}^d\) and \( k\in\mathbb{N}\) , we say that \( \mu\) has finite \( k\) -th moment if

\[ \int_{\mathbb{R}^d}\| {\boldsymbol{x}} \|_{}^k\,\mathrm{d}\mu({\boldsymbol{x}})<\infty. \]

Furthermore, the matrix

\[ \int_{\Omega}(X(\omega)-\mathbb{E}[X])(X(\omega)-\mathbb{E}[X])^\top\,\mathrm{d}\mathbb{P}(\omega)\in\mathbb{R}^{d\times d} \]

is the covariance of \( X:\Omega\to\mathbb{R}^d\) . For \( d=1\) , it is called the variance of \( X\) and denoted by \( \mathbb{V}[X]\) .

Finally, we recall different variants of convergence for random variables.

Definition 44

Let \( (\Omega,\mathfrak{A},\mathbb{P})\) be a probability space, let \( X_j:\Omega\to\mathbb{R}^d\) , \( j\in\mathbb{N}\) , be a sequence of random variables, and let \( X:\Omega\to\mathbb{R}^d\) also be a random variable. The sequence is said to

  1. converge almost surely to \( X\) , if

    \[ \mathbb{P}\left[\left\{\omega\in\Omega\, \middle|\,\lim_{j\to\infty}X_j(\omega)=X(\omega)\right\}\right]=1, \]
  2. converge in probability to \( X\) , if

    \[ \text{for all }\varepsilon>0:~\lim_{j\to\infty}\mathbb{P}\left[\left\{\omega\in\Omega\, \middle|\,|X_j(\omega)-X(\omega)|>\varepsilon\right\}\right]=0, \]
  3. converge in distribution to \( X\) , if for all bounded continuous functions \( f:\mathbb{R}^d\to\mathbb{R}\)

    \[ \lim_{j\to\infty}\mathbb{E}[f\circ X_j]=\mathbb{E}[f\circ X]. \]

The notions in Definition 44 are ordered by decreasing strength, i.ealmost sure convergence implies convergence in probability, and convergence in probability implies convergence in distribution, see for example [293, Chapter 13]. Since \( \mathbb{E}[f\circ X] = \int_{\mathbb{R}^d}f(x)\,\mathrm{d}\mathbb{P}_X(x)\) , the notion of convergence in distribution only depends on the distribution \( \mathbb{P}_X\) of \( X\) . We thus also say that a sequence of random variables converges in distribution towards a measure \( \mu\) .

18.3 Conditionals, marginals, and independence

In this section, we concentrate on \( \mathbb{R}^d\) -valued random variables, although the following concepts can be extended to more general spaces.

18.3.1 Joint and marginal distribution

Let again \( (\Omega,\mathfrak{A},\mathbb{P})\) be a probability space, and let \( X:\Omega\to\mathbb{R}^{d_X}\) , \( Y:\Omega\to \mathbb{R}^{d_Y}\) be two random variables. Then

\[ Z\mathrm{:}= (X,Y):\Omega\to\mathbb{R}^{d_X+d_Y} \]

is also a random variable. Its distribution \( \mathbb{P}_Z\) is a measure on the measurable space \( (\mathbb{R}^{d_X+d_Y},\mathfrak{B}_{d_X+d_Y})\) , and \( \mathbb{P}_Z\) is referred to as the joint distribution of \( X\) and \( Y\) . On the other hand, \( \mathbb{P}_{X}\) , \( \mathbb{P}_Y\) are called the marginal distributions of \( X\) , \( Y\) . Note that

\[ \mathbb{P}_{X}[A] = \mathbb{P}_Z[A\times\mathbb{R}^{d_Y}]~~\text{for all }A\in \mathfrak{B}_{d_X}, \]

and similarly for \( \mathbb{P}_{Y}\) . Thus the marginals \( \mathbb{P}_X\) , \( \mathbb{P}_Y\) , can be constructed from the joint distribution \( \mathbb{P}_Z\) . In turn, knowledge of the marginals is not sufficient to construct the joint distribution.

18.3.2 Independence

The concept of independence serves to formalize the situation, where knowledge of one random variable provides no information about another random variable. We first give the formal definition, and afterwards discuss the roll of a die as a simple example.

Definition 45

Let \( (\Omega,\mathfrak{A},\mathbb{P})\) be a probability space. Then two events \( A\) , \( B\in \mathfrak{A}\) are called independent if

\[ \mathbb{P}[A\cap B] =\mathbb{P}[A]\mathbb{P}[B]. \]

Two random variables \( X:\Omega\to\mathbb{R}^{d_X}\) and \( Y:\Omega\to\mathbb{R}^{d_Y}\) are called independent, if

\[ A,~B\text{ are independent for all }A\in \mathfrak{A}_{X},~B\in\mathfrak{A}_{Y}. \]

Two random variables are thus independent, if and only if all events in their induced sigma-algebras are independent. This turns out to be equivalent to the joint distribution \( \mathbb{P}_{(X,Y)}\) being equal to the product measure \( \mathbb{P}_X\otimes\mathbb{P}_Y\) ; the latter is characterized as the unique measure \( \mu\) on \( \mathbb{R}^{d_X+d_Y}\) satisfying \( \mu(A\times B)=\mathbb{P}_X[A]\mathbb{P}_Y[B]\) for all \( A\in\mathfrak{B}_{d_x}\) , \( B\in\mathfrak{B}_{d_Y}\) .

Example 22

Let \( \Omega=\{1,\dots,6\}\) represent the outcomes of rolling a fair die, let \( \mathfrak{A}=2^\Omega\) be the sigma-algebra, and let \( \mathbb{P}[\omega]=1/6\) for all \( \omega\in\Omega\) . Consider the three random variables

\[ X_1(\omega) = \left\{ \begin{array}{ll} 0 &\text{if \( \omega\) is odd}\\ 1 &\text{if \( \omega\) is even} \end{array}\right. ~ X_2(\omega) = \left\{ \begin{array}{ll} 0 &\text{if }\omega\le 3\\ 1 &\text{if }\omega\ge 4 \end{array}\right. ~ X_3(\omega) = \left\{ \begin{array}{ll} 0 &\text{if }\omega\in\{1,2\}\\ 1 &\text{if }\omega\in\{3,4\}\\ 2 &\text{if }\omega\in\{5,6\}. \end{array}\right. \]

These random variables can be interpreted as follows:

  • \( X_1\) indicates whether the roll yields an odd or even number.
  • \( X_2\) indicates whether the roll yields a number at most \( 3\) or at least \( 4\) .
  • \( X_3\) categorizes the roll into one of the groups \( \{1,2\}\) , \( \{3,4\}\) or \( \{5,6\}\) .

The induced sigma-algebras are

\[ \begin{align*} \mathfrak{A}_{X_1}&=\{\emptyset,\Omega,\{1,3,5\},\{2,4,6\}\}\\ \mathfrak{A}_{X_2}&=\{\emptyset,\Omega,\{1,2,3\},\{4,5,6\}\}\\ \mathfrak{A}_{X_3}&=\{\emptyset,\Omega,\{1,2\},\{3,4\},\{5,6\},\{1,2,3,4\},\{1,2,5,6\},\{3,4,5,6\}\}. \end{align*} \]

We leave it to the reader to formally check that \( X_1\) and \( X_2\) are not independent, but \( X_1\) and \( X_3\) are independent. This reflects the fact that, for example, knowing the outcome to be odd, makes it more likely that the number belongs to \( \{1,2,3\}\) rather than \( \{4,5,6\}\) . However, this knowledge provides no information on the three categories \( \{1,2\}\) , \( \{3,4\}\) , and \( \{5,6\}\) .

If \( X:\Omega\to\mathbb{R}\) , \( Y:\Omega\to\mathbb{R}\) are two independent random variables, then, due to \( \mathbb{P}_{(X,Y)}=\mathbb{P}_X\otimes\mathbb{P}_Y\)

\[ \begin{align*} \mathbb{E}[XY]&=\int_{\Omega} X(\omega)Y(\omega)\,\mathrm{d}\mathbb{P}(\omega)\\ &=\int_{\mathbb{R}^{2}} x y \,\mathrm{d}\mathbb{P}_{(X,Y)}(x,y)\\ &=\int_{\mathbb{R}}x \,\mathrm{d}\mathbb{P}_X(x)\int_{\mathbb{R}}y \,\mathrm{d}\mathbb{P}_X(y)\\ &=\mathbb{E}[X]\mathbb{E}[Y]. \end{align*} \]

Using this observation, it is easy to see that for a sequence of independent \( \mathbb{R}\) -valued random variables \( (X_i)_{i=1}^n\) with bounded second moments, there holds Bienaymé’s identity

\[ \begin{align} \mathbb{V}\left[\sum_{i=1}^n X_i\right] = \sum_{i=1}^n \mathbb{V}\left[X_i\right]. \end{align} \]

(275)

18.3.3 Conditional distributions

Let \( (\Omega,\mathfrak{A},\mathbb{P})\) be a probability space, and let \( A\) , \( B\in\mathfrak{A}\) be two events. In case \( \mathbb{P}[B]>0\) , we define

\[ \begin{equation} \mathbb{P}[A|B]\mathrm{:}= \frac{\mathbb{P}[A\cap B]}{\mathbb{P}[B]}, \end{equation} \]

(276)

and call \( \mathbb{P}[A|B]\) the conditional probability of \( A\) given \( B\) .

Example 23

Consider the setting of Example 22. Let \( A=\{\omega\in\Omega\,|\,X_1(\omega)=0\}\) be the event that the outcome of the die roll was an odd number and let \( B=\{\omega\in\Omega\,|\,X_2(\omega)=0\}\) be the event that the outcome yielded a number at most \( 3\) . Then \( \mathbb{P}[B]=1/2\) , and \( \mathbb{P}[A\cap B]=1/3\) . Thus

\[ \mathbb{P}[A|B]= \frac{\mathbb{P}[A\cap B]}{\mathbb{P}[B]} = \frac{1/3}{1/2} = \frac{2}{3}. \]

This reflects that, given we know the outcome to be at most \( 3\) , the probability of the number being odd, i.ein \( \{1,3\}\) , is larger than the probability of the number being even, i.eequal to \( 2\) .

The conditional probability in (276) is only well-defined if \( \mathbb{P}[B]>0\) . In practice, we often encounter the case where we would like to condition on an event of probability zero.

Example 24

Consider the following procedure: We first draw a random number \( p\in [0,1]\) according to a uniform distribution on \( [0,1]\) . Afterwards we draw a random number \( X\in\{0,1\}\) according to a \( p\) -Bernoulli distribution, i.e\( \mathbb{P}[X=1]=p\) and \( \mathbb{P}[X=0]=1-p\) . Then \( (p,X)\) is a joint random variable taking values in \( [0,1]\times\{0,1\}\) . What is \( \mathbb{P}[X=1|p=0.5]\) in this case? Intuitively, it should be \( 1/2\) , but note that \( \mathbb{P}[p=0.5]=0\) , so that (276) is not meaningful here.

Definition 46 (regular conditional distribution)

Let \( (\Omega,\mathfrak{A},\mathbb{P})\) be a probability space, and let \( X:\Omega\to \mathbb{R}^{d_X}\) and \( Y:\Omega\to \mathbb{R}^{d_Y}\) be two random variables. Let \( \tau_{X|Y}:\mathfrak{B}_{d_X}\times \mathbb{R}^{d_Y} \to [0,1]\) satisfy

  1. \( y\mapsto \tau_{X|Y}(A,y):\mathbb{R}^{d_Y}\to [0,1]\) is measurable for every fixed \( A\in\mathfrak{B}_{d_X}\) ,
  2. \( A\mapsto \tau_{X|Y}(A,y)\) is a probability measure on \( (\mathbb{R}^{d_X},\mathfrak{B}_{d_X})\) for every \( y\in Y(\Omega)\) ,
  3. for all \( A\in\mathfrak{B}_{d_X}\) and all \( B\in\mathfrak{B}_{d_Y}\) holds

    \[ \mathbb{P}[X\in A,Y\in B]=\int_{B}\tau_{X|Y}(A,y)\mathbb{P}_{Y}(y). \]

Then \( \tau\) is called a regular (version of the) conditional distribution of \( X\) given \( Y\) . In this case, we denote

\[ \mathbb{P}[X\in A|Y=y]:=\tau_{X|Y}(A,y), \]

and refer to this measure as the conditional distribution of \( X|Y=y\) .

Definition 46 provides a mathematically rigorous way of assigning a distribution to a random variable conditioned on an event that may have probability zero, as in Example 24. Existence and uniqueness of these conditional distributions hold in the following sense, see for example [293, Chapter 8] or [294, Chapter 3] for the specific statement given here.

Theorem 45

Let \( (\Omega,\mathfrak{A},\mathbb{P})\) be a probability space, and let \( X:\Omega\to\mathbb{R}^{d_X}\) , \( Y:\Omega\to\mathbb{R}^{d_Y}\) be two random variables. Then there exists a regular version of the conditional distribution \( \tau_1\) .

Let \( \tau_2\) be another regular version of the conditional distribution. Then there exists a \( \mathbb{P}_Y\) -null set \( N\subseteq\mathbb{R}^{d_Y}\) , such that for all \( y\in N^c\cap Y(\Omega)\) , the two probability measures \( \tau_1(\cdot,y)\) and \( \tau_2(\cdot,y)\) coincide.

In particular, conditional distributions are only well-defined in a \( \mathbb{P}_Y\) -almost everywhere sense.

Definition 47

Let \( (\Omega,\mathfrak{A},\mathbb{P})\) be a probability space, and let \( X:\Omega\to\mathbb{R}^{d_X}\) , \( Y:\Omega\to\mathbb{R}^{d_Y}\) , \( Z:\Omega\to\mathbb{R}^{d_Z}\) be three random variables. We say that \( X\) and \( Z\) are conditionally independent given \( Y\) , if the two distributions \( X|Y=y\) and \( Z|Y=y\) are independent for \( \mathbb{P}_Y\) -almost every \( y\in Y(\Omega)\) .

18.4 Concentration inequalities

Let \( X_i:\Omega\to\mathbb{R}\) , \( i\in\mathbb{N}\) , be a sequence of random variables with finite first moments. The centered average over the first \( n\) terms

\[ \begin{equation} S_n\mathrm{:}= \frac{1}{n}\sum_{i=1}^n (X_i-\mathbb{E}[X_i]) \end{equation} \]

(277)

is another random variable, and by linearity of the expectation it holds \( \mathbb{E}[S_n]=0\) . The sequence is said to satisfy the strong law of large numbers if

\[ \mathbb{P}\Big[\limsup_{n\to\infty} |S_n|=0\Big]=1. \]

This is for example the case if there exists \( C<\infty\) such that \( \mathbb{V}[X_i]\le C\) for all \( i\in\mathbb{N}\) . Concentration inequalities provide bounds on the rate of this convergence.

We start with Markov’s inequality.

Lemma 36 (Markov’s inequality)

Let \( X:\Omega\to\mathbb{R}\) be a random variable, and let \( \varphi:[0,\infty)\to [0,\infty)\) be monotonically increasing. Then for all \( \varepsilon>0\)

\[ \mathbb{P}[|X|\ge\varepsilon]\le \frac{\mathbb{E}[\varphi(|X|)]}{\varphi(\varepsilon)}. \]

Proof

We have

\[ \mathbb{P}[|X|\ge\varepsilon] =\int_{X^{-1}([\varepsilon,\infty))} 1\,\mathrm{d}\mathbb{P}(\omega) \le \int_{\Omega} \frac{\varphi(|X(\omega)|)}{\varphi(\varepsilon)}\,\mathrm{d}\mathbb{P}(\omega) =\frac{\mathbb{E}[\varphi(|X|)]}{\varphi(\varepsilon)}, \]

which gives the claim.

Applying Markov’s inequality with \( \varphi(x)\mathrm{:}= x^2\) to the random variable \( X-\mathbb{E}[X]\) directly gives Chebyshev’s inequality.

Lemma 37 (Chebyshev’s inequality)

Let \( X:\Omega\to\mathbb{R}\) be a random variable with finite variance. Then for all \( \varepsilon>0\)

\[ \mathbb{P}[|X-\mathbb{E}[X]|\ge\varepsilon]\le \frac{\mathbb{V}[X]}{\varepsilon^2}. \]

From Chebyshev’s inequality we obtain the next result, which is a quite general concentration inequality for random variables with finite variances.

Theorem 46

Let \( X_1,\dots,X_n\) be \( n\in\mathbb{N}\) independent real-valued random variables such that for some \( \varsigma>0\) holds \( \mathbb{E}[|X_i-\mu|^2]\le \varsigma^2\) for all \( i=1,\dots,n\) . Denote

\[ \begin{equation} \mu\mathrm{:}= \mathbb{E}\Big[\frac{1}{n}\sum_{j=1}^n X_j\Big]. \end{equation} \]

(278)

Then for all \( \varepsilon>0\)

\[ \mathbb{P}\Bigg[\Big|\frac{1}{n}\sum_{j=1}^n X_i-\mu\Big|\ge \varepsilon\Bigg]\le \frac{\varsigma^2}{\varepsilon^2n}. \]

Proof

Let \( S_n=\sum_{j=1}^n (X_i-\mathbb{E}[X_i])/n=(\sum_{j=1}^nX_i)/n-\mu\) . By Bienaymé’s identity (275), it holds that

\[ \begin{align*} \mathbb{V}[S_n]&=\frac{1}{n^2}\sum_{j=1}^n\mathbb{E}[(X_i-\mathbb{E}[X_i])^2] \le \frac{\varsigma^2}{n}. \end{align*} \]

Since \( \mathbb{E}[S_n]=0\) , Chebyshev’s inequality applied to \( S_n\) gives the statement.

If we have additional information about the random variables, then we can derive sharper bounds. In case of uniformly bounded random variables (rather than just bounded variance), Hoeffding’s inequality, which we recall next, shows an exponential rate of concentration around the mean.

Theorem 47 (Hoeffding’s inequality)

Let \( a\) , \( b \in \mathbb{R}\) . Let \( X_1,\dots,X_n\) be \( n\in\mathbb{N}\) independent real-valued random variables such that \( a \leq X_i \leq b\) almost surely for all \( i = 1, \dots, n\) , and let \( \mu\) be as in (278). Then, for every \( \varepsilon >0\)

\[ \begin{align*} \mathbb{P}\left[ \left| \frac{1}{n}\sum_{j=1}^n X_j - \mu \right| > \varepsilon \right] \leq 2 e^{-\frac{2n \varepsilon^2}{(b-a)^2}}. \end{align*} \]

A proof can, for example, be found in [13, Section B.4], where this version is also taken from.

Finally, we recall the central limit theorem, in its multivariate formulation. We say that \( (X_j)_{j\in\mathbb{N}}\) is an i.i.dsequence of random variables, if the random variables are (pairwise) independent and identically distributed. For a proof see [293, Theorem 15.58].

Theorem 48 (Multivariate central limit theorem)

Let \( ({\boldsymbol{X}}_n)_{n\in\mathbb{N}}\) be an i.i.dsequence of \( \mathbb{R}^d\) -valued random variables, such that \( \mathbb{E}[{\boldsymbol{X}}_n]=\boldsymbol{0}\in\mathbb{R}^d\) and \( \mathbb{E}[X_{n,i}X_{n,j}]=C_{ij}\) for all \( i\) , \( j=1,\dots,d\) . Let

\[ {\boldsymbol{Y}}_n\mathrm{:}= \frac{{\boldsymbol{X}}_1+\dots+{\boldsymbol{X}}_n}{\sqrt{n}}\in\mathbb{R}^d. \]

Then \( {\boldsymbol{Y}}_n\) converges in distribution to \( \rm{ N}(\boldsymbol{0},{\boldsymbol{C}})\) as \( n\to\infty\) .

19 Linear algebra and functional analysis

This appendix provides some basic notions and results in linear algebra and functional analysis required in the main text. It is intended as a revision for a reader already familiar with these concepts. For more details and further proofs, we refer for example to the standard textbooks [198, 295, 33, 296, 106].

19.1 Singular value decomposition and pseudoinverse

Let \( {\boldsymbol{A}}\in\mathbb{R}^{m\times n}\) , \( m\) , \( n\in\mathbb{N}\) . Then the square root of the positive eigenvalues of \( {\boldsymbol{A}}^\top{\boldsymbol{A}}\) (or equivalently of \( {\boldsymbol{A}}{\boldsymbol{A}}^\top\) ) are referred to as the singular values of \( {\boldsymbol{A}}\) . We denote them in the following by \( s_1\ge s_2\dots\ge s_r>0\) , where \( r\mathrm{:}= \rm{ rank}({\boldsymbol{A}})\) , so that \( r\le \min\{m,n\}\) . Every matrix allows for a singular value decomposition (SVD) as stated in the next theorem, e.g[198, Theorem 1.2.1]. Recall that a matrix \( {\boldsymbol{V}}\in\mathbb{R}^{n\times n}\) is called orthogonal, if \( {\boldsymbol{V}}^\top{\boldsymbol{V}}\) is the identity.

Theorem 49 (Singular value decomposition)

Let \( {\boldsymbol{A}}\in\mathbb{R}^{m\times n}\) . Then there exist orthogonal matrices \( {\boldsymbol{U}}\in\mathbb{R}^{m\times m}\) , \( {\boldsymbol{V}}\in\mathbb{R}^{n\times n}\) such that with

\[ {\boldsymbol{ \Sigma }} \mathrm{:}= \begin{pmatrix} s_1 &&&\\ &\ddots&&&\boldsymbol{0}\\ &&& s_r&\\ &\boldsymbol{0}&&&\boldsymbol{0}\\ \end{pmatrix}\in\mathbb{R}^{m\times n} \]

it holds that \( {\boldsymbol{A}}={\boldsymbol{U}}{\boldsymbol{ \Sigma }}{\boldsymbol{V}}^\top\) , where \( \boldsymbol{0}\) stands for a zero block of suitable size.

Given \( {\boldsymbol{y}}\in\mathbb{R}^m\) , consider the linear system

\[ \begin{equation} {\boldsymbol{A}}{\boldsymbol{w}}={\boldsymbol{y}}. \end{equation} \]

(279)

If \( {\boldsymbol{A}}\) is not a regular square matrix, then in general there need not be a unique solution \( {\boldsymbol{w}}\in\mathbb{R}^n\) to (279). However, there exists a unique minimal norm solution

\[ \begin{equation} {\boldsymbol{w}}_*=\rm{argmin}_{{\boldsymbol{w}}\in M}\| {\boldsymbol{w}} \|_{},~~ M = \{{\boldsymbol{w}}\in\mathbb{R}^m\,|\,\| {\boldsymbol{A}}{\boldsymbol{w}}-{\boldsymbol{y}} \|_{}\le \| {\boldsymbol{A}}{\boldsymbol{v}}-{\boldsymbol{y}} \|_{}~\forall{\boldsymbol{v}}\in\mathbb{R}^n\}. \end{equation} \]

(280)

The minimal norm solution can be expressed via the Moore-Penrose pseudoinverse \( {\boldsymbol{A}}^\dagger\in\mathbb{R}^{n\times m}\) of \( {\boldsymbol{A}}\) ; given an (arbitrary) SVD \( {\boldsymbol{A}}={\boldsymbol{U}}\Sigma{\boldsymbol{V}}^\top\) , it is defined as

\[ \begin{equation} {\boldsymbol{A}}^\dagger\mathrm{:}= {\boldsymbol{V}}{\boldsymbol{ \Sigma }}^\dagger {\boldsymbol{U}}^\top~~\text{where}~~ {\boldsymbol{ \Sigma }}^\dagger\mathrm{:}= \begin{pmatrix} s_1^{-1} &&&\\ &\ddots&&&\boldsymbol{0}\\ &&& s_r^{-1}&\\ &\boldsymbol{0}&&&\boldsymbol{0}\\ \end{pmatrix}\in\mathbb{R}^{n\times m}. \end{equation} \]

(281)

The following theorem makes this precise, e.g., [198, Theorem 1.2.10].

Theorem 50

Let \( {\boldsymbol{A}}\in\mathbb{R}^{m\times n}\) . Then there exists a unique minimum norm solution \( {\boldsymbol{w}}_*\in\mathbb{R}^n\) in (280) and it holds \( {\boldsymbol{w}}_*={\boldsymbol{A}}^\dagger{\boldsymbol{y}}\) .

Proof

Denote by \( {\boldsymbol{ \Sigma }}_r\in\mathbb{R}^{r\times r}\) the upper left quadrant of \( {\boldsymbol{ \Sigma }}\) . Since \( {\boldsymbol{U}}\in\mathbb{R}^{m\times m}\) is orthogonal,

\[ \| {\boldsymbol{A}}{\boldsymbol{w}}-{\boldsymbol{y}} \|_{} = \left\| \begin{pmatrix} {\boldsymbol{ \Sigma }}_r&\boldsymbol{0}\\ \boldsymbol{0}&\boldsymbol{0} \end{pmatrix} {\boldsymbol{V}}^\top{\boldsymbol{w}}-{\boldsymbol{U}}^\top{\boldsymbol{y}} \right\|_{}. \]

We can thus write \( M\) in (280) as

\[ \begin{align*} M &= \left\{{\boldsymbol{w}}\in\mathbb{R}^n\, \middle|\,\big(\begin{pmatrix} {\boldsymbol{ \Sigma }}_r \,\boldsymbol{0} \end{pmatrix}{\boldsymbol{V}}^\top{\boldsymbol{w}}\big)_{i=1}^r = ({\boldsymbol{U}}^\top{\boldsymbol{y}})_{i=1}^r \right\}\\ &= \left\{{\boldsymbol{w}}\in\mathbb{R}^n\, \middle|\,({\boldsymbol{V}}^\top{\boldsymbol{w}})_{i=1}^r = {\boldsymbol{ \Sigma }}_r^{-1}({\boldsymbol{U}}^\top{\boldsymbol{y}})_{i=1}^r\right\}\\ &= \left\{{\boldsymbol{V}}{\boldsymbol{z}}\, \middle|\,{\boldsymbol{z}}\in\mathbb{R}^n,~({\boldsymbol{z}})_{i=1}^r = {\boldsymbol{ \Sigma }}_r^{-1}({\boldsymbol{U}}^\top{\boldsymbol{y}})_{i=1}^r\right\} \end{align*} \]

where \( ({\boldsymbol{a}})_{i=1}^r\) denotes the first \( r\) entries of a vector \( {\boldsymbol{a}}\) , and for the last equality we used orthogonality of \( {\boldsymbol{V}}\in\mathbb{R}^{n\times n}\) . Since \( \| {\boldsymbol{V}}{\boldsymbol{z}} \|_{}=\| {\boldsymbol{z}} \|_{}\) , the unique minimal norm solution is obtained by setting components \( r+1,\dots,m\) of \( {\boldsymbol{z}}\) to zero, which yields

\[ {\boldsymbol{w}}_* = {\boldsymbol{V}} \begin{pmatrix} {\boldsymbol{ \Sigma }}_r^{-1}({\boldsymbol{U}}^\top{\boldsymbol{y}})_{i=1}^r\\ \boldsymbol{0} \end{pmatrix}={\boldsymbol{V}}{\boldsymbol{ \Sigma }}^\dagger{\boldsymbol{U}}^\top{\boldsymbol{y}} = {\boldsymbol{A}}^\dagger{\boldsymbol{y}} \]

as claimed.

19.2 Vector spaces

Definition 48

Let \( \mathbb{K}\in\{\mathbb{R},\mathbb{C}\}\) . A vector space (over \( \mathbb{K}\) ) is a set \( X\) such that the following holds:

  1. Properties of addition: For every \( x\) , \( y \in X\) there exists \( x+y \in X\) such that for all \( z \in X\)

    \[ \begin{align*} x+y = y+x~\text{ and }~ x + (y+z) = (x+y) + z. \end{align*} \]

    Moreover, there exists a unique element \( 0 \in X\) such that \( x + 0 = x\) for all \( x \in X\) and for each \( x\in X\) there exists a unique \( -x \in X\) such that \( x + (-x) = 0\) .

  2. Properties of scalar multiplication: There exists a map \( (\alpha, x) \mapsto \alpha x\) from \( \mathbb{K}\times X\) to \( X\) called scalar multiplication. It satisfies \( 1x = x\) and \( (\alpha \beta) x = \alpha(\beta x)\) for all \( x\in X\) .

We call the elements of a vector space vectors.

If the field is clear from context, we simply refer to \( X\) as a vector space. We will primarily consider the case \( \mathbb{K}=\mathbb{R}\) , and in this case we also say that \( X\) is a real vector space.

To introduce a notion of convergence on a vector space \( X\) , it needs to be equipped with a topology, see Definition 39. A topological vector space is a vector space which is also a topological space, and in which addition and scalar multiplication are continuous maps. We next discuss the most important instances of topological vector spaces.

19.2.1 Metric spaces

An important class of topological vector spaces consists of vector spaces that are also metric spaces.

Definition 49

For a set \( X\) , we call a map \( d_X \colon X\times X \to [0,\infty)\) a metric, if

  1. \( d_X(x, y) = 0\) if and only if \( x = y\) ,
  2. \( d_X(x, y) = d(y, x)\) for all \( x\) , \( y \in X\) ,
  3. \( d_X(x, z) \leq d_X(x, y) + d_X(y, z)\) for all \( x\) , \( y\) , \( z \in X\) .

We call \( (X,d_X)\) a metric space.

In a metric space \( (X,d_X)\) , we denote the open ball with center \( x\) and radius \( r>0\) by

\[ \begin{equation} B_r(x) \mathrm{:}= \{y \in X\,|\,d_X(x, y) < r\}. \end{equation} \]

(282)

Every metric space is naturally equipped with a topology: A set \( A\subseteq X\) is open if and only if for every \( x\in A\) exists \( \varepsilon>0\) such that \( B_\varepsilon(x)\subseteq A\) . Therefore every metric vector space is a topological vector space.

Definition 50

A metric space \( (X, d_X)\) is called complete, if every Cauchy sequence with respect to \( d\) converges to an element in \( X\) .

For complete metric spaces, an immensely powerful tool is Baire’s category theorem. To state it, we require the notion of density of sets. Let \( A\) , \( B \subseteq X\) for a topological space \( X\) . Then \( A\) is dense in \( B\) if the closure of \( A\) , denoted by \( \overline{A}\) , satisfies \( \overline{A} \supseteq B\) .

Theorem 51 (Baire’s category theorem)

Let \( X\) be a complete metric space. Then the intersection of every countable collection of dense open subsets of \( X\) is dense in \( X\) .

Theorem 51 implies that if \( X = \bigcup_{i=1}^\infty V_i\) for a sequence of sets \( V_i\) , then at least one of the \( V_i\) has to contain an open set. Indeed, assuming all \( V_i\) ’s have empty interior implies that \( V_i^c = X \setminus V_i\) is dense for all \( i \in \mathbb{N}\) . By De Morgan’s laws, it then holds that \( \emptyset = \bigcap_{i=1}^\infty V_i^c\) which contradicts Theorem 51.

19.2.2 Normed spaces

A norm is a way of assigning a length to a vector. A normed space is a vector space with a norm.

Definition 51

Let \( X\) be a vector space over a field \( \mathbb{K}\in\{\mathbb{R},\mathbb{C}\}\) . A map \( \| \cdot \|_{X}:X\to [0,\infty)\) is called a norm if the following hold for all \( x\) , \( y\in X\) and all \( \alpha\in \mathbb{K}\) :

  1. triangle inequality: \( \| x+y \|_{X} \leq \| x \|_{X} + \| y \|_{X}\) ,
  2. absolute homogeneity: \( \| \alpha x \|_{X} = |\alpha| \| x \|_{X}\) ,
  3. positive definiteness: \( \| x \|_{X}=0\) if and only if \( x=0\) .

We call \( (X, \| \cdot \|_{X})\) a normed space and omit \( \| \cdot \|_{X}\) from the notation if it is clear from the context.

Every norm induces a metric \( d_X\) and hence a topology via \( d_X(x,y) \mathrm{:}= \|x-y\|_X\) . In particular, every normed vector space is a topological vector space with respect to this topology.

19.2.3 Banach spaces

Definition 52

A normed vector space is called a Banach space if and only if it is complete.

Before presenting the main results on Banach spaces, we collect a couple of important examples.

  • Euclidean spaces: Let \( d\in\mathbb{N}\) . Then \( (\mathbb{R}^d,\| \cdot \|_{})\) is a Banach space.
  • Continuous functions: Let \( d \in \mathbb{N}\) and let \( K \subseteq \mathbb{R}^d\) be compact. The set of continuous functions from \( K\) to \( \mathbb{R}\) is denoted by \( C(K)\) . For \( \alpha\) , \( \beta \in \mathbb{R}\) and \( f\) , \( g \in C(K)\) , we define addition and scalar multiplication by \( (\alpha f + \beta g)({\boldsymbol{x}}) = \alpha f({\boldsymbol{x}}) + \beta g({\boldsymbol{x}})\) for all \( {\boldsymbol{x}} \in K\) . The vector space \( C(K)\) equipped with the supremum norm

    \[ \begin{align*} \| f \|_{\infty} \mathrm{:}= \sup_{{\boldsymbol{x}} \in K}|f({\boldsymbol{x}})|, \end{align*} \]

    is a Banach space.

  • Lebesgue spaces: Let \( (\Omega, \mathfrak{A}, \mu)\) be a measure space and let \( 1 \leq p < \infty\) . Then the Lebesgue space \( L^p(\Omega, \mu)\) is defined as the vector space of all equivalence classes of measurable functions \( f:\Omega\to\mathbb{R}\) that coincide \( \mu\) -almost everywhere and satisfy

    \[ \begin{equation} \|f\|_{L^p(\Omega, \mu)} \mathrm{:}= \left(\int_{\Omega} |f(x)|^p d\mu(x)\right)^{1/p} < \infty. \end{equation} \]

    (283)

    The integral is independent of the choice of representative of the equivalence class of \( f\) . Addition and scalar multiplication are defined pointwise as for \( C(K)\) . It then holds that \( L^p(\Omega, \mu)\) is a Banach space. If \( \Omega\) is a measurable subset of \( \mathbb{R}^d\) for \( d \in \mathbb{N}\) , and \( \mu\) is the Lebesgue measure, we typically omit \( \mu\) from the notation and simply write \( L^p(\Omega)\) . If \( \Omega = \mathbb{N}\) and the measure is the counting measure, we denote these spaces by \( \ell^p(\mathbb{N})\) or simply \( \ell^p\) .

    The definition can be extended to complex or \( \mathbb{R}^d\) -valued functions. In the latter case the integrand in (283) is replaced by \( \| f(x) \|_{}^p\) . We denote these spaces again by \( L^p(\Omega,\mu)\) with the precise meaning being clear from context.

  • Essentially bounded functions: Let \( (\Omega, \mathfrak{A}, \mu)\) be a measure space. The \( L^p\) spaces can be extended to \( p=\infty\) by defining the \( L^\infty\) -norm

    \[ \begin{align*} \| f \|_{{L^\infty(\Omega, \mu)}} \mathrm{:}= \inf\{C \geq 0\,|\, \mu(\{|f| > C\}) = 0)\}. \end{align*} \]

    This is indeed a norm on the space of equivalence classes of measurable functions from \( \Omega\to\mathbb{R}\) that coincide \( \mu\) -almost everywhere. Moreover, with this norm, \( L^\infty(\Omega, \mu)\) is a Banach space. If \( \Omega = \mathbb{N}\) and \( \mu\) is the counting measure, we denote the resulting space by \( \ell^\infty(\mathbb{N})\) or simply \( \ell^\infty\) . As in the case \( p<\infty\) , it is straightforward to extend the definition to complex or \( \mathbb{R}^d\) -valued functions, for which the same notation will be used.

We continue by introducing the concept of dual spaces.

Definition 53

Let \( (X, \| \cdot \|_{X})\) be a normed vector space over \( \mathbb{K}\in\{\mathbb{R},\mathbb{C}\}\) . Linear maps from \( X\to\mathbb{K}\) are called linear functionals. The vector space of all continuous linear functionals on \( X\) is called the (topological) dual space of \( X\) and is denoted by \( X'\) .

Together with the natural addition and scalar multiplication (for all \( h\) , \( g\in X'\) , \( \alpha\in\mathbb{K}\) and \( x\in X\) )

\[ (h+g)(x) \mathrm{:=} h(x) + g(x)~\text{and}~ (\alpha h) (x) \mathrm{:=} \alpha (h(x)), \]

\( X'\) is a vector space. We equip \( X'\) with the norm

\[ \| f \|_{X'}\mathrm{:}= \sup_{\substack{x\in X\\ \| x \|_{X}=1}}|f(x)|. \]

The space \( (X',\| \cdot \|_{X'})\) is always a Banach space, even if \( (X,\| \cdot \|_{X})\) is not complete [33, Theorem 4.1].

The dual space can often be used to characterize the original Banach space. One way in which the dual space \( X'\) captures certain algebraic and geometric properties of the Banach space \( X\) is through the Hahn-Banach theorem. In this book, we use one specific variant of this theorem and its implication for the existence of dual bases, see for instance [33, Theorem 3.5].

Theorem 52 (Geometric Hahn-Banach, subspace version)

Let \( M\) be a subspace of a Banach space \( X\) and let \( x_0 \in X\) . If \( x_0\) is not in the closure of \( M\) , then there exists \( f \in X'\) such that \( f(x_0) = 1\) and \( f(x) = 0\) for every \( x \in M\) .

An immediate consequence of Theorem 52 that will be used throughout this book is the existence of a dual basis. Let \( X\) be a Banach space and let \( (x_i)_{i\in\mathbb{N}} \subseteq X\) be such that for all \( i \in \mathbb{N}\)

\[ \begin{align*} x_i \not \in \overline{\mathrm{span}\{x_j\,|\,j\in\mathbb{N},~j \neq i\}}. \end{align*} \]

Then, for every \( i \in \mathbb{N}\) , there exists \( f_i \in X'\) such that \( f_i(x_j) = 0\) if \( i \neq j\) and \( f_i(x_i) = 1\) .

19.2.4 Hilbert spaces

Often, we require more structure than that provided by normed spaces. An inner product offers additional tools to compare vectors by introducing notions of angle and orthogonality. For simplicity we restrict ourselves to real vector spaces in the following.

Definition 54

Let \( X\) be a real vector space. A map \( \langle \cdot , \cdot \rangle_X:X \times X\to \mathbb{R}\) is called an inner product on \( X\) if the following hold for all \( x\) , \( y\) , \( z \in X\) and all \( \alpha\) , \( \beta\in \mathbb{R}\) :

  1. linearity: \( \langle \alpha x + \beta y, z\rangle_X = \alpha \langle x, z\rangle_X + \beta \langle y, z \rangle_X\) ,
  2. symmetry: \( \langle x, y\rangle_X = \langle y, x\rangle_X\) ,
  3. positive definiteness: \( \langle x, x\rangle_X > 0\) for all \( x \neq 0\) .

Example 25

For \( p=2\) , the Lebesgue spaces \( L^2(\Omega)\) and \( \ell^2(\mathbb{N})\) are Hilbert spaces with inner products

\[ \begin{align*} \left\langle f, g\right\rangle_{L^2(\Omega)}=\int_\Omega f(x)g(x)\,\mathrm{d} x~~\text{for all }f,~g\in L^2(\Omega), \end{align*} \]

and

\[ \begin{align*} \left\langle {\boldsymbol{x}}, {\boldsymbol{y}}\right\rangle_{\ell^2(\mathbb{N})}=\sum_{j\in\mathbb{N}}x_jy_j~~\text{for all }{\boldsymbol{x}}=(x_j)_{j\in\mathbb{N}},~{\boldsymbol{y}}=(y_j)_{j\in\mathbb{N}}\in\ell^2(\mathbb{N}). \end{align*} \]

On inner product spaces the so-called Cauchy-Schwarz inequality holds.

Theorem 53 (Cauchy-Schwarz inequality)

Let \( X\) be a vector space with inner product \( \langle \cdot , \cdot \rangle_X\) . Then it holds for all \( x\) , \( y \in X\)

\[ \begin{align*} |\langle x, y \rangle_X| \leq \sqrt{\left\langle x, x\right\rangle_{X}\left\langle y, y\right\rangle_{X} }. \end{align*} \]

Moreover, equality holds if and only if \( x\) and \( y\) are linearly dependent.

Proof

Let \( x\) , \( y\in X\) . If \( y=0\) then \( \left\langle x, y\right\rangle_{X}=0\) and thus the statement is trivial. Assume in the following \( y\neq 0\) , so that \( \left\langle y, y\right\rangle_{X}>0\) . Using the linearity and symmetry properties it holds for all \( \alpha\in\mathbb{R}\)

\[ 0\le \left\langle x-\alpha y, x-\alpha y\right\rangle_{X} = \left\langle x, x\right\rangle_{X}-2\alpha\left\langle x, y\right\rangle_{X} +\alpha^2\left\langle y, y\right\rangle_{X}. \]

Letting \( \alpha\mathrm{:}= \left\langle x, y\right\rangle_{X}/\left\langle y, y\right\rangle_{X}\) we get

\[ 0\le \left\langle x, x\right\rangle_{X}-2\frac{\left\langle x, y\right\rangle_{X}^2}{\left\langle y, y\right\rangle_{X}}+\frac{\left\langle x, y\right\rangle_{X}^2}{\left\langle y, y\right\rangle_{X}}=\left\langle x, x\right\rangle_{X}-\frac{\left\langle x, y\right\rangle_{X}^2}{\left\langle y, y\right\rangle_{X}}. \]

Rearranging terms gives the claim.

Every inner product \( \langle \cdot , \cdot \rangle_X\) induces a norm via

\[ \begin{equation} \| x \|_{X} \mathrm{:}= \sqrt{\langle x, x \rangle}~~\text{for all }x \in X. \end{equation} \]

(284)

The properties of the inner product immediately yield the polar identity

\[ \begin{align} \|x + y\|_X^2 = \|x\|_X^2 + 2 \langle x, y\rangle_X + \|y\|_X^2. \end{align} \]

(285)

The fact that (284) indeed defines a norm follows by an application of the Cauchy-Schwarz inequality to (285), which yields that \( \| \cdot \|_{X}\) satisfies the triangle inequality. This gives rise to the definition of a Hilbert space.

Definition 55

Let \( H\) be a real vector space with inner product \( \left\langle \cdot, \cdot\right\rangle_{H}\) . Then \( (H,\left\langle \cdot, \cdot\right\rangle_{H})\) is called a Hilbert space if and only if \( H\) is complete with respect to the norm \( \| \cdot \|_{H}\) induced by the inner product.

A standard example of a Hilbert space is \( L^2\) : Let \( (\Omega, \mathfrak{A}, \mu)\) be a measure space. Then

\[ \begin{align*} \langle f, g \rangle_{L^2(\Omega, \mu)} = \int_{\Omega} f(x)g(x) \,\mathrm{d}\mu(x)~~ \text{ for all }f,g \in L^2(\Omega, \mu), \end{align*} \]

defines an inner product on \( L^2(\Omega,\mu)\) compatible with the \( L^2(\Omega, \mu)\) -norm.

In a Hilbert space, we can compare vectors not only via their distance, measured by the norm, but also by using the inner product, which corresponds to their relative orientation. This leads to the concept of orthogonality.

Definition 56

Let \( (H,\left\langle \cdot, \cdot\right\rangle_{H})\) be a Hilbert space and let \( f\) , \( g \in H\) . We say that \( f\) and \( g\) are orthogonal if \( \left\langle f, g\right\rangle_{H} = 0\) , denoted by \( f \perp g\) . For \( F\) , \( G \subseteq H\) we write \( F \perp G\) if \( f\perp g\) for all \( f \in F\) , \( g \in G\) . Finally, for \( F\subseteq H\) , the set \( F^\perp=\{g\in H\,|\,g\perp f \forall f\in F\}\) is called the orthogonal complement of \( F\) in \( H\) .

For orthogonal vectors, the polar identity immediately implies the Pythagorean theorem.

Theorem 54 (Pythagorean theorem)

Let \( (H,\left\langle \cdot, \cdot\right\rangle_{H})\) be a Hilbert space, \( n\in \mathbb{N}\) , and let \( f_1, \dots, f_n \in H\) be pairwise orthogonal vectors. Then,

\[ \begin{align*} \left \|\sum_{i=1}^n f_i \right \|_H^2 = \sum_{i=1}^n \left \| f_i \right \|_H^2. \end{align*} \]

A final property of Hilbert spaces that we encounter in this book is the existence of unique projections onto convex sets. For a proof, see for instance [295, Thm. 4.10].

Theorem 55

Let \( (H,\left\langle \cdot, \cdot\right\rangle_{H})\) be a Hilbert space and let \( K\neq\emptyset\) be a closed convex subset of \( H\) . Then for all \( h\in H\) exists a unique \( k_0 \in K\) such that

\[ \begin{align*} \| h - k_0 \|_{H} = \inf\{\| h - k \|_{H}\,|\,k \in K\}. \end{align*} \]

References

[1] Tomaso Poggio and Andrzej Banburski and Qianli Liao Theoretical issues in deep networks Proceedings of the National Academy of Sciences 2020 117 48 30039-30045 10.1073/pnas.1907369117

[2] Matus Telgarsky Deep Learning Theory Lecture Notes 2021 Version: 2021-10-27 v0.0-e7150f2d (alpha)

[3] Arnulf Jentzen and Benno Kuckuck and Philippe von Wurstemberger Mathematical introduction to deep learning: methods, implementations, and theory arXiv preprint arXiv:2310.20360 2023

[4] Alex Krizhevsky and Ilya Sutskever and Geoffrey E Hinton Imagenet classification with deep convolutional neural networks Advances in neural information processing systems 2012 1097–1105

[5] David Silver and Aja Huang and Chris J Maddison and Arthur Guez and Laurent Sifre and George Van Den Driessche and Julian Schrittwieser and Ioannis Antonoglou and Veda Panneershelvam and Marc Lanctot and others Mastering the game of Go with deep neural networks and tree search nature 2016 529 7587 484–489

[6] John Jumper and Richard Evans and Alexander Pritzel and Tim Green and Michael Figurnov and Olaf Ronneberger and Kathryn Tunyasuvunakool and Russ Bates and Augustin Žídek and Anna Potapenko and others Highly accurate protein structure prediction with AlphaFold Nature 2021 596 7873 583–589

[7] Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N Gomez and \Lukasz Kaiser and Illia Polosukhin Attention is all you need Advances in neural information processing systems 2017 30

[8] Tom Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared D Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and others Language models are few-shot learners Advances in neural information processing systems 2020 33 1877–1901

[9] Warren S McCulloch and Walter Pitts A logical calculus of the ideas immanent in nervous activity The bulletin of mathematical biophysics 1943 5 115–133

[10] Yann LeCun and Bernhard Boser and John S Denker and Donnie Henderson and Richard E Howard and Wayne Hubbard and Lawrence D Jackel Backpropagation applied to handwritten zip code recognition Neural Computation 1989 1 4 541–551

[11] Michael M Bronstein and Joan Bruna and Taco Cohen and Petar Veličković Geometric deep learning: Grids, groups, graphs, geodesics, and gauges arXiv preprint arXiv:2104.13478 2021

[12] Sepp Hochreiter and Jürgen Schmidhuber Long short-term memory Neural Computation 1997 9 8 1735–1780

[13] Shai Shalev-Shwartz and Shai Ben-David Understanding Machine Learning - From Theory to Algorithms. Cambridge University Press

[14] Maziar Raissi and Paris Perdikaris and George E Karniadakis Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations Journal of Computational physics 2019 378 686–707

[15] Mehryar Mohri and Afshin Rostamizadeh and Ameet Talwalkar Foundations of machine learning MIT press 2018

[16] Christoph Molnar Interpretable machine learning Lulu. com 2020

[17] Mengnan Du and Fan Yang and Na Zou and Xia Hu Fairness in Deep Learning: A Computational Perspective IEEE Intelligent Systems 2021 36 4 25-34 10.1109/MIS.2020.3000681

[18] Solon Barocas and Moritz Hardt and Arvind Narayanan Fairness and Machine Learning fairmlbook.org http://www.fairmlbook.org

[19] Aur\'elien G\'eron Hands-on machine learning with Scikit-Learn and TensorFlow : concepts, tools, and techniques to build intelligent systems O'Reilly Media Sebastopol, CA

[20] Francois Chollet Deep learning with Python Simon and Schuster 2021

[21] Simon J.D. Prince Understanding Deep Learning MIT Press 2023

[22] Jürgen Schmidhuber Deep learning in neural networks: An overview Neural Networks 2015 61 85-117 https://doi.org/10.1016/j.neunet.2014.09.003

[23] Yann LeCun and Yoshua Bengio and Geoffrey Hinton Deep learning Nature 7553 436–444 may

[24] Simon S. Haykin Neural networks and learning machines Pearson Education Upper Saddle River, NJ Third

[25] Ian J. Goodfellow and Yoshua Bengio and Aaron Courville Deep Learning MIT Press 2016 Cambridge, MA, USA http://www.deeplearningbook.org

[26] Martin Anthony and Peter L. Bartlett Neural network learning: theoretical foundations Cambridge University Press, Cambridge 1999 10.1017/CBO9780511624216

[27] Ovidiu Calin Deep learning architectures Springer 2020

[28] Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun Deep residual learning for image recognition Proceedings of the IEEE conference on computer vision and pattern recognition 2016 770–778

[29] F. Rosenblatt The perceptron: A probabilistic model for information storage and organization in the brain Psychological Review 1958 65 6 386–408

[30] S. Hochreiter Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut für Informatik, Lehrstuhl Prof. Brauer, Technische Universität München

[31] Yoshua Bengio and Patrick Simard and Paolo Frasconi Learning Long-Term Dependencies With Gradient Descent Is Difficult IEEE Transactions on Neural Networks 2 157–166

[32] Moshe Leshno and Vladimir Ya. Lin and Allan Pinkus and Shimon Schocken Multilayer feedforward networks with a nonpolynomial activation function can approximate any function Neural Networks 1993 6 6 861-867 https://doi.org/10.1016/S0893-6080(05)80131-5

[33] Walter Rudin Functional analysis McGraw-Hill, Inc., New York 1991 International Series in Pure and Applied Mathematics Second

[34] V.Y. Lin and A. Pinkus Fundamentality of Ridge Functions Journal of Approximation Theory 1993 75 3 295-311 https://doi.org/10.1006/jath.1993.1104

[35] Harro Heuser Lehrbuch der Analysis. Teil 1 Vieweg + Teubner, Wiesbaden 2009 revised

[36] A. N. Kolmogorov On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition Dokl. Akad. Nauk SSSR 1957 114 953–956

[37] G. Cybenko Approximation by superpositions of a sigmoidal function Mathematics of Control, Signals and Systems 1989 2 4 303–314 10.1007/BF02551274

[38] Kurt Hornik and Maxwell Stinchcombe and Halbert White Multilayer feedforward networks are universal approximators Neural Networks 1989 2 5 359-366 https://doi.org/10.1016/0893-6080(89)90020-8

[39] Kurt Hornik Approximation capabilities of multilayer feedforward networks Neural Networks 1991 4 2 251-257 https://doi.org/10.1016/0893-6080(91)90009-T

[40] Ken-Ichi Funahashi On the approximate realization of continuous mappings by neural networks Neural Networks 1989 2 3 183-192 https://doi.org/10.1016/0893-6080(89)90003-8

[41] S. M. Carroll and Bradley W. Dickinson Construction of neural nets using the radon transform International 1989 Joint Conference on Neural Networks 1989 607-611 vol.1

[42] B. A. Vostrecov and M. A. Kre\uines Approximation of continuous functions by superpositions of plane waves Dokl. Akad. Nauk SSSR 1961 140 1237–1240

[43] Braun, Jürgen and Griebel, Michael On a Constructive Proof of Kolmogorov's Superposition Theorem Constructive Approximation 2009 30 3 653-675 Dec 10.1007/s00365-009-9054-2

[44] Robert H. Nielsen K P III 11–13 Piscataway, NJ: IEEE

[45] Věra Kůrková Kolmogorov's theorem and multilayer neural networks Neural Networks 1992 5 3 501-506 https://doi.org/10.1016/0893-6080(92)90012-8

[46] Hadrien Montanelli and Haizhao Yang Error bounds for deep ReLU networks using the Kolmogorov–Arnold superposition theorem Neural Networks 2020 129 1-6 https://doi.org/10.1016/j.neunet.2019.12.013

[47] Johannes Schmidt-Hieber The Kolmogorov–Arnold representation theorem revisited Neural Networks 2021 137 119-126 https://doi.org/10.1016/j.neunet.2021.01.020

[48] Vugar E. Ismailov A three layer neural network can represent any multivariate function Journal of Mathematical Analysis and Applications 2023 523 1 127096 https://doi.org/10.1016/j.jmaa.2023.127096

[49] Federico Girosi and Tomaso Poggio Representation Properties of Networks: Kolmogorov's Theorem Is Irrelevant Neural Computation 1989 1 4 465-469 10.1162/neco.1989.1.4.465

[50] Věra Kůrková Kolmogorov's Theorem Is Relevant Neural Computation 1991 3 4 617-622 10.1162/neco.1991.3.4.617

[51] Vitaly Maiorov and Allan Pinkus Lower bounds for approximation by MLP neural networks Neurocomputing 1999 25 1 81-91 https://doi.org/10.1016/S0925-2312(98)00111-8

[52] H. N. Mhaskar and Charles A. Micchelli Approximation by superposition of sigmoidal and radial basis functions Adv. in Appl. Math. 1992 13 3 350–373 10.1016/0196-8858(92)90016-P

[53] Peter Oswald On the degree of nonlinear spline approximation in Besov-Sobolev spaces J. Approx. Theory 1990 61 2 131–157

[54] Youssef Marzouk and Zhi Ren and Sven Wang and Jakob Zech Distribution learning via neural differential equations: a nonparametric statistical perspective Journal of Machine Learning Research (accepted) 2024

[55] Hrushikesh Narhar Mhaskar Approximation properties of a multilayered feedforward artificial neural network Adv. Comput. Math. 1993 1 1 61–80

[56] Larry Schumaker Spline Functions: Basic Theory Cambridge University Press 2007 Cambridge Mathematical Library 3

[57] Hrushikesh N Mhaskar Neural networks for optimal approximation of smooth and analytic functions Neural computation 1996 8 1 164–177

[58] Hrushikesh N Mhaskar and Charles A Micchelli Degree of approximation by neural and translation networks with a single hidden layer Advances in applied mathematics 1995 16 2 151–183

[59] Yagyensh C Pati and Perinkulam S Krishnaprasad Analysis and synthesis of feedforward neural networks using discrete affine wavelet transformations IEEE Transactions on Neural Networks 1993 4 1 73–85

[60] Emmanuel Jean Candes Ridgelets: theory and applications Stanford University 1998

[61] Vugar E Ismailov Ridge functions and applications in neural networks American Mathematical Society 2021 263

[62] Helmut Bolcskei and Philipp Grohs and Gitta Kutyniok and Philipp Petersen Optimal approximation with sparsely connected deep neural networks SIAM Journal on Mathematics of Data Science 2019 1 1 8–45

[63] Philipp Petersen and Felix Voigtlaender Optimal approximation of piecewise smooth functions using deep ReLU neural networks Neural Networks 2018 108 296–330

[64] Raman Arora and Amitabh Basu and Poorya Mianjy and Anirbit Mukherjee Understanding Deep Neural Networks with Rectified Linear Units International Conference on Learning Representations 2018

[65] Juncai He and Lin Li and Jinchao Xu and Chunyue Zheng Relu deep neural networks and linear finite elements J. Comput. Math. 2020 38 3 502–527 10.4208/jcm.1901-m2018-0160

[66] J.M. Tarela and M.V. Martínez Region configurations for realizability of lattice Piecewise-Linear models Mathematical and Computer Modelling 1999 30 11 17-27 https://doi.org/10.1016/S0895-7177(99)00195-8

[67] Sergei Ovchinnikov Max-min representation of piecewise linear functions Beiträge Algebra Geom. 2002 43 1 297–302

[68] Shuning Wang and Xusheng Sun Generalization of hinging hyperplanes IEEE Transactions on Information Theory 2005 51 12 4425-4431 10.1109/TIT.2005.859246

[69] Marcello Longo and Joost A.A. Opschoor and Nico Disch and Christoph Schwab and Jakob Zech De Rham compatible Deep Neural Network FEM Neural Networks 2023 165 721-739 https://doi.org/10.1016/j.neunet.2023.06.008

[70] A. Ern and J.L. Guermond Finite Elements I: Approximation and Interpolation Springer International Publishing 2021 Texts in Applied Mathematics

[71] J. M. Tarela and E. Alonso and M. V. Mart\'inez A representation method for PWL functions oriented to parallel processing Math. Comput. Modelling 1990 13 10 75–83 10.1016/0895-7177(90)90090-A

[72] R.A. DeVore and G.G. Lorentz Constructive Approximation Springer Berlin Heidelberg 1993 Grundlehren der mathematischen Wissenschaften

[73] P.G. Ciarlet The Finite Element Method for Elliptic Problems North Holland 1978 Studies in Mathematics and its Applications

[74] S. Brenner and R. Scott The Mathematical Theory of Finite Element Methods Springer New York 2007 Texts in Applied Mathematics

[75] Dmitry Yarotsky and Anton Zhevnerchuk The phase diagram of approximation rates for deep neural networks Advances in Neural Information Processing Systems 2020 H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin 33 13005–13015 Curran Associates, Inc.

[76] Christopher L Frenzen and Tsutomu Sasao and Jon T Butler On the number of segments needed in a piecewise linear approximation Journal of Computational and Applied mathematics 2010 234 2 437–446

[77] Matus Telgarsky Representation Benefits of Deep Feedforward Networks 2015

[78] Boris Hanin and David Rolnick Complexity of linear regions in deep networks International Conference on Machine Learning 2019 2596–2604 PMLR

[79] Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification Proceedings of the IEEE international conference on computer vision 2015

[80] Clemens Karner and Vladimir Kazeev and Philipp Christian Petersen Limitations of gradient descent due to numerical instability of backpropagation arXiv preprint arXiv:2210.00805 2022

[81] Guido F Montufar and Razvan Pascanu and Kyunghyun Cho and Yoshua Bengio On the Number of Linear Regions of Deep Neural Networks Advances in Neural Information Processing Systems 2014 Z. Ghahramani and M. Welling and C. Cortes and N. Lawrence and K.Q. Weinberger 27 Curran Associates, Inc.

[82] Maithra Raghu and Ben Poole and Jon Kleinberg and Surya Ganguli and Jascha Sohl-Dickstein On the Expressive Power of Deep Neural Networks Proceedings of the 34th International Conference on Machine Learning 2017 Precup, Doina and Teh, Yee Whye 70 Proceedings of Machine Learning Research 2847–2854 PMLR

[83] Thiago Serra and Christian Tjandraatmadja and Srikumar Ramalingam Bounding and Counting Linear Regions of Deep Neural Networks 2018

[84] Matus Telgarsky Benefits of depth in neural networks 29th Annual Conference on Learning Theory 2016 Feldman, Vitaly and Rakhlin, Alexander and Shamir, Ohad 49 Proceedings of Machine Learning Research 1517–1539 PMLR

[85] Dmitry Yarotsky Error bounds for approximations with deep ReLU networks Neural Netw. 2017 94 103-114

[86] Christoph Schwab and Jakob Zech Deep learning in high dimension: neural network expression rates for generalized polynomial chaos expansions in UQ Anal. Appl. (Singap.) 2019 17 1 19–55 10.1142/S0219530518500203

[87] Dennis Elbrächter and Dmytro Perekrestenko and Philipp Grohs and Helmut Bölcskei Deep Neural Network Approximation Theory IEEE Transactions on Information Theory 2021 67 5 2581-2623 10.1109/TIT.2021.3062161

[88] Joost A. A. Opschoor and Philipp C. Petersen and Christoph Schwab Deep ReLU networks and high-order finite element methods Analysis and Applications 2020 18 05 715-770 10.1142/S0219530519410136

[89] Shiyu Liang and Roger Srikant Why deep neural networks for function approximation? Proc. of ICLR 2017 2017 1 – 17

[90] Weinan E and Qingcan Wang Exponential convergence of the deep neural network approximation for analytic functions Sci. China Math. 2018 61 10 1733–1740 10.1007/s11425-018-9387-x

[91] J. A. A. Opschoor and Ch. Schwab and J. Zech Exponential ReLU DNN Expression of Holomorphic Maps in High Dimension Constructive Approximation 2021 10.1007/s00365-021-09542-5

[92] Konstantin Eckle and Johannes Schmidt-Hieber A comparison of deep networks with ReLU activation function and linear spline-type methods Neural Networks 2019 110 232-242 https://doi.org/10.1016/j.neunet.2018.11.005

[93] Hadrien Montanelli and Qiang Du New Error Bounds for Deep ReLU Networks Using Sparse Grids SIAM Journal on Mathematics of Data Science 2019 1 1 78-92 10.1137/18M1189336

[94] Matus Telgarsky Neural Networks and Rational Functions Proceedings of the 34th International Conference on Machine Learning 2017 Precup, Doina and Teh, Yee Whye 70 Proceedings of Machine Learning Research 3387–3393 PMLR

[95]

[96] Joost A.A. Opschoor and Christoph Schwab Deep ReLU networks and high-order finite element methods II: Chebyšev emulation Computers & Mathematics with Applications 2024 169 142-162 https://doi.org/10.1016/j.camwa.2024.06.008

[97] Ronen Eldan and Ohad Shamir The Power of Depth for Feedforward Neural Networks 29th Annual Conference on Learning Theory 2016 Feldman, Vitaly and Rakhlin, Alexander and Shamir, Ohad 49 Proceedings of Machine Learning Research 907–940 PMLR

[98] Itay Safran and Ohad Shamir Depth Separation in ReLU Networks for Approximating Smooth Non-Linear Functions ArXiv 2016 abs/1610.09887

[99] Allan Pinkus Approximation theory of the MLP model in neural networks Acta numerica, 1999 Cambridge Univ. Press, Cambridge 1999 8 Acta Numer. 143–195 10.1017/S0962492900002919

[100] Tim De Ryck and Samuel Lanthaler and Siddhartha Mishra On the approximation of functions by tanh neural networks Neural Networks 2021 143 732-750 https://doi.org/10.1016/j.neunet.2021.08.015

[101] Ingo Gühring and Mones Raslan Approximation rates for neural networks with encodable weights in smoothness spaces Neural Networks 2021 134 107–130

[102] Richard Bellman On the theory of dynamic programming Proceedings of the national Academy of Sciences 1952 38 8 716–719

[103] Ronald A DeVore Nonlinear approximation Acta numerica 1998 7 51–150

[104] Erich Novak and Henryk Woźniakowski Approximation of infinitely differentiable multivariate functions is intractable Journal of Complexity 2009 25 4 398–404

[105] Andrew R. Barron Universal approximation bounds for superpositions of a sigmoidal function IEEE Trans. Inform. Theory 1993 39 3 930–945 10.1109/18.256500

[106] Karlheinz Gröchenig Foundations of time-frequency analysis Springer Science & Business Media 2013

[107] Philipp Christian Petersen Neural Network Theory 2020 http://www.pc-petersen.eu/Neural_Network_Theory.pdf, Lecture notes

[108] Jonathan W Siegel and Jinchao Xu High-order approximation rates for shallow neural networks with cosine and ReLUk activation functions Applied and Computational Harmonic Analysis 2022 58 1–26

[109] Constantin Carathéodory Über den variabilitätsbereich der fourier’schen konstanten von positiven harmonischen funktionen Rendiconti del Circolo Matematico di Palermo (1884-1940) 1911 32 193-217

[110] Roman Vershynin High-dimensional probability: An introduction with applications in data science Cambridge University Press 2018 47

[111] Gilles Pisier Remarques sur un r\'esultat non publi\'e de B. Maurey S\'eminaire Analyse fonctionnelle (dit "Maurey-Schwartz") 1980-1981

[112] Tomaso Poggio and Hrushikesh Mhaskar and Lorenzo Rosasco and Brando Miranda and Qianli Liao Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review Int. J. Autom. Comput. 2017 14 5 503–519

[113] Elias M. Stein Singular integrals and differentiability properties of functions Princeton University Press, Princeton, N.J. 1970 Princeton Mathematical Series, No. 30

[114] Andrew R Barron Neural net approximation Proc. 7th Yale workshop on adaptive and learning systems 1992 1 69–72

[115] Chao Ma and Lei Wu and others A priori estimates of the population risk for two-layer neural networks arXiv preprint arXiv:1810.06397 2018

[116] Chao Ma and Stephan Wojtowytsch and Lei Wu and others Towards a mathematical understanding of neural network-based machine learning: what we know and what we don't arXiv preprint arXiv:2009.10713 2020

[117] E Weinan and Chao Ma and Lei Wu Barron spaces and the compositional function spaces for neural network models arXiv preprint arXiv:1906.08039 2019

[118] E Weinan and Stephan Wojtowytsch Representation formulas and pointwise properties for Barron functions Calculus of Variations and Partial Differential Equations 2022 61 2 46

[119] Andrew R Barron and Jason M Klusowski Approximation and estimation for high-dimensional deep learning networks arXiv preprint arXiv:1809.03090 2018

[120] Michael Kohler and Sophie Langer On the rate of convergence of fully connected deep neural network regression estimates The Annals of Statistics 2021 49 4 2231–2249

[121] Uri Shaham and Alexander Cloninger and Ronald R Coifman Provable approximation properties for deep neural networks Applied and Computational Harmonic Analysis 2018 44 3 537–557

[122] Charles K Chui and Hrushikesh N Mhaskar Deep nets for local manifold learning Frontiers in Applied Mathematics and Statistics 2018 4 12

[123] Minshuo Chen and Haoming Jiang and Wenjing Liao and Tuo Zhao Efficient approximation of deep relu networks for functions on low dimensional manifolds Advances in neural information processing systems 2019 32

[124] Johannes Schmidt-Hieber Deep relu network approximation of functions on a manifold arXiv preprint arXiv:1908.00695 2019

[125] Ryumei Nakada and Masaaki Imaizumi Adaptive approximation and generalization of deep neural network with intrinsic dimensionality Journal of Machine Learning Research 2020 21 174 1–38

[126] Michael Kohler and Adam Krzyżak and Sophie Langer Estimation of a function of low local dimensionality by deep neural networks IEEE transactions on information theory 2022 68 6 4032–4042

[127] Christoph Schwab and Jakob Zech Deep learning in high dimension: neural network expression rates for analytic functions in \ensuremathL^2(\mathbbR^d,\gamma_d) SIAM/ASA J. Uncertain. Quantif. 2023 11 1 199–234 10.1137/21M1462738

[128] Joost A. A. Opschoor and Christoph Schwab and Jakob Zech Deep learning in high dimension: ReLU neural network expression for Bayesian PDE inversion Optimization and control for partial differential equations—uncertainty quantification, open and closed-loop control, and shape optimization De Gruyter, Berlin 2022 29 Radon Ser. Comput. Appl. Math. 419–462 10.1515/9783110695984-015

[129] Gitta Kutyniok and Philipp Petersen and Mones Raslan and Reinhold Schneider A theoretical analysis of deep neural networks and parametric PDEs Constructive Approximation 2022 55 1 73–125

[130] Fabian Laakmann and Philipp Petersen Efficient approximation of solutions of parametric linear transport equations by ReLU DNNs Advances in Computational Mathematics 2021 47 1 11

[131] T. De Ryck and S. Mishra Error Analysis for Deep Neural Network Approximations of Parametric Hyperbolic Conservation Laws Mathematics of Computation 2023 Article electronically published on December 15, 2023 10.1090/mcom/3934

[132] Philipp Grohs and Lukas Herrmann Deep neural network approximation for high-dimensional elliptic PDEs with boundary conditions IMA Journal of Numerical Analysis 2022 42 3 2055–2082

[133] Philipp Grohs and Fabian Hornung and Arnulf Jentzen and Philippe Von Wurstemberger A proof that artificial neural networks overcome the curse of dimensionality in the numerical approximation of Black–Scholes partial differential equations American Mathematical Society 2023 284 1410

[134] Lukas Gonon and Christoph Schwab Deep ReLU network expression rates for option prices in high-dimensional, exponential Lévy models Finance and Stochastics 2021 25 4 615–657

[135] Martin Hutzenthaler and Arnulf Jentzen and Thomas Kruse and Tuan Anh Nguyen A proof that rectified deep neural networks overcome the curse of dimensionality in the numerical approximation of semilinear heat equations SN partial differential equations and applications 2020 1 2 10

[136] Arnulf Jentzen and Diyora Salimova and Timo Welti A proof that deep artificial neural networks overcome the curse of dimensionality in the numerical approximation of Kolmogorov partial differential equations with constant diffusion and nonlinear drift coefficients Commun. Math. Sci. 2021 19 5 1167–1205 10.4310/CMS.2021.v19.n5.a1

[137] Philipp Grohs and Fabian Hornung and Arnulf Jentzen and Philippe von Wurstemberger A proof that artificial neural networks overcome the curse of dimensionality in the numerical approximation of Black-Scholes partial differential equations Mem. Amer. Math. Soc. 2023 284 1410 v+93 10.1090/memo/1410

[138] Gleb Beliakov Interpolation of Lipschitz functions Journal of Computational and Applied Mathematics 2006 196 1 20-44 https://doi.org/10.1016/j.cam.2005.08.011

[139] A.G. Sukharev Optimal method of constructing best uniform approximations for functions of a certain class USSR Computational Mathematics and Mathematical Physics 1978 18 2 21-31 https://doi.org/10.1016/0041-5553(78)90035-6

[140] Michael A Sartori and Panos J Antsaklis A simple method to derive bounds on the size and to train multilayer neural networks IEEE transactions on neural networks 1991 2 4 467–471

[141] Y. Ito and K. Saito Superposition of linearly Independent functions and finite mappings by neural networks The Mathematical Scientist 1996 21 1 27

[142] Guang-Bin Huang and Haroon A Babri Upper bounds on the number of hidden neurons in feedforward networks with arbitrary bounded nonlinear activation functions IEEE transactions on neural networks 1998 9 1 224–229

[143] Arnulf Jentzen and Adrian Riekert On the existence of global minima and convergence analyses for gradient descent methods in the training of deep neural networks arXiv preprint arXiv:2112.09684 2021

[144] Dimitri P. Bertsekas Nonlinear programming Athena Scientific, Belmont, MA 2016 Athena Scientific Optimization and Computation Series Third

[145] Yurii Nesterov Lectures on convex optimization Springer, Cham 2018 137 Springer Optimization and Its Applications Second 10.1007/978-3-319-91578-4

[146] Sébastien Bubeck Convex Optimization: Algorithms and Complexity Found. Trends Mach. Learn. 2014 8 231-357

[147] Guanghui. Lan First-order and Stochastic Optimization Methods for Machine Learning Springer International Publishing 2020 Springer Series in the Data Sciences Cham 1st ed. 2020.

[148] Guillaume Garrigos and Robert M. Gower Handbook of Convergence Theorems for (Stochastic) Gradient Methods 2023

[149] Herbert Robbins and Sutton Monro A Stochastic Approximation Method The Annals of Mathematical Statistics 1951 22 3 400 – 407 10.1214/aoms/1177729586

[150] Léon Bottou and Frank E. Curtis and Jorge Nocedal Optimization Methods for Large-Scale Machine Learning SIAM Review 2018 60 2 223-311 10.1137/16M1080173

[151] Robert Mansel Gower and Nicolas Loizou and Xun Qian and Alibek Sailanbayev and Egor Shulgin and Peter Richtárik SGD Proceedings of the 36th International Conference on Machine Learning 2019 Chaudhuri, Kamalika and Salakhutdinov, Ruslan 97 Proceedings of Machine Learning Research 5200–5209 PMLR

[152] Dimitri P. Bertsekas and John N. Tsitsiklis Neuro-dynamic programming. Athena Scientific Optimization and neural computation series

[153] Ilya Sutskever and James Martens and George Dahl and Geoffrey Hinton On the importance of initialization and momentum in deep learning Proceedings of the 30th International Conference on Machine Learning 2013 Dasgupta, Sanjoy and McAllester, David 28 3 Proceedings of Machine Learning Research 1139–1147 PMLR

[154] B.T. Polyak Some methods of speeding up the convergence of iteration methods USSR Computational Mathematics and Mathematical Physics 1964 4 5 1-17 https://doi.org/10.1016/0041-5553(64)90137-5

[155] Yu. E. Nesterov A method for solving the convex programming problem with convergence rate \( O(1/k\sp{2})\) Dokl. Akad. Nauk SSSR 1983 269 3 543–547

[156] Gabriel Goh Why Momentum Really Works Distill 2017 http://distill.pub/2017/momentum 10.23915/distill.00006

[157] Boris T. polyak Introduction to optimization Optimization Software, Inc., Publications Division, New York 1987 Translations Series in Mathematics and Engineering Translated from the Russian, With a foreword by Dimitri P. Bertsekas

[158] Ning Qian On the momentum term in gradient descent learning algorithms Neural Networks 1999 12 1 145-151 https://doi.org/10.1016/S0893-6080(98)00116-6

[159] Laurent Lessard and Benjamin Recht and Andrew Packard Analysis and design of optimization algorithms via integral quadratic constraints SIAM J. Optim. 2016 26 1 57–95 10.1137/15M1009597

[160] Stephen Tu and Shivaram Venkataraman and Ashia C. Wilson and Alex Gittens and Michael I. Jordan and Benjamin Recht Breaking Locality Accelerates Block Gauss-Seidel Proceedings of the 34th International Conference on Machine Learning 2017 Precup, Doina and Teh, Yee Whye 70 Proceedings of Machine Learning Research 3482–3491 PMLR

[161] Ashia C. Wilson and Ben Recht and Michael I. Jordan A Lyapunov Analysis of Accelerated Methods in Optimization Journal of Machine Learning Research 2021 22 113 1–34

[162] Simon Weissmann and Ashia Wilson and Jakob Zech Multilevel Optimization for Inverse Problems Proceedings of Thirty Fifth Conference on Learning Theory 2022 Loh, Po-Ling and Raginsky, Maxim 178 Proceedings of Machine Learning Research 5489–5524 PMLR

[163] John Duchi and Elad Hazan and Yoram Singer Adaptive subgradient methods for online learning and stochastic optimization Journal of Machine Learning Research Jul 2121–2159

[164] Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method CoRR abs/1212.5701

[165] Diederik P Kingma and Jimmy Ba Adam: A method for stochastic optimization 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings 2015 International Conference on Learning Representations, ICLR

[166] Sebastian Ruder An overview of gradient descent optimization algorithms 2016

[167] Léon Bottou Montavon, Grégoire and Orr, Geneviève B. and Müller, Klaus-Robert Stochastic Gradient Descent Tricks 421–436 Springer Berlin Heidelberg 2012 Berlin, Heidelberg 10.1007/978-3-642-35289-8_25

[168] Shun-ichi Amari Natural Gradient Works Efficiently in Learning Neural Computation 1998 10 2 251-276 02 10.1162/089976698300017746

[169] Jorge Nocedal and Stephen J. Wright Numerical optimization Springer, New York 2006 Springer Series in Operations Research and Financial Engineering Second

[170] Mart\'in~Abadi and Ashish~Agarwal and Paul~Barham and Eugene~Brevdo and Zhifeng~Chen and Craig~Citro and Greg~S.~Corrado and Andy~Davis and Jeffrey~Dean and Matthieu~Devin and Sanjay~Ghemawat and Ian~Goodfellow and Andrew~Harp and Geoffrey~Irving and Michael~Isard and Yangqing Jia and Rafal~Jozefowicz and Lukasz~Kaiser and Manjunath~Kudlur and Josh~Levenberg and Dandelion~Mané and Rajat~Monga and Sherry~Moore and Derek~Murray and Chris~Olah and Mike~Schuster and Jonathon~Shlens and Benoit~Steiner and Ilya~Sutskever and Kunal~Talwar and Paul~Tucker and Vincent~Vanhoucke and Vijay~Vasudevan and Fernanda~Viégas and Oriol~Vinyals and Pete~Warden and Martin~Wattenberg and Martin~Wicke and Yuan~Yu and Xiaoqiang~Zheng TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems 2015 Software available from tensorflow.org

[171] Ashia C Wilson and Rebecca Roelofs and Mitchell Stern and Nati Srebro and Benjamin Recht The Marginal Value of Adaptive Gradient Methods in Machine Learning Advances in Neural Information Processing Systems 2017 I. Guyon and U. Von Luxburg and S. Bengio and H. Wallach and R. Fergus and S. Vishwanathan and R. Garnett 30 Curran Associates, Inc.

[172] Sashank J. Reddi and Satyen Kale and Sanjiv Kumar On the Convergence of Adam and Beyond 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings 2018 OpenReview.net

[173] David E Rumelhart and Geoffrey E Hinton and Ronald J Williams Learning representations by back-propagating errors Nature 1986 323 6088 533–536

[174] Yurii Nesterov Introductory lectures on convex optimization Kluwer Academic Publishers, Boston, MA 2004 87 Applied Optimization A basic course 10.1007/978-1-4419-8853-9

[175] Stephen Boyd and Lieven Vandenberghe Convex optimization Cambridge University Press, Cambridge 2004 10.1017/CBO9780511804441

[176] Karimi, Hamed and Nutini, Julie and Schmidt, Mark Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-\Lojasiewicz Condition Machine Learning and Knowledge Discovery in Databases 2016 Frasconi, Paolo and Landwehr, Niels and Manco, Giuseppe and Vreeken, Jilles 795–811 Springer International Publishing

[177] Eric Moulines and Francis Bach Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning Advances in Neural Information Processing Systems 2011 J. Shawe-Taylor and R. Zemel and P. Bartlett and F. Pereira and K.Q. Weinberger 24 Curran Associates, Inc.

[178] Alexander Rakhlin and Ohad Shamir and Karthik Sridharan Making gradient descent optimal for strongly convex stochastic optimization Proceedings of the 29th International Coference on International Conference on Machine Learning 2012 ICML'12 1571–1578 Omnipress

[179] A. Nemirovski and A. Juditsky and G. Lan and A. Shapiro Robust Stochastic Approximation Approach to Stochastic Programming SIAM Journal on Optimization 2009 19 4 1574-1609 10.1137/070704277

[180] Ohad Shamir and Tong Zhang Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes Proceedings of the 30th International Conference on Machine Learning 2013 Dasgupta, Sanjoy and McAllester, David 28 1 Proceedings of Machine Learning Research 71–79 PMLR

[181] D. Randall Wilson and Tony R. Martinez The general inefficiency of batch training for gradient descent learning Neural Netw. 2003 16 10 1429–1451 dec 10.1016/S0893-6080(03)00138-2

[182] Moritz Hardt and Ben Recht and Yoram Singer Train faster, generalize better: Stability of stochastic gradient descent Proceedings of The 33rd International Conference on Machine Learning 2016 Balcan, Maria Florina and Weinberger, Kilian Q. 48 Proceedings of Machine Learning Research 1225–1234 PMLR

[183] Chiyuan Zhang and Samy Bengio and Moritz Hardt and Benjamin Recht and Oriol Vinyals Understanding deep learning requires rethinking generalization 2016

[184] Nitish Shirish Keskar and Dheevatsa Mudigere and Jorge Nocedal and Mikhail Smelyanskiy and Ping Tak Peter Tang On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. CoRR abs/1609.04836

[185] Daniel Soudry and Elad Hoffer and Mor Shpigel Nacson and Suriya Gunasekar and Nathan Srebro The Implicit Bias of Gradient Descent on Separable Data Journal of Machine Learning Research 2018 19 70 1–57

[186] Gilbert Strang Lecture 23: Accelerating Gradient Descent (Use Momentum) MIT OpenCourseWare: Matrix Methods in Data Analysis, Signal Processing, And Machine Learning 2018 https://ocw.mit.edu/courses/18-065-matrix-methods-in-data-analysis-signal-processing-and-machine-learning-spring-2018/resources/lecture-23-accelerating-gradient-descent-use-momentum/

[187] Brendan O'Donoghue and Emmanuel Cand\`es Adaptive restart for accelerated gradient schemes Found. Comput. Math. 2015 15 3 715–732 10.1007/s10208-013-9150-3

[188] Alexandre Défossez and Léon Bottou and Francis R. Bach and Nicolas Usunier A Simple Convergence Proof of Adam and Adagrad Trans. Mach. Learn. Res. 2022 2022

[189] Christopher M. Bishop Pattern Recognition and Machine Learning (Information Science and Statistics) Springer 1

[190] Michael A. Nielsen Neural Networks and Deep Learning

[191] Yann LeCun and Leon Bottou and Genevieve Orr and Klaus Müller Efficient BackProp Neural Networks: Tricks of the Trade Springer Berlin / Heidelberg Lecture Notes in Computer Science 2 546 10.1007/3-540-49430-8\_2

[192] Stephen Boyd and Lin Xiao and Almir Mutapcic Subgradient Methods 2003 Lecture Notes, Stanford University.

[193] Naum Z. Shor Minimization Methods for Non-Differentiable Functions Springer-Verlag 1985 3 Springer Series in Computational Mathematics Berlin, Heidelberg 10.1007/978-3-642-82118-9

[194] Arthur Jacot and Franck Gabriel and Clément Hongler Neural tangent kernel: Convergence and generalization in neural networks Advances in neural information processing systems 2018 31

[195] Jaehoon Lee and Lechao Xiao and Samuel Schoenholz and Yasaman Bahri and Roman Novak and Jascha Sohl-Dickstein and Jeffrey Pennington Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent Advances in Neural Information Processing Systems 2019 H. Wallach and H. Larochelle and A. Beygelzimer and F. d&apos; Alché-Buc and E. Fox and R. Garnett 32 Curran Associates, Inc.

[196] Lénaïc Chizat and Edouard Oyallon and Francis Bach On Lazy Training in Differentiable Programming Advances in Neural Information Processing Systems 2019 H. Wallach and H. Larochelle and A. Beygelzimer and F. d&apos; Alché-Buc and E. Fox and R. Garnett 32 Curran Associates, Inc.

[197] James W. Demmel Applied Numerical Linear Algebra SIAM jan 10.1137/1.9781611971446

[198] Åke Björck Numerical Methods for Least Squares Problems Society for Industrial and Applied Mathematics 1996 10.1137/1.9781611971484

[199] Trevor Hastie and Robert Tibshirani and Jerome Friedman The elements of statistical learning: data mining, inference and prediction Springer 2

[200] Gene H. Golub and Charles F. Van Loan Matrix Computations - 4th Edition Johns Hopkins University Press 2013 Philadelphia, PA 10.1137/1.9781421407944

[201] Andrey N. Tikhonov Regularization of incorrectly posed problems Soviet Mathematics Doklady 1963 4 6 1624–1627

[202] A. E. Hoerl and R. W. Kennard Ridge Regression: Biased Estimation for Nonorthogonal Problems Technometrics 55–67

[203] H.W. Engl and M. Hanke and A. Neubauer Regularization of Inverse Problems Springer Netherlands 2000 Mathematics and Its Applications

[204] A. Ben-Israel and A. Charnes Contributions to the Theory of Generalized Inverses Journal of the Society for Industrial and Applied Mathematics 1963 11 3 667-699 10.1137/0111051

[205] Nello Cristianini and John Shawe-Taylor An Introduction to Support Vector Machines and Other Kernel-based Learning Methods Cambridge University Press 1

[206] Bernhard Schölkopf and Alexander J. Smola Learning with kernels : support vector machines, regularization, optimization, and beyond MIT Press Adaptive computation and machine learning

[207] George S. Kimeldorf and Grace Wahba A Correspondence Between Bayesian Estimation on Stochastic Processes and Smoothing by Splines The Annals of Mathematical Statistics 1970 41 2 495-502

[208] B. Schölkopf and R. Herbrich and A. J. Smola A Generalized Representer Theorem Proceedings of the Annual Conference on Learning Theory

[209] Bernhard E. Boser and Isabelle M. Guyon and Vladimir N. Vapnik A Training Algorithm for Optimal Margin Classifiers Proceedings of the 5th Annual Workshop on Computational Learning Theory (COLT'92) Haussler, David 144–152 ACM Press

[210] Alain Berlinet and Christine Thomas-Agnan Reproducing kernel Hilbert spaces in probability and statistics Kluwer Academic Publishers, Boston, MA 2004 With a preface by Persi Diaconis 10.1007/978-1-4419-9096-9

[211] Ingo Steinwart and Andreas Christmann Support Vector Machines Springer 2008 New York 10.1007/978-0-387-77242-4

[212] Youngmin Cho and Lawrence Saul Kernel Methods for Deep Learning Advances in Neural Information Processing Systems 2009 Y. Bengio and D. Schuurmans and J. Lafferty and C. Williams and A. Culotta 22 Curran Associates, Inc.

[213] Radford M Neal Bayesian learning for neural networks University of Toronto 1995

[214] Alexander G. de G. Matthews Sample-then-optimize posterior sampling for Bayesian linear models 2017

[215] Xavier Glorot and Yoshua Bengio Understanding the difficulty of training deep feedforward neural networks Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics 2010 Teh, Yee Whye and Titterington, Mike 9 Proceedings of Machine Learning Research 249–256 PMLR

[216] Marko D. Petković and Predrag S. Stanimirović Iterative method for computing the Moore–Penrose inverse based on Penrose equations Journal of Computational and Applied Mathematics 2011 235 6 1604-1613 https://doi.org/10.1016/j.cam.2010.08.042

[217] M. A. Aizerman and E. A. Braverman and L. Rozonoer Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25 Automation and Remote Control, 821-837

[218] C. Cortes and V. Vapnik Support Vector Networks Machine Learning 273-297

[219] Mauricio A. Álvarez and Lorenzo Rosasco and Neil D. Lawrence 2012 10.1561/2200000036

[220] Tengyuan Liang and Alexander Rakhlin Just interpolate: kernel “ridgeless'' regression can generalize Ann. Statist. 2020 48 3 1329–1347 10.1214/19-AOS1849

[221] Trevor Hastie and Andrea Montanari and Saharon Rosset and Ryan J. Tibshirani Surprises in high-dimensional ridgeless least squares interpolation The Annals of Statistics 2022 50 2 949 – 986 10.1214/21-AOS2133

[222] Daniel Beaglehole and Mikhail Belkin and Parthe Pandit On the inconsistency of kernel ridgeless regression in fixed dimensions SIAM J. Math. Data Sci. 2023 5 4 854–872 10.1137/22M1499819

[223] Zeyuan Allen-Zhu and Yuanzhi Li and Yingyu Liang Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers Advances in Neural Information Processing Systems 2019 H. Wallach and H. Larochelle and A. Beygelzimer and F. d&apos; Alché-Buc and E. Fox and R. Garnett 32 Curran Associates, Inc.

[224] Simon Du and Jason Lee and Haochuan Li and Liwei Wang and Xiyu Zhai Gradient Descent Finds Global Minima of Deep Neural Networks Proceedings of the 36th International Conference on Machine Learning 2019 Chaudhuri, Kamalika and Salakhutdinov, Ruslan 97 Proceedings of Machine Learning Research 1675–1685 PMLR

[225] Sanjeev Arora and Simon S Du and Wei Hu and Zhiyuan Li and Russ R Salakhutdinov and Ruosong Wang On Exact Computation with an Infinitely Wide Neural Net Advances in Neural Information Processing Systems 2019 H. Wallach and H. Larochelle and A. Beygelzimer and F. d&apos; Alché-Buc and E. Fox and R. Garnett 32 Curran Associates, Inc.

[226] Carl Edward Rasmussen and Christopher K. I. Williams Gaussian processes for machine learning. MIT Press Adaptive computation and machine learning

[227] Ali Rahimi and Benjamin Recht Random Features for Large-Scale Kernel Machines Advances in Neural Information Processing Systems 2007 J. Platt and D. Koller and Y. Singer and S. Roweis 20 Curran Associates, Inc.

[228] Jaehoon Lee and Jascha Sohl-dickstein and Jeffrey Pennington and Roman Novak and Sam Schoenholz and Yasaman Bahri Deep Neural Networks as Gaussian Processes International Conference on Learning Representations 2018

[229] Alexander G. de G. Matthews and Jiri Hron and Mark Rowland and Richard E. Turner and Zoubin Ghahramani Gaussian Process Behaviour in Wide Deep Neural Networks International Conference on Learning Representations 2018

[230] Ian J Goodfellow and Oriol Vinyals and Andrew M Saxe Qualitatively characterizing neural network optimization problems arXiv preprint arXiv:1412.6544 2014

[231] Daniel Jiwoong Im and Michael Tao and Kristin Branson An empirical analysis of deep network loss surfaces 2016

[232] Hao Li and Zheng Xu and Gavin Taylor and Christoph Studer and Tom Goldstein Visualizing the loss landscape of neural nets Advances in neural information processing systems 2018 31

[233] Luca Venturi and Afonso S Bandeira and Joan Bruna Spurious valleys in one-hidden-layer neural network optimization landscapes Journal of Machine Learning Research 2019 20 133

[234] Yann N Dauphin and Razvan Pascanu and Caglar Gulcehre and Kyunghyun Cho and Surya Ganguli and Yoshua Bengio Identifying and attacking the saddle point problem in high-dimensional non-convex optimization Advances in neural information processing systems 2014 27

[235] Jeffrey Pennington and Yasaman Bahri Geometry of neural network loss surfaces via random matrix theory International Conference on Machine Learning 2017 2798–2806 PMLR

[236] Quynh Nguyen, Mahesh Chandra Mukkamala, Matthias Hein On the Loss Landscape of a Class of Deep Neural Networks with No Bad Local Valleys International Conference on Learning Representations (ICLR) 2018

[237] Anna Choromanska and Mikael Henaff and Michael Mathieu and Gérard Ben Arous and Yann LeCun The loss surfaces of multilayer networks Artificial intelligence and statistics 2015 192–204 PMLR

[238] Timur Garipov and Pavel Izmailov and Dmitrii Podoprikhin and Dmitry P Vetrov and Andrew G Wilson Loss surfaces, mode connectivity, and fast ensembling of dnns Advances in neural information processing systems 2018 31

[239] Felix Draxler and Kambis Veschgini and Manfred Salmhofer and Fred Hamprecht Essentially no barriers in neural network energy landscape International conference on machine learning 2018 1309–1318 PMLR

[240] Sepp Hochreiter and Jürgen Schmidhuber Flat minima Neural computation 1997 9 1 1–42

[241] Pratik Chaudhari and Anna Choromanska and Stefano Soatto and Yann LeCun and Carlo Baldassi and Christian Borgs and Jennifer Chayes and Levent Sagun and Riccardo Zecchina Entropy-sgd: Biasing gradient descent into wide valleys Journal of Statistical Mechanics: Theory and Experiment 2019 2019 12 124018

[242] Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, Samy Bengio Fantastic Generalization Measures and Where to Find Them International Conference on Learning Representations (ICLR) 2019

[243] Laurent Dinh and Razvan Pascanu and Samy Bengio and Yoshua Bengio Sharp minima can generalize for deep nets International Conference on Machine Learning 2017 1019–1028 PMLR

[244] Philipp petersen and Mones Raslan and Felix Voigtlaender Topological properties of the set of functions generated by neural networks of fixed size Foundations of computational mathematics 2021 21 375–444

[245] Paul C Kainen and Vera Kurkova and Andrew Vogt Approximation by neural networks is not continuous Neurocomputing 1999 29 1-3 47–56

[246] Federico Girosi and Tomaso Poggio Networks and the best approximation property Biological cybernetics 1990 63 3 169–176

[247] Paul C Kainen and Vera Kurkova and Andrew Vogt Best approximation by linear combinations of characteristic functions of half-spaces Journal of Approximation Theory 2003 122 2 151–159

[248] Paul C Kainen and Vera Kurkova and Andrew Vogt Continuity of approximation by neural networks in L p spaces Annals of Operations Research 2001 101 143–147

[249] Scott Mahan and Emily J King and Alex Cloninger Nonclosedness of sets of neural networks in Sobolev spaces Neural Networks 2021 137 85–96

[250] Felipe Cucker and Steve Smale On the mathematical foundations of learning Bulletin of the American mathematical society 2002 39 1 1–49

[251] Vladimir N Vapnik and A Ya Chervonenkis On the uniform convergence of relative frequencies of events to their probabilities Measures of complexity: festschrift for alexey chervonenkis Springer 2015 11–30

[252] Leslie G Valiant A theory of the learnable Communications of the ACM 1984 27 11 1134–1142

[253] Johannes Schmidt-Hieber Nonparametric regression using deep neural networks with ReLU activation function 2020

[254] Julius Berner and Philipp Grohs and Arnulf Jentzen Analysis of the generalization error: Empirical risk minimization over deep artificial neural networks overcomes the curse of dimensionality in the numerical approximation of Black–Scholes partial differential equations SIAM Journal on Mathematics of Data Science 2020 2 3 631–657

[255] DeVore, R., Howard, R., Micchelli, C. Optimal nonlinear approximation. Manuscripta mathematica 1989 63 4 469-478

[256] Karen Simonyan and Andrew Zisserman Very deep convolutional networks for large-scale image recognition ICLR 2014

[257] Christian Szegedy and Wei Liu and Yangqing Jia and Pierre Sermanet and Scott Reed and Dragomir Anguelov and Dumitru Erhan and Vincent Vanhoucke and Andrew Rabinovich Going deeper with convolutions Proceedings of the IEEE conference on computer vision and pattern recognition 2015 1–9

[258] Gao Huang and Zhuang Liu and Laurens Van Der Maaten and Kilian Q Weinberger Densely connected convolutional networks Proceedings of the IEEE conference on computer vision and pattern recognition 2017 1 2 3

[259] Xiaohua Zhai and Alexander Kolesnikov and Neil Houlsby and Lucas Beyer Scaling vision transformers Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022 12104–12113

[260] Mingxing Tan and Quoc V Le Efficientnet: Rethinking model scaling for convolutional neural networks Proceedings of the 36th International Conference on Machine Learning 2019 6105–6114

[261] Esteban Real and Sara Moore and Andrew Selle and Saurabh Saxena and Yutaka L Suematsu and Quoc Le and Alex Kurakin Regularized Evolution for Image Classifier Architecture Search Proceedings of the AAAI Conference on Artificial Intelligence 2019 33 4780–4789

[262] Jia Deng and Wei Dong and Richard Socher and Li-Jia Li and Kai Li and Li Fei-Fei ImageNet: A large-scale hierarchical image database 2009 IEEE Conference on Computer Vision and Pattern Recognition 2009 248-255 10.1109/CVPR.2009.5206848

[263] Mikhail Belkin and Daniel Hsu and Siyuan Ma and Soumik Mandal Reconciling modern machine-learning practice and the classical bias–variance trade-off Proceedings of the National Academy of Sciences 2019 116 32 15849–15854

[264] Peter Bartlett For valid generalization the size of the weights is more important than the size of the network Advances in neural information processing systems 1996 9

[265] Lee-Ad Gottlieb and Aryeh Kontorovich and Robert Krauthgamer Efficient regression in metric spaces via approximate lipschitz extension IEEE Transactions on Information Theory 2017 63 8 4838–4849

[266] Vladimir M Tikhomirov \( \varepsilon\) -entropy and \( \varepsilon\) -capacity of sets in functional spaces Selected Works of AN Kolmogorov: Volume III: Information Theory and the Theory of Algorithms 1993 86–170

[267] Song Mei and Andrea Montanari The generalization error of random features regression: Precise asymptotics and the double descent curve Communications on Pure and Applied Mathematics 2022 75 4 667–766

[268] Trevor Hastie and Andrea Montanari and Saharon Rosset and Ryan J Tibshirani Surprises in high-dimensional ridgeless least squares interpolation The Annals of Statistics 2022 50 2 949–986

[269] Behnam Neyshabur and Ryota Tomioka and Nathan Srebro Norm-based capacity control in neural networks Conference on learning theory 2015 1376–1401 PMLR

[270] Julius Berner and Philipp Grohs and Gitta Kutyniok and Philipp Petersen The modern mathematics of deep learning 2021

[271] Sanjeev Arora and Rong Ge and Behnam Neyshabur and Yi Zhang Stronger generalization bounds for deep nets via a compression approach International Conference on Machine Learning 2018 254–263 PMLR

[272] Huan Xu and Shie Mannor Robustness and generalization Machine learning 2012 86 391–423

[273] Olivier Bousquet and André Elisseeff Stability and generalization The Journal of Machine Learning Research 2002 2 499–526

[274] Tomaso Poggio and Ryan Rifkin and Sayan Mukherjee and Partha Niyogi General conditions for predictivity in learning theory Nature 2004 428 6981 419–422

[275] Mikhail Belkin and Siyuan Ma and Soumik Mandal To understand deep learning we need to understand kernel learning International Conference on Machine Learning 2018 541–549 PMLR

[276] Weilin Li Generalization error of minimum weighted norm and kernel interpolation SIAM Journal on Mathematics of Data Science 2021 3 1 414–438

[277] Christian Szegedy and Wojciech Zaremba and Ilya Sutskever and Joan Bruna and Dumitru Erhan and Ian Goodfellow and Rob Fergus Intriguing properties of neural networks International Conference on Learning Representations (ICLR) 2014

[278] David Stutz and Matthias Hein and Bernt Schiele Disentangling adversarial robustness and generalization Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019 6976–6987

[279] Ian J Goodfellow and Jonathon Shlens and Christian Szegedy Explaining and harnessing adversarial examples International Conference on Learning Representations (ICLR) 2015

[280] Todd Huster and Cho-Yu Jason Chiang and Ritu Chadha Limitations of the lipschitz constant as a defense against adversarial examples ECML PKDD 2018 Workshops: Nemesis 2018, UrbReas 2018, SoGood 2018, IWAISe 2018, and Green Data Mining 2018, Dublin, Ireland, September 10-14, 2018, Proceedings 18 2019 16–29 Springer

[281] Matthias Hein and Maksym Andriushchenko Formal guarantees on the robustness of a classifier against adversarial manipulation Advances in neural information processing systems 2017 30

[282] Timon Gehr and Matthew Mirman and Dana Drachsler-Cohen and Petar Tsankov and Swarat Chaudhuri and Martin Vechev Ai2: Safety and robustness certification of neural networks with abstract interpretation 2018 IEEE symposium on security and privacy (SP) 2018 3–18 IEEE

[283] Marc Fischer and Mislav Balunovic and Dana Drachsler-Cohen and Timon Gehr and Ce Zhang and Martin Vechev DL2: training and querying neural networks with logic International Conference on Machine Learning 2019 1931–1941 PMLR

[284] Maximilian Baader and Matthew Mirman and Martin Vechev Universal approximation with certified networks arXiv preprint arXiv:1909.13846 2019

[285] Zi Wang and Aws Albarghouthi and Gautam Prakriya and Somesh Jha Interval universal approximation for neural networks Proceedings of the ACM on Programming Languages 2022 6 POPL 1–29

[286] Ling Huang and Anthony D Joseph and Blaine Nelson and Benjamin IP Rubinstein and J Doug Tygar Adversarial machine learning Proceedings of the 4th ACM workshop on Security and artificial intelligence 2011 43–58

[287] Wenjie Ruan and Xinping Yi and Xiaowei Huang Adversarial robustness of deep learning: Theory, algorithms, and applications Proceedings of the 30th ACM international conference on information & knowledge management 2021 4866–4869

[288] Nicolas Papernot and Patrick McDaniel and Ian Goodfellow and Somesh Jha and Z Berkay Celik and Ananthram Swami Practical black-box attacks against machine learning Proceedings of the 2017 ACM on Asia conference on computer and communications security 2017 506–519

[289] Seyed-Mohsen Moosavi-Dezfooli and Alhussein Fawzi and Omar Fawzi and Pascal Frossard Universal adversarial perturbations Proceedings of the IEEE conference on computer vision and pattern recognition 2017 1765–1773

[290] Nicholas Carlini and David Wagner Towards evaluating the robustness of neural networks 2017 ieee symposium on security and privacy (sp) 2017 39–57 Ieee

[291] Rima Alaifari and Giovanni S Alberti and Tandri Gauksson ADef: an iterative algorithm to construct adversarial deformations arXiv preprint arXiv:1804.07729 2018

[292] Chaowei Xiao and Jun-Yan Zhu and Bo Li and Warren He and Mingyan Liu and Dawn Song Spatially transformed adversarial examples arXiv preprint arXiv:1801.02612 2018

[293] A. Klenke Wahrscheinlichkeitstheorie Springer 2006

[294] Robert Scheichl and Jakob Zech Numerical Methods for Bayesian Inverse Problems 2021 Lecture Notes

[295] Walter Rudin Real and complex analysis McGraw-Hill Book Co., New York 1987 Third

[296] John B Conway A course in functional analysis Springer 2019 96