Overfitting

From Tournesol
Jump to navigation Jump to search

Overfitting occurs where fitting training data too closely is counter-productive to out-of-sample predictions.

Bias-variance tradeoff

GermanBD-92 identified the bias-variance tradeoff, which quantifies out-of-sample errors as a sum of inductive bias and model variance, for random samples drawn from the true distribution.

Formally, let f(x,S) the y-prediction of the algorithm trained with sample S for feature x. Then ExyS (f(x,S)-y)2 = bias2 + variance + σ2, where the expectation is over x, S and y, where bias = ExS f(x,S)-f*(x) is the bias with respect to the optimal prediction f*, where variance = ExS f(x,S)2 - (ExS f(x,S) )2 is how much the prediction varies from one sample to the other and σ2 = Exy (f*(x)-y)2 is the unpredictable components of y given x.

It has been argued that overfitting is caused by increased variance when we consider a learning algorithm that is too sensitive to the randomness of sampling S. This sometimes occurs when the number of parameters of the learning algorithm is too large (but not necessarily!).

PAC-learning

To prove theoretical guarantees of non-overfitting, Valiant-84 introduced the concept of probably approximately correct (PAC) learning. More explanations here: Wandida-16a.

In particular, the fundamental theorem of statistical learning ShalevshwartzBendavid-14 Wandida-16b provides guarantees of PAC learning, when the number of data points sufficiently exceed the VC dimension of the set of learnable algorithms. Since this VC dimension is often essentially the number of parameters (assuming finite representation of the parameters as float or double), then this means that PAC learning is guaranteed when #data << #parameters.

This has become a conventional wisdom for a while.

Test set overfitting

There has been concerns about overfitting of test sets, as these are used more and more to measure the performance of machine learning algorithms. But RechtRSS-19 analyze statistical patterns on reported test set performances, and argue that there still is actual progress.

Double descent

However, the conventional wisdom is in sharp contradiction with today's success of deep neural networks ZhangBHRV-17, BelkinHMM-19 NakkiranKBYBS-19, but also kernel methods BelkinMM-18 BelkinRT-19, ridgeless (random feature) linear regression MuthukumarVS-19 BartlettLLT-19 MeiMontanari-19 HastieMRT-19 BelkinHX-19 JacotSSHG-20 and even ensembles BelkinHMM-19. Learning algorithms seem to often achieve their best out-of-sample performance when they are massively overparameterized and perfectly fit the training data (called interpolation).

Intriguingly, a double descent phenomenon often occurs BelkinHMM-19 NakkiranKBYBS-19, where the performance at the test set first behaves as predicted by the bias-variance dilemma, but then improves and outperforms what would be advised by classical statistical learning.

All such results suggest that overfitting eventually disappears, which contradicts conventional wisdom.

Details

ZhangBHRV-17 showed that large interpolating neural networks generalize well, even for large noise in the data. Also, they showed that inductive bias likely plays a limited role, as neural networks still manage to learn quite efficiently data whose labels are completely shuffled. They also proved that a neural network with 2n+d parameters can interpolate n data points of dimension d.

BelkinHMM-19 observed double descent for random Fourier features (which RahimiRecht-07 proved to be intimately connected to kernel methods), neural networks, decision tree and ensemble methods.

NakkiranKBYBS-19 show that a very wide variety of deep neural networks exhibit a wide variety of double descent phenomenons. Not only is there double descent with respect to the number of parameters, but there also seems to be double descent with respect to the width of the neural networks, and weirdly also with respect to epochs of learning steps. They conjecture that "effective model complexity" (the number of data points for which the model is able to achieve small training loss) is a critical point where overfitting occurs. Before and beyond this, overfitting appears to vanish.

BelkinMM-18 present experiments that show that interpolating kernel methods also generalize well and are able to fit random labels (though in this paper, they do not exhibit double descent). They also show that, because norms of interpolaters grow superpolynomially in Hilbert space (in exp(θ(n1/d)) ), usual bounds controlling overfitting are actually trivial for large datasets. This indicates the need for radically different approach to understand overfitting.

BelkinRT-19 show examples of singular kernel interpolators (K(x,y) ~ ||x-y||-a) that achieve optimal rates, even for improper learning (meaning that the true function to learn does not belong to the set of hypotheses).

Note that the connection between kernel methods and neural networks has been made, for instance by RahimiRecht-07 and JacotHG-18. Essentially, random features implement approximate kernel methods. And the first layers of neural networks with random (or even trained) weights can be regarded as computations of random features.

HastieMRT-19 use random matrix theory (to estimate eigenvalues of XTX when X are random samples) to show that ridgeless regression (regression with minimum l2-norm) features "infinite double descent" as the size n of the training data sets grows to infinity, along with the number of parameters p (they assume p/n → γ, and show infinite overfitting for γ ~ 1). This is shown for both a linear model where X = Σ1/2 Z, for some fixed Σ and some well-behaved Z of mean 0 and variance 1, and a component-wise linearity X = σ(WZ). In both cases, it is assumed that y=xTβ+ε, where E[ε]=0. There are also assumptions of finite fixed variance for ε, and finite fourth moment for Z (needed for random matrix theory).

BelkinHX-19 analyze two other data models. The former is a classical Gaussian linear regression with a huge space of features. But the regression is only made within a (random) subset T of features, in which case double descent is observed, and errors can be derived from the norms of the true regression for T-coordinates and non-T-coordinates. A similar analysis is then provided for a random Fourier feature model.

MeiMontanari19 consider still another data model, where z is drawn uniformly randomly on a d-sphere, and y = f(z)+ε, such that E[ε]=0 and ε has finite fourth moment. The ridgeless linear regression is then over some random features xi = σ(Wzi), where σ applies component-wisely and W has random rows of l2-norm equal to 1. They prove that this yields a double descent.

JacotSSHG-20 study a random feature model where the true function and the features are drawn from a Gaussian process. They prove an upper-bound between a regularized Bayesian posterior prediction at a point x and the expectation of the (slightly differently) regularized random feature prediction at this point, which can be argued to go to zero under reasonable assumptions, as the number of parameters goes to infinity. Moreover, as the regularization goes to zero, the random feature linear regression comes closer to the Bayesian posterior prediction. They also prove a bound between the expected loss of the regularized random feature model and that of the regularized kernel model. They also show that the "effective ridge" λ* (i.e. ridge of the kernel method equivalent to the ridge λ of the random feature regression) is key to understanding the variance explosion. In particular, they relate it to the explosion of ∂λ λ*. An interesting insight is that the adequate linear regression regularization should thus be smaller than (and can be exactly computed from) the kernel method ridge.