6 - Complexity_selection.pdf

Model complexity selection Model complexity selection � � � � � � � � � � � � � � � � � � � Design of experiment Perform experiment Collect data Determine/choose model structure Choose method Estimate parameters Model assessment (validation) Model accepted? End Start Yes No A priori knowledge Planned use of the model New data set Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 1/15 Model complexity selection Model complexity selection: given a model class M ( θ ) , estimate the model complexity (the hyperparameter) p , i.e choose the ‘best’ model class M p ( θ ) Examples: – Estimate the best order of a polynomial in polynomial fittin g; – Estimate the best order of an ARX model. The intended use of the model should be taken into account Overfitting and underfitting should be avoided Model complexity estimation and model assessment are stron gly related and often based on common criteria Bear in mind the parsimony principle : out of two or more competing models which all explain the data well, the model with the smaller complex ity should be chosen. Training set: the set of data used for learning the model, i.e. used for comp uting the estimate ˆ θ Is the minimization of J ( θ ) over the training set a good criterion for estimating the mod el complexity? Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 2/15 Model complexity selection Example 1: polynomial fitting (the true model is a 3rd order polynomial) 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 -2 0 2 4 6 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 -2 0 2 4 6 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 -2 0 2 4 6 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 -2 0 2 4 6 Underfitting: the model (the complexity) is not rich (large) enough to fit th e data well = ⇒ poor generalization Overfitting: the model (the complexity) is too rich (large) and adapts too closely to the training data = ⇒ poor generalization Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 3/15 Model complexity selection What happens if we use previously identified models on a new da taset? 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 0 2 4 6 8 10 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 0 2 4 6 8 10 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 0 2 4 6 8 10 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 -10 -8 -6 -4 -2 0 10 4 0 0.5 1 1.5 2 2.5 3 3.5 4 0 5 10 15 Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 4/15 Model complexity selection Validation set: this data set is used for evaluating the predictive capabili ties of the models obtained with the training set in order to estimate the model complexity. If the number of available samples is large enough, it is conv enient to divide the dataset into two parts, a training set and a validation set. When using the training set, the higher the model complexity , the better the data fitting. The prediction error is thus underestimated. This i s not the case when using the validation set. When dealing with real data, the loss function exhibits a mon otone decrease in the training set and a U-shape in the validation set. Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 5/15 Model complexity selection Example 2: ARX model fitting (the true ARX model is of order n = 4 ) 1 2 3 4 5 6 7 8 9 10 11 12 0 0.5 1 1.5 2 2.5 3 1 2 3 4 5 6 7 8 9 10 11 12 0 0.5 1 1.5 2 2.5 3 Training set Validation set Example 3: AR model fitting (real data) 1 2 3 4 5 6 7 8 9 10 11 12 4 5 6 7 8 9 10 11 12 13 10 4 1 2 3 4 5 6 7 8 9 10 11 12 1 1.2 1.4 1.6 1.8 2 2.2 2.4 10 5 Training set Validation set Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 6/15 Model complexity selection If the validation set is used repeatedly to estimate the mode l complexity, the prediction error may be undererstimated as well. This happe ns for very complex models like neural networks. For the above reason, when dealing with neural networks, the dataset is split into three parts. The third set is called test set and is used for model assessment, i.e. for testing the predictive capabilities of the final chosen mode l. This requires a very large number of available samples. Is it possible to design criteria that allow to estimate the m odel complexity by using the training set? The answer is yes! Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 7/15 Model complexity selection The F-test Let M p 1 ( θ ) , M p 2 ( θ ) such that p 1 < p 2 ( p = 2 n for ARX models, p = n for AR models, etc). Consider the test quantity x = N J ( ˆ θ 1 N ) − J ( ˆ θ 2 N ) J ( ˆ θ 2 N ) where ˆ θ 1 N , ˆ θ 2 N are least squares estimates. Intuitively: – x large ⇒ the decrease in the loss function is significant, hence M p 2 ( θ ) is better than M p 1 ( θ ) – x small ⇒ M p 1 ( θ ) and M p 2 ( θ ) are almost equivalent so that M p 1 ( θ ) should be chosen according to the parsimony principle How to quantify “large” and “small”? – If M p 1 ( θ ) is not large enough to include the true system: J ( ˆ θ 1 N ) − J ( ˆ θ 2 N ) is O (1) x is of magnitude N Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 8/15 Model complexity selection – If M p 1 ( θ ) is large enough, it is possible to prove: x → χ 2 ( p 2 − p 1 ) , for N → ∞ The following statistical test can be performed    H 0 : M p 1 ( θ ) is suitable to describe the system H 1 : M p 1 ( θ ) is not suitable that is    H 0 : x ∼ χ 2 ( p 2 − p 1 ) H 1 : not H 0 After the choice of the significance level α :    x ≤ χ 2 α ( p 2 − p 1 ) = ⇒ accept H 0 x > χ 2 α ( p 2 − p 1 ) = ⇒ accept H 1 Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 9/15 Model complexity selection The final prediction error (FPE) criterion Let ˆ θ N a LS estimate of a model of complexity p and assume that it is used to predict future data. Assume also that a true model exists and conside r the prediction error variance (the expectation is with respect to future data): V ( ˆ θ N ) = E [ ( y ( t ) − ˆ y ( t | t − 1 , ˆ θ N ) ) 2 ] By replacing y ( t ) = φ T ( t ) θ ∗ + w ( t ) in V ( ˆ θ N ) and computing the expectation we get V ( ˆ θ N ) = σ 2 w + ( ˆ θ N − θ ∗ ) T Σ φ ( ˆ θ N − θ ∗ ) Consider now the criterion function F P E = E [ V ( ˆ θ N )] where the expectation is with respect to past data. By taking into account that: – E [ ( ˆ θ N − θ ∗ ) T Σ φ ( ˆ θ N − θ ∗ ) ] = E [ trace ( Σ φ ( ˆ θ N − θ ∗ )( ˆ θ N − θ ∗ ) T )] – Asymptotically: √ N ( ˆ θ N − θ ∗ ) ∼ N (0 , σ 2 w Σ − 1 φ ) Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 10/15 Model complexity selection it is easy to obtain F P E ≈ σ 2 w ( 1 + p N ) there is a penalty in using models with many (unnecessary) pa rameters we expect that F P E is minimized by the true model In practice, we need an asymptotically unbiased estimate of σ 2 w ˆ σ 2 w = 1 N − p N ∑ t =1 ε ( t, ˆ θ N ) 2 so that F P E = ˆ σ 2 e N + p N = N + p N − p J ( ˆ θ N ) How to use this criterion? Consider the function F P E ( n ) and choose the order that minimizes this function. Note that, for large N F P E ≈ J ( ˆ θ N ) + 2 p N J ( ˆ θ N ) so that, for large N, this criterion belongs to the family of criteria with complexity terms Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 11/15 Model complexity selection Criteria with complexity terms These criteria are obtained by penalizing in some way the dec rease of J ( ˆ θ N ) with increasing orders; The order giving the smallest value of the criterion is selec ted. General form: V ( ˆ θ N ) = N log J ( ˆ θ N ) + f ( N, p ) where f ( N, p ) penalizes high order models. Akaike information criterion (AIC) AIC = N log J ( ˆ θ N ) + 2 p AIC and FPE are asymptotically equivalent FPE and AIC do not give consistent estimates of n (the probability of overestimating the order is non-null) Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 12/15 Model complexity selection To get consistent estimates, the penalizing function must b e such that        f ( N, p ) = k p g ( N ) lim N →∞ g ( N ) = ∞ lim N →∞ g ( N ) N = 0 Minimum description length (MDL) criterion M DL = N log J ( ˆ θ N ) + 2 p log N MDL leads, in general, to models of lower complexity w.r.t. A IC and FPE Even though the derivation is different, the MDL approach is formally equivalent to the bayesian information criterion (BIC) Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 13/15 Model complexity selection Example: Polynomial fitting (true order n = 3 , see slide 2) 0 2 4 6 8 10 12 1 2 3 4 5 6 7 8 9 FPE 0 2 4 6 8 10 12 15 20 25 30 35 40 45 50 55 60 MDL Example: ARX model fitting (true order n = 4 , see slide 5) 1 2 3 4 5 6 7 8 9 10 11 12 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 FPE 1 2 3 4 5 6 7 8 9 10 11 12 -1600 -1500 -1400 -1300 -1200 -1100 -1000 -900 -800 MDL Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 14/15 Model complexity selection Example: AR model fitting (real data, see slide 5) 1 2 3 4 5 6 7 8 9 10 11 12 600 700 800 900 1000 1100 1200 1300 1400 FPE 1 2 3 4 5 6 7 8 9 10 11 12 1780 1800 1820 1840 1860 1880 1900 1920 1940 1960 MDL Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 15/15