choose.k                package:mgcv                R Documentation

_B_a_s_i_s _d_i_m_e_n_s_i_o_n _c_h_o_i_c_e _f_o_r _s_m_o_o_t_h_s

_D_e_s_c_r_i_p_t_i_o_n:

     Choosing the basis dimension, and checking the choice, when using
     penalized regression smoothers.

     Penalized regression smoothers gain computational efficiency by
     virtue of being defined using a basis of relatively modest size,
     'k'. When setting up models in the 'mgcv' package, using 's' or
     'te' terms in a model formula, 'k' must be chosen. 

     In practice 'k-1' (or 'k') sets the upper limit on the degrees of
     freedom associated with an 's' smooth (1 degree of freedom is
     usually lost to the identifiability constraint on the smooth). For
     'te' smooths the upper limit of the degrees of freedom is given by
     the product of the 'k' values provided for each marginal smooth
     less one, for the constraint. However the actual effective degrees
     of freedom are controlled by the degree of penalization selected
     during fitting, by GCV, AIC or whatever is specified. The
     exception to this is if a smooth is specified using the 'fx=TRUE'
     option, in which case it is unpenalized.  

     So, exact choice of 'k' is not generally critical: it should be
     chosen to be large enough that you are reasonably sure of having
     enough degrees of freedom to represent the underlying `truth'
     reasonably well, but small enough to maintain reasonable
     computational efficiency. Clearly `large' and `small' are
     dependent on the particular problem being addressed. 

     As with all model assumptions, it is useful to be able to check
     the choice of 'k' informally. If the effective degrees of freedom
     for a model term are estimated to be much less than 'k-1' then
     this is unlikely to be very worthwhile, but as the EDF approach
     'k-1', checking can be important. A useful general purpose
     approach goes as follows: (i) fit your model and extract the
     deviance residuals; (ii) for each smooth term in your model, fit
     an equivalent, single, smooth to the residuals, using a
     substantially increased 'k' to see if there is pattern in the
     residuals that could potentially be explained by increasing 'k'.
     Examples are provided below.

     More sophisticated approaches based on partial residuals are also
     possible.

     One scenario that can cause confusion is this: a model is fitted
     with 'k=10' for a smooth term, and the EDF for the term is
     estimated as 7.6, some way below the maximum of 9. The model is
     then refitted with 'k=20' and the EDF increases to 8.7 - what is
     happening - how come the EDF was not 8.7 the first time around?
     The explanation is that the function space with 'k=20' contains a
     larger subspace of functions with EDF 8.7 than did the function
     space with 'k=10': one of the functions in this larger subspace
     fits the data a little better than did any function in the smaller
     subspace. These subtleties seldom have much impact on the
     statistical conclusions to be drawn from a model fit, however.

_A_u_t_h_o_r(_s):

     Simon N. Wood simon.wood@r-project.org

_R_e_f_e_r_e_n_c_e_s:

     Wood, S.N. (2006) Generalized Additive Models: An Introduction
     with R. CRC.

     <URL: http://www.maths.bath.ac.uk/~sw283/>

_E_x_a_m_p_l_e_s:

     ## Simulate some data ....
     library(mgcv)
     set.seed(0) 
     dat <- gamSim(1,n=400,scale=2)
     ## fit a GAM with quite low `k'
     b<-gam(y~s(x0,k=6)+s(x1,k=6)+s(x2,k=6)+s(x3,k=6),data=dat)
     plot(b,pages=1)

     ## check for residual pattern, removeable by increasing `k'
     ## typically `k', below, chould be substantially larger than 
     ## the original, `k' but certainly less than n/2.
     ## Note use of cheap "cs" shrinkage smoothers, and gamma=1.4
     ## to reduce chance of overfitting...
     rsd <- residuals(b)
     gam(rsd~s(x0,k=40,bs="cs"),gamma=1.4,data=dat) ## fine
     gam(rsd~s(x1,k=40,bs="cs"),gamma=1.4,data=dat) ## fine
     gam(rsd~s(x2,k=40,bs="cs"),gamma=1.4,data=dat) ## `k' too low
     gam(rsd~s(x3,k=40,bs="cs"),gamma=1.4,data=dat) ## fine

     ## similar example with multi-dimensional smooth
     b1 <- gam(y~s(x0)+s(x1,x2,k=15)+s(x3),data=dat)
     rsd <- residuals(b1)
     gam(rsd~s(x0,k=40,bs="cs"),gamma=1.4,data=dat) ## fine
     gam(rsd~s(x1,x2,k=100,bs="ts"),gamma=1.4,data=dat) ## `k' too low
     gam(rsd~s(x3,k=40,bs="cs"),gamma=1.4,data=dat) ## fine
      
     ## and a `te' example
     b2 <- gam(y~s(x0)+te(x1,x2,k=4)+s(x3),data=dat)
     rsd <- residuals(b2)
     gam(rsd~s(x0,k=40,bs="cs"),gamma=1.4,data=dat) ## fine
     gam(rsd~te(x1,x2,k=10,bs="cs"),gamma=1.4,data=dat) ## `k' too low
     gam(rsd~s(x3,k=40,bs="cs"),gamma=1.4,data=dat) ## fine

     ## same approach works with other families in the original model
     dat <- gamSim(1,n=400,scale=.25,dist="poisson")
     bp<-gam(y~s(x0,k=6)+s(x1,k=6)+s(x2,k=6)+s(x3,k=6),
             family=poisson,data=dat)
     rsd <- residuals(bp)
     gam(rsd~s(x0,k=40,bs="cs"),gamma=1.4,data=dat) ## fine
     gam(rsd~s(x1,k=40,bs="cs"),gamma=1.4,data=dat) ## fine
     gam(rsd~s(x2,k=40,bs="cs"),gamma=1.4,data=dat) ## `k' too low
     gam(rsd~s(x3,k=40,bs="cs"),gamma=1.4,data=dat) ## fine

     rm(dat)