smooth.spline             package:stats             R Documentation

_F_i_t _a _S_m_o_o_t_h_i_n_g _S_p_l_i_n_e

_D_e_s_c_r_i_p_t_i_o_n:

     Fits a cubic smoothing spline to the supplied data.

_U_s_a_g_e:

     smooth.spline(x, y = NULL, w = NULL, df, spar = NULL,
                   cv = FALSE, all.knots = FALSE, nknots = NULL,
                   keep.data = TRUE, df.offset = 0, penalty = 1,
                   control.spar = list())

_A_r_g_u_m_e_n_t_s:

       x: a vector giving the values of the predictor variable, or  a
          list or a two-column matrix specifying x and y. 

       y: responses. If 'y' is missing, the responses are assumed to be
          specified by 'x'.

       w: optional vector of weights of the same length as 'x';
          defaults to all 1.

      df: the desired equivalent number of degrees of freedom (trace of
          the smoother matrix).

    spar: smoothing parameter, typically (but not necessarily) in
          (0,1].  The coefficient lambda of the integral of the squared
          second derivative in the fit (penalized log likelihood)
          criterion is a monotone function of 'spar', see the details
          below.

      cv: ordinary ('TRUE') or 'generalized' cross-validation (GCV)
          when 'FALSE'.

all.knots: if 'TRUE', all distinct points in 'x' are used as knots.  If
          'FALSE' (default), a subset of 'x[]' is used, specifically
          'x[j]' where the 'nknots' indices are evenly spaced in '1:n',
          see also the next argument 'nknots'.

  nknots: integer giving the number of knots to use when
          'all.knots=FALSE'.  Per default, this is less than n, the
          number of unique 'x' values for n > 49.

keep.data: logical specifying if the input data should be kept in the
          result.  If 'TRUE' (as per default), fitted values and
          residuals are available from the result.

df.offset: allows the degrees of freedom to be increased by 'df.offset'
          in the GCV criterion.

 penalty: the coefficient of the penalty for degrees of freedom in the
          GCV criterion.

control.spar: optional list with named components controlling the root
          finding when the smoothing parameter 'spar' is computed,
          i.e., missing or 'NULL', see below.

          *Note* that this is partly _experimental_ and may change with
          general spar computation improvements!


          _l_o_w: lower bound for 'spar'; defaults to -1.5 (used to
               implicitly default to 0 in R versions earlier than 1.4).

          _h_i_g_h: upper bound for 'spar'; defaults to +1.5.

          _t_o_l: the absolute precision (*tol*erance) used; defaults to
               1e-4 (formerly 1e-3).

          _e_p_s: the relative precision used; defaults to 2e-8 (formerly
               0.00244).

          _t_r_a_c_e: logical indicating if iterations should be traced.

          _m_a_x_i_t: integer giving the maximal number of iterations;
               defaults to 500.

          Note that 'spar' is only searched for in the interval [low,
          high]. 

_D_e_t_a_i_l_s:

     The 'x' vector should contain at least four distinct values.
     _Distinct_ here means 'distinct after rounding to 6 significant
     digits', i.e., 'x' will be transformed to 'unique(sort(signif(x,
     6)))', and 'y' and 'w' are pooled accordingly.

     The computational lambda used (as a function of 'spar') is lambda
     = r * 256^(3*spar - 1) where r = tr(X' W X) / tr(Sigma), Sigma is
     the matrix given by Sigma[i,j] = Integral B''[i](t) B''[j](t) dt,
     X is given by X[i,j] = B[j](x[i]), W is the diagonal matrix of
     weights (scaled such that its trace is n, the original number of
     observations) and B[k](.) is the k-th B-spline.

     Note that with these definitions, f_i = f(x_i), and the B-spline
     basis representation f = X c (i.e., c is the vector of spline
     coefficients), the penalized log likelihood is L = (y - f)' W (y -
     f) + lambda c' Sigma c, and hence c is the solution of the (ridge
     regression) (X' W X + lambda Sigma) c = X' W y.

     If 'spar' is missing or 'NULL', the value of 'df' is used to
     determine the degree of smoothing.  If both are missing,
     leave-one-out cross-validation (ordinary or 'generalized' as
     determined by 'cv') is used to determine lambda. Note that from
     the above relation, 'spar' is spar = s0 + 0.0601 * log(lambda),
     which is intentionally _different_ from the S-plus implementation
     of 'smooth.spline' (where 'spar' is proportional to lambda).  In
     R's (log lambda) scale, it makes more sense to vary 'spar'
     linearly.

     Note however that currently the results may become very unreliable
     for 'spar' values smaller than about -1 or -2.  The same may
     happen for values larger than 2 or so. Don't think of setting
     'spar' or the controls 'low' and 'high' outside such a safe range,
     unless you know what you are doing!

     The 'generalized' cross-validation method will work correctly when
     there are duplicated points in 'x'.  However, it is ambiguous what
     leave-one-out cross-validation means with duplicated points, and
     the internal code uses an approximation that involves leaving out
     groups of duplicated points.  'cv=TRUE' is best avoided in that
     case.

_V_a_l_u_e:

     An object of class '"smooth.spline"' with components 

       x: the _distinct_ 'x' values in increasing order, see the
          'Details' above.

       y: the fitted values corresponding to 'x'.

       w: the weights used at the unique values of 'x'.

     yin: the y values used at the unique 'y' values.

    data: only if 'keep.data = TRUE': itself a 'list' with components
          'x', 'y' and 'w' of the same length.  These are the original
          (x_i,y_i,w_i), i=1,...,n, values where 'data$x' may have
          repeated values and hence be longer than the above 'x'
          component; see details. 

     lev: leverages, the diagonal values of the smoother matrix.

 cv.crit: cross-validation score, 'generalized' or true, depending on
          'cv'.

pen.crit: penalized criterion

    crit: the criterion value minimized in the underlying '.Fortran'
          routine 'sslvrg'.

      df: equivalent degrees of freedom used.  Note that (currently)
          this value may become quite imprecise when the true 'df' is
          between and 1 and 2. 

    spar: the value of 'spar' computed or given.

  lambda: the value of lambda corresponding to 'spar', see the details
          above.

  iparms: named integer(3) vector where '..$ipars["iter"]' gives number
          of spar computing iterations used.

     fit: list for use by 'predict.smooth.spline', with components

          _k_n_o_t: the knot sequence (including the repeated boundary
               knots).

          _n_k: number of coefficients or number of 'proper' knots plus
               2.

          _c_o_e_f: coefficients for the spline basis used.

          _m_i_n, _r_a_n_g_e: numbers giving the corresponding quantities of
               'x'.

    call: the matched call.

_N_o_t_e:

     The default 'all.knots = FALSE' and 'nknots = NULL' entails using
     only O(n^{0.2}) knots instead of n for n > 49.  This cuts speed
     and memory requirements, but not drastically anymore since R
     version 1.5.1 where it is only O(nk) + O(n) where nk is the number
     of knots. In this case where not all unique 'x' values are used as
     knots, the result is not a smoothing spline in the strict sense,
     but very close unless a small smoothing parameter (or large 'df')
     is used.

_A_u_t_h_o_r(_s):

     R implementation by B. D. Ripley and Martin Maechler
     ('spar/lambda', etc).

     This function is based on code in the 'GAMFIT' Fortran program by
     T. Hastie and R. Tibshirani (<URL:
     http://lib.stat.cmu.edu/general/>), which makes use of spline code
     by Finbarr O'Sullivan.  Its design parallels the 'smooth.spline'
     function of Chambers & Hastie (1992).

_R_e_f_e_r_e_n_c_e_s:

     Chambers, J. M. and Hastie, T. J. (1992) _Statistical Models in
     S_, Wadsworth & Brooks/Cole.

     Green, P. J. and Silverman, B. W. (1994) _Nonparametric Regression
     and Generalized Linear Models: A Roughness Penalty Approach._
     Chapman and Hall.

     Hastie, T. J. and Tibshirani, R. J. (1990) _Generalized Additive
     Models._  Chapman and Hall.

_S_e_e _A_l_s_o:

     'predict.smooth.spline' for evaluating the spline and its
     derivatives.

_E_x_a_m_p_l_e_s:

     require(graphics)

     attach(cars)
     plot(speed, dist, main = "data(cars)  &  smoothing splines")
     cars.spl <- smooth.spline(speed, dist)
     (cars.spl)
     ## This example has duplicate points, so avoid cv=TRUE

     lines(cars.spl, col = "blue")
     lines(smooth.spline(speed, dist, df=10), lty=2, col = "red")
     legend(5,120,c(paste("default [C.V.] => df =",round(cars.spl$df,1)),
                    "s( * , df = 10)"), col = c("blue","red"), lty = 1:2,
            bg='bisque')
     detach()

     ## Residual (Tukey Anscombe) plot:
     plot(residuals(cars.spl) ~ fitted(cars.spl))
     abline(h = 0, col="gray")

     ## consistency check:
     stopifnot(all.equal(cars$dist,
                         fitted(cars.spl) + residuals(cars.spl)))

     ##-- artificial example
     y18 <- c(1:3,5,4,7:3,2*(2:5),rep(10,4))
     xx  <- seq(1,length(y18), len=201)
     (s2  <- smooth.spline(y18)) # GCV
     (s02 <- smooth.spline(y18, spar = 0.2))
     plot(y18, main=deparse(s2$call), col.main=2)
     lines(s2, col = "gray"); lines(predict(s2, xx), col = 2)
     lines(predict(s02, xx), col = 3); mtext(deparse(s02$call), col = 3)

     ## The following shows the problematic behavior of 'spar' searching:
     (s2  <- smooth.spline(y18, control = list(trace=TRUE,tol=1e-6, low= -1.5)))
     (s2m <- smooth.spline(y18, cv = TRUE,
                           control = list(trace=TRUE,tol=1e-6, low= -1.5)))
     ## both above do quite similarly (Df = 8.5 +- 0.2)