boot                  package:boot                  R Documentation

_B_o_o_t_s_t_r_a_p _R_e_s_a_m_p_l_i_n_g

_D_e_s_c_r_i_p_t_i_o_n:

     Generate 'R' bootstrap replicates of a statistic applied to data. 
     Both parametric and nonparametric resampling are possible.  For
     the nonparametric bootstrap, possible resampling methods are the
     ordinary bootstrap, the  balanced bootstrap, antithetic
     resampling, and permutation. For nonparametric multi-sample
     problems stratified resampling is used.   This is specified by
     including a vector of strata in the call to boot. Importance
     resampling weights may be specified.

_U_s_a_g_e:

     boot(data, statistic, R, sim="ordinary", stype="i", 
          strata=rep(1,n), L=NULL, m=0, weights=NULL, 
          ran.gen=function(d, p) d, mle=NULL, simple=FALSE, ...)

_A_r_g_u_m_e_n_t_s:

    data: The data as a vector, matrix or data frame.  If it is a
          matrix or data frame then each row is considered as one
          multivariate observation. 

statistic: A function which when applied to data returns a vector
          containing the statistic(s) of interest.  When
          'sim="parametric"', the first argument to 'statistic' must be
          the data.  For each replicate a simulated dataset returned by
          'ran.gen' will be passed.  In all other cases 'statistic'
          must take at least two arguments.  The first argument passed
          will always be the original data. The second will be a vector
          of indices, frequencies or weights which define the bootstrap
          sample. Further, if predictions are required, then a third
          argument is required which would be a vector of the random
          indices used to generate the bootstrap predictions.  Any
          further arguments can be passed to 'statistic'  through the
          '...{}' argument. 

       R: The number of bootstrap replicates.  Usually this will be a
          single positive integer.  For importance resampling, some
          resamples may use one set of weights and others use a
          different set of weights. In this case 'R' would be a vector
          of integers where each component gives the number of
          resamples from each of the rows of weights. 

     sim: A character string indicating the type of simulation
          required.  Possible values are '"ordinary"' (the default),
          '"parametric"', '"balanced"', '"permutation"', or
          '"antithetic"'.  Importance resampling is specified by
          including importance weights; the type of importance
          resampling must still be specified but may only be
          '"ordinary"' or '"balanced"' in this case.   

   stype: A character string indicating what the second argument of
          statistic represents. Possible values of stype are '"i"'
          (indices - the default), '"f"' (frequencies), or '"w"'
          (weights). 

  strata: An integer vector or factor specifying the strata for
          multi-sample problems.   This may be specified for any
          simulation, but is ignored when  'sim' is '"parametric"'.
          When 'strata' is supplied for a nonparametric bootstrap, the
          simulations are done within the specified strata. 

       L: Vector of influence values evaluated at the observations. 
          This is used only when 'sim' is '"antithetic"'.  If not
          supplied, they are calculated  through a call to 'empinf'. 
          This will use the infinitesimal jackknife provided that
          'stype' is '"w"', otherwise the usual jackknife is used. 

       m: The number of predictions which are to be made at each
          bootstrap replicate. This is most useful for (generalized)
          linear models.  This can only be used when 'sim' is
          '"ordinary"'.  'm' will usually be a single integer but, if
          there are strata, it may be a vector with length equal to the
          number of strata, specifying how many of the errors for
          prediction should come from each strata.  The actual
          predictions should be  returned as the final part of the
          output of 'statistic', which should also take a vector of
          indices of the errors to be used for the predictions. 

 weights: Vector or matrix of importance weights. If a vector then it
          should have as many  elements as there are observations in
          'data'.  When simulation from more than  one set of weights
          is required, 'weights' should be a matrix where each row of 
          the matrix is one set of importance weights.  If 'weights' is
          a matrix then 'R'  must be a vector of length
          'nrow(weights)'.  This parameter is ignored if 'sim'  is not
          '"ordinary"' or '"balanced"'.  

 ran.gen: This function is used only when 'sim' is '"parametric"' when
          it describes how random values are to be generated.  It
          should be a function of two arguments.  The first argument
          should be the observed data and the second argument consists
          of any other information needed (e.g. parameter estimates). 
          The second argument may be a list, allowing any number of
          items to be passed to 'ran.gen'.  The returned value should
          be a simulated data set of the same form as the observed data
          which will be passed to statistic to get a bootstrap
          replicate.  It is important that the returned value be of the
          same shape and type as the original dataset.  If 'ran.gen' is
          not specified, the default is a function which returns the
          original 'data'  in which case all simulation should be
          included as part of 'statistic'.  Use of 'sim="parametric"'
          with a suitable 'ran.gen' allows the user to implement any
          types of nonparametric resampling which are not supported
          directly. 

     mle: The second argument to be passed to 'ran.gen'.  Typically
          these will be maximum likelihood estimates of the parameters.
           For efficiency 'mle' is often a list containing all of the
          objects needed by 'ran.gen' which can be calculated  using
          the original data set only. 

  simple: logical, only allowed to be 'TRUE' for 'sim="ordinary",
          stype="i", n=0' (otherwise ignored with a warning).  By
          default a 'n*R' index array is created: this can be large and
          if 'simple = TRUE' this is avoided by sampling separately for
          each replication, which is slower but uses less memory. 

     ...: Any other arguments for 'statistic' which are passed
          unchanged each time it is called.  Any such arguments to
          'statistic' must follow the  arguments which 'statistic' is
          required to have for the simulation. 

_D_e_t_a_i_l_s:

     The statistic to be bootstrapped can be as simple or complicated
     as desired as long as its arguments correspond to the dataset and
     (for a nonparametric bootstrap) a vector of indices, frequencies
     or weights.  'statistic' is treated as a black box by the 'boot'
     function and is not checked to ensure that these conditions are
     met.  

     The first order balanced bootstrap is described in Davison,
     Hinkley and Schechtman (1986).  The antithetic bootstrap is
     described by Hall (1989) and is experimental, particularly when
     used with strata.  The other non-parametric simulation types are
     the ordinary bootstrap (possibly with unequal  probabilities), and
     permutation which returns random permutations of cases. All of
     these methods work independently within strata if that argument is
     supplied.

     For the parametric bootstrap it is necessary for the user to
     specify how the resampling is to be conducted.  The best way of
     accomplishing this is to  specify the function 'ran.gen' which
     will return a simulated data set from the observed data set and a
     set of parameter estimates specified in 'mle'.

_V_a_l_u_e:

     The returned value is an object of class '"boot"', containing the
     following  components :

      t0: The observed value of 'statistic' applied to 'data'.  

       t: A matrix with 'R' rows each of which is a bootstrap replicate
          of 'statistic'. 

       R: The value of 'R' as passed to 'boot'. 

    data: The 'data' as passed to 'boot'. 

    seed: The value of '.Random.seed' when 'boot' was called.   

statistic: The function 'statistic' as passed to 'boot'. 

     sim: Simulation type used. 

   stype: Statistic type as passed to 'boot'. 

    call: The original call to 'boot'. 

  strata: The strata used.  This is the vector passed to 'boot', if it
          was supplied or a vector of ones if there were no strata.  It
          is not returned if 'sim' is  '"parametric"'. 

 weights: The importance sampling weights as passed to 'boot' or the
          empirical  distribution function weights if no importance
          sampling weights were specified.  It is  omitted if 'sim' is
          not one of '"ordinary"' or '"balanced"'. 

  pred.i: If predictions are required ('m>0') this is the matrix of
          indices at which predictions were calculated as they were
          passed to statistic.  Omitted if 'm' is '0' or 'sim' is not
          '"ordinary"'. 

       L: The influence values used when 'sim' is '"antithetic"'.  If
          no such values were specified and 'stype' is not '"w"' then
          'L' is returned as consecutive integers  corresponding to the
          assumption that data is ordered by influence values. This
          component is omitted when 'sim' is not '"antithetic"'. 

 ran.gen: The random generator function used if 'sim' is
          '"parametric"'. This component  is omitted for any other
          value of 'sim'. 

     mle: The parameter estimates passed to 'boot' when 'sim' is
          '"parametric"'.  It is  omitted for all other values of
          'sim'. 

_R_e_f_e_r_e_n_c_e_s:

     There are many references explaining the bootstrap and its
     variations. Among them are :

     Booth, J.G., Hall, P. and Wood, A.T.A. (1993) Balanced importance
     resampling  for the bootstrap. _Annals of Statistics_, *21*,
     286-298.

     Davison, A.C. and Hinkley, D.V. (1997)  _Bootstrap Methods and
     Their Application_. Cambridge University Press.

     Davison, A.C., Hinkley, D.V. and Schechtman, E. (1986) Efficient
     bootstrap  simulation. _Biometrika_, *73*, 555-566.

     Efron, B. and Tibshirani, R. (1993) _An Introduction to the
     Bootstrap_. Chapman & Hall.

     Gleason, J.R. (1988) Algorithms for balanced bootstrap
     simulations. _ American Statistician_, *42*, 263-266.

     Hall, P. (1989) Antithetic resampling for the bootstrap.
     _Biometrika_, *73*, 713-724.

     Hinkley, D.V. (1988) Bootstrap methods (with Discussion). 
     _Journal of the  Royal Statistical Society, B_, *50*, 312-337,
     355-370.

     Hinkley, D.V. and Shi, S. (1989) Importance sampling and the
     nested bootstrap. _Biometrika_, *76*, 435-446.

     Johns M.V. (1988) Importance sampling for bootstrap confidence
     intervals. _Journal of the American Statistical Association_,
     *83*, 709-714.

     Noreen, E.W. (1989) _Computer Intensive Methods for Testing
     Hypotheses_.  John Wiley & Sons.

_S_e_e _A_l_s_o:

     'boot.array', 'boot.ci', 'censboot', 'empinf', 'jack.after.boot',
     'tilt.boot', 'tsboot'

_E_x_a_m_p_l_e_s:

     # usual bootstrap of the ratio of means using the city data
     ratio <- function(d, w)
          sum(d$x * w)/sum(d$u * w)
     boot(city, ratio, R=999, stype="w")

     # Stratified resampling for the difference of means.  In this
     # example we will look at the difference of means between the final
     # two series in the gravity data.
     diff.means <- function(d, f)
     {    n <- nrow(d)
          gp1 <- 1:table(as.numeric(d$series))[1]
          m1 <- sum(d[gp1,1] * f[gp1])/sum(f[gp1])
          m2 <- sum(d[-gp1,1] * f[-gp1])/sum(f[-gp1])
          ss1 <- sum(d[gp1,1]^2 * f[gp1]) - 
                 (m1 *  m1 * sum(f[gp1]))
          ss2 <- sum(d[-gp1,1]^2 * f[-gp1]) - 
                 (m2 *  m2 * sum(f[-gp1]))
          c(m1-m2, (ss1+ss2)/(sum(f)-2))
     }
     grav1 <- gravity[as.numeric(gravity[,2])>=7,]
     boot(grav1, diff.means, R=999, stype="f", strata=grav1[,2])

     #  In this example we show the use of boot in a prediction from 
     #  regression based on the nuclear data.  This example is taken 
     #  from Example 6.8 of Davison and Hinkley (1997).  Notice also 
     #  that two extra arguments to statistic are passed through boot.
     nuke <- nuclear[,c(1,2,5,7,8,10,11)]
     nuke.lm <- glm(log(cost)~date+log(cap)+ne+ ct+log(cum.n)+pt, data=nuke)
     nuke.diag <- glm.diag(nuke.lm)
     nuke.res <- nuke.diag$res*nuke.diag$sd
     nuke.res <- nuke.res-mean(nuke.res)

     #  We set up a new data frame with the data, the standardized 
     #  residuals and the fitted values for use in the bootstrap.
     nuke.data <- data.frame(nuke,resid=nuke.res,fit=fitted(nuke.lm))

     #  Now we want a prediction of plant number 32 but at date 73.00
     new.data <- data.frame(cost=1, date=73.00, cap=886, ne=0,
                            ct=0, cum.n=11, pt=1)
     new.fit <- predict(nuke.lm, new.data)

     nuke.fun <- function(dat, inds, i.pred, fit.pred, x.pred)
     {
          assign(".inds", inds, envir=.GlobalEnv)
          lm.b <- glm(fit+resid[.inds] ~date+log(cap)+ne+ct+
               log(cum.n)+pt, data=dat)
          pred.b <- predict(lm.b,x.pred)
          remove(".inds", envir=.GlobalEnv)
          c(coef(lm.b), pred.b-(fit.pred+dat$resid[i.pred]))
     }

     nuke.boot <- boot(nuke.data, nuke.fun, R=999, m=1, 
          fit.pred=new.fit, x.pred=new.data)
     #  The bootstrap prediction error would then be found by
     mean(nuke.boot$t[,8]^2)
     #  Basic bootstrap prediction limits would be
     new.fit-sort(nuke.boot$t[,8])[c(975,25)]


     #  Finally a parametric bootstrap.  For this example we shall look 
     #  at the air-conditioning data.  In this example our aim is to test 
     #  the hypothesis that the true value of the index is 1 (i.e. that 
     #  the data come from an exponential distribution) against the 
     #  alternative that the data come from a gamma distribution with
     #  index not equal to 1.
     air.fun <- function(data)
     {    ybar <- mean(data$hours)
          para <- c(log(ybar),mean(log(data$hours)))
          ll <- function(k) {
               if (k <= 0) out <- 1e200 # not NA
               else out <- lgamma(k)-k*(log(k)-1-para[1]+para[2])
              out
          }
          khat <- nlm(ll,ybar^2/var(data$hours))$estimate
          c(ybar, khat)
     }

     air.rg <- function(data, mle)
     #  Function to generate random exponential variates.  mle will contain 
     #  the mean of the original data
     {    out <- data
          out$hours <- rexp(nrow(out), 1/mle)
          out
     }

     air.boot <- boot(aircondit, air.fun, R=999, sim="parametric",
          ran.gen=air.rg, mle=mean(aircondit$hours))

     # The bootstrap p-value can then be approximated by
     sum(abs(air.boot$t[,2]-1) > abs(air.boot$t0[2]-1))/(1+air.boot$R)