cv.glm package:boot R Documentation _C_r_o_s_s-_v_a_l_i_d_a_t_i_o_n _f_o_r _G_e_n_e_r_a_l_i_z_e_d _L_i_n_e_a_r _M_o_d_e_l_s _D_e_s_c_r_i_p_t_i_o_n: This function calculates the estimated K-fold cross-validation prediction error for generalized linear models. _U_s_a_g_e: cv.glm(data, glmfit, cost, K) _A_r_g_u_m_e_n_t_s: data: A matrix or data frame containing the data. The rows should be cases and the columns correspond to variables, one of which is the response. glmfit: An object of class '"glm"' containing the results of a generalized linear model fitted to 'data'. cost: A function of two vector arguments specifying the cost function for the cross-validation. The first argument to 'cost' should correspond to the observed responses and the second argument should correspond to the predicted or fitted responses from the generalized linear model. 'cost' must return a non-negative scalar value. The default is the average squared error function. K: The number of groups into which the data should be split to estimate the cross-validation prediction error. The value of 'K' must be such that all groups are of approximately equal size. If the supplied value of 'K' does not satisfy this criterion then it will be set to the closest integer which does and a warning is generated specifying the value of 'K' used. The default is to set 'K' equal to the number of observations in 'data' which gives the usual leave-one-out cross-validation. _D_e_t_a_i_l_s: The data is divided randomly into 'K' groups. For each group the generalized linear model is fit to 'data' omitting that group, then the function 'cost' is applied to the observed responses in the group that was omitted from the fit and the prediction made by the fitted models for those observations. When 'K' is the number of observations leave-one-out cross-validation is used and all the possible splits of the data are used. When 'K' is less than the number of observations the 'K' splits to be used are found by randomly partitioning the data into 'K' groups of approximately equal size. In this latter case a certain amount of bias is introduced. This can be reduced by using a simple adjustment (see equation 6.48 in Davison and Hinkley, 1997). The second value returned in 'delta' is the estimate adjusted by this method. _V_a_l_u_e: The returned value is a list with the following components. call: The original call to 'cv.glm'. K: The value of 'K' used for the K-fold cross validation. delta: A vector of length two. The first component is the raw cross-validation estimate of prediction error. The second component is the adjusted cross-validation estimate. The adjustment is designed to compensate for the bias introduced by not using leave-one-out cross-validation. seed: The value of '.Random.seed' when 'cv.glm' was called. _S_i_d_e _E_f_f_e_c_t_s: The value of '.Random.seed' is updated. _R_e_f_e_r_e_n_c_e_s: Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984) _Classification and Regression Trees_. Wadsworth. Burman, P. (1989) A comparative study of ordinary cross-validation, _v_-fold cross-validation and repeated learning-testing methods. _Biometrika_, *76*, 503-514 Davison, A.C. and Hinkley, D.V. (1997) _Bootstrap Methods and Their Application_. Cambridge University Press. Efron, B. (1986) How biased is the apparent error rate of a prediction rule? _Journal of the American Statistical Association_, *81*, 461-470. Stone, M. (1974) Cross-validation choice and assessment of statistical predictions (with Discussion). _Journal of the Royal Statistical Society, B_, *36*, 111-147. _S_e_e _A_l_s_o: 'glm', 'glm.diag', 'predict' _E_x_a_m_p_l_e_s: # leave-one-out and 6-fold cross-validation prediction error for # the mammals data set. data(mammals, package="MASS") mammals.glm <- glm(log(brain)~log(body),data=mammals) cv.err <- cv.glm(mammals,mammals.glm) cv.err.6 <- cv.glm(mammals, mammals.glm, K=6) # As this is a linear model we could calculate the leave-one-out # cross-validation estimate without any extra model-fitting. muhat <- mammals.glm$fitted mammals.diag <- glm.diag(mammals.glm) cv.err <- mean((mammals.glm$y-muhat)^2/(1-mammals.diag$h)^2) # leave-one-out and 11-fold cross-validation prediction error for # the nodal data set. Since the response is a binary variable an # appropriate cost function is cost <- function(r, pi=0) mean(abs(r-pi)>0.5) nodal.glm <- glm(r~stage+xray+acid,binomial,data=nodal) cv.err <- cv.glm(nodal, nodal.glm, cost, K=nrow(nodal))$delta cv.11.err <- cv.glm(nodal, nodal.glm, cost, K=11)$delta