cor package:stats R Documentation _C_o_r_r_e_l_a_t_i_o_n, _V_a_r_i_a_n_c_e _a_n_d _C_o_v_a_r_i_a_n_c_e (_M_a_t_r_i_c_e_s) _D_e_s_c_r_i_p_t_i_o_n: 'var', 'cov' and 'cor' compute the variance of 'x' and the covariance or correlation of 'x' and 'y' if these are vectors. If 'x' and 'y' are matrices then the covariances (or correlations) between the columns of 'x' and the columns of 'y' are computed. 'cov2cor' scales a covariance matrix into the corresponding correlation matrix _efficiently_. _U_s_a_g_e: var(x, y = NULL, na.rm = FALSE, use) cov(x, y = NULL, use = "everything", method = c("pearson", "kendall", "spearman")) cor(x, y = NULL, use = "everything", method = c("pearson", "kendall", "spearman")) cov2cor(V) _A_r_g_u_m_e_n_t_s: x: a numeric vector, matrix or data frame. y: 'NULL' (default) or a vector, matrix or data frame with compatible dimensions to 'x'. The default is equivalent to 'y = x' (but more efficient). na.rm: logical. Should missing values be removed? use: an optional character string giving a method for computing covariances in the presence of missing values. This must be (an abbreviation of) one of the strings '"everything"', '"all.obs"', '"complete.obs"', '"na.or.complete"', or '"pairwise.complete.obs"'. method: a character string indicating which correlation coefficient (or covariance) is to be computed. One of '"pearson"' (default), '"kendall"', or '"spearman"', can be abbreviated. V: symmetric numeric matrix, usually positive definite such as a covariance matrix. _D_e_t_a_i_l_s: For 'cov' and 'cor' one must _either_ give a matrix or data frame for 'x' _or_ give both 'x' and 'y'. 'var' is just another interface to 'cov', where 'na.rm' is used to determine the default for 'use' when that is unspecified. If 'na.rm' is 'TRUE' then the complete observations (rows) are used ('use = "na.or.complete"') to compute the variance. Otherwise, by default 'use = "everything"'. If 'use' is '"everything"', 'NA's will propagate conceptually, i.e., a resulting value will be 'NA' whenever one of its contributing observations is 'NA'. If 'use' is '"all.obs"', then the presence of missing observations will produce an error. If 'use' is '"complete.obs"' then missing values are handled by casewise deletion (and if there are no complete cases, that gives an error). '"na.or.complete"' is the same unless there are no complete cases, that gives 'NA'. Finally, if 'use' has the value '"pairwise.complete.obs"' then the correlation or covariance between each pair of variables is computed using all complete pairs of observations on those variables. This can result in covariance or correlation matrices which are not positive semi-definite, as well as 'NA' entries if there are no complete pairs for that pair of variables. For 'cov' and 'var', '"pairwise.complete.obs"' only works with the '"pearson"' method. Note that (the equivalent of) 'var(double(0), use=*)' gives 'NA' for 'use = "everything"' and '"na.or.complete"', and gives an error in the other cases. The denominator n - 1 is used which gives an unbiased estimator of the (co)variance for i.i.d. observations. These functions return 'NA' when there is only one observation (whereas S-PLUS has been returning 'NaN'), and fail if 'x' has length zero. For 'cor()', if 'method' is '"kendall"' or '"spearman"', Kendall's tau or Spearman's rho statistic is used to estimate a rank-based measure of association. These are more robust and have been recommended if the data do not necessarily come from a bivariate normal distribution. For 'cov()', a non-Pearson method is unusual but available for the sake of completeness. Note that '"spearman"' basically computes 'cor(R(x), R(y))' (or 'cov(.,.)') where 'R(u) := rank(u, na.last="keep")'. In the case of missing values, the ranks are calculated depending on the value of 'use', either based on complete observations, or based on pairwise completeness with reranking for each pair. Scaling a covariance matrix into a correlation one can be achieved in many ways, mathematically most appealing by multiplication with a diagonal matrix from left and right, or more efficiently by using 'sweep(.., FUN = "/")' twice. The 'cov2cor' function is even a bit more efficient, and provided mostly for didactical reasons. _V_a_l_u_e: For 'r <- cor(*, use = "all.obs")', it is now guaranteed that 'all(r <= 1)'. _R_e_f_e_r_e_n_c_e_s: Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S Language_. Wadsworth & Brooks/Cole. _S_e_e _A_l_s_o: 'cor.test' for confidence intervals (and tests). 'cov.wt' for _weighted_ covariance computation. 'sd' for standard deviation (vectors). _E_x_a_m_p_l_e_s: var(1:10)# 9.166667 var(1:5,1:5)# 2.5 ## Two simple vectors cor(1:10,2:11)# == 1 ## Correlation Matrix of Multivariate sample: (Cl <- cor(longley)) ## Graphical Correlation Matrix: symnum(Cl) # highly correlated ## Spearman's rho and Kendall's tau symnum(clS <- cor(longley, method = "spearman")) symnum(clK <- cor(longley, method = "kendall")) ## How much do they differ? i <- lower.tri(Cl) cor(cbind(P = Cl[i], S = clS[i], K = clK[i])) ## cov2cor() scales a covariance matrix by its diagonal ## to become the correlation matrix. cov2cor # see the function definition {and learn ..} stopifnot(all.equal(Cl, cov2cor(cov(longley))), all.equal(cor(longley, method="kendall"), cov2cor(cov(longley, method="kendall")))) ##--- Missing value treatment: C1 <- cov(swiss) range(eigen(C1, only.values=TRUE)$values) # 6.19 1921 swM <- swiss swM[1,2] <- swM[7,3] <- swM[25,5] <- NA # create 3 "missing" try(cov(swM)) # Error: missing obs... C2 <- cov(swM, use = "complete") range(eigen(C2, only.values=TRUE)$values) # 6.46 1930 C3 <- cov(swM, use = "pairwise") range(eigen(C3, only.values=TRUE)$values) # 6.19 1938 symnum(cor(swM, method = "kendall", use = "complete")) ## Kendall's tau doesn't change much: symnum(cor(swiss, method = "kendall"))