density package:stats R Documentation _K_e_r_n_e_l _D_e_n_s_i_t_y _E_s_t_i_m_a_t_i_o_n _D_e_s_c_r_i_p_t_i_o_n: The (S3) generic function 'density' computes kernel density estimates. Its default method does so with the given kernel and bandwidth for univariate observations. _U_s_a_g_e: density(x, ...) ## Default S3 method: density(x, bw = "nrd0", adjust = 1, kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight", "cosine", "optcosine"), weights = NULL, window = kernel, width, give.Rkern = FALSE, n = 512, from, to, cut = 3, na.rm = FALSE, ...) _A_r_g_u_m_e_n_t_s: x: the data from which the estimate is to be computed. bw: the smoothing bandwidth to be used. The kernels are scaled such that this is the standard deviation of the smoothing kernel. (Note this differs from the reference books cited below, and from S-PLUS.) 'bw' can also be a character string giving a rule to choose the bandwidth. See 'bw.nrd'. The specified (or computed) value of 'bw' is multiplied by 'adjust'. adjust: the bandwidth used is actually 'adjust*bw'. This makes it easy to specify values like 'half the default' bandwidth. kernel, window: a character string giving the smoothing kernel to be used. This must be one of '"gaussian"', '"rectangular"', '"triangular"', '"epanechnikov"', '"biweight"', '"cosine"' or '"optcosine"', with default '"gaussian"', and may be abbreviated to a unique prefix (single letter). '"cosine"' is smoother than '"optcosine"', which is the usual 'cosine' kernel in the literature and almost MSE-efficient. However, '"cosine"' is the version used by S. weights: numeric vector of non-negative observation weights, hence of same length as 'x'. The default 'NULL' is equivalent to 'weights = rep(1/nx, nx)' where 'nx' is the length of (the finite entries of) 'x[]'. width: this exists for compatibility with S; if given, and 'bw' is not, will set 'bw' to 'width' if this is a character string, or to a kernel-dependent multiple of 'width' if this is numeric. give.Rkern: logical; if true, _no_ density is estimated, and the 'canonical bandwidth' of the chosen 'kernel' is returned instead. n: the number of equally spaced points at which the density is to be estimated. When 'n > 512', it is rounded up to the next power of 2 for efficiency reasons ('fft'). from,to: the left and right-most points of the grid at which the density is to be estimated; the defaults are 'cut * bw' outside of 'range(x)'. cut: by default, the values of 'from' and 'to' are 'cut' bandwidths beyond the extremes of the data. This allows the estimated density to drop to approximately zero at the extremes. na.rm: logical; if 'TRUE', missing values are removed from 'x'. If 'FALSE' any missing values cause an error. ...: further arguments for (non-default) methods. _D_e_t_a_i_l_s: The algorithm used in 'density.default' disperses the mass of the empirical distribution function over a regular grid of at least 512 points and then uses the fast Fourier transform to convolve this approximation with a discretized version of the kernel and then uses linear approximation to evaluate the density at the specified points. The statistical properties of a kernel are determined by sig^2 (K) = int(t^2 K(t) dt) which is always = 1 for our kernels (and hence the bandwidth 'bw' is the standard deviation of the kernel) and R(K) = int(K^2(t) dt). MSE-equivalent bandwidths (for different kernels) are proportional to sig(K) R(K) which is scale invariant and for our kernels equal to R(K). This value is returned when 'give.Rkern = TRUE'. See the examples for using exact equivalent bandwidths. Infinite values in 'x' are assumed to correspond to a point mass at '+/-Inf' and the density estimate is of the sub-density on '(-Inf, +Inf)'. _V_a_l_u_e: If 'give.Rkern' is true, the number R(K), otherwise an object with class '"density"' whose underlying structure is a list containing the following components. x: the 'n' coordinates of the points where the density is estimated. y: the estimated density values. These will be non-negative, but can be zero. bw: the bandwidth used. n: the sample size after elimination of missing values. call: the call which produced the result. data.name: the deparsed name of the 'x' argument. has.na: logical, for compatibility (always 'FALSE'). The 'print' method reports 'summary' values on the 'x' and 'y' components. _R_e_f_e_r_e_n_c_e_s: Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S Language_. Wadsworth & Brooks/Cole (for S version). Scott, D. W. (1992) _Multivariate Density Estimation. Theory, Practice and Visualization_. New York: Wiley. Sheather, S. J. and Jones M. C. (1991) A reliable data-based bandwidth selection method for kernel density estimation. _J. Roy. Statist. Soc._ *B*, 683-690. Silverman, B. W. (1986) _Density Estimation_. London: Chapman and Hall. Venables, W. N. and Ripley, B. D. (2002) _Modern Applied Statistics with S_. New York: Springer. _S_e_e _A_l_s_o: 'bw.nrd', 'plot.density', 'hist'. _E_x_a_m_p_l_e_s: require(graphics) plot(density(c(-20,rep(0,98),20)), xlim = c(-4,4))# IQR = 0 # The Old Faithful geyser data d <- density(faithful$eruptions, bw = "sj") d plot(d) plot(d, type = "n") polygon(d, col = "wheat") ## Missing values: x <- xx <- faithful$eruptions x[i.out <- sample(length(x), 10)] <- NA doR <- density(x, bw = 0.15, na.rm = TRUE) lines(doR, col = "blue") points(xx[i.out], rep(0.01, 10)) ## Weighted observations: fe <- sort(faithful$eruptions) # has quite a few non-unique values ## use 'counts / n' as weights: dw <- density(unique(fe), weights = table(fe)/length(fe), bw = d$bw) utils::str(dw) ## smaller n: only 126, but identical estimate: stopifnot(all.equal(d[1:3], dw[1:3])) ## simulation from a density() fit: # a kernel density fit is an equally-weighted mixture. fit <- density(xx) N <- 1e6 x.new <- rnorm(N, sample(xx, size = N, replace = TRUE), fit$bw) plot(fit) lines(density(x.new), col="blue") (kernels <- eval(formals(density.default)$kernel)) ## show the kernels in the R parametrization plot (density(0, bw = 1), xlab = "", main="R's density() kernels with bw = 1") for(i in 2:length(kernels)) lines(density(0, bw = 1, kernel = kernels[i]), col = i) legend(1.5,.4, legend = kernels, col = seq(kernels), lty = 1, cex = .8, y.intersp = 1) ## show the kernels in the S parametrization plot(density(0, from=-1.2, to=1.2, width=2, kernel="gaussian"), type="l", ylim = c(0, 1), xlab="", main="R's density() kernels with width = 1") for(i in 2:length(kernels)) lines(density(0, width = 2, kernel = kernels[i]), col = i) legend(0.6, 1.0, legend = kernels, col = seq(kernels), lty = 1) ##-------- Semi-advanced theoretic from here on ------------- (RKs <- cbind(sapply(kernels, function(k) density(kernel = k, give.Rkern = TRUE)))) 100*round(RKs["epanechnikov",]/RKs, 4) ## Efficiencies bw <- bw.SJ(precip) ## sensible automatic choice plot(density(precip, bw = bw), main = "same sd bandwidths, 7 different kernels") for(i in 2:length(kernels)) lines(density(precip, bw = bw, kernel = kernels[i]), col = i) ## Bandwidth Adjustment for "Exactly Equivalent Kernels" h.f <- sapply(kernels, function(k)density(kernel = k, give.Rkern = TRUE)) (h.f <- (h.f["gaussian"] / h.f)^ .2) ## -> 1, 1.01, .995, 1.007,... close to 1 => adjustment barely visible.. plot(density(precip, bw = bw), main = "equivalent bandwidths, 7 different kernels") for(i in 2:length(kernels)) lines(density(precip, bw = bw, adjust = h.f[i], kernel = kernels[i]), col = i) legend(55, 0.035, legend = kernels, col = seq(kernels), lty = 1)