agnes package:cluster R Documentation _A_g_g_l_o_m_e_r_a_t_i_v_e _N_e_s_t_i_n_g (_H_i_e_r_a_r_c_h_i_c_a_l _C_l_u_s_t_e_r_i_n_g) _D_e_s_c_r_i_p_t_i_o_n: Computes agglomerative hierarchical clustering of the dataset. _U_s_a_g_e: agnes(x, diss = inherits(x, "dist"), metric = "euclidean", stand = FALSE, method = "average", par.method, keep.diss = n < 100, keep.data = !diss) _A_r_g_u_m_e_n_t_s: x: data matrix or data frame, or dissimilarity matrix, depending on the value of the 'diss' argument. In case of a matrix or data frame, each row corresponds to an observation, and each column corresponds to a variable. All variables must be numeric. Missing values (NAs) are allowed. In case of a dissimilarity matrix, 'x' is typically the output of 'daisy' or 'dist'. Also a vector with length n*(n-1)/2 is allowed (where n is the number of observations), and will be interpreted in the same way as the output of the above-mentioned functions. Missing values (NAs) are not allowed. diss: logical flag: if TRUE (default for 'dist' or 'dissimilarity' objects), then 'x' is assumed to be a dissimilarity matrix. If FALSE, then 'x' is treated as a matrix of observations by variables. metric: character string specifying the metric to be used for calculating dissimilarities between observations. The currently available options are "euclidean" and "manhattan". Euclidean distances are root sum-of-squares of differences, and manhattan distances are the sum of absolute differences. If 'x' is already a dissimilarity matrix, then this argument will be ignored. stand: logical flag: if TRUE, then the measurements in 'x' are standardized before calculating the dissimilarities. Measurements are standardized for each variable (column), by subtracting the variable's mean value and dividing by the variable's mean absolute deviation. If 'x' is already a dissimilarity matrix, then this argument will be ignored. method: character string defining the clustering method. The six methods implemented are "average" ([unweighted pair-]group average method, UPGMA), "single" (single linkage), "complete" (complete linkage), "ward" (Ward's method), "weighted" (weighted average linkage) and its generalization '"flexible"' which uses (a constant version of) the Lance-Williams formula and the 'par.method' argument. Default is "average". par.method: if 'method == "flexible"', numeric vector of length 1, 3, or 4, see in the details section. keep.diss, keep.data: logicals indicating if the dissimilarities and/or input data 'x' should be kept in the result. Setting these to 'FALSE' can give much smaller results and hence even save memory allocation _time_. _D_e_t_a_i_l_s: 'agnes' is fully described in chapter 5 of Kaufman and Rousseeuw (1990). Compared to other agglomerative clustering methods such as 'hclust', 'agnes' has the following features: (a) it yields the agglomerative coefficient (see 'agnes.object') which measures the amount of clustering structure found; and (b) apart from the usual tree it also provides the banner, a novel graphical display (see 'plot.agnes'). The 'agnes'-algorithm constructs a hierarchy of clusterings. At first, each observation is a small cluster by itself. Clusters are merged until only one large cluster remains which contains all the observations. At each stage the two _nearest_ clusters are combined to form one larger cluster. For 'method="average"', the distance between two clusters is the average of the dissimilarities between the points in one cluster and the points in the other cluster. In 'method="single"', we use the smallest dissimilarity between a point in the first cluster and a point in the second cluster (nearest neighbor method). When 'method="complete"', we use the largest dissimilarity between a point in the first cluster and a point in the second cluster (furthest neighbor method). The 'method = "flexible"' allows (and requires) more details: The Lance-Williams formula specifies how dissimilarities are computed when clusters are agglomerated (equation (32) in K.&R., p.237). If clusters C_1 and C_2 are agglomerated into a new cluster, the dissimilarity between their union and another cluster Q is given by D(C_1 cup C_2, Q) = alpha_1 * D(C_1, Q) + alpha_2 * D(C_2, Q) + beta * D(C_1,C_2) + gamma * |D(C_1, Q) - D(C_2, Q)|, where the four coefficients (alpha_1, alpha_2, beta, gamma) are specified by the vector 'par.method': If 'par.method' is of length 1, say = alpha, 'par.method' is extended to give the "Flexible Strategy" (K. & R., p.236 f) with Lance-Williams coefficients (alpha_1 = alpha_2 = alpha, beta = 1 - 2alpha, gamma=0). If of length 3, gamma = 0 is used. *Care* and expertise is probably needed when using 'method = "flexible"' particularly for the case when 'par.method' is specified of longer length than one. The _weighted average_ ('method="weighted"') is the same as 'method="flexible", par.method = 0.5'. _V_a_l_u_e: an object of class '"agnes"' representing the clustering. See 'agnes.object' for details. _B_A_C_K_G_R_O_U_N_D: Cluster analysis divides a dataset into groups (clusters) of observations that are similar to each other. _H_i_e_r_a_r_c_h_i_c_a_l _m_e_t_h_o_d_s like 'agnes', 'diana', and 'mona' construct a hierarchy of clusterings, with the number of clusters ranging from one to the number of observations. _P_a_r_t_i_t_i_o_n_i_n_g _m_e_t_h_o_d_s like 'pam', 'clara', and 'fanny' require that the number of clusters be given by the user. _R_e_f_e_r_e_n_c_e_s: Kaufman, L. and Rousseeuw, P.J. (1990). _Finding Groups in Data: An Introduction to Cluster Analysis_. Wiley, New York. Anja Struyf, Mia Hubert & Peter J. Rousseeuw (1996): Clustering in an Object-Oriented Environment. _Journal of Statistical Software_, *1*. Struyf, A., Hubert, M. and Rousseeuw, P.J. (1997). Integrating Robust Clustering Techniques in S-PLUS, _Computational Statistics and Data Analysis_, *26*, 17-37. Lance, G.N., and W.T. Williams (1966). A General Theory of Classifactory Sorting Strategies, I. Hierarchical Systems. _Computer J._ *9*, 373-380. _S_e_e _A_l_s_o: 'agnes.object', 'daisy', 'diana', 'dist', 'hclust', 'plot.agnes', 'twins.object'. _E_x_a_m_p_l_e_s: data(votes.repub) agn1 <- agnes(votes.repub, metric = "manhattan", stand = TRUE) agn1 plot(agn1) op <- par(mfrow=c(2,2)) agn2 <- agnes(daisy(votes.repub), diss = TRUE, method = "complete") plot(agn2) agnS <- agnes(votes.repub, method = "flexible", par.meth = 0.6) plot(agnS) par(op) data(agriculture) ## Plot similar to Figure 7 in ref ## Not run: plot(agnes(agriculture), ask = TRUE)