factor package:base R Documentation _F_a_c_t_o_r_s _D_e_s_c_r_i_p_t_i_o_n: The function 'factor' is used to encode a vector as a factor (the terms 'category' and 'enumerated type' are also used for factors). If 'ordered' is 'TRUE', the factor levels are assumed to be ordered. For compatibility with S there is also a function 'ordered'. 'is.factor', 'is.ordered', 'as.factor' and 'as.ordered' are the membership and coercion functions for these classes. _U_s_a_g_e: factor(x = character(), levels = sort(unique.default(x), na.last = TRUE), labels = levels, exclude = NA, ordered = is.ordered(x)) ordered(x, ...) is.factor(x) is.ordered(x) as.factor(x) as.ordered(x) addNA(x, ifany=FALSE) _A_r_g_u_m_e_n_t_s: x: a vector of data, usually taking a small number of distinct values. levels: an optional vector of the values that 'x' might have taken. The default is the set of values taken by 'x', sorted into increasing order. labels: _either_ an optional vector of labels for the levels (in the same order as 'levels' after removing those in 'exclude'), _or_ a character string of length 1. exclude: a vector of values to be excluded when forming the set of levels. This should be of the same type as 'x', and will be coerced if necessary. ordered: logical flag to determine if the levels should be regarded as ordered (in the order given). ...: (in 'ordered(.)'): any of the above, apart from 'ordered' itself. ifany: (in 'addNA'): Only add an 'NA' level if it is used, i.e. if 'any(is.na(x))'. _D_e_t_a_i_l_s: The type of the vector 'x' is not restricted. Ordered factors differ from factors only in their class, but methods and the model-fitting functions treat the two classes quite differently. The encoding of the vector happens as follows. First all the values in 'exclude' are removed from 'levels'. If 'x[i]' equals 'levels[j]', then the 'i'-th element of the result is 'j'. If no match is found for 'x[i]' in 'levels', then the 'i'-th element of the result is set to 'NA'. Normally the 'levels' used as an attribute of the result are the reduced set of levels after removing those in 'exclude', but this can be altered by supplying 'labels'. This should either be a set of new labels for the levels, or a character string, in which case the levels are that character string with a sequence number appended. 'factor(x, exclude=NULL)' applied to a factor is a no-operation unless there are unused levels: in that case, a factor with the reduced level set is returned. If 'exclude' is used it should also be a factor with the same level set as 'x' or a set of codes for the levels to be excluded. The codes of a factor may contain 'NA'. For a numeric 'x', set 'exclude=NULL' to make 'NA' an extra level (prints as ''); by default, this is the last level. If 'NA' is a level, the way to set a code to be missing (as opposed to the code of the missing level) is to use 'is.na' on the left-hand-side of an assignment (as in 'is.na(f)[i] <- TRUE'; indexing inside 'is.na' does not work). Under those circumstances missing values are currently printed as '', i.e., identical to entries of level 'NA'. 'is.factor' is generic: you can write methods to handle specific classes of objects, see InternalMethods. _V_a_l_u_e: 'factor' returns an object of class '"factor"' which has a set of integer codes the length of 'x' with a '"levels"' attribute of mode 'character'. If 'ordered' is true (or 'ordered' is used) the result has class 'c("ordered", "factor")'. Applying 'factor' to an ordered or unordered factor returns a factor (of the same type) with just the levels which occur: see also '[.factor' for a more transparent way to achieve this. 'is.factor' returns 'TRUE' or 'FALSE' depending on whether its argument is of type factor or not. Correspondingly, 'is.ordered' returns 'TRUE' when its argument is ordered and 'FALSE' otherwise. 'as.factor' coerces its argument to a factor. It is an abbreviated form of 'factor'. 'as.ordered(x)' returns 'x' if this is ordered, and 'ordered(x)' otherwise. 'addNA' modifies a factor by turning 'NA' into an extra level (so that 'NA' values are counted in tables, for instance). _W_a_r_n_i_n_g: The interpretation of a factor depends on both the codes and the '"levels"' attribute. Be careful only to compare factors with the same set of levels (in the same order). In particular, 'as.numeric' applied to a factor is meaningless, and may happen by implicit coercion. To transform a factor 'f' to its original numeric values, 'as.numeric(levels(f))[f]' is recommended and slightly more efficient than 'as.numeric(as.character(f))'. The levels of a factor are by default sorted, but the sort order may well depend on the locale at the time of creation, and should not be assumed to be ASCII. There are some anomalies associated with factors that have 'NA' as a level. It is suggested to use them sparingly, e.g., only for tabulation purposes. _C_o_m_p_a_r_i_s_o_n _o_p_e_r_a_t_o_r_s _a_n_d _g_r_o_u_p _g_e_n_e_r_i_c _m_e_t_h_o_d_s: There are '"factor"' and '"ordered"' methods for the group generic 'Ops', which provide methods for the Comparison operators. (The rest of the group and the 'Math' and 'Summary' groups generate an error as they are not meaningful for factors.) Only '==' and '!=' can be used for factors: a factor can only be compared to another factor with an identical set of levels (not necessarily in the same ordering) or to a character vector. Ordered factors are compared in the same way, but the general dispatch mechanism precludes comparing ordered and unordered factors. All the comparison operators are available for ordered factors. Sorting is done by the levels of the operands: if both operands are ordered factors they must have the same level set. _N_o_t_e: In earlier versions of R, storing character data as a factor was more space efficient if there is even a small proportion of repeats. Since R 2.6.0 identical character strings share storage, so the difference is now small in most cases. (Integer values are stored in 4 bytes whereas each reference to a character string needs a pointer of 4 or 8 bytes.) _R_e_f_e_r_e_n_c_e_s: Chambers, J. M. and Hastie, T. J. (1992) _Statistical Models in S_. Wadsworth & Brooks/Cole. _S_e_e _A_l_s_o: '[.factor' for subsetting of factors. 'gl' for construction of balanced factors and 'C' for factors with specified contrasts. 'levels' and 'nlevels' for accessing the levels, and 'unclass' to get integer codes. _E_x_a_m_p_l_e_s: (ff <- factor(substring("statistics", 1:10, 1:10), levels=letters)) as.integer(ff) # the internal codes factor(ff) # drops the levels that do not occur ff[, drop=TRUE] # the same, more transparently factor(letters[1:20], labels="letter") class(ordered(4:1)) # "ordered", inheriting from "factor" ## suppose you want "NA" as a level, and to allow missing values. (x <- factor(c(1, 2, NA), exclude = NULL)) is.na(x)[2] <- TRUE x # [1] 1 is.na(x) # [1] FALSE TRUE FALSE ## Using addNA() Month <- airquality$Month table(addNA(Month)) table(addNA(Month, ifany=TRUE))