grep package:base R Documentation _P_a_t_t_e_r_n _M_a_t_c_h_i_n_g _a_n_d _R_e_p_l_a_c_e_m_e_n_t _D_e_s_c_r_i_p_t_i_o_n: 'grep' searches for matches to 'pattern' (its first argument) within the character vector 'x' (second argument). 'grepl' is an alternative way to return the results. 'regexpr' and 'gregexpr' do too, but return more detail in a different format. 'sub' and 'gsub' perform replacement of matches determined by regular expression matching. _U_s_a_g_e: grep(pattern, x, ignore.case = FALSE, extended = TRUE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE) grepl(pattern, x, ignore.case = FALSE, extended = TRUE, perl = FALSE, fixed = FALSE, useBytes = FALSE) sub(pattern, replacement, x, ignore.case = FALSE, extended = TRUE, perl = FALSE, fixed = FALSE, useBytes = FALSE) gsub(pattern, replacement, x, ignore.case = FALSE, extended = TRUE, perl = FALSE, fixed = FALSE, useBytes = FALSE) regexpr(pattern, text, ignore.case = FALSE, extended = TRUE, perl = FALSE, fixed = FALSE, useBytes = FALSE) gregexpr(pattern, text, ignore.case = FALSE, extended = TRUE, perl = FALSE, fixed = FALSE, useBytes = FALSE) _A_r_g_u_m_e_n_t_s: pattern: character string containing a regular expression (or character string for 'fixed = TRUE') to be matched in the given character vector. Coerced by 'as.character' to a character string if possible. x, text: a character vector where matches are sought, or an object which can be coerced by 'as.character' to a character vector. ignore.case: if 'FALSE', the pattern matching is _case sensitive_ and if 'TRUE', case is ignored during matching. extended: if 'TRUE', extended regular expression matching is used, and if 'FALSE' basic regular expressions are used. perl: logical. Should perl-compatible regexps be used? Has priority over 'extended'. value: if 'FALSE', a vector containing the ('integer') indices of the matches determined by 'grep' is returned, and if 'TRUE', a vector containing the matching elements themselves is returned. fixed: logical. If 'TRUE', 'pattern' is a string to be matched as is. Overrides all conflicting arguments. useBytes: logical. If 'TRUE' the matching is done byte-by-byte rather than character-by-character. See 'Details'. invert: logicaL. If 'TRUE' return indices or values for elements that do _not_ match. replacement: a replacement for matched pattern in 'sub' and 'gsub'. Coerced to character if possible. For 'fixed = FALSE' this can include backreferences '"\1"' to '"\9"' to parenthesized subexpressions of 'pattern'. For 'perl = TRUE' only, it can also contain '"\U"' or '"\L"' to convert the rest of the replacement to upper or lower case. _D_e_t_a_i_l_s: Arguments which should be character strings or character vectors are coerced to character if possible. The two '*sub' functions differ only in that 'sub' replaces only the first occurrence of a 'pattern' whereas 'gsub' replaces all occurrences. For 'regexpr' it is an error for 'pattern' to be 'NA', otherwise 'NA' is permitted and gives an 'NA' match. The regular expressions used are those specified by POSIX 1003.2, either extended or basic, depending on the value of the 'extended' argument, unless 'perl = TRUE' when they are those of PCRE, . (The exact set of patterns supported may depend on the version of PCRE installed on the system in use if R was configured to use the system PCRE.) 'useBytes' is only used if 'fixed = TRUE' or 'perl = TRUE'. Its main effect is to avoid errors/warnings about invalid inputs and spurious matches, but for 'regexpr' it changes the interpretation of the output. PCRE only supports caseless matching for a non-ASCII pattern in a UTF-8 locale (and not for 'useBytes = TRUE' in any locale). _V_a_l_u_e: For 'grep' a vector giving either the indices of the elements of 'x' that yielded a match or, if 'value' is 'TRUE', the matched elements of 'x' (after coercion, preserving names but no other attributes). 'grepl' differs only in that it returns a logical vector (match or no for each element of 'x'). For 'sub' and 'gsub' a character vector of the same length and with the same attributes as 'x' (after possible coercion). Elements of character vectors 'x' which are not substituted will be return unchanged (including any declared encoding). If 'useBytes = FALSE', either 'perl = TRUE' or 'fixed = TRUE' and any element of 'pattern', 'replacement' and 'x' is declared to be in UTF-8, the result will be in UTF-8. Otherwise changed elements of the result will be have the encoding declared as that of the current locale (see 'Encoding' if the corresponding input had a declared encoding and the current locale is either Latin-1 or UTF-8. For 'regexpr' an integer vector of the same length as 'text' giving the starting position of the first match, or -1 if there is none, with attribute '"match.length"' giving the length of the matched text (or -1 for no match). In a multi-byte locale these quantities are in characters rather than bytes unless 'useBytes = TRUE' is used with 'fixed = TRUE' or 'perl = TRUE'. For 'gregexpr' a list of the same length as 'text' each element of which is an integer vector as in 'regexpr', except that the starting positions of every (disjoint) match are given. If in a multi-byte locale the pattern or replacement is not a valid sequence of bytes, an error is thrown. An invalid string in 'x' or 'text' is a non-match with a warning for 'grep' or 'regexpr', but an error for 'sub' or 'gsub'. _W_a_r_n_i_n_g: The standard regular-expression code has been reported to be very slow when applied to extremely long character strings (tens of thousands of characters or more): the code used when 'perl = TRUE' seems much faster and more reliable for such usages. The standard version of 'gsub' does not substitute correctly repeated word-boundaries (e.g. 'pattern = "\b"'). Use 'perl = TRUE' for such matches. The 'perl = TRUE' option is only implemented for single-byte and UTF-8 encodings, and will warn if used in a non-UTF-8 multi-byte locale (unless 'useBytes = TRUE'). _R_e_f_e_r_e_n_c_e_s: Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S Language_. Wadsworth & Brooks/Cole ('grep') _S_e_e _A_l_s_o: regular expression (aka 'regexp') for the details of the pattern specification. 'glob2rx' to turn wildcard matches into regular expressions. 'agrep' for approximate matching. 'tolower', 'toupper' and 'chartr' for character translations. 'charmatch', 'pmatch', 'match'. 'apropos' uses regexps and has nice examples. _E_x_a_m_p_l_e_s: grep("[a-z]", letters) txt <- c("arm","foot","lefroo", "bafoobar") if(length(i <- grep("foo",txt))) cat("'foo' appears at least once in\n\t",txt,"\n") i # 2 and 4 txt[i] ## Double all 'a' or 'b's; "\" must be escaped, i.e., 'doubled' gsub("([ab])", "\\1_\\1_", "abc and ABC") txt <- c("The", "licenses", "for", "most", "software", "are", "designed", "to", "take", "away", "your", "freedom", "to", "share", "and", "change", "it.", "", "By", "contrast,", "the", "GNU", "General", "Public", "License", "is", "intended", "to", "guarantee", "your", "freedom", "to", "share", "and", "change", "free", "software", "--", "to", "make", "sure", "the", "software", "is", "free", "for", "all", "its", "users") ( i <- grep("[gu]", txt) ) # indices stopifnot( txt[i] == grep("[gu]", txt, value = TRUE) ) ## Note that in locales such as en_US this includes B as the ## collation order is aAbBcCdEe ... (ot <- sub("[b-e]",".", txt)) txt[ot != gsub("[b-e]",".", txt)]#- gsub does "global" substitution txt[gsub("g","#", txt) != gsub("g","#", txt, ignore.case = TRUE)] # the "G" words regexpr("en", txt) gregexpr("e", txt) ## trim trailing white space str <- 'Now is the time ' sub(' +$', '', str) ## spaces only sub('[[:space:]]+$', '', str) ## white space, POSIX-style sub('\\s+$', '', str, perl = TRUE) ## Perl-style white space ## capitalizing gsub("(\\w)(\\w*)", "\\U\\1\\L\\2", "a test of capitalizing", perl=TRUE) gsub("\\b(\\w)", "\\U\\1", "a test of capitalizing", perl=TRUE)