regex package:base R Documentation _R_e_g_u_l_a_r _E_x_p_r_e_s_s_i_o_n_s _a_s _u_s_e_d _i_n _R _D_e_s_c_r_i_p_t_i_o_n: This help page documents the regular expression patterns supported by 'grep' and related functions 'regexpr', 'gregexpr', 'sub' and 'gsub', as well as by 'strsplit'. _D_e_t_a_i_l_s: A 'regular expression' is a pattern that describes a set of strings. Three types of regular expressions are used in R, _extended_ regular expressions, used by 'grep(extended = TRUE)' (its default), _basic_ regular expressions, as used by 'grep(extended = FALSE)', and _Perl-like_ regular expressions used by 'grep(perl = TRUE)'. Other functions which use regular expressions (often via the use of 'grep') include 'apropos', 'browseEnv', 'help.search', 'list.files', 'ls' and 'strsplit'. These will all use _extended_ regular expressions, unless 'strsplit' is called with argument 'extended = FALSE' or 'perl = TRUE'. Patterns are described here as they would be printed by 'cat': _do remember that backslashes need to be doubled when entering R character strings_, e.g. from the keyboard. _E_x_t_e_n_d_e_d _R_e_g_u_l_a_r _E_x_p_r_e_s_s_i_o_n_s: This section covers the regular expressions allowed if 'extended = TRUE' in 'grep', 'regexpr', 'gregexpr', 'sub', 'gsub' and 'strsplit'. They use the 'glibc' 2.7 implementation of the POSIX 1003.2 standard. Regular expressions are constructed analogously to arithmetic expressions, by using various operators to combine smaller expressions. The fundamental building blocks are the regular expressions that match a single character. Most characters, including all letters and digits, are regular expressions that match themselves. Any metacharacter with special meaning may be quoted by preceding it with a backslash. (Escaping other characters with a backslash is undefined in POSIX but gives the character in the R implementation.) The metacharacters in EREs are '. \ | ( ) [ { ^ $ * + ?', but note that whether these have a special meaning depends on the context. A _character class_ is a list of characters enclosed between '[' and ']' which matches any single character in that list; unless the first character of the list is the caret '^', when it matches any character _not_ in the list. For example, the regular expression '[0123456789]' matches any single digit, and '[^abc]' matches anything except the characters 'a', 'b' or 'c'. A range of characters may be specified by giving the first and last characters, separated by a hyphen. (Because their interpretation is so locale-dependent, they are best avoided.) The precise way character ranges are interpreted depends on the values of 'perl' and 'ignore.case'. For basic and extended regular expressions the collation order is taken from the OS's implementation of the setting of the locale category 'LC_COLLATE', so '[W-Z]' may include 'x' and if it does may or may not include 'w'. (In most English locales the collation order is 'wWxXyYzZ'.) For caseless matching the characters in a range are interpreted as if in lower case, so in an English locale '[W-z]' matches 'WXYZwxyz'. The only portable way to specify all ASCII letters is to list them all as character class, '[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]'. For Perl regexps, the ranges are interpreted in the numerical order of the characters, either as bytes in a single-byte locale or as Unicode points in a UTF-8 locale. So in either case '[A-Za-z]' specifies the set of ASCII letters. Certain named classes of characters are predefined. Their interpretation depends on the _locale_ (see locales); the interpretation below is that of the POSIX locale. '[:_a_l_n_u_m:]' Alphanumeric characters: '[:alpha:]' and '[:digit:]'. '[:_a_l_p_h_a:]' Alphabetic characters: '[:lower:]' and '[:upper:]'. '[:_b_l_a_n_k:]' Blank characters: space and tab. '[:_c_n_t_r_l:]' Control characters. In ASCII, these characters have octal codes 000 through 037, and 177 ('DEL'). In another character set, these are the equivalent characters, if any. '[:_d_i_g_i_t:]' Digits: '0 1 2 3 4 5 6 7 8 9'. '[:_g_r_a_p_h:]' Graphical characters: '[:alnum:]' and '[:punct:]'. '[:_l_o_w_e_r:]' Lower-case letters in the current locale. '[:_p_r_i_n_t:]' Printable characters: '[:alnum:]', '[:punct:]' and space. '[:_p_u_n_c_t:]' Punctuation characters: '! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~'. '[:_s_p_a_c_e:]' Space characters: tab, newline, vertical tab, form feed, carriage return, and space. '[:_u_p_p_e_r:]' Upper-case letters in the current locale. '[:_x_d_i_g_i_t:]' Hexadecimal digits: '0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f'. For example, '[[:alnum:]]' means '[0-9A-Za-z]', except the latter depends upon the locale and the character encoding, whereas the former is independent of locale and character set. (Note that the brackets in these class names are part of the symbolic names, and must be included in addition to the brackets delimiting the bracket list.) Most metacharacters lose their special meaning inside lists. To include a literal ']', place it first in the list. Similarly, to include a literal '^', place it anywhere but first. Finally, to include a literal '-', place it first or last (or, for 'perl = TRUE' only, precede it by a backslash.). (Only these and '\' remain special inside character classes.) The period '.' matches any single character. The symbol '\w' is documented to be synonym for '[[:alnum:]]' and '\W' is its negation. However, '\w' also matches underscore in the GNU grep code used in R. The caret '^' and the dollar sign '$' are metacharacters that respectively match the empty string at the beginning and end of a line. The symbols '\<' and '\>' respectively match the empty string at the beginning and end of a word. The symbol '\b' matches the empty string at either edge of a word, and '\B' matches the empty string provided it is not at an edge of a word. A regular expression may be followed by one of several repetition quantifiers: '?' The preceding item is optional and will be matched at most once. '*' The preceding item will be matched zero or more times. '+' The preceding item will be matched one or more times. '{_n}' The preceding item is matched exactly 'n' times. '{_n,}' The preceding item is matched 'n' or more times. '{_n,_m}' The preceding item is matched at least 'n' times, but not more than 'm' times. Repetition is greedy, so the maximal possible number of repeats is used. Two regular expressions may be concatenated; the resulting regular expression matches any string formed by concatenating two substrings that respectively match the concatenated subexpressions. Two regular expressions may be joined by the infix operator '|'; the resulting regular expression matches any string matching either subexpression. For example, 'abba|cde' matches either the string 'abba' or the string 'cde'. Note that alternation does not work inside character classes, where '|' has its literal meaning. Repetition takes precedence over concatenation, which in turn takes precedence over alternation. A whole subexpression may be enclosed in parentheses to override these precedence rules. The backreference '\N', where 'N' is a single digit, matches the substring previously matched by the Nth parenthesized subexpression of the regular expression. _B_a_s_i_c _R_e_g_u_l_a_r _E_x_p_r_e_s_s_i_o_n_s: This section covers the regular expressions allowed if 'extended = FALSE' in 'grep', 'regexpr', 'gregexpr', 'sub', 'gsub' and 'strsplit'. In basic regular expressions the metacharacters '?', '+', '{', '|', '(', and ')' lose their special meaning; instead use the backslashed versions '\?', '\+', '\ {', '\|', '\(', and '\)'. Thus the metacharacters are '. \ [ ^ $ *'. _P_e_r_l _R_e_g_u_l_a_r _E_x_p_r_e_s_s_i_o_n_s: The 'perl = TRUE' argument to 'grep', 'regexpr', 'gregexpr', 'sub', 'gsub' and 'strsplit' switches to the PCRE library that 'implements regular expression pattern matching using the same syntax and semantics as Perl 5.6 or later, with just a few differences'. It adds some features from Perl 5.10. For complete details please consult the man pages for PCRE, especially 'man pcrepattern' and 'man pcreapi'), on your system or from the sources at . If PCRE support was compiled from the sources within R, the PCRE version is 7.8 as described here (version >= 7.6 is required if R is configured to use the system's PCRE library). Perl regular expressions are computed byte-by-byte rather than character-by-character except in UTF-8 locales. Since the only non-UTF-8 multibyte locales in common use are those for CJK languages, they should be used with care in non-UTF-8 CJK locales. All the regular expressions described for extended regular expressions are accepted except '\<' and '\>': in Perl all backslashed metacharacters are alphanumeric and backslashed symbols always are interpreted as a literal character. '{' is not special if it would be the start of an invalid interval specification. There can be more than 9 backreferences. In a UTF-8 locale the named character classes only match ASCII characters: see '\p' below for an alternative. The construct '(?...)' is used for Perl extensions in a variety of ways depending on what immediately follows the '?'. Perl-like matching can work in several modes, set by the options '(?i)' (caseless, equivalent to Perl's '/i'), '(?m)' (multiline, equivalent to Perl's '/m'), '(?s)' (single line, so a dot matches all characters, even new lines: equivalent to Perl's '/s') and '(?x)' (extended, whitespace data characters are ignored unless escaped and comments are allowed: equivalent to Perl's '/x'). These can be concatenated, so for example, '(?im)' sets caseless multiline matching. It is also possible to unset these options by preceding the letter with a hyphen, and to combine setting and unsetting such as '(?im-sx)'. These settings can be applied within patterns, and then apply to the remainder of the pattern. Additional options not in Perl include '(?U)' to set 'ungreedy' mode (so matching is minimal unless '?' is used, when it is greedy). Initially none of these options are set. If you want to remove the special meaning from a sequence of characters, you can do so by putting them between '\Q' and '\E'. This is different from Perl in that '$' and '@' are handled as literals in '\Q...\E' sequences in PCRE, whereas in Perl, '$' and '@' cause variable interpolation. The escape sequences '\d', '\s' and '\w' represent any decimal digit, space character and 'word' character (letter, digit or underscore in the current locale, except that in a UTF-8 locale only ASCII letters are considered) respectively, and their upper-case versions represent their negation. Unlike POSIX and earlier versions of Perl and PCRE, vertical tab is not regarded as a whitespace character. Escape sequence '\a' is 'BEL', '\e' is 'ESC', '\f' is 'FF', '\n' is 'LF', '\r' is 'CR' and '\t' is 'TAB'. In addition '\cx' is 'cntrl-x' for any 'x', '\ddd' is the octal character 'ddd' (for up to three digits unless interpretable as a backreference, as '\1' to '\7' always are), and '\xhh' specifies a character by two hex digits. In a UTF-8 locale, '\x{h...}' specifies a Unicode point by one or more hex digits. Outside a character class, '\b' matches a word boundary, '\B' is its negation, '\A' matches at start of a subject (even in multiline mode, unlike '^'), '\Z' matches at end of a subject or before newline at end, '\z' matches at end of a subject. and '\G' matches at first matching position in a subject (which is subtly different from Perl's end of the previous match). '\C' matches a single byte. including a newline. In a UTF-8 locale, '\R' matches any Unicode newline character (not just CR), and '\X' matches any number of Unicode characters that form an extended Unicode sequence. In a UTF-8 locale, some Unicode properties are supported via '\p{xx}' and '\P{xx}' which match characters with and without property 'xx' respectively. For a list of supported properties see the PCRE documentation, but for example 'Lu' is 'upper case letter' and 'Sc' is 'currency symbol'. The same repetition quantifiers as extended POSIX are supported. However, if a quantifier is followed by '?', the match is 'ungreedy', that is as short as possible rather than as long as possible (unless the meanings are reversed by the '(?U)' option.) The sequence '(?#' marks the start of a comment which continues up to the next closing parenthesis. Nested parentheses are not permitted. The characters that make up a comment play no part at all in the pattern matching. If the extended option is set, an unescaped '#' character outside a character class introduces a comment that continues up to the next newline character in the pattern. The pattern '(?:...)' groups characters just as parentheses do but does not make a backreference. Patterns '(?=...)' and '(?!...)' are zero-width positive and negative lookahead _assertions_: they match if an attempt to match the '...' forward from the current position would succeed (or not), but use up no characters in the string being processed. Patterns '(?<=...)' and '(? The 'pcrepattern' can be found as part of , and details of Perl's own implementation at .