This document provides a cheat sheet for using regular expressions in R. It summarizes common regex patterns for matching different types of characters, functions for extracting matches and positions, and modifiers for controlling greedy/lazy matching and case sensitivity. Key functions covered include grep(), regexpr(), stringr::str_extract(), sub(), and gsub().
This document provides a cheat sheet for using regular expressions in R. It summarizes common regex patterns for matching different types of characters, functions for extracting matches and positions, and modifiers for controlling greedy/lazy matching and case sensitivity. Key functions covered include grep(), regexpr(), stringr::str_extract(), sub(), and gsub().
Cheat Sheet extract first match [1] "tam" "tim" string regmatches(string, gregexpr(pattern, string)) extract all matches, outputs a list [[1]] "tam" [[2]] character(0) [[3]] "tim" "tom" stringr::str_extract(string, pattern) extract first match [1] "tam" NA "tim" [[:digit:]] or \\d Digits; [0-9] stringr::str_extract_all(string, pattern) \\D Non-digits; [^0-9] extract all matches, outputs a list [[:lower:]] Lower-case letters; [a-z] > string <- c("Hiphopopotamus", "Rhymenoceros", "time for bottomless lyrics") stringr::str_extract_all(string, pattern, simplify = TRUE) [[:upper:]] Upper-case letters; [A-Z] > pattern <- "t.m" extract all matches, outputs a matrix [[:alpha:]] Alphabetic characters; [A-z] stringr::str_match(string, pattern) [[:alnum:]] Alphanumeric characters [A-z0-9] extract first match + individual character groups \\w Word characters; [A-z0-9_] \\W Non-word characters grep(pattern, string) regexpr(pattern, string) stringr::str_match_all(string, pattern) [[:xdigit:]] or \\x Hexadec. digits; [0-9A-Fa-f] [1] 1 3 find starting position and length of first match extract all matches + individual character groups [[:blank:]] Space and tab grep(pattern, string, value = TRUE) gregexpr(pattern, string) [[:space:]] or \\s Space, tab, vertical tab, newline, [1] "Hiphopopotamus" find starting position and length of all matches form feed, carriage return [2] "time for bottomless lyrics“ stringr::str_locate(string, pattern) \\S Not space; [^[:space:]] sub(pattern, replacement, string) grepl(pattern, string) find starting and end position of first match replace first match [[:punct:]] Punctuation characters; [1] TRUE FALSE TRUE !"#$%&’()*+,-./:;<=>?@[]^_`{|}~ stringr::str_locate_all(string, pattern) gsub(pattern, replacement, string) [[:graph:]] Graphical characters; stringr::str_detect(string, pattern) find starting and end position of all matches replace all matches [[:alnum:][:punct:]] [1] TRUE FALSE TRUE stringr::str_replace(string, pattern, replacement) [[:print:]] Printable characters; [[:alnum:][:punct:]\\s] replace first match [[:cntrl:]] or \\c Control characters; \n, \r etc. stringr::str_replace_all(string, pattern, replacement) strsplit(string, pattern) or stringr::str_split(string, pattern) replace all matches
\n New line . Any character except \n
^ Start of the string * Matches at least 0 times \r Carriage return | Or, e.g. (a|b) $ End of the string + Matches at least 1 time \t Tab […] List permitted characters, e.g. [abc] \\b Empty string at either edge of a word ? Matches at most 1 time; optional string \v Vertical tab [a-z] Specify character ranges \\B NOT the edge of a word {n} Matches exactly n times \f Form feed [^…] List excluded characters \\< Beginning of a word {n,} Matches at least n times (…) Grouping, enables back referencing using \\> End of a word {n,m} Matches between n and m times \\N where N is an integer
(?=) Lookahead (requires PERL = TRUE),
e.g. (?=yx): position followed by 'xy' By default R uses extended regular expressions. Metacharacters (. * + etc.) can be used as By default the asterisk * is greedy, i.e. it always (?!) Negative lookahead (PERL = TRUE); You can switch to PCRE regular expressions literal characters by escaping them. Characters matches the longest possible string. It can be position NOT followed by pattern using PERL = TRUE for base or by wrapping can be escaped using \\ or by enclosing them used in lazy mode by adding ?, i.e. *?. (?<=) Lookbehind (PERL = TRUE), e.g. patterns with perl() for stringr. in \\Q...\\E. (?<=yx): position following 'xy' Greedy mode can be turned off using (?U). This (?<!) Negative lookbehind (PERL = TRUE); All functions can be used with literal searches switches the syntax, so that (?U)a* is lazy and position NOT following pattern using fixed = TRUE for base or by wrapping (?U)a*? is greedy. patterns with fixed() for stringr. Regular expressions can be made case insensitive ?(if)then If-then-condition (PERL = TRUE); use using (?i). In backreferences, the strings can be lookaheads, optional char. etc in if-clause All base functions can be made case insensitive converted to lower or upper case using \\L or \\U ?(if)then|else If-then-else-condition (PERL = TRUE) Regular expressions can conveniently be by specifying ignore.case = TRUE. (e.g. \\L\\1). This requires PERL = TRUE. *see, e.g. http://www.regular-expressions.info/lookaround.html created using e.g. the packages rex or rebus. http://www.regular-expressions.info/conditional.html
CC BY Ian Kopacka • ian.kopacka@ages.at Updated: 07/19