Regular expressions

File size
7.4KB
Lines of code
140

Regular expressions

The bane of every programmer's existence.

Intro

  • Abbreviated as regex or regexp
  • Powerful pattern-matching for specific character combinations in strings
  • Used for searching, editing and manipulating strings

Quickstart

  • Regular expressions consist of both literal characters and metacharacters
    1. Literal character: character literals (eg. a, b, c, 1, 2, 3)
    2. Metacharacter: special character with a specific meaning (eg. ., ^, $, *)

Literal character

  • Alphanumeric characters: a, b, c, ... and 1, 2, 3, ...
  • Whitespace characters: spaces, tabs, newline, ... (unless explicitly escaped)
  • Punctuation and symbols: @, #, $, ... (unless they are already designated metacharacters)

Metacharacters

  • .: matches any character except the newline character
  • ^: matches the start of a string
  • $: matches the end of a string
  • *: matches 0 or more repetitions of the preceding element
  • +: matches 1 or more repetitions of the preceding element
  • ?: matches 0 or 1 repetition of the preceding element
  • {n}: matches exactly n repetitions of the preceding element
  • {n,}: matches n or more repetitions of the preceding element
  • {n,m}: matches between n and m repetitions of the preceding element
  • []: matches any one character within the square brackets
  • |: logical OR operator
  • \: escape character that specifies the escape of a regex metacharacter (treating the metacharacter as a literal character)
  • (): groups multiple tokens together to create a capture group for extracting substrings
  • \d: matches any digit (equivalent to [0-9])
  • \D: matches any non-digit (equivalent to [^0-9])
  • \w: matches any alphanumeric character and the _ underscore (equivalent to [a-zA-Z0-9_])
  • \W: matches any character that is not a word character (equivalent to [^a-zA-Z0-9_])
  • \s: matches any whitespace character (equivalent to [ \t\n\r\f\v])
  • \S: matches any character that is not a whitespace character (equivalent to [^ \t\n\r\f\v])
  • \b: matches a position between a word character and a non-word character (word boundary)
  • \B: matches a position that is not a word boundary (non-word boundary)
  • (?:): groups multiple tokens together without creating a capture group (non-capturing group)
  • (?=): asserts that a group of characters can be matched to the right of the current position without including it in the match (positive lookahead)
  • (?!: asserts that a group of characters cannot be matched to the right of the current position (negative lookahead)
  • (?<=): asserts that a group of characters can be matched to the left of the current position (positive lookbehind)
  • (?<!: asserts that a group of characters cannot be matched to the left of the current position (negative lookbehind)

Worked example

# ----- WORKED EXAMPLE -----

hello         # this matches the exact string "hello"

h.llo         # this matches "hello", "hallo", "hxllo", etc.

^hello        # this matches "hello" only if it's at the start of a line
world$        # this matches "world" only if it's at the end of a line

a*            # this matches "a", "aa", "aaa", etc., including an empty string
a+            # this matches "a", "aa", "aaa", etc., but not an empty string
a?            # this matches "a" or an empty string
a{3}          # this matches exactly "aaa"
a{2,4}        # this matches "aa", "aaa", or "aaaa"

[abc]         # this matches "a", "b", or "c"
[^abc]        # this matches any character except "a", "b", or "c"
[a-z]         # this matches any lowercase letter
[A-Z]         # this matches any uppercase letter
[0-9]         # this matches any digit

(ab|cd)       # this matches "ab" or "cd"
(grape|apple)s # this matches "grapes" or "apples"

More on