Regex: Regularly Exploitable

Thursday, June 11, 2015

Here's a quick demonstration of why Regular Expressions (regex) can be bad for implementing character whitelisting.

I was reading through an application security assessment report recently and noticed a recommendation for preventing Operating System Command Injection (OSCI) that implemented character whitelisting on a given file name through the following regex.

/^[\/a-zA-Z0-9\-\s_]+\.rpt$/m

At first glance, the regex seems legit, right? It attempts to match any combination of letters, numbers, dashes, underscores, slashes, and whitespace, ending with the ".rpt" extension. Already knowing that there was a flaw here (we'll get to that in a moment), I put together the following proof-of-concept to demonstrate the security (or in-security) of the filter.

<?php
$file_name = $_GET["path"];
if(!preg_match("/^[\/a-zA-Z0-9\-\s_]+\.rpt$/m", $file_name)) {
    echo "regex failed";
} else {
    echo exec("/usr/bin/file -i -b " . $file_name);
}
?>

I tried all the typical attack payloads, and sure enough, the regex prevented injection into the shell command. The key here, and why one must always use caution when implementing regex filters, is understanding what the \s character class represents. Most resources are vague and say that it includes "any whitespace character", but what does that include? In most regex implementations, whitespace includes [ \t\r\n\f], i.e. spaces, tabs, line breaks, and form feeds. See the problem yet?

Many testers don't think about the impact of line breaks when dealing with injections, but when we're dealing with shell commands, line breaks become very important. Consider the following attack payload.

/path/to/file%0Aid%0A.rpt

%0a is a URL encoded line feed/break (whitespace), so according to the regex, this payload is safe. However, what happens when you put this into a shell? Below is the output from copying and pasting the decoded version of the above payload into a terminal prompt.

# /usr/bin/file -i -b /path/to/file
ERROR: cannot open `/path/to/file' (No such file or directory)
# id
uid=0(root) gid=0(root) groups=0(root)
# .rpt
bash: .rpt: command not found
# 

Do you see what happened? Each line break started a new command and we can see that the shell executed our arbitrary id command. Here's what it looks like through a web interface.

So let's fix this. Show of hands for how many people think the below regex solves the injection issue? (I replaced the \s with a space .)

/^[\/a-zA-Z0-9\- _]+\.rpt$/m

If we use the same payloads as before, including the one that resulted in a successful injection, we can see that the issue has been resolved.

Or has it? Consider the following attack payload.

/path/to/file.rpt%0aid

What just happened?! Let's look at the new regex again.

/^[\/a-zA-Z0-9\- _]+\.rpt$/m

See that m at the end of the regex pattern? It means something. At the end of the regex pattern declaration in PHP (available in other frameworks as well, but may be declared differently) there is a spot for modifiers. Regex modifiers change how the regex engine applies the pattern to the string. Discussing the different regex modifiers is outside the scope of this article, but what we want to focus on here is that the filter pattern is using the multiline modifier (m is the flag for multiline). The multiline modifier basically changes the way the beginning (^) and end ($) of line characters behave. When the multiline modifier is absent, the ^ and $ characters act as the beginning and end of the string, as opposed to the line. This is an important distinction, because in the payload, we are able to leverage the multiline modifier's effect on the $ character and a line break to create a match. We can then add anything we want to the end of the string to execute arbitrary commands within the shell.

There are a couple of takeaways here.

First, be mindful of how you build whitelists. Be as explicit as possible. The higher the level of whitelist, the better. For instance, in the above example, the optimal solution would be to build a whitelist of complete file names that are allowed, and ignore regex all together. If the file names are not known and we need to whitelist at the character level, then we would need to build a better regex that accounts for what is included in all allowed character classes within the context of the filter, e.g. %0a and its significance to shell commands and the multiline modifier.

Second, Burp was not able to find this vulnerability when the scanner speed was set to Normal (default). It wasn't until I set the scanner speed to Thorough and hard coded the ".rpt" extension into the payload that Burp was able to find it. There is no replacement for thorough manual testing by someone that knows what they're doing.

A shout out to John Poulin, who taught me a thing or two about exploiting regex that ultimately lead to this article. Thanks John.

Like what you see? Join me for live training! See the Training page for more information.


Please share your thoughts, comments, and suggestions via Twitter.