Perl makes a pretty nice replacement for `grep`, `sed`, and `awk`.

2013-07-20, Somewhere over the United States

Perl makes a pretty nice replacement for grep, sed, and awk.

Say you’re looking in your web server’s logs for failed requests for a given URI. The venerable Apache common and combined log formats are space delimited. To maintain your sanity, you configured your server to log tab delimited request records instead, and you added a few fields to tell you what upstream servers did with the request.

172.16.16.128   -   -   [20/Jul/2013:13:02:25 -0500]    "GET /uri/that/fails/intermittently HTTP/1.1"   200 1234    "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"  www.example.com 66.249.73.104   0.065   172.16.17.132:3032  200 0.065   "-"
172.16.16.128   -   -   [20/Jul/2013:13:02:26 -0500]    "GET /uri/that/fails/intermittently HTTP/1.1"   500 12345   "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"  www.example.com 66.249.73.104   0.065   172.16.17.132:3032  500 0.065   "-"
172.16.16.128   -   -   [20/Jul/2013:13:02:27 -0500]    "GET /uri/that/fails/intermittently HTTP/1.1"   200 1234    "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"  www.example.com 66.249.73.104   0.065   172.16.17.132:3032  200 0.065   "-"

You can number the fields using head, tr, and nl.

$ <~/tmp/input.txt head -n 1 | tr '\t' '\n' | nl -ba 
     1  172.16.16.128
     2  -
     3  -
     4  [20/Jul/2013:13:02:25 -0500]
     5  "GET /uri/that/fails/intermittently HTTP/1.1"
     6  200
     7  1234
     8  "-"
     9  "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
    10  www.example.com
    11  66.249.73.104
    12  0.065
    13  172.16.17.132:3032
    14  200
    15  0.065
    16  "-"

You’re looking for 500 in field 6. nl counts lines starting with 1.

The first tool you pull out is grep. Give it a pattern that matches 500 in the field after the fifth tab character.

$ <~/tmp/input.txt grep -E '^([^    ]*\t){5}500\t'
172.16.16.128   -   -   [20/Jul/2013:13:02:26 -0500]    "GET /uri/that/fails/intermittently HTTP/1.1"   500 12345   "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"  www.example.com 66.249.73.104   0.065   172.16.17.132:3032  500 0.065   "-"

Not bad, but it took you a while to get that pattern syntax right.

On its own, this is kind of a contrived job for sed, but you could use it.

$ <~/tmp/input.txt sed -r -n -e '/^([^\t]*\t){5}500\t/p'
172.16.16.128   -   -   [20/Jul/2013:13:02:26 -0500]    "GET /uri/that/fails/intermittently HTTP/1.1"   500 12345   "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"  www.example.com 66.249.73.104   0.065   172.16.17.132:3032  500 0.065   "-"

So sed probably isn’t the best tool for this job, but it does have the advantage of editing the line, if that’s what you’re looking to do.

$ <~/tmp/input.txt sed -r -e 's/^(([^\t]*\t){5})([^\t]*)(\t.*)/The status code was \3./'
The status code was 200.
The status code was 500.
The status code was 200.

Back to the original problem. It’s right in awk‘s wheelhouse. awk splits fields on a given delimiter for you, and lets you match the text in only the field of interest. awk counts fields starting with 1.

$ <~/tmp/input.txt awk -F '\t' '$6 == 500 { print $0 }'
172.16.16.128   -   -   [20/Jul/2013:13:02:26 -0500]    "GET /uri/that/fails/intermittently HTTP/1.1"   500 12345   "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"  www.example.com 66.249.73.104   0.065   172.16.17.132:3032  500 0.065   "-"

If you’re printing individual fields instead of the whole line, you want to specify the output delimiter in addition to the input delimiter.

This produces space delimited output:

$ <~/tmp/input.txt awk -F '\t' '$6 == 500 { print $4,$5,$6 }'
[20/Jul/2013:13:02:26 -0500] "GET /uri/that/fails/intermittently HTTP/1.1" 500

This produces tab delimited output:

$ <~/tmp/input.txt awk 'BEGIN { OFS=FS="\t"} $6 == 500 { print $4,$5,$6 }'
[20/Jul/2013:13:02:26 -0500]    "GET /uri/that/fails/intermittently HTTP/1.1"   500

At some point, you decide that remembering the details of each program’s command-line options and the various regular expression flavors is a hassle, and you just do it all with perl.

You can match lines with 500 in field 6. Perl counts array indices starting with 0, so field 6 is array index 5.

$ <~/tmp/input.txt perl -F'\t' -ane 'if ($F[5] == 500) { print join("\t", @F); }'
172.16.16.128   -   -   [20/Jul/2013:13:02:26 -0500]    "GET /uri/that/fails/intermittently HTTP/1.1"   500 12345   "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"  www.example.com 66.249.73.104   0.065   172.16.17.132:3032  500 0.065   "-"

You can edit the line:

$ <~/tmp/input.txt perl -pe 's/^(([^\t]*\t){5})([^\t]*)(\t.*)/The status code was \3./'
The status code was 200.
The status code was 500.
The status code was 200.

And you can print individual fields.

$ <~/tmp/input.txt perl -F'\t' -ane 'if ($F[5] == 500) { print join("\t", $F[3], $F[4], $F[5]), "\n"; }'
[20/Jul/2013:13:02:26 -0500]    "GET /uri/that/fails/intermittently HTTP/1.1"   500

The perl commands are a little more verbose than the grep, sed and awk equivalents, but perl can solve all the above problems alone.

Perl also has unique advantages over each of the other tools. It has Perl regular expressions, which provide features like look around that you can’t get in grep, sed, and awk‘s basic or extended regular expressions. And Perl is a full-featured procedural programming language, making it easier to extend these solutions to more complex problems.