Skip to main content

AWK - Regular expressions

Regular expressions

awk provides regular expressions for pattern matching; the syntax of UNIX system expressions is described in ``Regular expressions''.

The simplest regular expression is a string of characters matching only itself: that is, the string is a literal. In awk, a regular expression is typically enclosed within slashes in order to label it as a regular expression as opposed to an awk command, as follows:

   /Asia/

This program points to all input records that contain the substring ``Asia''; if a record contains ``Asia'' as part of a larger string like ``Asian'' or ``Pan-Asiatic'', it is also printed.

awk provides the full range of UNIX system regular expression metacharacters; see ``Regular expressions'' for a detailed explanation. (In addition, awk recognizes the escape sequences listed in ``The echo command''.) awk also provides the regular expression operators shown in ``awk regular expression operators''.
awk regular expression operators

OperatorMeaning
~ matches
!~ does not match

To restrict a match to a specific field, you use the matching operators ~ (matches) and !~ (does not match). The following program prints the first field of all lines in which the fourth field matches ``Asia'':

   $4 ~ /Asia/ { print $1 }

This program prints the first field of all lines in which the fourth field does not match ``Asia'':

   $4 !~ /Asia/ { print $1 }

awk interprets any string or variable on the right side of a ~ or !~ as a regular expression. For example:

   $2 !~ /^[0-9]+$/

This sample program can be rewritten as follows:

   BEGIN     { digits = "^[0-9]+$" }
$2 !~ digits

Suppose you wanted to search for a string of characters such as ^[0-9]+$. When a literal quoted string like "^[0-9]+$" is used as a regular expression, one extra level of backslashes is needed to protect regular expression metacharacters. This is because one level of backslashes is removed when a string is originally parsed. If a backslash is needed in front of a character to turn off its special meaning in a regular expression, then that backslash needs a preceding backslash to protect it in a string.

For example, suppose we want to match strings containing ``b'' followed by a dollar sign. The regular expression for this pattern is b\$. To create a string to represent this regular expression, add one more backslash, as follows:

   "b\\$"

The two regular expressions on each of the following lines are equivalent:

   x ~ "b\\$"	x ~ /b\$/
x ~ "b\$" x ~ /b$/
x ~ "b$" x ~ /b$/
x ~ "\\t" x ~ /\t/

A summary of the regular expressions and the substrings they match is given in ``awk regular expressions''. The unary operators *, +, and ? have the highest precedence, with concatenation next, and then alternation (|). All operators are left-associative. The r stands for any regular expression.

awk regular expressions

ExpressionMatches
char any non-metacharacter char
\char character char literally
^ beginning of string
$ end of string
. any character but newline
[s] any character in set s
[^s] any character not in set s
r* zero or more rs
r+ one or more rs
r? zero or one r
(r) r
r1 r2 r1 then r2 (concatenation)
r1|r2 r1 or r2 (alternation)
| Linux