A regular expression, or regexp, is a way of describing a set of strings. Because regular expressions are such a fundamental part of awk
programming, their format and use deserve a separate chapter.
A regular expression enclosed in slashes (`/') is an awk
pattern that matches every input record whose text belongs to that set. The simplest regular expression is a sequence of letters, numbers, or both. Such a regexp matches any string that contains that sequence. Thus, the regexp `foo' matches any string containing `foo'. Therefore, the pattern /foo/
matches any input record containing the three characters `foo' anywhere in the record. Other kinds of regexps let you specify more complicated classes of strings.
Initially, the examples in this chapter are simple. As we explain more about how regular expressions work, we will present more complicated instances.
3.1 How to Use Regular Expressions 3.2 Escape Sequences How to write non-printing characters. 3.3 Regular Expression Operators 3.4 Using Character Lists What can go between `[...]'. 3.5 gawk
-Specific Regexp OperatorsOperators specific to GNU software. 3.6 Case Sensitivity in Matching How to do case-insensitive matching. 3.7 How Much Text Matches? How much text matches. 3.8 Using Dynamic Regexps
A regular expression can be used as a pattern by enclosing it in slashes. Then the regular expression is tested against the entire text of each record. (Normally, it only needs to match some part of the text in order to succeed.) For example, the following prints the second field of each record that contains the string `foo' anywhere in it:
$ awk '/foo/ { print $2 }' BBS-list |
Regular expressions can also be used in matching expressions. These expressions allow you to specify the string to match against; it need not be the entire current input record. The two operators `~' and `!~' perform regular expression comparisons. Expressions using these operators can be used as patterns, or in if
, while
, for
, and do
statements. (See section Control Statements in Actions.) For example:
exp ~ /regexp/ |
is true if the expression exp (taken as a string) matches regexp. The following example matches, or selects, all input records with the uppercase letter `J' somewhere in the first field:
$ awk '$1 ~ /J/' inventory-shipped |
So does this:
awk '{ if ($1 ~ /J/) print }' inventory-shipped |
This next example is true if the expression exp (taken as a character string) does not match regexp:
exp !~ /regexp/ |
The following example matches, or selects, all input records whose first field does not contain the uppercase letter `J':
$ awk '$1 !~ /J/' inventory-shipped |
When a regexp is enclosed in slashes, such as /foo/
, we call it a regexp constant, much like 5.27
is a numeric constant and "foo"
is a string constant.
Some characters cannot be included literally in string constants ("foo"
) or regexp constants (/foo/
). Instead, they should be represented with escape sequences, which are character sequences beginning with a backslash (`\'). One use of an escape sequence is to include a double quote character in a string constant. Because a plain double quote ends the string, you must use `\"' to represent an actual double quote character as a part of the string. For example:
$ awk 'BEGIN { print "He said \"hi!\" to her." }' |
The backslash character itself is another character that cannot be included normally; you must write `\\' to put one backslash in the string or regexp. Thus, the string whose contents are the two characters `"' and `\' must be written "\"\\"
.
Another use of backslash is to represent unprintable characters such as tab or newline. While there is nothing to stop you from entering most unprintable characters directly in a string constant or regexp constant, they may look ugly.
The following table lists all the escape sequences used in awk
and what they represent. Unless noted otherwise, all these escape sequences apply to both string constants and regexp constants:
\\
\a
\b
\f
\t
\v
\nnn
\xhh...
awk
.)\/
awk
to keep processing the rest of the regexp.\"
awk
to keep processing the rest of the string.In gawk
, a number of additional two-character sequences that begin with a backslash have special meaning in regexps. See section gawk
-Specific Regexp Operators.
In a regexp, a backslash before any character that is not in the above table and not listed in gawk
-Specific Regexp Operators, means that the next character should be taken literally, even if it would normally be a regexp operator. For example, /a\+b/
matches the three characters `a+b'.
For complete portability, do not use a backslash before any character not shown in the table above.
To summarize:
awk
reads your program.gawk
processes both regexp constants and dynamic regexps (see section Using Dynamic Regexps), for the special operators listed in gawk
-Specific Regexp Operators. If you place a backslash in a string constant before something that is not one of the characters listed above, POSIX awk
purposely leaves what happens as undefined. There are two choices:
awk
and gawk
both do. For example, "a\qc"
is the same as "aqc"
. (Because this is such an easy bug to both introduce and to miss, gawk
warns you about it.) Consider `FS = "[ \t]+\|[ \t]+"' to use vertical bars surrounded by whitespace as the field separator. There should be two backslashes in the string, `FS = "[ \t]+\\|[ \t]+"'.)awk
implementations do this. In such implementations, "a\qc"
is the same as if you had typed "a\\qc"
.Suppose you use an octal or hexadecimal escape to represent a regexp metacharacter (see section Regular Expression Operators). Does awk
treat the character as a literal character or as a regexp operator?
Historically, such characters were taken literally. (d.c.) However, the POSIX standard indicates that they should be treated as real metacharacters, which is what gawk
does. In compatibility mode (see section Command-Line Options), gawk
treats the characters represented by octal and hexadecimal escape sequences literally when used in regexp constants. Thus, /a\52b/
is equivalent to /a\*b/
.
You can combine regular expressions with special characters, called regular expression operators or metacharacters, to increase the power and versatility of regular expressions.
The escape sequences described earlier in 3.2 Escape Sequences, are valid inside a regexp. They are introduced by a `\', and are recognized and converted into the corresponding real characters as the very first step in processing regexps.
Here is a list of metacharacters. All characters that are not escape sequences and that are not listed in the table stand for themselves:
\
^
if ("line1\nLINE 2" ~ /^L/) ... |
$
if ("line1\nLINE 2" ~ /1$/) ... |
.
awk
may not be able to match the NUL character.[...]
[^ ...]
|
(...)
*
+
awk '/\(c[ad]+r x\)/ { print }' sample |
?
{n}
{n,}
{n,m}
wh{3}y
wh{3,5}y
wh{2,}y
awk
. They were added as part of the POSIX standard to make awk
and egrep
consistent with each other.gawk
does not match interval expressions in regexps. If either `--posix' or `--re-interval' are specified (see section Command-Line Options), then interval expressions are allowed in regexps.awk
.(13) In regular expressions, the `*', `+', and `?' operators, as well as the braces `{' and `}', have the highest precedence, followed by concatenation, and finally by `|'. As in arithmetic, parentheses can change how operators are grouped.
In POSIX awk
and gawk
, the `*', `+', and `?' operators stand for themselves when there is nothing in the regexp that precedes them. For example, `/+/' matches a literal plus sign. However, many other versions of awk
treat such a usage as a syntax error.
If gawk
is in compatibility mode (see section Command-Line Options), POSIX character classes and interval expressions are not available in regular expressions.
Within a character list, a range expression consists of two characters separated by a hyphen. It matches any single character that sorts between the two characters, using the locale's collating sequence and character set. For example, in the default C locale, `[a-dx-z]' is equivalent to `[abcdxyz]'. Many locales sort characters in dictionary order, and in these locales, `[a-dx-z]' is typically not equivalent to `[abcdxyz]'; instead it might be equivalent to `[aBbCcDdxXyYz]', for example. To obtain the traditional interpretation of bracket expressions, you can use the C locale by setting the LC_ALL
environment variable to the value `C'.
To include one of the characters `\', `]', `-', or `^' in a character list, put a `\' in front of it. For example:
[d\]] |
matches either `d' or `]'.
This treatment of `\' in character lists is compatible with other awk
implementations and is also mandated by POSIX. The regular expressions in awk
are a superset of the POSIX specification for Extended Regular Expressions (EREs). POSIX EREs are based on the regular expressions accepted by the traditional egrep
utility.
Character classes are a new feature introduced in the POSIX standard. A character class is a special notation for describing lists of characters that have a specific attribute, but the actual characters can vary from country to country and/or from character set to character set. For example, the notion of what is an alphabetic character differs between the United States and France.
A character class is only valid in a regexp inside the brackets of a character list. Character classes consist of `[:', a keyword denoting the class, and `:]'. Here are the character classes defined by the POSIX standard:
(A space is printable but not visible, whereas an `a' is both.) control characters, or space characters).
[:alnum:] |
Alphanumeric characters. |
[:alpha:] |
Alphabetic characters. |
[:blank:] |
Space and tab characters. |
[:cntrl:] |
Control characters. |
[:digit:] |
Numeric characters. |
[:graph:] |
Characters that are both printable and visible. |
[:lower:] |
Lowercase alphabetic characters. |
[:print:] |
Printable characters (characters that are not control characters). |
[:punct:] |
Punctuation characters (characters that are not letters, digits, |
[:space:] |
Space characters (such as space, tab, and formfeed, to name a few). |
[:upper:] |
Uppercase alphabetic characters. |
[:xdigit:] |
Characters that are hexadecimal digits. |
For example, before the POSIX standard, you had to write /[A-Za-z0-9]/
to match alphanumeric characters. If your character set had other alphabetic characters in it, this would not match them, and if your character set collated differently from ASCII, this might not even match the ASCII alphanumeric characters. With the POSIX character classes, you can write /[[:alnum:]]/
to match the alphabetic and numeric characters in your character set.
Two additional special sequences can appear in character lists. These apply to non-ASCII character sets, which can have single symbols (called collating elements) that are represented with more than one character. They can also have several characters that are equivalent for collating, or sorting, purposes. (For example, in French, a plain "e" and a grave-accented "è" are equivalent.)
[[.ch.]]
is a regexp that matches this collating element, whereas [ch]
is a regexp that matches either `c' or `h'.[[=e=]]
is a regexp that matches any of `e', `é', or `è'.These features are very valuable in non-English speaking locales.
Caution: The library functions that gawk
uses for regular expression matching currently only recognize POSIX character classes; they do not recognize collating symbols or equivalence classes.
gawk
-Specific Regexp Operators GNU software that deals with regular expressions provides a number of additional regexp operators. These operators are described in this section and are specific to gawk
; they are not available in other awk
implementations. Most of the additional operators deal with word matching. For our purposes, a word is a sequence of one or more letters, digits, or underscores (`_'):
\w
[[:alnum:]_]
.\W
[^[:alnum:]_]
.\<
/\<away/
matches `away' but not `stowaway'.\>
/stow\>/
matches `stow' but not `stowaway'. \y
\B
/\Brat\B/
matches `crate' but it does not match `dirty rat'. `\B' is essentially the opposite of `\y'. There are two other operators that work on buffers. In Emacs, a buffer is, naturally, an Emacs buffer. For other programs, gawk
's regexp library routines consider the entire string to match as the buffer.
\`
\'
Because `^' and `$' always work in terms of the beginning and end of strings, these operators don't add any new capabilities for awk
. They are provided for compatibility with other GNU software.
In other GNU software, the word-boundary operator is `\b'. However, that conflicts with the awk
language's definition of `\b' as backspace, so gawk
uses a different letter. An alternative method would have been to require two backslashes in the GNU operators, but this was deemed too confusing. The current method of using `\y' for the GNU `\b' appears to be the lesser of two evils.
The various command-line options (see section Command-Line Options) control how gawk
interprets characters in regexps:
gawk
provides all the facilities of POSIX regexps and the previously described GNU regexp operators. GNU regexp operators described in Regular Expression Operators. However, interval expressions are not supported.--posix
--traditional
awk
regexps are matched. The GNU operators are not special, interval expressions are not available, nor are the POSIX character classes ([[:alnum:]]
and so on). Characters described by octal and hexadecimal escape sequences are treated literally, even if they represent regexp metacharacters.--re-interval
Case is normally significant in regular expressions, both when matching ordinary characters (i.e., not metacharacters) and inside character sets. Thus, a `w' in a regular expression matches only a lowercase `w' and not an uppercase `W'.
The simplest way to do a case-independent match is to use a character list--for example, `[Ww]'. However, this can be cumbersome if you need to use it often and it can make the regular expressions harder to read. There are two alternatives that you might prefer.
One way to perform a case-insensitive match at a particular point in the program is to convert the data to a single case, using the tolower
or toupper
built-in string functions (which we haven't discussed yet; see section String Manipulation Functions). For example:
tolower($1) ~ /foo/ { ... } |
converts the first field to lowercase before matching against it. This works in any POSIX-compliant awk
.
Another method, specific to gawk
, is to set the variable IGNORECASE
to a nonzero value (see section 7.5 Built-in Variables). When IGNORECASE
is not zero, all regexp and string operations ignore case. Changing the value of IGNORECASE
dynamically controls the case sensitivity of the program as it runs. Case is significant by default because IGNORECASE
(like most variables) is initialized to zero:
x = "aB" |
In general, you cannot use IGNORECASE
to make certain rules case-insensitive and other rules case-sensitive, because there is no straightforward way to set IGNORECASE
just for the pattern of a particular rule.(14) To do this, use either character lists or tolower
. However, one thing you can do with IGNORECASE
only is dynamically turn case-sensitivity on or off for all the rules at once.
IGNORECASE
can be set on the command line or in a BEGIN
rule (see section Other Command-Line Arguments; also see section Startup and Cleanup Actions). Setting IGNORECASE
from the command line is a way to make a program case-insensitive without having to edit it.
Prior to gawk
3.0, the value of IGNORECASE
affected regexp operations only. It did not affect string comparison with `==', `!=', and so on. Beginning with version 3.0, both regexp and string comparison operations are also affected by IGNORECASE
.
Beginning with gawk
3.0, the equivalences between upper- and lowercase characters are based on the ISO-8859-1 (ISO Latin-1) character set. This character set is a superset of the traditional 128 ASCII characters, that also provides a number of characters suitable for use with European languages.
The value of IGNORECASE
has no effect if gawk
is in compatibility mode (see section Command-Line Options). Case is always significant in compatibility mode.
echo aaaabcd | awk '{ sub(/a+/, "<A>"); print }' |
This example uses the sub
function (which we haven't discussed yet; see section String Manipulation Functions) to make a change to the input record. Here, the regexp /a+/
indicates "one or more `a' characters," and the replacement text is `<A>'.
The input contains four `a' characters. awk
(and POSIX) regular expressions always match the leftmost, longest sequence of input characters that can match. Thus, all four `a' characters are replaced with `<A>' in this example:
$ echo aaaabcd | awk '{ sub(/a+/, "<A>"); print }' |
For simple match/no-match tests, this is not so important. But when doing text matching and substitutions with the match
, sub
, gsub
, and gensub
functions, it is very important. See section String Manipulation Functions, for more information on these functions. Understanding this principle is also important for regexp-based record and field splitting (see section How Input Is Split into Records, and also see section Specifying How Fields Are Separated).
The righthand side of a `~' or `!~' operator need not be a regexp constant (i.e., a string of characters between slashes). It may be any expression. The expression is evaluated and converted to a string if necessary; the contents of the string are used as the regexp. A regexp that is computed in this way is called a dynamic regexp:
BEGIN { digits_regexp = "[[:digit:]]+" } |
This sets digits_regexp
to a regexp that describes one or more digits, and tests whether the input record matches this regexp.
When using the `~' and `!~' Caution: When using the `~' and `!~' operators, there is a difference between a regexp constant enclosed in slashes and a string constant enclosed in double quotes. If you are going to use a string constant, you have to understand that the string is, in essence, scanned twice: the first time when awk
reads your program, and the second time when it goes to match the string on the lefthand side of the operator with the pattern on the right. This is true of any string valued expression (such as digits_regexp
shown previously), not just string constants.
What difference does it make if the string is scanned twice? The answer has to do with escape sequences, and particularly with backslashes. To get a backslash into a regular expression inside a string, you have to type two backslashes.
For example, /\*/
is a regexp constant for a literal `*'. Only one backslash is needed. To do the same thing with a string, you have to type "\\*"
. The first backslash escapes the second one so that the string actually contains the two characters `\' and `*'.
Given that you can use both regexp and string constants to describe regular expressions, which should you use? The answer is "regexp constants," for several reasons:
awk
can note that you have supplied a regexp, and store it internally in a form that makes pattern matching more efficient. When using a string constant, awk
must first convert the string into this internal form and then perform the pattern matching.
in Character Lists of Dynamic RegexpsSome commercial versions of awk
do not allow the newline character to be used inside a character list for a dynamic regexp:
$ awk '$0 ~ "[ \t\n]"' |
But a newline in a regexp constant works with no problem:
$ awk '$0 ~ /[ \t\n]/' |
gawk
does not have this problem, and it isn't likely to occur often in practice, but it's worth noting for future reference.