Linux Command Line

awk text processing

The awk programming language was designed for writing programs which peform rote data manipulations, such as formatting, validating, searching, or calculating. The language (and the intepreter program which runs it) automatically handles tasks such as reading input lines, splitting them into separate data fields, and storing them in appropriate variables. This allows awk programs to be smaller and simpler than an equivalent program in, say, C++, where the programmer would need to handle these details by themself.

An awk program is composed of a sequence of statements containg a pattern and an action.

Awk pattern-action statements

pattern {action}
pattern {action}

Awk reads the input file(s) one line at a time and, for each line, considers the patterns in the order they are written. When a pattern matches the current input line, the corresponding action is performed. A given input line may match multiple patterns or none at all.

Many awk programs are quite short -- only one or two lines -- and can be composed at the command line, used once, and forgotten after. This makes awk a flexible tool for a wide range of tasks related to data manipulation. Alongside grep and sed, awk is an important tool both in its own right and in combination with others through the use of Linux pipes.

The name "awk" is derived from the names of its three creators: Alfred Aho, Peter Weinberger, and Brian Kernighan.

It is pronounced the same as "auk", a family of aquatic birds, whose likeness often appears on books related to the tool.

A simple introduction to awk programs

The most common way to use awk is at the command line, providing both the program to run and the input file (or files) to run it on:

$ awk 'program' first_file second_file ...

You can also type your program text into a file and invoke it with the command-line flag awk -f progfile, which is useful for more complicated (or frequently-used) programs.

Input lines, fields, and data values

Awk automatically splits the input line (at each whitespace character, by default) into fields and assigns them to numbered variables. The first field is named $1, the second $2, and so on. The entire (un-split) line is named $0. The number of fields can change from line to line, so awk also provides the variable NF ("number of fields") which stores this information.

Awk treats all data as either numbers or strings of characters, but doesn't insist that a given variable be one or the other. That is, the value of $1 could be a number for one input line and a string for the next. Awk also allows you to introduce your own variables by naming them in an expression -- no declaration needed! If you use a new variable as a number, it is automatically initialized to zero as well.

Patterns

A pattern is a question you ask about an input line; when the pattern matches the line, the corresponding action is performed. You can also write an action without a pattern, in which case the action is performed for every input line.

One pattern type is comparison, such as $3 >= 0 or NF != 4, which match when the comparison is true. These can be combined with the logical operators &&, ||, and ! to create more complicated patterns. This should feel familiar and comfortable if you've programmed in C (or C++, or Java, or...) before.

Another pattern type is string-matching, such as /day/, which matches when the given pattern appears somewhere in the input line. The pattern between the forward slashes can be a literal string or a regular expression, giving you powerful search options. You can also search for a pattern within a specific field, such as $4 ~ /day/ to further control where awk looks for matches.

The special pattern BEGIN identifies a block which executes once before the first input line is read, while END executes once after the last input line is processed. These are useful for configuring the details of your script or displaying the results of a calculation after all data has been read.

Actions

An action is a statement (or statements) performed when the input line matches the corresponding pattern. If you write a pattern without an action, awk prints the matching line. This is equivalent to the action { print } (which is equivalent to { print $0 }).

The print action is also used to display portions of the input line (using the field name) in any combination you like. When multiple values are listed separated by a comma, they are printed with a space between them; without the comma, they are concatenated (i.e., printed side-by-side).

Actions can also set or modify variable values, allowing awk to perform calculations. A simple awk program totaling the values appearing in the second field of each input line is shown below:

{ total = total + $2 }
END { print total }

Actions can consist of multiple statements (separated by semicolons) and can include more general programming constructs (such as loops and function calls) which will be familiar in syntax to anyone who has programmed in C before.

The de facto standard for the awk progrmaming language is The AWK Programming Language by Aho, Kernighan, and Weinberger (also known as "the grey book") which was published in 1988.

The book is both an accessible introduction to awk and a comprehensive reference, making it a valuable addition to any collection. Look out for it on the bookshelves of your computer science faculty!

Some useful "one-liner" awk programs

Below are a selection of simple awk programs which are both instructive and (potentially) useful in their own right.

Print the total number of input lines

END { print NR }

Print the last field of every input line

{ print $NR }

Print the total number of lines that contain "Ray"

/Ray/ { nlines = nlines + 1}
END { print nlines }

Print the largest value found in the first field and the line that contains it

$1 > max { max = $1; maxline = $0 }
END { print max, maxline }

Add a line number to every line of output

{ print NR, $0 }

Print every line which has more than 80 characters

length($0) > 80

Print the first two fields, in opposite order, of every line

{ print $2, $1 }

Print "FirstName LastName" as "LastName, FirstName"

{ print $2 ",", $1 }

Exchange the first two fields of every line, then print the entire (modified) line

{ temp = $1; $1 = $2; $2 = temp; print $0 }

Print in reverse order the fields of every line

{ for (i = NF; i > 0; i = i - 1) printf("%s ", $i)
printf("\n")
}

References

  • Source: Aho, Alfred A et. al. "The AWK Programming Language" Addison-Wesley, 1988.

Awk is widely used even today, but is not without its limitations. Famously, Larry Wall credits a problem he couldn't (easily) solve with awk as the impetus to create the Perl programming language which became the "Swiss Army chainsaw" of scripting languages and the glue holding togther different web systems in the 1990s.

References