grep search and sed stream editing

Linux commands are often described as tools which act on data. These tools read data from standard input; modify or process that data; and write the results to standard output. Linux pipes let you connect these tools like stations in an assembly line to solve problems by rearranging the tools you already know. Each Linux command you learn adds a tool to your toolbox that you can combine in new ways.

Two well-worn tools in the Linux toolbox are grep and sed. These two programs receive lines of text as input and are frequently combined with each other and other proograms. They are easy enough to use on simple problems and flexible enough to help solve complicated problems.

grep search

grep is a pattern matcher, a program which searches lines of text for patterns described by a regular expression. You can use grep to find information within text or to filter text by keeping certain lines while discarding others.

In many ways, grep is the prototype of a Linux software tool. It does one job, simply and effectively, and combines with other tools using pipes. It is an early model of the Unix programming philosophy that other Linux commands follow.

sed stream editing

sed is a stream editor, a program which modifies lines of text according to a script, or rule. The most common of these is substitution, a search-and-replace rule. You can use sed to quickly edit text in a pipeline, often to prepare it for another command to use.

You may notice overlap between sed rules and vi text editor commands. They were both inspired by ed, the first Unix text editor; using either one will make you feel more comfortable with the other.

Any discussion of grep and sed should also mention awk - together, they are the "Three Musketeers" of Linux text processing.

A proper introduction to awk requires its own reference page; "awk" is the name of a programming language as well as the awk command which interprets that language.

The grep pattern matcher

The grep command searches each line of input for a provided pattern, and prints the lines which contain a match.

grep - print lines matching a pattern

$ grep [option]... pattern 
                   [file]...

The pattern is a regular expression (enclosed in quotes). The file (or files) provide the lines of text to search. If you provide no filename (or the filename "-") then grep reads from standard input.

Example: Calculator spelling

If you've owned or used a calculator with a seven-segment display, you may have entertained yourself with calculator spelling - typing numbers which look like words when read upside down. For example, 0.7734 appears like hello when turned around.

In 1972, a colleage asked Brian Kernighan if he could find all the words one can make in this way. The calculator produced the letters b, e, h, i, l, o, and s; so Brian used grep and the Unix word list (a text file containing one dictionary word per line) to come up with an answer:

$ grep "^[behilos]*$ /usr/share/dict/words

(The word list on cs-class is found at /usr/share/dict/words and contains 479,828 words)

You can run this command on cs-class to print the 574 words which can be spelled using only these letters!

Source: Kernighan, Brian. UNIX A History and Memoir. USA: Kindle Direct Publishing, 2020, pp. 72-73.

There are quite a few option flags to control how grep works. A few of the most common and useful ones are listed below.

Some useful grep command-line options

-H - print the file name of each matching line (used by default if more than one file is searched).

-n - print the line number of each match (within its input file).

-v - invert the matching, so grep prints the non-matching lines instead.

-e pattern - introduces a pattern to match. This option can be repeated to provide more than one pattern. grep will print lines matching at least one of the patterns given.

-f pattern_file - introduces a file whose lines are search patterns. This is an alternative to using -e multiple times (especially if you use the same patterns again and again).

Ken Thompson wrote grep for a friend researching the authorship of the Federalist Papers. Soon after, his boss suggested a command for search within text. Ken offered to "think about it" and spent the evening polishing grep for sharing. This started a rumor that Ken wrote grep in one night!

The name grep comes from the ed text editor (also written by Thompson), which used the command g/re/p for 'global search for regex and print.'

The sed stream editor

The sed command reads each line of input and modifies it according to a provided rule.

sed - stream editor for filtering and transforming text

$ sed [option] script
                   [file]...

The script is a rule telling sed what to do with the input text. The file (or files) provide the lines of text to modify. If you provide no filename (or the filename "-") then sed reads from standard input.

The substitution rule

The essential rule for sed is substitution. This is a "search and replace" rule applied to each line of input.

sed's substitution rule

$ sed 's/pattern/replacement/'

The rule is enclosed in single quotes and has several parts, separated by slashes:

The s (for "substitution")
The pattern, a regular expression to search for
The replacement, the text to replace a match with

sed searches each line of input for a substring matching the pattern. If a match is found, the matching substring is removed and the replacement is inserted in its place.

The command begins with s for substitution. The second portion is a regular expression for sed to match in each line. If a matching substring is found, that substring is replaced by the third portion.

Example: Changing day into night

Suppose we've got a text file named input containing the lines of text shown below.

We can ask sed to change "day" into "night" with a substitution rule:

$ sed 's/day/night' < input

The output of this command is shown below.

Notice a few things about how sed applied the rule:

The rule is applied to each line, so "day" becomes "night" in each of the first two lines
The rule applies to the first match in each line; on line three
- The first word ("Days") doesn't match the (case-sensitive) pattern
- The third word ("days") is matched and substituted
- The fifth word ("days") is unchanged because a match was already found!
The rule matched the substring in "Sunday" (becoming "Sunnight")

Care must be taken in choosing the right regular expression for your pattern!

There are a few sed options which are placed after the final slash in the substitution rule. The most important of these is g (for "global"), which tells sed replace all matches in each line rather than just the first match.