grep search and sed stream editing
Linux commands are often described as tools which act on data. These tools read data from standard input; modify or process that data; and write the results to standard output. Linux pipes let you connect these tools like stations in an assembly line to solve problems by rearranging the tools you already know. Each Linux command you learn adds a tool to your toolbox that you can combine in new ways.
Two well-worn tools in the Linux toolbox are grep
and
sed
. These two programs receive lines of text as input and
are frequently combined with each other and other proograms. They are easy
enough to use on simple problems and flexible enough to help solve complicated
problems.
grep search
grep
is a pattern matcher, a program which searches lines
of text for patterns described by a regular expression.
You can use grep
to find information within text or to filter text
by keeping certain lines while discarding others.
In many ways, grep
is the prototype of a Linux software tool. It
does one job, simply and effectively, and combines with other tools using pipes.
It is an early model of the
Unix programming philosophy that other Linux
commands follow.
sed stream editing
sed
is a stream editor, a program which modifies lines of
text according to a script, or rule. The most common of these is
substitution, a search-and-replace rule. You can use sed
to quickly edit text in a pipeline, often to prepare it for another command
to use.
You may notice overlap between sed
rules and
vi text editor commands. They were both inspired by
ed
, the first Unix text editor; using either one will make you
feel more comfortable with the other.
Any discussion of grep
and sed
should also mention
awk
- together, they are the "Three Musketeers" of Linux text
processing.
A proper introduction to awk requires its
own reference page; "awk" is the name of a programming
language as well as the awk
command which interprets
that language.
The grep pattern matcher
The grep
command searches each line of input for a provided
pattern, and prints the lines which contain a match.
grep
- print lines matching a pattern
$ grep [option]... pattern
[file]...
The pattern
is a regular expression
(enclosed in quotes). The file
(or files) provide the lines of
text to search. If you provide no filename (or the filename "-")
then grep
reads from standard input.
Example: Calculator spelling
If you've owned or used a calculator with a seven-segment display, you may
have entertained yourself with calculator spelling - typing numbers which
look like words when read upside down. For example, 0.7734
appears like hello
when turned around.
In 1972, a colleage asked Brian Kernighan if he could find all the
words one can make in this way. The calculator produced the letters b, e,
h, i, l, o, and s; so Brian used grep
and the Unix word list
(a text file containing one dictionary word per line) to come up with an
answer:
$ grep "^[behilos]*$ /usr/share/dict/words
(The word list on cs-class is found at
/usr/share/dict/words
and contains 479,828 words)
You can run this command on cs-class to print the 574 words which can be spelled using only these letters!
Source: Kernighan, Brian. UNIX A History and Memoir. USA: Kindle Direct Publishing, 2020, pp. 72-73.
There are quite a few option flags to control how grep
works. A
few of the most common and useful ones are listed below.
Some useful grep
command-line options
-H
- print the file name of each matching line (used by default
if more than one file is searched).
-n
- print the line number of each match (within its input
file).
-v
- invert the matching, so grep
prints
the non-matching lines instead.
-e pattern
- introduces a pattern to match. This option
can be repeated to provide more than one pattern. grep
will
print lines matching at least one of the patterns given.
-f pattern_file
- introduces a file whose lines are
search patterns. This is an alternative to using -e
multiple
times (especially if you use the same patterns again and again).
Ken Thompson wrote grep
for a friend researching the authorship
of the Federalist Papers. Soon after, his boss suggested a command for search
within text. Ken offered to "think about it" and spent the evening polishing
grep
for sharing. This started a rumor that Ken wrote
grep
in one night!
The name grep
comes from the ed
text editor (also
written by Thompson), which used the command g/re/p
for 'global
search for regex and print.'
The sed stream editor
The sed
command reads each line of input and modifies it according
to a provided rule.
sed
- stream editor for filtering and
transforming text
$ sed [option] script
[file]...
The script
is a rule telling sed
what to do with the
input text. The file
(or files) provide the lines of text to modify.
If you provide no filename (or the filename "-
") then sed
reads from standard input.
The substitution rule
The essential rule for sed
is substitution. This is a
"search and replace" rule applied to each line of input.
sed
's substitution rule
$ sed 's/pattern/replacement/'
The rule is enclosed in single quotes and has several parts, separated by slashes:
- The
s
(for "substitution") - The
pattern
, a regular expression to search for - The
replacement
, the text to replace a match with
sed
searches each line of input for a substring matching the
pattern
. If a match is found, the matching substring is removed and
the replacement
is inserted in its place.
The command begins with s
for substitution. The second portion is
a regular expression for sed
to match in each line. If a matching
substring is found, that substring is replaced by the third portion.
Example: Changing day into night
Suppose we've got a text file named input
containing the
lines of text shown below.
We can ask sed
to change "day" into "night" with a substitution
rule:
$ sed 's/day/night' < input
The output of this command is shown below.
Notice a few things about how sed
applied the rule:
- The rule is applied to each line, so "day" becomes "night" in each of the first two lines
- The rule applies to the first match in each line; on line three
- The first word ("Days") doesn't match the (case-sensitive) pattern
- The third word ("days") is matched and substituted
- The fifth word ("days") is unchanged because a match was already found!
- The rule matched the substring in "Sunday" (becoming "Sunnight")
Care must be taken in choosing the right regular expression for your pattern!
There are a few sed
options which are placed after the final slash
in the substitution rule. The most important of these is g
(for
"global"), which tells sed
replace all matches in each line
rather than just the first match.