Pipes and combining commands

A pipe is a connection between two commands. The pipe receives the output from the first command and provides it as input to the second command. Pipes let you combine the Linux commands you already know to perform complicated jobs that no single command can handle.

Combining existing programs in new ways is a key part of the Unix programming philosophy, named for the Unix operating system where it became popular.

The Unix Programming Philosophy

"... at its heart is an idea that the power of a system comes more from the relationships among programs than from the programs themselves. Many UNIX programs do quite trivial things in isolation but, combined with other programs, become general and useful tools."

Linux is a descendant of Unix and inherits this philosophy - along with many features and design choices that support it.

Combining commands with pipes

You tell bash to create a pipe between two commands with the pipe symbol ("|") (usually found on your backslash key).

Creating a pipe between two commands

$ command1 [options]... [arguments] 
                   |
                   command2 [options]... [arguments]...

Data flows left-to-right; the standard output from the first command passes to the second command's standard input. The first command's standard error is not redirected by the pipe, and appears in your terminal (along with all of the second command's output).

You are not limited to a single pipe, either. You can chain together as many commands as necessary (even different invocations of the same command) to accomplish your goal.

Shortly after adding pipes to Unix, programmers Ken Thompson and Dennis Ritchie "upgraded every command on the system in a single night" to take advantage of them. This mainly required allowing each to read data from stdin when no other source was specified.

Pipes also led to the invention of the stderr stream in order to keep error messages separate from piped data.

References

Kernighan, Brian. UNIX: A History and Memoir. Kindle Direct Publishing, 2020, p. 69.

Combining command line tools

A common description of Linux commands is as tools for operating on data. These tools receive data from standard input; modify or process that data; and send the results to standard output.

Linux pipes let you connect these tools like steps in an assembly line to do complicated work over multiple steps without writing a new program to do the work. As you learn more Linux commands, you add tools to your toolbox that you can combine in new ways.

Tools to manage terminal output

Some commands can print a lot of output very quickly, flooding your terminal and making the information you care about harder to find. Piping that output to another command lets you limit what's printed or examine it in a more controlled way.

The head and tail commands limit output by printing only the first (or last) few lines they receive as input. A terminal pager like less displays only one screen of data and gives you keyboard controls (similar to those used by man pages) to navigate the text.

less - file perusal filter (opposite of more)

$ less [option]... [file]...

Try it yourself: Managing the output of ls -l in a really big directory

(Part 1) What happens when you run the command

$ ls -l /usr/bin

on cs-class? (Hint: connect to cs-class using ssh and try it)

You just flooded your terminal with output! The /usr/bin directory has over 1800 items in it, and ls -l prints one item per line.

(Part 2) Can you think of a "piped command" that would let you view this output in a more mangageable way? (Hint: try the less command above).

Use a pipe to connect ls -l to less:

$ ls -l /usr/bin | less

Now you see one screen of text, with controls to navigate the rest.

You'll now see only one screen's worth of output, along with a cursor

Tools to copy terminal outputs

Another useful tool is tee, which works like a tee junction in plumbing.

tee - read from standard input and write to standard output and files

$ tee [option]... [file]...

You can use tee to capture a program's output in a file while still displaying it in your terminal -- or passing it on to another program through a pipe.

Example: Using tee to capture user input and program output

Suppose you wrote an interactive program in C++ (compiled as a.out that asks the user questions and processes their responses. The tee command can be used (twice) to capture both the users' inputs and the program's responses in a single log file.

$ tee -a log.txt | a.out | tee -a log.txt

The pipeline above has three parts:

The first tee receives standard input (from the user) and sends it both to the file log.txt and to standard output used in the next step.
The a.out receives standard input (from tee) containing the user's inputs; processes them; and prints the responses to standard output for the last step.
The second tee receives the output of a.out and sends it both to log.txt and to standard output, displayed in the terminal for the user to read.

The option -a tells tee to append to the log file rather than overwriting it - so the two copies of tee don't take turns overwriting each others work.

Yet another useful program is xargs, whose job is to transform standard input into command-line arguments for another program.

xargs build and execute command lines from standard input

$ xargs [options] [command
                   [initial-args]]

A classic use-case for xargs is with programs which take a list of filenames as command-line arguments. A command like ls can be used to print a list of filenames to standard output which xargs can transform into an argument list. (See two diffent examples here and here further down this page!)

Tools to compare data

You can compare a command's output to "known" output line-by-line using diff. A dash ("-") as a filename tells diff to compare standard input to the named file.

diff - compare files line by line

$ diff [option]... files

If you need to identify byte-by-byte differences (especially if the data isn't plain text), you can use cmp to do the job.

cmp - compare two files byte by byte

$ cmp [option]... file1 [file2]

Tools to filter and transform data

Many Linux commands behave like filters (a la Instagram, though the term comes from signal processing) which transform input data into a modified form for later use. Some simple transformations are handled by the commands below.

The sort command outputs all input lines in sorted order.

sort - sort lines of text files

$ sort [option]... [file]...

The uniq command compares each line of text and outputs only one copy of each "unique" occurrence.

uniq - report or omit repeated lines

$ uniq [option]... [input]

The word count command wc counts not only words of input but also lines and individual characters. Command-line arguments are used to choose which counts it outputs.

wc - print newline, word, and byte counts for each file

$ wc [option]... [file]...

Tools to process text streams

Since so many Linux files contain plain text, tools that can search, format, or process text are particularly useful to learn about. The tools below are explored in greater detail on other pages.

grep is a pattern-matching tool. It searches each line of input for a particular pattern, and prints the lines that match.
sed is a stream editor. It can use a rule to edit or transform each line of input provided and print the results.
awk is both a command and a programming language. It specializes in operations common in data extraction, manipulation, and report-generation.

In particular, grep is often given as the quintissential example of a Linux command-line tool in its usage and its pattern-matching behavior proves useful in many different situations.

A few examples of piped commands using grep, sed, and xargs given below.

Example: Counting the number of long lines in C++ source code

Programmers agree that long lines ("wide code") are harder to read, though they argue about how many characters is "too wide." The Google C++ Style Guide mandates an 80 character line length, but admits this is a controversial choice.

Suppose you are a COSC 051 TA (or an attentive student) who wants to check for overlong lines in your main.cpp. You can use grep (and a suitable regular expression) to search for the bad lines.

grep -n -e '.\{81\}' main.cpp

The pattern '.\{81\}' ("81 characters of any kind") will only match lines which contain at least 81 characters (lines that are "too wide"), so grep will print those "bad" lines while filtering the rest. The -n option prints the line number of each match (so you can quickly find and split the long lines).

What if your programming project contains more than one source code file? You could type out each filename by hand, but you know a command for listing filenames: the ls command! Let's use a pipe to solve the problem instead.

# ls | xargs grep -n -e '.\{81\}'

Here ls provides all the filenames, and xargs uses them as arguments to grep so it will search each file.

What if your project directory contains files that aren't C++ source code (like a .pdf specification)? The ls names all the files, but we can filter them with grep.

$ ls | grep -e '*.cpp' -e '*.h' | xargs grep -n -e '.\{81\}'

The new grep checks the filenames provided by ls and filters out those not ending in either .cpp or .h (that is, C++ source code). These are the only files passed to xargs and searched for long lines.

Example: Staging files for a git commit

Git is a version control program that creates snapshots (called commits) of your coding project as you work. You tell git which changes to save by adding them to a draft version of the next commit. This is done with the git add command.

$ git add file1 [file2]...

Suppose I'm writing a custom class, order, with an associated header and implementation file as well as client code in main.cpp. I can ask git to print information about files that have changed since my last commit.

This output is human-readable, but I can't use it with git add as-is. First, let's use grep to pick out lines with names of modified files. These all start with modified:.

Now I've got the filenames I want to add, but they aren't "clean" - the modified: shouldn't be there! Next, let's use sed to trim off that unwanted bit by replacing it with nothing.

Great! Now we've got a list of files to add with one file per line. Last, let's use xargs to make that list of filenames into arguments for git add to stage.

Voila! The modified files are now staged for your commit!

The nice thing about this pipeline is that it works equally well when there are three, thirty, or three hundred modified files - more than is practical to type by hand!

Linux Command Line