sed & awk

sed & awkSearch this book
Previous: 6.2 A Case for StudyChapter 6
Advanced sed Commands
Next: 6.4 Advanced Flow Control Commands
 

6.3 Hold That Line

The pattern space is a buffer that contains the current input line. There is also a set-aside buffer called the hold space. The contents of the pattern space can be copied to the hold space and the contents of the hold space can be copied to the pattern space. A group of commands allows you to move data between the hold space and the pattern space. The hold space is used for temporary storage, and that's it. Individual commands can't address the hold space or alter its contents.

The most frequent use of the hold space is to have it retain a duplicate of the current input line while you change the original in the pattern space. The commands that affect the pattern space are:

CommandAbbreviationFunction
Holdh or H

Copy or append contents of pattern space to hold space.

Getg or G

Copy or append contents of hold space to pattern space.

Exchangex

Swap contents of hold space and pattern space.

Each of these commands can take an address that specifies a single line or a range of lines. The hold (h,H) commands move data into the hold space and the get (g,G) commands move data from the hold space back into the pattern space. The difference between the lowercase and uppercase versions of the same command is that the lowercase command overwrites the contents of the target buffer, while the uppercase command appends to the buffer's existing contents.

The hold command replaces the contents of the hold space with the contents of the pattern space. The get command replaces the contents of the pattern space with the contents of the hold space.

The Hold command puts a newline followed by the contents of the pattern space after the contents of the hold space. (The newline is appended to the hold space even if the hold space is empty.) The Get command puts a newline followed by the contents of the hold space after the contents of the pattern space.

The exchange command swaps the contents of the two buffers. It has no side effects on either buffer.

Let's use a trivial example to illustrate putting lines into the hold space and retrieving them later. We are going to write a script that reverses pairs of lines. For a sample file, we'll use a list of numbers:

1
2
11
22
111
222

The object is to reverse the order of the lines beginning with 1 and the lines beginning with 2. Here's how we use the hold space: we copy the first line to the hold space - and hold on to it - while we clear the pattern space. Then sed reads the second line into the pattern space and we append the line from the hold space to the end of the pattern space. Look at the script:

# Reverse flip
/1/{
h
d
}
/2/{
G
}

Any line matching a "1" is copied to the hold space and deleted from the pattern space. Control passes to the top of the script and the line is not printed. When the next line is read, it matches the pattern "2" and the line that had been copied to the hold space is now appended to the pattern space. Then both lines are printed. In other words, we save the first line of the pair and don't output it until we match the second line.

Here's the result of running the script on the sample file:

$ sed -f sed.flip test.flip
2
1
22
11
222
111

The hold command followed by the delete command is a fairly common pairing. Without the delete command, control would reach the bottom of the script and the contents of the pattern space would be output. If the script used the next (n) command instead of the delete command, the contents of the pattern space would also be output. You can experiment with this script by removing the delete command altogether or by putting a next command in its place. You could also see what happens if you use g instead of G.

Note that the logic of this script is poor, though the script is useful for demonstration purposes. If a line matches the first instruction and the next line fails to match the second instruction, the first line will not be output. This is a hole down which lines disappear.

6.3.1 A Capital Transformation

In the previous chapter, we introduced the transform command (y) and described how it can exchange lowercase letters for uppercase letters on a line. Since this command acts on the entire contents of the pattern space, it is something of a chore to do a letter-by-letter transformation of a portion of the line. But it is possible, though convoluted, as the following example will demonstrate.

While working on a programming guide, we found that the names of statements were entered inconsistently. They needed to be uppercase, but some were lowercase while others had an initial capital letter. While the task was simple - to capitalize the name of the statement - there were nearly 100 statements and it seemed a tedious project to write that many explicit substitutions of the form:

s/find the Match statement/find the MATCH statement/g

The transform command could do the lowercase-to-uppercase conversion but it applies the conversion to the entire line. The hold space makes this task possible because we use it to store a copy of the input line while we isolate and convert the statement name in the pattern space. Look at the script first:

# capitalize statement names
/the .* statement/{
h
s/.*the \(.*\) statement.*/\1/
y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/
G
s/\(.*\)\n\(.*the \).*\( statement.*\)/\2\1\3/
}

The address limits the procedure to lines that match "the .* statement". Let's look at what each command does:

h

The hold command copies the current input line into the hold space. Using the sample line "find the Match statement," we'll show the contents of the pattern space and of the hold space. After the h command, both the pattern space and the hold space are identical.

Pattern Space:find the Match statement
Hold Space:find the Match statement
s/.*the \(.*\) statement.*/\1/

The substitute command extracts the name of the statement from the line and replaces the entire line with it.

Pattern Space:Match
Hold Space:find the Match statement
y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/

The transform command changes each lowercase letter to an uppercase letter.

Pattern Space:MATCH
Hold Space:find the Match statement
G

The Get command appends the line saved in the hold space to the pattern space.

Pattern Space:MATCH\nfind the Match statement
Hold Space:find the Match statement
s/\(.*\)\n\(.*the \).*\( statement.*\)/\2\1\3/

The substitute command matches three different parts of the pattern space: 1) all characters up to the embedded newline, 2) all characters following the embedded newline and up to and including "the" followed by a space, and 3) all characters beginning with a space and followed by "statement" up to the end of the pattern space. The name of the statement as it appeared in the original line is matched but not saved. The replacement section of this command recalls the saved portions and reassembles them in a different order, putting the capitalized name of the command in between "the" and "statement."

Pattern Space:find the MATCH statement
Hold Space:find the Match statement

Let's look at a test run. Here's our sample file:

find the Match statement
Consult the Get statement.
using the Read statement to retrieve data

Running the script on the sample file produces:

find the MATCH statement
Consult the GET statement.
using the READ statement to retrieve data

As you can see from this script, skillful use of the hold space can aid in isolating and manipulating portions of the input line.

6.3.2 Correcting Index Entries (Part II)

In the previous chapter, we looked at a shell script named index.edit. This script extracts index entries from one or more files and automatically generates a sed script consisting of a substitute command for each index entry. We mentioned that a small failing of the script was that it did not look out for regular expression metacharacters that appeared as literals in an index entry, such as the following:

.XX "asterisk (*) metacharacter"

After processing this entry, the original index.edit generated the following substitute command:

/^\.XX /s/asterisk (*) metacharacter/asterisk (*) metacharacter/

While it "knows" to escape the period before ".XX", it doesn't protect the metacharacter "*". The problem is that the pattern "(*)" does not match "(*)" and the substitute command would fail to be applied. The solution is to modify index.edit so it looks for metacharacters and escapes them. There's one more twist: a different set of metacharacters is recognized in the replacement string.

We have to maintain two copies of the index entry. The first copy we edit to escape regular expression metacharacters and then use for the pattern. The second copy we edit to escape the metacharacters special to the replacement string. The hold space keeps the second copy while we edit the first copy, then we swap the two and edit the second copy. Here's the script:

#! /bin/sh
# index.edit -- compile list of index entries for editing
#		new version that matches metacharacters
grep "^\.XX" $* | sort -u |
sed '
h
s/[][\\*.]/\\&/g
x
s/[\\&]/\\&/g
s/^\.XX //
s/$/\//
x
s/^\\\.XX \(.*\)$/\/^\\.XX \/s\/\1\//
G
s/\n//'

The hold command puts a copy of the current index entry into the hold space. Then the substitute command looks for any of the following metacharacters: "]", "[", "\", "*" or ".". This regular expression is rather interesting: 1) if the close bracket is the first character in a character class, it is interpreted literally, not as the closing delimiter of the class; and 2) of the metacharacters specified, only the backslash has a special meaning in a character class and must be escaped. Also, there is no need to escape the metacharacters "^" and "$" because they only have special meaning if in the first or last positions in a regular expression, respectively, which is impossible given the structure of the index entry. After escaping the metacharacters, the exchange command swaps the contents of the pattern space and the hold space.

Starting with a new copy of the line, the substitute command adds a backslash to escape either a backslash or an ampersand for the replacement string. Then another substitute command removes the ".XX" from the line and the one after that appends a slash (/) to the end of the line, preparing a replacement string that looks like:

"asterisk (*) metacharacter"/

Once again, the exchange command swaps the pattern space and the hold space. With the first copy in the pattern space, we need to prepare the pattern address and the substitute pattern. The substitute command saves the index entry, and replaces the line with the first part of the syntax for the substitute command:

\/^\\.XX \/s\/\1\//

Using the sample entry, the pattern space would have the following contents:

/^\.XX /s/"asterisk (\*) metacharacter"/

Then the Get command takes the replacement string in the hold space and appends it to the pattern space. Because Get also inserts a newline, the substitute command is necessary to remove it. The following line is output at the end:

/^\.XX /s/"asterisk (\*) metacharacter"/"asterisk (*) metacharacter"/

6.3.3 Building Blocks of Text

The hold space can be used to collect a block of lines before outputting them. Some troff requests and macros are block-oriented, in that commands must surround a block of text. Usually a code at the beginning enables the format and one at the end disables the format. HTML-coded documents also contain many block-oriented constructs. For instance, "<p>" begins a paragraph and "</p>" ends it. In the next example, we'll look at placing HTML-style paragraph tags in a plain text file. For this example, the input is a file containing variable-length lines that form paragraphs; each paragraph is separated from the next one by a blank line. Therefore, the script must collect all lines in the hold space until a blank line is encountered. The contents of the hold space are retrieved and surrounded with the paragraph tags.

Here's the script:

/^$/!{
     H
     d
     }
/^$/{
	x
	s/^\n/<p>/
	s/$/<\/p>/
	G
	}

Running the script on a sample file produces:

<p>My wife won't let me buy a power saw.  She is afraid of an
accident if I use one.
So I rely on a hand saw for a variety of weekend projects like
building shelves.
However, if I made my living as a carpenter, I would
have to use a power
saw.  The speed and efficiency provided by power tools
would be essential to being productive.</p>

<p>For people who create and modify text files,
sed and awk are power tools for editing.</p>

<p>Most of the things that you can do with these programs
can be done interactively with a text editor.  However,
using these programs can save many hours of repetitive
work in achieving the same result.</p>

The script has basically two parts, corresponding to each address. Either we do one thing if the input line is not blank or a different thing if it is. If the input line is not blank, it is appended to the hold space (with H), and then deleted from the pattern space. The delete command prevents the line from being output and clears the pattern space. Control passes back to the top of the script and a new line is read. The general idea is that we don't output any line of text; it is collected in the hold space.

If the input line is blank, we process the contents of the hold space. To illustrate what the second procedure does, let's use the second paragraph in the previous sample file and show what happens. After a blank line has been read, the pattern space and the hold space have the following contents:

Pattern Space:^$
Hold Space:

\nFor people who create and modify text files, \nsed and awk are power tools for editing.

A blank line in the pattern space is represented as "^$", the regular expression that matches it. The embedded newlines are represented in the hold space by "\n". Note that the Hold command puts a newline in the hold space and then appends the current line to the hold space. Even when the hold space is empty, the Hold command places a newline before the contents of the pattern space.

The exchange command (x) swaps the contents of the hold space and the pattern space. The blank line is saved in the hold space so we can retrieve it at the end of the procedure. (We could insert a newline in other ways, also.)

Pattern Space:

\nFor people who create and modify text files, \nsed and awk are power tools for editing.

Hold Space:^$

Now we make two substitutions: placing "<p>" at the beginning of the pattern space and "</p>" at the end. The first substitute command matches "^\n" because a newline is at the beginning of the line as a consequence of the Hold command. The second substitute command matches the end of the pattern space ("$" does not match any embedded newlines but only the terminal newline.)

Pattern Space:

<p>For people who create and modify text files, \nsed and awk are power tools for editing.</p>

Hold Space:^$

Note that the embedded newline is preserved in the pattern space. The last command, G, appends the blank line in the hold space to the pattern space. Upon reaching the bottom of the script, sed outputs the paragraph we had collected in the hold space and coded in the pattern space.

This script illustrates the mechanics of collecting input and holding on to it until another pattern is matched. It's important to pay attention to flow control in the script. The first procedure in the script does not reach bottom because we don't want any output yet. The second procedure does reach bottom, clearing the pattern space and the hold space before we begin collecting lines for the next paragraph.

This script also illustrates how to use addressing to set up exclusive addresses, in which a line must match one or the other address. You can also set up addresses to handle various exceptions in the input and thereby improve the reliability of a script. For instance, in the previous script, what happens if the last line in the input file is not blank? All the lines collected since the last blank line will not be output. There are several ways to handle this, but a rather clever one is to manufacture a blank line that the blank-line procedure will match later in the script. In other words, if the last line contains a line of text, we will copy the text to the hold space and clear the contents of the pattern space with the substitute command. We make the current line blank so that it matches the procedure that outputs what has been collected in the hold space. Here's the procedure:

${
/^$/!{
     H
     s/.*//
     }
}

This procedure must be placed in the script before the two procedures shown earlier. The addressing symbol "$" matches only the last line in the file. Inside this procedure, we test for lines that are not blank. If the line is blank, we don't have to do anything with it. If the current line is not blank, then we append it to the hold space. This is what we do in the other procedure that matches a non-blank line. Then we use the substitute command to create a blank line in the pattern space.

Upon exiting this procedure, there is a blank line in the pattern space. It matches the subsequent procedure for blank lines that adds the HTML paragraph codes and outputs the paragraph.


Previous: 6.2 A Case for Studysed & awkNext: 6.4 Advanced Flow Control Commands
6.2 A Case for StudyBook Index6.4 Advanced Flow Control Commands

The UNIX CD Bookshelf NavigationThe UNIX CD BookshelfUNIX Power ToolsUNIX in a NutshellLearning the vi Editorsed & awkLearning the Korn ShellLearning the UNIX Operating System