[Chapter 9] 9.2 String Functions

9.2 String Functions

The built-in string functions are much more significant and interesting than the numeric functions. Because awk is essentially designed as a string-processing language, a lot of its power derives from these functions. Table 9.2 lists the string functions found in awk.

Table 9.2: Awk's Built-In String Functions
Awk Function	Description
`gsub`(r,s,t)	Globally substitutes s for each match of the regular expression r in the string t. Returns the number of substitutions. If t is not supplied, defaults to $0.
`index`(s,t)	Returns position of substring t in string s or zero if not present.
`length`(s)	Returns length of string s or length of $0 if no string is supplied.
`match`(s,r)	Returns either the position in s where the regular expression r begins, or 0 if no occurrences are found. Sets the values of RSTART and RLENGTH.
`split`(s,a,sep)	Parses string s into elements of array a using field separator sep; returns number of elements. If sep is not supplied, FS is used. Array splitting works the same way as field splitting.
`sprintf`("fmt",expr)	Uses `printf` format specification for expr.
`sub`(r,s,t)	Substitutes s for first match of the regular expression r in the string t. Returns 1 if successful; 0 otherwise. If t is not supplied, defaults to $0.
`substr`(s,p,n)	Returns substring of string s at beginning position p up to a maximum length of n. If n is not supplied, the rest of the string from p is used.
`tolower`(s)	Translates all uppercase characters in string s to lowercase and returns the new string.
`toupper`(s)	Translates all lowercase characters in string s to uppercase and returns the new string.

The split() function was introduced in the previous chapter in the discussion on arrays.

The sprintf() function uses the same format specifications as printf(), which is discussed in Chapter 7, Writing Scripts for awk. It allows you to apply the format specifications on a string. Instead of printing the result, sprintf() returns a string that can be assigned to a variable. It can do specialized processing of input records or fields, such as performing character conversions. For instance, the following example uses the sprintf() function to convert a number into an ASCII character.

for (i = 97; i <= 122; ++i) {
	nextletter = sprintf("%c", i)
	...
}

A loop supplies numbers from 97 to 122, which produce ASCII characters from a to z.

That leaves us with three basic built-in string functions to discuss: index(), substr(), and length().

9.2.1 Substrings

The index() and substr() functions both deal with substrings. Given a string s, index(s,t) returns the leftmost position where string t is found in s. The beginning of the string is position 1 (which is different from the C language, where the first character in a string is at position 0). Look at the following example:

pos = index("Mississippi", "is")

The value of pos is 2. If the substring is not found, the index() function returns 0.

Given a string s, substr(s,p) returns the characters beginning at position p. The following example creates a phone number without an area code.

phone = substr("707-555-1111", 5)

You can also supply a third argument which is the number of characters to return. The next example returns just the area code:

area_code = substr("707-555-1111", 1, 3)

The two functions can be and often are used together, as in the next example. This example capitalizes the first letter of the first word for each input record.

awk '# caps - capitalize 1st letter of 1st word
# initialize strings
BEGIN { upper = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
        lower = "abcdefghijklmnopqrstuvwxyz" 
}

# for each input line
{
# get first character of first word
	FIRSTCHAR = substr($1, 1, 1)
# get position of FIRSTCHAR in lowercase array; if 0, ignore
	if (CHAR = index(lower, FIRSTCHAR)) 
		# change $1, using position to retrieve
		# uppercase character 
		$1 = substr(upper, CHAR, 1) substr($1, 2)
# print record
	print $0
}'

This script creates two variables, upper and lower, consisting of uppercase and lowercase letters. Any character that we find in lower can be found at the same position in upper. The first statement of the main procedure extracts a single character, the first one, from the first field. The conditional statement tests to see if that character can be found in lower using the index() function. If CHAR is not 0, then CHAR can be used to extract the uppercase character from upper. There are two substr() function calls: the first one retrieves the capitalized letter and the second call gets the rest of the first field, extracting all characters, beginning with the second character. The values returned by both substr() functions are concatenated and assigned to $1. Making an assignment to a field as we do here is a new twist, but it has the added benefit that the record can be output normally. (If the assignment was made to a variable, you'd have to output the variable and then output the record's remaining fields.) The print statement prints the changed record. Let's see it in action:

$ caps
root user
Root user
dale
Dale
Tom
Tom

In a little bit, we'll see how to revise this program to change all characters in a string from lower- to uppercase or vice versa.

9.2.2 String Length

When presenting the awkro program in the previous chapter, we noted that the program was likely to produce lines that exceed 80 characters. After all, the descriptions are quite long. We can find out how many characters are in a string using the built-in function length(). For instance, to evaluate the length of the current input record, we specify length($0). (As it happens, if length() is called without an argument, it returns the length of $0.)

The length() function is often used to find the length of the current input record, in order to determine if we need to break the line.

One way to handle the line break, perhaps more efficiently, is to use the length() function to get the length of each field. By accumulating those lengths, we could specify a line break when a new field causes the total to exceed a certain number.

Chapter 13, A Miscellany of Scripts, contains a script that uses the length() function to break lines greater than 80 columns wide.

9.2.3 Substitution Functions

Awk provides two substitution functions: sub() and gsub(). The difference between them is that gsub() performs its substitution globally on the input string whereas sub() makes only the first possible substitution. This makes gsub() equivalent to the sed substitution command with the g (global) flag.

Both functions take at least two arguments. The first is a regular expression (surrounded by slashes) that matches a pattern and the second argument is a string that replaces what the pattern matches. The regular expression can be supplied by a variable, in which case the slashes are omitted. An optional third argument specifies the string that is the target of the substitution. If there is no third argument, the substitution is made for the current input record ($0).

The substitution functions change the specified string directly. You might expect, given the way functions work, that the function returns the new string created when the substitution is made. The substitution functions actually return the number of substitutions made. sub() will always return 1 if successful; both return 0 if not successful. Thus, you can test the result to see if a substitution was made.

For example, the following example uses gsub() to replace all occurrences of "UNIX" with "POSIX".

if (gsub(/UNIX/, "POSIX"))
	print

The conditional statement tests the return value of gsub() such that the current input line is printed only if a change is made.

As with sed, if an "&" appears in the substitution string, it will be replaced by the string matched by the regular expression. Use "\&" to output an ampersand. (Remember that to get a literal "\" into a string, you have to type two of them.) Also, note that awk does not "remember" the previous regular expression, as does sed, so you cannot use the syntax "//" to refer to the last regular expression.

The following example surrounds any occurrence of "UNIX" with the troff font-change escape sequences.

gsub(/UNIX/, "\\fB&\\fR")

If the input is "the UNIX operating system", the output is "the \fBUNIX\fR operating system".

In Chapter 4, Writing sed Scripts, we presented the following sed script named do.outline:

sed -n '
s/"//g
s/^\.Se /Chapter /p
s/^\.Ah /A. /p
s/^\.Bh /B.  /p' $*

Now here's that script rewritten using the substitution functions:

awk '
{
gsub(/"/, "")
if (sub(/^\.Se /, "Chapter ")) print
if (sub(/^\.Ah /, "\tA. ")) print
if (sub(/^\.Bh /, "\t\tB.  ")) print
}' $*

The two scripts are exactly equivalent, printing out only those lines that are changed. For the first edition of this book, Dale compared the run-time of both scripts and, as he expected, the awk script was slower. For the second edition, new timings showed that performance varies by implementation, and in fact, all tested versions of new awk were faster than sed! This is nice, since we have the capabilities in awk to make the script do more things. For instance, instead of using letters of the alphabet, we could number the headings. Here's the revised awk script:

awk '# do.outline -- number headings in chapter.
{
gsub(/"/, "")
}
/^\.Se/ {
	sub(/^\.Se /, "Chapter ") 
	ch = $2
	ah = 0
	bh = 0
	print
	next
}
/^\.Ah/ {
	sub(/^\.Ah /, "\t " ch "." ++ah " ") 
	bh = 0
	print
	next
}
/^\.Bh/ {
	sub(/^\.Bh /, "\t\t " ch "."  ah "." ++bh " ")
	print
}' $*

In this version, we break out each heading into its own pattern-matching rule. This is not necessary but seems more efficient since we know that once a rule is applied, we don't need to look at the others. Note the use of the next statement to bypass further examination of a line that has already been identified.

The chapter number is read as the first argument to the ".Se" macro and is thus the second field on that line. The numbering scheme is done by incrementing a variable each time the substitution is made. The action associated with the chapter-level heading initializes the section-heading counters to zero. The action associated with the top-level heading ".Ah" zeroes the second-level heading counter. Obviously, you can create as many levels of heading as you need. Note how we can specify a concatenation of strings and variables as a single argument to the sub() function.

$ do.outline ch02
Chapter 2 Understanding Basic Operations
         2.1 Awk, by Sed and Grep, out of Ed 
         2.2 Command-line Syntax
                 2.2.1 Scripting
                 2.2.2 Sample Mailing List
         2.3 Using Sed
                 2.3.1 Specifying Simple Instructions
                 2.3.2 Script Files
         2.4 Using Awk
         2.5 Using Sed and Awk Together

If you wanted the option of choosing either numbers or letters, you could maintain both programs and construct a shell wrapper that uses some flag to determine which program should be invoked.

9.2.4 Converting Case

POSIX awk provides two functions for converting the case of characters within a string. The functions are tolower() and toupper(). Each takes a single string argument, and returns a copy of that string, with all the characters of one case converted to the other (upper to lower and lower to upper, respectively). Their use is straightforward:

$ cat test
Hello, World!
Good-bye CRUEL world!
1, 2, 3, and away we GO!
$ awk '{ printf("<%s>, <%s>\n", tolower($0), toupper($0)) }' test
<hello, world!>, <HELLO, WORLD!>
<good-bye cruel world!>, <GOOD-BYE CRUEL WORLD!>
<1, 2, 3, and away we go!>, <1, 2, 3, AND AWAY WE GO!>

Note that nonalphabetic characters are left unchanged.

9.2.5 The match() Function

The match() function allows you to determine if a regular expression matches a specified string. It takes two arguments, the string and the regular expression. (This function is confusing because the regular expression is in the second position, whereas it is in the first position for the substitution functions.)

The match() function returns the starting position of the substring that was matched by the regular expression. You might consider it a close relation to the index() function. In the following example, the regular expression matches any sequence of capital letters in the string "the UNIX operating system".

match("the UNIX operating system", /[A-Z]+/)

The value returned by this function is 5, the character position of "U," the first capital letter in the string.

The match() function also sets two system variables: RSTART and RLENGTH. RSTART contains the same value returned by the function, the starting position of the substring. RLENGTH contains the length of the string in characters (not the ending position of the substring). When the pattern does not match, RSTART is set to 0 and RLENGTH is set to -1. In the previous example, RSTART is equal to 5 and RLENGTH is equal to 4. (Adding them together gives you the position of the first character after the match.)

Let's look at a rather simple example that prints out a string matched by a specified regular expression, demonstrating the "extent of the match," as discussed in Chapter 3, Understanding Regular Expression Syntax. The following shell script takes two command-line arguments: the regular expression, which should be specified in quotes, and the name of the file to search.

awk '# match -- print string that matches line
# for lines match pattern 
match($0, pattern) {
	# extract string matching pattern using
	# starting position and length of string in $0 
	# print string
	print substr($0, RSTART, RLENGTH)
}' pattern="$1" $2

The first command-line parameter is passed as the value of pattern. Note that $1 is surrounded by quotes, necessary to protect any spaces that might appear in the regular expression. The match() function appears in a conditional expression that controls execution of the only procedure in this awk script. The match() function returns 0 if the pattern is not found, and a non-zero value (RSTART) if it is found, allowing the return value to be used as a condition. If the current record matches the pattern, then the string is extracted from $0, using the values of RSTART and RLENGTH in the substr() function to specify the starting position of the substring to be extracted and its length. The substring is printed. This procedure only matches the first occurrence in $0.

Here's a trial run, given a regular expression that matches "emp" and any number of characters up to a blank space:

$ match "emp[^ ]*" personnel.txt
employees
employee
employee.
employment,
employer
employment
employee's
employee

The match script could be a useful tool in improving your understanding of regular expressions.

The next script uses the match() function to locate any sequence of uppercase letters so that they can be converted to lowercase. Compare it to the caps program shown earlier in the chapter.

awk '# lower - change upper case to lower case 
# initialize strings
BEGIN { upper = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
        lower = "abcdefghijklmnopqrstuvwxyz" 
}

# for each input line
{
# see if there is a match for all caps 
	 while (match($0, /[A-Z]+/))  
		# get each cap letter
		for (x = RSTART; x < RSTART+RLENGTH; ++x) { 
			CAP = substr($0, x, 1)
			CHAR = index(upper, CAP)
			# substitute lowercase for upper 
			gsub(CAP, substr(lower, CHAR, 1))
		}
		
# print record
       print $0
}' $*

In this script, the match() function appears in a conditional expression that determines whether a while loop will be executed. By placing this function in a loop, we apply the body of the loop as many times as the pattern occurs in the current input record.

The regular expression matches any sequence of uppercase letters in $0. If a match is made, a for loop does the lookup of each character in the substring that was matched, similar to what we did in the caps sample program, shown earlier in this chapter. What's different here is how we use the system variables RSTART and RLENGTH. RSTART initializes the counter variable x. It is used in the substr() function to extract one character at a time from $0, beginning with the first character that matched the pattern. By adding RLENGTH to RSTART, we get the position of the first character after the ones that matched the pattern. That is why the loop uses "<" instead of "<=". At the end, we use gsub() to replace the uppercase letter with the corresponding lowercase letter.[2] Notice that we use gsub() instead of sub() because it offers us the advantage of making several substitutions if there are multiple instances of the same letter on the line.

[2] You may be wondering, "why not just use tolower()?" Good question. Some early versions of nawk, including the one on SunOS 4.1.x systems, don't have tolower() and toupper(); thus it's useful to know how to do it yourself.

$ cat test
Every NOW and then, a WORD I type appears in CAPS.
$ lower test
every now and then, a word i type appears in caps.

Note that you could change the regular expression to avoid matching individual capital letters by matching a sequence of two or more uppercase characters, by using: "/[A-Z][A-Z]+/." This would also require revising the way the lowercase conversion was made using gsub(), since it matches a single character on the line.

In our discussion of the sed substitution command, you saw how to save and recall a portion of a string matched by a pattern, using $ and $ to surround the pattern to be saved and \n to recall the saved string in the replacement pattern. Unfortunately, awk's standard substitution functions offer no equivalent syntax. The match() function can solve many such problems, though.

For instance, if you match a string using the match() function, you can single out characters or a substring at the head or tail of the string. Given the values of RSTART and RLENGTH, you can use the substr() function to extract the characters. In the following example, we replace the second of two colons with a semicolon. We can't use gsub() to make the replacement because "/:/" matches the first colon and "/:[^:]*:/" matches the whole string of characters. We can use match() to match the string of characters and to extract the last character of the string.

# replace 2nd colon with semicolon using match, substr
if (match($1, /:[^:]*:/)) {
	before = substr($1, 1, (RSTART + RLENGTH - 2))
	after = substr($1, (RSTART + RLENGTH))
	$1 = before ";" after
}

The match() function is placed within a conditional statement that tests that a match was found. If there is a match, we use the substr() function to extract the substring before the second colon as well as the substring after it. Then we concatenate before, the literal ";", and after, assigning it to $1.

You can see examples of the match() function in use in Chapter 12, Full-Featured Applications.


9.1 Arithmetic Functions		9.3 Writing Your Own Functions