nawk(1)nawk(1)Namenawk - data transformation, report generation language
Syntaxnawk [ -f programfile ] [ -Fs ] [ program ] [ var=value... ] [ file
... ]
Description
The language is a file-processing language which is well-suited to data
manipulation and retrieval of information from text files. This refer‐
ence page provides a full technical description of if you are unfamil‐
iar with the language, you will probably find it helpful to read the
Guide to the nawk Utility before reading the following material.
A program consists of any number of user-defined functions and `rules'
of the form:
pattern {action}
There are two ways to specify the program:
(a) Directly on the command line. In this case, the program is a sin‐
gle command line argument, usually enclosed in apostrophes
(b) By using the -f programfile option (where programfile contains the
program). More than one -f option can appear on the command line.
The program will consist of the concatenation of the contents of
all the specified programfiles. You can use - in place of a file
name, to obtain input from the standard input.
The input data manipulated by the program is provided in files speci‐
fied on the command line. If no such files are specified, data is read
from the standard input. You can also specify a file name of - to mean
the standard input.
Input to is divided into records. By default, records are separated by
new-line characters; however, you can specify a different record sepa‐
rator if you wish.
One at a time, and in order, each input record is compared with the
pattern of every `rule' in the program. When a pattern matches, the
action part of the rule is performed on the current input record. Pat‐
terns and actions often refer to separate fields within a record. By
default, fields are separated by white space (blanks, new-lines, or
horizontal tab characters); however, you can specify a different field
separator string using the -Fs option (see Input).
You can omit the pattern or action part of a rule (but not both). If
pattern is omitted, the action is performed on every input record (as
if every record matches). If action is omitted, every record matching
the pattern will be written to the standard output.
If a line in a program contains a `#' character, the `#' and everything
after it is considered to be a comment.
Program lines can be continued by adding a backslash `\' to the end of
the line. Statement lines ending with a comma `,', double or-bars
`||', or double ampersands `&&', are automatically continued.
Options-f programfile
Tells to obtain its program from the specified file. There can
be more than one of these on the command line.
-Fs Says that s is the field separator character within records.
Variables and Expressions
There are three types of variables in identifiers, fields, and array
elements.
An identifier is a sequence of letters, digits, and underscores begin‐
ning with a letter or an underscore.
Fields are described in the Input subsection.
Arrays are associative collections of values called the elements of the
array. Array elements are referenced with constructs of the form
identifier[subscript]
where subscript has the form expr or expr,expr,... Each such expr can
have any string value. Arrays with multiple expr subscripts are imple‐
mented by concatenating the string values of each expr with a separator
character SUBSEP separating multiple expr. The initial value of SUBSEP
is set to `\034' (ASCII field separator).
Fields and identifiers are sometimes called scalar variables to distin‐
guish them from arrays.
Variables are not declared and need not be initialized. The value of
an uninitialized variable is the empty string. Variables can be ini‐
tialized on the command line using
var=value
Such initializations can be interspersed with the names of input files
on the command line. Initializations and input files will be processed
in the order they appear on the command line. For example, the command
nawk-f progfile A=1 f1 f2 A=2 f3
sets A to 1 before input is read from f1 and sets A to 2 before input
is read from f3.
Certain built-in variables have special meaning to as described in
later sections.
Expressions consist of constants, variables, functions, regular expres‐
sions and `subscript in array' conditions (see below) combined with
operators. Each variable and expression has a string value and a cor‐
responding numeric value; the value appropriate to the context is used.
If a string is used in a numeric context, and the contents of the
string cannot be interpreted as a number, the `value' of the string is
taken to be zero.
Numeric constants are sequences of decimal digits.
String constants are quoted, as in"x" Escape sequences accepted in
literal strings are:
Escape ASCII Character
───────────────────────────────
\a audible bell
\b backspace
\f formfeed
\n new-line
\r carriage return
\t horizontal tab
\v vertical tab
\ooo octal value ooo
\xdd hexadecimal value dd
\" quotation mark
\c any other character c
The regular expression syntax understood by is the extended regular
expressions of the utility described in Characters enclosed in slash
characters `/' are compiled as regular expressions when the program is
read. In addition, literal strings and variables are interpreted as
dynamic regular expressions on the right side of a `~' or `!~' opera‐
tor, or as certain arguments to built-in matching and substitution
functions. Note that when literal strings are used as regular expres‐
sions, extra backslashes are needed to escape regular expression
metacharacters because the backslash is also the literal string escape
character.
The `subscript in array' condition is defined as:
index in array
where index looks like expr or (expr,...,expr). This condition evalu‐
ates to 1 if the string value of index is a subscript of array, and to
0 otherwise. This is a way to determine if an array element exists.
If the element does not exist, this condition will not create it.
Symbol Table
The symbol table can be accessed through the built-in array SYMTAB.
SYMTAB[expr]
is equivalent to the variable named by the evaluation of expr. For
example,
SYMTAB["var"]
is a synonym for the variable var.
Environment
A program can determine its initial environment by examining the ENVI‐
RON array. If the environment consists of entries of the form:
name=value
then
ENVIRON[name]
has string value
"value"
For example, the following program is equivalent to the default output
of
BEGIN {
for (i in ENVIRON)
printf("%s=%s\n", i, ENVIRON[i])
exit
}
Operators
The usual precedence order of arithmetic operations is followed unless
overridden with parentheses; a table giving the order of operations
appears at the end of the Guide to the nawk Utility. The unary opera‐
tors are
- Negation
+ Nothing (place holder)
-- Decrement by one
++ Increment by one
where the `++' and `--' operators can be used as either postfix or pre‐
fix operators, as in C.
The binary arithmetic operators are
+ Addition
- Subtraction
* Multiplication
/ Division
% Modulus
^ Exponentiation
The conditional operator
expr ? expr1 : expr2
evaluates to expr1 if the value of expr is non-zero, and to expr2 oth‐
erwise.
If two expressions are not separated by an operator, their string val‐
ues are concatenated.
The operator `~' yields 1 (true) if the regular expression on the right
side matches the string on the left side. The operator `!~' yields 1
when the right side has no match on the left. To illustrate:
$2 ~ /[0-9]/
selects any line where the second field contains at least one digit.
Any string or variable on the right side of `~' or `!~' is interpreted
as a dynamic regular expression.
The relational operators are the usual `<', `<=', `>', `>=', `==', and
`!='.
The boolean operators are `||' (or), `&&' (and), and `!' (not).
Values can be assigned to a variable with
var = expr
If op is a binary arithmetic operator,
var op= expr
is equivalent to
var = var op expr
Command Line Arguments
The built-in variable ARGC is set to the number of command line argu‐
ments. The built-in array ARGV has elements subscripted with digits
from zero to ARGC-1, giving command line arguments in the order they
appeared on the command line.
The ARGC count and the ARGV vector do not include command line options
(beginning with `-') or the program file (following They do include the
name of the command itself, initialization statements of the form
var=value
and the names of input data files.
The language actually creates ARGC and ARGV before doing anything else.
It then walks through ARGV processing the arguments. If an element of
ARGV is the empty string, it is simply skipped. If it contains an
equals sign `=', it is interpreted as a variable assignment. If it is
a minus sign `-', it stands for the standard input and input is immedi‐
ately read from the standard input until end-of-file is encountered.
Otherwise, the argument is taken to be a file name; input will be read
from that file until end-of-file is reached. Note that the program is
executed by `walking through' ARGV in this way; thus if the program
changes ARGV, different files can be read and assignments made.
Input
Input is divided into records. Each record is separated from the next
with a record separator character. The value of the built-in variable
RS gives the current record separator character; by default, it begins
as the new-line `\n'. If you assign a different character to RS, will
use that as the record separator character from that point on.
Records are divided into fields. Each field is separated from the next
with a field separator string, given by the value of the built-in vari‐
able FS. You can set a specific separator string by assigning a value
to FS or by specifying the -Fs option on the command line. FS can be
be assigned a regular expression. For example,
FS = "[,:$]"
says that fields can be separated by commas, colons, or dollar signs.
As a special case, assigning FS a string containing only a blank char‐
acter sets the field separator to white space. In this case, any
sequence of contiguous space and/or tab characters is considered a sin‐
gle field separator. This is the default for FS. However, if FS is
assigned a string containing any other character, that character desig‐
nates the start of a new field. For example, if we set
FS="\t"
(the tab character),
texta \t textb \t \t \t textc
contains five fields, two of which only contain blanks. With the
default setting, the above would only contain three fields because the
sequence of multiple blanks and tabs would be considered a single sepa‐
rator.
Various pieces of information about input are provided by the built-in
variables listed below.
NF Number of fields in the current record
NR Number of records read so far
FILENAME Name of file containing current record
FNR Number of records read from current file
Field specifiers have the form $i where i runs from 1 through NF. Such
a field specifier refers to the ith field of the current input record.
$0 (zero) refers to the entire current input record.
The getline function can read a value for a variable or $0 from the
current input, from a file, or from a pipe. The result of getline is
an integer indicating whether the read operation was successful. A
value of 1 indicates success; 0 indicates end-of-file encountered; and
-1 indicates that an error occurred. Possible forms for getline are:
getline
Reads next input record into $0 and splits the record into fields.
NF, NR, and FNR are set appropriately.
getline var
Reads next input record into the variable var. The record is not
split into fields (which means that the current $i values do not
change). NR and FNR are set appropriately.
getline <expr
Interprets the string value of expr to be a file name. The next
record from that file is read into $0 and split into fields. NF
is set appropriately.
getline var <expr
Interprets the string value of expr to be a file name, and reads
the next record from that file into the variable var. The record
is not split into fields.
expr | getline
Interprets the string value of expr as a command line to be exe‐
cuted. Output from this command is piped into getline, and read
into $0 in a manner similar to getline <expr. See the SYSTEM
FUNCTION section for additional details.
expr | getline var
Executes the string value of expr as a command and pipes the out‐
put of the command into getline. The result is similar to getline
var <expr.
close(expr)
Only a limited number of files and pipes can be open at one time.
This function will close open files or pipes. The expr must be
one that came before `|' or after `>' in getline, or after `>',
`>>', or `|' in print or printf as described in the Output sec‐
tion. By closing files and pipes that are no longer needed, you
can use any number of files and pipes in the course of executing a
program.
Built-In Arithmetic Functions
int(expr)
Returns the integer part of the numeric value of expr. If (expr)
is omitted, the integer part of $0 is returned.
exp(expr), log(expr), sqrt(expr)
Returns the exponential, natural logarithm, and square root of the
numeric value of expr. If (expr) is omitted, $0 is used.
sin(expr), cos(expr)
Returns the sine and cosine of the numeric value of expr (inter‐
preted as an angle in radians).
atan2(expr1, expr2)
Returns the arctangent of expr1/expr2 in the range of -π through π.
rand()
Returns a random floating-point number in the range 0 through 1.
srand(expr)
Sets the seed of the rand function to the integer value of expr.
If (expr) is omitted, sets a default seed (which is the same each
time is invoked).
Built-In String Functions
len = length(expr)
Returns the number of characters in the string value of expr.
If (expr) is omitted, $0 is used.
n = split(string, array, regexp)
Splits the string into fields. The expression regexp is a regu‐
lar expression giving the field separator string for the pur‐
poses of this operation. The elements of array are assigned the
separated fields in order; subscripts for array begin at 1. All
other elements of array are discarded. The result of split is
the number of fields into which string was divided (which is
also the maximum subscript for array). Note that regexp divides
the record in the same way that the FS field separator string
does. If regexp is omitted in the call to split, the current
value of FS will be used.
str = substr(string, m, len)
Returns the substring of string that begins in position m and is
at most len characters long. The first character of the string
has m equal to one. If len is omitted, the rest of string is
returned.
pos = index(s1, s2)
Returns the position of the first occurrence of string s2 in
string s1; if s2 is not found in s1, index returns zero.
pos = match(string, regexp)
Searches string for the first substring matching the regular
expression regexp, and returns an integer giving the position of
this substring. If no such substring is found, match returns
zero. The built-in variable RSTART is set to pos and the built-
in variable RLENGTH is set to the length of the matched string.
These are both set to zero if there is no match. The regexp can
be enclosed in slashes or given as a string.
n = gsub(regexp, repl, string)
globally replaces all substrings of string that match the regu‐
lar expression regexp, and replaces the substring with the
string repl. If string is omitted, the current record ($0) is
used. The notation gsub returns the number of substrings that
were replaced or zero if no match occurred.
n = sub(regexp, repl, string)
Works like gsub except that at most one match and substitution
is attempted.
str = sprintf(fmt, expr, expr...)
Formats the expression list expr, expr, ... using specifica‐
tions from the string fmt, then returns the formatted string.
The fmt string consists of conversion specifications which con‐
vert and add the next expr to the string, and ordinary charac‐
ters which are simply added to the string. Conversion specifi‐
cations have the form
%[-][x][.y]c
where
- left justifies the field
x is the minimum field width
y is the precision
c is the conversion character
In a string, the precision is the maximum number of characters
to be printed from the string; in a number, the precision is the
number of digits to be printed to the right of the decimal point
in a floating point value. If x or y is `*' (asterisk), the
minimum field width or precision will be the value of the next
expr in the call to sprintf.
The conversion character c is one of following:
d Decimal integer
o Unsigned octal integer
x Unsigned hexadecimal integer
u Unsigned decimal integer
f Floating point
e Floating point (scientific notation)
g The shorter of e and f (suppresses non-significant zeros)
c Single character of an integer value
s String
n = ord(expr)
Returns the integer value of first character in the string value
of expr. This is useful in conjunction with `%c' in sprintf.
str = tolower(expr)
Converts all letters in the string value of expr into lower
case, and returns the result. If expr is omitted, $0 is used.
str = toupper(expr)
Converts all letters in the string value of expr into upper
case, and returns the result. If expr is omitted, $0 is used.
The System Function
status = system(expr)
Executes the string value of expr as a command. For example,
system("tail " $1)
calls the command, using the string value of $1 as the file that
should examine. See the Restrictions section for a discussion
of the execution of the command.
User-Defined Functions
You can define your own functions using the form
function name(parameter-list) {
statements
}
A function definition can appear in the place of a pattern {action}
rule. The parameter-list contains any number of normal (scalar) and
array variables separated by commas. When a function is called, scalar
arguments are passed by value, and array arguments are passed by refer‐
ence. The names specified in the parameter-list are local to the func‐
tion; all other names used in the function are are global. Local
scalar variables can be defined by adding them to the end of the param‐
eter list. These extra parameters are not used in any call to the
function.
A function returns to its caller either when the final statement in the
function is executed, or when an explicit return statement is executed.
Patterns and Actions
A pattern is a regular expression, a special pattern, a pattern range,
or any arithmetic expression.
BEGIN is a special pattern used to label actions that should be per‐
formed before any input records have been read. END is a special pat‐
tern used to label actions that should be performed after all input
records have been read.
A pattern range is given as
pattern1,pattern2
This matches all lines from one that matches pattern1 to one that
matches pattern2, inclusive.
If a pattern is omitted, or if the numeric value of the pattern is non-
zero (true), the resulting action is executed for the line.
An action is a series of statements terminated by semicolons, new-
lines, or closing braces. A condition is any expression; a non-zero
value is considered true, and a zero value is considered false. A
statement is one of the following:
expression
if (condition)
statement
[else
statement]
while (condition)
statement
do
statement
while (condition)
for (expression1; condition; expression2)
statement
The for statement is equivalent to:
expression1
while (condition) {
statement
expression2
}
The for statement can also have the form
for (i in array)
statement
The statement is executed once for each element in array; on each repe‐
tition, the variable i will contain the name of a subscript of array,
running through all the subscripts in an arbitrary order. If array is
multi-dimensional (has multiple subscripts), i will be expressed as a
single string with the SUBSEP character separating the subscripts. The
following simple statements are supported:
break Exits a for or a while loop immediately.
continue
Stops the current iteration of a for or while loop and begins
the next iteration (if there is one).
next Terminates any processing for the current input record and imme‐
diately starts processing the next input record. Processing for
the next record will begin with the first appropriate rule.
exit[ (expr) ]
Immediately goes to the END action if it exists; if there is no
END action, or if is already executing the END action, the pro‐
gram terminates. The exit status of the program is set to the
numeric value of expr. If (expr) is omitted, the exit status is
0.
return [expr]
Returns from the execution of a function. If an expr is speci‐
fied, the value of the expression is returned as the result of
the function. Otherwise, the function result is undefined.
delete array[i]
Deletes element i from the given array.
print expr, expr, ...
Described below.
printf fmt, expr, expr, ...
Described below.
Output
The print and printf statements write to the standard output. Output
can be redirected to a file or pipe as described below.
If >expr is added to a print or printf statement, the string value of
expr is taken to be a file name, and output is written to that file.
Similarly, if >RI >> expr is added, output will be appended to the cur‐
rent contents of the file. The distinction between `>' and `>>' is
only important for the first print to the file expr. Subsequent out‐
puts to an already open file will append to what is there already.
In order to eliminate ambiguities, statements such as
print a > b c
are syntactically illegal. Parentheses must be used to resolve the
ambiguity.
If |expr is added to a print or printf statement, the string value of
expr is taken to be an executable command. The command is executed
with the output from the statement piped as input into the command.
As noted earlier, only a limited number of files and pipes can be open
at any time. To avoid going over the limit, you should use the close
function to close files and pipes when they are no longer needed.
The print statement prints its arguments with only simple formatting.
If it has no arguments, the current input record is printed in its
entirety. The output record separator ORS is added to the end of the
output produced by each print statement; when arguments in the print
statement are separated by commas, the corresponding output values will
be separated by the output field separator OFS. ORS and OFS are built-
in variables whose values can be changed by assigning them strings.
The default output record separator is a new-line and the default out‐
put field separator is a space. The format of numbers output by print
is given by the string OFMT. By default, the value is `%.6g'; this can
be changed by assigning OFMT a different string value.
The printf statement formats its arguments using the fmt argument.
Formatting is the same as for the built-in function sprintf. Unlike
print, printf does not add output separators automatically. This gives
the program more precise control of the output.
Restrictions
The longest input record is restricted to 20,000 bytes and the maximum
number of fields supported is 4000. The length of the string produced
by sprintf is limited to 1024 bytes.
The ord function may not be recognized by other versions of The toupper
and tolower functions and the ENVIRON array variable are found in the
Bell Labs version of this version is a superset of `New as described in
The AWK Programming Language by Aho, Weinberger, and Kernighan.
The shell that is used by the functions
getline print printf system
and the return value of the system function is described in
Examples
The following example outputs the contents of the file with line num‐
bers prepended to each line:
nawk '{print NR ":" $0}' input1
The following is an example using var=value on the command line:
nawk '{print NR SEP $0}' SEP=":" input1
The program script can also be read from a file as in the command line:
nawk-f addline.nawk input1
This example produces the same output as the previous example when the
file contains
{print NR ":" $0}
The following program appends all input lines starting with `January'
to the file (which can already exist or not), and all lines starting
with `February' or `March' to the file
/^January/ {print >> "jan"}
/^February|^March/ {print >> "febmar"}
This program prints the total and average for the last column of each
input line:
{s += $NF}
END {print "sum is", s, "average is", s/NR}
The following program interchanges the first and second fields of input
lines:
{
tmp = $1
$1 = $2
$2 = tmp
print
}
The following example inserts line numbers so that output lines are
left-aligned:
{printf "%-6d: %s\n", NR, $0}
This example prints input records in reverse order (assuming sufficient
memory):
{
a[NR] = $0 # index using record number
}
END {
for (i = NR; i>0; --i)
print a[i]
}
The next program determines the number of lines starting with the same
first field:
{
++a[$1] # array indexed using the first field
}
END { # note output will be in undefined order
for (i in a)
print a[i], "lines start with", i
}
The following program can be used to determine the number of lines in
each input file:
{
++a[FILENAME]
}
END {
for (file in a)
if (a[file] == 1)
print file, "has 1 line"
else
print file, "has", a[file], "lines"
}
This program illustrates how a two dimensional array can be used in
Assume the first field contains a product number, the second field con‐
tains a month number, and the third field contains a quantity (bought,
sold, or whatever). The program generates a table of products versus
month.
BEGIN {NUMPROD = 5}
{
array[$1,$2] += $3
}
END {
print "\t Jan\t Feb\tMarch\tApril\t May\t" \
"June\tJuly\t Aug\tSept\t Oct\t Nov\t Dec"
for (prod = 1; prod <= NUMPROD; prod++) {
printf "%-7s", "prod#" prod
for (month = 1; month <= 12; month++){
printf "\t%5d", array[prod,month]
}
printf "\n"
}
}
As this program reads in each line of input, it reports whether the
line matches a pre-determined value:
function randint() {
return (int((rand()+1)*10))
}
BEGIN {
prize[randint(),randint()] = "$100";
prize[randint(),randint()] = "$10";
prize[1,1] = "the booby prize"
}
{
if (($1,$2) in prize)
printf "You have won %s!\n", prize[$1,$2]
}
END
This example prints lines whose first and last fields are the same,
reversing the order of the fields:
$1==$NF {
for (i = NF; i > 0; --i)
printf "%s", $i (i>1 ? OFS : ORS)
}
The following program prints the input files from the command line.
The infiles function first empties the array passed to it, and then
fills the array. Notice that the extra parameter i of infiles is a
local variable.
function infiles(f, i) {
for (i in f)
delete f[i]
for (i = 1; i < ARGC; i++)
if (index(ARGV[i],"=") == 0)
f[i] = ARGV[i]
}
BEGIN {
infiles(a)
for (i in a)
print a[i]
exit
}
This example is the standard recursive factorial function:
function fact(num) {
if (num <= 1)
return 1
else
return num * fact(num - 1)
}
{ print $0 " factorial is " fact($0) }
The last program illustrates the use of getline with a pipe. Here,
getline sets the current record from the output of the command. The
program prints the number of words in each input file.
function words(file, string) {
string = "wc " fn
string | getline
close(string)
return ($2)
}
BEGIN {
for (i=1; i<ARGC; i++) {
fn = ARGV[i]
printf "There are %d words in %s.",
words(fn), fn
}
}
See Alsoed(1), grep(1), sed(1), ex(1), system(3), ascii(7),
"Awk - A Pattern Scanning and Processing Language" ULTRIX Supplementary
Documents, Vol. II: Programmer
nawk(1)