Detailed Explanation of the Linux Awk Command

(Click the public account above to quickly follow)

Source: ggjucheng

Link: http://www.cnblogs.com/ggjucheng/archive/2013/01/13/2858470.html

Introduction

Awk is a powerful text analysis tool. Compared to grep for searching and sed for editing, awk is particularly powerful in data analysis and report generation. In simple terms, awk reads files line by line, slicing each line using space as the default delimiter, and then performs various analyses and processing on the sliced parts.

There are three different versions of awk: awk, nawk, and gawk. Unless otherwise specified, it generally refers to gawk, which is the GNU version of AWK.

The name awk comes from the initials of its creators Alfred Aho, Peter Weinberger, and Brian Kernighan. In fact, AWK has its own language: the AWK programming language, which the three creators have officially defined as a “pattern scanning and processing language.” It allows you to create short programs that read input files, sort data, process data, perform calculations on input, and generate reports, among countless other functions.

Usage

awk‘{pattern + action}’{filenames}

Although operations can be complex, the syntax is always like this, where pattern indicates what AWK is looking for in the data, and action is a series of commands executed when a match is found. Curly braces ({}) do not always need to appear in the program, but they are used to group a series of instructions based on a specific pattern. The pattern is the regular expression to be represented, enclosed in slashes.

The most basic function of the awk language is to browse and extract information from files or strings based on specified rules. After extracting information, awk can perform other text operations. A complete awk script is usually used to format information in text files.

Typically, awk processes one line of a file at a time. Awk receives one line of the file and then executes the corresponding command to process the text.

Calling awk

There are three ways to call awk:

1. Command line method

awk [-F field-separator] ‘commands’ input-file(s) where commands are the actual awk commands, [-F field-separator] is optional. input-file(s) are the files to be processed. In awk, each item separated by the field separator in each line of the file is called a field. Typically, the default field separator is space unless -F field-separator is specified.

2. Shell script method

Insert all awk commands into a file and make the awk program executable, then call it by typing the script name, with the awk command interpreter as the first line of the script. The first line of the shell script: #!/bin/sh can be replaced with: #!/bin/awk

3. Insert all awk commands into a separate file and then call: awk -f awk-script-file input-file(s) where the -f option loads the awk script from awk-script-file, and input-file(s) is the same as above.

This chapter focuses on the command line method.

Getting Started Examples

Assuming the output of last -n 5 is as follows:

# last -n 5 only takes the first five lines

root pts/1 192.168.1.100 Tue Feb1011:21 still logged in

root pts/1 192.168.1.100 Tue Feb1000:46 – 02:28 (01:41)

root pts/1 192.168.1.100 Mon Feb 911:41 – 18:30 (06:48)

dmtsai pts/1 192.168.1.100 Mon Feb 911:41 – 11:41 (00:00)

root tty1 Fri Sep 514:09 – 14:10 (00:01)

If you only want to display the last 5 logged-in accounts:

# last -n 5 | awk ‘{print $1}’

root

root

root

dmtsai

root

The workflow of awk is as follows: it reads a record split by ‘n’ newline characters, then divides the record into fields based on the specified field separator, filling the fields. $0 represents all fields, $1 represents the first field, $n represents the nth field. The default field separator is “whitespace” or “key”, so $1 represents the logged-in user, $3 represents the logged-in user’s IP, and so on.

If you only want to display the accounts in /etc/passwd:

# cat /etc/passwd | awk -F ‘:’ ‘{print $1}’

root

daemon

bin

sys

This is an example of awk + action, where the action {print $1} is executed for each line.

-F specifies the field separator as ‘:’ .

If you only want to display the accounts in /etc/passwd and their corresponding shells, with accounts and shells separated by a tab key:

# cat /etc/passwd | awk -F ‘:’ ‘{print $1″t”$7}’

root /bin/bash

daemon /bin/sh

bin /bin/sh

sys /bin/sh

If you only want to display the accounts in /etc/passwd and their corresponding shells, with accounts and shells separated by a comma, and add column names name, shell to all lines, and add “blue,/bin/nosh” to the last line:

cat /etc/passwd | awk -F ‘:’ ‘BEGIN {print “name,shell”} {print $1″,”$7} END {print “blue,/bin/nosh”}’

name,shell

root,/bin/bash

daemon,/bin/sh

bin,/bin/sh

sys,/bin/sh

….

blue,/bin/nosh

The workflow of awk is as follows: it first executes BEGIN, then reads the file, reads a record split by ‘/n’ newline characters, then divides the record into fields based on the specified field separator, filling the fields. $0 represents all fields, $1 represents the first field, $n represents the nth field, and then begins executing the actions corresponding to the patterns. It continues reading the second record… until all records are read, and finally executes the END operation.

Search for all lines in /etc/passwd that contain the keyword root:

# awk -F: ‘/root/’ /etc/passwd

root:x:0:0:root:/root:/bin/bash

This is an example of using a pattern, where only lines matching the pattern (here root) will execute the action (no action specified, default is to output each line’s content).

Search supports regular expressions, for example, to find lines starting with root: awk -F: ‘/^root/’ /etc/passwd

Search for all lines in /etc/passwd that contain the keyword root and display the corresponding shell:

# awk -F: ‘/root/{print $7}’ /etc/passwd

/bin/bash

Here, the action {print $7} is specified.

Built-in Variables in awk

Awk has many built-in variables used to set environment information, which can be changed. Below are some of the most commonly used variables.

ARGC Number of command line arguments

ARGV Arrangement of command line arguments

ENVIRON Supports the use of system environment variables in the queue

FILENAME awk browsing file name

FNR Number of records browsed in the file

FS Sets the input field separator, equivalent to the command line -F option

NF Number of fields in the browsed record

NR Number of records read

OFS Output field separator

ORS Output record separator

RS Controls the record separator

Additionally, the $0 variable refers to the entire record. $1 represents the first field of the current line, $2 represents the second field of the current line, and so on.

Count the number of lines, line numbers, number of columns, and corresponding complete line content in /etc/passwd:

# awk -F ‘:’ ‘{print “filename:” FILENAME “,linenumber:” NR “,columns:” NF “,linecontent:”$0}’ /etc/passwd

filename:/etc/passwd,linenumber:1,columns:7,linecontent:root:x:0:0:root:/root:/bin/bash

filename:/etc/passwd,linenumber:2,columns:7,linecontent:daemon:x:1:1:daemon:/usr/sbin:/bin/sh

filename:/etc/passwd,linenumber:3,columns:7,linecontent:bin:x:2:2:bin:/bin:/bin/sh

filename:/etc/passwd,linenumber:4,columns:7,linecontent:sys:x:3:3:sys:/dev:/bin/sh

Using printf instead of print can make the code more concise and readable.

awk -F ‘:’ ‘{printf(“filename:%10s,linenumber:%s,columns:%s,linecontent:%sn”,FILENAME,NR,NF,$0)}’ /etc/passwd

Print and printf

Awk provides both print and printf functions for output.

The print function can take variables, numbers, or strings as parameters. Strings must be quoted with double quotes, and parameters are separated by commas. If there are no commas, the parameters will concatenate together and cannot be distinguished. Here, the role of the comma is the same as the output file separator, except that the latter is a space.

The printf function is similar to the printf in C language, allowing for formatted strings. When outputting complex data, printf is more useful and makes the code easier to understand.

Awk Programming

Variables and Assignments

In addition to built-in variables, awk also allows for user-defined variables.

Below is a count of the number of accounts in /etc/passwd:

awk ‘{count++;print $0;} END{print “user count is “, count}’ /etc/passwd

root:x:0:0:root:/root:/bin/bash

……

user count is40

Count is a user-defined variable. In previous action{} blocks, there was only one print; in fact, print is just a statement, while action{} can have multiple statements separated by semicolons.

Here, count is not initialized, although it defaults to 0, it is still better practice to initialize it to 0:

awk ‘BEGIN {count=0;print “[start]user count is “, count} {count=count+1;print $0;} END{print “[end]user count is “, count}’ /etc/passwd

[start]user count is 0

root:x:0:0:root:/root:/bin/bash

…

[end]user count is 40

Count the byte size occupied by files in a specific folder:

ls -l | awk ‘BEGIN {size=0;} {size=size+$5;} END{print “[end]size is “, size}’

[end]size is8657198

If displayed in MB:

ls -l | awk ‘BEGIN {size=0;} {size=size+$5;} END{print “[end]size is “, size/1024/1024,”M”}’

[end]size is8.25889M

Note that the count does not include the subdirectories of the folder.

Conditional Statements

Conditional statements in awk are borrowed from C language, as seen in the following declaration:

if(expression){

statement;

statement;

……

}

if(expression){

statement;

}else{

statement2;

}

if(expression){

statement1;

}elseif(expression1){

statement2;

}else{

statement3;

}

Count the byte size occupied by files in a specific folder, filtering out files of size 4096 (which are generally folders):

ls -l | awk ‘BEGIN {size=0;print “[start]size is “, size} {if($5!=4096){size=size+$5;}} END{print “[end]size is “, size/1024/1024,”M”}’

[end]size is8.22339M

Loop Statements

Loop statements in awk are also borrowed from C language, supporting while, do/while, for, break, continue, and these keywords have the same semantics as in C language.

Arrays

In awk, array indices can be numbers or letters, and array indices are usually referred to as keys. Both values and keys are stored in an internal hash table for key/value applications. Since hashes are not stored in order, when displaying array contents, they may not appear in the order you expect. Arrays, like variables, are automatically created when used, and awk will also automatically determine whether they store numbers or strings. Generally, arrays in awk are used to collect information from records, which can be used for summation, word counting, and tracking how many times a template is matched, etc.

Display the accounts in /etc/passwd:

awk -F ‘:’ ‘BEGIN {count=0;} {name[count] = $1;count++;}; END{for (i = 0; i < count; i++) print i, name[i]}’ /etc/passwd

Here, a for loop is used to iterate through the array.

There is a lot of content in awk programming; here only simple and commonly used usages are listed. For more, please refer to http://www.gnu.org/software/gawk/manual/gawk.html

Follow “Linux Enthusiasts”

See more Linux technical sharing

↓↓↓