Detailed Explanation of the Linux awk Command

(Click the public account above to quickly follow)

Source: ggjucheng

Link: http://www.cnblogs.com/ggjucheng/archive/2013/01/13/2858470.html

Introduction

awk is a powerful text analysis tool. Compared to grep for searching and sed for editing, awk is particularly powerful in data analysis and report generation. In simple terms, awk reads files line by line, slicing each line using space as the default delimiter, and then performs various analyses and processing on the sliced parts.

awk has three different versions: awk, nawk, and gawk. Unless otherwise specified, it generally refers to gawk, which is the GNU version of AWK.

The name awk comes from the initials of its creators: Alfred Aho, Peter Weinberger, and Brian Kernighan. In fact, AWK has its own language: the AWK programming language, which the three creators have officially defined as a “pattern scanning and processing language.” It allows you to create short programs that read input files, sort data, process data, perform calculations on input, and generate reports, among countless other functions.

Usage

awk‘{pattern + action}’{filenames}

Although operations can be complex, the syntax is always like this, where pattern indicates what awk is looking for in the data, and action is a series of commands executed when a match is found. Curly braces ({}) do not always need to appear in the program, but they are used to group a series of instructions based on a specific pattern. The pattern is the regular expression to be represented, enclosed in slashes.

The most basic function of the awk language is to browse and extract information from files or strings based on specified rules. After extracting information, awk can perform other text operations. A complete awk script is usually used to format information in text files.

Typically, awk processes one line of a file at a time. awk receives one line of the file and then executes the corresponding command to process the text.

Calling awk

There are three ways to call awk

1. Command line method

awk[–F field–separator] ‘commands’ input–file(s)

Here, commands are the actual awk commands, and [–F field separator] is optional. input–file(s) is the file to be processed.

In awk, each item in a line of the file separated by the field separator is called a field. Typically, the default field separator is space unless specified with –F.

2. Shell script method

Insert all awk commands into a file, make the awk program executable, and then call the awk command interpreter as the first line of the script by typing the script name.

Equivalent to the first line of the shell script: #!/bin/sh

Can be changed to: #!/bin/awk

3. Insert all awk commands into a separate file and then call:

awk –fawk–script–fileinput–file(s)

Here, the –f option loads the awk–script–file containing the awk script, and input–file(s) is the same as above.

This chapter focuses on the command line method.

Getting Started Examples

Assuming the output of last -n 5 is as follows

# last -n 5 only retrieves the first five lines

root pts/1 192.168.1.100 Tue Feb1011:21 still logged in

root pts/1 192.168.1.100 Tue Feb1000:46 – 02:28 (01:41)

root pts/1 192.168.1.100 Mon Feb 911:41 – 18:30 (06:48)

dmtsai pts/1 192.168.1.100 Mon Feb 911:41 – 11:41 (00:00)

root tty1 Fri Sep 514:09 – 14:10 (00:01)

If you only want to display the last 5 logged-in accounts

#last -n 5 | awk ‘{print $1}’

root

root

root

dmtsai

root

The workflow of awk is as follows: it reads a record separated by newline characters, then divides the record into fields based on the specified field separator, filling the fields. $0 represents all fields, $1 represents the first field, $n represents the nth field. The default field separator is “whitespace” or “key”, so $1 represents the logged-in user, $3 represents the logged-in user’s IP, and so on.

If you only want to display the accounts in /etc/passwd

#cat /etc/passwd |awk -F ‘:’ ‘{print $1}’

root

daemon

bin

sys

This is an example of awk + action, where the action {print $1} is executed for each line.

-F specifies the field separator as ‘:’ .

If you only want to display the accounts in /etc/passwd and their corresponding shells, with accounts and shells separated by a tab key

#cat /etc/passwd |awk -F ‘:’ ‘{print $1″t”$7}’

root /bin/bash

daemon /bin/sh

bin /bin/sh

sys /bin/sh

If you only want to display the accounts in /etc/passwd and their corresponding shells, with accounts and shells separated by commas, and add column names name, shell to all lines, and add “blue,/bin/nosh” to the last line.

cat /etc/passwd |awk –F‘:’ ‘BEGIN {print “name,shell”} {print $1″,”$7} END {print “blue,/bin/nosh”}’

name,shell

root,/bin/bash

daemon,/bin/sh

bin,/bin/sh

sys,/bin/sh

….

blue,/bin/nosh

The workflow of awk is as follows: it first executes BEGIN, then reads the file, reads a record separated by newline characters, then divides the record into fields based on the specified field separator, filling the fields. $0 represents all fields, $1 represents the first field, $n represents the nth field, and then starts executing the actions corresponding to the pattern. It continues to read the second record until all records are read, and finally executes the END operation.

Search for all lines in /etc/passwd that contain the keyword root

#awk -F: ‘/root/’ /etc/passwd

root:x:0:0:root:/root:/bin/bash

This is an example of using pattern, where only lines matching the pattern (in this case, root) will execute the action (default is to output each line’s content).

Search supports regular expressions, for example, to find lines starting with root: awk -F: ‘/^root/’ /etc/passwd

Search for all lines in /etc/passwd that contain the keyword root and display the corresponding shell

# awk -F: ‘/root/{print $7}’ /etc/passwd

/bin/bash

Here, the action {print $7} is specified.

Built-in Variables in awk

awk has many built-in variables used to set environment information, which can be changed. Below are some of the most commonly used variables.

ARGC Number of command line arguments

ARGV Arrangement of command line arguments

ENVIRON Supports the use of system environment variables in the queue

FILENAME awk browsing file name

FNR Number of records read from the file

FS Sets the input field separator, equivalent to the command line –F option

NF Number of fields in the record

NR Number of records read

OFS Output field separator

ORS Output record separator

RS Controls the record separator

Additionally, the $0 variable refers to the entire record. $1 represents the first field of the current line, $2 represents the second field of the current line, and so on.

Count the number of lines, line numbers, number of columns, and corresponding complete line content in /etc/passwd:

#awk -F ‘:’ ‘{print “filename:” FILENAME “,linenumber:” NR “,columns:” NF “,linecontent:”$0}’ /etc/passwd

filename:/etc/passwd,linenumber:1,columns:7,linecontent:root:x:0:0:root:/root:/bin/bash

filename:/etc/passwd,linenumber:2,columns:7,linecontent:daemon:x:1:1:daemon:/usr/sbin:/bin:/sh

filename:/etc/passwd,linenumber:3,columns:7,linecontent:bin:x:2:2:bin:/bin:/bin/sh

filename:/etc/passwd,linenumber:4,columns:7,linecontent:sys:x:3:3:sys:/dev:/bin/sh

Using printf instead of print can make the code more concise and readable.

awk –F‘:’ ‘{printf(“filename:%10s,linenumber:%s,columns:%s,linecontent:%sn”,FILENAME,NR,NF,$0)}’ /etc/passwd

print and printf

awk provides both print and printf functions for output.

The print function can take variables, numbers, or strings as parameters. Strings must be enclosed in double quotes, and parameters are separated by commas. If there are no commas, the parameters will concatenate together and cannot be distinguished. Here, the role of the comma is the same as the output file separator, except that the latter is a space.

The printf function is similar to the printf in C language, allowing for formatted strings. When outputting complex data, printf is more useful and makes the code easier to understand.

awk Programming

Variables and Assignments

In addition to built-in variables, awk also allows for user-defined variables.

Below is an example counting the number of accounts in /etc/passwd

awk‘{count++;print $0;} END{print “user count is “, count}’ /etc/passwd

root:x:0:0:root:/root:/bin/bash

……

user count is40

count is a user-defined variable. In previous actions, there was only one print, but the action can have multiple statements separated by semicolons.

Here, count is not initialized, although it defaults to 0, it is better practice to initialize it to 0:

awk‘BEGIN {count=0;print “[start]user count is “, count} {count=count+1;print $0;} END{print “[end]user count is “, count}’ /etc/passwd

[start]user count is 0

root:x:0:0:root:/root:/bin/bash

…

[end]user count is 40

Count the byte size occupied by files in a specific folder

ls –l |awk‘BEGIN {size=0;} {size=size+$5;} END{print “[end]size is “, size}’

[end]size is8657198

If you want to display in MB:

ls –l |awk‘BEGIN {size=0;} {size=size+$5;} END{print “[end]size is “, size/1024/1024,”M”}’

[end]size is8.25889M

Note that the count does not include subdirectories of the folder.

Conditional Statements

Conditional statements in awk are borrowed from C language, as seen in the following declaration:

if(expression){

statement;

statement;

……

}

if(expression){

statement;

}else{

statement2;

}

if(expression){

statement1;

}elseif(expression1){

statement2;

}else{

statement3;

}

Count the byte size occupied by files in a specific folder, filtering out files of size 4096 (which are generally folders):

ls –l |awk‘BEGIN {size=0;print “[start]size is “, size} {if($5!=4096){size=size+$5;}} END{print “[end]size is “, size/1024/1024,”M”}’

[end]size is8.22339M

Loop Statements

Loop statements in awk are also borrowed from C language, supporting while, do/while, for, break, and continue. The semantics of these keywords are identical to those in C language.

Arrays

In awk, array indices can be numbers or letters, and array indices are usually referred to as keys. Both values and keys are stored in an internal hash table for key/value applications. Since hashes are not stored in order, when displaying array contents, they may not appear in the expected order. Arrays, like variables, are automatically created when used, and awk will also automatically determine whether they store numbers or strings. Generally, arrays in awk are used to collect information from records, which can be used for summation, word counting, and tracking how many times a template is matched, etc.

Display the accounts in /etc/passwd

awk –F‘:’‘BEGIN {count=0;} {name[count] = $1;count++;}; END{for (i = 0; i ‘ /etc/passwd

0root

1daemon

2bin

3sys

4sync

5games

……

Here, a for loop is used to iterate through the array.

There is a lot to awk programming; here are just some simple common usages. For more, please refer to http://www.gnu.org/software/gawk/manual/gawk.html

【Today’s WeChat Public Account Recommendation↓】

For more recommendations, see《Recommended Technical and Design Public Accounts》

Among them, recommendations include popular public accounts related to technology, design, geeks, and IT matchmaking. Technology covers: Python, Web front-end, Java, Android, iOS, PHP, C/C++, .NET, Linux, databases, operations, big data, algorithms, IT careers, etc. Click on 《Recommended Technical and Design Public Accounts》 to discover exciting content!