The Three Musketeers of Linux: Mastering Regular Expressions, AWK, and Sed

Regular Expressions

Basic regular expression ( ^, s, ., [], *) metacharacters

Linux regular expressions are tools used for processing large amounts of strings, defining a set of rules and methods to match specific text patterns. Regular expressions consist of ordinary characters (such as letters a to z) and special characters (also known as metacharacters) that describe how to match one or more strings when searching or replacing strings. Many commands in the Linux environment, such as grep, sed, awk, and find, support the use of regular expressions to achieve complex text processing tasks.

The basic concepts of regular expressions include: Character classes: A set of characters can be specified using character classes, such as [0-9] for digit characters and [a-z] for lowercase English letters. Quantifiers: * indicates zero or more prefixes, + indicates one or more prefixes, and ? indicates zero or one prefix. Boundary anchors: ^ indicates the start of a line, and $ indicates the end of a line. Selectors: | is used to separate different selection conditions. Repetition quantifiers: {n} indicates repeating n times, {n,} indicates at least n times, and {n,m} indicates repeating from n to m times.

Examples of using regular expressions include: Using the grep command to match information containing specific text. Using the sed command for text replacement or deletion via regular expressions. Using the awk command in conjunction with regular expressions to perform various data processing tasks. In Linux systems, regular expressions are typically processed on a “line” basis, and escape characters (such as \) can be used to control the behavior of special characters.

The significance of regular expressions

Processing large amounts of strings and text. With the help of special symbols, Linux administrators can quickly filter, replace, and process the required strings and texts, making their work efficient. Typically, Linux operations face a large amount of content with strings, such as configuration files, program code, command output results, and log files. For such string content, we often have specific needs to find particular strings that meet work requirements, hence the emergence of regular expressions. Regular expressions are a set of rules and methods that simplify complex tasks, improving work efficiency. Linux only supports the three musketeers (sed, awk, grep) for regular expressions; other commands cannot use regular expressions. The application of regular expressions is very widespread, used in languages like Python, Java, Perl, etc. Regular expressions cannot be used with ordinary commands in Linux, only with the three musketeers.

Wildcards

Wildcards and regular expressions both have symbols like *, ?, and [abcd]. In wildcards, they are used to represent any character. For example, ls *.log can find a.log, b.log, ccc.log.

Wildcards are supported by most ordinary commands for finding files or directories, while regular expressions filter content in files (data streams) through the three musketeers commands.

BRE Set

Matching characters, matching times, position anchoring

grep "^$" ck.txt -n finds empty lines in the ck file. grep "^$" ck.txt -n -v finds non-empty lines in the ck file. grep "^#" ck.txt finds lines that do not start with a comment hash. grep "^#" ck.txt -v | grep "^$" -v finds lines in the ck file that are neither comments nor empty.

Usage of ^: grep "^i" ck.txt -n -i finds lines in the ck file that start with i. Usage of .: grep -n "\.$" ck.txt finds lines ending with a dot (with escape character treated as a normal character). grep -n ".$" ck.txt matches any character result (the dot is treated as a regular expression). grep "." ck.txt -n -i finds non-empty lines in ck.txt. grep ".ab" ck.txt -n -i finds any three characters containing ab. ^$ combination: grep "^$" ck.txt -n finds empty lines. Usage of * matches the previous character 0 or more times: grep "I*" ck.txt outputs the source file, highlighting i. grep "I*" ck.txt -o directly displays each line’s i. .* combination represents matching all content, including spaces: grep ".*" ck.txt matches all content. grep ".*e" ck.txt matches all lines ending with e.

In [abc], the brackets indicate that the ^ symbol in the first position indicates exclusion. The bracket expression [abc] matches any one of the characters in the brackets, a or b or c. Common forms include [a-z] matching all lowercase single letters, [A-Z] matching all uppercase single letters, [a-zA-Z] matching all single letters of both cases, [0-9] matching all single digits, and [a-zA-Z0-9] matching all digits and letters.

grep "[a-z]" ck.txt finds lowercase letters. grep "[A-Z]" ck.txt finds uppercase letters. grep "[^0-5]" ck.txt finds all characters not in the range 0-5.

Matching times: *: matches the previous character any number of times (0 to infinite). \?: matches the previous character 0 or 1 time. \+: matches the previous character 1 or more times. {n,0} = {n}: matches the previous character at least n times. {n,m}: matches the previous character at least n times, and at most m times. {0,n} = {,n}: matches the previous character at most m times. Example: grep "[root]\{3\}" file.

Matching characters: .: any single character, []: any specified character, [^]: any character outside the specified range, .*: matches any number of any characters, .\?: matches any character appearing 0 or 1 time, []*: matches fixed format characters appearing any number of times. [:digit:] = digits, [:lower:] = lowercase letters, [:upper:] = uppercase letters, [:alpha:] = all letters, [:alnum:] = digits and letters, [:punct:] = all punctuation marks, [:space:] = spaces. Example: grep "^[[:space]]" file.

Position anchoring: <: word start anchor, used on the left side of the word mode, >: word end anchor, used on the right side of the word mode, < extgreater;: matches the complete word, ^: line start anchor; used on the leftmost side of the pattern, $: line end anchor; used on the rightmost side of the pattern. Example: grep "<hello>" file.

Grouping and referencing: ( ): bundles one or more characters together to be treated as a whole. extbackslash 1: matches the characters between the first left parenthesis and the matching right parenthesis from the left; (can be extbackslash 2, extbackslash 3). Example: grep "$[a-z]\+$ is \1" file.

Note that on the Linux platform, all files end with a $ symbol, which can be viewed using cat -A.

Extended regular expressions (adding ( ), { }, ?, +, | to basic regular expressions)

Extended regular expressions must be activated with grep -E. .: matches any single character. [ ]: matches any single character within the specified range. [^]: matches any single character outside the specified range. Matching times: *: matches the preceding character any number of times (0, 1, or more). ?: matches the preceding character 0 or 1 time. +: matches the preceding character 1 or more times. {m}: matches the preceding character m times. {m,n}: matches the preceding character at least m times, at most n times. {m,}: matches the preceding character at least m times. {,n}: matches the preceding character at most n times. Position anchoring: ^: line start anchor. $: line end anchor. < or \b: word start anchor. > or \b: word end anchor. Grouping and referencing: (): grouping; the pattern matched within the parentheses will be recorded in the internal variables of the regular expression engine, with variables being extbackslash 1, extbackslash 2, etc. Or: |: a|b means a or b.

The ? symbol is used to match the preceding character 0 or 1 time, so ro?t can only match rot or rt. The () symbol is usually used in conjunction with the | symbol to enumerate a series of replaceable characters. For example, the area code and phone number of a fixed telephone may be connected by a “-” symbol or a space. Group filtering content can be referenced by the following \n regular expression, where n is a number indicating which group’s content to reference. extbackslash 1: indicates the characters matched in the first group from the left. extbackslash 2: indicates the characters matched in the second group from the left. The + symbol is used to match the preceding character more than once, so ro+t can match rot, root, etc. The | symbol means “or”, indicating multiple possibilities in a branching relationship. {m}: matches the preceding character m times. {m,n}: matches the preceding character at least m times, at most n times. {m,}: matches the preceding character at least m times. {,n}: matches the preceding character at most n times.

grep -E "I+" ck.txt matches i once or multiple times. grep -E "ro?t" ck.txt matches the preceding character once or multiple times. find /tmp -name "*txt" | grep -E "A|X" finds all files in tmp that have a or x. grep "r(oo|o)t" ck.txt finds lines where the middle is oo or o. grep "(r..t)*\1" ck.txt finds all lines starting with r and ending with t (\1 references the group afterwards).

grep and regular expressions

The grep command is used to find files containing specified pattern styles. If a file’s content matches the specified pattern style, the grep command will display the line containing the pattern style. If no file name is specified, or if the given file name is -, the grep command will read data from the standard input device.

grep [options] [pattern] [files] command parameters: matching pattern, file data. -a: searches binary files as text files. -c: counts the occurrences of the ‘search string’ (only counts matching lines). -i: ignores case differences, treating upper and lower case as the same. -n: outputs line numbers as well. -v: reverse selection, displaying lines that do not contain the ‘search string’ content! (excluding matching results). --color=auto adds color to grep filtered results. -o only outputs matching content. -w only matches filtered content.

grep "login" /tmp/test_grep.txt -n finds lines related to login. grep "login" /tmp/test_grep.txt -n -v finds lines that do not contain login. grep -I "R00T" /tmp/test_grep.txt ignores case, finding root. grep -E "rootlsync" /tmp/test_grep.txt --color=auto filters out lines related to both root and sync. grep "login" /tmp/test_grep.txt -c counts the number of matching results. grep "login" /tmp/test_grep.txt -n -o only outputs matching content. grep "oldboy" /tmp/test_grep.txt -w matches the complete string, accurately matching the whole word. grep -E "~#l^s" /tmp/test_grep.txt filters out blank and comment lines.

The Three Musketeers of Linux

Text processing tools, all support regular expression engines: grep: text filtering tool (pattern: pattern) tool. sed: stream editor; text editing tool. awk: Linux’s text report generator (formatted text), on Linux it is gawk.

awk

AWK is a language for processing text files, a powerful file analysis tool. It is a programming language specifically designed for text processing and is also a line processing software, commonly used for scanning, filtering, and statistical summarization. Data can come from standard input or from pipes or files.

AWK working principle: When reading the first line, it matches conditions and then executes specified actions, continuing to read the second line for processing, without default output. If no matching condition is defined, it defaults to matching all data lines. AWK has an implicit loop; the number of times conditions match determines how many times actions are executed. It reads text line by line, defaulting to space or tab as the delimiter, saving the split fields into built-in variables, and executing editing commands based on patterns or conditions.

Format 1: awk [options] 'pattern or condition {action}' filename. Format 2: awk -f script_file filename.

Rows and columns: A row is called a record, each row is split by carriage return by default. A column is called a field, each column is split by spaces by default. Symbols: >, <, >=, <=, !=, ==.

Built-in variables: $0: all records of the current processing line. $n: the nth field of the current record, fields are separated by delimiters. awk -F: '{print $1,$3}'. NF: the number of fields (columns) in the current record. awk -F: 'print NF'. $NF: the last column. $(NF-1) indicates the second to last column. FNR/NR: line number. FS: defines the input delimiter. awk 'BEGIN{FS=":"};{print $1,$3}' is equivalent to awk -F: '{print $1,$3}'. OFS: defines the output field delimiter, default is space. awk -F: 'BEGIN{OFS="==>"};{print $1,$3}' outputs $1==>$3, and can also be written as awk -F: '{print $1"==>"$3}'. RS: input record delimiter, default is newline. awk -F: 'BEGIN{RS="\t"};{print $1,$3}' is equivalent to splitting columns by \t.

Extracting columns: -F specifies the delimiter. Specify the end marker for each column (default is space, continuous spaces, tab). $number extracts a certain column; note: in AWK, $ means extracting a certain column. $0 means extracting all columns, i.e., the entire line. {print xxx} sets the delimiter FS. -F can specify the delimiter. -v FS="xx" sets the delimiter, where -v indicates modifying the AWK variable, and FS is the delimiter variable. OFS is the output field delimiter (when AWK displays each column, what separates each column). NF: the number of fields (columns) in the current record. $NF indicates the last column. awk -F: 'print NF'. ls -l | awk '{print $5,$9}' | column -t extracts the fifth and ninth columns. column -t aligns. awk '{print $5,$NF}' /etc/passwd | column -t swaps the last and first columns. awk -v OFS=":" '{print $NF,$2,$3,$4,$5,$6,$1}' /etc/passwd | column -t swaps the last and first columns. awk 'NR==2,NR==5{print "第"NR"行:"$0}' ck.txt extracts all columns from the second to the fifth row.

Execution process

Pattern matching

Command options: ‘pattern {action}’ or ‘condition {action}’ awk -F"[/.]+" 'pattern{action}' 'NR==3{print $3}'. Who can be a condition for AWK: Comparison symbols: > < >= <= == !=. Regular expressions, range expressions, special conditions: BEGIN and END.

awk 'condition1{action1} condition2{action2}…' filename. Conditions (Pattern): Generally, relational expressions are used as conditions. There are many relational expressions, for example: x > 10 checks if variable x is greater than 10, x == y checks if variable x equals variable y, A ~ B checks if string A contains a substring that matches expression B, A !~ B checks if string A does not contain a substring that matches expression B. Actions (Action): formatted output, control flow statements. Common parameters: -F specifies the field delimiter used during input, -v custom variable, -f reads AWK commands from a script, -m sets internal limits on the value of val.

Regular matching

// Supports extended regular expressions. AWK can precisely target a certain column, whether a certain column contains/does not contain… content: ~ contains, !~ does not contain.

Regular AWK regular indicates lines starting with… in a certain column. $ indicates lines ending with… in a certain column. ^$ indicates empty lines in a certain column.

awk -F: '$3~/^[12]/{print $1,$3,$NF}' /etc/passwd finds lines where the third column starts with 1 or 2, displaying the first, third, and last columns. awk -F: '$3~/^1/' /etc/passwd finds lines where the third column starts with 1. awk -F: '$3~/^2/{print $1,$3,$NF}' /etc/passwd finds lines where the third column starts with 2, displaying the first, third, and last columns. awk -F: '$3~/^1|2/{print $1,$3,$NF}' /etc/passwd finds lines where the third column starts with 1 or has 2, displaying the first, third, and last columns. awk -F: '$3~/^1|^2/{print $1,$3,$NF}' /etc/passwd finds lines where the third column starts with 1 or 2, displaying the first, third, and last columns. awk -F: '$3~/^(1|2){print $1,$3,$NF}' /etc/passwd finds lines where the third column starts with 1 or 2, displaying the first, third, and last columns.

Range patterns

/start/,/end/. Example: awk '/11:02:00/,/11:02:30/{print $1}' access.log. NR==2, NR==4 indicates from the second row to the fifth row, similar to sed -n '2,4p'.

Special patterns

BEGIN{} executes content before reading the file. 1) performs simple statistics, calculations, without involving reading the file (common). 2) adds a header before processing the file (understanding). 3) defines AWK variables (rarely used because -v can be used).

END{} executes content after reading the file. 1) AWK performs statistics, generally the process: first calculates, then outputs results in END (common). 2) AWK uses arrays to output array results (common).

Statistical methods: I++ counts, counting occurrences, sum=sum+? accumulates. I,sum are both variables. awk "/^$/{i++}END{print i}" /etc/services.

Comparison symbols reference: awk -F"[/]+" 'NR==3{print $3}'.

Arrays

Log statistics: counting occurrences of each IP, counting occurrences of each status code, counting how many times each user in the system was attacked, counting occurrences of attacker IPs. Accumulating: counting the traffic consumed by each IP: array[]++. What to count: [] is a certain column. awk -F"[/.]+" '{array[$2]++}END{for(i in array)print i,array[i]}' ck.txt.

AWK letters will be recognized as variables; if you just want to use a string, you need to enclose it in double quotes: awk 'BEGIN{a[0]=cc;a[1]=kk; print a[0],a[1]}'. awk 'BEGIN{a[0]=123;a[1]="ck"; for(i in a) print i}' outputs the index. awk 'BEGIN{a[0]=2306;a[1]="ck"; for(i in a) print a[i]}' outputs variable content. awk -F"[/.]+" '{array[$2]++} END {for(i in array) print i, array[i]}' url.txt -F”[/.]+”: this parameter specifies the field delimiter -F, using regular expression [/.]+ does not require quotes, matching one or more . or / characters. '{array[$2]++}: this is the action part of AWK, indicating that for each line, the second field ($2) is used as the index, and the corresponding element in the array array is incremented by one, i.e., counting the occurrences of each different second field. END {for(i in array) print i, array[i]}: the END block executes after processing all input lines. In this block, a for loop traverses the array array and prints each index (i.e., the value of the second field) and its occurrence in the text file.

Judging and looping

For loop

Used to loop through each field: for(i=1;I<100;I++) print i.

awk 'BEGIN{for(i=1;i<=100;i++)sum+=i;print sum3}'.

If judgment

if (condition) print "output".

When using multiple conditions in AWK, the first condition can be placed in the condition {action}. The second condition generally uses if: df -h | awk -F"[ %]+" 'NR>1{if($5>=1)print "disk not enough",$1,$5,SNF}'.

sed

sed is a non-interactive online editor that processes one line of content at a time. During processing, the current line being processed is stored in a temporary buffer called the “pattern space”. The sed command then processes the content in the buffer, and after processing, sends the buffer’s content to the screen. It then processes the next line, repeating this until the end of the file.

Syntax: sed parameters "[location command]" file_path. -e allows multiple edits. -r enables extended regular expressions. -n cancels default output (output of the pattern space content). -i writes to the file, changing the output direction, modifying the source file (redirecting content that flows to the screen into the file). -i.bak creates a backup file when modifying the file to prevent mistakes. -a adds information after the specified line. -i adds information before the specified line. -f specifies the name of the sed script file.

The difference between sed and vim: sed can write the rules for processing files in advance and then process them with a set of rules, while vim can only edit one by one. Batch processing is better handled with sed, which processes one line in memory at a time, avoiding excessive pressure on memory.

CRUD operations: Searching

sed -n '1p' '2p' specifies line number search. sed -n '3p' ck.txt specifies line number output. '1,5p' specifies line number range search. sed -n '3,$p' ck.txt indicates from the third line to the last line, where $ indicates the last line.

CRUD operations: Deleting (deleting entire lines)

1d deletes the content of the first line. sed -n '1d' ck.txt. 1,5d deletes the content from the first to the fifth line. sed -n '1,5d' ck.txt. 2,+5d deletes the second line and the next five lines. /pattern1/d deletes lines matching pattern1. /pattern1/,/pattern2/d deletes all lines from matching pattern1 to matching pattern2. /pattern1/,10d deletes all lines from matching pattern1 to line 10. 10,/pattern1/d deletes from line 10 to matching pattern1.

sed -nr '/^$|#/!p' ck.txt does not display empty lines or comments.

CRUD operations: Adding

cai c replaces the content of this line: sed '3c 996,ckUFO' ck.txt. a appends, adding content after the specified line or each line (after the line): sed '3a 996,ckUFO' ck.txt. i inserts, adding content before the specified line or each line.

CRUD operations: Replacing

s###g replaces the preceding with the following: sed 's#[0-9]##' ck.txt.

Back-referencing

echo 123456|sed -r 's#(.*)#<\1>#g'. echo ck_li sed -r 's#(~.*)_(.*$)#\2 1#g'. echo 123456: prints the string “123456” to standard output. |: pipe symbol, takes the standard output of the previous command as the standard input of the next command. sed -r: calls the sed command and enables extended regular expressions. 's#(.*)#<\1>#g': this is the sed replacement command, where: s: indicates the replacement operation. #: serves as a delimiter; sed typically uses / as the default delimiter, but any other character can be used, such as # here. Using other characters as delimiters is useful when processing text containing the default delimiter /, avoiding the need for escaping. (.*): this is a capturing group, where . matches any single character (except newline), and * indicates matching 0 or more times. In this case, .* will match the entire input string, i.e., “123456”. <\1>: replacement text, where \1 references the first (and in this example, the only) capturing group, i.e., the matched entire string “123456”. < and > are added to the sides of the replacement text. g: global replacement flag, meaning the replacement will occur across the entire line.

AWK and sed

The sed command is often used for processing entire lines. AWK tends to split a line into multiple “fields” and then process them. AWK reads information line by line, and the execution results can be printed using the print function to display field data. During the use of the AWK command, logical operators can be used. (&& means “and”, || means “or”, ! means “not”; simple mathematical operations can also be performed, such as +, -, *, /, %, ^ representing addition, subtraction, multiplication, division, modulus, and exponentiation respectively.)

Related posts

Leave a Comment Cancel reply