A Collection of Text Processing Tools in Linux Shell

(Click the public account above to quickly follow)

From: cnblogs

Link: http://www.cnblogs.com/me115/p/3427319.html

This article will introduce the most commonly used tools for text processing in Shell under Linux:

find, grep, xargs, sort, uniq, tr, cut, paste, wc, sed, awk;

The examples and parameters provided are the most commonly used and practical;

My principle for using shell scripts is to write commands in a single line, preferably not exceeding 2 lines;

If there are more complex task requirements, consider using Python;

find File Search

Find txt and pdf files

find . $ -name “*.txt” -o -name “*.pdf” $ -print

Search for .txt and .pdf using regex

find . -regex “.*$\.txt|\.pdf$$”

-iregex: case-insensitive regex

Negation parameter

Find all non-txt text files

find . ! -name “*.txt” -print

Specify search depth

Print files in the current directory (depth of 1)

find . -maxdepth 1 -type f

Custom Search

Search by type:

find . -type d -print // Only list all directories

-type f for files / l for symbolic links

Search by time:

-atime access time (in days, for minutes use -amin, similar for others)

-mtime modification time (content has been modified)

-ctime change time (metadata or permission changes)

All files accessed in the last 7 days:

find . -atime 7 -type f -print

Search by size:

w for bytes, k for kilobytes, M for megabytes, G for gigabytes

Find files larger than 2k

find . -type f -size +2k

Find by permissions:

find . -type f -perm 644 -print // Find all files with executable permissions

Find by user:

find . -type f -user weber -print// Find files owned by user weber

Subsequent Actions After Finding

Delete:

Delete all swp files in the current directory:

find . -type f -name “*.swp” -delete

Execute actions (powerful exec)

find . -type f -user root -exec chown weber {} \; // Change ownership of all files in the current directory to weber

Note: {} is a special string that will be replaced with the corresponding filename for each matching file;

For example: Copy all found files to another directory:

find . -type f -mtime +10 -name “*.txt” -exec cp {} OLD \;

Combine multiple commands

Tips: If you need to execute multiple commands subsequently, you can write them into a script. Then, use -exec to call the script;

-exec ./commands.sh {} \;

-print Delimiter

By default, ‘
‘ is used as the file delimiter;

-print0 uses ‘’ as the file delimiter, allowing you to search for files with spaces;

grep Text Search

grep match_pattern file // By default, access matching lines

Common parameters

-o only output matching text lines VS -v only output non-matching text lines

-c count occurrences of text in the file

grep -c “text” filename

-n print matching line numbers

-i ignore case when searching

-l only print filenames

Recursively search text in multi-level directories (a favorite of programmers searching code):

grep “class” . -R -n

Match multiple patterns

grep -e “class” -e “virtual” file

grep output with as the ending character for filenames: (-z)

grep “test” file* -lZ | xargs -0 rm

xargs Command Line Argument Conversion

xargs can convert input data into command line arguments for specific commands; thus, it can be combined with many commands for use. For example, grep, find;

Convert multi-line output to single-line output

cat file.txt | xargs

is the delimiter between multi-line text

Convert single line to multi-line output

cat single.txt | xargs -n 3

-n: specify the number of fields displayed per line

xargs Parameter Description

-d define delimiter (default is space, multi-line delimiter is
)

-n specify output as multiple lines

-I {} specify replacement string, this string will be replaced during xargs expansion, used when the command to be executed requires multiple parameters

For example:

cat file.txt | xargs -I {} ./command.sh -p {} -1

-0: specify as the input delimiter

For example: Count lines of code

find source_dir/ -type f -name “*.cpp” -print0 | xargs -0 wc -l

sort Sorting

Field description:

-n sort numerically VS -d sort lexicographically

-r reverse order sorting

-k N specify sorting by the Nth column

For example:

sort -nrk 1 data.txt

sort -bd data // ignore leading whitespace characters like spaces

uniq Remove Duplicate Lines

Remove duplicate lines

sort unsort.txt | uniq

Count occurrences of each line in the file

sort unsort.txt | uniq -c

Find duplicate lines

sort unsort.txt | uniq -d

You can specify the content to compare for duplicates: -s starting position -w number of characters to compare

Use tr for Conversion

General usage

echo 12345 | tr ‘0-9’ ‘9876543210’ // encryption and decryption conversion, replace corresponding characters

cat text | tr ‘\t’ ‘ ‘ // convert tab to space

tr delete characters

cat file | tr -d ‘0-9’ // delete all numbers

-c complement

cat file | tr -c ‘0-9’ // get all numbers in the file

cat file | tr -d -c ‘0-9 \n’ // delete non-numeric data

tr compress characters

tr -s compresses repeated characters in the text; most commonly used to compress excess spaces

cat file | tr -s ‘ ‘

Character classes

Various character classes can be used in tr:

alnum: alphanumeric

alpha: alphabetic

digit: numeric

space: whitespace

lower: lowercase

upper: uppercase

cntrl: control (non-printable) characters

print: printable characters

Usage: tr [:class:] [:class:]

eg: tr ‘[:lower:]’ ‘[:upper:]’

cut Split Text by Column

Extract the 2nd and 4th columns of a file:

cut -f2,4 filename

Remove all columns except the 3rd column:

cut -f3 –complement filename

-d specify delimiter:

cat -f2 -d”;” filename

cut range

N- from the Nth field to the end

-M the 1st field is M

N-M from N to M fields

cut units

-b by bytes

-c by characters

-f by fields (using delimiter)

cut -c1-5 file // print characters 1 to 5

cut -c-2 file // print the first 2 characters

paste Concatenate Text by Column

Concatenate two texts by column;

cat file1

1

2

cat file2

colin

book

paste file1 file2

1 colin

2 book

The default delimiter is a tab, which can be specified with -d

paste file1 file2 -d “,”

1,colin

2,book

wc Tool for Counting Lines and Characters

wc -l file // count lines

wc -w file // count words

wc -c file // count characters

sed Text Replacement Tool

First occurrence replacement

sed ‘s/text/replace_text/’ file // replace the first matching text in each line

Global replacement

sed ‘s/text/replace_text/g’ file

By default, after replacement, the replaced content is output. If you need to directly replace the original file, use -i:

sed -i ‘s/text/replace_text/g’ file

Remove blank lines:

sed ‘/^$/d’ file

Variable conversion

The matched string can be referenced using the marker &.

echo this is an example | sed ‘s/\w+/[&]/g’

$>[this] [is] [an] [example]

Substring matching marker

The content of the first matching parentheses is referenced using the marker \1

sed ‘s/hello$[0-9]$/\1/’

Double quotes evaluation

sed is usually quoted with single quotes; double quotes can also be used, and when using double quotes, the expression will be evaluated:

sed ‘s/$var/HELLO/’

When using double quotes, we can specify variables in the sed style and replacement string;

For example:

p=pattern

r=replaced

echo “line contains a pattern” | sed “s/$p/$r/g”

$>line contains a replaced

Other examples

Insert characters into strings: Convert each line of text (PEKSHA) to PEK/SHA

sed ‘s/^.{3}/&\//g’ file

awk Data Stream Processing Tool

awk script structure

awk ‘ BEGIN{ statements } statements2 END{ statements } ‘

Working method

1. Execute the statements in the begin block;

2. Read a line from the file or stdin, then execute statements2, repeat this process until the file is completely read;

3. Execute the end statement block;

print Print Current Line

Using print without parameters will print the current line;

echo -e “line1\nline2” | awk ‘BEGIN{print “start”} {print } END{ print “End” }’

When print is separated by commas, parameters are delimited by spaces;

echo | awk ‘ {var1 = “v1” ; var2 = “V2″; var3=”v3”; \

print var1, var2 , var3; }’

$>v1 V2 v3

Using the concatenation operator (“” as the concatenation operator);

echo | awk ‘ {var1 = “v1” ; var2 = “V2″; var3=”v3”; \

print var1″-“var2”-“var3; }’

$>v1-V2-v3

Special Variables: NR NF $0 $1 $2

NR: represents the number of records, corresponding to the current line number during execution;

NF: represents the number of fields, corresponding to the number of fields in the current line during execution;

$0: this variable contains the text content of the current line during execution;

$1: text content of the first field;

$2: text content of the second field;

echo -e “line1 f2 f3\n line2 \n line 3″ | awk ‘{print NR”:”$0″-“$1”-“$2}’

Print the second and third fields of each line:

awk ‘{print $2, $3}’ file

Count the number of lines in the file:

awk ‘ END {print NR}’ file

Accumulate the first field of each line:

echo -e “1\n 2\n 3\n 4\n” | awk ‘BEGIN{num = 0 ;

print “begin”;} {sum += $1;} END {print “==”; print sum }’

Pass External Variables

var=1000

echo | awk ‘{print vara}’ vara=$var # Input from stdin

awk ‘{print vara}’ vara=$var file # Input from file

Filter Lines Processed by awk with Styles

awk ‘NR < 5’ # Line number less than 5

awk ‘NR==1,NR==4 {print}’ file # Print lines equal to 1 and 4

awk ‘/linux/’ # Lines containing the text linux (can specify with regex, super powerful)

awk ‘!/linux/’ # Lines not containing the text linux

Set Delimiter

Use -F to set the delimiter (default is space)

awk -F: ‘{print $NF}’ /etc/passwd

Read Command Output

Use getline to read the output of an external shell command into the variable cmdout;

echo | awk ‘{“grep root /etc/passwd” | getline cmdout; print cmdout }’

Use Loops in awk

for(i=0;i<10;i++){print $i;}

for(i in array){print array[i];}

For example:

Print lines in reverse order: (implementation of tac command)

seq 9| \

awk ‘{lifo[NR] = $0; lno=NR} \

END{ for(;lno>-1;lno–){print lifo[lno];}

} ‘

awk Implementing head and tail Commands

head:

awk ‘NR<=10{print}’ filename

tail:

awk ‘{buffer[NR%10] = $0;} END{for(i=0;i<11;i++){ \

print buffer[i %10]} } ‘ filename

Print Specified Columns

Implemented by awk:

ls -lrt | awk ‘{print $6}’

Implemented by cut

ls -lrt | cut -f6

Print Specified Text Area

Determine line numbers

seq 100| awk ‘NR==4,NR==6{print}’

Determine text

Print text between start_pattern and end_pattern;

awk ‘/start_pattern/, /end_pattern/’ filename

For example:

seq 100 | awk ‘/13/,/15/’

cat /etc/passwd| awk ‘/mai.*mail/,/news.*news/’

Common Built-in Functions in awk

index(string,search_string): returns the position of search_string in string

sub(regex,replacement_str,string): replaces the first occurrence of the regex match with replacement_str;

match(regex,string): checks if the regex can match the string;

length(string): returns the length of the string

echo | awk ‘{“grep root /etc/passwd” | getline cmdout; print length(cmdout) }’

printf similar to printf in C language, formats the output

For example:

seq 10 | awk ‘{printf “->%4s\n”, $1}’

Iterate Through Lines, Words, and Characters in a File

1. Iterate through each line in the file

Using while loop

while read line;

do

echo $line;

done < file.txt

Change to subshell:

cat file.txt | (while read line;do echo $line;done)

Using awk:

cat file.txt| awk ‘{print}’

2. Iterate through each word in a line

for word in $line;

do

echo $word;

done

3. Iterate through each character

${string:start_pos:num_of_chars}: extract a character from the string; (bash text slicing)

${#word}: returns the length of the variable word

for((i=0;i<${#word};i++))

do

echo ${word:i:1};

done

【Today’s WeChat Public Account Recommendation↓】

For more recommendations, please see《Recommended Technical and Design Public Accounts》

Among them, recommendations include popular public accounts related to technology, design, geeks, and IT matchmaking. Technology covers: Python, Web front-end, Java, Android, iOS, PHP, C/C++, .NET, Linux, databases, operations, big data, algorithms, IT workplace, etc. Click on 《Recommended Technical and Design Public Accounts》 to discover exciting content!

A Collection of Text Processing Tools in Linux Shell

Click “Read the original text” for more details

Related posts

Leave a Comment Cancel reply