A Collection of Text Processing Tools in Linux Shell

(Click the public account above to quickly follow)

From: cnblogs

Link: http://www.cnblogs.com/me115/p/3427319.html

This article will introduce the most commonly used tools for text processing in Shell under Linux:

find, grep, xargs, sort, uniq, tr, cut, paste, wc, sed, awk;

The examples and parameters provided are the most commonly used and practical;

My principle for using shell scripts is to write commands in a single line, preferably not exceeding 2 lines;

If there are more complex task requirements, consider using Python;

find File Search

  • Find txt and pdf files

find . \( -name “*.txt” -o -name “*.pdf” \) -print

  • Search for .txt and .pdf using regex

find . -regex “.*\(\.txt|\.pdf\)$”

-iregex: case-insensitive regex

  • Negation parameter

Find all non-txt text files

find . ! -name “*.txt” -print

  • Specify search depth

Print files in the current directory (depth of 1)

find . -maxdepth 1 -type f

Custom Search

  • Search by type:

find . -type d -print // Only list all directories

-type f for files / l for symbolic links

  • Search by time:

-atime access time (in days, for minutes use -amin, similar for others)

-mtime modification time (content has been modified)

-ctime change time (metadata or permission changes)

  • All files accessed in the last 7 days:

find . -atime 7 -type f -print

  • Search by size:

w for bytes, k for kilobytes, M for megabytes, G for gigabytes

Find files larger than 2k

find . -type f -size +2k

Find by permissions:

find . -type f -perm 644 -print // Find all files with executable permissions

Find by user:

find . -type f -user weber -print// Find files owned by user weber

Subsequent Actions After Finding

  • Delete:

Delete all swp files in the current directory:

find . -type f -name “*.swp” -delete

  • Execute actions (powerful exec)

find . -type f -user root -exec chown weber {} \; // Change ownership of all files in the current directory to weber

Note: {} is a special string that will be replaced with the corresponding filename for each matching file;

For example: Copy all found files to another directory:

find . -type f -mtime +10 -name “*.txt” -exec cp {} OLD \;

  • Combine multiple commands

Tips: If you need to execute multiple commands subsequently, you can write them into a script. Then, use -exec to call the script;

-exec ./commands.sh {} \;

-print Delimiter

By default, ‘
‘ is used as the file delimiter;

-print0 uses ‘’ as the file delimiter, allowing you to search for files with spaces;

grep Text Search

grep match_pattern file // By default, access matching lines

  • Common parameters

-o only output matching text lines VS -v only output non-matching text lines

-c count occurrences of text in the file

grep -c “text” filename

-n print matching line numbers

-i ignore case when searching

-l only print filenames

  • Recursively search text in multi-level directories (a favorite of programmers searching code):

grep “class” . -R -n

  • Match multiple patterns

grep -e “class” -e “virtual” file

  • grep output with as the ending character for filenames: (-z)

grep “test” file* -lZ | xargs -0 rm

xargs Command Line Argument Conversion

xargs can convert input data into command line arguments for specific commands; thus, it can be combined with many commands for use. For example, grep, find;

  • Convert multi-line output to single-line output

cat file.txt | xargs

is the delimiter between multi-line text

  • Convert single line to multi-line output

cat single.txt | xargs -n 3

-n: specify the number of fields displayed per line

xargs Parameter Description

-d define delimiter (default is space, multi-line delimiter is
)

-n specify output as multiple lines

-I {} specify replacement string, this string will be replaced during xargs expansion, used when the command to be executed requires multiple parameters

For example:

cat file.txt | xargs -I {} ./command.sh -p {} -1

-0: specify as the input delimiter

For example: Count lines of code

find source_dir/ -type f -name “*.cpp” -print0 | xargs -0 wc -l

sort Sorting

Field description:

-n sort numerically VS -d sort lexicographically

-r reverse order sorting

-k N specify sorting by the Nth column

For example:

sort -nrk 1 data.txt

sort -bd data // ignore leading whitespace characters like spaces

uniq Remove Duplicate Lines

  • Remove duplicate lines

sort unsort.txt | uniq

  • Count occurrences of each line in the file

sort unsort.txt | uniq -c

  • Find duplicate lines

sort unsort.txt | uniq -d

You can specify the content to compare for duplicates: -s starting position -w number of characters to compare

Use tr for Conversion

  • General usage

echo 12345 | tr ‘0-9’ ‘9876543210’ // encryption and decryption conversion, replace corresponding characters

cat text | tr ‘\t’ ‘ ‘ // convert tab to space

  • tr delete characters

cat file | tr -d ‘0-9’ // delete all numbers

  • -c complement

cat file | tr -c ‘0-9’ // get all numbers in the file

cat file | tr -d -c ‘0-9 \n’ // delete non-numeric data

  • tr compress characters

tr -s compresses repeated characters in the text; most commonly used to compress excess spaces

cat file | tr -s ‘ ‘

  • Character classes

Various character classes can be used in tr:

alnum: alphanumeric

alpha: alphabetic

digit: numeric

space: whitespace

lower: lowercase

upper: uppercase

cntrl: control (non-printable) characters

print: printable characters

Usage: tr [:class:] [:class:]

eg: tr ‘[:lower:]’ ‘[:upper:]’

cut Split Text by Column

  • Extract the 2nd and 4th columns of a file:

cut -f2,4 filename

  • Remove all columns except the 3rd column:

cut -f3 –complement filename

  • -d specify delimiter:

cat -f2 -d”;” filename

  • cut range

N- from the Nth field to the end

-M the 1st field is M

N-M from N to M fields

  • cut units

-b by bytes

-c by characters

-f by fields (using delimiter)

  • eg:

cut -c1-5 file // print characters 1 to 5

cut -c-2 file // print the first 2 characters

paste Concatenate Text by Column

Concatenate two texts by column;

cat file1

1

2

cat file2

colin

book

paste file1 file2

1 colin

2 book

The default delimiter is a tab, which can be specified with -d

paste file1 file2 -d “,”

1,colin

2,book

wc Tool for Counting Lines and Characters

wc -l file // count lines

wc -w file // count words

wc -c file // count characters

sed Text Replacement Tool

  • First occurrence replacement

sed ‘s/text/replace_text/’ file // replace the first matching text in each line

  • Global replacement

sed ‘s/text/replace_text/g’ file

  • By default, after replacement, the replaced content is output. If you need to directly replace the original file, use -i:

sed -i ‘s/text/replace_text/g’ file

  • Remove blank lines:

sed ‘/^$/d’ file

  • Variable conversion

The matched string can be referenced using the marker &.

echo this is an example | sed ‘s/\w+/[&]/g’

$>[this] [is] [an] [example]

  • Substring matching marker

The content of the first matching parentheses is referenced using the marker \1

sed ‘s/hello\([0-9]\)/\1/’

  • Double quotes evaluation

sed is usually quoted with single quotes; double quotes can also be used, and when using double quotes, the expression will be evaluated:

sed ‘s/$var/HELLO/’

When using double quotes, we can specify variables in the sed style and replacement string;

For example:

p=pattern

r=replaced

echo “line contains a pattern” | sed “s/$p/$r/g”

$>line contains a replaced

  • Other examples

Insert characters into strings: Convert each line of text (PEKSHA) to PEK/SHA

sed ‘s/^.{3}/&\//g’ file

awk Data Stream Processing Tool

  • awk script structure

awk ‘ BEGIN{ statements } statements2 END{ statements } ‘

  • Working method

1. Execute the statements in the begin block;

2. Read a line from the file or stdin, then execute statements2, repeat this process until the file is completely read;

3. Execute the end statement block;

print Print Current Line

  • Using print without parameters will print the current line;

echo -e “line1\nline2” | awk ‘BEGIN{print “start”} {print } END{ print “End” }’

  • When print is separated by commas, parameters are delimited by spaces;

echo | awk ‘ {var1 = “v1” ; var2 = “V2″; var3=”v3”; \

print var1, var2 , var3; }’

$>v1 V2 v3

  • Using the concatenation operator (“” as the concatenation operator);

echo | awk ‘ {var1 = “v1” ; var2 = “V2″; var3=”v3”; \

print var1″-“var2”-“var3; }’

$>v1-V2-v3

Special Variables: NR NF $0 $1 $2

NR: represents the number of records, corresponding to the current line number during execution;

NF: represents the number of fields, corresponding to the number of fields in the current line during execution;

$0: this variable contains the text content of the current line during execution;

$1: text content of the first field;

$2: text content of the second field;

echo -e “line1 f2 f3\n line2 \n line 3″ | awk ‘{print NR”:”$0″-“$1”-“$2}’

  • Print the second and third fields of each line:

awk ‘{print $2, $3}’ file

  • Count the number of lines in the file:

awk ‘ END {print NR}’ file

  • Accumulate the first field of each line:

echo -e “1\n 2\n 3\n 4\n” | awk ‘BEGIN{num = 0 ;

print “begin”;} {sum += $1;} END {print “==”; print sum }’

Pass External Variables

var=1000

echo | awk ‘{print vara}’ vara=$var # Input from stdin

awk ‘{print vara}’ vara=$var file # Input from file

Filter Lines Processed by awk with Styles

awk ‘NR < 5’ # Line number less than 5

awk ‘NR==1,NR==4 {print}’ file # Print lines equal to 1 and 4

awk ‘/linux/’ # Lines containing the text linux (can specify with regex, super powerful)

awk ‘!/linux/’ # Lines not containing the text linux

Set Delimiter

Use -F to set the delimiter (default is space)

awk -F: ‘{print $NF}’ /etc/passwd

Read Command Output

Use getline to read the output of an external shell command into the variable cmdout;

echo | awk ‘{“grep root /etc/passwd” | getline cmdout; print cmdout }’

Use Loops in awk

for(i=0;i<10;i++){print $i;}

for(i in array){print array[i];}

For example:

Print lines in reverse order: (implementation of tac command)

seq 9| \

awk ‘{lifo[NR] = $0; lno=NR} \

END{ for(;lno>-1;lno–){print lifo[lno];}

} ‘

awk Implementing head and tail Commands

  • head:

awk ‘NR<=10{print}’ filename

  • tail:

awk ‘{buffer[NR%10] = $0;} END{for(i=0;i<11;i++){ \

print buffer[i %10]} } ‘ filename

Print Specified Columns

  • Implemented by awk:

ls -lrt | awk ‘{print $6}’

  • Implemented by cut

ls -lrt | cut -f6

Print Specified Text Area

  • Determine line numbers

seq 100| awk ‘NR==4,NR==6{print}’

  • Determine text

Print text between start_pattern and end_pattern;

awk ‘/start_pattern/, /end_pattern/’ filename

For example:

seq 100 | awk ‘/13/,/15/’

cat /etc/passwd| awk ‘/mai.*mail/,/news.*news/’

Common Built-in Functions in awk

index(string,search_string): returns the position of search_string in string

sub(regex,replacement_str,string): replaces the first occurrence of the regex match with replacement_str;

match(regex,string): checks if the regex can match the string;

length(string): returns the length of the string

echo | awk ‘{“grep root /etc/passwd” | getline cmdout; print length(cmdout) }’

printf similar to printf in C language, formats the output

For example:

seq 10 | awk ‘{printf “->%4s\n”, $1}’

Iterate Through Lines, Words, and Characters in a File

1. Iterate through each line in the file

  • Using while loop

while read line;

do

echo $line;

done < file.txt

Change to subshell:

cat file.txt | (while read line;do echo $line;done)

  • Using awk:

cat file.txt| awk ‘{print}’

2. Iterate through each word in a line

for word in $line;

do

echo $word;

done

3. Iterate through each character

${string:start_pos:num_of_chars}: extract a character from the string; (bash text slicing)

${#word}: returns the length of the variable word

for((i=0;i<${#word};i++))

do

echo ${word:i:1};

done

【Today’s WeChat Public Account Recommendation↓】

A Collection of Text Processing Tools in Linux Shell

For more recommendations, please seeRecommended Technical and Design Public Accounts

Among them, recommendations include popular public accounts related to technology, design, geeks, and IT matchmaking. Technology covers: Python, Web front-end, Java, Android, iOS, PHP, C/C++, .NET, Linux, databases, operations, big data, algorithms, IT workplace, etc. Click on 《Recommended Technical and Design Public Accounts》 to discover exciting content!

A Collection of Text Processing Tools in Linux Shell

Click “Read the original text” for more details

Leave a Comment