Linux iconv Command

iconv is a powerful command-line tool in Linux systems for character encoding conversion. It can convert text files or input streams from one character encoding to another, widely used for handling multilingual text, data migration, and cross-platform data exchange.

1. Introduction to iconv

iconv is a command-line tool based on the GNU libiconv library, used for converting between different character encodings. Character encoding refers to the method of mapping characters to byte sequences, such as UTF-8, GBK, ISO-8859-1, etc. Since different systems and applications may use different encodings, iconv provides a convenient way to ensure the compatibility of text data.

1.1 Introduction to Character Encoding

The core of character encoding is mapping human-readable characters (such as letters, Chinese characters, symbols) to byte sequences that computers can process. Common character encodings include:

ASCII: 7-bit encoding, supports only basic Latin characters.
ISO-8859-1: 8-bit encoding, supports Western European languages.
GBK/GB2312: Chinese encoding, GBK is an extension of GB2312.
UTF-8: Variable-length encoding of Unicode, compatible with ASCII, widely used in modern applications.
UTF-16/UTF-32: Other encoding forms of Unicode, using fixed or variable-length bytes.

Due to historical and regional reasons, different regions and systems may use different encodings, leading to garbled text during cross-platform transmission or display. The role of iconv is to convert text from one encoding to another, solving the garbled text problem.

1.2 Functions of iconv

iconv is mainly used for:

File Encoding Conversion: Convert files from one encoding to another.
Stream Conversion: Process standard input (stdin) and output to standard output (stdout).
Encoding Detection and Validation: Check file encoding in conjunction with other tools.
Multilingual Support: Supports hundreds of character encodings, covering major languages worldwide.

2. iconv Command Syntax and Options

2.1 Basic Syntax

iconv [options]... [input file]...

Input File: The file that needs encoding conversion. If omitted, it reads from standard input.
Output: By default, output to standard output (stdout), can specify output file through options.

2.2 Common Options

The following are the main options of iconv and their functions:

-f, --from-code=NAME: Specify the encoding of the input file (source encoding).
-t, --to-code=NAME: Specify the encoding of the output file (target encoding).
-o, --output=FILE: Specify the output file instead of the default standard output.
-c: Ignore invalid characters (characters that cannot be converted will be skipped).
--verbose: Display detailed conversion process information.
-l, --list: List all supported character encodings.
--unicode-subst=FORMAT: Replace unconvertible Unicode characters with the specified format.
--byte-subst=FORMAT: Replace unconvertible byte sequences with the specified format.
--widechar-subst=FORMAT: Replace unconvertible wide characters with the specified format.

2.3 Supported Encodings

Run the following command to see all encodings supported by iconv:

iconv -l

The output will list hundreds of encodings, such as:

UTF-8, UTF-16, UTF-32
GBK, GB2312, BIG5
ISO-8859-1, ISO-8859-15
Windows-1252, CP936

3. Basic Usage and Examples

The following specific examples demonstrate the basic usage of iconv.

3.1 Example 1: Convert GBK Encoded File to UTF-8

Assuming there is a GBK encoded text file input.txt with the content:

你好，世界！

Convert it to UTF-8 encoding and output to output.txt:

iconv -f GBK -t UTF-8 input.txt -o output.txt

-f GBK: Specify the input file as GBK encoded.
-t UTF-8: Specify the output file as UTF-8 encoded.
-o output.txt: Specify the output file.

After execution, the content of output.txt will still be:

你好，世界！

But its encoding has changed to UTF-8. You can verify with the file command:

file output.txt

The output may be:

output.txt: UTF-8 Unicode text

3.2 Example 2: Convert Encoding from Standard Input

If there is no input file, you can read data from standard input through a pipe. For example, convert GBK encoded text to UTF-8:

echo "你好，世界！" | iconv -f GBK -t UTF-8

The output will be displayed in the terminal, encoded as UTF-8.

3.3 Example 3: Batch Convert Multiple Files

Assuming there are multiple GBK encoded files that need to be converted to UTF-8, you can use find and iconv for batch conversion:

for file in *.txt; do  iconv -f GBK -t UTF-8 "$file" -o "${file%.txt}_utf8.txt"; done

*.txt: Matches all .txt files.
${file%.txt}_utf8.txt: Changes the output file name to the original file name with the _utf8 suffix.

3.4 Example 4: Ignore Invalid Characters

Sometimes the input file may contain characters that cannot be converted. Using the -c option can skip these characters. For example:

iconv -c -f GBK -t UTF-8 input.txt -o output.txt

If input.txt contains characters that cannot be converted from GBK to UTF-8, these characters will be ignored instead of causing the conversion to fail.

4. Advanced Usage

The following introduces some advanced usages of iconv, suitable for complex scenarios.

4.1 Custom Replacement of Invalid Characters

When the input file contains characters that cannot be converted, iconv will default to reporting an error and stopping the conversion. In addition to using -c to skip invalid characters, you can specify replacement formats using --unicode-subst, --byte-subst, or --widechar-subst.

For example, replace unconvertible characters with ?:

iconv -f GBK -t UTF-8 --unicode-subst="?" input.txt -o output.txt

If input.txt contains characters that cannot be converted, they will be replaced with ?, instead of causing an error.

More complex replacement formats can use the % placeholder. For example, replace invalid characters with hexadecimal representation:

iconv -f GBK -t UTF-8 --unicode-subst="[U+%04X]" input.txt -o output.txt

Invalid characters will be replaced with the [U+XXXX] format, where XXXX is the Unicode code point of the character.

4.2 Encoding Detection Combined with iconv

iconv itself does not provide encoding detection functionality, but it can be combined with tools like file or enca to detect file encoding. For example, use file to check file encoding:

file input.txt

The output may be:

input.txt: ISO-8859-1 text

Then use iconv to convert based on the result:

iconv -f ISO-8859-1 -t UTF-8 input.txt -o output.txt

If more precise encoding detection is needed, you can use enca (needs to be installed):

enca -L zh input.txt

Assuming the output is GBK, then execute:

iconv -f GBK -t UTF-8 input.txt -o output.txt

4.3 Handling Large Files

For large files, the streaming processing capability of iconv is very useful. You can process in chunks through pipes to avoid loading the entire file into memory at once. For example:

cat large_file.txt | iconv -f GBK -t UTF-8 > large_file_utf8.txt

This method is suitable for scenarios with limited memory.

4.4 Convert CSV Files While Preserving Structure

CSV files are commonly used for data exchange, but their encoding may vary due to different sources. Assuming there is a GBK encoded CSV file data.csv:

姓名,年龄,城市张三,25,北京李四,30,上海

Convert it to UTF-8:

iconv -f GBK -t UTF-8 data.csv -o data_utf8.csv

After conversion, the content of data_utf8.csv remains unchanged, but the encoding changes to UTF-8. You can verify with the file command or a text editor.

4.5 Use with Other Tools for Complex Text Processing

iconv can be combined with other Linux tools (such as sed, awk) to handle complex text conversion tasks. For example, if you need to convert all GBK encoded Chinese characters in a file to UTF-8 and replace certain specific characters with others:

iconv -f GBK -t UTF-8 input.txt | sed 's/你好/Hello/g' > output.txt

Here, iconv first converts the file from GBK to UTF-8, then sed replaces “你好” with “Hello” in the text.

4.6 Handling Complex Scenarios with Multi-byte Encodings

When dealing with multi-byte encodings (such as UTF-16, UTF-32), attention must be paid to byte order (BOM, Byte Order Mark). iconv supports encodings with BOM, such as UTF-16BE (big-endian) and UTF-16LE (little-endian).

For example, convert a UTF-16BE file to UTF-8:

iconv -f UTF-16BE -t UTF-8 input.txt -o output.txt

If the file contains BOM, iconv will automatically recognize and handle it.

4.7 Write Scripts for Automated Conversion

The following is a Bash script to recursively scan directories and convert all GBK files to UTF-8:

#!/bin/bash
# Recursively find all .txt files
find . -type f -name "*.txt" | while read -r file; do  # Check file encoding  encoding=$(file -b --mime-encoding "$file")  if [[ "$encoding" == "iso-8859-1" || "$encoding" == "gbk" ]]; then    echo "Converting $file from $encoding to UTF-8"    # Backup original file    cp "$file" "$file.bak"    # Convert to UTF-8    iconv -f "$encoding" -t UTF-8 "$file" -o "$file"  fi; done

Save as convert.sh, grant execute permission (chmod +x convert.sh), and then run:

./convert.sh

The script will:

Find all .txt files.
Use file to check encoding.
If the encoding is GBK or ISO-8859-1, convert to UTF-8 and back up the original file.

5. Common Issues and Solutions

5.1 Garbled Text Issues

If the converted file appears garbled, possible reasons include:

Incorrect Input Encoding Specification: Use file or enca to check the actual encoding of the file.
Invalid Characters: Use -c to ignore or --unicode-subst to replace invalid characters.
BOM Issues: Some editors may add BOM to UTF-8 files, causing display anomalies. You can manually remove BOM:
```
sed -i '1s/^\xEF\xBB\xBF//' output.txt
```

5.2 Conversion Failures

If iconv reports an error (such as illegal input sequence), possible reasons include:

The input file contains corrupted byte sequences. Use -c to skip invalid characters.
The specified source encoding does not support the actual encoding of the file. Recheck the encoding.

5.3 Performance Issues

For extremely large files, the performance of iconv may be limited. You can optimize by chunk processing or parallel conversion:

split -b 100M large_file.txt part_for part in part_*; do  iconv -f GBK -t UTF-8 "$part" -o "${part}_utf8" & done; wait; cat part_*_utf8 > large_file_utf8.txt

split: Splits the large file into smaller chunks of 100MB.
&: Runs conversion tasks in parallel.
wait: Waits for all tasks to complete.
cat: Merges the results.

6. Precautions

Backup Files: It is recommended to back up the original files before conversion to prevent accidental data loss.
Encoding Compatibility: Ensure that the target encoding supports all characters in the input file. For example, ASCII does not support Chinese characters.
Environment Dependency: iconv depends on the libiconv library, ensure it is installed on the system.
Test Output: Before batch conversion, test a small number of files to ensure correct results.
Multi-byte Encoding: When handling UTF-16/UTF-32, pay attention to byte order and BOM.

7. Conclusion

iconv is a powerful tool for handling character encoding conversion in Linux systems, suitable for everything from simple file conversions to complex multilingual data processing. With flexible options and integration with other tools, iconv can tackle various practical scenarios. This article introduced the basic usage, options, examples, and advanced applications of iconv, and provided solutions to common issues. Whether handling Chinese GBK files, batch converting multilingual text, or optimizing large file processing, iconv is an indispensable tool.