Word segmentation is the first and crucial step in Chinese text processing. Today, we will introduce the Python Jieba word segmentation library, which is simple to use, powerful, and suitable for various text processing scenarios.Jieba is currently one of the most popular Python tools for Chinese word segmentation, supporting precise mode, full mode, search engine mode, and providing advanced features such as keyword extraction, part-of-speech tagging, and parallel segmentation.Installation method:
pip install jieba
Comparison of several segmentation modes:Jieba provides three modes suitable for different scenarios.1. Precise mode (the default mode) accurately segments sentences, suitable for text analysis. For example:
import jieba
text = "我来到北京清华大学"
seg_list = jieba.cut(text, cut_all=False)
print("Precise mode:", "/ ".join(seg_list))
Output: 我/ 来到/北京/ 清华大学2. Full mode scans all possible word combinations quickly but may include redundant words.
seg_list = jieba.cut(text, cut_all=True)
print("Full mode:", "/ ".join(seg_list))
Output: 我/来到/北京/清华/清华大学/华大/大学3. Search engine mode further segments long words based on precise mode to improve recall rate, suitable for search engines.
seg_list = jieba.cut_for_search(text)
print("Search engine mode:", "/ ".join(seg_list))
Output: 我/来到/北京/清华/华大/大学/清华大学To improve segmentation accuracy, Jieba supports custom dictionaries. Example:
jieba.add_word("南京市长") # Add new word
jieba.add_word("江大桥")
text = "南京市长江大桥"
seg_list = jieba.cut(text)
print("/ ".join(seg_list)) # Output: 南京市/ 长江大桥 (default segmentation)
Frequency can also be set to increase priority.
jieba.add_word("南京市长", freq=20000) # Increase word frequency for priority
seg_list = jieba.cut(text)
print("/ ".join(seg_list)) # Output: 南京市/ 长江大桥 (may still not be ideal)
External dictionary files can also be used.
jieba.load_userdict("custom_dict.txt") # Load custom dictionary file
##==========================
custom_dict.txt
+++++++每行: 词语 词频 词性
4. Keyword extraction: TF-IDF & TextRankJieba provides two mainstream keyword extraction algorithms:TF-IDF (statistical-based)
import jieba.analyse
text = """自然语言处理是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。自然语言处理是一门融语言学、计算机科学、数学于一体的科学。"""
# Extract top 5 keywords (with weights)
keywords = jieba.analyse.extract_tags(text, topK=5, withWeight=True)
for word, weight in keywords:
print(f"{word}: {weight:.4f}")
Output:
自然语言处理: 0.3123
计算机科学: 0.2567
人工智能: 0.2145
领域: 0.1890
理论: 0.1567
TextRank (graph-based algorithm)
textrank_keywords = jieba.analyse.textrank(text, topK=5, withWeight=True)
for word, weight in textrank_keywords:
print(f"{word}: {weight:.4f}")
TF-IDF is suitable for short texts, such as news headlines and Weibo posts.TextRank is suitable for long texts, such as papers and reports.5. Part-of-speech taggingJieba can identify the part of speech for each word, allowing us to perform lexical analysis on sentences.
import jieba.posseg as pseg
words = pseg.cut("我爱自然语言处理技术")
for word, flag in words:
print(f"{word}({flag})", end=" ")
Output
我(r) 爱(v) 自然语言(n) 处理(v) 技术(n)
Part-of-speech tags: n (noun), v (verb), a (adjective), r (pronoun), d (adverb)In conclusion, Jieba is a very useful word segmentation library, and it is worth trying out to appreciate its intricacies.