How to Use BERT for Chinese Sentiment Analysis in Python

In today’s digital age, sentiment analysis is a powerful text analysis tool widely used in market research, public opinion monitoring, customer service, and many other fields. The emergence of the BERT (Bidirectional Encoder Representations from Transformers) model has brought significant breakthroughs to sentiment analysis. Today, let’s delve into how to use BERT for Chinese sentiment analysis in Python.

1. Introduction to BERT Model

BERT is a pre-trained language model based on the Transformer architecture proposed by Google in 2018. It learns rich language knowledge and semantic information through unsupervised learning on a large amount of text data. The core feature of BERT is its bidirectional training, which simultaneously considers information from both the left and right contexts, making it excel in understanding text semantics. For Chinese sentiment analysis tasks, BERT can better capture the emotional tendencies and semantic associations in the text.

2. Preparation

Before we begin, we need to ensure that Python and the following necessary libraries are installed.

1. Install PyTorch

PyTorch is a popular deep learning framework, and many implementations of BERT are based on it. You can install PyTorch using the following command:

pip install torch

2. Install the Transformers Library

The Transformers library provided by Hugging Face is a powerful library that includes various pre-trained models, including BERT. The installation command is as follows:

pip install transformers

3. Install Other Dependencies

We also need to install some common libraries, such as numpy and scikit-learn, for data processing and model evaluation.

pip install numpy scikit-learn

3. Loading the Pre-trained BERT Model

In Python, we can easily load the pre-trained BERT model using Hugging Face’s Transformers library. Here is an example code:

from transformers import BertTokenizer, BertForSequenceClassification

# Load the pre-trained BERT model and tokenizer
model_name = "bert-base-chinese"  # Use the Chinese version of the BERT model
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)

Here, we selected the bert-base-chinese model, which is a pre-trained model specifically designed for Chinese text and can handle Chinese sentiment analysis tasks well.

4. Text Preprocessing

Before performing sentiment analysis, we need to preprocess the text data. BERT uses the WordPiece tokenization method to split the text into subwords. Here is an example code for text preprocessing:

# Example text
text = "这部电影真的很好看，我非常喜欢！"

# Encode the text using the tokenizer
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)

# Print the encoded input
print(inputs)

This code converts the text into a format that the BERT model can accept, including input IDs and attention masks.

5. Sentiment Analysis Prediction

Now, we can use the loaded BERT model to predict sentiment analysis on the text. Here is a complete prediction process:

import torch

# Pass the input data to the model for prediction
outputs = model(**inputs)

# Get the prediction results
logits = outputs.logits
predicted_class = torch.argmax(logits, dim=1)

# Print the prediction result
print("Predicted sentiment category:", predicted_class.item())

In this example, logits are the raw prediction values output by the model, and we use the torch.argmax function to obtain the predicted sentiment category. Typically, the sentiment categories can be 0 (negative) and 1 (positive).

6. Model Fine-tuning (Optional)

Although the pre-trained BERT model already has strong performance, we can further improve the model’s performance through fine-tuning on specific domains or datasets. Fine-tuning refers to further training the model using our own dataset based on the pre-trained model, making it better suited for specific tasks.

Here is a simple example of fine-tuning:

from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
import numpy as np

# Example dataset
texts = ["这部电影很好看", "我不喜欢这部电影", "剧情很精彩", "演员表现很差"]
labels = [1, 0, 1, 0]  # 1 indicates positive sentiment, 0 indicates negative sentiment

# Define dataset class
class SentimentDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]

        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_length,
            padding="max_length",
            truncation=True,
            return_attention_mask=True,
            return_tensors="pt",
        )

        return {
            "input_ids": encoding["input_ids"].flatten(),
            "attention_mask": encoding["attention_mask"].flatten(),
            "labels": torch.tensor(label, dtype=torch.long),
        }

# Split the dataset into training and testing sets
train_texts, test_texts, train_labels, test_labels = train_test_split(texts, labels, test_size=0.2, random_state=42)

# Create dataset instances
train_dataset = SentimentDataset(train_texts, train_labels, tokenizer, max_length=512)
test_dataset = SentimentDataset(test_texts, test_labels, tokenizer, max_length=512)

# Create data loaders
train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=4, shuffle=False)

# Define optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=2e-5)

# Fine-tuning training process
model.train()
for epoch in range(3):  # Train for 3 epochs
    for batch in train_loader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)

        optimizer.zero_grad()

        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss

        loss.backward()
        optimizer.step()

    print(f"Epoch {epoch + 1}, Loss: {loss.item()}")

# Evaluate model performance on the test set
model.eval()
correct = 0
total = 0
with torch.no_grad():
    for batch in test_loader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)

        outputs = model(input_ids, attention_mask=attention_mask)
        _, predicted = torch.max(outputs.logits, dim=1)

        total += labels.size(0)
        correct += (predicted == labels).sum().item()

accuracy = correct / total
print(f"Test Accuracy: {accuracy:.4f}")

In this fine-tuning process, we first defined a dataset class SentimentDataset to encapsulate the text data and labels into a PyTorch dataset. Then, we split the dataset into training and testing sets and created data loaders. Next, we defined the optimizer and conducted the training process. At the end of each epoch, we printed the loss value. Finally, we evaluated the model’s performance on the test set and calculated the accuracy.

7. Conclusion

Through the above steps, we have successfully used BERT for Chinese sentiment analysis in Python. The powerful performance of the BERT model makes sentiment analysis more accurate and efficient. Of course, in practical applications, we can further optimize and improve the model according to specific needs, such as adjusting hyperparameters and using larger datasets for fine-tuning.

I hope this article helps you better understand and apply BERT for Chinese sentiment analysis. If you have any questions or suggestions, feel free to leave a comment, and let’s learn together!

Follow me for more interesting technical shares!