How to Scrape Comments from NetEase News Using Python

How to Scrape Comments from NetEase News Using Python

Hello everyone, I am Niu Ge. Recently, I found that browsing news comments online has become increasingly interesting, with all kinds of humorous replies popping up, making it hard for me to stop. I thought about using Python to scrape these comments to see if I could discover more funny jokes. Today, I will teach you how to scrape comments from NetEase News using Python, so we can explore the humor of netizens together!

In fact, many people want to quickly and easily obtain these comment data for analysis or just for fun. The goal of this tutorial is to use Python to effortlessly scrape comments from NetEase News and save them for everyone to enjoy later.

Preparing the Development Environment

You need to install some necessary Python libraries. Don’t worry, it’s as simple as downloading an app.

# Install the requests library for sending HTTP requests
pip install requests

# Install the beautifulsoup4 library for parsing HTML
pip install beautifulsoup4

Tip: If you encounter issues during installation, you can try using domestic mirror sources, such as Douban:<span>pip install requests -i https://pypi.douban.com/simple/</span>

Getting the News URL and Comment API

We need to find the news URL and the comment API. Open a NetEase news article, press F12 to open the developer tools, go to the Network tab, refresh the page, and observe the request list to find the comment API URL. This URL usually contains keywords like <span>comment</span>.

import requests

# Replace with the news URL you want to scrape
news_url = "https://example.news.163.com/article/xxxxxxxxx.html"

# Find the comment API URL by observing the Network tab, usually containing keywords like comment
comment_url = "https://comment.api.163.com/api/v1/products/a2869674571f77b5a0867c3d71db5856/threads/xxxxxxxxx/comments/newList"

Tip: The comment API URL may differ from the specific news page, so observe the requests in the Network tab carefully.

Sending Requests to Get Comment Data

Once we find the comment API, we can use the <span>requests</span> library to send a request and obtain comment data.

import requests
import json

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

response = requests.get(comment_url, headers=headers)
data = json.loads(response.text)

Tip: Setting the User-Agent can simulate browser behavior to avoid being blocked by anti-scraping mechanisms.

Parsing Comment Data

After obtaining the data, we need to use the <span>json</span> library to parse the data and extract the comment content.

import json

comments = data['comments']
for comment_id, comment_data in comments.items():
    content = comment_data['content']
    print(content)

Tip: Comment data is usually returned in JSON format, which can be parsed using the json library.

Saving Comment Data

We can save the scraped comments to a file.

import json

with open('comments.txt', 'w', encoding='utf-8') as f:
    for comment_id, comment_data in comments.items():
        content = comment_data['content']
        f.write(content + '\n')

Tip: Using UTF-8 encoding can prevent Chinese character garbling.

Code Implementation

import requests
import json

news_url = "https://example.news.163.com/article/xxxxxxxxx.html"
comment_url = "https://comment.api.163.com/api/v1/products/a2869674571f77b5a0867c3d71db5856/threads/xxxxxxxxx/comments/newList"

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

response = requests.get(comment_url, headers=headers)
data = json.loads(response.text)

with open('comments.txt', 'w', encoding='utf-8') as f:
    for comment_id, comment_data in data['comments'].items():
        content = comment_data['content']
        f.write(content + '\n')

print("Comments saved successfully!")

How about that? Isn’t it simple? Just a few lines of code and you’re done! Now you can easily obtain comments from NetEase News without having to copy and paste manually, isn’t that much more convenient?

Besides scraping NetEase News, this method can also be applied to scrape comments from other websites; you just need to modify the corresponding URL and parsing method. For example, if you want to scrape Weibo comments, you just need to find Weibo’s comment API and parse the data similarly. Isn’t it flexible?

Conclusion

This article briefly introduced how to use Python to scrape comments from NetEase News. I hope everyone can master basic scraping techniques through this tutorial and apply them to other scenarios. Go ahead and give it a try!

Leave a Comment