60 Lines of Code to Scrape Zhihu’s Best Answers

(For Python developers, bookmark this to enhance your Python skills)

Source: Python and Data Analysis, Author: shenzhongqiang

Scraping Zhihu’s best answers is quite simple, and in this article, we will reveal the underlying principles.

What characteristics do Zhihu’s best answers have? Let’s observe first.

60 Lines of Code to Scrape Zhihu's Best Answers

60 Lines of Code to Scrape Zhihu's Best AnswersDid you notice any patterns? Are they concise and insightful? Do they have a lot of upvotes? Therefore, to scrape Zhihu’s best answers, we only need to scrape those with many upvotes and few words. It can be accomplished in two simple steps: first, scrape Zhihu answers, and second, filter the answers. Isn’t it easy?

Scraping Zhihu Answers

First, we scrape the answers on Zhihu. There are too many answers on Zhihu, and scraping all of them at once would be time-consuming. We can select a few topics and scrape the content from those topics. The following function is used to scrape the content of a specified topic.

def get_answers_by_page(topic_id, page_no):
    offset = page_no * 10
    url = <topic_url> # topic_url is the URL corresponding to this topic
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
    }
    r = requests.get(url, verify=False, headers=headers)
    content = r.content.decode("utf-8")
    data = json.loads(content)
    is_end = data["paging"]["is_end"]
    items = data["data"]
    client = pymongo.MongoClient()
    db = client["zhihu"]
    if len(items) > 0:
        db.answers.insert_many(items)
        db.saved_topics.insert({"topic_id": topic_id, "page_no": page_no})
    return is_end

The get_answers_by_page function has two parameters: the first parameter is the topic ID, and the second parameter indicates which page of content is being scraped.

There are several fields in the scraped content that need attention, highlighted in yellow in the image below.

60 Lines of Code to Scrape Zhihu's Best AnswersThe meanings of these fields are as follows:

  • question.title – The title of the question

  • content – The content of the answer

  • voteup_count – The number of upvotes

These fields will be used in the next step to filter the answers.

Filtering Answers

After scraping the data, we will filter the results.

We use the aggregation pipeline in MongoDB to filter the answers (for information on using MongoDB’s aggregation pipeline, you can refer to the Aggregation Pipeline Quick Reference article at https://docs.mongodb.com/manual/meta/aggregation-quick-reference/), the code is as follows:

client = pymongo.MongoClient()
db = client["zhihu"]
items = db.answers.aggregate([
    {"$match": {"target.type": "answer"}},
    {"$match": {"target.voteup_count": {"$gte": 1000}}},
    {"$addFields": {"answer_len": {"$strLenCP": "$target.content"}}},
    {"$match": {"answer_len": {"$lte": 50}}},])

The above code will filter all answers with more than 1000 upvotes and less than 50 words; the filtered result will be the concise and insightful best answers. This is the core code, and the complete code has been uploaded to GitHub. You can reply with “Zhihu Best Answers” in the public account backend to get the address.

Zhihu Best Answers

With the code written, let’s run it and see the results. Coincidentally, yesterday was Programmer’s Day, so we will filter the best answers related to programmers. The results are as follows, a total of 75 hilarious jokes 😂

1

Q: What are the most common “lies” programmers say?

A: //TODO

2

Q: What is it like to keep GitHub green for 365 days?

A:

I once kept it green for over 200 days, but I neglected my girlfriend, and it has been green ever since.

3

Q: How to refute the view that “programmers are useless when away from the computer”?

A: No, many programmers are useless even in front of the computer.

4

Q: What would it be like if everyone spoke in programming languages one day?

A:

hello, world.烫烫烫烫烫烫烫�d}��R�0:�v�?.

5

Q: Suddenly wanting to open a programmer-themed restaurant, named Programmer’s Dish, with dishes named after keywords in various programming languages. Any advice on its prospects?

A: A big hello world at the entrance, with the signature dish called “Braised Product Manager” would definitely be packed.

6

Q: What is recursion?

A: The definition and scope of “political content not suitable for public discussion” also belong to “political content not suitable for public discussion”.

7

Q: How to translate the basic programming term “bug”?

A:

It’s a hassle; your program has bugs again.

8

Q: What is the fun of programming?

A: A person’s sense of accomplishment comes from two things: creation and destruction.

9

Q: How to refute the view that “programmers are useless when away from the computer”?

A: Honestly, if you can chat with such a woman, do you want to sleep with her?

10

Q: As a programmer, what math-related troubles have you encountered while programming?

A: When reading papers, a single “obviously” took me an entire afternoon.

11

Q: What equipment do wealthy programmers have?

A: A girlfriend…

12

Q: Which deity should I pray to for bug-free code?

A: Pray to Yongzheng, he specializes in treating the Eighth Prince.

13

Q: Is studying IT in a good university the only way for poor kids in China to rise to the middle class?

A: Yes, there are four paths: writing code, working in finance, coding in the coding circle, and writing code in the finance circle.

14

Q: Why do programmers like to carry computer bags everywhere, even if there’s no computer inside?

A: Because they have no other bags.

15

Q: How to translate “Talk is cheap. Show me the code”?

A: Stop talking nonsense and show me the code.

16

Q: Why do programmers’ girlfriends or wives generally have a much higher appearance level than the men?

A: I admire programmers’ girlfriends’ looks. If you ask ten programmers who their girlfriend is, nine will say it’s Gakki Yui.

17

Q: Why do some people prefer to buy several mechanical keyboards to switch around instead of using facial masks?

A:

I don’t rely on looks to earn a living.

My hard-earned money, I can spend it however I want.

18

Q: What should be engraved on wedding rings for programmer couples?

A: 0 error 0 warning

19

Q: Do IT engineers feel uncomfortable being called “code farmers”?

A: At least we are still humans; products and designs are already dogs…

20

Q: Why would a 30-year-old male salesman invite a 24-year-old male programmer to Starbucks near the community?

A: Based on my years of experience, he must have a brilliant idea that just needs a programmer to implement it.

21

Q: How to find a girl who likes programmers as a girlfriend?

A: It depends on fate. There are so many users on Zhihu, if you follow me, that’s fate.

22

Q: How does a programmer’s girlfriend celebrate the programmer boyfriend’s birthday?

A: Tell him that the interface is ready.

23

Q: As a programmer, how did you find your girlfriend after work?

A: It’s rare for someone to be a programmer and still like girls.

24

Q: What preparations do programmers need to make if they want to switch to barbecue, and what are the advantages and disadvantages?

A: You see, you don’t even know the advantages and disadvantages of doing barbecue, so you still need a product manager.

25

Q: What can provoke programmers?

A: Passing by their computer and saying, “Oh, writing bugs again!”

26

Q: A teacher of mine said that Java is suitable for large software while C# is suitable for small and medium software. Is this true?

A: Java has a talent for turning small and medium software into large ones.

27

Q: Why were programmer salaries so high in 2014?

A: The hourly wage is not high.

28

Q: Don’t most programmers complain about low salaries?

A:

Who, who complains about high salaries?

29

Q: What should a single programmer do after solving a technical problem without a girl to show off or boast about?

A: Now you understand why so many programmers write technical blogs.

30

Q: Why do Chinese programmers prefer “windbreakers + jeans + sneakers”? If so, why has this trend formed?

A: Do you dress nicely to show it to the computer?

31

Q: As an IT professional, what tools have greatly improved your work efficiency?

A:

Single.

32

Q: Why do I think programmers are generally not good at speaking?

A:

Just take it that we have low emotional intelligence,

that way you are happy,

and we are happy too.

33

Q: In China, the oldest programmers are only about 40 years old. What can Chinese programmers do in the future?

A:

This is the same principle as why most people born in the 90s don’t live past 30.

34

Q: How to reply to a programmer’s message: “Hello world”?

A: hello nerd.

35

Q: How can you tell if an IT guy likes a girl?

A: When he tries hard to get close to you despite his habitual silence.

36

Q: Why shouldn’t programmers know how to fix computers?

A: Does Fan Bingbing need to know how to fix TVs?

37

Q: How to make a colleague realize that he is not as good as he claims to be, saying he is the best in C++ in China?

A:

To be honest, I am not pretending: my C++ level is the best in the country.

38

Q: Why do all icons shake when deleting software on iPhone?

A: Third-party software is scared, while the system’s built-in software is showing off.

39

Q: If a revolver has one bullet and shooting yourself gives you 100,000 yuan, two shots give you 1 billion, three shots 2 billion, four shots 4 billion, and five shots 16 billion, is it worth it?

A:

As long as it doesn’t hit a vital point, I can tell you, I can hit our A station to go public!!!!

40

Q: According to the current trend of iPhone processors doubling in performance every year, will they soon catch up with or even surpass desktop processors?

A: When I was young, I always thought that in two years I would be as big as my brother who is two years older than me.

41

Q: What is the smallest benefit Zhihu has brought you?

A: Killing time without feeling guilty.

42

Q: What anti-human technological inventions or designs exist?

A: Computers can’t connect to the internet, and after diagnosis, they prompt me to connect to solve it.

43

Q: Why are designers unwilling to be called beauticians?

A: As long as the salary is high, you can call me auntie.

44

Q: Why do some people think NetEase Cloud Music is a good conscience in the industry?

A: One day, it suddenly pushed me a message saying that the lyrics I wanted were found.

45

Q: Why haven’t self-destructing attack drones appeared? Have terrorists used them?

A: Are you talking about missiles?

46

Q: Since thoughts are mine, why can’t I control my negative emotions sometimes?

A: The operating system does not allow users to access, modify, or delete core system files, as this would damage the system and lead to operational abnormalities.

47

Q: Although Lu Xun is impressive, is he just a filler among the top ten literary giants in the world?

A: Why should literary giants pay for the rankings made by illiterates?

48

Q: What technologies are close to a bottleneck and have not had major breakthroughs for a long time?

A: Boiling water.

49

Q: How do you view some people’s preference for downloading software from official websites?

A: Classmate, have you never been caught in the Baidu family bucket?

50

Q: Why do many people buy laptops for gaming instead of using better-performing desktops?

A: Because they can’t afford a house…

51

Q: How shocking was the experience of hearing a good headset for the first time?

A: The first time you hear a good headset, it won’t shock you much, but when you switch back to a regular headset, the shock comes.

52

Q: Is Chrome really power-hungry?

A: Not power-hungry, I’m using Chrome right now, and after using it for so long, my laptop still has 50% battery left.

53

Q: How is the experience of installing Windows on a MacBook?

A: It’s like suddenly having a soft rib and losing armor.

54

Q: What is it like to use all Apple products at home?

A: When someone gets a call, the whole family rings.

55

Q: Why don’t you buy an iPhone X?

A: The contradiction between the ever-increasing demand for a better life and the reality of being poor.

56

Q: Why are some willing to spend thousands on an iPhone but not willing to spend a few dozen on genuine iPhone software and games?

A: Because they can’t download iPhone.

57

Q: Are there any apps with particularly stunning names?

A: Water Meter Assistant… it’s for checking express delivery…

58

Q: Why do you want to buy an external hard drive?

A: When conditions are better, I want to provide a more comfortable living for my women.

59

Q: How to remotely shut down a PC with an iPad?

A: Aim at the PC’s power button and throw it.

60

Q: How to evaluate the Apple conference on September 7, 2016?

A: I watched three conferences in six months for the new MacBook Pro…

61

Q: How to evaluate Internet Explorer?

A: Download other browsers’ browsers—–a year later—–IE8 and below are so bad, it’s a crying rhythm for front-end developers.

62

Q: My parents want me to save money to buy a house, but I want to buy an Apple computer. What should I do?

A: If you can really save 500,000 for a house in three years, is it worth it to spend 17,000 on a computer, big brother?

63

Q: What are some garbage mobile apps?

A: SMS interception software! After intercepting, it tells you it has intercepted a message. I believe 99% of people will go check the intercepted message!

64

Q: What is the most headache-inducing part of making a complete PPT?

A: How to hide my skills from the leader.

65

Q: What can Vim do that Emacs can’t?

A: Help the poor children of Uganda…

66

Q: Why do Apple users choose Apple?

A: Because users who don’t use Apple are not Apple users.

67

Q: What classic rumors exist in the computer world?

A: Windows is connecting to find a solution.

68

Q: Will wired mice be replaced by wireless mice?

A: I think wired mice will not be replaced in internet cafes.

69

Q: What classic rumors exist in the computer world?

A: I have read and agree to the terms.

70

Q: What are the common sayings among computer science students?

A: My computer is running fine…

71

Q: How to view Baidu’s official blog publicly refuting rumors about Li Yanhong’s family matters?

A:

“Chinese people are not that sensitive about privacy and are willing to exchange privacy for convenience.”

——Li Yanhong

72

Q: How to chat with Jack Ma if you meet him on a plane?

A: Hello Jack, my name is Jackson.

73

Q: How to understand what Ma Yun said about houses being as cheap as green onions in eight years?

A:

Hurry up and buy green onions; they are going to rise in price!

74

Q: How to understand what Ma Yun said about “killing landlords does not mean you can become rich”?

A: His point is “Don’t kill me”

75

Q: How to view Baidu’s promise to rectify after the Wei Zexi incident, which has quietly faded the color of the ad hints?

A: Please do not criticize Baidu; I am a front-end developer, and this is just the CSS fading over time.

Recommended Reading

(Clicking the title will jump to reading)

The Most Accessible Python3 Web Scraping Introduction

Python Scraping: Font Anti-Scraping Handling

NetEase Cloud Music Comment Scraper (1): All Popular Songs and Their IDs

Do you find this article helpful? Please share it with more people.

Follow “Python Developers” and bookmark it to enhance your Python skills.

60 Lines of Code to Scrape Zhihu's Best Answers

Leave a Comment