Hello everyone, I am your Python learning partner! Today, I want to introduce you to an efficient and fun tool – aiohttp
. As an asynchronous HTTP client, it helps us build powerful scraper systems. Whether it’s scraping e-commerce product information, monitoring social media dynamics, or batch downloading image resources, as long as you use aiohttp
correctly, these tasks can be easily accomplished!
Imagine your scraper rapidly fetching thousands of web pages like lightning, with efficiency several times higher than traditional scrapers. Isn’t that cool? Moreover, the code for aiohttp
is very concise, allowing you to start a high-performance scraper system with just a few lines of code.
Are you ready? Next, let’s take a look at the powerful features of aiohttp
!
1. Introduction to Aiohttp: Simple Explanation + Examples
aiohttp
is an HTTP client based on asynchronous programming, and its core feature is asynchronous non-blocking. In simple terms, traditional scrapers must wait for one web page to load completely before proceeding to the next. In contrast, aiohttp
can initiate multiple requests simultaneously, efficiently processing them like an assembly line.
What is Asynchronous?
To give an analogy: Suppose you are watching a movie and suddenly want to eat popcorn. If you are in “synchronous mode”, you have to pause the movie, queue up to buy popcorn, and then return to continue watching. In “asynchronous mode”, you would open a food delivery app, order popcorn, and continue watching the movie while waiting for the delivery—both tasks can be done simultaneously without delaying each other!
In the field of scraping, asynchronous programming can significantly improve efficiency, especially when a large number of pages need to be fetched. Here is a simple example of using aiohttp
to scrape multiple web pages:
import aiohttp
import asyncio
async def fetch(session, url):
async with session.get(url) as response:
print(f"Fetching: {url}")
return await response.text()
async def main():
urls = [
"https://example.com",
"https://python.org",
"https://github.com"
]
async with aiohttp.ClientSession() as session:
tasks = [fetch(session, url) for url in urls]
results = await asyncio.gather(*tasks)
print("Fetching complete!")
# Start the event loop
asyncio.run(main())
Notes
-
Asynchronous Environment: aiohttp
needs to run in an asynchronous environment and must be used withasyncio
. -
Resource Management: When using aiohttp.ClientSession
, be sure to manage the session withasync with
, otherwise it may lead to resource leaks.
2. Basic Operations: Step-by-Step Code + Function Explanation
Next, let’s learn about the basic operations of aiohttp
through a simple scraping example.
2.1 Installing Aiohttp
First, ensure that you have installed aiohttp
. If not, you can get it with the following command:
pip install aiohttp
2.2 Basic Scraper Example
Below is a basic code snippet that shows how to use aiohttp
to scrape web page content:
import aiohttp
import asyncio
async def fetch_page(url):
"""Fetch a single web page content"""
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
print(f"Status code: {response.status}")
html = await response.text()
return html
# Test code
async def main():
url = "https://www.example.com"
content = await fetch_page(url)
print(content[:500]) # Print the first 500 characters
# Start the asynchronous scraper
asyncio.run(main())
Code Explanation
-
aiohttp.ClientSession
: Used to manage HTTP sessions, supports connection reuse, and improves efficiency. -
session.get(url)
: Initiates a GET request to fetch web page content. -
response.text()
: Converts the response into HTML content of the web page.
Tips
-
Status Code Check: In scraping, checking the status code (e.g., 200
) can help us determine whether the request was successful. -
Exception Handling: During network requests, you may encounter timeouts or other exceptions, so it’s recommended to use try-except
to catch errors.
3. Function Extensions: Detailed Functions + Notes
aiohttp
can not only scrape web pages simply but also supports many advanced features. Below are several commonly used functions and their implementations.
3.1 Function 1: Setting Request Headers
When scraping web pages, sometimes you need to disguise as a browser. We can achieve this by setting request headers:
async def fetch_with_headers(url):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
async with aiohttp.ClientSession(headers=headers) as session:
async with session.get(url) as response:
return await response.text()
Notes: Some websites may restrict frequent access, so it’s advisable to set reasonable request intervals to avoid being banned.
3.2 Function 2: Asynchronous File Download
With aiohttp
, we can also quickly download files such as images or videos:
async def download_file(url, filename):
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
with open(filename, 'wb') as f:
while chunk := await response.content.read(1024):
f.write(chunk)
print(f"{filename} download complete!")
Summary: When downloading large files, you can read in chunks (e.g., response.content.read(1024)
) to avoid consuming too much memory.
4. Advanced Applications: Combining with Real Cases
Practical Case: Batch Fetching Titles from Multiple Web Pages
Next, we will build a simple scraper system using aiohttp
to extract titles from multiple web pages (<title>
tag content).
from bs4 import BeautifulSoup
import aiohttp
import asyncio
async def fetch_title(session, url):
async with session.get(url) as response:
html = await response.text()
soup = BeautifulSoup(html, 'html.parser')
title = soup.title.string if soup.title else "No Title"
print(f"The title of {url} is: {title}")
return title
async def main():
urls = [
"https://example.com",
"https://python.org",
"https://github.com"
]
async with aiohttp.ClientSession() as session:
tasks = [fetch_title(session, url) for url in urls]
titles = await asyncio.gather(*tasks)
print("All titles fetched successfully!", titles)
asyncio.run(main())
Exercise
Try modifying the code to scrape all links (<a href=\"...\">
) from the web page and save them to a local file.
Practical Tips
-
Asynchronous Non-Blocking: Use asyncio
to improve scraper efficiency. -
Session Management: Manage requests through aiohttp.ClientSession
to enhance performance. -
Exception Handling: Handle timeouts and errors during scraping to ensure program robustness.
Today we learned how to build a high-performance asynchronous scraper system using aiohttp
. From basic operations to advanced applications, I believe you have already felt its power. Give it a try! If you encounter any problems, feel free to leave a message for discussion, and let’s improve together!