When developing crawlers, understanding the basics of the HTTP protocol is crucial, as crawlers communicate with target websites via the HTTP protocol. This chapter will explain the basic concepts of the HTTP protocol, common request methods, status codes, and the structure of requests and responses.
1. What is the HTTP Protocol?
HTTP (HyperText Transfer Protocol) is a protocol used for communication between clients (browsers or crawlers) and servers. Its core characteristics include:
-
Based on Request and Response: The client sends a request, and the server returns a response. -
Stateless: Each request is independent; the server does not keep track of the previous request state. -
Text-Oriented: Request and response data are transmitted in text format, making it easy to understand and parse.
2. Basic Workflow of HTTP
The process of HTTP communication includes the following steps:
-
Establish Connection: The client (such as a crawler) establishes a TCP connection with the server via IP and port. -
Send Request: The client constructs an HTTP request and sends it to the server. -
Process Request: The server processes data or resources based on the request content. -
Return Response: The server returns the processing result (HTML, JSON, etc.) to the client. -
Close Connection: After the request is completed, the connection may be closed or kept open (HTTP/1.1 supports persistent connections by default).
3. HTTP Request Methods
HTTP defines various request methods, each with different functionalities. The methods commonly used in crawlers are mainly GET and POST.
3.1 GET Method
-
Function: Retrieve resources from the server (such as web pages, images). -
Characteristics: Request parameters are included in the URL, suitable for data retrieval. -
Example: GET /search?q=python HTTP/1.1 Host: www.example.com
-
Code Practice: import requests url = "https://www.example.com/search" params = {"q": "python"} response = requests.get(url, params=params) print(response.text)
3.2 POST Method
-
Function: Submit data to the server (such as forms, files). -
Characteristics: Request parameters are included in the request body, suitable for submitting sensitive data. -
Example: POST /login HTTP/1.1 Host: www.example.com Content-Type: application/x-www-form-urlencoded username=admin&password=12345
-
Code Practice: import requests url = "https://www.example.com/login" data = {"username": "admin", "password": "12345"} response = requests.post(url, data=data) print(response.text)
3.3 Other Common Methods
-
PUT: Update resources. -
DELETE: Delete resources. -
HEAD: Retrieve response header information without returning the response body. -
OPTIONS: Query the methods supported by the server.
4. Structure of HTTP Requests and Responses
4.1 Structure of HTTP Requests
An HTTP request consists of the following parts:
-
Request Line: Includes the request method, URL, and protocol version. GET /index.html HTTP/1.1
-
Request Header: Contains metadata (such as User-Agent, Cookie). User-Agent: Mozilla/5.0
-
Request Body: Carries the data to be submitted (common in POST requests). username=admin&password=12345
4.2 Structure of HTTP Responses
An HTTP response consists of the following parts:
-
Status Line: Includes protocol version, status code, and status description. HTTP/1.1 200 OK
-
Response Header: Describes metadata of the returned content (such as Content-Type). Content-Type: text/html; charset=UTF-8
-
Response Body: The actual data returned (such as HTML, JSON). <html><body>Hello, World!</body></html>
5. Common HTTP Status Codes
HTTP status codes are used to describe the result of processing a request. The following are the status codes that crawlers should pay close attention to:
Status Code | Category | Meaning |
---|---|---|
200 | Success | The request was successful, and the server returned the required content. |
301/302 | Redirection | The requested resource has been moved and needs to redirect to a new address. |
403 | Forbidden | The server refuses the request; it may require identity disguise or login. |
404 | Not Found | The requested resource does not exist. |
429 | Too Many Requests | The client sends requests too frequently, triggering anti-crawling mechanisms. |
500 | Internal Server Error | The server encountered a problem and could not process the request. |
6. Importance of Headers
Request headers are an important part of HTTP requests and responses. Crawlers often need to disguise request headers to bypass anti-crawling mechanisms.
Common Request Headers
-
User-Agent: Indicates the identity information of the client (such as a browser). headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36" } response = requests.get("https://example.com", headers=headers)
-
Referer: Indicates the source page of the request, used to disguise the request source. -
Cookies: Used to store session information, commonly found on websites that require login.
7. Practice: Scraping Web Page Title
The following code demonstrates how to use Requests to get the title of a web page:
import requests
from bs4 import BeautifulSoup
# Target URL
url = "https://www.example.com"
# Add request headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36"
}
# Send GET request
response = requests.get(url, headers=headers)
# Use BeautifulSoup to parse HTML
soup = BeautifulSoup(response.text, "lxml")
print("Web Page Title:", soup.title.text)
Running the above code can retrieve the title of the target web page, example result:
Web Page Title: Example Domain