Basics of HTTP Protocol

When developing crawlers, understanding the basics of the HTTP protocol is crucial, as crawlers communicate with target websites via the HTTP protocol. This chapter will explain the basic concepts of the HTTP protocol, common request methods, status codes, and the structure of requests and responses.

1. What is the HTTP Protocol?

HTTP (HyperText Transfer Protocol) is a protocol used for communication between clients (browsers or crawlers) and servers. Its core characteristics include:

Based on Request and Response: The client sends a request, and the server returns a response.
Stateless: Each request is independent; the server does not keep track of the previous request state.
Text-Oriented: Request and response data are transmitted in text format, making it easy to understand and parse.

2. Basic Workflow of HTTP

The process of HTTP communication includes the following steps:

Establish Connection: The client (such as a crawler) establishes a TCP connection with the server via IP and port.
Send Request: The client constructs an HTTP request and sends it to the server.
Process Request: The server processes data or resources based on the request content.
Return Response: The server returns the processing result (HTML, JSON, etc.) to the client.
Close Connection: After the request is completed, the connection may be closed or kept open (HTTP/1.1 supports persistent connections by default).

3. HTTP Request Methods

HTTP defines various request methods, each with different functionalities. The methods commonly used in crawlers are mainly GET and POST.

3.1 GET Method

Function: Retrieve resources from the server (such as web pages, images).
Characteristics: Request parameters are included in the URL, suitable for data retrieval.

Example:

GET /search?q=python HTTP/1.1
Host: www.example.com

Code Practice:

import requests

url = "https://www.example.com/search"
params = {"q": "python"}
response = requests.get(url, params=params)
print(response.text)

3.2 POST Method

Function: Submit data to the server (such as forms, files).
Characteristics: Request parameters are included in the request body, suitable for submitting sensitive data.

Example:

POST /login HTTP/1.1
Host: www.example.com
Content-Type: application/x-www-form-urlencoded

username=admin&password=12345

Code Practice:

import requests

url = "https://www.example.com/login"
data = {"username": "admin", "password": "12345"}
response = requests.post(url, data=data)
print(response.text)

3.3 Other Common Methods

PUT: Update resources.
DELETE: Delete resources.
HEAD: Retrieve response header information without returning the response body.
OPTIONS: Query the methods supported by the server.

4. Structure of HTTP Requests and Responses

4.1 Structure of HTTP Requests

An HTTP request consists of the following parts:

Request Line: Includes the request method, URL, and protocol version.
```
GET /index.html HTTP/1.1
```
Request Header: Contains metadata (such as User-Agent, Cookie).
```
User-Agent: Mozilla/5.0
```
Request Body: Carries the data to be submitted (common in POST requests).
```
username=admin&password=12345
```

4.2 Structure of HTTP Responses

An HTTP response consists of the following parts:

Status Line: Includes protocol version, status code, and status description.
```
HTTP/1.1 200 OK
```
Response Header: Describes metadata of the returned content (such as Content-Type).
```
Content-Type: text/html; charset=UTF-8
```
Response Body: The actual data returned (such as HTML, JSON).
```
<html><body>Hello, World!</body></html>
```

5. Common HTTP Status Codes

HTTP status codes are used to describe the result of processing a request. The following are the status codes that crawlers should pay close attention to:

Status Code	Category	Meaning
200	Success	The request was successful, and the server returned the required content.
301/302	Redirection	The requested resource has been moved and needs to redirect to a new address.
403	Forbidden	The server refuses the request; it may require identity disguise or login.
404	Not Found	The requested resource does not exist.
429	Too Many Requests	The client sends requests too frequently, triggering anti-crawling mechanisms.
500	Internal Server Error	The server encountered a problem and could not process the request.

6. Importance of Headers

Request headers are an important part of HTTP requests and responses. Crawlers often need to disguise request headers to bypass anti-crawling mechanisms.

Common Request Headers

User-Agent: Indicates the identity information of the client (such as a browser).

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36"
}
response = requests.get("https://example.com", headers=headers)

Referer: Indicates the source page of the request, used to disguise the request source.
Cookies: Used to store session information, commonly found on websites that require login.

7. Practice: Scraping Web Page Title

The following code demonstrates how to use Requests to get the title of a web page:

import requests
from bs4 import BeautifulSoup

# Target URL
url = "https://www.example.com"

# Add request headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36"
}

# Send GET request
response = requests.get(url, headers=headers)

# Use BeautifulSoup to parse HTML
soup = BeautifulSoup(response.text, "lxml")
print("Web Page Title:", soup.title.text)

Running the above code can retrieve the title of the target web page, example result:

Web Page Title: Example Domain