Introduction to Python Web Scraping: HTTP Protocol and Chrome DevTools

Follow the “Python Column” public account below and reply with the keyword「environment」to download the Python software and environment configuration prepared for you by Xiao Qian.

Recommended Reading:Introduction to Python Web Scraping: Getting Started with Web Scraping

Learning knowledge is like building a house; without a good foundation, the house cannot withstand the wind and rain. If the foundation is built strong enough, even a typhoon will not shake it.

We did not start by explaining how to use web scraping; instead, we talked about some basic content. Once everyone masters this content, we will find it much smoother and clearer to learn practical web scraping later.

HTTP Protocol

When we mention web scraping, we cannot avoid discussing the HTTP protocol. So what is the HTTP protocol?

HTTP protocol (HyperText Transfer Protocol) is an application layer protocol based on TCP. In simple terms, it is a set of rules for data transmission between clients and servers.

Hypertext: Refers to text that exceeds simple text, including images, audio, video, and other files.
Transmission Protocol: Refers to the use of a common agreed-upon fixed format to transmit hypertext content converted into strings.
Default Port Number: 80

Moreover, HTTP is a stateless protocol; it does not persist the state of communication for requests and responses. This design aims to maintain the simplicity of the HTTP protocol, allowing it to process a large number of transactions quickly and efficiently.

Sometimes we also use the HTTPS protocol, which is actually HTTP + SSL (Secure Sockets Layer), meaning HTTP with a secure sockets layer. The default port number is 443.

SSL encrypts the content being transmitted (hypertext, which is the request body or response body).

Requests

Each request in the HTTP protocol carries the following content, such as the request method, request path, and protocol version, which we call the request line.

There are also fields in the format of name: value, which we call the request headers.

The last part is the request body.

Introduction to Python Web Scraping: HTTP Protocol and Chrome DevTools

Let’s take a look at the request situation in the browser’s developer tools under the Network tab:

We can see that when the browser makes a request, it also carries the request line, request headers, and other content.

The most important parts of the request line are the URL and the request method. What are the request methods?

The methods in the HTTP protocol are:

GET: Request to retrieve the resource identified by the Request-URI.

POST: Add new data to the resource identified by the Request-URI.

HEAD: Request to retrieve the response headers for the resource identified by the Request-URI.

PUT: Request the server to store or modify a resource identified by the Request-URI.

DELETE: Request the server to delete the resource identified by the Request-URI.

TRACE: Request the server to return the received request information, mainly for testing or diagnostics.

CONNECT: Reserved for future use.

OPTIONS: Request to query the server’s performance or options and requirements related to the resource.

By default, GET requests are used unless otherwise specified. POST requests are also commonly used. The other methods mentioned above are mostly involved in API interface programming. So remember mainly two: GET and POST.

Request Headers Particularly Important for Web Scraping

Whether in a browser or a web scraper, when making a request, it is necessary to comply with the HTTP protocol, which means carrying the request headers.

What are the request headers in a browser? You can also check by clicking on a link in the Network tab:

Do we need to include all these request headers when making a request? No, because each website has different request headers, and they are not standardized. Therefore, web scrapers pay special attention to the following request header fields:

Content-Type
Host (host and port number)
Connection (connection type)
Upgrade-Insecure-Requests (upgrade to HTTPS request)
User-Agent (browser name)
Referer (the page where the request originated)
Cookie (cookie)
Authorization (for representing authentication information required for resources in the HTTP protocol, such as JWT authentication used in previous web courses)

The bolded request headers are commonly used and are most frequently employed by servers to identify web scrapers. Compared to other request headers, they are more important. However, it is important to note that this does not mean the others are unimportant, as some website administrators or developers may use less common request headers to identify web scrapers.

Responses

There can be no complete HTTP protocol process without requests and responses. So what is the format of the returned response content?

Each response will contain the content shown in the image above: status code, protocol version, response headers, response body, etc.

To visualize the response better, we can also check it through the browser’s Network tab. We can see the status code, remote host IP, response headers, etc.

Another major component that we cannot see is the response body. We need to click on Response to see the changes below; what you see is the content of the response body.

Everyone should also learn to check the status code because not all requests return a status code of 200; there can also be 304, 500, 404, etc.

Status codes consist of three digits, with the first digit defining the category of the response, which can have five possible values: 1xx: Informational – indicates that the request has been received and is continuing to process. 2xx: Success – indicates that the request has been successfully received, understood, and accepted. 3xx: Redirection – indicates that further action is needed to complete the request. 4xx: Client Error – indicates that there is a syntax error in the request or the request cannot be fulfilled. 5xx: Server Error – indicates that the server failed to fulfill a legitimate request.

Common status codes:

200: OK – client request succeeded.

400: Bad Request – client request has syntax errors and cannot be understood by the server.

401: Unauthorized – request is unauthorized; this status code must be used with the WWW-Authenticate header field.

403: Forbidden – the server received the request but refuses to provide the service.

404: Not Found – the requested resource does not exist, e.g., an incorrect URL was entered.

500: Internal Server Error – the server encountered an unexpected error.

503: Server Unavailable – the server cannot currently handle the client’s request; it may recover after some time.

Thus, when the browser makes a request, the process is as follows:

Process of HTTP requests:

1. Whenever we visit any website, we enter the domain name in the browser’s address bar, such as: http://www.baidu.com. Because domain names are easier to remember! We never remember Baidu’s IP address. However, the identification process is completed by the DNS domain name resolution server, which can find the matching IP address based on the entered domain name and inform you of the IP address.

2. After the browser obtains the IP corresponding to the domain name, it begins to make requests and obtain responses.

3. The returned response content (HTML) will include URLs for CSS, JS, images, etc., as well as AJAX code. The browser will sequentially send other requests based on the order in the response content and obtain corresponding responses.

4. The browser adds (loads) the results displayed for each response obtained. JS, CSS, and other content can modify the page content, and JS can also send requests again to obtain responses.

5. From obtaining the first response and displaying it in the browser until all responses are finally obtained, this process is called browser rendering.

After discussing so much, what is the relationship between HTTP and the web scraping we are about to discuss? Please see the image below:

This is the process of web scraping, and the HTTP protocol is a crucial part of this process, specifically the part about sending requests and obtaining responses.

To clarify, the section about requests and responses is essentially the work that the browser needs to do; we write programs to simulate the browser accessing the site.

!!! The part highlighted in red in the image above is what simulates the browser doing its job~

Next time, we will introduce regular expressions and their use in web scraping.

If you encounter any problems during your learning process, feel free to contact us to join the free trial class.

Introduction to Python Web Scraping: HTTP Protocol and Chrome DevTools

HTTP Protocol

Requests

Request Headers Particularly Important for Web Scraping

Responses

Related posts

Leave a Comment Cancel reply