Click on the aboveJava Resource Community, select “Pin the public account”
Quality articles delivered first
Recently, I’ve become fascinated with web scraping technology. After learning about it, I found out the basic process of web scraping:

1. Analyze the interface, page, request parameters, and data rendering logic
2. Initiate requests to the interface and page using HTTP tools
3. After receiving the response, filter the data from the response
4. Store the filtered data
As the business deepened, I found that using HttpClient for web scraping was not convenient, for example, when encountering a website that requires login to access. Using HttpClient is not very convenient; to simulate a login, the user must first log in and obtain the actual logged-in user’s cookie, which must be set in the request’s Cookie. Clearly, this is not a very good solution.
During this web scraping learning process, I discovered a very interesting framework called selenium. Those who do automation testing should be familiar with it. Below, you can first watch a small example I wrote using selenium:
After watching the above video, let’s understand how it is implemented.
This framework is a tool for web testing, running directly in the browser. During use, it operates similarly to a human. It supports compatibility with major mainstream browsers in the market, and by writing programs, it can simplify repetitive operations on the web, allowing the program to work for you. Of course! selenium is not primarily designed for web scraping; this article simply uses this framework to create a simple web scraper application.
Since selenium runs in the browser, the first step is to install the browser driver. I am using Google Chrome, and the driver must match the browser version; otherwise, exceptions will occur.
Google download address:
http://npm.taobao.org/mirrors/chromedriver/
Firefox download address:
https://github.com/mozilla/geckodriver/releases
IE download address:
http://selenium-release.storage.googleapis.com/index.html
My Google version is 80.0.3987.149, and here I chose the 80.0.3987.16 Google driver. Finally, you can download different versions according to your system environment.
1. Implement automatic login to Lagou.com (captcha requires manual recognition)
2. Automatically filter and search for specified positions
3. Capture filtered job information and handle pagination
Project Environment:maven3.3.1+jdk1.8+IDEA2019.1
Maven dependencies:
<?xml version="1.0" encoding="UTF-8"?><project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion>
<groupId>com.java134</groupId> <artifactId>selenium_Lagou</artifactId> <version>1.0-SNAPSHOT</version>
<dependencies> <dependency> <groupId>org.seleniumhq.selenium</groupId> <artifactId>selenium-server</artifactId> <version>3.141.59</version> </dependency> </dependencies></project>
1) Place the downloaded Google driver in the project root path (you can also customize it for project startup loading driver)
2) Implementing Requirement 1
First, analyze the Lagou.com login page at https://passport.lagou.com/login/login.html
Then analyze the structure of the username, password, and login button on Lagou.com
Code Implementation:
/** * @ClassName LagouJob * @Description * @Author Public Account: Java Resource Community - Programmer Xiao Long * @Date 2020/3/27 16:51 **/public class LagouJob { public static void run() throws InterruptedException { // Set Google driver system variable System.setProperty("webdriver.chrome.driver",System.getProperty("user.dir")+"\chromedriver.exe"); // Create browser driver instance WebDriver driver = new ChromeDriver(); // Open specified URL (Lagou.com login URL) driver.get("https://passport.lagou.com/login/login.html"); // Locate username field WebElement userName=driver.findElement(By.xpath("//div[@data-propertyname='username']/input")); // Input phone number userName.sendKeys("1767xxxxxxx"); // Locate password field WebElement userPwd=driver.findElement(By.xpath("//div[@data-propertyname='password']/input")); // Input password userPwd.sendKeys("xxxxxxxx"); // Locate submit button WebElement btnSubmit=driver.findElement(By.xpath("//div[@data-propertyname='submit']/input")); // Click submit button btnSubmit.click(); }}
API Introduction:
1. WebDriver creates a browser instance, creating different implementations based on different drivers
WebDriver driver = new ChromeDriver(); // Chrome browserWebDriver driver = new FirefoxDriver(); // Firefox browserWebDriver driver = new EdgeDriver(); // Edge browserWebDriver driver = new InternetExplorerDriver(); // Internet Explorer browserWebDriver driver = new OperaDriver(); // Opera browserWebDriver driver = new PhantomJSDriver(); // PhantomJS
2. get: a method under WebDriver used to open a specified URL when the browser first launches
3. findElement: a method under WebDriver used to find elements on the page, needs to be used with By
4. By: used to locate elements on the page, similar to jQuery, provides 8 locating methods
-
id
-
name
-
class name
-
tag name
-
link text
-
partial link text
-
xpath
-
css selector
Corresponding API methods:
// Here we simply introduce a few commonly used ones. If interested, you can learn morefindElement(By.id("id"))// Locate by element's id, similar to $("#id")findElement(By.name("name"))// Locate by element's namefindElement(By.className("class"))// Locate by element's class name, similar to $(".class")// Divided into two types 1. Absolute positioning 2. Relative positioning// Absolute positioning, the index here starts from 1, used to distinguish the hierarchy of sub-elementsfindElement(By.xpath("/html/body/div[1]/a[2]/"))// Relative positioning, the @ can be used to distinguish element attribute values, can also be set to class name or custom attributes, etc.findElement(By.xpath("//a[@href='www.baidu.com']"))findElement(By.tagName())findElement(By.linkText())findElement(By.partialLinkText())findElement(By.cssSelector())
5.WebElement: used to receive the elements found by findElement
6.sendKeys: a method under WebElement, used to input specified characters into a text box
7. click: a method under WebElement, used to trigger the click event of an element
Run Results
3) Implementing Requirement 2
After a successful login, it will redirect to the Lagou homepage. Since our requirement is to search for positions based on conditions, we will first analyze the search entry from the homepage, which is consistent with the above login process: locate the text box –> input search conditions –> click the search button;
Issue 1: Due to potential human verification during login, selenium currently cannot recognize it well, so we need to set a thread sleep of 10 seconds to allow for manual recognition of the captcha and time for the login redirection process
Issue 2: After the search redirection, there may be an advertisement modal. If the modal is not closed, it will affect the next operation of the program. Therefore, we need to analyze whether the advertisement modal exists and close it if it does.
Requirement 2 Search Implementation Code
// Omit Requirement 1 implementation code...// Continue with Requirement 1 code block// Solve Issue 1: Stop running for 10 seconds to allow for manual selection of captcha and login redirection timeThread.sleep(10000);// Input job titledriver.findElement(By.id("search_input")).sendKeys("Java");// Click search buttondriver.findElement(By.id("search_button")).click(); // Solve Issue 2: Check if the advertisement modal exists, close the modal if it doesif(driver.findElement(By.className("body-container")).isDisplayed()){ driver.findElement(By.className("body-btn")).click(); }
After the search redirection, we begin analyzing the filtering condition module. After checking the source code, we find that to select the filtering conditions, we must click the corresponding filtering condition text;
Requirement 2 Filtering Condition Implementation Code:
// Omit the above code...// Filtering conditionsString gzdd="Shanghai";String gzjy="3 years or less";String xl="Associate";String rzjd="Unlimited";String gsgm="Unlimited";String hyly="Unlimited";// Click the specified area conditiondriver.findElement(By.xpath("//div[@class='other-hot-city']/div/a[contains(text(),'