To write a web crawler in Python, you can follow these steps:
-
Environment Setup: Ensure that you have Python installed and the necessary libraries. Commonly used libraries include
requests
(for sending HTTP requests) andBeautifulSoup4
(for parsing HTML content). You can install these libraries using the following command:pip install requests beautifulsoup4
-
Create Database: To store the crawled data, you can use an SQLite database. Create a database named
data.db
and create a table within it to store the data. For example:import sqlite3 def create_database(): conn = sqlite3.connect('data.db') cursor = conn.cursor() cursor.execute('''CREATE TABLE IF NOT EXISTS articles (id INTEGER PRIMARY KEY AUTOINCREMENT, title TEXT NOT NULL, link TEXT NOT NULL)''') conn.commit() conn.close() create_database()
-
Define Crawler Class: Create a crawler class responsible for sending requests, parsing web pages, and storing data. Here is a simple example:
import requests from bs4 import BeautifulSoup import sqlite3 class SimpleCrawler: def __init__(self, base_url): self.base_url = base_url def fetch_page(self, url): response = requests.get(url) if response.status_code == 200: return response.text else: print(f"Request failed, status code: {response.status_code}") return None def parse_page(self, html): soup = BeautifulSoup(html, 'html.parser') articles = [] for item in soup.find_all('h2', class_='entry-title'): title = item.get_text() link = item.find('a')['href'] articles.append((title, link)) return articles def store_data(self, articles): conn = sqlite3.connect('data.db') cursor = conn.cursor() for title, link in articles: cursor.execute('INSERT INTO articles (title, link) VALUES (?, ?)', (title, link)) conn.commit() conn.close() def crawl(self): html = self.fetch_page(self.base_url) if html: articles = self.parse_page(html) self.store_data(articles) print("Data fetching and storage completed!") if __name__ == "__main__": base_url = 'https://example-blog.com' # Replace with the web page you want to crawl crawler = SimpleCrawler(base_url) crawler.crawl()
-
Run the Program: In the terminal or command prompt, navigate to the directory where your script is located, and run the following command:
python your_script.py
Make sure to replace
base_url
with the web address you want to crawl. After running the program, the crawled data will be stored in thedata.db
database. -
Legal Compliance: When writing a crawler program, be sure to comply with the
robots.txt
file regulations of the target website and respect copyright and privacy laws. Ensure that your crawling behavior is legal and compliant, avoiding excessive pressure on the target server.