How To Write A Web Crawler In Python

To write a web crawler in Python, you can follow these steps:

Environment Setup: Ensure that you have Python installed and the necessary libraries. Commonly used libraries include requests (for sending HTTP requests) and BeautifulSoup4 (for parsing HTML content). You can install these libraries using the following command:
```
pip install requests beautifulsoup4
```

Create Database: To store the crawled data, you can use an SQLite database. Create a database named data.db and create a table within it to store the data. For example:

import sqlite3

def create_database():
    conn = sqlite3.connect('data.db')
    cursor = conn.cursor()
    cursor.execute('''CREATE TABLE IF NOT EXISTS articles (id INTEGER PRIMARY KEY AUTOINCREMENT, title TEXT NOT NULL, link TEXT NOT NULL)''')
    conn.commit()
    conn.close()

create_database()

Define Crawler Class: Create a crawler class responsible for sending requests, parsing web pages, and storing data. Here is a simple example:

import requests
from bs4 import BeautifulSoup
import sqlite3

class SimpleCrawler:
    def __init__(self, base_url):
        self.base_url = base_url

    def fetch_page(self, url):
        response = requests.get(url)
        if response.status_code == 200:
            return response.text
        else:
            print(f"Request failed, status code: {response.status_code}")
            return None

    def parse_page(self, html):
        soup = BeautifulSoup(html, 'html.parser')
        articles = []
        for item in soup.find_all('h2', class_='entry-title'):
            title = item.get_text()
            link = item.find('a')['href']
            articles.append((title, link))
        return articles

    def store_data(self, articles):
        conn = sqlite3.connect('data.db')
        cursor = conn.cursor()
        for title, link in articles:
            cursor.execute('INSERT INTO articles (title, link) VALUES (?, ?)', (title, link))
        conn.commit()
        conn.close()

    def crawl(self):
        html = self.fetch_page(self.base_url)
        if html:
            articles = self.parse_page(html)
            self.store_data(articles)
            print("Data fetching and storage completed!")

if __name__ == "__main__":
    base_url = 'https://example-blog.com'  # Replace with the web page you want to crawl
    crawler = SimpleCrawler(base_url)
    crawler.crawl()

Run the Program: In the terminal or command prompt, navigate to the directory where your script is located, and run the following command:
```
python your_script.py
```
Make sure to replace base_url with the web address you want to crawl. After running the program, the crawled data will be stored in the data.db database.
Legal Compliance: When writing a crawler program, be sure to comply with the robots.txt file regulations of the target website and respect copyright and privacy laws. Ensure that your crawling behavior is legal and compliant, avoiding excessive pressure on the target server.

Related posts

Leave a Comment Cancel reply