How To Write A Web Crawler In Python

To write a web crawler in Python, you can follow these steps:

  1. Environment Setup: Ensure that you have Python installed and the necessary libraries. Commonly used libraries include requests (for sending HTTP requests) and BeautifulSoup4 (for parsing HTML content). You can install these libraries using the following command:

    pip install requests beautifulsoup4
  2. Create Database: To store the crawled data, you can use an SQLite database. Create a database named data.db and create a table within it to store the data. For example:

    import sqlite3
    
    def create_database():
        conn = sqlite3.connect('data.db')
        cursor = conn.cursor()
        cursor.execute('''CREATE TABLE IF NOT EXISTS articles (id INTEGER PRIMARY KEY AUTOINCREMENT, title TEXT NOT NULL, link TEXT NOT NULL)''')
        conn.commit()
        conn.close()
    
    create_database()
  3. Define Crawler Class: Create a crawler class responsible for sending requests, parsing web pages, and storing data. Here is a simple example:

    import requests
    from bs4 import BeautifulSoup
    import sqlite3
    
    class SimpleCrawler:
        def __init__(self, base_url):
            self.base_url = base_url
    
        def fetch_page(self, url):
            response = requests.get(url)
            if response.status_code == 200:
                return response.text
            else:
                print(f"Request failed, status code: {response.status_code}")
                return None
    
        def parse_page(self, html):
            soup = BeautifulSoup(html, 'html.parser')
            articles = []
            for item in soup.find_all('h2', class_='entry-title'):
                title = item.get_text()
                link = item.find('a')['href']
                articles.append((title, link))
            return articles
    
        def store_data(self, articles):
            conn = sqlite3.connect('data.db')
            cursor = conn.cursor()
            for title, link in articles:
                cursor.execute('INSERT INTO articles (title, link) VALUES (?, ?)', (title, link))
            conn.commit()
            conn.close()
    
        def crawl(self):
            html = self.fetch_page(self.base_url)
            if html:
                articles = self.parse_page(html)
                self.store_data(articles)
                print("Data fetching and storage completed!")
    
    if __name__ == "__main__":
        base_url = 'https://example-blog.com'  # Replace with the web page you want to crawl
        crawler = SimpleCrawler(base_url)
        crawler.crawl()
  4. Run the Program: In the terminal or command prompt, navigate to the directory where your script is located, and run the following command:

    python your_script.py

    Make sure to replace base_url with the web address you want to crawl. After running the program, the crawled data will be stored in the data.db database.

  5. Legal Compliance: When writing a crawler program, be sure to comply with the robots.txt file regulations of the target website and respect copyright and privacy laws. Ensure that your crawling behavior is legal and compliant, avoiding excessive pressure on the target server.

How To Write A Web Crawler In Python

Leave a Comment