Automating PDF Processing with Python

Hello everyone, recently many students have asked me how to process PDF files using Python. Indeed, in our daily work, we often need to handle a large number of PDF documents, such as extracting text, merging files, adding watermarks, etc. Today, I will share how to achieve automated PDF processing with Python, making tedious document work easier.

Preparation: Install Necessary Libraries

First, we need to install the powerful PDF processing library PyPDF2. Open the command line and enter:

pip install PyPDF2

1. Reading PDF Files

Let’s start with the basic task of reading a PDF file:

from PyPDF2 import PdfReader

# Create a PDF reader object
reader = PdfReader("example.pdf")

# Get the number of pages
print(f"The PDF file has {len(reader.pages)} pages")

# Read the text of the first page
page = reader.pages[0]
text = page.extract_text()
print("Content of the first page:")
print(text)

Tip: If the PDF file contains Chinese characters, you may need to pay attention to encoding issues. It is recommended to save the file using UTF-8 encoding.

2. Merging PDF Files

In my work, I often need to merge multiple PDF files into one. This task is very simple to do with Python:

from PyPDF2 import PdfMerger

def merge_pdfs(files, output):
    merger = PdfMerger()
    
    # Add all PDF files
    for pdf in files:
        merger.append(pdf)
    
    # Save the merged file
    merger.write(output)
    merger.close()

# Example usage
pdf_files = ["file1.pdf", "file2.pdf", "file3.pdf"]
merge_pdfs(pdf_files, "merged.pdf")

Note: When merging PDFs, ensure that all files exist and are accessible, otherwise an error will occur.

3. Adding Watermarks to PDFs

Want to protect document security? Adding a watermark to a PDF is a good choice:

from PyPDF2 import PdfReader, PdfWriter

def add_watermark(input_pdf, watermark_pdf, output_pdf):
    # Read the original PDF and watermark PDF
    reader = PdfReader(input_pdf)
    watermark = PdfReader(watermark_pdf)
    writer = PdfWriter()
    
    # Add watermark to each page
    for page in reader.pages:
        page.merge_page(watermark.pages[0])
        writer.add_page(page)
    
    # Save the result
    with open(output_pdf, "wb") as file:
        writer.write(file)

# Example usage
add_watermark("original.pdf", "watermark.pdf", "output_with_watermark.pdf")

4. Extracting Images from PDFs

Sometimes we need to extract images from a PDF, here’s a practical tip:

from PyPDF2 import PdfReader
import fitz  # Make sure to install PyMuPDF

def extract_images(pdf_path):
    # Open the PDF file
    doc = fitz.open(pdf_path)
    
    # Iterate through each page
    for page_num in range(len(doc)):
        page = doc[page_num]
        
        # Get images on the page
        images = page.get_images()
        
        # Save images
        for image_num, image in enumerate(images):
            xref = image[0]
            base_image = doc.extract_image(xref)
            image_bytes = base_image["image"]
            
            # Save image file
            with open(f"image_page{page_num+1}_{image_num+1}.png", "wb") as f:
                f.write(image_bytes)

# Example usage
extract_images("document_with_images.pdf")

Tip: When extracting images, it is advisable to check the size of the PDF file first. If the file is too large, it is better to process it in batches.

Summary and Exercises

Today we learned four practical functions for PDF processing:

  • Reading PDF text
  • Merging multiple PDFs
  • Adding watermarks
  • Extracting images

Exercises:

  1. Try writing a function to split a PDF file (divide one PDF into multiple files)
  2. How to search for specific text in a PDF file and output the page number?

Remember to pay attention to file permissions and size limits when processing PDF files. It is recommended to test the correctness of the code with small files before handling large ones.

Next, I encourage you to try these codes yourself. If you encounter any issues, feel free to leave a comment for discussion. The most important part of learning programming is practice, let’s grow together through code!

Next issue preview: I will introduce how to use Python for PDF table data extraction, stay tuned!

Leave a Comment