Hello everyone, recently many students have asked me how to process PDF files using Python. Indeed, in our daily work, we often need to handle a large number of PDF documents, such as extracting text, merging files, adding watermarks, etc. Today, I will share how to achieve automated PDF processing with Python, making tedious document work easier.
Preparation: Install Necessary Libraries
First, we need to install the powerful PDF processing library PyPDF2. Open the command line and enter:
pip install PyPDF2
1. Reading PDF Files
Let’s start with the basic task of reading a PDF file:
from PyPDF2 import PdfReader
# Create a PDF reader object
reader = PdfReader("example.pdf")
# Get the number of pages
print(f"The PDF file has {len(reader.pages)} pages")
# Read the text of the first page
page = reader.pages[0]
text = page.extract_text()
print("Content of the first page:")
print(text)
Tip: If the PDF file contains Chinese characters, you may need to pay attention to encoding issues. It is recommended to save the file using UTF-8 encoding.
2. Merging PDF Files
In my work, I often need to merge multiple PDF files into one. This task is very simple to do with Python:
from PyPDF2 import PdfMerger
def merge_pdfs(files, output):
merger = PdfMerger()
# Add all PDF files
for pdf in files:
merger.append(pdf)
# Save the merged file
merger.write(output)
merger.close()
# Example usage
pdf_files = ["file1.pdf", "file2.pdf", "file3.pdf"]
merge_pdfs(pdf_files, "merged.pdf")
Note: When merging PDFs, ensure that all files exist and are accessible, otherwise an error will occur.
3. Adding Watermarks to PDFs
Want to protect document security? Adding a watermark to a PDF is a good choice:
from PyPDF2 import PdfReader, PdfWriter
def add_watermark(input_pdf, watermark_pdf, output_pdf):
# Read the original PDF and watermark PDF
reader = PdfReader(input_pdf)
watermark = PdfReader(watermark_pdf)
writer = PdfWriter()
# Add watermark to each page
for page in reader.pages:
page.merge_page(watermark.pages[0])
writer.add_page(page)
# Save the result
with open(output_pdf, "wb") as file:
writer.write(file)
# Example usage
add_watermark("original.pdf", "watermark.pdf", "output_with_watermark.pdf")
4. Extracting Images from PDFs
Sometimes we need to extract images from a PDF, here’s a practical tip:
from PyPDF2 import PdfReader
import fitz # Make sure to install PyMuPDF
def extract_images(pdf_path):
# Open the PDF file
doc = fitz.open(pdf_path)
# Iterate through each page
for page_num in range(len(doc)):
page = doc[page_num]
# Get images on the page
images = page.get_images()
# Save images
for image_num, image in enumerate(images):
xref = image[0]
base_image = doc.extract_image(xref)
image_bytes = base_image["image"]
# Save image file
with open(f"image_page{page_num+1}_{image_num+1}.png", "wb") as f:
f.write(image_bytes)
# Example usage
extract_images("document_with_images.pdf")
Tip: When extracting images, it is advisable to check the size of the PDF file first. If the file is too large, it is better to process it in batches.
Summary and Exercises
Today we learned four practical functions for PDF processing:
- Reading PDF text
- Merging multiple PDFs
- Adding watermarks
- Extracting images
Exercises:
- Try writing a function to split a PDF file (divide one PDF into multiple files)
- How to search for specific text in a PDF file and output the page number?
Remember to pay attention to file permissions and size limits when processing PDF files. It is recommended to test the correctness of the code with small files before handling large ones.
Next, I encourage you to try these codes yourself. If you encounter any issues, feel free to leave a comment for discussion. The most important part of learning programming is practice, let’s grow together through code!
Next issue preview: I will introduce how to use Python for PDF table data extraction, stay tuned!