Python Web Scraping Notes: Basics of Streaming Video Capture

Hello everyone! The second article in the web scraping series is here~ Today we will start an interesting and practical topic: streaming video capture. I will provide a complete introduction on how to legally capture and analyze streaming video content in three parts.

1. Streaming Video: The Mainstream Form of Modern Online Video

What is Streaming Media?

Streaming media is a technology that allows for playback while downloading, unlike traditional methods that require complete downloads before playback. Common streaming protocols include:

  • HLS (HTTP Live Streaming): A protocol introduced by Apple, using m3u8 and ts files

  • DASH (Dynamic Adaptive Streaming over HTTP): An adaptive streaming protocol

  • RTMP (Real-Time Messaging Protocol): A real-time messaging protocol (gradually being phased out)

Why is it Important to Understand Streaming Capture?

  • Many websites use streaming technology to deliver video

  • Direct video links are often difficult to obtain

  • Understanding the principles allows for more effective video content capture

2. Analysis of the HLS Streaming Protocol

m3u8 File: The “Directory” of Streaming Media

The m3u8 file is the core of the HLS protocol; it is a text file that contains various information about the video stream:

# Example m3u8 file content
m3u8_content ="""
    #EXTM3U
    #EXT-X-VERSION:3
    #EXT-X-TARGETDURATION:10
    #EXT-X-MEDIA-SEQUENCE:0
    #EXTINF:10.0, segment0.ts
    #EXTINF:10.0, segment1.ts
    #EXTINF:10.0, segment2.ts
    #EXT-X-ENDLIST
"""
TS Files: The "Chunks" of Video

TS (Transport Stream) files are the actual video data blocks, with each TS file containing a few seconds of video content.

3. Identifying and Analyzing Streaming Resources

Using Developer Tools for Analysis

  1. Open the browser’s developer tools (F12)

  2. Switch to the Network tab

  3. Filter for “m3u8” or “ts” requests

  4. Play the video and observe the requests that appear

Common Streaming Resource Identifiers

# Common streaming file extensions and keywords
streaming_keywords ={
    'm3u8':'HLS playlist',
    'ts':'video segment',
    'm4s':'MPEG-DASH segment',
    'mpd':'DASH descriptor file',
    'f4m':'Flash media manifest',
    'key':'encryption key'
}
4. Basic Streaming Capture Practice

Practical Steps and Ideas:

  1. Find the m3u8 file: By analyzing the webpage source code or network requests

  2. Download the m3u8 file: Obtain video segment information

  3. Parse the m3u8 content: Extract all TS file links

  4. Download TS files: Download video segments one by one

  5. Merge TS files: Combine segments into a complete video

import requests
import re
import os
from urllib.parse import urljoin
class BasicStreamingDownloader:
    def __init__(self, output_dir='videos'):
        self.output_dir = output_dir
        os.makedirs(output_dir, exist_ok=True)
    def extract_m3u8_url(self, page_url):
        """
            Extract m3u8 address from the webpage
            Steps:
                1. Download webpage content
                2. Use regular expressions to match m3u8 links
                3. Return the first found m3u8 link
        """
        try:
            headers ={
                'User-Agent':'Mozilla/5.0 (Windows NT 10.0; 
                Win64; x64) AppleWebKit/537.36'
            } 
            response = requests.get(
                page_url, 
                headers=headers, 
                timeout=10
            )
            response.raise_for_status()
            # Use regular expressions to find m3u8 links
            m3u8_pattern =r'https?://[^
]+?\.m3u8'
            m3u8_urls = re.findall(m3u8_pattern,response.text)
            if m3u8_urls:
                print(f"Found m3u8 address: {m3u8_urls[0]}")
                return m3u8_urls[0]
            else:
                print("No m3u8 address found")
                return None
        except Exception as e:
            print(f"Failed to extract m3u8 address: {e}")
            return None
    def download_m3u8(self, m3u8_url):
        """
        Download m3u8 file and parse it
        Steps:
            1. Send request to get m3u8 content
            2. Save to local file
            3. Return m3u8 content
        """
        try:
            headers ={
                'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
                'Referer':'https://example.com/'# Set according to actual situation
            }
            response = requests.get(m3u8_url, headers=headers, timeout=10)
            response.raise_for_status()
            # Save m3u8 file
            m3u8_path = os.path.join(self.output_dir,'playlist.m3u8')
            with open(m3u8_path,'w', encoding='utf-8')as f:
                f.write(response.text)
            print(f"m3u8 file saved: {m3u8_path}")
            return response.text
        except Exception as e:
            print(f"Failed to download m3u8 file: {e}")
            return None
    def parse_m3u8(self, m3u8_content, base_url):
        """
        Parse m3u8 content and extract TS file list
        Steps:
            1. Split m3u8 content by line
            2. Skip comment lines and empty lines
            3. Construct complete TS file URLs
        """
        ts_files =[]
        # Simple parsing of TS file links
        lines = m3u8_content.split('\n')
        for line in lines:
            line = line.strip()
            # Skip comments and empty lines
            if line and not line.startswith('#'):
                # Construct complete TS file URL
                if line.startswith('http'):
                    ts_url = line
                else:
                    ts_url = urljoin(base_url, line)
                ts_files.append(ts_url)
        print(f"Parsed {len(ts_files)} TS files")
        return ts_files
    def download_ts_file(self, ts_url, index):
        """
        Download a single TS file
        Steps:
            1. Send request to get TS file content
            2. Save to local file
            3. Use index to name the file for later merging
        """
        try:
            headers ={
                'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
                'Referer':'https://example.com/'# Set according to actual situation
            }
            response = requests.get(ts_url, headers=headers, timeout=30)
            response.raise_for_status()
            # Save TS file
            ts_filename =f'segment_{index:04d}.ts'
            ts_path =os.path.join(self.output_dir,ts_filename)
            with open(ts_path,'wb')as f:
                f.write(response.content)
            print(f"Downloaded: {ts_filename}")
            return True
        except Exception as e:
            print(f"Failed to download TS file {ts_url}: {e}")
            return False
# Example usage
if __name__ =="__main__":
    # Example usage (please replace with actual test URL)
    downloader = BasicStreamingDownloader()
    # Step 1: Extract m3u8 address from webpage
    page_url ='https://example.com/video-page'# Replace with actual video page
    m3u8_url = downloader.extract_m3u8_url(page_url)
    if m3u8_url:
        # Step 2: Download m3u8 file
        m3u8_content = downloader.download_m3u8(m3u8_url)
        if m3u8_content:
            # Step 3: Parse m3u8 to get TS file list
            base_url = m3u8_url.rsplit('/',1)[0]+'/'# Get base URL
            ts_files = downloader.parse_m3u8(m3u8_content, base_url)
            # Step 4: Download the first few TS files (example, actually need to download all)
            for i, ts_url in enumerate(ts_files[:5]):
                # Only download the first 5 as a demonstration
                downloader.download_ts_file(ts_url, i)
5. Important Principles for Compliant Capture

1. Respect Copyright and Legal Regulations

# For educational and technical research purposes only, not for commercial use
legal_guidelines =[
    'Do not download copyrighted paid content',
    'Do not redistribute captured content',
    'Only capture publicly accessible content',
    'Comply with the website's robots.txt protocol',
    'Control request frequency to avoid server pressure'
]
2. Technical Limitations and Ethical Considerations
# Ethical boundaries of technical use
ethical_considerations =[
    'Do not bypass paywalls',
    'Do not capture personal privacy content',
    'Not for competitive commercial use',
    'Cite data sources (when used legally)',
    'Follow fair use principles'
]
6. Common Questions and Solutions

1. Unable to Find m3u8 Address

def advanced_m3u8_discovery(page_url):
    """Advanced m3u8 discovery method"""
    # Method 1: Check video tags in the webpage source code
    # Method 2: Analyze XHR and Media types in network requests
    # Method 3: Use regular expressions to match various formats of m3u8 URLs
    # Method 4: Parse video information in JavaScript code
    print("Need to use browser developer tools to manually analyze the video loading process")
2. TS File Download Failure
def handle_download_failures(ts_url, retries=3):
    """Handle retry mechanism for download failures"""
    for attempt in range(retries):
        try:
            # Attempt to download
            success = download_ts_file(ts_url, index)
            if success:
                return True
        except Exception as e:
            print(f"Attempt {attempt+1} failed: {e}")
            time.sleep(2** attempt)# Exponential backoff
    print(f"Unable to download TS file: {ts_url}")
    return False
7. Practical Exercise: Complete Process Demonstration

Exercise Objective:

Use publicly available test streaming resources to complete the following steps:

  1. Find the m3u8 file address

  2. Download and parse the m3u8 file

  3. Download some TS files

  4. Try to play the downloaded TS files

Detailed Steps:

1. Find Test Resources:

  • Use publicly available test streaming websites (e.g., https://test-videos.co.uk/)

  • Select a test video in HLS format

2. Analyze Web Structure:

# Pseudo code: Analyze webpage to obtain m3u8 address

Open the test video webpage
Right-click -> Inspect -> Network panel
Refresh the page and start playing the video
Filter m3u8 requests
Find the m3u8 file address
3. Implement the Downloader:
# Use the BasicStreamingDownloader class provided above
downloader = BasicStreamingDownloader("test_video")
m3u8_url ="Found m3u8 address"
# Download m3u8 file
m3u8_content = downloader.download_m3u8(m3u8_url)
# Parse TS file list
base_url = m3u8_url.rsplit('/',1)[0]+'/'
ts_files = downloader.parse_m3u8(m3u8_content, base_url)
# Download the first few TS files
for i, ts_url in enumerate(ts_files[:3]):
    downloader.download_ts_file(ts_url, i)
4. Verify Results:
  • Check if the downloaded TS files can play normally

  • Use VLC or other players to open individual TS files

8. Learning Suggestions

  1. Practice Environment: Use publicly available test videos for practice

  2. Tool Preparation: Familiarize yourself with the use of browser developer tools

  3. Step by Step: Start with simple cases and gradually handle complex situations

  4. Legal Awareness: Always comply with laws and ethical standards

9. Next Learning Preview

In the next article, we will delve into:

  1. How to handle encrypted streaming videos

  2. AES-128 decryption principles and practices

  3. Obtaining and using decryption keys

  4. Automatically identifying and decrypting encrypted streams

Streaming video capture is a highly technical topic, and understanding its basic principles is key to successful capture. In the next article, we will explore techniques for handling encrypted streams!

Next time we continue to explore Python, making progress a little bit every day, let's work hard together on the learning journey! If you have any questions, feel free to leave comments for discussion~

Leave a Comment