Hello everyone! The second article in the web scraping series is here~ Today we will start an interesting and practical topic: streaming video capture. I will provide a complete introduction on how to legally capture and analyze streaming video content in three parts.

1. Streaming Video: The Mainstream Form of Modern Online Video

What is Streaming Media?

Streaming media is a technology that allows for playback while downloading, unlike traditional methods that require complete downloads before playback. Common streaming protocols include:

HLS (HTTP Live Streaming): A protocol introduced by Apple, using m3u8 and ts files
DASH (Dynamic Adaptive Streaming over HTTP): An adaptive streaming protocol
RTMP (Real-Time Messaging Protocol): A real-time messaging protocol (gradually being phased out)

Why is it Important to Understand Streaming Capture?

Many websites use streaming technology to deliver video
Direct video links are often difficult to obtain
Understanding the principles allows for more effective video content capture

2. Analysis of the HLS Streaming Protocol

m3u8 File: The “Directory” of Streaming Media

The m3u8 file is the core of the HLS protocol; it is a text file that contains various information about the video stream:

# Example m3u8 file content

m3u8_content ="""

    #EXTM3U

    #EXT-X-VERSION:3

    #EXT-X-TARGETDURATION:10

    #EXT-X-MEDIA-SEQUENCE:0

    #EXTINF:10.0, segment0.ts

    #EXTINF:10.0, segment1.ts

    #EXTINF:10.0, segment2.ts

    #EXT-X-ENDLIST

"""

TS Files: The "Chunks" of Video

TS (Transport Stream) files are the actual video data blocks, with each TS file containing a few seconds of video content.

3. Identifying and Analyzing Streaming Resources

Using Developer Tools for Analysis

Open the browser’s developer tools (F12)
Switch to the Network tab
Filter for “m3u8” or “ts” requests
Play the video and observe the requests that appear

Common Streaming Resource Identifiers

# Common streaming file extensions and keywords

streaming_keywords ={

    'm3u8':'HLS playlist',

    'ts':'video segment',

    'm4s':'MPEG-DASH segment',

    'mpd':'DASH descriptor file',

    'f4m':'Flash media manifest',

    'key':'encryption key'

4. Basic Streaming Capture Practice

Practical Steps and Ideas:

Find the m3u8 file: By analyzing the webpage source code or network requests
Download the m3u8 file: Obtain video segment information
Parse the m3u8 content: Extract all TS file links
Download TS files: Download video segments one by one
Merge TS files: Combine segments into a complete video

import requests

import re

import os

from urllib.parse import urljoin

class BasicStreamingDownloader:

    def __init__(self, output_dir='videos'):

        self.output_dir = output_dir

        os.makedirs(output_dir, exist_ok=True)

    def extract_m3u8_url(self, page_url):

"""

            Extract m3u8 address from the webpage

            Steps:

                1. Download webpage content

                2. Use regular expressions to match m3u8 links

                3. Return the first found m3u8 link

"""

        try:

            headers ={

                'User-Agent':'Mozilla/5.0 (Windows NT 10.0;

                Win64; x64) AppleWebKit/537.36'

            response = requests.get(

                page_url,

                headers=headers,

                timeout=10

            response.raise_for_status()

            # Use regular expressions to find m3u8 links

            m3u8_pattern =r'https?://[^
]+?\.m3u8'

            m3u8_urls = re.findall(m3u8_pattern,response.text)

            if m3u8_urls:

                print(f"Found m3u8 address: {m3u8_urls[0]}")

                return m3u8_urls[0]

            else:

                print("No m3u8 address found")

                return None

        except Exception as e:

            print(f"Failed to extract m3u8 address: {e}")

            return None

    def download_m3u8(self, m3u8_url):

"""

        Download m3u8 file and parse it

        Steps:

            1. Send request to get m3u8 content

            2. Save to local file

            3. Return m3u8 content

"""

        try:

            headers ={

                'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',

                'Referer':'https://example.com/'# Set according to actual situation

            response = requests.get(m3u8_url, headers=headers, timeout=10)

            response.raise_for_status()

            # Save m3u8 file

            m3u8_path = os.path.join(self.output_dir,'playlist.m3u8')

            with open(m3u8_path,'w', encoding='utf-8')as f:

                f.write(response.text)

            print(f"m3u8 file saved: {m3u8_path}")

            return response.text

        except Exception as e:

            print(f"Failed to download m3u8 file: {e}")

            return None

    def parse_m3u8(self, m3u8_content, base_url):

"""

        Parse m3u8 content and extract TS file list

        Steps:

            1. Split m3u8 content by line

            2. Skip comment lines and empty lines

            3. Construct complete TS file URLs

"""

        ts_files =[]

        # Simple parsing of TS file links

        lines = m3u8_content.split('\n')

        for line in lines:

            line = line.strip()

            # Skip comments and empty lines

            if line and not line.startswith('#'):

                # Construct complete TS file URL

                if line.startswith('http'):

                    ts_url = line

                else:

                    ts_url = urljoin(base_url, line)

                ts_files.append(ts_url)

        print(f"Parsed {len(ts_files)} TS files")

        return ts_files

    def download_ts_file(self, ts_url, index):

"""

        Download a single TS file

        Steps:

            1. Send request to get TS file content

            2. Save to local file

            3. Use index to name the file for later merging

"""

        try:

            headers ={

                'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',

                'Referer':'https://example.com/'# Set according to actual situation

            response = requests.get(ts_url, headers=headers, timeout=30)

            response.raise_for_status()

            # Save TS file

            ts_filename =f'segment_{index:04d}.ts'

            ts_path =os.path.join(self.output_dir,ts_filename)

            with open(ts_path,'wb')as f:

                f.write(response.content)

            print(f"Downloaded: {ts_filename}")

            return True

        except Exception as e:

            print(f"Failed to download TS file {ts_url}: {e}")

            return False

# Example usage

if __name__ =="__main__":

    # Example usage (please replace with actual test URL)

    downloader = BasicStreamingDownloader()

    # Step 1: Extract m3u8 address from webpage

    page_url ='https://example.com/video-page'# Replace with actual video page

    m3u8_url = downloader.extract_m3u8_url(page_url)

    if m3u8_url:

        # Step 2: Download m3u8 file

        m3u8_content = downloader.download_m3u8(m3u8_url)

        if m3u8_content:

            # Step 3: Parse m3u8 to get TS file list

            base_url = m3u8_url.rsplit('/',1)[0]+'/'# Get base URL

            ts_files = downloader.parse_m3u8(m3u8_content, base_url)

            # Step 4: Download the first few TS files (example, actually need to download all)

            for i, ts_url in enumerate(ts_files[:5]):

                # Only download the first 5 as a demonstration

                downloader.download_ts_file(ts_url, i)

5. Important Principles for Compliant Capture

1. Respect Copyright and Legal Regulations

# For educational and technical research purposes only, not for commercial use

legal_guidelines =[

    'Do not download copyrighted paid content',

    'Do not redistribute captured content',

    'Only capture publicly accessible content',

    'Comply with the website's robots.txt protocol',

    'Control request frequency to avoid server pressure'

2. Technical Limitations and Ethical Considerations

# Ethical boundaries of technical use

ethical_considerations =[

    'Do not bypass paywalls',

    'Do not capture personal privacy content',

    'Not for competitive commercial use',

    'Cite data sources (when used legally)',

    'Follow fair use principles'

6. Common Questions and Solutions

1. Unable to Find m3u8 Address

def advanced_m3u8_discovery(page_url):

    """Advanced m3u8 discovery method"""

    # Method 1: Check video tags in the webpage source code

    # Method 2: Analyze XHR and Media types in network requests

    # Method 3: Use regular expressions to match various formats of m3u8 URLs

    # Method 4: Parse video information in JavaScript code

    print("Need to use browser developer tools to manually analyze the video loading process")

2. TS File Download Failure

def handle_download_failures(ts_url, retries=3):

    """Handle retry mechanism for download failures"""

    for attempt in range(retries):

        try:

            # Attempt to download

            success = download_ts_file(ts_url, index)

            if success:

                return True

        except Exception as e:

            print(f"Attempt {attempt+1} failed: {e}")

            time.sleep(2** attempt)# Exponential backoff

    print(f"Unable to download TS file: {ts_url}")

    return False

7. Practical Exercise: Complete Process Demonstration

Exercise Objective:

Use publicly available test streaming resources to complete the following steps:

Find the m3u8 file address
Download and parse the m3u8 file
Download some TS files
Try to play the downloaded TS files

Detailed Steps:

1. Find Test Resources:

Use publicly available test streaming websites (e.g., https://test-videos.co.uk/)
Select a test video in HLS format

2. Analyze Web Structure:

# Pseudo code: Analyze webpage to obtain m3u8 address

Open the test video webpage

Right-click -> Inspect -> Network panel

Refresh the page and start playing the video

Filter m3u8 requests

Find the m3u8 file address

3. Implement the Downloader:

# Use the BasicStreamingDownloader class provided above

downloader = BasicStreamingDownloader("test_video")

m3u8_url ="Found m3u8 address"

# Download m3u8 file

m3u8_content = downloader.download_m3u8(m3u8_url)

# Parse TS file list

base_url = m3u8_url.rsplit('/',1)[0]+'/'

ts_files = downloader.parse_m3u8(m3u8_content, base_url)

# Download the first few TS files

for i, ts_url in enumerate(ts_files[:3]):

    downloader.download_ts_file(ts_url, i)

4. Verify Results:

Check if the downloaded TS files can play normally
Use VLC or other players to open individual TS files

8. Learning Suggestions

Practice Environment: Use publicly available test videos for practice
Tool Preparation: Familiarize yourself with the use of browser developer tools
Step by Step: Start with simple cases and gradually handle complex situations
Legal Awareness: Always comply with laws and ethical standards

9. Next Learning Preview

In the next article, we will delve into:

How to handle encrypted streaming videos
AES-128 decryption principles and practices
Obtaining and using decryption keys
Automatically identifying and decrypting encrypted streams

Streaming video capture is a highly technical topic, and understanding its basic principles is key to successful capture. In the next article, we will explore techniques for handling encrypted streams!

Next time we continue to explore Python, making progress a little bit every day, let's work hard together on the learning journey! If you have any questions, feel free to leave comments for discussion~

Python Web Scraping Notes: Basics of Streaming Video Capture

Hello everyone! The second article in the web scraping series is here~ Today we will start an interesting and practical topic: streaming video capture. I will provide a complete introduction on how to legally capture and analyze streaming video content in three parts.

1. Streaming Video: The Mainstream Form of Modern Online Video

What is Streaming Media?

Why is it Important to Understand Streaming Capture?

2. Analysis of the HLS Streaming Protocol

m3u8 File: The “Directory” of Streaming Media

3. Identifying and Analyzing Streaming Resources

Using Developer Tools for Analysis

Common Streaming Resource Identifiers

Practical Steps and Ideas:

1. Respect Copyright and Legal Regulations

1. Unable to Find m3u8 Address

Exercise Objective:

Detailed Steps:

1. Find Test Resources:

8. Learning Suggestions

9. Next Learning Preview

Leave a Comment Cancel reply

Hello everyone! The second article in the web scraping series is here~ Today we will start an interesting and practical topic: streaming video capture. I will provide a complete introduction on how to legally capture and analyze streaming video content in three parts.

1. Streaming Video: The Mainstream Form of Modern Online Video

What is Streaming Media?

Why is it Important to Understand Streaming Capture?

2. Analysis of the HLS Streaming Protocol

m3u8 File: The “Directory” of Streaming Media

3. Identifying and Analyzing Streaming Resources

Using Developer Tools for Analysis

Common Streaming Resource Identifiers

Practical Steps and Ideas:

1. Respect Copyright and Legal Regulations

1. Unable to Find m3u8 Address

Exercise Objective:

Detailed Steps:

1. Find Test Resources:

8. Learning Suggestions

9. Next Learning Preview

Related posts

Leave a Comment Cancel reply