Implementing PSM-DID in Python

Author: Li Shengsheng (A Little Ant in Management)

Email:[email protected]

Advantages and Disadvantages Provided by Kimi

Feature Advantages (3 points) Disadvantages (3 points)
1. Three PSM algorithms independently encapsulated ① Function semantics are clear and highly readable;② Can be debugged/replaced independently, facilitating robustness testing;③ Decoupled from DID regression, adhering to the single responsibility principle. ① Code volume expands, with a lot of copy-paste logic (e.g., marking <span>_matched</span>) not abstracted, leading to high maintenance costs;② Matching details (caliper, bandwidth) are hard-coded, lacking flexibility;③ Random seed is not unified at the matching layer, result reproducibility depends on the outer <span>np.random.seed</span>.
2. Double fixed effects DID (individual + time) ① Explicit control of <span>year dummies</span> or using <span>time_effects=True</span>, making policy evaluation more rigorous;② Clustering at the individual level, standard errors are more robust;③ Retains the constant term, convenient for reporting baseline values. ① Manually generating dummy variables may introduce high-dimensional collinearity, consuming memory;② If the sample window is long, annual dummies will consume a lot of degrees of freedom;③ Does not consider heterogeneous treatment timing, strictly speaking, it is still “two-way fixed” rather than Callaway-Sant’Anna style heterogeneous treatment effects.
3. Fully automated process (matching → regression → Word output) ① Lowers the operational threshold, allowing beginners to run it directly;② Output format conforms to the submission habits of Chinese journals, including asterisks, standard errors, and data sources;③ Console and file dual-channel results, facilitating verification. ① Word table rows and columns are hard-coded, requiring changes to <span>n_rows</span> magic number if variables are added or removed;② Does not provide multi-format export to Excel/LaTeX, poor extensibility;③ Exception paths only print <span>traceback</span>, no log file written, making it difficult to locate issues in large projects.
4. Multiple fault tolerance and prompts ① Key steps (no valid samples, collinearity, matching failure) provide Chinese prompts, enhancing user experience;② Uses <span>try/except</span> to wrap core blocks, single point failures do not interrupt the entire process;③ Automatically deletes missing values and reports the amount deleted, ensuring data quality transparency. ① Exception capture is too broad, easily masking real data/model issues;② Fault tolerance still places <span>None</span> into <span>results</span>, subsequent table generation requires additional null checks, leading to logical coupling;③ Only prints unrecorded, reproducing issues requires rerunning, low debugging efficiency.
5. Random seed and reproducibility design ① Unifies <span>np.random.seed</span> at the beginning of the script, ensuring Monte-Carlo matching and random sampling can be reproduced;<span>LogisticRegression</span> also sets <span>random_state</span>, providing double assurance.

Python Code

import pandas as pd
import numpy as np
import statsmodels.api as sm
from linearmodels import PanelOLS
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler
from docx import Document
import os

# Set random seed
np.random.seed(13579)

# Read data
file_path = r"F:\信息\发表论文代码\JUA\data.dta"
df_original = pd.read_stata(file_path)

# Ensure output directory exists
output_dir = r"F:\Statalearning\do\2025\did2\Out1"
os.makedirs(output_dir, exist_ok=True)

# Define control variables
control_vars = ["indu_third", "advan", "ration", "open", "per_gdp", "science", "num"]

print("=" * 80)
print("PSM-DID Analysis - Three Matching Methods (Controlling for Time Effects)")
print("=" * 80)

def calculate_propensity_scores(df):
    """Calculate propensity scores"""
    df_temp = df.copy()
    
    try:
        X = df_temp[control_vars].copy()
        y = df_temp['treated'].copy()
        
        # Remove missing values
        valid_idx = X.notna().all(axis=1) &amp; y.notna()
        X_clean = X[valid_idx]
        y_clean = y[valid_idx]
        df_clean = df_temp[valid_idx].copy().reset_index(drop=True)
        
        if len(X_clean) == 0:
            print("No valid data after cleaning")
            return None, None
            
        # Standardize
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X_clean)
        
        # Logistic regression
        logit_model = LogisticRegression(random_state=13579, max_iter=1000)
        logit_model.fit(X_scaled, y_clean)
        propensity_scores = logit_model.predict_proba(X_scaled)[:, 1]
        df_clean['_ps'] = propensity_scores
        
        print(f"Propensity score range: [{propensity_scores.min():.3f}, {propensity_scores.max():.3f}]")
        print(f"Treatment group sample size: {(df_clean['treated'] == 1).sum()}")
        print(f"Control group sample size: {(df_clean['treated'] == 0).sum()}")
        
        return df_clean, propensity_scores
        
    except Exception as e:
        print(f"Failed to calculate propensity scores: {e}")
        return None, None

# ... (rest of the code remains unchanged) ...

Leave a Comment