Author: Li Shengsheng (A Little Ant in Management)
Email:[email protected]
Advantages and Disadvantages Provided by Kimi
| Feature | Advantages (3 points) | Disadvantages (3 points) |
|---|---|---|
| 1. Three PSM algorithms independently encapsulated | ① Function semantics are clear and highly readable;② Can be debugged/replaced independently, facilitating robustness testing;③ Decoupled from DID regression, adhering to the single responsibility principle. | ① Code volume expands, with a lot of copy-paste logic (e.g., marking <span>_matched</span>) not abstracted, leading to high maintenance costs;② Matching details (caliper, bandwidth) are hard-coded, lacking flexibility;③ Random seed is not unified at the matching layer, result reproducibility depends on the outer <span>np.random.seed</span>. |
| 2. Double fixed effects DID (individual + time) | ① Explicit control of <span>year dummies</span> or using <span>time_effects=True</span>, making policy evaluation more rigorous;② Clustering at the individual level, standard errors are more robust;③ Retains the constant term, convenient for reporting baseline values. |
① Manually generating dummy variables may introduce high-dimensional collinearity, consuming memory;② If the sample window is long, annual dummies will consume a lot of degrees of freedom;③ Does not consider heterogeneous treatment timing, strictly speaking, it is still “two-way fixed” rather than Callaway-Sant’Anna style heterogeneous treatment effects. |
| 3. Fully automated process (matching → regression → Word output) | ① Lowers the operational threshold, allowing beginners to run it directly;② Output format conforms to the submission habits of Chinese journals, including asterisks, standard errors, and data sources;③ Console and file dual-channel results, facilitating verification. | ① Word table rows and columns are hard-coded, requiring changes to <span>n_rows</span> magic number if variables are added or removed;② Does not provide multi-format export to Excel/LaTeX, poor extensibility;③ Exception paths only print <span>traceback</span>, no log file written, making it difficult to locate issues in large projects. |
| 4. Multiple fault tolerance and prompts | ① Key steps (no valid samples, collinearity, matching failure) provide Chinese prompts, enhancing user experience;② Uses <span>try/except</span> to wrap core blocks, single point failures do not interrupt the entire process;③ Automatically deletes missing values and reports the amount deleted, ensuring data quality transparency. |
① Exception capture is too broad, easily masking real data/model issues;② Fault tolerance still places <span>None</span> into <span>results</span>, subsequent table generation requires additional null checks, leading to logical coupling;③ Only prints unrecorded, reproducing issues requires rerunning, low debugging efficiency. |
| 5. Random seed and reproducibility design | ① Unifies <span>np.random.seed</span> at the beginning of the script, ensuring Monte-Carlo matching and random sampling can be reproduced;② <span>LogisticRegression</span> also sets <span>random_state</span>, providing double assurance. |
Python Code
import pandas as pd
import numpy as np
import statsmodels.api as sm
from linearmodels import PanelOLS
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler
from docx import Document
import os
# Set random seed
np.random.seed(13579)
# Read data
file_path = r"F:\信息\发表论文代码\JUA\data.dta"
df_original = pd.read_stata(file_path)
# Ensure output directory exists
output_dir = r"F:\Statalearning\do\2025\did2\Out1"
os.makedirs(output_dir, exist_ok=True)
# Define control variables
control_vars = ["indu_third", "advan", "ration", "open", "per_gdp", "science", "num"]
print("=" * 80)
print("PSM-DID Analysis - Three Matching Methods (Controlling for Time Effects)")
print("=" * 80)
def calculate_propensity_scores(df):
"""Calculate propensity scores"""
df_temp = df.copy()
try:
X = df_temp[control_vars].copy()
y = df_temp['treated'].copy()
# Remove missing values
valid_idx = X.notna().all(axis=1) & y.notna()
X_clean = X[valid_idx]
y_clean = y[valid_idx]
df_clean = df_temp[valid_idx].copy().reset_index(drop=True)
if len(X_clean) == 0:
print("No valid data after cleaning")
return None, None
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_clean)
# Logistic regression
logit_model = LogisticRegression(random_state=13579, max_iter=1000)
logit_model.fit(X_scaled, y_clean)
propensity_scores = logit_model.predict_proba(X_scaled)[:, 1]
df_clean['_ps'] = propensity_scores
print(f"Propensity score range: [{propensity_scores.min():.3f}, {propensity_scores.max():.3f}]")
print(f"Treatment group sample size: {(df_clean['treated'] == 1).sum()}")
print(f"Control group sample size: {(df_clean['treated'] == 0).sum()}")
return df_clean, propensity_scores
except Exception as e:
print(f"Failed to calculate propensity scores: {e}")
return None, None
# ... (rest of the code remains unchanged) ...