Simulating Wage Data for 11 Million Residents in Beijing Using Python: A Comprehensive Practice from Statistics to Validation

In data analysis and economic research, simulating wage distribution is a common yet complex task.

Real wage data is often difficult to obtain due to privacy restrictions, but by simulating based on known macro indicators (such as average wage and Gini coefficient), we can generate reasonable synthetic data for algorithm testing, policy evaluation, or academic research.

This article will demonstrate how to simulate specific wage data from known statistical indicators using Python’s pandas and numpy libraries, based on publicly available data from Beijing and Shanghai, and rigorously validate the reliability of the simulation results. Through this practice, we can not only verify the feasibility of the code but also gain a deeper understanding of the underlying logic of the data generation process.

1. Background and Objectives

Wage data typically exhibits a right-skewed distribution (i.e., a small number of high earners raise the average), and is influenced by the regional economic structure. Official data often only provides macro indicators (such as average wage and Gini coefficient), while micro-level individual data is missing. The purposes of simulating this data include:

Algorithm validation: Testing wage prediction models or income inequality analysis tools.
Policy simulation: Evaluating the impact of tax and social security policy adjustments on income distribution.
Data augmentation: Generating synthetic datasets when real data is insufficient.

This article takes Beijing and Shanghai as examples, with known data including:

Resident population, employed population, average wage, social security contribution lower limit, Gini coefficient.
Objective: Generate simulated wage data for 11 million people and ensure that its statistical characteristics (median, extremes, percentiles) are consistent with macro indicators.

2. Data Preparation and Simulation Principles

2.1 Input Data

We extracted key indicators for Beijing and Shanghai from public data for the year 2024 (see table below) and stored them as a DataFrame:

City	Resident Population at End of 2024 (10,000)	Average Wage of Urban Employees (CNY/Month)	Employed Population (10,000)	Social Security Contribution Lower Limit (CNY/Month)	Gini Coefficient
Beijing	2183.20	11937.00	1100.00	7162.00	0.462
Shanghai	2480.26	12434.00	1345.00	7460.00	0.462

The Gini coefficient (0.462) indicates a high level of income inequality (the international alert line is 0.4), which requires the simulated data to capture tail differences.

2.2 Simulation Principle: Why Use Log-Normal Distribution?

Wage data typically follows a log-normal distribution for the following reasons:

Wages are positive values, with a lower limit of 0 (constrained by minimum wage in practice).
The distribution is right-skewed, reflecting reality (a small number of high earners raise the average).

There is a mathematical relationship between the Gini coefficient and the parameter σ of the log-normal distribution:

GiniCoefficientLikelihoodFormula

By inferring σ from the Gini coefficient and calculating μ in conjunction with the average wage, we can generate the distribution.

2.3 Core Algorithm: simulate_wages Function

Function Input: Average wage, employed population, Gini coefficient, social security lower limit (as wage lower limit).

Steps:

Calculate the standard deviation σ based on the Gini coefficient.
Calculate the mean μ in log space, ensuring that the simulated data’s average equals the target average wage.
Generate random numbers from a log-normal distribution and adjust to precisely match the target mean.
Apply the wage lower limit constraint (wages must not be lower than the social security contribution lower limit).

import pandas as pd
import numpy as np

data = {
    '城市': ['北京市', '上海市'],
    '2024年末常住人口(万)': [2183.20, 2480.26],
    '城镇单位就业人员平均工资(元/月)': [11937.00, 12434.00],
    '常住就业人口(万)': [1100.00, 1345.00],
    '社保缴费下限(元/月)': [7162.00, 7460.00],
    '基尼系数': [0.462, 0.462]
}

df = pd.DataFrame(data)

# 添加新的列来存储计算结果
df['工资中位数'] = None
df['最低工资'] = None
df['最高工资'] = None
df['前10%平均工资'] = None

def simulate_wages(avg_wage, employment, gini, min_wage):
    n = int(employment * 10000)  # 就业人口转换为个体数
    sigma = np.sqrt(np.log(1 + gini * np.sqrt(3)))  # 从基尼系数推导σ
    mu = np.log(avg_wage) - 0.5 * sigma**2# 计算μ
    wages = np.random.lognormal(mu, sigma, n)  # 生成初始分布
    wages = wages * (avg_wage / wages.mean())  # 调整至目标均值
    wages = np.maximum(wages, min_wage)  # 应用下限
    wages = wages * (avg_wage / wages.mean())  # 再次调整均值
    return wages
    
# 为每个城市生成工资数据并更新DataFrame
for idx, city in enumerate(df['城市']):
    city_data = df.iloc[idx]
    wages = simulate_wages(
        city_data['城镇单位就业人员平均工资(元/月)'],
        city_data['常住就业人口(万)'],
        city_data['基尼系数'],
        city_data['社保缴费下限(元/月)']
    )
    
    # 更新DataFrame中的统计信息
    df.at[idx, '工资中位数'] = np.median(wages)
    df.at[idx, '最低工资'] = wages.min()
    df.at[idx, '最高工资'] = wages.max()
    df.at[idx, '前10%平均工资'] = np.percentile(wages, 90)

3. Accurate Simulation and Validation

The initial simulation may not fully match all statistical metrics (such as median, percentiles). Therefore, we introduce an iterative optimization function simulate_wages_exact to approach the target values through multiple trials.

3.1 Accurate Simulation Algorithm

def simulate_wages_exact(avg_wage, employment, gini, min_wage, target_median, target_min, target_max, target_p90, max_iterations=100):
    n = int(employment)
    best_error = float('inf')
    best_wages = None
    
    for _ in range(max_iterations):
        # 生成基础工资分布
        sigma = np.sqrt(np.log(1 + gini * np.sqrt(3)))
        mu = np.log(avg_wage) - 0.5 * sigma**2
        wages = np.random.lognormal(mu, sigma, n)
        
        # 应用约束条件
        wages = np.maximum(wages, min_wage)
        wages = np.minimum(wages, target_max)
        
        # 计算当前统计值与目标值的误差
        current_median = np.median(wages)
        current_p90 = np.percentile(wages, 90)
        
        # 调整工资以匹配目标值
        if current_median > 0:
            wages = wages * (target_median / current_median)
        
        # 重新应用约束
        wages = np.maximum(wages, min_wage)
        wages = np.minimum(wages, target_max)
        
        # 计算总误差
        error = (
            abs(np.median(wages) - target_median) +
            abs(wages.min() - target_min) +
            abs(wages.max() - target_max) +
            abs(np.percentile(wages, 90) - target_p90)
        )
        
        if error < best_error:
            best_error = error
            best_wages = wages.copy()
            
        if error < 1:  # 如果误差足够小，提前退出
            break
    
    return best_wages

3.2 Generating and Validating Data

Taking Beijing as an example, we generate wage data for 11,000,000 people and compare the target values with the actual values:

n = 11000000 # 模拟n个人的工资
beijing_data = df[df['城市'] == '北京市'].iloc[0]

# 模拟n个人的工资
wages_n = simulate_wages_exact(
    beijing_data['城镇单位就业人员平均工资(元/月)'],
    n, 
    beijing_data['基尼系数'],
    beijing_data['社保缴费下限(元/月)'],
    beijing_data['工资中位数'],
    beijing_data['最低工资'],
    beijing_data['最高工资'],
    beijing_data['前10%平均工资']
)

# 创建新的DataFrame存储这n个人的工资数据
wages_df = pd.DataFrame({
    '序号': range(1, n+1),
    '工资(元/月)': wages_n
})

# 验证统计信息
print(f"\n北京市{n}人工资统计（验证）：")
print(f"目标工资中位数: {beijing_data['工资中位数']:.2f} 元/月")
print(f"实际工资中位数: {np.median(wages_n):.2f} 元/月")
print(f"目标最低工资: {beijing_data['最低工资']:.2f} 元/月")
print(f"实际最低工资: {wages_n.min():.2f} 元/月")
print(f"目标最高工资: {beijing_data['最高工资']:.2f} 元/月")
print(f"实际最高工资: {wages_n.max():.2f} 元/月")
print(f"目标前10%平均工资: {beijing_data['前10%平均工资']:.2f} 元/月")
print(f"实际前10%平均工资: {np.percentile(wages_n, 90):.2f} 元/月")

wages_df.to_csv('large_dataset.csv', index=False)

Output results:

北京市11000000人工资统计（验证）：
目标工资中位数: 8191.42 元/月
实际工资中位数: 8191.42 元/月
目标最低工资: 6593.56 元/月
实际最低工资: 7162.00 元/月
目标最高工资: 381444.51 元/月
实际最高工资: 351438.69 元/月
目标前10%平均工资: 21877.70 元/月
实际前10%平均工资: 21884.44 元/月

Validation shows:

The median and the average wage of the top 10% match closely (error < 0.1%), proving that the distribution shape is correct.
The minimum wage deviation: due to the social security lower limit constraint, the actual minimum wage is higher than the target value, which aligns with policy reality.
The maximum wage deviation: the extreme values of randomly simulated data are difficult to control precisely, but the order of magnitude is consistent and does not affect the overall distribution.

4. In-Depth Discussion and Application Prospects

4.1 Limitations of Simulation

Extreme value control: The simulation error of the maximum wage arises from the tail characteristics of the log-normal distribution, which can be improved through extreme value theory.
Lack of dynamics: This simulation considers static cross-sectional data and does not account for wage changes over time.
Structural assumptions: The reliance on the log-normal distribution assumption means that if the true distribution is complex (e.g., bimodal distribution), a mixed model should be used instead.

4.2 Code Feasibility Verification

By comparing target values with simulated values, the feasibility of the code has been verified in the following aspects:

Distribution shape: The high matching degree of median and percentiles proves that the characteristics of income inequality are captured.
Policy constraints: The integration of the social security lower limit ensures that the simulated data complies with regulations.
Scalability: The code is easy to modify to include more variables (such as industry, education level).

4.3 Practical Application Scenarios

Academic research: Generating microdata for econometric analysis.
Corporate decision-making: Simulating wage costs in different cities to assist human resource planning.
Policy evaluation: Testing the impact of basic income and tax reforms on distribution.

Conclusion

This article demonstrates, through a complete Python example, how to simulate microdata from macro wage indicators and rigorously validate its statistical consistency.

This process not only proves the feasibility of the code but also highlights the value of data simulation in bridging information gaps. In the future, machine learning methods (such as GANs) could be further integrated to generate more complex distributions or combined with time series forecasting to predict dynamic wage changes.

This is a piece of code, and also an idea. Can you expand random ages based on this?