Practical Data Analysis with Python: Uncovering Millions in E-commerce Sales Data

1. Case Background: Annual Sales Data from an E-commerce Platform

Dataset: 2024 All-Platform Order Data (Simulated)

Order ID, User ID, Product Category, Unit Price, Quantity, Payment Time, City
Dataset Size: 500,000 orders, 200,000 users, 1000+ SKUs
Analysis Goals: Identify high-potential products, optimize marketing strategies, enhance user repurchase rates

2. Data Preparation and Cleaning

1. Generate Simulated Data (Pandas)

import pandas as pd  
import numpy as np  

# Generate 500,000 simulated data  
dates = pd.date_range("2024-01-01", "2024-12-31", freq="H")  
categories = ["Mobile", "Computer", "Home Appliances", "Clothing", "Beauty", "Food"]  
  
data = pd.DataFrame({  
    "order_id": np.arange(500000),  
    "user_id": np.random.choice(200000, 500000),  
    "category": np.random.choice(categories, 500000, p=[0.3,0.2,0.15,0.15,0.1,0.1]),  
    "price": np.round(np.abs(np.random.normal(2000,1500,500000)),2),  
    "quantity": np.random.randint(1,5,500000),  
    "city": np.random.choice(["Beijing","Shanghai","Guangzhou","Shenzhen","Hangzhou"],500000),  
    "pay_time": np.random.choice(dates,500000)  
})  
  
# Add noise data  
data.loc[::1000, "price"] = np.nan  # Randomly missing prices  
data.loc[::500, "quantity"] = 999  # Abnormal order quantity

2. Key Steps in Data Cleaning

# Remove outliers  
data = data[(data["quantity"] > 0) & (data["quantity"] < 10)]  
  
# Fill missing prices (by category median)  
data["price"] = data.groupby("category")["price"].transform(  
    lambda x: x.fillna(x.median())  
)  
  
# Construct feature fields  
data["month"] = data["pay_time"].dt.month  
data["hour"] = data["pay_time"].dt.hour  
data["revenue"] = data["price"] * data["quantity"]

3. Core Analysis: Five Key Business Insights

1. Product Sales Trend Analysis

import seaborn as sns  
  
# Monthly sales trend  
monthly_sales = data.groupby(["month","category"]) ["revenue"].sum().unstack()  
sns.relplot(  
    data=monthly_sales.T,  
    kind="line",  
    dashes=False,  
    markers=True,  
    height=5,  
    aspect=2  
)  
plt.title("Monthly Sales Trends by Category in 2024")

Conclusion:

Mobile and computer categories peaked in June (618) and November (Double 11)
Food category saw a surge in demand in December (New Year purchases)

2. User Value Segmentation (RFM Model)

# Calculate RFM metrics  
rfm = data.groupby("user_id").agg({  
    "pay_time": lambda x: (pd.Timestamp.now() - x.max()).days,  
    "order_id": "count",  
    "revenue": "sum"  
}).rename(columns={  
    "pay_time": "Recency",  
    "order_id": "Frequency",  
    "revenue": "Monetary"  
})  
  
# Binning scores  
rfm["R_Score"] = pd.qcut(rfm["Recency"], 5, labels=range(5,0,-1))  
rfm["F_Score"] = pd.qcut(rfm["Frequency"], 5, labels=range(1,6))  
rfm["M_Score"] = pd.qcut(rfm["Monetary"], 5, labels=range(1,6))  
rfm["Total_Score"] = rfm[["R_Score","F_Score","M_Score"]].sum(axis=1)  
  
# User segmentation  
rfm["Segment"] = pd.cut(  
    rfm["Total_Score"],  
    bins=[0,6,9,12,15],  
    labels=["Churned Users","Regular Users","Valuable Users","Core Users"]  
)

Operational Strategy:

Core Users: Provide exclusive VIP discounts
Churned Users: Send recall coupons

3. City Consumption Characteristics Analysis

# City consumption power matrix analysis  
city_matrix = pd.pivot_table(  
    data=data,  
    values=["revenue","quantity"],  
    index="city",  
    aggfunc={"revenue":"sum", "quantity":"mean"}  
)  
  
sns.jointplot(  
    x=city_matrix["quantity"],  
    y=city_matrix["revenue"],  
    hue=city_matrix.index,  
    height=8  
)

Findings:

Beijing users: High average order value, low frequency (average order amount ¥3200)
Hangzhou users: High frequency, low average order value (average order amount ¥980)

4. 24-Hour Shopping Behavior Analysis

hourly_pattern = data.groupby("hour").agg({  
    "order_id": "count",  
    "revenue": "sum"  
})  
  
# Dual-axis chart drawing  
fig, ax1 = plt.subplots(figsize=(12,6))  
ax2 = ax1.twinx()  
  
sns.lineplot(data=hourly_pattern["order_id"], ax=ax1, color="b", label="Order Volume")  
sns.lineplot(data=hourly_pattern["revenue"], ax=ax2, color="r", label="Sales Amount")  
  
ax1.set_ylabel("Order Volume", color="b")  
ax2.set_ylabel("Sales Amount (10,000 Yuan)", color="r")  
plt.title("24-Hour Transaction Trend Analysis")

Operational Insights:

Peak traffic at 8 PM: Suitable for flash sales
Low period from 1-5 AM: Push coupons to stimulate consumption

5. Product Association Analysis (Apriori Algorithm)

from mlxtend.frequent_patterns import apriori  
  
# Convert order-product matrix  
basket = data.groupby(["order_id","category"]) ["quantity"].sum().unstack().fillna(0)  
basket = (basket > 0).astype(int)  
  
# Mine frequent itemsets  
frequent_items = apriori(basket, min_support=0.02, use_colnames=True)  
  
# Association rules  
from mlxtend.frequent_patterns import association_rules  
rules = association_rules(frequent_items, metric="lift", min_threshold=1.5)  
print(rules[rules["lift"]>3].sort_values("confidence", ascending=False))

Recommendation Strategy:

Customers who buy mobile phones have a 65% probability of buying headphones (confidence=0.65)
Customers who buy beauty products have a 58% probability of buying skincare products

4. Automating Analysis Reports (Jupyter Dashboard)

1. Building Interactive Dashboards with Panel

import panel as pn  
pn.extension()  
  
# Define controls  
category_select = pn.widgets.Select(name="Category", options=categories)  
city_select = pn.widgets.MultiSelect(name="City", options=["Beijing","Shanghai","Guangzhou","Shenzhen","Hangzhou"])  
  
# Interactive callback function  
@pn.depends(category_select.param.value, city_select.param.value)  
def update_plots(category, cities):  
    filtered = data[data["category"].isin([category]) & data["city"].isin(cities)]  
    # Generate dynamic charts...  
  
# Layout combination  
dashboard = pn.Row(  
    pn.Column(category_select, city_select),  
    update_plots  
)  
dashboard.servable()

5. Pitfall Guide: Enterprise-Level Analysis Experience

Data TimelinessOrder data must include payment status (to prevent false transactions interference)
User Privacy ProtectionData desensitization must be performed before analysis
Model OverfittingUse time series validation sets to evaluate predictive models
Cost AccountingHigh GMV does not equal high profit; cost data must be associated
Result VisualizationAvoid misleading charts like 3D pie charts

Today’s Challenge: Based on user segmentation results, design a text message recall strategy for “Churned Users” and submit your proposal in the comments section

“