Practical Data Analysis with Python: Uncovering Millions in E-commerce Sales Data
1. Case Background: Annual Sales Data from an E-commerce Platform
Dataset: 2024 All-Platform Order Data (Simulated)
- Order ID, User ID, Product Category, Unit Price, Quantity, Payment Time, City
- Dataset Size: 500,000 orders, 200,000 users, 1000+ SKUs
- Analysis Goals: Identify high-potential products, optimize marketing strategies, enhance user repurchase rates
2. Data Preparation and Cleaning
1. Generate Simulated Data (Pandas)
import pandas as pd
import numpy as np
# Generate 500,000 simulated data
dates = pd.date_range("2024-01-01", "2024-12-31", freq="H")
categories = ["Mobile", "Computer", "Home Appliances", "Clothing", "Beauty", "Food"]
data = pd.DataFrame({
"order_id": np.arange(500000),
"user_id": np.random.choice(200000, 500000),
"category": np.random.choice(categories, 500000, p=[0.3,0.2,0.15,0.15,0.1,0.1]),
"price": np.round(np.abs(np.random.normal(2000,1500,500000)),2),
"quantity": np.random.randint(1,5,500000),
"city": np.random.choice(["Beijing","Shanghai","Guangzhou","Shenzhen","Hangzhou"],500000),
"pay_time": np.random.choice(dates,500000)
})
# Add noise data
data.loc[::1000, "price"] = np.nan # Randomly missing prices
data.loc[::500, "quantity"] = 999 # Abnormal order quantity
import pandas as pd
import numpy as np
# Generate 500,000 simulated data
dates = pd.date_range("2024-01-01", "2024-12-31", freq="H")
categories = ["Mobile", "Computer", "Home Appliances", "Clothing", "Beauty", "Food"]
data = pd.DataFrame({
"order_id": np.arange(500000),
"user_id": np.random.choice(200000, 500000),
"category": np.random.choice(categories, 500000, p=[0.3,0.2,0.15,0.15,0.1,0.1]),
"price": np.round(np.abs(np.random.normal(2000,1500,500000)),2),
"quantity": np.random.randint(1,5,500000),
"city": np.random.choice(["Beijing","Shanghai","Guangzhou","Shenzhen","Hangzhou"],500000),
"pay_time": np.random.choice(dates,500000)
})
# Add noise data
data.loc[::1000, "price"] = np.nan # Randomly missing prices
data.loc[::500, "quantity"] = 999 # Abnormal order quantity
2. Key Steps in Data Cleaning
# Remove outliers
data = data[(data["quantity"] > 0) & (data["quantity"] < 10)]
# Fill missing prices (by category median)
data["price"] = data.groupby("category")["price"].transform(
lambda x: x.fillna(x.median())
)
# Construct feature fields
data["month"] = data["pay_time"].dt.month
data["hour"] = data["pay_time"].dt.hour
data["revenue"] = data["price"] * data["quantity"]
# Remove outliers
data = data[(data["quantity"] > 0) & (data["quantity"] < 10)]
# Fill missing prices (by category median)
data["price"] = data.groupby("category")["price"].transform(
lambda x: x.fillna(x.median())
)
# Construct feature fields
data["month"] = data["pay_time"].dt.month
data["hour"] = data["pay_time"].dt.hour
data["revenue"] = data["price"] * data["quantity"]
3. Core Analysis: Five Key Business Insights
1. Product Sales Trend Analysis
import seaborn as sns
# Monthly sales trend
monthly_sales = data.groupby(["month","category"]) ["revenue"].sum().unstack()
sns.relplot(
data=monthly_sales.T,
kind="line",
dashes=False,
markers=True,
height=5,
aspect=2
)
plt.title("Monthly Sales Trends by Category in 2024")
import seaborn as sns
# Monthly sales trend
monthly_sales = data.groupby(["month","category"]) ["revenue"].sum().unstack()
sns.relplot(
data=monthly_sales.T,
kind="line",
dashes=False,
markers=True,
height=5,
aspect=2
)
plt.title("Monthly Sales Trends by Category in 2024")
Conclusion:
- Mobile and computer categories peaked in June (618) and November (Double 11)
- Food category saw a surge in demand in December (New Year purchases)
2. User Value Segmentation (RFM Model)
# Calculate RFM metrics
rfm = data.groupby("user_id").agg({
"pay_time": lambda x: (pd.Timestamp.now() - x.max()).days,
"order_id": "count",
"revenue": "sum"
}).rename(columns={
"pay_time": "Recency",
"order_id": "Frequency",
"revenue": "Monetary"
})
# Binning scores
rfm["R_Score"] = pd.qcut(rfm["Recency"], 5, labels=range(5,0,-1))
rfm["F_Score"] = pd.qcut(rfm["Frequency"], 5, labels=range(1,6))
rfm["M_Score"] = pd.qcut(rfm["Monetary"], 5, labels=range(1,6))
rfm["Total_Score"] = rfm[["R_Score","F_Score","M_Score"]].sum(axis=1)
# User segmentation
rfm["Segment"] = pd.cut(
rfm["Total_Score"],
bins=[0,6,9,12,15],
labels=["Churned Users","Regular Users","Valuable Users","Core Users"]
)
# Calculate RFM metrics
rfm = data.groupby("user_id").agg({
"pay_time": lambda x: (pd.Timestamp.now() - x.max()).days,
"order_id": "count",
"revenue": "sum"
}).rename(columns={
"pay_time": "Recency",
"order_id": "Frequency",
"revenue": "Monetary"
})
# Binning scores
rfm["R_Score"] = pd.qcut(rfm["Recency"], 5, labels=range(5,0,-1))
rfm["F_Score"] = pd.qcut(rfm["Frequency"], 5, labels=range(1,6))
rfm["M_Score"] = pd.qcut(rfm["Monetary"], 5, labels=range(1,6))
rfm["Total_Score"] = rfm[["R_Score","F_Score","M_Score"]].sum(axis=1)
# User segmentation
rfm["Segment"] = pd.cut(
rfm["Total_Score"],
bins=[0,6,9,12,15],
labels=["Churned Users","Regular Users","Valuable Users","Core Users"]
)
Operational Strategy:
- Core Users: Provide exclusive VIP discounts
- Churned Users: Send recall coupons
3. City Consumption Characteristics Analysis
# City consumption power matrix analysis
city_matrix = pd.pivot_table(
data=data,
values=["revenue","quantity"],
index="city",
aggfunc={"revenue":"sum", "quantity":"mean"}
)
sns.jointplot(
x=city_matrix["quantity"],
y=city_matrix["revenue"],
hue=city_matrix.index,
height=8
)
# City consumption power matrix analysis
city_matrix = pd.pivot_table(
data=data,
values=["revenue","quantity"],
index="city",
aggfunc={"revenue":"sum", "quantity":"mean"}
)
sns.jointplot(
x=city_matrix["quantity"],
y=city_matrix["revenue"],
hue=city_matrix.index,
height=8
)
Findings:
- Beijing users: High average order value, low frequency (average order amount ¥3200)
- Hangzhou users: High frequency, low average order value (average order amount ¥980)
4. 24-Hour Shopping Behavior Analysis
hourly_pattern = data.groupby("hour").agg({
"order_id": "count",
"revenue": "sum"
})
# Dual-axis chart drawing
fig, ax1 = plt.subplots(figsize=(12,6))
ax2 = ax1.twinx()
sns.lineplot(data=hourly_pattern["order_id"], ax=ax1, color="b", label="Order Volume")
sns.lineplot(data=hourly_pattern["revenue"], ax=ax2, color="r", label="Sales Amount")
ax1.set_ylabel("Order Volume", color="b")
ax2.set_ylabel("Sales Amount (10,000 Yuan)", color="r")
plt.title("24-Hour Transaction Trend Analysis")
hourly_pattern = data.groupby("hour").agg({
"order_id": "count",
"revenue": "sum"
})
# Dual-axis chart drawing
fig, ax1 = plt.subplots(figsize=(12,6))
ax2 = ax1.twinx()
sns.lineplot(data=hourly_pattern["order_id"], ax=ax1, color="b", label="Order Volume")
sns.lineplot(data=hourly_pattern["revenue"], ax=ax2, color="r", label="Sales Amount")
ax1.set_ylabel("Order Volume", color="b")
ax2.set_ylabel("Sales Amount (10,000 Yuan)", color="r")
plt.title("24-Hour Transaction Trend Analysis")
Operational Insights:
- Peak traffic at 8 PM: Suitable for flash sales
- Low period from 1-5 AM: Push coupons to stimulate consumption
5. Product Association Analysis (Apriori Algorithm)
from mlxtend.frequent_patterns import apriori
# Convert order-product matrix
basket = data.groupby(["order_id","category"]) ["quantity"].sum().unstack().fillna(0)
basket = (basket > 0).astype(int)
# Mine frequent itemsets
frequent_items = apriori(basket, min_support=0.02, use_colnames=True)
# Association rules
from mlxtend.frequent_patterns import association_rules
rules = association_rules(frequent_items, metric="lift", min_threshold=1.5)
print(rules[rules["lift"]>3].sort_values("confidence", ascending=False))
from mlxtend.frequent_patterns import apriori
# Convert order-product matrix
basket = data.groupby(["order_id","category"]) ["quantity"].sum().unstack().fillna(0)
basket = (basket > 0).astype(int)
# Mine frequent itemsets
frequent_items = apriori(basket, min_support=0.02, use_colnames=True)
# Association rules
from mlxtend.frequent_patterns import association_rules
rules = association_rules(frequent_items, metric="lift", min_threshold=1.5)
print(rules[rules["lift"]>3].sort_values("confidence", ascending=False))
Recommendation Strategy:
- Customers who buy mobile phones have a 65% probability of buying headphones (confidence=0.65)
- Customers who buy beauty products have a 58% probability of buying skincare products
4. Automating Analysis Reports (Jupyter Dashboard)
1. Building Interactive Dashboards with Panel
import panel as pn
pn.extension()
# Define controls
category_select = pn.widgets.Select(name="Category", options=categories)
city_select = pn.widgets.MultiSelect(name="City", options=["Beijing","Shanghai","Guangzhou","Shenzhen","Hangzhou"])
# Interactive callback function
@pn.depends(category_select.param.value, city_select.param.value)
def update_plots(category, cities):
filtered = data[data["category"].isin([category]) & data["city"].isin(cities)]
# Generate dynamic charts...
# Layout combination
dashboard = pn.Row(
pn.Column(category_select, city_select),
update_plots
)
dashboard.servable()
import panel as pn
pn.extension()
# Define controls
category_select = pn.widgets.Select(name="Category", options=categories)
city_select = pn.widgets.MultiSelect(name="City", options=["Beijing","Shanghai","Guangzhou","Shenzhen","Hangzhou"])
# Interactive callback function
@pn.depends(category_select.param.value, city_select.param.value)
def update_plots(category, cities):
filtered = data[data["category"].isin([category]) & data["city"].isin(cities)]
# Generate dynamic charts...
# Layout combination
dashboard = pn.Row(
pn.Column(category_select, city_select),
update_plots
)
dashboard.servable()
5. Pitfall Guide: Enterprise-Level Analysis Experience
- Data TimelinessOrder data must include payment status (to prevent false transactions interference)
- User Privacy ProtectionData desensitization must be performed before analysis
- Model OverfittingUse time series validation sets to evaluate predictive models
- Cost AccountingHigh GMV does not equal high profit; cost data must be associated
- Result VisualizationAvoid misleading charts like 3D pie charts
Today’s Challenge: Based on user segmentation results, design a text message recall strategy for “Churned Users” and submit your proposal in the comments section
“