How to Use Python for Statistical Analysis?

Recently, many friends have been asking Teacher Linglu how to use Python for statistical analysis when writing papers. Teacher Linglu took some time to summarize the process in the simplest way, guiding you through the complete workflow.

Step 1: Preparation — Install the “Kitchen” and “Utensils”

Before cooking, you need to have a kitchen and pots and pans. For Python, the most commonly used “kitchen” is Anaconda, which automatically installs Python and most of the commonly used “utensils” (libraries).

1. Install Anaconda: Go to the official website to download and install it; the process is very simple, just click “Next” all the way.

2. Familiarize yourself with the four core “utensils”:

Pandas: Your “universal cutting board” and “prep area”. Almost all data operations, such as reading, cleaning, and organizing, rely on it.

NumPy: Provides “precise knife skills” for efficient scientific computing.

SciPy & StatsModels: Your “main chef’s stove”. Various complex statistical tests (such as t-tests and ANOVA) are handled here.

Matplotlib & Seaborn: Your “plating and decoration station”. Responsible for turning all results into beautiful charts.

Now, the “kitchen” is ready. We open a tool called Jupyter Notebook in the “kitchen” (it’s like a notebook where you can write and see results at any time) and start cooking.

Step 2: Data Processing — Turning “raw meat” into “clean dishes”

The premise of any analysis is clean data.

“`python

# First, import our “utensils”

import pandas as pd

import numpy as np

import scipy.stats as stats

import matplotlib.pyplot as plt

import seaborn as sns

# Read data from an Excel file (assuming your data is in ‘my_data.xlsx’)

# This is the most likely scenario you will encounter

df = pd.read_excel(‘my_data.xlsx’)

# Quickly take a look at what the data looks like

print(df.head()) # Display the first 5 rows

print(df.shape) # Show how many rows and columns the data has

print(df.info()) # Show the data types of each column and if there are any missing values

“`

Beginner’s Guide:

Handling missing values: If there are blank cells in the data, you need to decide whether to delete the entire row or fill it with the average.

“`python

# Method 1: Simple and straightforward, delete rows with missing values

df_clean = df.dropna()

# Method 2: Fill missing values in a specific column (e.g., ‘age’ column) with the average

df[‘age’].fillna(df[‘age’].mean(), inplace=True)

“`

Selecting data: You usually need to divide the data into several groups, such as experimental and control groups.

“`python

# Assuming there is a column called ‘group’ with values ‘Control’ and ‘Treatment’

control_group = df[df[‘group’] == ‘Control’][‘measurement’] # Measurement values for the control group

treatment_group = df[df[‘group’] == ‘Treatment’][‘measurement’] # Measurement values for the experimental group

“`

Step 3: Descriptive Statistics — Informing readers about the basic situation of the “dish”

Before diving into deeper analysis, first describe the basic situation of your data. This is like introducing the main and auxiliary ingredients of a dish.

“`python

# Quickly perform descriptive statistics on numeric columns

# It will automatically calculate: count, mean, standard deviation, min, quartiles, max

description = df.describe()

print(description)

# If you only want to see a specific column (e.g., ‘score’ column)

print(df[‘score’].mean()) # Mean

print(df[‘score’].std()) # Standard deviation

print(df[‘score’].median()) # Median

“`

In your paper, you can write:

> “This study included XX subjects. The average score of the experimental group (M=85.2, SD=5.1) was higher than that of the control group (M=78.6, SD=6.3).”

> M stands for Mean, and SD stands for Standard Deviation.

Step 4: Core Statistical Tests — Proving that your “dish” is indeed tastier

This is the essence of the paper, used to prove that your intervention is effective or that there is a relationship between two variables.

# Scenario 1: Comparing two groups of data (e.g., experimental group vs. control group)

Use the independent samples t-test.

“`python

# Using the previously defined control_group and treatment_group

t_stat, p_value = stats.ttest_ind(treatment_group, control_group)

print(f”t-statistic: {t_stat:.4f}”)

print(f”p-value: {p_value:.4f}”)

# How to interpret?

if p_value < 0.05:

print(“Results are significant! There is a statistical difference between the experimental and control groups.”)

else:

print(“Results are not significant, cannot conclude that there is a difference between the two groups.”)

“`

# Scenario 2: Comparing three or more groups of data

Use one-way ANOVA.

“`python

# Assuming there are three groups: A, B, C

group_a = df[df[‘group’] == ‘A’][‘score’]

group_b = df[df[‘group’] == ‘B’][‘score’]

group_c = df[df[‘group’] == ‘C’][‘score’]

f_stat, p_value = stats.f_oneway(group_a, group_b, group_c)

print(f”F-statistic: {f_stat:.4f}”)

print(f”p-value: {p_value:.4f}”)

# Interpretation is the same, p < 0.05 indicates that there is a significant difference between at least two groups

“`

# Scenario 3: Exploring the relationship between two variables

Use correlation analysis and regression analysis.

Correlation analysis (to see the strength of the relationship):

“`python

# Exploring the relationship between ‘study time’ and ‘exam score’

study_time = df[‘study_time’]

exam_score = df[‘exam_score’]

# Calculate the Pearson correlation coefficient

corr_coef, p_value = stats.pearsonr(study_time, exam_score)

print(f”Correlation coefficient r: {corr_coef:.4f}”)

print(f”p-value: {p_value:.4f}”)

“`

`r` values range from -1 to 1. The closer to 1, the stronger the positive correlation; the closer to -1, the stronger the negative correlation; close to 0 indicates no correlation.

Simple linear regression (to see how it affects):

“`python

import statsmodels.api as sm

# Define independent variable (X) and dependent variable (y)

X = df[‘study_time’] # Independent variable: study time

y = df[‘exam_score’] # Dependent variable: exam score

# Add a constant term (intercept) to X

X = sm.add_constant(X)

# Build and fit the model

model = sm.OLS(y, X).fit()

# Print detailed regression results

print(model.summary())

“`

In the `summary()` table, mainly look at:

`coef`: Coefficient. The coefficient of `study_time` indicates how much the score changes on average for each unit increase in study time.

`P>|t|`: p-value. If less than 0.05, it indicates that this independent variable has a significant effect on the dependent variable.

`R-squared`: R-squared, indicating the percentage of variance in the dependent variable explained by the model.

Step 5: Data Visualization — Adding “dish presentation images” to the paper

A picture is worth a thousand words. Good charts can make it clear to reviewers and readers.

“`python

# Set the style of the charts to make them look better

sns.set_style(“whitegrid”)

# 1. Boxplot – Very suitable for comparing the distribution of several groups of data (replacing simple bar charts)

plt.figure(figsize=(8,6))

sns.boxplot(x=’group’, y=’score’, data=df)

plt.title(‘Comparison of Score Distribution Among Different Groups’)

plt.show()

# 2. Scatter plot – Showing the correlation between two variables

plt.figure(figsize=(8,6))

sns.scatterplot(x=’study_time’, y=’exam_score’, data=df)

plt.title(‘Relationship Between Study Time and Exam Score’)

plt.xlabel(‘Study Time (hours)’)

plt.ylabel(‘Exam Score’)

plt.show()

# 3. Histogram – Checking the distribution of a single variable

plt.figure(figsize=(8,6))

df[‘score’].hist(bins=10)

plt.title(‘Distribution Histogram of Scores’)

plt.xlabel(‘Score’)

plt.ylabel(‘Frequency’)

plt.show()

“`

In fact, when writing a paper, you do not need to understand the complex mathematical principles behind each statistical test. What you first need to learn is “in what scenario, use what code,” and be able to correctly interpret key indicators such as p-values from the output. This process is like using a calculator; you first learn to press buttons to get the correct answer, and later gradually understand the principles behind it.

Be bold and try it out!

Finally, I wish all friends good luck with their papers!

Leave a Comment