Have you often encountered
moments of frustration when opening a CSV file: Chinese content turns into a bunch of question marks,
numbers and text are intertwined, and date formats are all over the place?
Don’t worry, today I will share 3 Python tricks that can
help you completely say goodbye to these troubles!
Imagine these real scenarios: the reimbursement details sent by finance have
confused amount formats, the employee information provided by HR has
garbled text, and the date formats in sales data are bizarre…
Faced with these headaches, Python can easily
solve them!
First: Say goodbye to garbled text—let Chinese data see the light of day
The garbled text issue is like a dialect barrier in the data world; you speak plain language, and it responds in Martian. In fact, solving it is much simpler than you think.
Why does garbled text occur?
Different systems use different encoding methods, just like different countries use different languages.
Windows systems prefer gbk encoding, while Mac and Linux
prefer utf-8.
During file transfer, encoding information may be lost or confused.
Let’s look at a practical solution:
import pandas as pd
def read_csv_smart(file_path):
encodings = ['utf-8', 'gbk', 'gb2312', 'latin1']
for encoding in encodings:
try:
data = pd.read_csv(file_path, encoding=encoding)
print(f"Successfully read! Using encoding: {encoding}")
return data
except UnicodeDecodeError:
continue
print("Unable to automatically identify encoding, please check the file manually")
return None
# Usage example: Processing user comments data exported from social media
comments_data = read_csv_smart('user_comments_data.csv')
print(comments_data.head())
The brilliance of this method lies in its ability to detect the
actual encoding of the file first, then choose the appropriate way to read it,
just like an experienced translator who can always find the right
way to communicate!
Remember the key points of this trick:
Don’t stick to one encoding method; prepare multiple alternatives.
UTF-8 can solve 90% of the problems, but for older system files, you needgbk.
Always be prepared for error handling, leaving a way out for the program.
Second trick: Taming chaotic formats—let data fall into place
Chaotic data formats are like an unorganized bookshelf; they seem to be there, but you can’t find the information you want.
We need to find the right place for each type of data.
Common format issues include:
Text mixed with numbers, such as “150 yuan”,“2.5kg”
Inconsistent date formats, some using slashes, others using dashes.
Blank cells and illegal characters scattered everywhere.
Python can easily solve these problems:
import pandas as pd
import re
def clean_mixed_data(dataframe):
# Create a copy of the data to avoid modifying the original data
cleaned_df = dataframe.copy()
# Clean numeric columns: extract pure numeric parts
if 'Price' in cleaned_df.columns:
cleaned_df['Price'] = cleaned_df['Price'].apply(
lambda x: re.findall(r'\d+\.?\d*', str(x))[0] if re.findall(r'\d+\.?\d*', str(x)) else '0'
)
cleaned_df['Price'] = pd.to_numeric(cleaned_df['Price'])
# Standardize date formats
if 'Creation Time' in cleaned_df.columns:
cleaned_df['Creation Time'] = pd.to_datetime(
cleaned_df['Creation Time'], errors='coerce'
)
# Handle blank cells
cleaned_df = cleaned_df.fillna('Unknown')
return cleaned_df
# Processing product data exported from e-commerce platforms
product_data = pd.read_csv('product_info.csv', encoding='utf-8')
cleaned_products = clean_mixed_data(product_data)
print("Cleaned data:")
print(cleaned_products.dtypes)
The cleverness of this method lies in its ability to identify the essence of the data, organizing the seemingly chaotic information neatly.
It’s like having a meticulous assistant who helps you archive and organize messy files!
Third trick: Master of batch processing—one-click solution for multiple file issues
When you are faced not with one file, but with dozens of similar files, batch processing becomes particularly important.
The advantages of batch processing:
Save a lot of time on repetitive operations.
Ensure all files are processed uniformly.
Reduce errors caused by human operations.
Implementing a complete batch processing solution:
import os
import pandas as pd
from pathlib import Path
def batch_process_csv(folder_path, output_folder):
# Ensure the output folder exists
Path(output_folder).mkdir(exist_ok=True)
# Get all CSV files
csv_files = [f for f in os.listdir(folder_path) if f.endswith('.csv')]
processed_count = 0
for file_name in csv_files:
try:
file_path = os.path.join(folder_path, file_name)
# Read file (automatically handle encoding)
data = pd.read_csv(file_path, encoding='utf-8')
# Data cleaning
data = data.dropna() # Remove empty rows
data = data.reset_index(drop=True) # Reset index
# Save processed file
output_path = os.path.join(output_folder, f'cleaned_{file_name}')
data.to_csv(output_path, index=False, encoding='utf-8-sig')
processed_count += 1
print(f"Processed: {file_name}")
except Exception as e:
print(f"Processing failed: {file_name}, error: {str(e)}")
print(f"Batch processing complete! Successfully processed {processed_count} files")
# Batch process last month’s sales daily reports
batch_process_csv('raw_sales_data/', 'cleaned_data/')
This batch processing solution is like forming an efficient data processing team that tirelessly helps you complete tedious data organization tasks!
Master these three tricks, and you will achieve a qualitative leap in data processing capabilities:
From being flustered to being calm and collected.
From wasting time to doubling efficiency.
From worrying about errors to being full of confidence.
Each trick brings tangible value:
Automatic encoding recognition means you no longer have to worry about garbled text.
Unified data formats make analysis smooth and unobstructed.
Batch processing capabilities significantly enhance work efficiency.
Start practicing these tricks now! Find some CSV files you encounter at work and use these methods to process them. When you see the originally chaotic data become neat and standardized, you will truly appreciate the convenience brought by Python.
Remember, good tools should not only be powerful but also easy to use. These tricks are just that; they encapsulate complex technology in simple functions, allowing you to achieve the greatest return with the least investment. From today on, let Python be your reliable assistant at your desk!
If you like it, please follow!