3 Practical Tips for Handling CSV Data with Python: Solve “Data Garbled/Format Errors” and Double Your Office Efficiency

Have you often encountered

moments of frustration when opening a CSV file: Chinese content turns into a bunch of question marks,

numbers and text are intertwined, and date formats are all over the place?

Don’t worry, today I will share 3 Python tricks that can

help you completely say goodbye to these troubles!

Imagine these real scenarios: the reimbursement details sent by finance have

confused amount formats, the employee information provided by HR has

garbled text, and the date formats in sales data are bizarre…

Faced with these headaches, Python can easily

solve them!

First: Say goodbye to garbled text—let Chinese data see the light of day

The garbled text issue is like a dialect barrier in the data world; you speak plain language, and it responds in Martian. In fact, solving it is much simpler than you think.

Why does garbled text occur?

Different systems use different encoding methods, just like different countries use different languages.

Windows systems prefer gbk encoding, while Mac and Linux

prefer utf-8.

During file transfer, encoding information may be lost or confused.

Let’s look at a practical solution:

import pandas as pd

def read_csv_smart(file_path):
    encodings = ['utf-8', 'gbk', 'gb2312', 'latin1']
    
    for encoding in encodings:
        try:
            data = pd.read_csv(file_path, encoding=encoding)
            print(f"Successfully read! Using encoding: {encoding}")
            return data
        except UnicodeDecodeError:
            continue
    
    print("Unable to automatically identify encoding, please check the file manually")
    return None

# Usage example: Processing user comments data exported from social media

comments_data = read_csv_smart('user_comments_data.csv')
print(comments_data.head())

The brilliance of this method lies in its ability to detect the

actual encoding of the file first, then choose the appropriate way to read it,

just like an experienced translator who can always find the right

way to communicate!

Remember the key points of this trick:

Don’t stick to one encoding method; prepare multiple alternatives.

UTF-8 can solve 90% of the problems, but for older system files, you needgbk.

Always be prepared for error handling, leaving a way out for the program.

Second trick: Taming chaotic formats—let data fall into place

Chaotic data formats are like an unorganized bookshelf; they seem to be there, but you can’t find the information you want.

We need to find the right place for each type of data.

Common format issues include:

Text mixed with numbers, such as “150 yuan”,“2.5kg”

Inconsistent date formats, some using slashes, others using dashes.

Blank cells and illegal characters scattered everywhere.

Python can easily solve these problems:

import pandas as pd
import re

def clean_mixed_data(dataframe):
    # Create a copy of the data to avoid modifying the original data
    cleaned_df = dataframe.copy()
    
    # Clean numeric columns: extract pure numeric parts
    if 'Price' in cleaned_df.columns:
        cleaned_df['Price'] = cleaned_df['Price'].apply(
            lambda x: re.findall(r'\d+\.?\d*', str(x))[0] if re.findall(r'\d+\.?\d*', str(x)) else '0'
        )
        cleaned_df['Price'] = pd.to_numeric(cleaned_df['Price'])
    
    # Standardize date formats
    if 'Creation Time' in cleaned_df.columns:
        cleaned_df['Creation Time'] = pd.to_datetime(
            cleaned_df['Creation Time'], errors='coerce'
        )
    
    # Handle blank cells
    cleaned_df = cleaned_df.fillna('Unknown')
    
    return cleaned_df

# Processing product data exported from e-commerce platforms

product_data = pd.read_csv('product_info.csv', encoding='utf-8')
cleaned_products = clean_mixed_data(product_data)
print("Cleaned data:")
print(cleaned_products.dtypes)

The cleverness of this method lies in its ability to identify the essence of the data, organizing the seemingly chaotic information neatly.

It’s like having a meticulous assistant who helps you archive and organize messy files!

Third trick: Master of batch processing—one-click solution for multiple file issues

When you are faced not with one file, but with dozens of similar files, batch processing becomes particularly important.

The advantages of batch processing:

Save a lot of time on repetitive operations.

Ensure all files are processed uniformly.

Reduce errors caused by human operations.

Implementing a complete batch processing solution:

import os
import pandas as pd
from pathlib import Path

def batch_process_csv(folder_path, output_folder):
    # Ensure the output folder exists
    Path(output_folder).mkdir(exist_ok=True)
    
    # Get all CSV files
    csv_files = [f for f in os.listdir(folder_path) if f.endswith('.csv')]
    
    processed_count = 0
    
    for file_name in csv_files:
        try:
            file_path = os.path.join(folder_path, file_name)
            # Read file (automatically handle encoding)
            data = pd.read_csv(file_path, encoding='utf-8')
            # Data cleaning
            data = data.dropna()  # Remove empty rows
            data = data.reset_index(drop=True)  # Reset index
            # Save processed file
            output_path = os.path.join(output_folder, f'cleaned_{file_name}')
            data.to_csv(output_path, index=False, encoding='utf-8-sig')
            processed_count += 1
            print(f"Processed: {file_name}")
        except Exception as e:
            print(f"Processing failed: {file_name}, error: {str(e)}")
    
    print(f"Batch processing complete! Successfully processed {processed_count} files")

# Batch process last month’s sales daily reports

batch_process_csv('raw_sales_data/', 'cleaned_data/')

This batch processing solution is like forming an efficient data processing team that tirelessly helps you complete tedious data organization tasks!

Master these three tricks, and you will achieve a qualitative leap in data processing capabilities:

From being flustered to being calm and collected.

From wasting time to doubling efficiency.

From worrying about errors to being full of confidence.

Each trick brings tangible value:

Automatic encoding recognition means you no longer have to worry about garbled text.

Unified data formats make analysis smooth and unobstructed.

Batch processing capabilities significantly enhance work efficiency.

Start practicing these tricks now! Find some CSV files you encounter at work and use these methods to process them. When you see the originally chaotic data become neat and standardized, you will truly appreciate the convenience brought by Python.

Remember, good tools should not only be powerful but also easy to use. These tricks are just that; they encapsulate complex technology in simple functions, allowing you to achieve the greatest return with the least investment. From today on, let Python be your reliable assistant at your desk!

If you like it, please follow!

Leave a Comment