New Year Goal: Mastering 'Kung Fu Panda' in Python to Become the Most Efficient Data Scientist

Produced by Big Data Digest

Source: Medium

Translated by: Zhu Shuai, Xue Qing, Xia Yawei

This is an introductory guide to pandas, where the author uses simple language and straightforward example code to showcase the overview of pandas and some advanced operations. “… It is a library that everyone engaged in data science must master,” “… pandas is one of the reasons why Python is so user-friendly.” Is pandas really that great? Let’s take a look!

Python is an open-source programming language that is very convenient to use, but it also has some inherent issues common to open-source languages: there are many libraries available to implement a function. For beginners in Python, it can be challenging to know which library is best to call for a specific function. At this point, experienced individuals are needed to provide guidance. This article aims to tell you: there is a library that all data science practitioners must master, and that library is called “pandas.”

The most interesting aspect of pandas is that it encompasses many functionalities of other Python libraries, meaning that pandas is a comprehensive library. This implies that, many times, you can accomplish most tasks using just pandas.

Pandas is like Excel in Python: its basic data structure is a table (called “DataFrame” in pandas), which allows for various operations and transformations on the data. Of course, it can do many other things as well.

If you are already familiar with Python, feel free to skip to the third section.

Let’s get started:

import pandas as pd

Don’t ask me why I use “pd” instead of “p” or some other abbreviation; the fact is, everyone uses it this way, and you should too! 🙂 (Just having a bit of fun~)

Beginner Functions in Pandas

Reading Data

data=pd.read_csv('my_file.csv')

data=pd.read_csv('my_file.csv', sep=';', encoding='latin-1', nrows=1000, skiprows=[2,5])

The sep parameter indicates the delimiter. If you are dealing with French data, the CSV delimiter used in Excel is “;”, so you need to explicitly declare the delimiter with this parameter. The encoding parameter should be set to “latin-1” to recognize French characters; nrows=1000 indicates reading the first 1000 rows of data; skiprows=[2,5] means to skip the data in the 2nd and 5th rows when reading the file.

The most commonly used functions are: read_csv and read_excel.

Other very useful functions include: read_clipboard and read_sql.

Writing Data

data.to_csv("my_new_file.csv", index=None)

By setting index=None, the data will be written to the file exactly as it is. If you do not specify index=None, the program will add an index column to the file, which will be at the front of all columns, with values 0, 1, 2, 3… up to the last row.

I generally do not use functions like .to_excel, .to_json, or .to_pickle because the .to_csv function is already very useful! Moreover, CSV is currently the most commonly used file format for storing tabular data.

Checking Data

data.shape

Validates whether the (rows, columns) information matches the number of rows and columns in the data.

data.describe()

Calculates some basic statistics, such as count, mean, standard deviation, and percentiles.

Viewing Data

data.head(3)

Prints the first 3 rows of data. Similar to the .head() function, you can also use the .tail() function to view the last few rows of data.

data.loc[8]

Prints the row with index 8. (Note that indexing starts from 0 by default.)

data.loc[8,'column_1']

Prints the data corresponding to row index 8 and column name ‘column_1’.

data.loc[range(4,6)]

Outputs the row data from index 4 to 6 (excluding 6).

Basic Functions in Pandas

Logical Operators

Use logical operators to subset data. You can use & (AND), ~ (NOT), and | (OR) as common operators, remembering to add parentheses around logical operations.

data[data['column_1'].isin(["french", "english"]) ]

If you need to use a lot of OR operations on the same column, it is common to use the .isin() function instead.

Basic Plotting Functions

This functionality is mainly due to the matplotlib library. As mentioned in the introduction, most of this library’s features can be directly used through pandas.

data['column_numerical'].plot()

New Year Goal: Mastering 'Kung Fu Panda' in Python to Become the Most Efficient Data Scientist

Example output of the .plot() function

data['column_numerical'].hist()

This function plots a distribution graph (also known as a histogram).

Example output of the .hist() function

%matplotlib inline

If you are using Jupyter, don’t forget to add this line before plotting (you only need to declare it once in the notebook).

Updating Data

data.loc[8,'column_1']='english'

Replaces the value pointed to by row index 8 and column name ‘column_1’ with “english”.

data.loc[data['column_1']=='french','column_1']='French'

Changes the values of multiple rows in one line of code.

Now you have learned some common functionalities that can be accomplished in Excel. Next, let’s explore some magical features that Excel cannot achieve!

Intermediate Functions

Counting Frequencies

Example output of the .value_counts() function.

Operations on Rows, Columns, or All Data

data['column_1'].map(len)

The len() function will be applied to each element under ‘column_1’.

The .map() operation will apply a function to each element of the specified column.

data['column_1'].map(len).map(lambda x: x/100).plot()

A very useful feature in the pandas library is the chaining method. It allows you to perform multiple operations in one line of code (like the .map() function and .plot() function here), making it both simple and efficient.

Chaining method:

https://tomaugspurger.github.io/method-chaining

The apply function will apply a function to all columns.

The applymap() function will apply a function to all cells in the table.

tqdm — A Unique Module

When dealing with large datasets, pandas takes some time to complete operations like .map(), .apply(), and .applymap(). tqdm is a very useful library that can predict when these operations will finish. (Okay, I lied; I previously said we only use the pandas library). You can install tqdm using the command “pip install tqdm”.

from tqdm import tqdm_notebook

tqdm_notebook().pandas()

Using pandas to create tqdm progress

data['column_1'].progress_map(lambda x: x.count('e'))

Replace .map() with .progress_map() function, and the same goes for .apply() and .applymap() functions.

This is the progress bar you can see after using tqdm and pandas in Jupyter.

Correlation Matrix and Scatter Matrices

data.corr()

data.corr().applymap(lambda x: int(x*100)/100)

New Year Goal: Mastering 'Kung Fu Panda' in Python to Become the Most Efficient Data Scientist

Using .corr() gives you the correlation matrix of all columns.

pd.plotting.scatter_matrix(data, figsize=(12,8))

Example of scatter matrices. It plots all combinations of values from two columns in the same graph.

Advanced Operations in Pandas

SQL Join Functionality

Join operations in Pandas are very simple.

data.merge(other_data, on=['column_1', 'column_2', 'column_3'])

With just one line of code, you can join three columns together.

Grouping Functionality

When you first start using this functionality, it may not be easy; you first need to master some syntax knowledge, but then you will find that you cannot live without this functionality.

data.groupby('column_1')['column_2'].apply(sum).reset_index()

Groups the data based on a certain column and then performs some function operation on another column’s data. The .reset_index() function can convert the data into a DataFrame format.

As previously mentioned, using chaining methods to implement as many function features as possible in one line of code can greatly optimize code structure. (Digest reminder: but also consider the readability of the code~)

Iterating Rows

dictionary={}

for i, row in data.iterrows():

dictionary[row['column_1']]=row['column_2']

The .iterrows() function retrieves two variables simultaneously and implements a loop: the row index and the row object (which are i and row in the above code).

In summary, the pandas library is one of the reasons why Python is so user-friendly.

It is difficult to comprehensively showcase all the functionalities of the Pandas library in this article, but through the above content, you should understand why a data scientist cannot do without the Pandas library. Overall, the Pandas library has the following advantages:

Convenient and easy to use, hiding all complex and abstract computational processes;
Most functionalities are implemented in a very intuitive way;
Fast, although it is not the fastest data analysis library (optimized in C language).

Pandas is a very important tool that helps data scientists quickly read and understand data, completing their work more efficiently.

Alright, if you find this article useful, please remember to give the Digest a thumbs up, haha 🙂 Additionally, the Digest also suggests that everyone regularly check the official documentation and API!

Related reports:

https://towardsdatascience.com/be-a-more-efficient-data-scientist-today-master-pandas-with-this-guide-ea362d27386

Volunteer Introduction

Reply with “Volunteer” to join us

New Year Goal: Mastering ‘Kung Fu Panda’ in Python to Become the Most Efficient Data Scientist

Leave a Comment Cancel reply

Related posts

Leave a Comment Cancel reply