Python Data Analysis and Visualization: A Beginner's Guide

Python Data Analysis and Visualization

—— A Beginner’s Guide

To begin with, let me answer three fundamental questions to introduce you to the world of data analysis.

· Why Perform Data Analysis

We are in an era of information, where everything can be abstracted into information and data. To understand something, we need to understand its data information and the patterns of its data changes. In competitions, conducting data analysis appropriately helps in our decision-making and efficiency improvement.

· What is the Process of Data Analysis?

· Data Acquisition: Data is typically acquired using web scraping, databases, and file reading methods.

· Data Cleaning: The data we acquire is often not complete or flawless, and needs to undergo data type conversion and other processes before analysis.

· Data Analysis: This typically includes descriptive statistical analysis, visualization analysis, and cluster analysis.

· Conclusion Report: Write a conclusion analysis report that integrates data, analysis process, and conclusions.

· Why Choose Python?

Python is a high-level programming language with a simple syntax that is easy to learn. It also has a rich set of data analysis libraries such as NumPy, Pandas, and powerful visualization libraries such as Matplotlib and Seaborn. This gives it a significant advantage in the field of data analysis compared to other languages.

PART 1

Environment Setup

1►

Download Python

Download the installation package directly from the official website and customize the installation location during installation. Official website:

https://www.python.org/

2►

Download Vscode and Python Extensions

Vscode is a powerful compiler, and its environment setup is relatively simple for beginners. Official website:

https://code.visualstudio.com/Download

After downloading, install the Python extension to compile Python programs. Besides the Python extension, I also recommend downloading the following extensions.

Students can also download Chinese localization extensions and other plugins they like as needed.

3►

Download Related Packages

This article uses three third-party libraries: NumPy, Pandas, and Matplotlib. I provide two download methods here.

· Install via pip

Open the command line tool with win+R, type cmd, and press enter to open the terminal. It is recommended to use Tsinghua mirror site for faster download speeds in China. Type the following code:

pip install numpy -i https://pypi.tuna.tsinghua.edu.cn/simplepip install matplotlib -i https://pypi.tuna.tsinghua.edu.cn/simplepip install pandas -i https://pypi.tuna.tsinghua.edu.cn/simple

Swipe left and right to view

· Download from the website

Open the mirror site:

https://pypi.tuna.tsinghua.edu.cn/simple/numpy/

Find the three libraries mentioned above and download them. After downloading, cut the three libraries into the Lib folder in the downloaded Python directory.

After downloading, verify by creating a new Python file in Vscode and typing:

import numpyimport pandasimport matplotlib

If there is no error, it means the installation was successful.

PART 2

Introduction to Numpy

1►

Introduction to NumPy

NumPy, short for Numerical Python, is an open-source numerical computing extension for Python. It is a Python library that provides multidimensional array objects, various derived objects (such as masked arrays and matrices), and various APIs for fast array operations.

The core of numpy lies in its array type, which, due to its underlying C language, allows array operations to be up to fifty times faster than Python’s list type. In data processing work, numpy is typically used for numerical calculations.

· Importing the NumPy Library

np is the conventional name for numpy.

import numpy as np

· Query Functions

NumPy has many methods, and using the info() method allows for quick reference to the related documentation.

numpy.info(numyp.info)   # Running will return documentation related to the numpy.info() function

Swipe left and right to view

2►

NumPy Data Types

To facilitate extensive calculations, numpy has set more data types than Python, for example, when a set of data is all less than 256, int8 can be used, which effectively reduces memory compared to Python’s built-in int (64-bit).

Data Type

Description

Unique Identifier

bool

Boolean type stored in one byte

‘b’

int8

One byte size, -128 to 127

‘i1’

int16

Integer, 16-bit integer

‘i2’

uint8

Unsigned integer, 0 to 255

‘u1’

uint16

Unsigned integer, 0 to 65535

‘u2’

uint32

Unsigned integer, 0 to 2 ** 32 – 1

‘u4’

float16

Half-precision floating point: 16 bits, sign bit 1, exponent 5

‘f2’

float32

Single-precision floating point: 32 bits, sign bit 1, exponent 8

‘f4’

complex64

Complex number represented by two 32-bit floating point numbers for real and imaginary parts

‘c8’

complex128

Complex number represented by two 64-bit floating point numbers for real and imaginary parts

‘c16’

object

Python object

‘O’

string

String

‘S’

Swipe down to view

3►

Arrays

(Image source: Internet)

Unlike lists, arrays have a fixed size when created, and all elements in an array must have the same data type. Common types include one-dimensional arrays, two-dimensional arrays, and three-dimensional arrays.

The NumPy library uses axis to represent axes, specifying a certain axis means performing operations along that axis.

· Creating Arrays

· array() function

The simplest way to create an array is to pass in a list or tuple:

numpy.array(object, dtype=None, copy=True, order='K', subok=False, ndmin=0, like=None)

Swipe left and right to view

Key parameter explanations:

Parameter	Description
object	Any object exposing the array interface method
dtype	Data type
copy	If True, the object is copied; otherwise, a copy is only made if __array__ returns a copy, the object is a nested sequence, or a copy is needed to satisfy any other requirements (dtype, order, etc.).

· arange() function

Creates an array with fixed step size by specifying the endpoints:

numpy.arange([start, ]stop, [step, ]dtype=None, *, like=None)

Swipe left and right to view

Key parameter explanations:

Parameter	Description
start	Starting value, default is 0
stop	Ending value (not included)
step	Step size, default is 1
dtype	Data type of the created ndarray, if not provided, the input data type will be used.

· linspace() function

Creates an array by specifying the endpoints with a fixed number of elements:

numpy.linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None, axis=0)

Swipe left and right to view

· logspace() function

Creates an array by specifying the endpoints with geometric progression:

numpy.logspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None, axis=0)

Swipe left and right to view

· empty() function

Creates an uninitialized array with specified shape (shape) and data type (dtype), with values generated randomly based on memory conditions:

numpy.empty(shape, dtype=float, order='C', *, like=None)

Swipe left and right to view

· zeros() function

Creates an array filled with zeros of specified shape (shape) and data type (dtype):

numpy.zeros(shape, dtype=float, order='C', *, like=None)

Swipe left and right to view

· full() function

Creates an array of specified shape and fills it with specified values:

numpy.full(shape, fill_value, dtype=None, order='C', *, like=None)

Swipe left and right to view

· Basic Operations on Arrays

· Array Properties

Assuming we have an ndarray array, the important properties of the array are as follows:

Property	Description
ndarray.ndim	Rank, i.e., the number of axes or dimensions
ndarray.shape	The dimensions of the array, for matrices, n rows m columns
ndarray.size	The total number of elements in the array, equivalent to the value of n*m in .shape
ndarray.dtype	The element type of the ndarray object
ndarray.itemsize	The size of each element in the ndarray object, in bytes
ndarray.nbytes	The total number of bytes consumed by the array elements
ndarray.real	The real part of the ndarray elements (real part of complex numbers)
ndarray.imag	The imaginary part of the ndarray elements (imaginary part of complex numbers)
ndarray.T	Transpose the array
ndarray.flat	A one-dimensional iterator over the array
ndarray.flags	Memory information of the ndarray object
ndarray.strides	Byte tuple for each dimension when traversing the array

· Array Indexing and Slicing

Indexing and slicing of arrays in numpy is very flexible. Any array can be sliced in any way. An example is shown below:

import numpy as npa = np.arange(36).reshape(6, 6)  # Generate a 6x6 two-dimensional arrayprint(a)

Swipe left and right to view

The output is:

[[ 0  1  2  3  4  5] [ 6  7  8  9 10 11] [12 13 14 15 16 17] [18 19 20 21 22 23] [24 25 26 27 28 29] [30 31 32 33 34 35]]

Slicing it:

print(a[3,0])        # Returns the element in the 4th row and 1st columnprint(a[1])          # Returns the 2nd rowprint(a[1,3:5])      # Returns columns 4 to 5 of the 2nd rowprint(a[4:,4:])      # Slices both the 5th row and columnprint(a[:,2])        # Returns the 3rd columnprint(a[2::2,::2])   # Slicing with a step of 2

Swipe left and right to view

The output is:

18[ 6  7  8  9 10 11][ 9 10][[28 29] [34 35]][ 2  8 14 20 26 32][[12 14 16]55[24 26 28]]

tip

Unlike list slicing, the results of array slicing in NumPy are views of the array. Modifications to the view will affect the original array.

· Other Operations on Arrays

reshape()	Used to reshape the array, i.e., change the shape of the array. The new shape must be compatible with the original shape, and the values in the array do not change.
astype()	This method is used to convert the types of elements in the array. It returns a new array with the original array’s values after type conversion, usually requiring only the type to convert to.
copy()	This method returns a copy of the specified array; modifications to the copy do not affect the original array.
vstack()	Vertically stacks arrays
hstack()	Horizontally stacks arrays
vsplit()	Splits the array by rows
hsplit()	Splits the array by columns
delete()	Used to delete specified rows or columns from the array and returns a new array after deletion.
insert()	Used to insert rows or columns at specified positions in the array and returns a new array after insertion.
where()	Queries elements that meet certain conditions
sort()	Sorts the array by value

Swipe down to view

· Array Operations

When arrays a and b have the same shape, operations such as addition (+), subtraction (-), multiplication (*), division (/), exponentiation (**), integer division (//), and modulo (%) can be performed. Each element in the resulting array c corresponds to the result of the respective operation on the elements of arrays a and b.

Specifically, numpy supports matrix operations (@), as shown below:

import numpy as np
a = np.array([1,2,3,4]).reshape(2,2)b = np.array([1,0,2,1]).reshape(2,2)print(a @ b)

The output is:

[[ 5  2] [11  4]]

PART 3

Introduction to Pandas

1►

Pandas Overview

Pandas is an open-source library with a BSD license that provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Compared to NumPy, Pandas is more convenient for data manipulation, while NumPy has an advantage in numerical calculations. Using both together often leads to better data analysis results.

· Importing the Pandas Library

pd is the conventional name for pandas.

import pandas as pd

2►

Series and DataFrame

The pandas library provides two powerful data structure objects: Series and DataFrame. Mastering them can help simplify our data handling processes.

· Series

· Creating a Series

A Series is a one-dimensional labeled array, adding a label for indexing on top of the numpy array type. It can be created as follows:

pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)

Swipe left and right to view

Parameter explanations:

data	Data contained in the Series, supports lists, Numpy arrays, dictionaries, and scalar values
index	Row labels (indices)
dtype	Data type specified when the Series data is generated. If not specified, it will be inferred from the data automatically.
name	Specifies the index name
copy	Whether to copy data, default is False; in this case, it is a view, and modifications to the data in the Series will affect the data.

The properties and operations of Series are similar to those of DataFrame, which will be introduced later.

· DataFrame

· Creating a DataFrame

pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)

Swipe left and right to view

A DataFrame is a two-dimensional labeled array, where the parameter meanings are the same as in the pandas.Series() method, with an additional parameter columns used to set column labels.

· DataFrame Properties

Method	Function
DataFrame.index	Returns row labels
DataFrame.columns	Returns column labels
DataFrame.dtypes	Returns the data type of each column
DataFrame.info()	Views basic information about the DataFrame object, including index, data types, memory information, etc.
DataFrame.values	Returns the values of all elements
DataFrame.size	Returns the number of elements
DataFrame.shape	Returns the number of rows and columns of the DataFrame object

· DataFrame Indexing and Slicing

The flexibility and versatility of indexing is what makes DataFrame powerful. Below, I will focus on introducing the iloc() and loc() indexing methods. Now we have a CSV file recording the number of students in the Mechatronics Innovation Team from 2020 to 2022, formatted as follows:

iloc(): Queries based on row and column index numbers, with index numbers starting from 0. Input can be: a single integer, a list or array of integers, or integer slices, as shown below:

import pandas as pddf = pd.read_csv('team_population.csv', index_col=0)print(df.iloc[0])           # Get a single row, returns the Max team population Seriesprint(df.iloc[[0, 1, 3]])   # Integer list to get multiple rows, returns Max, Dream Team, Ivy League populations DataFrameprint(df.iloc[0:2])         # Slice to get multiple rows, returns Max, Dream Team populations DataFrameprint(df.iloc[:, 1])        # Get a single column, returns the 21-level population of four teamsprint(df.iloc[:, [0, 2]])   # Integer list to get multiple columns, returns 20-level and 22-level team populations DataFrameprint(df.iloc[0:3, 0:2])    # Slice both row and column index, returns four teams' 20-level and 21-level populations DataFrameprint(df.iloc[0, 2])        # Row and column index, returns Max 22-level team population

Swipe left and right to view

The running result is as follows:

2020    102021     72022     7Name: Max, dtype: int64     2020  2021  2022Max    10     7     7Dream Team     8     5     9Ivy League     8     6     8     2020  2021  2022Max    10     7     7Dream Team     8     5     9Max    7Dream Team    5Pioneers    9Ivy League    6Name: 2021, dtype: int64     2020  2022Max    10     7Dream Team     8     9Pioneers     8     7Ivy League     8     8     2020  2021Max    10     7Dream Team     8     5Pioneers     8     97

Swipe down to view

loc(): Queries based on row and column labels. Input can be: a single label, a list or array of labels, or label slices (with the slice including the end position), as shown below:

import pandas as pd
df = pd.read_csv('team_population.csv', index_col= 0)print(df.loc['Max'])                      # Get a single row, returns Max team population Seriesprint(df.loc[['Max','Dream Team','Ivy League']])     # Integer list to get multiple rows, returns Max, Dream Team, Ivy League populations DataFrameprint(df.loc['Max':'Pioneers'])             # Slice to get multiple rows, returns Max, Dream Team, Pioneers populations DataFrameprint(df.loc[:, '2021'])                  # Get a single column, returns the 21-level population of four teams Seriesprint(df.loc[:, ['2020', '2022']])        # Integer list to get multiple columns, returns 20-level and 22-level team populations DataFrameprint(df.loc['Max':'Ivy League', '2021':'2022'])      # Slice both row and column index, returns four teams' 21-level and 22-level populations DataFrameprint(df.loc['Max', '2022'])                               # Row and column index, returns Max 22-level team populationprint(df.loc['Max'].apply(lambda x:x>=7))           # Anonymous function filter, returns all Max team populations greater than or equal to 7 for the year

Swipe left and right to view

The running result is as follows:

2020    102021     72022     7Name: Max, dtype: int64     2020  2021  2022Max    10     7     7Dream Team     8     5     9Ivy League     8     6     8     2020  2021  2022Max    10     7     7Dream Team     8     5     9Max    7Dream Team    5Pioneers    9Ivy League    6Name: 2021, dtype: int64     2020  2022Max    10     7Dream Team     8     9Pioneers     8     7Ivy League     8     8     2021  2022Max     7     7Dream Team     5     9Pioneers     9     7Ivy League     6     872020    True2021    True2022    TrueName: Max, dtype: bool

Swipe down to view

3►

Reading Files with Pandas

Pandas provides several flexible methods for reading files, which are simpler than Python’s native reading methods.

These include csv, json, excel, html, and many other commonly used file types. Below is a table summarizing these:

File Type	Read	Write
CSV	read_csv	to_csv
Fixed-Width Text File	read_fwf	–
JSON	read_json	to_json
HTML	read_html	to_html
LaTeX	–	Styler.to_latex
XML	read_xml	to_xml
Local clipboard	read_clipboard	to_clipboard
MS Excel	read_excel	to_excel
OpenDocument	read_excel	–

After reading, the function will return the corresponding series or DataFrame for further data analysis.

4►

Data Cleaning with Pandas

When the data we acquire is incomplete or has significant deviations, we need to perform preliminary cleaning to achieve better data analysis results.

The pandas library provides many methods for data cleaning.

· Handling Null and Missing Values

Pandas provides the isnull() and notnull() methods to check for missing values.

Function Name	Function Effect
isnull()	Checks for null or missing values, returning a boolean value; returns True for missing values and False for non-missing values.
notnull()	Checks for null or missing values, returning a boolean value; returns False for missing values and True for non-missing values.

· Handling Duplicate Values

For values that appear repeatedly in the data, pandas uses the duplicated() method for processing.

df.duplicated(subset=None, keep='first')

Swipe left and right to view

Here, subset represents the column labels to check, and the function will return the corresponding boolean Series or DataFrame.

· Changing Data Types

When the data types in the data are inconsistent, we can use astype() to change the data types for convenience in subsequent calculations and analyses.

df.astype(dtype, copy=True, errors='raise')

Swipe left and right to view

Here, dtype represents the data type to change to, and copy determines whether it is a copy.

· Resetting Index

When the original index does not align with our analysis goals, we can use the rename() method in pandas to rename the index.

df.rename(mapper=None, *, index=None, columns=None, axis=None, copy=True, inplace=False)

Swipe left and right to view

Here, mapper, index, and columns are usually of dictionary type, with the original index as the key and the new index as the value.

· Data Transposition

Sometimes transposing the data can better meet our analysis needs. The stack() method in pandas can be used to stack columns into rows.

df.stack(level=- 1, dropna=True)

Here, dropna indicates whether to drop missing values after transposition, default is True.

I have only listed a few commonly used data cleaning methods in the pandas library here; for more methods, you can refer to the official pandas documentation:

https://pypandas.cn/docs/

PART 4

Introduction to Matplotlib

1►

Overview of Matplotlib

Matplotlib is a Python 2D plotting library that generates publication-quality graphics in various hardcopy formats and interactive environments across platforms. The pyplot sub-library contains a set of command-style functions that make Matplotlib work similarly to MATLAB, significantly simplifying our function plotting tasks.

· Importing the pyplot Sub-library of Matplotlib

plt is the conventional name for pyplot.

import matplotlib.pyplot as plt

2►

Plotting Operations

· plot() function

plt.plot(x, y)

plot() is the most basic plotting method, where x and y are usually lists or arrays, and the image will be plotted with x as the horizontal coordinate and y as the vertical coordinate.

tip

If you want to use plot() to obtain a continuous smooth image, you can try using the linspace() function from the numpy library introduced earlier.

· show() function

plt.show()

The show() function in the Matplotlib library is used to display images. Note that only the latest image will be retained in the background, so remember to use the show() function promptly after plotting.

· Image Decoration Functions

Matplotlib provides a rich set of image decoration functions. Below are some commonly used functions.

Function	Description
title()	Adds a title to the current plot
legend()	Places a legend in the current plot
annotate()	Creates an annotation for specified data points
xlabel(s)	Sets the x-axis label
ylabel(s)	Sets the y-axis label
xticks()	Sets the positions and labels of x-axis ticks
yticks()	Sets the positions and labels of y-axis ticks

This is just a very small part of the library; for more functions, you can learn from the official documentation.

3►

Plotting Examples

The Matplotlib library allows us to plot images with very concise code. Below are examples of several commonly used images.

· Line Chart

Line charts are usually plotted using the plot() function. Below is a simple effect example.

· Bar Chart

Bar charts are usually plotted using the bar() function, with a simple function prototype as follows:

plt.bar(x, height, width, bottom=None,color)

Swipe left and right to view

Below is a simple effect example:

· Pie Chart

Pie charts are usually plotted using the pie() function, with a simple function prototype as follows:

pyplot.pie(x, explode=None, labels=None, colors=None)

Swipe left and right to view

Below is a simple effect example:

· Scatter Plot

Scatter plots are usually plotted using the scatter() function, with a simple function prototype as follows:

plt.scatter(x, y, s=None, c=None, marker=None）

Swipe left and right to view

Below is a simple effect example:

The above is just a few of the most commonly used images; for more images such as radar charts and heat maps, you can refer to the official documentation of Matplotlib:

https://www.matplotlib.org.cn/API/

PART 5

Final Thoughts from the Author

In data analysis, I am also a beginner. The methods introduced in this article are the most basic applications among the libraries discussed. Python has a vast number of powerful data analysis libraries, and we need to frequently consult the official documentation and explore and experiment more to better understand the nuances of data analysis. Interested students can also continue learning about the sk-learn machine learning library and other tools to provide better fitting and more persuasive mathematical support for their analyses. I wish to progress together with everyone.

【End】

Scan to Follow Us

Text | Zhang Aoyu

Editor | Zhang Aoyu

Reviewer | Kou Yanlin Li Chenyang

Python Data Analysis and Visualization: A Beginner’s Guide

Leave a Comment Cancel reply

Related posts

Leave a Comment Cancel reply