In regular scientific research, the three steps to implement visualization in Python are:
-
Identify the problem and choose the graph
-
Transform data and apply functions
-
Set parameters for clarity
1. First, what libraries do we use for plotting?
The most basic plotting library in Python is matplotlib, which is the foundational library for Python data visualization. Typically, one starts with matplotlib for Python data visualization and then expands vertically and horizontally.
Seaborn is an advanced visualization library based on matplotlib, primarily aimed at feature selection in data mining and machine learning. Seaborn allows for concise code to create visualizations that describe multidimensional data.
Bokeh (a library for interactive visualizations in the browser that enables interaction between analysts and data); Mapbox (a more powerful visualization tool for handling geographic data), etc.
This article mainly uses matplotlib for case analysis
Step 1: Identify the problem and choose the graph
Business problems can be complex, but by breaking them down, we need to find the specific issue we want to express through the graph. Training in analytical thinking can be learned from The McKinsey Method and The Pyramid Principle.
This is a summary of chart type selection found online.
In Python, we can summarize the following four basic visual elements to present graphs:
-
Points: scatter plot for two-dimensional data, suitable for simple two-dimensional relationships;
-
Lines: line plot for two-dimensional data, suitable for time series;
-
Bars: bar plot for two-dimensional data, suitable for categorical statistics;
-
Colors: heatmap suitable for displaying the third dimension;
There are relationships among data such as distribution, composition, comparison, connection, and trends. Depending on the different relationships, choose the corresponding graph for display.
Step 2: Transform data and apply functions
A lot of programming work in data analysis and modeling is based on data preparation: loading, cleaning, transforming, and reshaping. Our visualization steps also need to organize the data, transforming it into the format we need before applying visualization methods to complete the plotting.
Here are some common data transformation methods:
-
Merging: merge, concat, combine_first (similar to a full outer join in databases)
-
Reshaping: reshape; pivot (similar to Excel pivot tables)
-
Deduplication: drop_duplicates
-
-
Filling and replacing: fillna, replace
-
Renaming axis indexes: rename
Transforming categorical variables into a ‘dummy variable matrix’ using the get_dummies function and limiting values on certain columns in df, etc.
Functions are then selected based on the graph chosen in the first step, finding the corresponding functions in Python.
Step 3: Set parameters for clarity
After the initial graph is drawn, we can modify colors (color), line styles (linestyle), markers (marker), or other chart decoration items such as title (Title), axis labels (xlabel, ylabel), axis ticks (set_xticks), and legends (legend) to make the graph more intuitive.
This step is a refinement based on the second step to make the graph clearer. Specific parameters can be found in the plotting functions.
2. Basics of Visualization Plotting
Basics of Matplotlib Plotting
# Import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Graphs in matplotlib are located within a Figure (canvas), Subplot creates image space. You cannot plot directly through figure; you must create one or more subplots using add_subplot.
figsize can specify the image size.
# Create canvas
fig = plt.figure()
<Figure size 432x288 with 0 Axes>
# Create subplot, 221 indicates this is the first image in a 2x2 grid.
ax1 = fig.add_subplot(221)
# Now it is more common to create canvases and images like this, 2,2 indicates this is a 2*2 canvas that can hold 4 images
fig , axes = plt.subplots(2,2,sharex=True,sharey=True)
# The sharex and sharey parameters in plt.subplot can specify that all subplots use the same x and y axis ticks.
You can use the subplots_adjust method of Figure to adjust spacing.
subplots_adjust(left=None,bottom=None,right=None,top=None,wspace=None,hspace=None)
Color, Marker, and Line Style
The plot function in matplotlib accepts a set of X and Y coordinates and can also accept a string abbreviation representing color and line style: ‘g–‘ indicates the color is green and the line style is dashed. Parameters can also be explicitly specified.
Line plots can also include markers to highlight the positions of data points. Markers can also be included in the format string, but marker types and line styles must come after the color.
plt.plot(np.random.randn(30),color='g',linestyle='--',marker='o')
[<matplotlib.lines.Line2D at 0x8c919b0>]
Ticks, Labels, and Legends
The plt’s xlim, xticks, and xtickslabels methods control the range, tick positions, and tick labels of the chart.
When called without parameters, the current parameter values are returned; when called with parameters, the parameter values are set.
plt.plot(np.random.randn(30),color='g',linestyle='--',marker='o')
plt.xlim() # Call without parameters to show current parameters;
# You can replace xlim with the other two methods to try
(-1.4500000000000002, 30.45)
plt.plot(np.random.randn(30),color='g',linestyle='--',marker='o')
plt.xlim([0,15]) # Change x-axis ticks to 0-15
(0, 15)
Setting Titles, Axis Labels, Ticks, and Tick Labels
fig = plt.figure();ax = fig.add_subplot(1,1,1)
ax.plot(np.random.randn(1000).cumsum())
ticks = ax.set_xticks([0,250,500,750,1000]) # Set tick values
labels = ax.set_xticklabels(['one','two','three','four','five']) # Set tick labels
ax.set_title('My first Plot') # Set title
ax.set_xlabel('Stage') # Set axis label
Text(0.5,0,'Stage')
Legends are another important tool for identifying chart elements.You can pass a label parameter when adding a subplot.
fig = plt.figure(figsize=(12,5));ax = fig.add_subplot(111)
ax.plot(np.random.randn(1000).cumsum(),'k',label='one') # Pass label parameter to define label name
ax.plot(np.random.randn(1000).cumsum(),'k--',label='two')
ax.plot(np.random.randn(1000).cumsum(),'k.',label='three')
# After the graph is created, just call the legend parameter to display the labels.
ax.legend(loc='best') # If the requirements are not strict, it is recommended to use loc='best' to let it choose the best position itself
<matplotlib.legend.Legend at 0xa8f5a20>
In addition to standard chart objects, we can also add custom text annotations or arrows.
Annotations can be added using text, arrow, and annotate functions. The text function can draw text at specified x, y coordinate positions and can be customized.
plt.plot(np.random.randn(1000).cumsum())
plt.text(600,10,'test ',family='monospace',fontsize=10)
# Chinese annotations may not display correctly in the default environment and require modification of the configuration file to support Chinese fonts. Please search for specific steps.
You can use plt.savefig to save the current chart to a file. For example, to save the chart as a png file, you can execute:
The file type is determined by the extension. Other parameters include:
-
fname: A string containing the file path, with the extension specifying the file type
-
dpi: Resolution, default 100 facecolor, edgecolor: background color of the image, default ‘w’ (white)
-
format: specifies the file format (‘png’, ‘pdf’, ‘svg’, ‘ps’, ‘jpg’, etc.)
-
bbox_inches: the part of the chart that needs to be retained. If set to “tight”, it will attempt to trim the blank areas around the image.
plt.savefig('./plot.jpg') # Save the image as a jpg format image named plot
<Figure size 432x288 with 0 Axes>
3. Plotting Functions in Pandas
Matplotlib is the most basic plotting function and is relatively low-level. To assemble a chart, you need to call each basic component separately. Pandas has many high-level plotting methods based on matplotlib, allowing you to create charts that would normally require multiple lines of code using only a few lines.
We use the plotting package from pandas.
import matplotlib.pyplot as plt
Both Series and DataFrame have a plot method for generating various charts.By default, they generate line charts.
s = pd.Series(np.random.randn(10).cumsum(),index=np.arange(0,100,10))
s.plot() # The index of the Series object will be passed to matplotlib as the x-axis for plotting.
<matplotlib.axes._subplots.AxesSubplot at 0xf553128>
df = pd.DataFrame(np.random.randn(10,4).cumsum(0),columns=['A','B','C','D'])
df.plot() # plot will automatically change colors for different variables and add legends
<matplotlib.axes._subplots.AxesSubplot at 0xf4f9eb8>
Parameters of the Series.plot Method
-
label: Used for the chart label
-
style: Style string, ‘g–‘
-
alpha: Fill opacity of the image (0-1)
-
kind: Chart type (bar, line, hist, kde, etc.)
-
xticks: Set x-axis tick values
-
yticks: Set y-axis tick values
-
xlim, ylim: Set axis limits, [0,10]
-
grid: Display axis grid lines, default off
-
-
use_index: Use the object’s index as tick labels
-
logy: Use logarithmic scale on the Y-axis
Parameters of the DataFrame.plot Method
In addition to the parameters in Series, DataFrame has some unique options.
-
subplots: Plot each DataFrame column in a separate subplot
-
sharex, sharey: Share x and y axes
-
figsize: Control image size
-
-
legend: Add legend, default to show
-
sort_columns: Draw columns in alphabetical order, default to use current order
By adding kind=’bar’ or kind=’barh’ to the code for generating line charts,you can generate bar charts or horizontal bar charts.
fig,axes = plt.subplots(2,1)
data = pd.Series(np.random.rand(10),index=list('abcdefghij'))
data.plot(kind='bar',ax=axes[0],rot=0,alpha=0.3)
data.plot(kind='barh',ax=axes[1],grid=True)
<matplotlib.axes._subplots.AxesSubplot at 0xfe39898>
Bar charts have a very practical method:
Using value_counts to graphically display the frequency of occurrences of each value in a Series or DataFrame.
For example, df.value_counts().plot(kind=’bar’)
The basic syntax for Python visualization ends here, and other chart drawing methods are largely similar.
The key is to follow the thought process of thinking, selecting, and applying through the three steps. More practice will lead to greater proficiency.
Editor / Zhang Zhihong
Reviewer / Fan Ruiqiang
Verification / Zhang Zhihong