Practical Data Analysis and Visualization with Python

This is an old article from 2022. With the support of AI now, it should be much easier to accomplish.

Background

I happened to see a video discussing national fortunes, and felt inspired after watching it, so I shared it on my video account. Later, I thought, could I find some data to support it? Thus, this video (and article) was created.

Data Acquisition

I initially wanted to find data starting from the Tang Dynasty, but then I realized that it seemed unreasonable, as it would be difficult to make horizontal comparisons. Collecting and estimating the data myself could be done, but it doesn’t seem to be my area of expertise. So I settled for existing data: GM-GDP-Dataset-v27.xlsx.

The data is on GDrive, and some friends may not be able to access it. You can get the download link by replying <span>0514GDP</span> in the public account backend.

Data Processing

File Format Conversion

The downloaded data is an <span>Excel</span> file with multiple <span>sheets</span>. To facilitate processing, we will convert it into a <span>csv</span> file. There are two methods:

  1. Simple and convenient, directly open the file with <span>Excel</span>, select the desired <span>sheet</span>, and save it as <span>csv</span>.
  2. Use a program.

<span>xlsx</span> to <span>csv</span>

  • First, import the necessary packages:
import pandas as pd
from pyecharts import options as opts
from pyecharts.charts import Bar, Timeline
from pyecharts.commons.utils import JsCode
  • Convert Excel to csv:
# Read Excel, specify the workbook
df = pd.read_excel("GM-GDP-Dataset-v27.xlsx", sheet_name="data-GDP-in-columns")

# Since the original data has two rows of descriptions, we need to reprocess the column names
columns = ["geo", "name"]  # First and second columns
columns.extend(df.iloc[1, 2:].astype(int).astype(str).tolist())  # Other columns

df.columns = columns  # Reset column names
df = df[2:]  # Remove those two rows of descriptions
  • Further processing:
df = df.fillna(0)       # Fill NaN with 0
df.index = df["name"]   # Change index to country names
df.pop("geo")           # Discard abbreviation
df.pop("name")          # Discard country names (already in the index)
  • More processing:
df = df[:-7]            # The last 7 rows are data for the world and continents, discard them
  • Processing result:
df.head()
\t1800\t1801\t1802\t1803\t1804\t1805\t1806\t1807\t1808\t1809\t...\t2041\t2042\t2043\t2044\t2045\t2046\t2047\t2048\t2049\t2050
name
Afghanistan\t2.211294e+09\t2.211294e+09\t2.211294e+09\t2.211294e+09\t2.211294e+09\t2.211294e+09\t2.211294e+09\t2.211294e+09\t2.211294e+09\t2.211294e+09\t...\t1.663444e+11\t1.723039e+11\t1.784054e+11\t1.846484e+11\t1.910328e+11\t1.975585e+11\t2.042263e+11\t2.110376e+11\t2.179942e+11\t2.250976e+11
Albania\t2.982917e+08\t2.996139e+08\t3.009420e+08\t3.022761e+08\t3.036162e+08\t3.054195e+08\t3.067735e+08\t3.081335e+08\t3.094995e+08\t3.108722e+08\t...\t6.399410e+10\t6.485958e+10\t6.572149e+10\t6.657911e+10\t6.743400e+10\t6.828550e+10\t6.913390e+10\t6.997973e+10\t7.082276e+10\t7.166292e+10
Algeria\t1.969643e+09\t1.976833e+09\t1.984050e+09\t1.991293e+09\t1.998562e+09\t2.005859e+09\t2.013181e+09\t2.020530e+09\t2.027906e+09\t2.035309e+09\t...\t7.435795e+11\t7.666284e+11\t7.903464e+11\t8.147227e+11\t8.397461e+11\t8.654191e+11\t8.917401e+11\t9.186838e+11\t9.462136e+11\t9.742973e+11
Andorra\t3.551816e+06\t3.557750e+06\t3.563685e+06\t3.572586e+06\t3.578521e+06\t3.584456e+06\t3.590390e+06\t3.596325e+06\t3.605226e+06\t3.611161e+06\t...\t8.440631e+09\t8.609413e+09\t8.780199e+09\t8.951468e+09\t9.123809e+09\t9.295095e+09\t9.466711e+09\t9.638031e+09\t9.807867e+09\t9.977686e+09
Angola\t1.082731e+09\t1.086235e+09\t1.091491e+09\t1.096747e+09\t1.100251e+09\t1.105507e+09\t1.110763e+09\t1.116019e+09\t1.121275e+09\t1.124779e+09\t...\t4.941509e+11\t5.184238e+11\t5.437268e+11\t5.700874e+11\t5.975349e+11\t6.261032e+11\t6.558303e+11\t6.867555e+11\t7.189204e+11\t7.523675e+11
5 rows × 251 columns

The data is ready, let’s start plotting.

Plotting

Translation

From the data above, we can see that the country names are in English, which is not intuitive, so let’s translate them:

NameDict = {"China": "🇨🇳中国", "United States": "🇺🇸美国", "India": "🇮🇳印度", "Indonesia": "🇮🇩印尼", "Japan": "🇯🇵日本",
            "France": "🇫🇷法国", "Italy": "🇮🇹意大利", "Russia": "🇷🇺俄罗斯", "United Kingdom": "🇬🇧英国", "Germany": "🇩🇪德国",
            "Spain": "🇪🇸西班牙", "Bangladesh": "🇧🇩孟加拉", "Pakistan": "🇵🇰巴基斯坦", "Czech Republic": "🇨🇿捷克",
            "Nigeria": "🇳🇬尼日利亚", "Turkey": "🇹🇷火鸡", "Netherlands": "🇧🇶荷兰", "Ukraine": "🇺🇦乌克兰", "Poland": "🇵🇱波兰",
            "South Korea": "🇰🇷韩国", "Mexico": "🇲🇽墨西哥", "Egypt": "🇪🇬埃及", "Brazil": "🇧🇷巴西", "Canada": "🇨🇦加拿大",
            "Saudi Arabia": "🇸🇦沙特", "Ireland": "🇮🇪爱尔兰", "Belgium": "🇧🇪比利时", "Austria": "🇦🇹奥地利", "Argentina": "🇦🇷阿根廷",
            "Australia": "🇦🇺澳大利亚", "Switzerland": "『中立国』瑞士", "Philippines": "🇵🇭菲律宾", "South Africa": "🇿🇦南非",
            "Thailand": "🇹🇭泰国", "Taiwan": "🇨🇳中国台湾", "Iran": "🇮🇷伊朗"}

The reason for not translating all of them is that I made a significant discovery: the top 20 countries hardly change, so there is no need to translate the other 100 or so.

Plotting

tl = Timeline().add_schema(play_interval=300, is_loop_play=False)
for i in range(1800, 2051):
    data = df[str(i)].sort_values()[-20:]
    bar = (
        Bar(opts.InitOpts(bg_color="white"))
        .add_xaxis([NameDict.get(x, x) for x in data.index.tolist()])
        .add_yaxis("", [round(x/1000000000000, 2) for x in data.tolist()])
        .reversal_axis()
        .set_series_opts(label_opts=opts.LabelOpts(position="right"))
        .set_global_opts(
            title_opts=opts.TitleOpts(f"Purchasing Power Parity GDP (Trillions of USD)  {i} Year")
        )
        .set_series_opts(
            itemstyle_opts={
                "normal": {
                    "color": JsCode(
                        """new echarts.graphic.LinearGradient(0, 0, 0, 1, [{offset: 0,color: 'rgba(255, 0, 0, 1)'}, {offset: 1,color: 'rgba(0, 77, 167, 1)'}], false)"""
                    ),
                    "barBorderRadius": [30, 30, 30, 30], "shadowColor": "rgb(0, 160, 221)",
                }
            }
        )
    )

    tl.add(bar, "{}年".format(i))

tl.render_notebook()    # Render directly
# tl.render("GDP.html") # Save as a webpage

At this point, the data support for national fortunes has been completed.

WeChatFerry: A tool that no longer exists

Reply <span>WCF</span> to observe🤖 college students reply <span>Quantum</span> to research quantum computing

Reply <span>Insurance</span> for consultation on insurance plans

Responses from humans may be slow, please be patient

Leave a Comment