Python中的数据分析和可视化？

Python提供了许多用于数据分析和可视化的库，主要包括numpy、pandas、matplotlib、seaborn等。在本节中，我们将讨论pandas库用于数据分析和可视化，它是一个基于numpy构建的开源库。

它允许我们进行快速分析以及数据清洗和准备。Pandas还提供了许多内置的可视化功能，我们将在下面看到。

安装

要安装pandas，请在终端中运行以下命令：

pipinstall pandas

或者如果您有anaconda，您可以使用

condainstall pandas

Pandas-DataFrame

当我们使用pandas时，DataFrame是主要的工具。

代码：

import numpy as np
import pandas as pd
from numpy.random import randn
np.random.seed(50)
df = pd.DataFrame(randn(6,4), ['a','b','c','d','e','f'],['w','x','y','z'])
df

输出

	w	x	y	z
a	-1.560352	-0.030978	-0.620928	-1.464580
b	1.411946	-0.476732	-0.780469	1.070268
c	-1.282293	-1.327479	0.126338	0.862194
d	0.696737	-0.334565	-0.997526	1.598908
e	3.314075	0.987770	0.123866	0.742785
f	-0.393956	0.148116	-0.412234	-0.160715

Pandas-缺失数据

我们将看到一些方便的方法来处理pandas中的缺失数据，这些数据会自动填充为零或NaN。

import numpy as np
import pandas as pd
from numpy.random import randn
d = {'A': [1,2,np.nan], 'B': [9, np.nan, np.nan], 'C': [1,4,9]}
df = pd.DataFrame(d)
df

输出

	A	B	C
0	1.0	9.0	1
1	2.0	NaN	4
2	NaN	NaN	9

因此，我们上面有3个缺失值。

df.dropna()

	A	B	C
0	1.0	9.0	1

df.dropna(axis = 1)

	C
0	1
1	4
2	9

df.dropna(thresh = 2)

	A	B	C
0	1.0	9.0	1
1	2.0	NaN	4

df.fillna(value = df.mean())

	A	B	C
0	1.0	9.0	1
1	2.0	9.0	4
2	1.5	9.0	9

Pandas-导入数据

我们将读取csv文件，该文件存储在我的本地机器上（在我的例子中），或者我们可以直接从网上获取。

#import pandas library
import pandas as pd

#Read csv file and assigned it to dataframe variable
df = pd.read_csv("SYB61_T03_Population Growth Rates in Urban areas and Capital cities.csv",encoding = "ISO-8859-1")

#Read first five element from the dataframe
df.head()

输出

读取DataFrame或csv文件中行数和列数。

#Countthe number of rows and columns in our dataframe.
df.shape

输出

(4166,9)

Pandas-DataFrame数学运算

可以使用pandas的各种统计工具对DataFrame进行运算。

#To computes various summary statistics, excluding NaN values
df.describe()

输出

# computes numerical data ranks
df.rank()

输出

.....

Pandas-绘制图表

import matplotlib.pyplot as plt
years = [1981, 1991, 2001, 2011, 2016]

Average_populations = [716493000, 891910000, 1071374000, 1197658000, 1273986000]

plt.plot(years, Average_populations)
plt.title("Census of India: sample registration system")
plt.xlabel("Year")
plt.ylabel("Average_populations")
plt.show()

输出

上述数据的散点图

plt.scatter(years,Average_populations)

直方图

import matplotlib.pyplot as plt

Average_populations = [716493000, 891910000, 1071374000, 1197658000, 1273986000]

plt.hist(Average_populations, bins = 10)
plt.xlabel("Average_populations")
plt.ylabel("Frequency")

plt.show()

输出

Nitya Raut

更新于：2019年7月30日

1K+ 浏览量

启动您的职业生涯

完成课程获得认证

开始学习