Python Pandas - 分类数据

在 Pandas 中，分类数据指的是表示分类变量的数据类型，类似于 R 中的因子概念。它是一种专门用于处理分类变量的数据类型，通常用于统计学。

分类变量可以表示诸如“男性”或“女性”之类的值，或者表示诸如“差”、“中等”和“优秀”之类的等级。与数值数据不同，您无法对分类数据执行加法或除法等数学运算。

在 Pandas 中，分类数据存储得更有效率，因为它使用类别值的数组和引用这些类别的整数代码的数组的组合。这节省了内存，并在处理包含重复值的大型数据集时提高了性能。

分类数据类型在以下情况下很有用：

仅包含几个不同值的字符串变量。将此类字符串变量转换为分类变量将节省一些内存。
变量的词法顺序与逻辑顺序不同（“一”、“二”、“三”）。通过转换为分类变量并在类别上指定顺序，排序和最小/最大值将使用逻辑顺序而不是词法顺序。
作为其他 Python 库的信号，该列应被视为分类变量（例如，使用合适的统计方法或绘图类型）。

在本教程中，我们将学习 Pandas 中处理分类数据的基础知识，包括 Series 和 DataFrame 的创建、行为控制以及从分类值中恢复原始数据。

创建包含分类数据的 Series

可以使用 Pandas Series() 或 DataFrame() 构造函数的 dtype="category" 参数直接使用分类数据创建 Pandas Series 或 DataFrame 对象。

示例

以下是使用分类数据创建 Pandas Series 对象的基本示例。

import pandas as pd

# Create Series object with categorical data
s = pd.Series(["a", "b", "c", "a"], dtype="category")

# Display the categorical Series 
print('Series with Categorical Data:\n', s)

以下是上述代码的输出：

Series with Categorical Data:
0    a
1    b
2    c
3    a
dtype: category

Categories (3, object): ['a', 'b', 'c']

示例：将现有 DataFrame 列转换为分类数据

此示例演示如何使用 astype() 方法将现有的 Pandas DataFrame 列转换为分类数据类型。

import pandas as pd
import numpy as np

# Create a DataFrame 
df = pd.DataFrame({"Col_a": list("aeeioou"), "Col_b": range(7)})

# Display the Input DataFrame
print('Input DataFrame:\n',df)
print('Verify the Data type of each column:\n', df.dtypes)

# Convert the Data type of col_a to categorical
df['Col_a'] = df["Col_a"].astype("category")

# Display the Input DataFrame
print('Converted DataFrame:\n',df)
print('Verify the Data type of each column:\n', df.dtypes)

以下是上述代码的输出：

Input DataFrame:
   Col_a  Col_b
0     a      0
1     e      1
2     e      2
3     i      3
4     o      4
5     o      5
6     u      6

Verify the Data type of each column:
Col_a    object
Col_b     int64
dtype: object

Converted DataFrame:
   Col_a  Col_b
0     a      0
1     e      1
2     e      2
3     i      3
4     o      4
5     o      5
6     u      6

Verify the Data type of each column:
Col_a    category
Col_b       int64
dtype: object

控制分类数据行为

默认情况下，Pandas 从数据中推断类别并将它们视为无序的。要控制行为，您可以使用 pandas.api.types 模块中的 CategoricalDtype 类。

示例

此示例演示如何将 CategoricalDtype 应用于整个 DataFrame。

import pandas as pd
from pandas.api.types import CategoricalDtype

# Create a DataFrame 
df = pd.DataFrame({"A": list("abca"), "B": list("bccd")})

# Display the Input DataFrame
print('Input DataFrame:\n',df)
print('Verify the Data type of each column:\n', df.dtypes)

# Applying CategoricalDtype to a DataFrame
cat_type = CategoricalDtype(categories=list("abcd"), ordered=True)
df_cat = df.astype(cat_type)

# Display the Input DataFrame
print('Converted DataFrame:\n', df_cat)
print('Verify the Data type of each column:\n', df_cat.dtypes)

以下是上述代码的输出：

Input DataFrame:
    A  B
0  a  b
1  b  c
2  c  c
3  a  d
Verify the Data type of each column:
 A    object
B    object
dtype: object
Converted DataFrame:
    A  B
0  a  b
1  b  c
2  c  c
3  a  d
Verify the Data type of each column:
 A    category
B    category

将分类数据转换回原始数据

将 Series 转换为分类数据后，可以使用 Series.astype() 或 np.asarray() 将其转换回原始形式。

示例

此示例使用 astype() 方法将 Series 对象的分类数据转换回对象数据类型。

import pandas as pd

# Create Series object with categorical data
s = pd.Series(["a", "b", "c", "a"], dtype="category")

# Display the categorical Series 
print('Series with Categorical Data:\n', s)

# Display the converted Series
print('Converted Seriesbac to original:\n ', s.astype(str))

以下是上述代码的输出：

Series with Categorical Data:
 0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']
Converted Seriesbac to original:
  0    a
1    b
2    c
3    a
dtype: object

使用 describe 命令

对分类数据使用 .describe() 命令，我们将获得与 type 为字符串的 Series 或 DataFrame 类似的输出。

import pandas as pd
import numpy as np

cat = pd.Categorical(["a", "c", "c", np.nan], categories=["b", "a", "c"])
df = pd.DataFrame({"cat":cat, "s":["a", "c", "c", np.nan]})

print(df.describe())
print(df["cat"].describe())

其输出如下：

       cat s
count    3 3
unique   2 2
top      c c
freq     2 2
count     3
unique    2
top       c
freq      2
Name: cat, dtype: object

打印页面