Python Pandas - 分类数据的排序

在数据分析中，我们经常需要处理分类数据，尤其是在包含重复字符串值（例如国家名称、性别或评级）的列中。分类数据是指只能取有限数量不同值的数据。例如，国家名称列中的“印度”、“澳大利亚”值以及性别列中的“男”和“女”值都是分类的。这些值也可以是有序的，允许进行逻辑排序。

分类数据是 Pandas 中用于处理具有固定数量可能值的变量（也称为“类别”）的数据类型之一。这种类型的数据通常用于统计分析。在本教程中，我们将学习如何使用 Pandas 对分类数据进行排序。

对分类数据进行排序

Pandas 中的有序分类数据是有意义的，允许您执行某些操作，例如排序、min()、max() 和比较。当您尝试对无序数据应用 min/max 操作时，Pandas 将引发TypeError。Pandas 的.cat访问器提供as_ordered()方法，用于将分类数据类型转换为有序类型。

示例

以下示例演示了如何使用.cat.as_ordered()方法创建有序分类序列，以及如何在有序分类序列上执行查找最小值和最大值等操作。

import pandas as pd

# Create a categorical series
s = pd.Series(["a", "b", "c", "a", "a", "a", "b", "b"]).astype(pd.CategoricalDtype())

# Convert the categorical series into ordered using the .cat.as_ordered() method 
s = s.cat.as_ordered()

# Display the ordered categorical series
print('Ordered Categorical Series:\n',s)

# Perform the minimum and maximum operation on ordered categorical series
print('Minimum value of the categorical series:',s.min())
print('Maximum value of the categorical series:', s.max())

以下是上述代码的输出：

Ordered Categorical Series: 
0    a
1    b
2    c
3    a
4    a
5    a
6    b
7    b
dtype: category
Categories (3, object): ['a' < 'b' < 'c']

Minimum value of the categorical series: a
Maximum value of the categorical series: c

重新排序类别

Pandas 允许您使用.cat.reorder_categories()和.cat.set_categories()方法重新排序或重置分类数据中的类别。

reorder_categories()：此方法用于使用指定的new_categaries重新排序现有类别。
set_categories()：此方法允许您定义一组新的类别，这可能包括添加新类别或删除现有类别。

示例

以下示例演示了如何使用reorder_categories()和set_categories()方法重新排序类别。

import pandas as pd

# Create a categorical series with a specific order
s = pd.Series(["b", "a", "c", "a", "b"], dtype="category")

# Reorder categories using reorder_categories
s_reordered = s.cat.reorder_categories(["b", "a", "c"], ordered=True)

print("Reordered Categories:\n", s_reordered)

# Set new categories using set_categories
s_new_categories = s.cat.set_categories(["d", "b", "a", "c"], ordered=True)

print("\nNew Categories Set:\n", s_new_categories)

以下是上述代码的输出：

Reordered Categories:
0    b
1    a
2    c
3    a
4    b
dtype: category
Categories (3, object): ['b' < 'a' < 'c']

New Categories Set:
0    b
1    a
2    c
3    a
4    b
dtype: category
Categories (4, object): ['d' < 'b' < 'a' < 'c']

对分类数据进行排序

对分类数据进行排序是指根据定义的类别顺序按特定顺序排列数据。例如，如果您有特定顺序的分类数据，例如 ["c", "a", "b"]，则排序将根据此顺序排列值。否则，如果您没有显式指定顺序，则排序可能会按词法顺序（字母顺序或数字顺序）进行。

示例

以下示例演示了 Pandas 中无序和有序分类数据的排序行为。

import pandas as pd

# Create a categorical series without any specific order
s = pd.Series(["a", "b", "c", "a", "a", "a", "b", "b"], dtype="category")

# Sort the categorical series without any predefined order (lexical sorting)
print("Lexical Sorting:\n", s.sort_values())

# Define a custom order for the categories
s = s.cat.set_categories(['c', 'a', 'b'], ordered=True)

# Sort the categorical series with the defined order
print("\nSorted with Defined Category Order:\n", s.sort_values())

以下是上述代码的输出：

Lexical Sorting:
0    a
3    a
4    a
5    a
1    b
6    b
7    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']

Sorted with Defined Category Order:
2    c
0    a
3    a
4    a
5    a
1    b

使用分类数据的多列排序

如果您的 DataFrame 中有多个分类列，则分类列将与其他列一起排序，其顺序将遵循定义的类别。

示例

在此示例中，将创建一个具有两个分类列“A”和“B”的 DataFrame。然后，首先根据其分类顺序按列“A”对 DataFrame 进行排序，然后按列“B”进行排序。

import pandas as pd

# Create a DataFrame with categorical columns
dfs = pd.DataFrame({
"A": pd.Categorical(["X", "X", "Y", "Y", "X", "Z", "Z", "X"], categories=["Y", "Z", "X"], ordered=True),
"B": [1, 2, 1, 2, 2, 1, 2, 1]
})

# Sort by multiple columns
sorted_dfs = dfs.sort_values(by=["A", "B"])

print("Sorted DataFrame:\n", sorted_dfs)

以下是上述代码的输出：

Sorted DataFrame:

A B
2 Y 1
3 Y 2
5 Z 1
6 Z 2
0 X 1
7 X 1
1 X 2
4 X 2

	A	B
2	Y	1
3	Y	2
5	Z	1
6	Z	2
0	X	1
7	X	1
1	X	2
4	X	2

打印页面