Python Pandas - 堆叠和解堆叠

Pandas 中的堆叠和解堆叠是重塑 DataFrame 的有用技术，可以以不同的方式提取更多信息。它也能有效地处理多层索引。无论是将列压缩成行级别还是将行扩展成列，这些操作对于处理复杂数据集都至关重要。

Pandas 库为此提供了两种主要方法：堆叠和解堆叠操作，它们分别是 stack() 和 unstack()。在本教程中，我们将学习 Pandas 中的堆叠和解堆叠技术，以及处理缺失数据的示例。

Pandas 中的堆叠

Pandas 中的堆叠是一个将 DataFrame 列压缩成行的过程。Pandas 中的 DataFrame.stack() 方法用于将列级别堆叠到索引中。此方法将列标签级别（可能是分层的）旋转到行标签，并返回一个具有多层索引的新 DataFrame 或 Series。

示例

以下示例使用 df.stack() 方法将列旋转到行索引。

import pandas as pd
import numpy as np

# Create MultiIndex
tuples = [["x", "x", "y", "y", "", "f", "z", "z"],["1", "2", "1", "2", "1", "2", "1", "2"]]
index = pd.MultiIndex.from_arrays(tuples, names=["first", "second"])

# Create a DataFrame
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])

# Display the input DataFrame
print('Input DataFrame:\n', df)

# Stack columns
stacked = df.stack()

print('Output Reshaped DataFrame:\n', stacked)

以上代码的输出如下：

Input DataFrame:

A B
first second
x 1 0.596485 -1.356041
2 -1.091407 0.246216
y 1 0.499328 -1.346817
2 -0.893557 0.014678
1 -0.059916 0.106597
f 2 -0.315096 -0.950424
z 1 1.050350 -1.744569
2 -0.255863 0.539803

Output Reshaped DataFrame:

first second

x 1 A 0.596485
B -1.356041
2 A -1.091407
B 0.246216
y 1 A 0.499328
B -1.346817
2 A -0.893557
B 0.014678
1 A -0.059916
B 0.106597
f 2 A -0.315096
B -0.950424
z 1 A 1.050350
B -1.744569
2 A -0.255863
B 0.539803
dtype: float64

		A	B
first	second
x	1	0.596485	-1.356041
2	-1.091407	0.246216
y	1	0.499328	-1.346817
2	-0.893557	0.014678
1	-0.059916	0.106597
f	2	-0.315096	-0.950424
z	1	1.050350	-1.744569
2	-0.255863	0.539803

first	second
x	1	A	0.596485
B	-1.356041
2	A	-1.091407
B	0.246216
y	1	A	0.499328
B	-1.346817
2	A	-0.893557
B	0.014678
1	A	-0.059916
B	0.106597
f	2	A	-0.315096
B	-0.950424
z	1	A	1.050350
B	-1.744569
2	A	-0.255863
B	0.539803

在这里，stack() 方法将列 A 和 B 旋转到索引中，将 DataFrame 压缩成长格式。

Pandas 中的解堆叠

解堆叠通过将行索引级别移回列来反转堆叠操作。Pandas DataFrame.unstack() 方法用于将行索引级别旋转成列，这对于将长格式 DataFrame 转换为宽格式非常有用。

示例

以下示例演示了 df.unstack() 方法在解堆叠 DataFrame 时的工作方式。

import pandas as pd
import numpy as np

# Create MultiIndex
tuples = [["x", "x", "y", "y", "", "f", "z", "z"],["1", "2", "1", "2", "1", "2", "1", "2"]]
index = pd.MultiIndex.from_arrays(tuples, names=["first", "second"])

# Create a DataFrame
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])

# Display the input DataFrame
print('Input DataFrame:\n', df)

# Unstack the DataFrame
unstacked = df.unstack()

print('Output Reshaped DataFrame:\n', unstacked)

以上代码的输出如下：

Input DataFrame:

A B
first second
x 1 -0.407537 -0.957010
2 0.045479 0.789849
y 1 0.751488 -0.474536
2 -1.043122 -0.015152
1 -0.133349 1.094900
f 2 1.681111 2.480652

z 1 0.283679 0.769553
2 -2.034907 0.301275

Output Reshaped DataFrame:
                A                   B          
second         1         2         1         2
first                                         
       -0.133349       NaN  1.094900       NaN
f            NaN  1.681111       NaN  2.480652
x      -0.407537  0.045479 -0.957010  0.789849
y       0.751488 -1.043122 -0.474536 -0.015152
z       0.283679 -2.034907  0.769553  0.301275

		A	B
first	second
x	1	-0.407537	-0.957010
2	0.045479	0.789849
y	1	0.751488	-0.474536
2	-1.043122	-0.015152
1	-0.133349	1.094900
f	2	1.681111	2.480652
z	1	0.283679	0.769553
2	-2.034907	0.301275

解堆叠期间处理缺失数据

当重塑后的 DataFrame 在子组中具有不相等的标签集时，解堆叠可能会产生缺失值。Pandas 默认情况下使用 NaN 处理这些缺失值，但您可以指定自定义填充值。

示例

此示例演示如何在解堆叠 DataFrame 时处理缺失值。

import pandas as pd
import numpy as np

# Create Data
index = pd.MultiIndex.from_product([["bar", "baz", "foo", "qux"], ["one", "two"]], names=["first", "second"])
columns = pd.MultiIndex.from_tuples([("A", "cat"), ("B", "dog"), ("B", "cat"), ("A", "dog")], names=["exp", "animal"])

df = pd.DataFrame(np.random.randn(8, 4), index=index, columns=columns)

# Create a DataFrame
df3 = df.iloc[[0, 1, 4, 7], [1, 2]]

print(df3)

# Unstack the DataFame
unstacked = df3.unstack()

# Display the Unstacked DataFrame
print("Unstacked DataFrame without Filling:\n",unstacked)

unstacked_filled = df3.unstack(fill_value=1)
print("Unstacked DataFrame with Filling:\n",unstacked_filled)

以上代码的输出如下：

exp                  B          
animal             dog       cat
first second                    
bar   one    -0.556587 -0.157084
      two     0.109060  0.856019
foo   one    -1.034260  1.548955
qux   two    -0.644370 -1.871248

Unstacked DataFrame without Filling:
exp            B                             
animal       dog                cat          
second       one      two       one       two
first                                        
bar    -0.556587  0.10906 -0.157084  0.856019
foo    -1.034260      NaN  1.548955       NaN
qux          NaN -0.64437       NaN -1.871248

Unstacked DataFrame with Filling:
exp            B                             
animal       dog                cat          
second       one      two       one       two
first                                        
bar    -0.556587  0.10906 -0.157084  0.856019
foo    -1.034260  1.00000  1.548955  1.000000
qux     1.000000 -0.64437  1.000000 -1.871248

打印页面