Python Pandas - 插值处理缺失值

插值是 Pandas 中一种强大的技术，用于处理数据集中的缺失值。此技术根据数据集的其他数据点估算缺失值。Pandas 为 DataFrame 和 Series 对象都提供了 **interpolate()** 方法，可以使用各种插值方法填充缺失值。

在本教程中，我们将学习 Pandas 中的 **interpolate()** 方法，使用不同的插值方法填充时间序列数据、数值数据等中的缺失值。

基本插值

DataFrame 和 Series 对象的 Pandas **interpolate()** 方法用于使用不同的插值策略填充缺失值。默认情况下，Pandas 自动使用线性插值作为默认方法。

示例

这是一个调用 **interpolate()** 方法填充缺失值的简单示例。

import numpy as np
import pandas as pd

df = pd.DataFrame({"A": [1.1, np.nan, 3.5, np.nan, np.nan, np.nan, 6.2, 7.9],
"B": [0.25, np.nan, np.nan, 4.7, 10, 14.7, 1.3, 9.2],
})

print("Original DataFrame:")
print(df)

# Using the  interpolate() method
result = df.interpolate()
print("\nResultant DataFrame after applying the interpolation:")
print(result)

以下是上述代码的输出 -

Original DataFrame:

A B
0 1.1 0.25
1 NaN NaN
2 3.5 NaN
3 NaN 4.70
4 NaN 10.00
5 NaN 14.70
6 6.2 1.30
7 7.9 9.20


Resultant DataFrame after applying the interpolation:

A B
0 1.100 0.250000
1 2.300 1.733333
2 3.500 3.216667
3 4.175 4.700000
4 4.850 10.000000
5 5.525 14.700000
6 6.200 1.300000
7 7.900 9.200000

	A	B
0	1.1	0.25
1	NaN	NaN
2	3.5	NaN
3	NaN	4.70
4	NaN	10.00
5	NaN	14.70
6	6.2	1.30
7	7.9	9.20

	A	B
0	1.100	0.250000
1	2.300	1.733333
2	3.500	3.216667
3	4.175	4.700000
4	4.850	10.000000
5	5.525	14.700000
6	6.200	1.300000
7	7.900	9.200000

不同的插值方法

Pandas 支持多种插值方法，包括线性、多项式、pchip、akima、spline 等。这些方法为根据数据的性质填充缺失值提供了灵活性。

示例

以下示例演示了使用 **interpolate()** 方法和 **barycentric** 插值技术。

import numpy as np
import pandas as pd

df = pd.DataFrame({"A": [1.1, np.nan, 3.5, np.nan, np.nan, np.nan, 6.2, 7.9],
"B": [0.25, np.nan, np.nan, 4.7, 10, 14.7, 1.3, 9.2],
})

print("Original DataFrame:")
print(df)

# Applying the interpolate() with Barycentric method
result = df.interpolate(method='barycentric')

print("\nResultant DataFrame after applying the interpolation:")
print(result)

以下是上述代码的输出 -

Original DataFrame:

i A B
0 1.1 0.25
1 NaN NaN
2 3.5 NaN
3 NaN 4.70
4 NaN 10.00
5 NaN 14.70
6 6.2 1.30
7 7.9 9.20

Resultant DataFrame after applying the interpolation:

A B
0 1.100000 0.250000
1 2.596429 57.242857
2 3.500000 24.940476
3 4.061429 4.700000
4 4.531429 10.000000
5 5.160714 14.700000
6 6.200000 1.300000
7 7.900000 9.200000

i	A	B
0	1.1	0.25
1	NaN	NaN
2	3.5	NaN
3	NaN	4.70
4	NaN	10.00
5	NaN	14.70
6	6.2	1.30
7	7.9	9.20

	A	B
0	1.100000	0.250000
1	2.596429	57.242857
2	3.500000	24.940476
3	4.061429	4.700000
4	4.531429	10.000000
5	5.160714	14.700000
6	6.200000	1.300000
7	7.900000	9.200000

处理插值中的限制

默认情况下，Pandas 插值填充所有缺失值，但是您可以使用 **interpolate()** 方法的 **limit** 参数限制填充多少个连续的 NaN 值。

示例

以下示例演示了通过使用 **interpolate()** 方法的 **limit** 参数限制连续填充来填充 Pandas DataFrame 的缺失值。

import numpy as np
import pandas as pd

df = pd.DataFrame({"A": [1.1, np.nan, 3.5, np.nan, np.nan, np.nan, 6.2, 7.9],
"B": [0.25, np.nan, np.nan, 4.7, 10, 14.7, 1.3, 9.2],
})

print("Original DataFrame:")
print(df)

# Applying the interpolate() with limit
result = df.interpolate(method='spline', order=2, limit=1)

print("\nResultant DataFrame after applying the interpolation:")
print(result)

以下是上述代码的输出 -

Original DataFrame:

i A B
0 1.1 0.25
1 NaN NaN
2 3.5 NaN
3 NaN 4.70
4 NaN 10.00
5 NaN 14.70
6 6.2 1.30
7 7.9 9.20


Resultant DataFrame after applying the interpolation:

i A B
0 1.100000 0.250000
1 2.231383 -1.202052
2 3.500000 NaN
3 4.111529 4.700000
4 NaN 10.000000
5 NaN 14.700000
6 6.200000 1.300000
7 7.900000 9.200000

i	A	B
0	1.1	0.25
1	NaN	NaN
2	3.5	NaN
3	NaN	4.70
4	NaN	10.00
5	NaN	14.70
6	6.2	1.30
7	7.9	9.20

i	A	B
0	1.100000	0.250000
1	2.231383	-1.202052
2	3.500000	NaN
3	4.111529	4.700000
4	NaN	10.000000
5	NaN	14.700000
6	6.200000	1.300000
7	7.900000	9.200000

时间序列数据的插值

插值也可以应用于 Pandas 时间序列数据。在填充随时间推移缺失数据点的间隙时，这很有用。

示例

示例语句 -

import numpy as np
import pandas as pd

indx = pd.date_range("2024-01-01", periods=10, freq="D")
data = np.random.default_rng(2).integers(0, 10, 10).astype(np.float64)
s = pd.Series(data, index=indx)
s.iloc[[1, 2, 5, 6, 9]] = np.nan

print("Original Series:")
print(s)

result = s.interpolate(method="time")

print("\nResultant Time Series after applying the interpolation:")
print(result)

以下是上述代码的输出 -

Original Series:

Date Value
2024-01-01 8.0
2024-01-02 NaN
2024-01-03 NaN
2024-01-04 2.0
2024-01-05 4.0
2024-01-06 NaN
2024-01-07 NaN
2024-01-08 0.0
2024-01-09 3.0
2024-01-10 NaN


Resultant Time Series after applying the interpolation:

Date Value
2024-01-01 8.000000
2024-01-02 6.000000
2024-01-03 4.000000
2024-01-04 2.000000
2024-01-05 4.000000
2024-01-06 2.666667
2024-01-07 1.333333
2024-01-08 0.000000
2024-01-09 3.000000
2024-01-10 3.000000

Date	Value
2024-01-01	8.0
2024-01-02	NaN
2024-01-03	NaN
2024-01-04	2.0
2024-01-05	4.0
2024-01-06	NaN
2024-01-07	NaN
2024-01-08	0.0
2024-01-09	3.0
2024-01-10	NaN

Date	Value
2024-01-01	8.000000
2024-01-02	6.000000
2024-01-03	4.000000
2024-01-04	2.000000
2024-01-05	4.000000
2024-01-06	2.666667
2024-01-07	1.333333
2024-01-08	0.000000
2024-01-09	3.000000
2024-01-10	3.000000

打印页面