Python Pandas - 删除缺失数据



在处理现实世界数据集时,缺失数据是一个常见问题。Python Pandas 库提供了一种简单的方法,可以使用dropna()方法从数据集中删除包含缺失值(NaN 或 NaT)的行或列。

Pandas 中的 dropna() 方法是处理缺失数据的有用工具,它可以根据您的特定需求删除行或列。在本教程中,我们将学习如何使用dropna()根据各种条件删除缺失数据来清理您的数据集。

dropna() 方法

Pandas 的dropna()方法允许您从 Pandas 数据结构(如 Series 和 DataFrame 对象)中删除缺失值。它提供了多个选项来自定义您根据 NaN 值的存在方式删除行或列的方式。此方法返回一个新的 Pandas 对象,其中删除了缺失数据,或者如果inplace参数设置为True,则返回None

语法

以下是语法:

DataFrame.dropna(*, axis=0, how=<no_default>, thresh=<no_default>, subset=None, inplace=False, ignore_index=False)

其中,

  • axis:0 或 'index'(默认值)删除行;1 或 'columns' 删除列。

  • how:默认情况下设置为 'any',如果存在任何缺失值,则删除该行或列。如果设置为 'all',则如果所有值都缺失,则删除该行或列。

  • thresh:要求保留行或列的非 NA 值的最小数量。

  • subset:要考虑的特定列(如果删除行)或行(如果删除列)的列表。

  • inplace:就地修改 DataFrame(默认为 False)。

  • ignore_index重置结果的索引(默认为 False)。

让我们探索dropna()方法如何根据各种条件删除缺失数据。

删除任何缺失值的行

默认情况下,dropna()方法删除存在任何缺失值的行。

示例

以下示例使用dropna()方法删除具有任何缺失值的行。

import pandas as pd
import numpy as np

dataset = {"Student_name": ["Ajay", "Krishna", "Deepak", "Swati"], "Roll_number": [23, 45, np.nan, 18],
           "Major_Subject": ["Maths", "Physics", "Arts", "Political science"], "Marks": [57, np.nan, 98, np.nan]}

df = pd.DataFrame(dataset, index= [1, 2, 3, 4])
print("Original DataFrame:")
print(df)

# Drop the rows that have any missing values
df_cleaned = df.dropna()
print('\nResultant DataFrame after removing row:\n',df_cleaned)

以下是上述代码的输出:

Original DataFrame:
Student nameRoll numberMajor SubjectMarks
1Ajay23.0Maths57.0
2Krishna45.0PhysicsNaN
3DeepakNaNArts98.0
4Swati18.0Political scienceNaN
Resultant DataFrame after removing row:
Student nameRoll numberMajor SubjectMarks
1Ajay23.0Maths57.0

删除所有值都缺失的行

要删除所有值都缺失的行,我们需要将how='all'参数设置为dropna()方法。

示例

以下示例演示了如何在 DataFrame 中删除所有值都缺失的行。

import pandas as pd
import numpy as np

dataset = {"Student name": ["Ajay", np.nan, "Deepak", "Swati"], 
"Roll number": [23, np.nan, np.nan, 18],
"Major Subject": ["Maths", np.nan, "Arts", "Political science"], 
"Marks": [57, np.nan, 98, np.nan]}

df = pd.DataFrame(dataset, index= [1, 2, 3, 4])
print("Original DataFrame:")
print(df)

# Drop rows where all values are missing
reslut = df.dropna(how='all')
print('\nResultant DataFrame after removing row:\n',reslut)

以下是上述代码的输出:

Original DataFrame:
Student nameRoll numberMajor SubjectMarks
1Ajay23.0Maths57.0
2NaNNaNNaNNaN
3DeepakNaNArts98.0
4Swati18.0Political scienceNaN
Resultant DataFrame after removing row:
Student nameRoll numberMajor SubjectMarks
1Ajay23.0Maths57.0
3DeepakNaNArts98.0
4Swati18.0Political scienceNaN

保留具有最小数量的缺失值的行

pandas dropan()方法提供thresh参数来指定非缺失值的最小阈值,以便保留具有最小数量的非 Na 值的行。

示例

此示例演示了如何保留具有最小数量的缺失值的行。

import pandas as pd
import numpy as np

dataset = {"Student name": ["Ajay", "Krishna", "Deepak", "Swati"], 
"Roll number": [23, np.nan, np.nan, 18],
"Major Subject": ["Maths", np.nan, "Arts", "Political science"], 
"Marks": [57, np.nan, 98, np.nan]}

df = pd.DataFrame(dataset, index= [1, 2, 3, 4])
print("Original DataFrame:")
print(df)

# Drop the rows with a threshold 
result = df.dropna(thresh=2)
print('\nResultant DataFrame after removing row:\n',result)

以下是上述代码的输出:

Original DataFrame:
Student nameRoll numberMajor SubjectMarks
1Ajay23.0Maths57.0
2KrishnaNaNNaNNaN
3DeepakNaNArts98.0
4Swati18.0Political scienceNaN
Resultant DataFrame after removing row:
Student nameRoll numberMajor SubjectMarks
1Ajay23.0Maths57.0
3DeepakNaNArts98.0
4Swati18.0Political scienceNaN

删除任何缺失值的列

要删除包含任何缺失值的列,我们可以使用dropna()方法的axis参数来选择列。

示例

此示例显示了dropna()方法如何删除任何值都缺失的整列。

import pandas as pd
import numpy as np

dataset = {"Student_name": ["Ajay", "Krishna", "Deepak", "Swati"], 
"Roll_number": [23, 45, np.nan, 18],
"Major_Subject": ["Maths", "Physics", "Arts", "Political science"], 
"Marks": [57, np.nan, 98, np.nan]}

df = pd.DataFrame(dataset, index= [1, 2, 3, 4])
print("Original DataFrame:")
print(df)

# Drop column with any missing values
result = df.dropna(axis='columns')
print('\nResultant DataFrame after removing columns:\n',result)

以下是上述代码的输出:

Original DataFrame:
Student nameRoll numberMajor SubjectMarks
1Ajay23.0Maths57.0
2Krishna45.0PhysicsNaN
3DeepakNaNArts98.0
4Swati18.0Political scienceNaN
Resultant DataFrame after removing columns:
Student nameMajor Subject
1AjayMaths
2KrishnaPhysics
3DeepakArts
4SwatiPolitical science

根据特定列中的缺失数据删除行

您可以使用drop()方法的subset参数仅关注那些特定列,同时删除数据缺失的行。

示例

此示例显示了如何使用dropna()方法的subset参数删除特定列中存在缺失数据的行。

import pandas as pd
import numpy as np

dataset = {"Student_name": ["Ajay", "Krishna", "Deepak", "Swati"], 
"Roll_number": [23, 45, np.nan, 18],
"Major_Subject": ["Maths", "Physics", np.nan, "Political science"], 
"Marks": [57, np.nan, 98, np.nan]}

df = pd.DataFrame(dataset, index= [1, 2, 3, 4])
print("Original DataFrame:")
print(df)

# Drop Rows Based on Missing Data in Specific Columns
result = df.dropna(subset=['Roll_number', 'Major_Subject'])
print('\nResultant DataFrame after removing rows:\n',result)

以下是上述代码的输出:

Original DataFrame:
Student nameRoll numberMajor SubjectMarks
1Ajay23.0Maths57.0
2Krishna45.0PhysicsNaN
3DeepakNaNNaN98.0
4Swati18.0Political scienceNaN
Resultant DataFrame after removing rows:
Student nameRoll numberMajor SubjectMarks
1Ajay23.0Maths57.0
2Krishna45.0PhysicsNaN
4Swati18.0Political scienceNaN
广告

© . All rights reserved.