机器学习 - 向后消除法

向后消除法是一种在机器学习中使用的特征选择技术，用于为预测模型选择最重要的特征。在这种技术中，我们首先考虑所有特征，然后迭代地去除最不重要的特征，直到我们得到提供最佳性能的最佳特征子集。

Python 实现

要在 Python 中实现向后消除法，您可以按照以下步骤操作：

导入必要的库：pandas、numpy 和 statsmodels.api。

import pandas as pd
import numpy as np
import statsmodels.api as sm

将您的数据集加载到 Pandas DataFrame 中。我们将使用 Pima-Indians-Diabetes 数据集。

diabetes = pd.read_csv(r'C:\Users\Leekha\Desktop\diabetes.csv')

定义预测变量 (X) 和目标变量 (y)。

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

向预测变量添加一列 1 来表示截距。

X = np.append(arr = np.ones((len(X), 1)).astype(int), values = X, axis = 1)

使用 statsmodels 库中的普通最小二乘法 (OLS) 来拟合包含所有预测变量的多元线性回归模型。

X_opt = X[:, [0, 1, 2, 3, 4, 5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()

检查每个预测变量的 p 值，并去除 p 值最高的那个（即最不重要的）。

regressor_OLS.summary()

重复步骤 5 和 6，直到所有剩余预测变量的 p 值都低于显著性水平（例如，0.05）。

X_opt = X[:, [0, 1, 3, 4, 5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

X_opt = X[:, [0, 3, 4, 5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

X_opt = X[:, [0, 3, 5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

X_opt = X[:, [0, 3]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

p 值低于显著性水平的最终预测变量子集是模型的最佳特征集。

示例

以下是 Python 中向后消除法的完整实现：

# Importing the necessary libraries
import pandas as pd
import numpy as np
import statsmodels.api as sm

# Load the diabetes dataset
diabetes = pd.read_csv(r'C:\Users\Leekha\Desktop\diabetes.csv')

# Define the predictor variables (X) and the target variable (y)
X = diabetes.iloc[:, :-1].values
y = diabetes.iloc[:, -1].values

# Add a column of ones to the predictor variables to represent the intercept
X = np.append(arr = np.ones((len(X), 1)).astype(int), values = X, axis = 1)

# Fit the multiple linear regression model with all the predictor variables
X_opt = X[:, [0, 1, 2, 3, 4, 5, 6, 7, 8]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()

# Check the p-values of each predictor variable and remove the one
# with the highest p-value (i.e., the least significant)
regressor_OLS.summary()

# Repeat the above step until all the remaining predictor variables
# have a p-value below the significance level (e.g., 0.05)
X_opt = X[:, [0, 1, 2, 3, 5, 6, 7, 8]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

X_opt = X[:, [0, 1, 3, 5, 6, 7, 8]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

X_opt = X[:, [0, 1, 3, 5, 7, 8]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

X_opt = X[:, [0, 1, 3, 5, 7]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

输出

执行此程序时，将产生以下输出：

打印页面