机器学习 - 前向特征构造



前向特征构造是机器学习中的一种特征选择方法,我们从一个空的特征集开始,并在每一步迭代地添加性能最好的特征,直到达到所需的特征数量。

特征选择的目的是识别与预测目标变量相关的最重要的特征,同时忽略那些为模型增加噪声并可能导致过拟合的不太重要的特征。

前向特征构造涉及以下步骤:

  • 初始化一个空的特征集。

  • 设置要选择的最大特征数。

  • 迭代直到达到所需的特征数:

    • 对于每个尚未包含在已选择特征集中的剩余特征,使用已选择特征和当前特征拟合一个模型,并使用验证集评估其性能。

    • 选择导致最佳性能的特征,并将其添加到已选择特征集中。

  • 将已选择特征集作为模型的最佳特征集返回。

前向特征构造的主要优点是计算效率高,可用于高维数据集。但是,它可能并不总是产生最佳的特征集,尤其是在特征之间存在高度相关性或特征与目标变量之间存在非线性关系的情况下。

示例

以下是如何在 Python 中实现前向特征构造的示例:

# Importing the necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load the diabetes dataset
diabetes = pd.read_csv(r'C:\Users\Leekha\Desktop\diabetes.csv')

# Define the predictor variables (X) and the target variable (y)
X = diabetes.iloc[:, :-1].values
y = diabetes.iloc[:, -1].values

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Create an empty set of features
selected_features = set()

# Set the maximum number of features to be selected
max_features = 8

# Iterate until the desired number of features is reached
while len(selected_features) < max_features:

   # Set the best feature and the best score to be 0
   best_feature = None
   best_score = 0
   
   # Iterate over all the remaining features
   for i in range(X_train.shape[1]):

      # Skip the feature if it's already selected
      if i in selected_features:
         continue
      
      # Select the current feature and fit a linear regression model
      X_train_selected = X_train[:, list(selected_features) + [i]]
      regressor = LinearRegression()
      regressor.fit(X_train_selected, y_train)
      
      # Compute the score on the testing set
      X_test_selected = X_test[:, list(selected_features) + [i]]
      score = regressor.score(X_test_selected, y_test)

      # Update the best feature and score if the current feature performs better
      if score > best_score:
         best_feature = i
         best_score = score

   # Add the best feature to the set of selected features
   selected_features.add(best_feature)
   
   # Print the selected features and the score
   print('Selected Features:', list(selected_features))
   print('Score:', best_score)

输出

执行后,它将生成以下输出:

Selected Features: [1]
Score: 0.23530716168783583
Selected Features: [0, 1]
Score: 0.2923143573608237
Selected Features: [0, 1, 5]
Score: 0.3164103491569179
Selected Features: [0, 1, 5, 6]
Score: 0.3287368302427327
Selected Features: [0, 1, 2, 5, 6]
Score: 0.334586804842275
Selected Features: [0, 1, 2, 3, 5, 6]
Score: 0.3356264736550455
Selected Features: [0, 1, 2, 3, 4, 5, 6]
Score: 0.3313166516703744
Selected Features: [0, 1, 2, 3, 4, 5, 6, 7]
Score: 0.32230203252064216
广告