如何在Python中将数据集分割成训练集和测试集？

Python 服务器端编程编程

在本教程中，我们将学习如何使用Python编程语言将数据集分割成训练集和测试集。

介绍

在创建机器学习和深度学习模型时，我们可能会遇到需要在同一数据集上进行训练和评估的情况。在这种情况下，我们可能希望将数据集分成不同的组或集合，并将每个集合用于一项任务或特定过程（例如训练）。在这种情况下，我们可以使用训练集/测试集。

训练集和测试集的必要性

这是非常重要且简单的预处理技术之一。机器学习模型中常见的难题是过拟合或欠拟合。过拟合是指模型在训练数据上表现非常好，但在未见样本上却无法泛化。如果模型学习了数据中的噪声，则可能会发生这种情况。

另一个问题是欠拟合，其中模型在训练数据上的表现不佳，因此无法很好地泛化。如果训练数据不足，则可能会发生这种情况。

为了克服这些问题，最简单的技术之一是将数据集分成训练集和测试集。训练集用于训练模型或学习模型参数。测试集通常用于评估模型在未见数据上的性能。

一些术语

训练集

用于训练模型的数据集部分。这通常可以取整个数据集的大约70%，但用户可以尝试其他百分比，例如60%或80%，或根据用例而定。数据集的这一部分用于学习和拟合模型的参数。

测试集

用于评估模型的数据集部分。这通常可以取整个数据集的大约30%，但用户可以尝试其他百分比，例如40%或20%，或根据用例而定。

通常，我们将数据集按照我们的需求划分成70:30或80:20等比例的训练集和测试集。

在Python中将数据集分割成训练集和测试集

基本上有三种方法可以实现数据集的分割

使用sklearn的train_test_split
使用numpy索引
使用pandas

让我们简要了解一下上述每种方法

1. 使用sklearn的train_test_split

示例

import numpy as np
from sklearn.model_selection import train_test_split
x = np.arange(0, 50).reshape(10, 5)
y = np.array([0, 1, 1, 0, 1, 0, 0, 1, 1, 0])
x_train, x_test, y_train, y_test = train_test_split( x, y, test_size=0.3,
random_state=4)

print("Shape of x_train is ",x_train.shape)
print("Shape of x_test is ",x_test.shape)
print("Shape of y_train is ",y_train.shape)
print("Shape of y_test is ",x_test.shape)

输出

Shape of x_train is (7, 5)
Shape of x_test is (3, 5)
Shape of y_train is (7,)
Shape of y_test is (3, 5)

2. 使用numpy索引

示例

import numpy as np
x = np.random.rand(100, 5)
y = np.random.rand(100,1)
x_train, x_test = x[:80,:], x[80:,:]
y_train, y_test = y[:80,:], y[80:,:]

print("Shape of x_train is ",x_train.shape)
print("Shape of x_test is ",x_test.shape)
print("Shape of y_train is ",y_train.shape)
print("Shape of y_test is ",x_test.shape)

输出

Shape of x_train is (80, 5)
Shape of x_test is (20, 5)
Shape of y_train is (80, 1)
Shape of y_test is (20, 5)

3. 使用pandas sample

示例

import pandas as pd 
import numpy as np 
data = np.random.randint(10,25,size=(5,3)) 
df = pd.DataFrame(data, columns=['col1','col2','col3']) 
train_df = df.sample(frac=0.8, random_state=100) 
test_df = df[~df.index.isin(train_df.index)] 

print("Dataset shape : {}".format(df.shape)) 
print("Train dataset shape : {}".format(train_df.shape)) 
print("Test dataset shape : {}".format(test_df.shape))

输出

Dataset shape : (5, 3) Train dataset shape : (4, 3) Test dataset shape : (1, 3)

结论

训练测试分割是python和机器学习任务中非常重要的预处理步骤。它有助于防止过拟合和欠拟合问题。

Mithilesh Pradhan

更新于：2022年12月1日

1000+ 次浏览

启动你的职业生涯

通过完成课程获得认证

开始学习