如何在PyTorch中使用DataLoader？

PyTorch是一个流行的开源机器学习库。数据科学家、研究人员和开发者广泛使用这个库来开发AI/ML产品。PyTorch最重要的功能之一是DataLoader类。这个类有助于高效地加载和批处理神经网络训练的数据。本文将教我们如何在PyTorch中使用DataLoader。

在PyTorch中使用DataLoader

我们可以遵循以下基本规则，使用PyTorch库在Python中执行数据加载操作：

数据准备 - 创建一个自定义的Random Dataset类，生成所需大小的随机数据集。使用DataLoader创建数据批次，指定批次大小并启用数据混洗。
神经网络定义 - 定义一个神经网络类Net，包含两个全连接层和一个激活函数。根据每层所需的单元数量自定义架构。
初始化和优化 - 实例化Net类，设置均方误差(MSE)损失准则，并将优化器初始化为具有所需学习率的随机梯度下降(SGD)。
训练循环 - 迭代DataLoader，进行所需数量的轮次训练。对于每一批数据，计算网络输出，计算损失，反向传播梯度，更新权重，并跟踪运行损失。

示例

以下代码定义了一个简单的神经网络和一个包含1000个数据点（每个数据点有10个特征）的随机数据集。然后，它使用批次大小为32并对数据进行混洗的DataLoader从数据集中创建DataLoader。神经网络使用具有均方误差损失函数的随机梯度下降进行训练。训练循环迭代DataLoader进行10个轮次训练，计算每一批数据的损失，反向传播梯度，并更新网络权重。每10个批次打印一次运行损失，以监控训练进度。

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
class RandomDataset(Dataset):
    def __init__(self, size):
        self.data = torch.randn(size, 10)
    def __len__(self):
        return len(self.data)
    def __getitem__(self, index):
        return self.data[index]
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(10, 5)
        self.fc2 = nn.Linear(5, 1)

    def forward(self, x):
        x = self.fc1(x)
        x = nn.functional.relu(x)
        x = self.fc2(x)
        return x
dataset = RandomDataset(1000)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
net = Net()
criterion = nn.MSELoss()
optimizer = optim.SGD(net.parameters(), lr=0.01)
for epoch in range(10):
    running_loss = 0.0
    for i, data in enumerate(dataloader, 0):
        inputs = data
        labels = torch.rand((data.shape[0], 1))
        optimizer.zero_grad()
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        if i % 10 == 9:
            print(f"[Epoch {epoch + 1}, Batch {i + 1}] loss: {running_loss / 10}")
            running_loss = 0.0

输出

[Epoch 1, Batch 10] loss: 0.25439725518226625
[Epoch 1, Batch 20] loss: 0.18304144889116286
[Epoch 1, Batch 30] loss: 0.1451663628220558
[Epoch 2, Batch 10] loss: 0.12896266356110572
[Epoch 2, Batch 20] loss: 0.11783223450183869
………………………………………………………..
[Epoch 10, Batch 30] loss: 0.09491728842258454

数据采样和加权采样

数据采样是指仅选择数据的子集进行执行。当大量数据无法放入RAM时，这在机器学习和数据分析中至关重要。采样有助于分批进行训练、测试和验证。加权采样是采样的一种变体，我们为数据点定义一些权重。这考虑到了对预测影响更大的数据点具有更大的意义。

语法

weighted_sampler = WeightedRandomSampler(“weights in the form of array like
objects”, num_samples=len(dataset), other parameters…)
loader = DataLoader(dataset, batch_size=batch_size,
sampler=weighted_sampler, other parameters......)

这里我们需要将权重定义为列表或类似数组的对象；Weighted Random Sampler然后创建采样器。然后我们需要将数据集传递给DataLoader对象。我们需要使用参数“sampler”进行加权采样。

示例

我们在下面的示例中使用DataLoader和Weighted Random Sampler实现了加权采样。我们将数据集和batch_size=32传递给DataLoader对象。这意味着一次处理32个数据样本。我们使用Weighted Random Sampler方法为样本赋予权重。由于我们将replacement=True，数据点可以包含在多个批次中。

import torch
from torch.utils.data import Dataset, DataLoader, WeightedRandomSampler
class CustomDataset(Dataset):
    def __init__(self):
        self.data = torch.randn((1000, 3, 32, 32))
        self.labels = torch.randint(0, 10, (1000,))
    def __len__(self):
        return len(self.data)
    def __getitem__(self, index):
        data_sample = self.data[index]
        label = self.labels[index]
        return data_sample, label
dataset = CustomDataset()
weights = torch.where(dataset.labels == 0, torch.tensor(2.0), torch.tensor(1.0))
sampler = WeightedRandomSampler(weights, len(dataset), replacement=True)
dataloader = DataLoader(dataset, batch_size=32, sampler=sampler)
for batch_data, batch_labels in dataloader:
    print(batch_data.shape, batch_labels.shape)

输出

torch.Size([32, 3, 32, 32]) torch.Size([32])
torch.Size([32, 3, 32, 32]) torch.Size([32])
..................................................
torch.Size([32, 3, 32, 32]) torch.Size([32])
torch.Size([32, 3, 32, 32]) torch.Size([32])
torch.Size([8, 3, 32, 32]) torch.Size([8])

多线程数据加载

多线程加载是一个加速数据加载和预处理过程的方法。此技术旨在跨设备在多个线程中并行化数据加载操作，使其能够更快地处理执行。我们可以使用num_workers参数在PyTorch中启用此功能。该参数以整数形式接受要使用的线程数。

语法

dataloader = DataLoader( num_workers=<number of workers>, other parameters)

这里的num_workers是执行期间可以发生子进程的数量。将num_works设置为可用的CPU线程数很常见。如果设置为-1，它将利用所有可用的CPU核心。

示例

在下面的代码中，我们将num_workers设置为2，这意味着数据加载和预处理过程将并行地在2个线程中发生。我们将batch_size保持为32，shuffle=True（混洗将在创建批次之前发生）。

import torch
from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
    def __init__(self, num_samples):
        self.data = torch.randn((num_samples, 3, 64, 64))
        self.labels = torch.randint(0, 10, (num_samples,))

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        data_sample = self.data[index]
        label = self.labels[index]
        return data_sample, label
dataset = CustomDataset(num_samples=3000)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=2)
for batch_data, batch_labels in dataloader:
    print("Batch data shape:", batch_data.shape)
    print("Batch labels shape:", batch_labels.shape)

输出

Batch data shape: torch.Size([32, 3, 64, 64])
Batch labels shape: torch.Size([32])
Batch data shape: torch.Size([32, 3, 64, 64])
.......................................................
Batch data shape: torch.Size([24, 3, 64, 64])
Batch labels shape: torch.Size([24])

混洗和批次大小

顾名思义，混洗是指随机重新排序数据点。这有几个优点，包括消除偏差。预期的是，在混洗数据后，数据点会更加均匀，从而导致模型更好地微调。相反，批次大小是指对数据点进行分组并一次执行它们。这很重要，因为大量数据有时可能无法完全放入内存。

语法

dataloader = DataLoader(dataset, batch_size=<set a number>,
shuffle=<Boolean True or False>, other parameters...)

这里数据集是我们需要设置批次大小并对其进行混洗的数据。批次大小以整数形式接受参数。Shuffle接受布尔值True和False作为参数。如果设置为True，则进行混洗；如果设置为False，则不进行混洗。

示例

在下面的示例中，我们将两个重要的参数传递给DataLoader类，即batch_size和shuffle。我们将batch_size设置为128，这意味着将同时执行128个数据点。shuffle=True表示每次执行之前都会进行混洗。如果设置为False，则不会发生混洗，我们可能会遇到略微有偏差的模型。

import torch
from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
    def __init__(self, num_samples):
        self.data = torch.randn((num_samples, 3, 32, 32))
        self.labels = torch.randint(0, 10, (num_samples,))
    def __len__(self):
        return len(self.data)
    def __getitem__(self, index):
        data_sample = self.data[index]
        label = self.labels[index]
        return data_sample, label
dataset = CustomDataset(num_samples=1000)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=2)
for batch_data, batch_labels in dataloader:
    print("Batch data shape:", batch_data.shape)
    print("Batch labels shape:", batch_labels.shape)

输出

Batch data shape: torch.Size([128, 3, 32, 32])
Batch labels shape: torch.Size([128])
......................................................
Batch data shape: torch.Size([104, 3, 32, 32])
Batch labels shape: torch.Size([104])

结论

在本文中，我们讨论了在PyTorch中使用DataLoader。我们可以稍后处理这些数据来训练神经网络。当我们在任何现有模型上训练数据时，这些类非常有用。这有助于我们节省时间并获得良好的结果，因为多个开发人员为模型、开源社区等做出了贡献。同样重要的是要理解，不同的模型可能需要不同的超参数。因此，这取决于可用的资源和数据的特性，应该选择哪些参数。

Asif Rahaman

更新于：2023年7月28日

浏览量：200

启动你的职业生涯

通过完成课程获得认证

开始学习