SageMaker - 机器学习模型训练

您可以使用 Amazon SageMaker 的完全托管训练服务轻松训练机器学习模型。

要训练机器学习模型，您可以使用 SageMaker 的内置算法，也可以使用我们自己的模型。在这两种情况下，SageMaker 都允许您高效地运行训练作业。

如何使用 Amazon SageMaker 训练模型？

让我们借助下面的Python程序了解如何使用 SageMaker 训练模型：

步骤 1：准备您的数据

首先，准备您的数据并将其以 CSV 格式或任何其他合适的格式存储在 Amazon S3 中。Amazon SageMaker 从 S3 读取数据用于训练作业。

步骤 2：定义估算器

现在，您需要定义估算器。您可以使用 Estimator 对象配置训练作业。对于此示例，我们将使用内置的 XGBoost 算法训练模型，如下所示：

import sagemaker
from sagemaker import get_execution_role
from sagemaker.inputs import TrainingInput

# Define your SageMaker session and role
session = sagemaker.Session()
role = get_execution_role()

# Define the XGBoost estimator
xgboost = sagemaker.estimator.Estimator(
    image_uri=sagemaker.image_uris.retrieve("xgboost", session.boto_region_name),
    role=role,
    instance_count=1,
    instance_type="ml.m4.xlarge",
    output_path=f"s3://your-bucket/output",
    sagemaker_session=session,
)

# Set hyperparameters
xgboost.set_hyperparameters(objective="binary:logistic", num_round=100)

步骤 3：指定训练数据

我们需要指定训练数据以进行进一步处理。您可以使用 TrainingInput 类指定数据在 S3 中的位置，如下所示：

# Specify training data in S3
train_input = TrainingInput
   (s3_data="s3://your-bucket/train", content_type="csv")
validation_input = TrainingInput
   (s3_data="s3://your-bucket/validation", content_type="csv")

步骤 4：训练模型

最后，通过调用 fit 方法启动训练作业，如下所示：

# Train the model
xgboost.fit({"train": train_input, "validation": validation_input})

训练完成后，SageMaker 将自动配置资源，运行训练作业并将模型输出保存到指定的 S3 位置。

使用 SageMaker 进行分布式训练

Amazon SageMaker 支持分布式训练，使您可以跨多个实例扩展训练。当您处理大型数据集或深度学习模型时，这非常有用。SageMaker 提供了支持分布式训练的框架，例如 TensorFlow 和 PyTorch。

要启用分布式训练，您可以增加 Estimator 对象中的 instance_count 参数。

示例

下面是一个使用 TensorFlow 的示例：

from sagemaker.tensorflow import TensorFlow

# Define the TensorFlow estimator with distributed training
tensorflow_estimator = TensorFlow(
    entry_point="train.py",
    role=role,
    instance_count=2,
    instance_type="ml.p3.2xlarge",
    framework_version="2.3",
    py_version="py37",
)

# Train the model on multiple instances
tensorflow_estimator.fit({"train": train_input, "validation": validation_input})

在此示例中，SageMaker 使用两个 ml.p3.2xlarge 实例进行分布式训练。这将减少大型模型的训练时间。

打印页面