如何使用 Python 准备伊利亚德数据集进行训练？

TensorFlow 是 Google 提供的一个机器学习框架。它是一个开源框架，与 Python 结合使用，可以实现算法、深度学习应用程序等等。它用于研究和生产目的。

可以使用以下代码行在 Windows 上安装 “tensorflow” 包：

pip install tensorflow

张量是 TensorFlow 中使用的一种数据结构。它有助于连接数据流图中的边。这个数据流图被称为“数据流图”。张量只不过是多维数组或列表。

我们将使用伊利亚德的数据集，其中包含 William Cowper、Edward（Derby 伯爵）和 Samuel Butler 三个译本的文本数据。该模型在给定单行文本时被训练以识别翻译者。使用的文本文件已经过预处理。这包括删除文档页眉和页脚、行号和章节标题。

我们使用 Google Colaboratory 来运行以下代码。Google Colab 或 Colaboratory 帮助在浏览器上运行 Python 代码，无需任何配置，并可免费访问 GPU（图形处理单元）。Collaboratory 建立在 Jupyter Notebook 之上。

示例

以下是代码片段：

print("Prepare the dataset for training")
tokenizer = tf_text.UnicodeScriptTokenizer()
print("Defining a function named 'tokenize' to tokenize the text data")
def tokenize(text, unused_label):
   lower_case = tf_text.case_fold_utf8(text)
   return tokenizer.tokenize(lower_case)
tokenized_ds = all_labeled_data.map(tokenize)
print("Iterate over the dataset and print a few samples")
for text_batch in tokenized_ds.take(6):
   print("Tokens: ", text_batch.numpy())

代码来源：https://tensorflowcn.cn/tutorials/load_data/text

输出

Prepare the dataset for training
Defining a function named 'tokenize' to tokenize the text data
WARNING:tensorflow:From /usr/local/lib/python3.6/distpackages/tensorflow/python/util/dispatch.py:201: batch_gather (from
tensorflow.python.ops.array_ops) is deprecated and will be removed after 2017-10-25.
Instructions for updating:
`tf.batch_gather` is deprecated, please use `tf.gather` with `batch_dims=-1` instead.
Iterate over the dataset and print a few samples
Tokens: [b'but' b'i' b'have' b'now' b'both' b'tasted' b'food' b',' b'and' b'given']
Tokens: [b'all' b'these' b'shall' b'now' b'be' b'thine' b':' b'but' b'if' b'the'
b'gods']
Tokens: [b'their' b'spiry' b'summits' b'waved' b'.' b'there' b',' b'unperceived']
Tokens: [b'"' b'i' b'pray' b'you' b',' b'would' b'you' b'show' b'your' b'love'
b',' b'dear' b'friends' b',']
Tokens: [b'entering' b'beneath' b'the' b'clavicle' b'the' b'point']
Tokens: [b'but' b'grief' b',' b'his' b'father' b'lost' b',' b'awaits' b'him'
b'now' b',']

Learn Python in-depth with real-world projects through our Python certification course. Enroll and become a certified expert to boost your career.

解释

定义了一个 “tokenize” 函数，该函数通过消除空格将数据集中句子拆分为单词。
此函数在整个数据集上调用。
在控制台上显示标记化后数据集的样本。

AmitDiwan

更新于：2021年1月19日

94 次浏览

开启您的职业生涯

完成课程获得认证

开始学习