如何在 Python 中使用 Tensorflow text 和 `unicode_split()` 函数按字符分割字符串？

Tensorflow text可以使用`unicode_split`方法按字符分割字符串，方法是先编码要分割的字符串，然后将函数调用赋值给一个变量。这个变量保存函数调用的结果。

我们将使用 Keras Sequential API，它有助于构建一个顺序模型，用于处理简单的层堆栈，其中每一层只有一个输入张量和一个输出张量。

包含至少一层卷积层的神经网络称为卷积神经网络。我们可以使用卷积神经网络来构建学习模型。

TensorFlow Text 包含一系列与文本相关的类和操作，可用于 TensorFlow 2.0。TensorFlow Text 可用于预处理序列建模。

我们使用 Google Colaboratory 来运行以下代码。Google Colab 或 Colaboratory 帮助在浏览器上运行 Python 代码，无需任何配置，并可免费访问 GPU（图形处理器）。Colaboratory 建立在 Jupyter Notebook 之上。

分词是将字符串分解为标记的方法。这些标记可以是单词、数字或标点符号。

重要的接口包括 Tokenizer 和 TokenizerWithOffsets，它们分别具有单个方法 tokenize 和 tokenize_with_offsets。有多个分词器，每个分词器都实现了 TokenizerWithOffsets（它扩展了 Tokenizer 类）。这包括获取原始字符串中的字节偏移量的选项。这有助于了解创建标记的原始字符串中的字节。

示例

print("The encoded characters are split")
tokens = tf.strings.unicode_split([u"仅今年前".encode('UTF-8')], 'UTF-8')
print("The tokenized data is converted to a list")
print(tokens.to_list())

代码来源 −https://tensorflowcn.cn/tutorials/tensorflow_text/intro

输出

The encoded characters are split
The tokenized data is converted to a list
[[b'\xe4\xbb\x85', b'\xe4\xbb\x8a', b'\xe5\xb9\xb4', b'\xe5\x89\x8d']]

解释

所有分词器都返回 RaggedTensors，其最内层维度中的标记映射到原始单个字符串。
结果形状的秩增加一。
在不使用空格来分割单词的情况下对语言进行分词时，通常会按字符分割。
这可以使用 Tensorflow 核心中的 unicode_split 操作来完成。
调用 unicode_split 后，分词数据将添加到列表中。

AmitDiwan

更新于： 2021年2月22日

450 次浏览

启动你的职业生涯

完成课程获得认证

开始学习