如何在Python中使用Tensorflow text分割UTF-8字符串？

可以使用Tensorflow text分割UTF-8字符串。这可以通过‘UnicodeScriptTokenizer’来实现。创建‘UnicodeScriptTokenizer’后，在其上调用‘tokenize’方法即可对字符串进行分割。

我们将使用Keras Sequential API，它有助于构建一个顺序模型，用于处理简单的层堆叠，其中每一层只有一个输入张量和一个输出张量。

包含至少一层卷积层的神经网络称为卷积神经网络。我们可以使用卷积神经网络来构建学习模型。

TensorFlow Text包含一系列与文本相关的类和操作，可用于TensorFlow 2.0。TensorFlow Text可用于预处理序列建模。

我们使用Google Colaboratory运行以下代码。Google Colab或Colaboratory有助于在浏览器上运行Python代码，无需任何配置，并可免费访问GPU（图形处理单元）。Colaboratory构建在Jupyter Notebook之上。

分词是将字符串分解成标记的方法。这些标记可以是单词、数字或标点符号。

重要的接口包括Tokenizer和TokenizerWithOffsets，它们分别具有单个方法tokenize和tokenize_with_offsets。有多个分词器，每个分词器都实现了TokenizerWithOffsets（它扩展了Tokenizer类）。这包括获取原始字符串中字节偏移量的选项。这有助于了解创建标记的原始字符串中的字节。

所有分词器都返回RaggedTensors，其最内层维度是映射到原始单个字符串的标记。结果形状的秩增加一。

示例

print("Unicode script tokenizer is being called")
tokenizer = text.UnicodeScriptTokenizer()
tokens = tokenizer.tokenize(['everything not saved will be lost.', u'Sad☹'.encode('UTF-8')])
print("The tokenized data is converted to a list")
print(tokens.to_list())

代码来源 −https://tensorflowcn.cn/tutorials/tensorflow_text/intro

输出

Unicode script tokenizer is being called
The tokenized data is converted to a list
[[b'everything', b'not', b'saved', b'will', b'be', b'lost', b'.'], [b'Sad', b'\xe2\x98\xb9']]

Learn Python in-depth with real-world projects through our Python certification course. Enroll and become a certified expert to boost your career.

解释

分词器根据Unicode脚本边界分割UTF-8字符串。
脚本代码对应于国际Unicode组件（ICU）UScriptCode值。
它类似于WhitespaceTokenizer，不同之处在于它也会将标点符号（USCRIPT_COMMON）与语言文本分开，并将不同的语言文本彼此分开。

AmitDiwan

更新于：2021年2月22日

262 次查看

启动您的职业生涯

通过完成课程获得认证

开始学习