如何在TensorFlow中表示和操作Unicode字符串？

Unicode 字符串默认使用 UTF-8 编码。可以使用 TensorFlow 模块中的“constant”方法将 Unicode 字符串表示为 UTF-8 编码的标量值。可以使用 TensorFlow 模块中的“encode”方法将 Unicode 字符串表示为 UTF-16 编码的标量。

阅读更多：什么是 TensorFlow？Keras 如何与 TensorFlow 一起创建神经网络？

处理自然语言的模型会处理不同语言，这些语言具有不同的字符集。Unicode 被认为是标准编码系统，用于表示几乎所有语言的字符。每个字符都使用介于 0 和 0x10FFFF 之间的唯一整数代码点进行编码。Unicode 字符串是由零个或多个代码值组成的序列。

让我们了解如何使用 Python 表示 Unicode 字符串，以及如何使用 Unicode 等效项来操作这些字符串。首先，我们使用 Unicode 等效的标准字符串操作，根据脚本检测将 Unicode 字符串分成标记。

我们使用 Google Colaboratory 来运行以下代码。Google Colab 或 Colaboratory 帮助在浏览器上运行 Python 代码，无需任何配置，并可免费访问 GPU（图形处理单元）。Colaboratory 基于 Jupyter Notebook 构建。

import tensorflow as tf
print("A constant is defined")
tf.constant(u"Thanks 😊")
print("The shape of the tensor is")
tf.constant([u"You are", u"welcome!"]).shape
print("Unicode string represented as UTF-8 encoded scalar")
text_utf8 = tf.constant(u"语言处理")
print(text_utf8)
print("Unicode string represented as UTF-16 encoded scalar")
text_utf16be = tf.constant(u"语言处理".encode("UTF-16-BE"))
print(text_utf16be)
print("Unicode string represented as a vector of Unicode code points")
text_chars = tf.constant([ord(char) for char in u"语言处理"])
print(text_chars)

代码来源：https://tensorflowcn.cn/tutorials/load_data/unicode

输出

A constant is defined
The shape of the tensor is
Unicode string represented as UTF-8 encoded scalar
tf.Tensor(b'\xe8\xaf\xad\xe8\xa8\x80\xe5\xa4\x84\xe7\x90\x86', shape=(), dtype=string)
Unicode string represented as UTF-16 encoded scalar
tf.Tensor(b'\x8b\xed\x8a\x00Y\x04t\x06', shape=(), dtype=string)
Unicode string represented as a vector of Unicode code points
tf.Tensor([35821 35328 22788 29702], shape=(4,), dtype=int32)

解释

TensorFlow `tf.string` 是一个基本数据类型。
它允许用户构建字节字符串张量。
Unicode 字符串默认使用 UTF-8 编码。
由于字节字符串被视为原子单元，因此 `tf.string` 张量能够保存不同长度的字节字符串。
字符串长度不包含在张量维度中。
当使用 Python 构造字符串时，Unicode 处理在 v2 和 v3 之间有所不同。在 v2 中，Unicode 字符串用 "u" 前缀表示。
在 v3 中，字符串默认使用 Unicode 编码。
在 TensorFlow 中表示 Unicode 字符串的两种标准方法：
字符串标量：使用已知的字符编码对代码点序列进行编码。
int32 向量：每一位包含单个代码点的方法。

AmitDiwan

更新于：2021年2月19日

浏览量：200

启动你的职业生涯

完成课程获得认证

开始学习