如何在 TensorFlow 和 Python 中分割 Unicode 字符串并指定字节偏移量？

可以使用 ‘unicode_split’ 方法和 ‘unicode_decode_with_offsets’ 方法分别分割 Unicode 字符串并指定字节偏移量。这些方法存在于 ‘tensorflow’ 模块的 ‘string’ 类中。

首先，使用 Python 表示 Unicode 字符串，并使用 Unicode 等价物操作它们。借助 Unicode 等价的标准字符串操作，根据脚本检测将 Unicode 字符串分割成标记。

我们正在使用 Google Colaboratory 来运行以下代码。Google Colab 或 Colaboratory 帮助在浏览器上运行 Python 代码，无需任何配置，并且可以免费访问 GPU（图形处理单元）。Colaboratory 建立在 Jupyter Notebook 之上。

print("Split unicode strings")
tf.strings.unicode_split(thanks, 'UTF-8').numpy()
codepoints, offsets = tf.strings.unicode_decode_with_offsets(u"🎈🎉🎊", 'UTF-8')
print("Printing byte offset for characters")
for (codepoint, offset) in zip(codepoints.numpy(), offsets.numpy()):
   print("At byte offset {}: codepoint {}".format(offset, codepoint))

代码来源： https://tensorflowcn.cn/tutorials/load_data/unicode

输出

Split unicode strings
Printing byte offset for characters
At byte offset 0: codepoint 127880
At byte offset 4: codepoint 127881
At byte offset 8: codepoint 127882

解释

tf.strings.unicode_split 操作将 Unicode 字符串分割成单个字符的子字符串。
生成的字符张量必须通过 tf.strings.unicode_decode 与原始字符串对齐。
为此，需要知道每个字符开始的偏移量。
tf.strings.unicode_decode_with_offsets 方法类似于 unicode_decode 方法，不同之处在于前者返回第二个张量，其中包含每个字符的起始偏移量。

AmitDiwan

更新于： 2021年2月20日

597 次查看

开启你的职业生涯

通过完成课程获得认证

开始学习

如何在 TensorFlow 和 Python 中分割 Unicode 字符串并指定字节偏移量？

输出

解释

开启你的 职业生涯

开启你的职业生涯