C++ 中的 UTF-8 验证

假设我们有一个整数列表表示数据。我们需要检查它是否为有效的 UTF-8 编码。一个 UTF-8 字符可以是 1 到 4 个字节长。有一些属性：

对于 1 字节字符，第一个比特位为 0，后面跟着它的 Unicode 码。
对于 n 字节字符，前 n 个比特位都为 1，第 n+1 个比特位为 0，后面跟着 n-1 个字节，其最高两位为 10。

所以编码技术如下：

字符数值范围	UTF-8 字节序列
0000 0000 0000 007F	0xxxxxxx
0000 0080 0000 07FF	110xxxxx 10xxxxxx
0000 0800 0000 FFFF	1110xxxx 10xxxxxx 10xxxxxx
0001 0000 0010 FFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

所以如果输入类似于 [197, 130, 1]，它表示字节序列 11000101 10000010 00000001，因此这将返回 true。它是一个有效的 utf-8 编码，表示一个 2 字节字符后跟一个 1 字节字符。

为了解决这个问题，我们将遵循以下步骤：

cnt := 0
for i in range 0 to size of data array
- x := data[i]
- if cnt is 0, then
  - if x/32 = 110, then set cnt as 1
  - otherwise when x/16 = 1110, then cnt = 2
  - otherwise when x/8 = 11110, then cnt = 3
  - otherwise when x/128 is 0, then return false
- otherwise when x /64 is not 10, then return false and decrease cnt by 1
return true when cnt is 0

示例（C++）

让我们看看以下实现以获得更好的理解：

实时演示

#include <bits/stdc++.h>
using namespace std;
class Solution {
   public:
   bool validUtf8(vector<int>& data) {
      int cnt = 0;
      for(int i = 0; i <data.size(); i++){
         int x = data[i];
         if(!cnt){
            if((x >> 5) == 0b110){
               cnt = 1;
            }
            else if((x >> 4) == 0b1110){
               cnt = 2;
            }
            else if((x >> 3) == 0b11110){
               cnt = 3;
            }
            else if((x >> 7) != 0) return false;
            } else {
               if((x >> 6) != 0b10) return false;
               cnt--;
            }
         }
         return cnt == 0;
      }
};
main(){
   Solution ob;
   vector<int> v = {197,130,1};
   cout << (ob.validUtf8(v));
}

输入

[197,130,1]

Explore our latest online courses and learn new skills at your own pace. Enroll and become a certified expert to boost your career.

输出

Arnab Chakraborty

更新于: 2020年5月2日

2K+ 浏览量

开启你的职业生涯

通过完成课程获得认证

开始学习

C++ 中的 UTF-8 验证

示例（C++）

输入

输出

开启你的 职业生涯

开启你的职业生涯