Elasticsearch - 分析

在搜索操作期间处理查询时，任何索引中的内容都会由分析模块进行分析。此模块包含分析器、分词器、标记过滤器和字符过滤器。如果未定义分析器，则默认情况下，内置的分析器、标记、过滤器和分词器将注册到分析模块。

在以下示例中，我们使用标准分析器，在未指定其他分析器时使用。它将根据语法分析句子并生成句子中使用的单词。

POST _analyze
{
   "analyzer": "standard",
   "text": "Today's weather is beautiful"
}

运行上述代码后，我们将获得如下所示的响应：

{
   "tokens" : [
      {
         "token" : "today's",
         "start_offset" : 0,
         "end_offset" : 7,
         "type" : "",
         "position" : 0
      },
      {
         "token" : "weather",
         "start_offset" : 8,
         "end_offset" : 15,
         "type" : "",
         "position" : 1
      },
      {
         "token" : "is",
         "start_offset" : 16,
         "end_offset" : 18,
         "type" : "",
         "position" : 2
      },
      {
         "token" : "beautiful",
         "start_offset" : 19,
         "end_offset" : 28,
         "type" : "",
         "position" : 3
      }
   ]
}

配置标准分析器

我们可以使用各种参数配置标准分析器以满足我们的自定义需求。

在以下示例中，我们将标准分析器的 max_token_length 配置为 5。

为此，我们首先创建一个索引，该索引的分析器具有 max_length_token 参数。

PUT index_4_analysis
{
   "settings": {
      "analysis": {
         "analyzer": {
            "my_english_analyzer": {
               "type": "standard",
               "max_token_length": 5,
               "stopwords": "_english_"
            }
         }
      }
   }
}

接下来，我们应用分析器并使用如下所示的文本。请注意标记是如何不出现的，因为它在开头有两个空格，在结尾有两个空格。对于单词“is”，它开头有一个空格，结尾有一个空格。将所有这些加起来，它就变成了带有空格的 4 个字母，这使得它不是一个单词。至少在开头或结尾处应该有一个非空格字符，才能使其成为一个要计数的单词。

POST index_4_analysis/_analyze
{
   "analyzer": "my_english_analyzer",
   "text": "Today's weather is beautiful"
}

运行上述代码后，我们将获得如下所示的响应：

{
   "tokens" : [
      {
         "token" : "today",
         "start_offset" : 0,
         "end_offset" : 5,
         "type" : "",
         "position" : 0
      },
      {
         "token" : "s",
         "start_offset" : 6,
         "end_offset" : 7,
         "type" : "",
         "position" : 1
      },
      {
         "token" : "weath",
         "start_offset" : 8,
         "end_offset" : 13,
         "type" : "",
         "position" : 2
      },
      {
         "token" : "er",
         "start_offset" : 13,
         "end_offset" : 15,
         "type" : "",
         "position" : 3
      },
      {
         "token" : "beaut",
         "start_offset" : 19,
         "end_offset" : 24,
         "type" : "",
         "position" : 5
      },
      {
         "token" : "iful",
         "start_offset" : 24,
         "end_offset" : 28,
         "type" : "",
         "position" : 6
      }
   ]
}

下表列出了各种分析器及其说明：

序号	分析器及说明
1	标准分析器 (standard) 可以为该分析器设置停用词和 max_token_length 设置。默认情况下，停用词列表为空，max_token_length 为 255。
2	简单分析器 (simple) 该分析器由小写分词器组成。
3	空格分析器 (whitespace) 该分析器由空格分词器组成。
4	停用词分析器 (stop) 可以配置 stopwords 和 stopwords_path。默认情况下，stopwords 初始化为英语停用词，stopwords_path 包含停用词文本文件的路径。

分词器

分词器用于在 Elasticsearch 中从文本生成标记。可以通过考虑空格或其他标点符号将文本分解成标记。Elasticsearch 拥有大量内置分词器，可用于自定义分析器。

以下显示了一个分词器的示例，该分词器在遇到非字母字符时将文本分解成术语，但它还会将所有术语小写：

POST _analyze
{
   "tokenizer": "lowercase",
   "text": "It Was a Beautiful Weather 5 Days ago."
}

运行上述代码后，我们将获得如下所示的响应：

{
   "tokens" : [
      {
         "token" : "it",
         "start_offset" : 0,
         "end_offset" : 2,
         "type" : "word",
         "position" : 0
      },
      {
         "token" : "was",
         "start_offset" : 3,
         "end_offset" : 6,
         "type" : "word",
         "position" : 1
      },
      {
         "token" : "a",
         "start_offset" : 7,
         "end_offset" : 8,
         "type" : "word",
         "position" : 2
      },
      {
         "token" : "beautiful",
         "start_offset" : 9,
         "end_offset" : 18,
         "type" : "word",
         "position" : 3
      },
      {
         "token" : "weather",
         "start_offset" : 19,
         "end_offset" : 26,
         "type" : "word",
         "position" : 4
      },
      {
         "token" : "days",
         "start_offset" : 29,
         "end_offset" : 33,
         "type" : "word",
         "position" : 5
      },
      {
         "token" : "ago",
         "start_offset" : 34,
         "end_offset" : 37,
         "type" : "word",
         "position" : 6
      }
   ]
}

下表列出了分词器及其说明：

序号	分词器及说明
1	标准分词器 (standard) 它基于语法分词器构建，并且可以为该分词器配置 max_token_length。
2	边缘 N 元语法分词器 (edgeNGram) 可以为该分词器设置 min_gram、max_gram、token_chars 等设置。
3	关键字分词器 (keyword) 它将整个输入作为输出生成，并且可以为此设置 buffer_size。
4	字母分词器 (letter) 它捕获整个单词，直到遇到非字母字符。

打印页面