Elasticsearch（简称ES）简易介绍

一、简单了解 Elasticsearch

Elasticsearch（简称ES）是一个开源的分布式搜索引擎，在实时数据索引、搜索和分析方面有着优秀的性能和功能。

一、原理介绍

倒排索引
倒排索引（Inverted Index）是ES最重要的原理之一，它将每个文档中的每个词（term）和出现的位置记录下来，然后构建一个反向索引，词作为关键词，而文档则作为关联文档的列表。以此方式保存数据，使得当我们要搜索文档中的某个关键词时，可以非常快速地找到相关的文档。倒排索引对于全文检索有着非常重要的意义。

Lucene
ES底层使用了强大的全文搜索引擎——Lucene。Lucene是一个高性能的全文搜索引擎库，并提供了包括分析、索引和搜索等功能。ES在Lucene的基础上加入了分布式搜索和分析功能，使得其能够处理PB级别的数据。

分布式架构
ES是一个分布式系统，通过分片和副本机制，可以使得数据在多个节点间自动平衡，并实现高可用和高性能。

二、是否还有其他检索引擎？

目前市面上还有一些比如：Apache Solr、Amazon CloudSearch、Sphinx和Microsoft Azure Search 检索引擎！

作者带领大家了解一下这些检索引擎，从不同的角度探讨它们的特点和适用场景，以帮助您选择适合您的搜索引擎解决方案。

1、先简单了解一下这四个检索引擎

Apache Solr：
Apache Solr是基于Apache Lucene构建的企业级搜索平台，提供全文搜索、分布式搜索、多语言支持、复杂查询和过滤、高亮显示、相关性排名等功能。Solr也是一个开源项目，拥有强大的社区支持。Solr是一个强大的搜索引擎解决方案，它用于建立搜索引擎、采集和索引文档、实现搜索功能等等。

Amazon CloudSearch：
Amazon CloudSearch是亚马逊提供的托管搜索服务，充分利用了亚马逊规模和弹性基础设施，并提供全文搜索、自定义搜索、多语言支持、自动缩放和高可靠性等功能。Amazon CloudSearch可以让用户在几分钟内进行搜索、自定义搜索体验、通过查询API与现有应用程序集成并支持多种企业用例。

Sphinx：
Sphinx是一个快速、高效的全文搜索引擎，适用于从各种数据源（如数据库、文本文件）中提供快速和准确的搜索。它支持全文搜索、实时索引、分布式搜索、多种查询语法、可扩展性和灵活的结果排序。Sphinx是一个使用C++编写的高性能的搜索引擎，具有较大的处理速度和较小的内存消耗。

Microsoft Azure Search：
Microsoft Azure Search是微软提供的托管搜索服务，可轻松地将搜索功能添加到应用程序中。它提供全文搜索、过滤、排序、分页、查询语法、自动缩放、多语言支持等功能，与Azure生态系统集成紧密。Microsoft Azure Search可以轻松地实现从各种数据源提供搜索功能的要求，并且可与Microsft其他服务（如Azure Cosmos DB，Azure SQL Database）的应用程序集成。

2、从不同的角度探讨它们的特点和适用场景

架构和可扩展性比较
Elasticsearch将数据分片和复制到多个节点上，实现了分布式的存储和处理。它具有简单而灵活的扩展能力，可以轻松地增加或减少节点，提高或降低系统的性能。同样地，Solr也采用了分布式架构，并可以水平扩展。而CloudSearch、Sphinx和Azure Search则是托管服务，并提供自动缩放和可靠性保证。

查询和分析功能比较
Elasticsearch具有强大的查询和分析功能，支持全文搜索、模糊查询、多字段查询、范围查询、过滤条件查询等多种查询方式，还提供聚合、排序、高亮显示等功能。Solr也提供类似的功能，支持复杂的查询和过滤，具有丰富的插件生态系统。CloudSearch、Sphinx和Azure Search的查询和分析功能较少，更适合简单的搜索需求。

社区和生态系统比较
Elasticsearch和Solr都是开源项目，有着活跃的开源社区和丰富的插件和工具支持。它们有广泛的使用案例和文档资料可供参考。CloudSearch是亚马逊提供的托管服务，依托于亚马逊的基础设施和生态系统。Sphinx和Azure Search的社区和插件生态系统相对较小。

部署和管理比较
Elasticsearch和Solr都需要自行管理部署和维护。它们提供了丰富的配置选项和监控工具，但需要花费一定的时间和精力来管理和优化。CloudSearch、Sphinx和Azure Search则是托管服务，无需担心底层基础设施，可以专注于应用程序的开发和功能实现。

数据源和集成比较
Elasticsearch和Solr可以从多种数据源中提取数据，包括数据库、文件系统和API。它们具有广泛的集成和插件支持，可以与各种外部系统无缝集成。CloudSearch、Sphinx和Azure Search更倾向于特定的数据源和集成方案。

结论：根据不同的需求和场景，选择适合自己的搜索引擎解决方案至关重要。如果需要灵活性和应对复杂查询和分析需求，Elasticsearch和Solr是首选。对于快速部署和无需管理基础设施，CloudSearch、Sphinx和Azure Search提供了更便利的托管服务。在选择之前，应综合考虑架构、功能、生态系统、管理和集成等因素，以提供满足您的搜索和分析需求的最佳解决方案

3、单独对ES和solr做一个独家采访

Elasticsearch:

优点：

简单的分布式集群管理：Elasticsearch提供了简单易用的集群管理工具，可以轻松地扩展和管理分布式环境。
强大的分析和聚合功能：Elasticsearch具有丰富的聚合功能，可以进行复杂的数据分析和统计操作。
高级的搜索能力：通过使用Elasticsearch的各种查询语法和过滤条件，可以实现高级的全文搜索和相关性排名。
数据复制和冗余：Elasticsearch支持数据的自动复制和冗余，可以提供高可用性并保护数据免受硬件故障或数据中心故障的影响。
大型社区和广泛接受度：Elasticsearch拥有庞大的开源社区，并且得到了广泛的采用和认可。

缺点：

内存消耗较高：由于Elasticsearch需要将大量数据加载到内存中以提供快速的搜索和查询，因此它对内存的消耗较高。
索引更新的延迟：当进行数据更新时，Elasticsearch的索引更新可能会有一定的延迟，因此在需要实时更新的场景下可能不太适合。

Solr:

优点：

成熟的全文搜索功能：Solr基于Lucene构建，提供了完善的全文搜索和查询功能，支持各种搜索选项和高级特性。
易于自定义和扩展：Solr提供了灵活的配置选项和插件机制，可以自定义和扩展搜索功能。
大型社区和广泛接受度：Solr拥有庞大的开源社区，并且已经被广泛采用和使用。

缺点：

部署和配置相对复杂：相对于Elasticsearch而言，Solr的部署和配置需要一定的技术知识和经验。
相对较少的实时特性：Solr的实时搜索和索引更新相对Elasticsearch来说可能略显不足，因此在某些实时数据处理场景下可能不太适合。

三、简单认识认识 Elasticsearch 分词器 Analyzer

作者这里只是简单举例介绍一下，具体详细说明，请大家查看ES官网文档

Standard Analyzer（标准分析器）：

优点：适用于一般的文本处理和搜索场景，能够进行词项切分和转换为小写形式。
缺点：不适用于特定语言的处理，未考虑语义和上下文。

示例：

GET /pdf_data/_analyze
{
  "analyzer": "standard",
  "text": "The quick brown fox jumps over the lazy dog."
}
-- 分词结果
{
  "tokens" : [
    {
      "token" : "the",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "quick",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "brown",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "fox",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "jumps",
      "start_offset" : 20,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "over",
      "start_offset" : 26,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "the",
      "start_offset" : 31,
      "end_offset" : 34,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "lazy",
      "start_offset" : 35,
      "end_offset" : 39,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "dog",
      "start_offset" : 40,
      "end_offset" : 43,
      "type" : "<ALPHANUM>",
      "position" : 8
    }
  ]
}

Simple Analyzer（简单分析器）：

优点：简单快速，将输入文本按照非字母字符进行切分。
缺点：不执行其他过滤或处理操作。

示例：

GET /pdf_data/_analyze
{
  "analyzer": "simple",
  "text": "The quick brown fox jumps over the lazy dog."
}
-- 分词结果
{
  "tokens" : [
    {
      "token" : "the",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "quick",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "brown",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "fox",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "jumps",
      "start_offset" : 20,
      "end_offset" : 25,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "over",
      "start_offset" : 26,
      "end_offset" : 30,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "the",
      "start_offset" : 31,
      "end_offset" : 34,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "lazy",
      "start_offset" : 35,
      "end_offset" : 39,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "dog",
      "start_offset" : 40,
      "end_offset" : 43,
      "type" : "word",
      "position" : 8
    }
  ]
}

Whitespace Analyzer（空格分析器）：

优点：简单、快速，按照空格字符进行切分。
缺点：不进行其他处理操作，不适用于所有场景。

示例：

GET /pdf_data/_analyze
{
  "analyzer": "whitespace",
  "text": "The quick brown fox jumps over the lazy dog."
}
-- 分词结果
{
  "tokens" : [
    {
      "token" : "The",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "quick",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "brown",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "fox",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "jumps",
      "start_offset" : 20,
      "end_offset" : 25,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "over",
      "start_offset" : 26,
      "end_offset" : 30,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "the",
      "start_offset" : 31,
      "end_offset" : 34,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "lazy",
      "start_offset" : 35,
      "end_offset" : 39,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "dog.",
      "start_offset" : 40,
      "end_offset" : 44,
      "type" : "word",
      "position" : 8
    }
  ]
}

Stop Analyzer（停用词分析器）：
- 优点：基于标准分析器，在切分基础上删除指定的停用词，适用于忽略常用词汇的搜索。
- 缺点：某些情况下，可能会过滤掉具有特定意义的词汇。
- 示例：
```
GET /your_index/_analyze
{
  "analyzer": "stop",
  "text": "The quick brown fox jumps over the lazy dog."
}

结果：[“quick”, “brown”, “fox”, “jumps”, “lazy”, “dog”]
```

Keyword Analyzer（关键词分析器）：

优点：将整个输入作为单独的关键词进行处理，适用于需要完整保留输入内容的场景。
缺点：不进行切分或其他处理，无法进行词项搜索。

示例：

GET /pdf_data/_analyze
{
  "analyzer": "keyword",
  "text": "The quick brown fox jumps over the lazy dog."
}
-- 分词结果
{
  "tokens" : [
    {
      "token" : "The quick brown fox jumps over the lazy dog.",
      "start_offset" : 0,
      "end_offset" : 44,
      "type" : "word",
      "position" : 0
    }
  ]
}

Pattern Analyzer（模式分析器）：

优点：可以根据用户指定的正则表达式对输入文本进行切分和处理。
缺点：需要了解和编写正则表达式，适用于特定模式的文本处理。

示例：

GET /pdf_data/_analyze
{
  "analyzer": "pattern",
  "text": "The quick brown fox jumps over the lazy dog.",
  "tokenizer": {
    "pattern": "\\W+"  // 使用非字母或数字字符进行切分
  }
}
-- 分词结果
{
  "tokens" : [
    {
      "token" : "the",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "quick",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "brown",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "fox",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "jumps",
      "start_offset" : 20,
      "end_offset" : 25,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "over",
      "start_offset" : 26,
      "end_offset" : 30,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "the",
      "start_offset" : 31,
      "end_offset" : 34,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "lazy",
      "start_offset" : 35,
      "end_offset" : 39,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "dog",
      "start_offset" : 40,
      "end_offset" : 43,
      "type" : "word",
      "position" : 8
    }
  ]
}

Language Analyzers（语言分析器）：

优点：根据特定语言的语法和习惯进行优化，提供更好的文本处理和搜索效果。
缺点：仅适用于对应语言的处理，不够通用。

示例：

GET /your_index/_analyze
{
  "analyzer": "english",
  "text": "The quick brown fox jumps over the lazy dog."
}
-- 分词结果
{
  "tokens" : [
    {
      "token" : "quick",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "brown",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "fox",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "jump",
      "start_offset" : 20,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "over",
      "start_offset" : 26,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "lazi",
      "start_offset" : 35,
      "end_offset" : 39,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "dog",
      "start_offset" : 40,
      "end_offset" : 43,
      "type" : "<ALPHANUM>",
      "position" : 8
    }
  ]
}

Edge N-gram Analyzer（前缀分析器）：

优点：生成输入文本的所有前缀，适用于前缀匹配和自动完成等场景。
缺点：会产生较大的词项量，占用较多的存储空间。

示例：

# 创建一个索引 指定分词器
PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 10,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  }
}
# 执行分词
POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "2 Quick Foxes."
}
-- 分词结果
{
  "tokens" : [
    {
      "token" : "Qu",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "Qui",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "Quic",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "Quick",
      "start_offset" : 2,
      "end_offset" : 7,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "Fo",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "Fox",
      "start_offset" : 8,
      "end_offset" : 11,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "Foxe",
      "start_offset" : 8,
      "end_offset" : 12,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "Foxes",
      "start_offset" : 8,
      "end_offset" : 13,
      "type" : "word",
      "position" : 7
    }
  ]
}

IK 分词器（IK Analyzer）：
- 优点：
  - 支持智能切分：IK Analyzer 根据中文文本的词汇和语法规则进行切分，可以识别并区分词语中的各个成分，如汉字、字母、数字、符号等，实现了较为准确的细粒度分词。
  - 支持多种切分模式：IK Analyzer 提供了多种切分模式，包括最细粒度切分模式和最大词长切分模式。用户可以根据具体需求选择合适的切分模式，在精度和效率之间做出权衡。
  - 支持自定义词典：IK Analyzer 允许用户通过配置自定义词典来增加或修改已有词汇。这样可以根据实际业务场景，将特定的领域名词、品牌名词等纳入分词器的词库中，提高分词准确性。
  - 支持拼写纠错：IK Analyzer 在切分过程中，可以对输入文本的拼写错误进行纠正，并输出正确的分词结果。这对于提高搜索召回率和纠正用户输入错误非常有帮助。
  - 支持同义词扩展：IK Analyzer 提供了同义词扩展的功能，可以将同义词扩展为多个近义词进行分词，从而提高搜索的召回率。
  - 支持停用词过滤：IK Analyzer 内置了中文常用的停用词词库，可以过滤掉停用词，如常见的介词、连词等，减少干扰词对搜索结果的影响。
  - 容易集成：IK Analyzer 是一个开源的分词器，具有良好的可扩展性和易集成性。它可以与 Elasticsearch 无缝集成，作为其内置的中文分词器使用。
- 缺点：不适用于其他语言的处理，可能对英文等其他语言的分词效果不如专为该语言设计的分词器。
- IK分词不同切分器
  - 细粒度切分模式（ik_smart）：这种切分模式是一种比较智能的中文切分模式，它可以根据上下文进行分词，能够处理一些歧义词语。
    - ```
    GET /pdf_data/_analyze
    {
      "analyzer": "ik_smart",
      "text": "我们是共产主义接班人"
    }
    -- 分词结果
    {
      "tokens" : [
        {
          "token" : "我们",
          "start_offset" : 0,
          "end_offset" : 2,
          "type" : "CN_WORD",
          "position" : 0
        },
        {
          "token" : "是",
          "start_offset" : 2,
          "end_offset" : 3,
          "type" : "CN_CHAR",
          "position" : 1
        },
        {
          "token" : "共产主义",
          "start_offset" : 3,
          "end_offset" : 7,
          "type" : "CN_WORD",
          "position" : 2
        },
        {
          "token" : "接班人",
          "start_offset" : 7,
          "end_offset" : 10,
          "type" : "CN_WORD",
          "position" : 3
        }
      ]
    }
```
- 最细粒度切分模式（ik_max_word）：这种切分模式是一种针对文本最细粒度的分词模式，可以将文本中每个可以成词的字都切分出来。
  - ```
  GET /pdf_data/_analyze
  {
    "analyzer": "ik_max_word",
    "text": "我们是共产主义接班人"
  }
  
  -- 分词结果
  {
    "tokens" : [
      {
        "token" : "我们",
        "start_offset" : 0,
        "end_offset" : 2,
        "type" : "CN_WORD",
        "position" : 0
      },
      {
        "token" : "是",
        "start_offset" : 2,
        "end_offset" : 3,
        "type" : "CN_CHAR",
        "position" : 1
      },
      {
        "token" : "共产主义",
        "start_offset" : 3,
        "end_offset" : 7,
        "type" : "CN_WORD",
        "position" : 2
      },
      {
        "token" : "共产",
        "start_offset" : 3,
        "end_offset" : 5,
        "type" : "CN_WORD",
        "position" : 3
      },
      {
        "token" : "主义",
        "start_offset" : 5,
        "end_offset" : 7,
        "type" : "CN_WORD",
        "position" : 4
      },
      {
        "token" : "接班人",
        "start_offset" : 7,
        "end_offset" : 10,
        "type" : "CN_WORD",
        "position" : 5
      },
      {
        "token" : "接班",
        "start_offset" : 7,
        "end_offset" : 9,
        "type" : "CN_WORD",
        "position" : 6
      },
      {
        "token" : "人",
        "start_offset" : 9,
        "end_offset" : 10,
        "type" : "CN_CHAR",
        "position" : 7
      }
    ]
  }
```

备注：

# 查询指定索引的分词器
GET /pdf_data/_mapping

四、Elasticsearch 简单查询语法

作者这边只对ES的一些简易查询，做一些举例，复杂查询将单独推出文章介绍。

# 创建索引，创建一些测试数据
POST /pdf_data/_doc?pretty
{

  "id": "3",

  "name": "面试题文件1.pdf",

  "age": 18,

  "type": "file",

  "money": 1111,

  "createBy": "阿杰",

  "createTime": "2022-11-03T10:41:51.851Z",

  "attachment": {

    "content": "面试官：如何保证消息不被重复消费啊？如何保证消费的时候是幂等的啊？Kafka、ActiveMQ、RabbitMQ、RocketMQ 都有什么区别，以及适合哪些场景？",

    "date": "2022-11-02T10:41:51.851Z",

    "language": "en"

  }
}

# 无条件查询 查询所有数据
GET pdf_data/_search
{
}

# 简单 单条件查询
GET /pdf_data/_search
{
  "query": {
    "match": {
      "createBy": "阿杰"
    }
  }
}

# 简单 单条件查询 文档内容检索
GET /pdf_data/_search
{
  "query": {
    "match": {
      "attachment.content": "面试官：如何保证消息不被重复消费啊？如何保证消费的时候是幂等的啊？"
    }
  }
}

# 多条件查询  and的关系
GET /pdf_data/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "age": "18" } },
        { "match": { "attachment.content": "Kafka、ActiveMQ、RabbitMQ、RocketMQ 都有什么区别，以及适合哪些场景？" } }
      ]
    }
  }
}
# 范围查询
GET /pdf_data/_search
{
  "query": {
    "range": {
      "age": {
        "gte": 10,
        "lte": 20
      }
    }
  }
}

#  带排序的检索
GET /pdf_data/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    { "money": { "order": "asc" } },
    { "age": { "order": "desc" } }
  ]
}
# 聚合查询
GET /pdf_data/_search
{
  "aggs": {
    "group_by_field": {
      "terms": {
        "field": "age",
        "size": 10
      }
    }
  }
}

制作不易，给个小赞，可好！