Elasticsearch 全文检索分词检索-Elasticsearch文章四-Toy模板网

这篇具有很好参考价值的文章主要介绍了Elasticsearch 全文检索分词检索-Elasticsearch文章四。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

官方文档地址

https://www.elastic.co/guide/en/enterprise-search/current/start.html

refercence文档

https://www.elastic.co/guide/en/elasticsearch/reference/7.17/query-dsl-match-query.html

全文搜索体系

Full text Query中，我们只需要把如下的那么多点分为3大类，你的体系能力会大大提升

Elasticsearch 全文检索分词检索-Elasticsearch文章四,AI大数据,分布式微服务,全文检索,elasticsearch,分词

很多api都可以查得到，我们只要大概知道有支持哪些功能

Elasticsearch 全文检索分词检索-Elasticsearch文章四,AI大数据,分布式微服务,全文检索,elasticsearch,分词

match

简单查询

GET visit_log/_search
{
  "query": { "match": {
    "serverHostName": "wei"
  }},
  "sort": [
    { "_id": "asc" }
  ],
  "from": 0,
  "size": 10
}

Elasticsearch 执行上面这个 match 查询的步骤是：
1. 检查字段类型。
标题 title 字段是一个 string 类型（ analyzed ）已分析的全文字段，这意味着查询字符串本身也应该被分析。
1. 分析查询字符串。
将查询的字符串 wei cui传入标准分析器中，输出的结果是单个项 wei。因为只有一个单词项，所以 match 查询执行的是单个底层 term 查询。
1. 查找匹配文档。
用 term 查询在倒排索引中查找 wei然后获取一组包含该项的文档，本例的结果是文档：1、2 和 3 。
1. 为每个文档评分。
用 term 查询计算每个文档相关度评分 _score ，这是种将词频（term frequency，即词 quick 在相关文档的 title 字段中出现的频率）和反向文档频率（inverse document frequency，即词 quick 在所有文档的 title 字段中出现的频率），以及字段的长度（即字段越短相关度越高）相结合的计算方式。

查询结果：

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : "visit_log",
        "_type" : "_doc",
        "_id" : "nUL9rokBpGsmR0pP0VSc",
        "_score" : null,
        "_source" : {
          "_class" : "org.lwd.microservice.boot.es.entity.VisitLog",
          "id" : 7,
          "tableName" : "VisitLog",
          "userLoginId" : 3,
          "serverIpAddress" : "127.0.0.1",
          "serverHostName" : "liu wei",
          "initialRequest" : "http://localhost:8023",
          "msgContent" : "test es add7",
          "createTime" : 1690446876000
        },
        "sort" : [
          "nUL9rokBpGsmR0pP0VSc"
        ]
      }
    ]
  }
}

match 多词/分词

单字段分词

查询字段包含wei cui两个词

GET visit_log/_search
{
  "query": { "match": {
    "serverHostName": "wei cui"
  }},
  "sort": [
    { "_id": "asc" }
  ],
  "from": 0,
  "size": 10
}

结果：

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : "visit_log",
        "_type" : "_doc",
        "_id" : "TEL9rokBpGsmR0pPXFMo",
        "_score" : null,
        "_source" : {
          "_class" : "org.lwd.microservice.boot.es.entity.VisitLog",
          "id" : 5,
          "tableName" : "VisitLog",
          "userLoginId" : 3,
          "serverIpAddress" : "127.0.0.1",
          "serverHostName" : "wang cui",
          "initialRequest" : "http://localhost:8023",
          "msgContent" : "test es add6",
          "createTime" : 1690446876000
        },
        "sort" : [
          "TEL9rokBpGsmR0pPXFMo"
        ]
      },
      {
        "_index" : "visit_log",
        "_type" : "_doc",
        "_id" : "nUL9rokBpGsmR0pP0VSc",
        "_score" : null,
        "_source" : {
          "_class" : "org.lwd.microservice.boot.es.entity.VisitLog",
          "id" : 7,
          "tableName" : "VisitLog",
          "userLoginId" : 3,
          "serverIpAddress" : "127.0.0.1",
          "serverHostName" : "liu wei",
          "initialRequest" : "http://localhost:8023",
          "msgContent" : "test es add7",
          "createTime" : 1690446876000
        },
        "sort" : [
          "nUL9rokBpGsmR0pP0VSc"
        ]
      }
    ]
  }
}

因为 match 查询必须查找两个词（ [“liu”,“wei”] ），它在内部实际上先执行两次 term 查询，然后将两次查询的结果合并作为最终结果输出。为了做到这点，它将两个 term 查询包入一个 bool 查询中，
所以上述查询的结果，和如下语句查询结果是等同的

GET /visit_log/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "term": {
            "serverHostName": "liu"
          }
        },
        {
          "term": {
            "serverHostName": "cui"
          }
        }
      ]
    }
  }
}

match多个词的逻辑

上面等同于should（任意一个满足），是因为 match还有一个operator参数，默认是or, 所以对应的是should。

GET /visit_log/_search
{
  "query": {
    "match": {
      "serverHostName": {
        "query": "wang cui",
        "operator": "or"
      }
    }
  }
}

多字段分词

GET /visit_log/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "serverHostName": "cui wei" }},
        { "match": { "msgContent": "add3 add4" }}
      ]
    }
  }
}

控制match的匹配精度

如果用户给定 3 个查询词，想查找至少包含其中 2 个的文档，该如何处理？将 operator 操作符参数设置成 and 或者 or 都是不合适的。
match 查询支持 minimum_should_match 最小匹配参数，这让我们可以指定必须匹配的词项数用来表示一个文档是否相关。我们可以将其设置为某个具体数字，更常用的做法是将其设置为一个百分数，因为我们无法控制用户搜索时输入的单词数量：

GET /visit_log/_search
{
  "query": {
    "match": {
      "serverHostName": {
        "query": "wang cui wangcui",
        "minimum_should_match": "75%"
      }
    }
  }
}

当然也等同于

GET /visit_log/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "serverHostName": "wang" }},
        { "match": { "serverHostName": "cui"   }},
        { "match": { "serverHostName": "wangcui"   }}
      ],
      "minimum_should_match": 2 
    }
  }
}

match_pharse_prefix分词前缀方式

那有没有可以查询出quick brown f的方式呢？ELasticSearch在match_phrase基础上提供了一种可以查最后一个词项是前缀的方法，这样就可以查询test es a了

GET /visit_log/_search
{
  "query": {
    "match_phrase_prefix": {
      "msgContent": {
        "query": "test es a"
      }
    }
  }
}

(ps: prefix的意思不是整个text的开始匹配，而是最后一个词项满足term的prefix查询而已)

match_bool_prefix

GET /visit_log/_search
{
  "query": {
    "match_bool_prefix": {
      "msgContent": {
        "query": "es test a"
      }
    }
  }
}

所以这样你就能理解，match_bool_prefix查询中的quick,brown,f是无序的。

multi_match多字段匹配

GET /visit_log/_search
{
  "query": {
    "multi_match" : {
      "query":    "add7 wang",
      "fields": [ "msgContent", "*HostName" ] 
    }
  }
}

结果：

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 1.7917595,
    "hits" : [
      {
        "_index" : "visit_log",
        "_type" : "_doc",
        "_id" : "nUL9rokBpGsmR0pP0VSc",
        "_score" : 1.7917595,
        "_source" : {
          "_class" : "org.lwd.microservice.boot.es.entity.VisitLog",
          "id" : 7,
          "tableName" : "VisitLog",
          "userLoginId" : 3,
          "serverIpAddress" : "127.0.0.1",
          "serverHostName" : "liu wei",
          "initialRequest" : "http://localhost:8023",
          "msgContent" : "test es add7",
          "createTime" : 1690446876000
        }
      },
      {
        "_index" : "visit_log",
        "_type" : "_doc",
        "_id" : "TEL9rokBpGsmR0pPXFMo",
        "_score" : 1.0800905,
        "_source" : {
          "_class" : "org.lwd.microservice.boot.es.entity.VisitLog",
          "id" : 5,
          "tableName" : "VisitLog",
          "userLoginId" : 3,
          "serverIpAddress" : "127.0.0.1",
          "serverHostName" : "wang cui",
          "initialRequest" : "http://localhost:8023",
          "msgContent" : "test es add6",
          "createTime" : 1690446876000
        }
      },
      {
        "_index" : "visit_log",
        "_type" : "_doc",
        "_id" : "6UL9rokBpGsmR0pPjVOS",
        "_score" : 1.0800905,
        "_source" : {
          "_class" : "org.lwd.microservice.boot.es.entity.VisitLog",
          "id" : 6,
          "tableName" : "VisitLog",
          "userLoginId" : 3,
          "serverIpAddress" : "127.0.0.1",
          "serverHostName" : "wang ting",
          "initialRequest" : "http://localhost:8023",
          "msgContent" : "test es add6",
          "createTime" : 1690446876000
        }
      }
    ]
  }
}

*表示前缀匹配字段。

query string类型

此查询使用语法根据运算符（例如AND或）来解析和拆分提供的查询字符串NOT。然后查询在返回匹配的文档之前独立分析每个拆分的文本。
可以使用该query_string查询创建一个复杂的搜索，其中包括通配符，跨多个字段的搜索等等。尽管用途广泛，但查询是严格的，如果查询字符串包含任何无效语法，则返回错误。
例如：


GET /visit_log/_search
{
  "query": {
    "query_string": {
      "query": "(wangcui) OR (add6)",
      "fields": [ "msgContent", "*HostName" ] 
    }
  }
}

Interval类型

Intervals是时间间隔的意思，本质上将多个规则按照顺序匹配。


GET /visit_log/_search
{
  "query": {
    "intervals" : {
      "msgContent" : {
        "all_of" : {
          "ordered" : true,
          "intervals" : [
            {
              "match" : {
                "query" : "liu",
                "max_gaps" : 0,
                "ordered" : true
              }
            },
            {
              "any_of" : {
                "intervals" : [
                  { "match" : { "query" : "es" } },
                  { "match" : { "query" : "add6" } }
                ]
              }
            }
          ]
        }
      }
    }
  }
}

因为interval之间是可以组合的，所以它可以表现的很复杂

DSL查询之Term详解

自行查官方文档，有可能后边会出详解文章来源地址https://www.toymoban.com/news/detail-628732.html

聚合查询之Bucket聚合详解

自行查官方文档，有可能后边会出详解

聚合查询之Metric聚合详解

自行查官方文档，有可能后边会出详解

聚合查询之Pipline聚合详解

自行查官方文档，有可能后边会出详解

其他

外传

😜 原创不易，如若本文能够帮助到您的同学
🎉 支持我：关注我+点赞👍+收藏⭐️
📝 留言：探讨问题，看到立马回复
💬 格言：己所不欲勿施于人 扬帆起航、游历人生、永不言弃！🔥

到了这里，关于Elasticsearch 全文检索分词检索-Elasticsearch文章四的文章就介绍完了。如果您还想了解更多内容，请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章，希望大家以后多多支持TOY模板网！

Elasticsearch 全文检索分词检索-Elasticsearch文章四

官方文档地址

refercence文档

全文搜索体系

match

简单查询

match 多词/分词

单字段分词

match多个词的逻辑

控制match的匹配精度

match_pharse_prefix分词前缀方式

match_bool_prefix

multi_match多字段匹配

query string类型

Interval类型

DSL查询之Term详解

聚合查询之Bucket聚合详解

聚合查询之Metric聚合详解

聚合查询之Pipline聚合详解

其他

外传

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

支付宝扫一扫领取红包，优惠每天领

二维码1

二维码2

Elasticsearch 全文检索 分词检索-Elasticsearch文章四

官方文档地址

refercence文档

全文搜索体系

match

简单查询

match 多词/分词

单字段分词

match多个词的逻辑

控制match的匹配精度

match_pharse_prefix分词前缀方式

match_bool_prefix

multi_match多字段匹配

query string类型

Interval类型

DSL查询之Term详解

聚合查询之Bucket聚合详解

聚合查询之Metric聚合详解

聚合查询之Pipline聚合详解

其他

外传

相关文章

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

支付宝扫一扫领取红包，优惠每天领

二维码1

二维码2

Elasticsearch 全文检索分词检索-Elasticsearch文章四