什么是全文检索
和术语级别查询(Term-Level Queries)不同,全文检索查询(Full Text Queries)旨在基于相关性搜索和匹配文本数据
。这些查询会对输入的文本进行分析,将其拆分
为词项(单个单词),并执行诸如分词、词干处理和标准化等操作。
全文检索的关键特点:
- 对输入的文本进行分析,并根据分析后的词项进行搜索和匹配。全文检索查询会对输入的文本进行分析,将其拆分为词项,并基于这些词项进行搜索和匹配操作。
- 以相关性为基础进行搜索和匹配。全文检索查询使用相关性算法来确定文档与查询的匹配程度,并按照相关性进行排序。相关性可以基于词项的频率、权重和其他因素来计算。
- 全文检索查询适用于包含自由文本数据的字段,例如文档的内容、文章的正文或产品描述等。
一、数据准备
PUT full_index
{
"settings": {
"number_of_replicas": 1,
"number_of_shards": 1
},
"mappings": {
"properties": {
"name": {
"type": "text"
},
"age": {
"type": "long"
},
"description" : {
"type" : "text",
"analyzer": "ik_max_word",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
测试数据如下:
{name=张三, description=北京故宫圆明园, age=11}
{name=王五, description=南京总统府, age=15}
{name=李四, description=北京市天安门广场, age=18}
{name=富贵, description=南京市中山陵, age=22}
{name=来福, description=山东济南趵突泉, age=8}
{name=憨憨, description=安徽黄山九华山, age=27}
{name=小七, description=上海东方明珠, age=31}
二、match query
匹配查询: match在匹配时会对所查找的关键词进行分词,然后按分词匹配查找。
match支持以下参数:
- query : 指定匹配的值
- operator : 匹配条件类型
- and : 条件分词后都要匹配
- or : 条件分词后有一个匹配即可(默认)
- minmum_should_match : 最低匹配度,即条件在倒排索引中最低的匹配度
DSL: 索引description字段包含 “南京总统府” 的数据
GET full_index/_search
{
"query": {
"match": {
"description": "南京总统府"
}
}
}
返回数据如下:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.2667978,
"hits" : [
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.2667978,
"_source" : {
"name" : "王五",
"age" : 15,
"description" : "南京总统府"
}
},
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "4",
"_score" : 1.0751815,
"_source" : {
"name" : "富贵",
"age" : 22,
"description" : "南京市中山陵"
}
}
]
}
}
springboot实现:
private final static Logger LOGGER = LoggerFactory.getLogger(FullTextQuery.class);
private static final String INDEX_NAME = "full_index";
@Resource
private RestHighLevelClient client;
@RequestMapping(value = "/match_query", method = RequestMethod.GET)
@ApiOperation(value = "DSL - match_query")
public void match_query() throws Exception {
// 定义请求对象
SearchRequest searchRequest = new SearchRequest(INDEX_NAME);
// 查询所有
searchRequest.source(new SearchSourceBuilder().query(QueryBuilders.matchQuery("description","南京总统府")));
// 打印返回数据
printLog(client.search(searchRequest, RequestOptions.DEFAULT));
}
private void printLog(SearchResponse searchResponse) {
SearchHits hits = searchResponse.getHits();
System.out.println("返回hits数组长度:" + hits.getHits().length);
for (SearchHit hit: hits.getHits()) {
System.out.println(hit.getSourceAsMap().toString());
}
}
返回结果如下:
返回hits数组长度:2
{name=王五, description=南京总统府, age=15}
{name=富贵, description=南京市中山陵, age=22}
分析: 此时可以发现当搜索 “南京总统府” 时,返回了两条数据,那么为什么 “南京市中山陵” 也被搜索到了呢?
原因就是全文检索会拆分
搜索的此项,因为在创建索引的时候指定了 description 字段的分词方式是 “ik_max_word” ,而该分词类型会将 “南京总统府” 拆分成如下词类去搜索倒排索引:
POST _analyze
{
"analyzer": "ik_max_word",
"text": ["南京总统府"]
}
{
"tokens" : [
{
"token" : "南京",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "总统府",
"start_offset" : 2,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "总统",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "府",
"start_offset" : 4,
"end_offset" : 5,
"type" : "CN_CHAR",
"position" : 3
}
]
}
其中就有"南京"这个词项,所以用 “南京总统府” 去搜索是可以搜到 “南京中山陵” 的数据,那么match_query的operator也不用多说,就是满足所有拆分的词项
比如此时我们再插入一条数据:
POST /full_index/_bulk
{"index":{"_id":8}}
{"name":"张三","age":11,"description":"南京总统"}
当我们搜索:"南京总统",可以搜到两条数据
GET full_index/_search
{
"query": {
"match": {
"description": {
"query": "南京总统",
"operator": "and"
}
}
}
}
数据如下:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 2.898355,
"hits" : [
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "8",
"_score" : 2.898355,
"_source" : {
"name" : "张三",
"age" : 11,
"description" : "南京总统"
}
},
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 2.35562,
"_source" : {
"name" : "王五",
"age" : 15,
"description" : "南京总统府"
}
}
]
}
}
但是当搜索:"南京总统府"时,只能搜索到一条数据,就是因为分词时,有一个词项"府",在其中一条数据中不存在
三、multi_match query
多字段查询:可以根据字段类型,决定是否使用分词查询,得分最高的在前面注意:字段类型分词,将查询条件分词之后进行查询,如果该字段不分词就会将查询条件作为整体进行查询。
DSL: 查询 “name” 或者 “description” 这两个字段中出现 “北京王五” 词汇的数据
GET full_index/_search
{
"query": {
"multi_match": {
"query": "北京王五",
"fields": ["name","description"]
}
}
}
返回结果如下:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 3.583519,
"hits" : [
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 3.583519,
"_source" : {
"name" : "王五",
"age" : 15,
"description" : "南京总统府"
}
},
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.4959542,
"_source" : {
"name" : "张三",
"age" : 11,
"description" : "北京故宫圆明园"
}
},
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "3",
"_score" : 0.98645234,
"_source" : {
"name" : "李四",
"age" : 18,
"description" : "北京市天安门广场"
}
}
]
}
}
springboot实现:
@RequestMapping(value = "/multi_match", method = RequestMethod.GET)
@ApiOperation(value = "DSL - multi_match")
public void multi_match() throws Exception {
// 定义请求对象
SearchRequest searchRequest = new SearchRequest(INDEX_NAME);
// 查询所有
searchRequest.source(new SearchSourceBuilder().query(
QueryBuilders.multiMatchQuery("北京王五", new String[]{"name","description"})));
// 打印返回数据
printLog(client.search(searchRequest, RequestOptions.DEFAULT));
}
查询结果如下:
返回hits数组长度:3
{name=王五, description=南京总统府, age=15}
{name=张三, description=北京故宫圆明园, age=11}
{name=李四, description=北京市天安门广场, age=18}
前面也强调到
字段类型分词,将查询条件分词之后进行查询,如果该字段不分词就会将查询条件作为整体进行查询
那么我们来测试一下,比如当不对 “description” 字段分词时查询
GET full_index/_search
{
"query": {
"multi_match": {
"query": "北京王五",
"fields": ["name","description.keyword"]
}
}
}
返回结果如下:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 3.583519,
"hits" : [
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 3.583519,
"_source" : {
"name" : "王五",
"age" : 15,
"description" : "南京总统府"
}
}
]
}
}
可以看到,当使用 “description.keyword” 也就是不对 “description” 进行分词时,只返回了一条数据,该条数据只有 “name” 字段为 “王五” 满足了查询条件分词匹配后的结果。
四、match_phrase query
短语搜索(match phrase)会对搜索文本进行文本分析,然后到索引中寻找搜索的每个分词并要求分词相邻,你可以通过调整slop参数设置分词出现的最大间隔距离。match_phrase 会将检索关键词分词。
DSL: 搜索 "description " 字段有 “北京故宫” 的数据
GET full_index/_search
{
"query": {
"match_phrase": {
"description": {
"query": "北京故宫"
}
}
}
}
返回数据如下:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 3.5884824,
"hits" : [
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 3.5884824,
"_source" : {
"name" : "张三",
"age" : 11,
"description" : "北京故宫圆明园"
}
}
]
}
}
springboot实现:
@RequestMapping(value = "/match_phrase", method = RequestMethod.GET)
@ApiOperation(value = "DSL - match_phrase")
public void match_phrase() throws Exception {
// 定义请求对象
SearchRequest searchRequest = new SearchRequest(INDEX_NAME);
// 查询所有
searchRequest.source(new SearchSourceBuilder().query(
QueryBuilders.matchPhraseQuery("description","北京故宫")));
// 打印返回数据
printLog(client.search(searchRequest, RequestOptions.DEFAULT));
}
返回数据如下:
返回hits数组长度:1
{name=张三, description=北京故宫圆明园, age=11}
思考: 搜索 "description " 字段有 “北京故宫” 的数据有返回,那么搜索 “北京圆明园” ,为什么没数据返回?
GET full_index/_search
{
"query": {
"match_phrase": {
"description": {
"query": "北京圆明园"
}
}
}
}
返回数据如下:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
原因分析: 先查看 “北京故宫圆明园” 的分词结果,如下:
POST _analyze
{
"analyzer": "ik_max_word",
"text": ["北京故宫圆明园"]
}
{
"tokens" : [
{
"token" : "北京",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "故宫",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "圆明园",
"start_offset" : 4,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 2
}
]
}
可以发现 “北京” 和 “圆明园” 并不是相邻的词条,他们之间相差了一个词条,所以这时候就需要用到 “slop” ,
slop参数告诉match_phrase查询词条能够相隔多远时仍然将文档视为匹配
GET full_index/_search
{
"query": {
"match_phrase": {
"description": {
"query": "北京圆明园",
"slop": 1
}
}
}
}
返回结果如下:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 2.4425511,
"hits" : [
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 2.4425511,
"_source" : {
"name" : "张三",
"age" : 11,
"description" : "北京故宫圆明园"
}
}
]
}
}
五、query_string query
允许我们在单个查询字符串中指定AND | OR | NOT条件,同时也和 multi_match query 一样,支持多字段搜索。和match类似,但是match需要指定字段名,query_string是在所有字段中搜索,范围更广泛。注意: 查询字段分词就将查询条件分词查询,查询字段不分词将查询条件不分词查询
DSL: 搜索当前索引所有字段中含有 “北京故宫” 的文档
GET full_index/_search
{
"query": {
"query_string": {
"query": "安徽张三"
}
}
}
返回数据如下:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 2.5618675,
"hits" : [
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 2.5618675,
"_source" : {
"name" : "张三",
"age" : 11,
"description" : "北京故宫圆明园"
}
},
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "8",
"_score" : 2.5618675,
"_source" : {
"name" : "张三",
"age" : 11,
"description" : "南京总统"
}
},
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "6",
"_score" : 1.7342355,
"_source" : {
"name" : "憨憨",
"age" : 27,
"description" : "安徽黄山九华山"
}
}
]
}
}
springboot实现:
@RequestMapping(value = "/query_string", method = RequestMethod.GET)
@ApiOperation(value = "DSL - query_string")
public void query_string() throws Exception {
// 定义请求对象
SearchRequest searchRequest = new SearchRequest(INDEX_NAME);
// 查询所有
searchRequest.source(new SearchSourceBuilder().query(
QueryBuilders.queryStringQuery("安徽张三")));
// 打印返回数据
printLog(client.search(searchRequest, RequestOptions.DEFAULT));
}
返回hits数组长度:3
{name=张三, description=北京故宫圆明园, age=11}
{name=张三, description=南京总统, age=11}
{name=憨憨, description=安徽黄山九华山, age=27}
指定字段查询: “description” 字段中含有 “安徽张三” 的文档
GET full_index/_search
{
"query": {
"query_string": {
"query": "安徽张三",
"fields": ["description"]
}
}
}
返回数据如下:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.7342355,
"hits" : [
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "6",
"_score" : 1.7342355,
"_source" : {
"name" : "憨憨",
"age" : 27,
"description" : "安徽黄山九华山"
}
}
]
}
}
指定多个字段查询 : 查询 “安徽” “憨憨” 同时满足
GET full_index/_search
{
"query": {
"query_string": {
"query": "安徽 AND 憨憨",
"fields": ["description","name"]
}
}
}
返回:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 6.6615744,
"hits" : [
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "6",
"_score" : 6.6615744,
"_source" : {
"name" : "憨憨",
"age" : 27,
"description" : "安徽黄山九华山"
}
}
]
}
}
GET full_index/_search
{
"query": {
"query_string": {
"query": "(安徽 AND 憨憨)OR 张三",
"fields": ["description","name"]
}
}
}
返回数据如下:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 6.6615744,
"hits" : [
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "6",
"_score" : 6.6615744,
"_source" : {
"name" : "憨憨",
"age" : 27,
"description" : "安徽黄山九华山"
}
},
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 2.5618675,
"_source" : {
"name" : "张三",
"age" : 11,
"description" : "北京故宫圆明园"
}
},
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "8",
"_score" : 2.5618675,
"_source" : {
"name" : "张三",
"age" : 11,
"description" : "南京总统"
}
}
]
}
}
query_string query 这种查询方式类似于 match query匹配查询 结合 multi_match query 多字段查询 一起使用。文章来源:https://www.toymoban.com/news/detail-814195.html
六、simple_query_string
类似Query String,但是会忽略错误的语法,同时只支持部分查询语法,不支持AND OR NOT,会当作字符串处理。支持部分逻辑:文章来源地址https://www.toymoban.com/news/detail-814195.html
- “+” 替代 “AND”
- “|” 替代 “OR”
- “-” 替代 “NOT”
GET full_index/_search
{
"query": {
"simple_query_string": {
"query": "(安徽 + 憨憨) | 张三",
"fields": ["description","name"]
}
}
}
返回结果如下:
{
"took" : 41,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 6.6615744,
"hits" : [
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "6",
"_score" : 6.6615744,
"_source" : {
"name" : "憨憨",
"age" : 27,
"description" : "安徽黄山九华山"
}
},
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 2.5618675,
"_source" : {
"name" : "张三",
"age" : 11,
"description" : "北京故宫圆明园"
}
},
{
"_index" : "full_index",
"_type" : "_doc",
"_id" : "8",
"_score" : 2.5618675,
"_source" : {
"name" : "张三",
"age" : 11,
"description" : "南京总统"
}
}
]
}
}
到了这里,关于【ElasticSearch-基础篇】ES高级查询Query DSL全文检索的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!