背景
有时候需要ES模糊多个多个和中文相关的字段,可以把多个字段合成一个逻辑意义上的字段进行模糊
相关信息
此时需要两个配置:
1、copy_to (将多个字段整合成一个字段)官方文档:https://www.elastic.co/guide/en/elasticsearch/reference/5.5/copy-to.html
2、ngram (分词器在对纯中文或者中英文混合相关等检索的时候很犀利,无脑的将词分隔成成为几个字连接起来)官方文档:https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-ngram-tokenizer.html
如下示配置看,将first_name和last_name字段copy到一个字段full_name中
注意⚠️:(这个会不会产生一个情况?用户输入的前半段和后半段,命中了字段合在一起的中间部分,如A字段值:abc,B字段值为:def,把A、B字段合并后进行模糊分词匹配的时候会不会匹配中:cd这个值?)如果是使用copyto,不用担心这个。如果你使用的是binlog同步使用字符串拼接的方式产生的字段,会产生括号中所属的情况
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"first_name": {
"type": "text",
"copy_to": "full_name"
},
"last_name": {
"type": "text",
"copy_to": "full_name"
},
"full_name": {
"type": "text"
}
}
}
}
}
PUT my_index/my_type/1
{
"first_name": "John",
"last_name": "Smith"
}
GET my_index/_search
{
"query": {
"match": {
"full_name": {
"query": "John Smith",
"operator": "and"
}
}
}
}
试一下没有进行额外配置ngrem默认的分词效果:会为字符创建一个offset,更加方便命中数据
GET _analyze
{
"tokenizer": "ngram",
"text": "我的测试"
}
#产生的效果
{
"tokens": [
{
"token": "我",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "我的",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "的",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 2
},
{
"token": "的测",
"start_offset": 1,
"end_offset": 3,
"type": "word",
"position": 3
},
{
"token": "测",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 4
},
{
"token": "测试",
"start_offset": 2,
"end_offset": 4,
"type": "word",
"position": 5
},
{
"token": "试",
"start_offset": 3,
"end_offset": 4,
"type": "word",
"position": 6
}
]
}
进行ngrem配置
min_gram和max_gram 有助于分词的效率,类似一个窗口(如下配置像一个长度1~3单位的窗口,可以在字段字符上来回滑动分词),窗口长度拉的越长,匹配的更加具体,长度越小,匹配质量越低文章来源:https://www.toymoban.com/news/detail-502395.html
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "2 Quick Foxes."
}
后续java示例伪代码文章来源地址https://www.toymoban.com/news/detail-502395.html
int from = (pageNum - 1) * pageSize;
SearchSourceBuilder builder = new SearchSourceBuilder();
BoolQueryBuilder rootQuery = QueryBuilders.boolQuery();
BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery();
boolQueryBuilder.must(QueryBuilders.matchQuery("full_name", "模糊匹配的内容").operator(Operator.AND));
rootQuery.filter(boolQueryBuilder);
builder.setQuery(rootQuery);
builder.from(from);
builder.size(pageSize);
SearchRequest searchRequest = new SearchRequest("indexName");
searchRequest.source(builder);
try {
SearchResponse response = client.search(searchRequest);
return response;
} catch (IOException e) {
throw new BaseException("es查询连接出错:" + e.getMessage());
}catch (ElasticsearchException e){
throw new BaseException("es查询出错:" + e.getMessage());
}
到了这里,关于如何使用ES更有效率的进行多字段模糊匹配的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!