使用ES对一段中文进行分词

这篇具有很好参考价值的文章主要介绍了使用ES对一段中文进行分词。希望对大家有所帮助。如果存在错误或未考虑完全的地方,请大家不吝赐教,您也可以点击"举报违法"按钮提交疑问。

ES连接使用org.elasticsearch.client.RestHighLevelClient。获取分词的代码如下:


import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONArray;
import com.alibaba.fastjson.JSONObject;
import org.apache.http.util.EntityUtils;
import org.elasticsearch.client.Request;
import org.elasticsearch.client.Response;
import org.elasticsearch.client.RestHighLevelClient;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;

import java.util.ArrayList;
import java.util.List;

@Service
public class BaseDataService {
    protected Logger logger = LoggerFactory.getLogger(this.getClass());

    @Autowired
    private RestHighLevelClient restHighLevelClient;

    /**
     * 获取分词
     *
     * @param text
     * @return
     * @throws Exception
     */
    public List<String> getAnalyze(String text) throws Exception {
        List<String> list = new ArrayList<String>();
        Request request = new Request("GET", "_analyze");
        JSONObject entity = new JSONObject();
        entity.put("analyzer", "ik_max_word");
        entity.put("text", text);
        request.setJsonEntity(entity.toJSONString());
        Response response = restHighLevelClient.getLowLevelClient().performRequest(request);
        JSONObject tokens = JSONObject.parseObject(EntityUtils.toString(response.getEntity()));
        JSONArray arrays = tokens.getJSONArray("tokens");
        for (int i = 0; i < arrays.size(); i++) {
            JSONObject obj = JSON.parseObject(arrays.getString(i));
            list.add(obj.getString("token"));
        }
        return list;
    }

}

单测代码如下:

 @Test
    public void getAnalyze() throws Exception {
        String text = "点击上方蓝字关注我们!全体教职员工、家长朋友们:你们好!快乐而充实的暑期生活即将结束,新学期的各项工作即将开启。鉴于目前国内、省内严峻复杂的疫情形势,为进一步做好幼儿园疫情防控工作,为秋季开学创造良好条件,确保返园后正常的教育教学秩序,现温馨提示如下:一、做好返安准备。广大教职员工及幼儿根据开学时间以及疫情形势变化,预留足够时间,至少提前7天返安或返回居住地(即:全体教师于2022年8月20日零时前返安;全体幼儿于2022年8月24日零时前返安),并严格落实属地(单位报备、社区报备)健康管理要求。二、做好健康监测。建议从外地返安的教职工、幼儿及家长自觉进行3天2次核酸检测(至少间隔24小时),并做好7天自我健康监测。前3天原则上“两点一线”,少聚集、少聚会。时刻关注自己和家人的身体状况,如出现发热、干咳、乏力、嗅(味)觉减退、鼻塞、流涕、咽痛、结膜炎、肌痛和腹泻等症状,及时到附近的发热门诊进行排查和诊疗,就医过程尽量避免乘坐公共交通工具。三、做好重点防控。近7日内有中、高风险区旅居或与相关人员有密切接触的教师、幼儿,返安前 48 小时向目的地社区报备,在抵安后12小时内向目的地社区和幼儿园报告,并配合做好信息登记、核酸检测、集中隔离或居家健康监测等管控措施。四、做好健康登记。如实填写《汉滨区铁路幼儿园疫情防控返园承诺书及返园前健康监测登记表》,并在开学当天上交纸质版给班级教师。(电子表格已发至班级群)新学期开学在即,让我们一起做好返园前各项防控工作,确保全体教职工及幼儿安全返园。祝大家身体健康!暑假愉快!汉滨区铁路幼儿园2022年8月19日扫码关注分享给第一个想到的人";
        List<String> result = baseDataService.getAnalyze(text);
        System.out.println(JsonMapper.toJson(result));
    }

执行结果:

["点击","上方","蓝字","关注","我们","全体","教职员工","教职员","教职","职员","员工","家长","朋友们","朋友","们","你们","好","快乐","而","充实","的","暑期","生活","即将","结束","新学期","新学","学期","的","各项工作","各项","工作","即将","开启","鉴于","目前国内","目前","国内","省内","严峻","复杂","的","疫情","情形","形势","为","进一步","进一","一步","一","步","做好","幼儿园","幼儿","园","疫情","防","控","工作","为","秋季","开学","创造","良好条件","良好","条件","确保","返","园","后","正常","的","教育","教学秩序","教学","秩序","现","温馨","提示","如下","一","做好","返","安","准备","广大","教职员工","教职员","教职","职员","员工","及","幼儿","根据","开学","学时","时间","以及","疫情","情形","形势","变化","预留","留足","足够","时间","至少","少提","提前","7","天","返","安","或","返回","居住地","居住","住地","即","全体","教师","于","2022","年","8","月","20","日","零时","零","时","前","返","安","全体","幼儿","于","2022","年","8","月","24","日","零时","零","时","前","返","安","并","严格","落实","实属","属地","单位","报备","社区","报备","健康","管理","要求","二","做好","健康","监测","建议","从","外地","返","安","的","教职工","教职","职工","幼儿","及","家长","自觉","进行","3","天","2","次","核酸","检测","至少","少间","间隔","24","小时","时","并","做好","7","天","自我","健康","监测","前","3","天","原则上","原则","上","两点","两","点","一线","一","线","少","聚集","少","聚会","时刻","关注","自己","和家人","家人","的","身体状况","身体","状况","如","出现","发热","干咳","乏力","嗅","味","觉","减退","鼻塞","流涕","咽","痛","结膜炎","结膜","膜炎","肌","痛","和","腹泻","等","症状","及时","到","附近","的","发热","热门","门诊","进行","排查","和","诊疗","就医","过程","尽量","避免","乘坐","公共交通","公共","交通工具","交通","工具","三","做好","重点","防","控","近","7","日内","日","内有","中","高风险","高风","风险","险区","旅居","或与","相关","关人","人员","有","密切接触","密切","接触","的","教师","幼儿","返","安","前","48","小时","向","目的地","目的","地","社区","报备","在","抵","安","后","12","小时内","小时","时","内向","目的地","目的","地","社区","和","幼儿园","幼儿","园","报告","并","配合","合做","做好","信息","登记","核酸","检测","集中","中隔","隔离","或","居家","健康","监测","等","管","控","措施","四","做好","健康","登记","如实","填写","汉滨区","铁路","幼儿园","幼儿","园","疫情","防","控","返","园","承诺书","承诺","书","及","返","园","前","健康","监测","登记表","登记","表","并在","开学","当天","天上","上交","纸质","版","给","班级","教师","电子表格","电子表","电子","子表","表格","已","发至","班级","群","新学期","新学","学期","开学","在即","让我们","我们","一起","一","起","做好","返","园","前","各项","防","控","工作","确保全","确保","保全","全体","教职工","教职","职工","及","幼儿","安全","返","园","祝","大家","身体健康","身体","健康","暑假","愉快","汉滨区","铁路","幼儿园","幼儿","园","2022","年","8","月","19","日","扫","码","关注","分享","给","第一个","第一","一个","一","个","想到","的人"]

resthighlevelclient 分词,elasticsearch,分词文章来源地址https://www.toymoban.com/news/detail-521642.html

到了这里,关于使用ES对一段中文进行分词的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处: 如若内容造成侵权/违法违规/事实不符,请点击违法举报进行投诉反馈,一经查实,立即删除!

领支付宝红包 赞助服务器费用

相关文章

  • es elasticsearch 十 中文分词器ik分词器 Mysql 热更新词库

    目录 中文分词器ik分词器 介绍 安装 使用分词器 Ik分词器配置文件 Mysql 热更新词库 介绍 中文分词器按照中文进行分词,中文应用最广泛的是ik分词器 安装 官网下载对应版本zip 下载  放到  plugins 目录 新建 ik文件夹 考入解析zip 重启 es //分成小单词 使用分词器 ik_max_word分成

    2024年02月07日
    浏览(59)
  • Elasticsearch07:ES中文分词插件(es-ik)安装部署

    在中文数据检索场景中,为了提供更好的检索效果,需要在ES中集成中文分词器,因为ES默认是按照英文的分词规则进行分词的,基本上可以认为是单字分词,对中文分词效果不理想。 ES之前是没有提供中文分词器的,现在官方也提供了一些,但是在中文分词领域,IK分词器是

    2024年02月03日
    浏览(80)
  • es自定义分词器支持数字字母分词,中文分词器jieba支持添加禁用词和扩展词典

    自定义分析器,分词器 所有字段检索 高亮搜索 分词测试 GET /test_index/_analyze jieba中文分词支持添加禁用词和扩展词库功能 创建索引:PUT http://xxxx:9200/test_index 分词测试: GET http://xxxxxx:9200/test_index/_analyze

    2024年02月11日
    浏览(41)
  • ES自定义分词,对数字进行分词

    需求:需要将下面类似的数据分词为:GB,T,32403,1,2015 我们使用的Unicode进行正则匹配,Unicode将字符编码分为了七类,其中 P代表标点 L 代表字母 Z 代表分隔符(空格,换行) S 代表数学符号,货币符号 M代表标记符号 N 阿拉伯数字,罗马数字 C其他字符 例如:所以pP的作用是匹配

    2024年02月15日
    浏览(34)
  • ES客户端RestHighLevelClient的使用

    默认情况下,ElasticSearch使用两个端口来监听外部TCP流量。 9200端口:用于所有通过HTTP协议进行的API调用。包括搜索、聚合、监控、以及其他任何使用HTTP协议的请求。所有的客户端库都会使用该端口与ElasticSearch进行交互。 9300端口:是一个自定义的二进制协议,用于集群中各

    2024年02月03日
    浏览(59)
  • 项目中使用es(二):使用RestHighLevelClient操作elasticsearch

    写在前面 之前写了有关elasticsearch的搭建和使用springboot操作elasticsearch,这次主要简单说下使用RestHighLevelClient工具包操作es。 搭建环境和选择合适的版本 环境还是以springboot2.7.12为基础搭建的,不过这不重要,因为这次想说的是RestHighLevelClient操作elasticsearch,RestHighLevelClient版本

    2024年02月14日
    浏览(43)
  • Java使用Springboot集成Es官方推荐(RestHighLevelClient)

    SpringBoot集成ElasticSearch的四种方式(主要讲解ES官方推荐方式) TransportClient:这种方式即将弃用 官方将在8.0版本彻底去除 Data-Es:Spring提供的封装的方式,由于是Spring提供的,所以每个SpringBoot版本对应的ElasticSearch,具体这么个对应的版本,自己去官网看 ElasticSearch SQL:将Elasti

    2023年04月08日
    浏览(34)
  • Springboot 整合 Elasticsearch(五):使用RestHighLevelClient操作ES ②

    📁 前情提要: Springboot 整合 Elasticsearch(三):使用RestHighLevelClient操作ES ① 目录  一、Springboot 整合 Elasticsearch 1、RestHighLevelClient API介绍 1.1、全查询 分页 排序 1.2、单条件查询 1.2.1、termQuery 1.2.2、matchQuery 1.2.3、短语检索 1.3、组合查询 1.4、范围查询 1.5、模糊查询 1.6、分组

    2024年04月11日
    浏览(40)
  • Elasticsearch7.15.2 安装ik中文分词器后启动ES服务报错的解决办法

    下载elasticsearch ik中文分词器,在elasticsearch安装目录下的plugins文件夹下创建名为ik的文件夹,将下载的ik中文分词器解压缩到新建的ik文件夹下,再次运行 ./bin/elasticsearch启动ES服务时出现以下错误: Exception in thread \\\"main\\\" java.nio.file.NotDirectoryException: /Users/amelia/work/elasticsearch-7.1

    2024年02月12日
    浏览(55)
  • 本地elasticsearch中文分词器 ik分词器安装及使用

    ElasticSearch 内置了分词器,如标准分词器、简单分词器、空白词器等。但这些分词器对我们最常使用的中文并不友好,不能按我们的语言习惯进行分词。 ik分词器就是一个标准的中文分词器。它可以根据定义的字典对域进行分词,并且支持用户配置自己的字典,所以它除了可

    2024年02月05日
    浏览(69)

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

博客赞助

微信扫一扫打赏

请作者喝杯咖啡吧~博客赞助

支付宝扫一扫领取红包,优惠每天领

二维码1

领取红包

二维码2

领红包