Elasticsearch:如何部署 NLP:文本嵌入和向量搜索

这篇具有很好参考价值的文章主要介绍了Elasticsearch:如何部署 NLP:文本嵌入和向量搜索。希望对大家有所帮助。如果存在错误或未考虑完全的地方,请大家不吝赐教,您也可以点击"举报违法"按钮提交疑问。

Elasticsearch:如何部署 NLP:文本嵌入和向量搜索

作为我们自然语言处理 (NLP) 博客系列的一部分,我们将介绍一个使用文本嵌入模型生成文本内容的向量表示并演示对生成的向量进行向量相似性搜索的示例。我们将在 Elasticsearch 上部署一个公开可用的模型,并在摄取管道中使用它来从文本文档生成嵌入。然后,我们将展示如何在向量相似性搜索中使用这些嵌入(embedding)来查找给定查询的语义相似文档。

向量相似性搜索(vector similarity search),或者通常称为语义搜索,超越了传统的基于关键字的搜索,允许用户找到可能没有任何共同关键字的语义相似的文档,从而提供更广泛的结果。向量相似性搜索对密集向量进行操作,并使用 k-最近邻(k-nearest neighbour)搜索来查找相似向量。为此,首先需要使用文本嵌入模型将文本形式的内容转换为其数字向量表示。

我们将使用来自 MS MARCO Passage Ranking Task 的公共数据集进行演示。它由来自 Microsoft Bing 搜索引擎的真实问题和人工生成的答案组成。该数据集是测试向量相似性搜索的完美资源,首先,因为问答是向量搜索最常见的用例之一,其次,MS MARCO 排行榜中的顶级论文以某种形式使用了向量搜索。

在我们的示例中,我们将使用此数据集的样本,使用模型生成文本嵌入,然后对其运行向量搜索。我们还希望对向量搜索产生的结果的质量进行快速验证。在今天的展示中,我将使用 Elastic Stack 8.2 来进行展示。

针对一个向量搜索来说,它的架构可以表述为如下:

Elasticsearch:如何部署 NLP:文本嵌入和向量搜索

Elasticsearch:如何部署 NLP:文本嵌入和向量搜索

安装

Elasticsearch 及 Kibana

如果你还没安装好自己的 Elasticsearch 及 Kibana,请参阅如下的文章来进行安装:

  • 如何在 Linux,MacOS 及 Windows 上进行安装 Elasticsearch
  • Kibana:如何在 Linux,MacOS 及 Windows上安装 Elastic 栈中的 Kibana

请注意文章中的 8.x 的安装部分。由于使用 eland 上传模型是白金版或者是企业版的功能,在我们的演示中,我们需要启动白金版试用功能:
Elasticsearch:如何部署 NLP:文本嵌入和向量搜索

Elasticsearch:如何部署 NLP:文本嵌入和向量搜索

 Elasticsearch:如何部署 NLP:文本嵌入和向量搜索

Eland

可以使用 Pip 从 PyPI 安装 Eland:

python -m pip install eland

也可以使用 Conda 从 Conda Forge 安装 Eland:

conda install -c conda-forge eland

希望在不安装 Eland 的情况下使用它的用户,为了只运行可用的脚本,可以构建 Docker 容器:

git clone https://github.com/elastic/eland
cd eland
docker build -t elastic/eland .

Elasticsearch:如何部署 NLP:文本嵌入和向量搜索

Eland 将 Hugging Face 转换器模型到其 TorchScript 表示的转换和分块过程封装在一个 Python 方法中; 因此,这是推荐的导入方法。

  1. 安装 Eland Python 客户端。
  2. 运行 eland_import_hub_model 脚本。 例如:
eland_import_hub_model --url <clusterUrl> \ 
--hub-model-id elastic/distilbert-base-cased-finetuned-conll03-english \ 
--task-type ner 
  • 指定 URL 以访问你的集群。 例如,https://<user>:<password>@<hostname>:<port>。
  • 在 Hugging Face 模型中心中指定模型的标识符。
  • 指定 NLP 任务的类型。 支持的值为 fill_mask、ner、text_classification、text_embedding, question_answering 和 zero_shot_classification。

Elasticsearch:如何部署 NLP:文本嵌入和向量搜索

部署文本嵌入模型

第一步是安装文本嵌入模型。 对于我们的模型,我们使用 Hugging Face 的 msmarco-distilbert-base-tas-b。 这是一个句子转换模型,它将一个句子或一个段落映射到一个 768 维的密集向量。 该模型针对语义搜索进行了优化,并专门针对 MS MARCO Passage 数据集进行了训练,使其适合我们的任务。 除了这个模型,Elasticsearch 还支持许多其他的文本嵌入模型。 完整列表可以在这里找到。

我们使用我们在 NER 示例中构建的 Eland docker 代理安装模型。 运行下面的脚本将我们的模型导入我们的本地集群并部署它:

docker run -it --rm elastic/eland \
    eland_import_hub_model \
        --url https://elastic:lOwgBZT3KowJrQWMwRWm@192.168.0.3:9200/ \
        --hub-model-id sentence-transformers/msmarco-distilbert-base-tas-b \
        --task-type text_embedding \
        --insecure \
        --start       

在上面,请注意你需要根据自己的情况替换点上面的用户名及密码部分。你也需要修改相应的 Elasticsearch 地址。在这里,由于我们使用的是自签名安装,我使用了 --insecuer 选择来进行安装以规避 SSL 的安全证书检查。

Elasticsearch:如何部署 NLP:文本嵌入和向量搜索

如果我们想使用自签名证书来保证安全的上传,我们可以使用如下的命令:

docker run -it \
     -v /Users/liuxg/elastic/elasticsearch-8.2.0/config/certs/http_ca.crt:/usr/share/http_ca.crt \
     --rm elastic/eland \
     eland_import_hub_model \
        --url https://elastic:EH3*HOpb5rmWdbDj_f4k@192.168.0.3:9200/ \
        --hub-model-id sentence-transformers/msmarco-distilbert-base-tas-b \
        --task-type text_embedding \
        --ca-cert /usr/share/http_ca.crt
        --start       

在上面,我们使用了 -v 标志来安装本地的一个 http_ca.crt 证书到 docker 里的容器中。上面命令的运行结果为:

Elasticsearch:如何部署 NLP:文本嵌入和向量搜索

 这里, --task-type 设置为 text_embedding 并且 --start 选项被传递给 Eland 脚本,因此模型将自动部署,而无需在模型管理 UI 中启动它。 为了加快推理速度,你可以使用 inference_threads 参数增加推理线程的数量。

Elasticsearch:如何部署 NLP:文本嵌入和向量搜索

从上面的输出中,我们可以看到模型已经被成功地上传了。 

我们可以通过在 Kibana 控制台中使用这个示例来测试模型的成功部署:

POST /_ml/trained_models/sentence-transformers__msmarco-distilbert-base-tas-b/deployment/_infer
{
  "docs": {
    "text_field": "how is the weather in jamaica"
  }
}

我们应该看到预测的密集向量(dense vector)作为结果:

Elasticsearch:如何部署 NLP:文本嵌入和向量搜索

 经过上面的操作后,我们可以在 Kibana 中进行查看已经被摄入的模型:

Elasticsearch:如何部署 NLP:文本嵌入和向量搜索

Elasticsearch:如何部署 NLP:文本嵌入和向量搜索

Elasticsearch:如何部署 NLP:文本嵌入和向量搜索

Elasticsearch:如何部署 NLP:文本嵌入和向量搜索

Elasticsearch:如何部署 NLP:文本嵌入和向量搜索

我们可以在 Kibana 中测试上传的模型:

Elasticsearch:如何部署 NLP:文本嵌入和向量搜索

Elasticsearch:如何部署 NLP:文本嵌入和向量搜索

上面表明,我们已经成功地上传了模型。 

装载初始数据

如介绍中所述,我们使用 MS MARCO Passage Ranking 数据集。 数据集非常大,包含超过 800 万个段落。 在我们的示例中,我们使用了它的一个子集,该子集在 2019 TREC Deep Learning Track 的测试阶段使用。 用于重新排序任务的数据集 msmarco-passagetest2019-top1000.tsv 包含 200 个查询,每个查询由一个简单的 IR 系统提取的相关文本段落列表。 从该数据集中,我们提取了所有带有 id 的唯一段落,并将它们放入一个单独的 tsv 文件中,总共 182469 个段落。 我们使用这个文件作为我们的数据集。

我们使用 Kibana 的文件上传功能来上传这个数据集。 Kibana 文件上传允许我们为字段提供自定义名称,我们将它们称为 id 类型为 long 的段落 id 和 text 类型的文本为段落的内容。 索引名称是 collection。 上传后,我们可以看到一个名为 collection 的索引,其中包含 182469 个文档。

Elasticsearch:如何部署 NLP:文本嵌入和向量搜索

 Elasticsearch:如何部署 NLP:文本嵌入和向量搜索

Elasticsearch:如何部署 NLP:文本嵌入和向量搜索

Elasticsearch:如何部署 NLP:文本嵌入和向量搜索

Elasticsearch:如何部署 NLP:文本嵌入和向量搜索

从上面,我们可以看出来有 182469 个文档被摄入。

创建 pipeline

我们希望使用推理处理器(inference processor)处理初始数据,该处理器将为每个段落添加嵌入(embedding)。 为此,我们创建了一个文本嵌入摄取管道,然后使用该管道重新索引我们的初始数据。

在 Kibana 控制台中,我们创建了一个摄取管道用于文本嵌入,并将其称为 text-embedding。 这些段落位于名为 text 的字段中。 正如我们之前所做的,我们将定义一个 field_map 来将文本映射到模型期望的字段 text_field。 同样 on_failure 处理程序设置为将故障索引到不同的索引中:

PUT _ingest/pipeline/text-embeddings
{
  "description": "Text embedding pipeline",
  "processors": [
    {
      "inference": {
        "model_id": "sentence-transformers__msmarco-distilbert-base-tas-b",
        "target_field": "text_embedding",
        "field_map": {
          "text": "text_field"
        }
      }
    }
  ],
  "on_failure": [
    {
      "set": {
        "description": "Index document to 'failed-<index>'",
        "field": "_index",
        "value": "failed-{{{_index}}}"
      }
    },
    {
      "set": {
        "description": "Set error message",
        "field": "ingest.failure",
        "value": "{{_ingest.on_failure_message}}"
      }
    }
  ]
}

Reindex

我们希望通过 text-embedding 管道推送文档,将文档从 collection 索引重新索引(reindex)到新的 collection-with-embedding 索引中,以便在 collection-with-embeddings 索引中的文档具有用于段落嵌入的附加字段。 但在我们这样做之前,我们需要为我们的目标索引创建和定义一个映射,特别是对于摄取处理器将存储嵌入的字段 text_embedding.predicted_value。 如果我们不这样做,嵌入将被索引到常规浮点 float 字段中,并且不能用于向量相似性搜索。 我们使用的模型将嵌入生成为 768 维向量,因此我们使用具有 768 个维度的索引 dense_vector 字段类型,如下所示:

PUT collection-with-embeddings
{
  "mappings": {
    "properties": {
      "text_embedding.predicted_value": {
        "type": "dense_vector",
        "dims": 768,
        "index": true,
        "similarity": "cosine"
      },
      "text": {
        "type": "text"
      }
    }
  }
}

最后,我们准备重新索引。 鉴于 reindex 需要一些时间来处理所有文档并对其进行推断,我们通过调用带有 wait_for_completion=false 标志的 API 在后台reindex:

POST _reindex?wait_for_completion=false
{
  "source": {
    "index": "collection"
  },
  "dest": {
    "index": "collection-with-embeddings",
    "pipeline": "text-embeddings"
  }
}

以上返回一个任务 ID。 我们可以通过以下方式监控任务的进度:

GET _tasks/<task_id>

或者,通过观察 model stats API或模型统计 UI 中的 inference count 增加来跟踪进度。

Elasticsearch:如何部署 NLP:文本嵌入和向量搜索

当我们看到它达到我们之前的那个文档数  182469,它就表明已经完成。

重新索引的文档现在包含推理结果——向量嵌入(vetor embedings)。 例如,其中一个文档如下所示:

{
    "id": 7130104,
    "text": "This is the definition of RNA along with examples of types of RNA molecules. This is the definition of RNA along with examples of types of RNA molecules. RNA Definition",
    "text_embedding":
    {
        "predicted_value":
        [
            0.057356324046850204,
            0.1602816879749298,
            -0.18122544884681702,
            0.022277727723121643,
            ....
        ],
        "model_id": "sentence-transformers__msmarco-distilbert-base-tas-b"
    }
}

Vector Similarity Search

目前我们不支持在搜索请求期间从查询词隐式生成嵌入,因此我们的语义搜索被组织为一个两步过程:

  • 从文本查询中获取文本嵌入。 为此,我们使用模型的 _infer API。
  • 使用向量搜索来查找与查询文本语义相似的文档。 在 Elasticsearch v8.0 中,我们引入了一个新的 _knn_search 端点,它允许在索引的 dense_vector 字段上进行有效的近似最近邻搜索。 我们使用 _knn_search API 来查找最近的文档。

例如,给一个文本查询 “how is the weather in jamaica”,我们首先运行 _infer API 以得到一个密集向量的 embedding:

POST /_ml/trained_models/sentence-transformers__msmarco-distilbert-base-tas-b/deployment/_infer
{
  "docs": {
    "text_field": "how is the weather in jamaica"
  }
}

上面的命令返回如下的结果:

Elasticsearch:如何部署 NLP:文本嵌入和向量搜索

 上面的 predicted_value 是一个768 维的向量。之后,我们将生成的密集向量(dense vector)插入到 _knn_search 中,如下所示:

GET collection-with-embeddings/_knn_search
{
  "knn": {
    "field": "text_embedding.predicted_value",
    "query_vector": [
    -0.09194609522819519,
    -0.49406030774116516,
    0.03598763048648834,
       …
    ],
    "k": 10,
    "num_candidates": 100
  },
  "_source": [
    "id",
    "text"
  ]
}

结果,我们得到最接近查询文档的前 10 个文档,按它们与查询的接近程度排序:

"hits" : [
      {
        "_index" : "collection-with-embeddings",
        "_id" : "6H_OsH8Bi5IvRzQ7g-Aa",
        "_score" : 0.9527166,
        "_source" : {
          "id" : 6140,
          "text" : "Ocho Rios Jamaica Weather - Winter ( December, January And February) The winters in this town are usually colder when compared to other parts of the island. The average temperature for December, January and February are 81  °F and 79  °F respectively. All three months usually have a high temperature of 84  °F."
        }
      },
      {
        "_index" : "collection-with-embeddings",
        "_id" : "6n_OsH8Bi5IvRzQ7g-Aa",
        "_score" : 0.95225316,
        "_source" : {
          "id" : 6142,
          "text" : "Jamaica Weather and When to Go. Jamaica weather essentials. For more details on the current temperature, wind, and stuff like that you can check any search engine weather feature. The rainy months, also called the rainy season, are generally from the end of April, or early May, until the end of September or early October."
        }
      },
      {
        "_index" : "collection-with-embeddings",
        "_id" : "5n_OsH8Bi5IvRzQ7g-Aa",
        "_score" : 0.9394933,
        "_source" : {
          "id" : 6138,
          "text" : "Quick Answer. Hurricane season in Jamaica starts on June 1 and ends on Nov. 30. Satellite weather forecasts work to allow tourists and island dwellers adequate time to take precautions when hurricanes approach during those months. Continue Reading."
        }
      },
…

Elasticsearch:如何部署 NLP:文本嵌入和向量搜索

快速验证

由于我们只使用了 MS MARCO 数据集的一个子集,我们无法进行全面评估。相反,我们可以做的是对一些查询进行简单的验证,以了解我们确实得到了相关的结果,而不是一些随机的结果。从 TREC 2019 Deep Learning Track 对 Passage Ranking Task 的判断中,我们选取​​最后 3 个查询,将它们提交到我们的向量相似性搜索,获得前 10 个结果并参考 TREC 判断,看看我们收到的结果的相关性如何。对于文章排名任务,文章按照不相关 (0)、相关(文章主题但不回答问题) (1)、高度相关 (2) 和完全相关 ( 3)。

请注意,我们的验证不是严格的评估,它仅用于我们的快速演示。由于我们只对已知与查询相关的段落进行索引,因此它比原始段落检索任务要容易得多。未来我们打算对 MS MARCO 数据集进行严格的评估。

查询 #1124210 “tracheids are part of _____” 提交给我们的向量搜索返回以下结果:

Passage id Relevance rating Passage
2258591 2 - highly relevant Tracheid of oak shows pits along the walls. It is longer than a vessel element and has no perforation plates. Tracheids are elongated cells in the xylem of vascular plants that serve in the transport of water and mineral salts.Tracheids are one of two types of tracheary elements, vessel elements being the other. Tracheids, unlike vessel elements, do not have perforation plates.racheids provide most of the structural support in softwoods, where they are the major cell type. Because tracheids have a much higher surface to volume ratio compared to vessel elements, they serve to hold water against gravity (by adhesion) when transpiration is not occurring.
2258592 3 - perfectly relevant Tracheid. a dead lignified plant cell that functions in water conduction. Tracheids are found in the xylem of all higher plants except certain angiosperms, such as cereals and sedges, in which the water-conducting function is performed by vessels, or tracheae.Tracheids are usually polygonal in cross section; their walls have annular, spiral, or scalene thickenings or rimmed pores.racheids are found in the xylem of all higher plants except certain angiosperms, such as cereals and sedges, in which the water-conducting function is performed by vessels, or tracheae. Tracheids are usually polygonal in cross section; their walls have annular, spiral, or scalene thickenings or rimmed pores.
2728448 2 - highly relevant The xylem tracheary elements consist of cells known as tracheids and vessel members, both of which are typically narrow, hollow, and elongated. Tracheids are less specialized than the vessel members and are the only type of water-conducting cells in most gymnosperms and seedless vascular plants.
7443586 2 - highly relevant 1 The xylem tracheary elements consist of cells known as tracheids and vessel members, both of which are typically narrow, hollow, and elongated. Tracheids are less specialized than the vessel members and are the only type of water-conducting cells in most gymnosperms and seedless vascular plants.
8026737 2 - highly relevant Its major components include xylem parenchyma, xylem fibers, tracheids, and xylem vessels. Tracheids are one of the two types of tracheary elements of vascular plants. (The other being the vessel elements). A tracheid cell loses its protoplast at maturity. Thus, at maturity, it becomes one of the non-living components of the xylem.
2258595 2 - highly relevant Summary: Vessels have perforations at the end plates while tracheids do not have end plates. Tracheids are derived from single individual cells while vessels are derived from a pile of cells. Tracheids are present in all vascular plants whereas vessels are confined to angiosperms.Tracheids are thin whereas vessel elements are wide. Tracheids have a much higher surface-to-volume ratio as compared to vessel elements.Vessels are broader than tracheids with which they are associated.Morphology of the perforation plate is different from that in tracheids.racheids are thin whereas vessel elements are wide. Tracheids have a much higher surface-to-volume ratio as compared to vessel elements. Vessels are broader than tracheids with which they are associated. Morphology of the perforation plate is different from that in tracheids.
181177 3 - perfectly relevant Xylem tracheids are pointed, elongated xylem cells, the simplest of which have continuous primary cell walls and lignified secondary wall thickenings in the form of rings, hoops, or reticulate networks.
2258597 2 - highly relevant Thank you... In plants xylem and phloem are the complex tissues which are the components parts of conductive system. In higher plants xylem contains tracheids, vessels (tracheae), xylem fibres(wood fibres) and xylem parenchyma (wood parenchyma).Tracheids These are elongated narrow tube like cells with hard thick and lignified walls with large cell cavity.hank you... In plants xylem and phloem are the complex tissues which are the components parts of conductive system. In higher plants xylem contains tracheids, vessels (tracheae), xylem fibres(wood fibres) and xylem parenchyma (wood parenchyma).
6541866 2 - highly relevant
In most plants, pitted tracheids function as the primary transport cells. The other type of tracheary element, besides the tracheid, is the vessel element. Vessel elements are joined by perforations into vessels. In vessels, water travels by bulk flow, as in a pipe, rather than by diffusion through cell membranes.n most plants, pitted tracheids function as the primary transport cells. The other type of tracheary element, besides the tracheid, is the vessel element. Vessel elements are joined by perforations into vessels. In vessels, water travels by bulk flow, as in a pipe, rather than by diffusion through cell membranes.

查询 #1129237 “hydrogen is a liquid below what temperature” 返回以下结果:

Passage id Relevance rating Passage
128984 3 - perfectly relevant Hydrogen gas has the molecular formula H 2. At room temperature and under standard pressure conditions, hydrogen is a gas that is tasteless, odorless and colorless. Hydrogen can exist as a liquid under high pressure and an extremely low temperature of 20.28 kelvin (−252.87°C, −423.17 °F). Hydrogen is often stored in this way as liquid hydrogen takes up less space than hydrogen in its normal gas form. Liquid hydrogen is also used as a rocket fuel.
5906130 3 - perfectly relevant Rating Newest Oldest. Best Answer: Hydrogen, like water, can exist in 3 states....Solid, Liquid and Gas Its temperature as a solid is −259.14 °C' Hydrogen melts to liquid at −252.87 °C. It boils and vaporises at -252.125 °C Just cooling or compressing Hydrogen won't liquefy or freeze it.
4254815 1 - related Answer   The boiling point of liquid hydrogen is 20.268 K (-252.88 °C or -423.184 °F)    The freezing point of hydrogen is 14.025 K (-259.125 °C or -434.
8588222 3 - perfectly relevant User: Hydrogen is a liquid below what temperature? a. 100 degrees C c. -183 degrees C b. -253 degrees C d. 0 degrees C Weegy: Hydrogen is a liquid below 253 degrees C. User: What is the boiling point of oxygen? a. 100 degrees C c. -57 degrees C b. 8 degrees C d. -183 degrees C Weegy: The boiling point of oxygen is -183 degrees C.
4254811 3 - perfectly relevant Confidence votes 11.4K. At STP (standard temperature and pressure) hydrogen is a gas. It cools to a liquid at -423 °F, which is only about 37 degrees above absolute zero. Eleven degrees cooler, at -434 °F, it starts to solidify.
2697752 2 - highly relevant Hydrogen's state of matter is gas at standard conditions of temperature and pressure. Hydrogen condenses into a liquid or freezes solid at extremely cold... Hydrogen's state of matter is gas at standard conditions of temperature and pressure. Hydrogen condenses into a liquid or freezes solid at extremely cold temperatures. Hydrogen's state of matter can change when the temperature changes, becoming a liquid at temperatures between minus 423.18 and minus 434.49 degrees Fahrenheit. It becomes a solid at temperatures below minus 434.49 F.Due to its high flammability, hydrogen gas is commonly used in combustion reactions, such as in rocket and automobile fuels.
6080460 3 - perfectly relevant Hydrogen can exist as a liquid under high pressure and an extremely low temperature of 20.28 kelvin (−252.87°C, −423.17 °F). Hydrogen is often stored in this way as liquid hydrogen takes up less space than hydrogen in its normal gas form. Liquid hydrogen is also used as a rocket fuel.ydrogen is found in large amounts in giant gas planets and stars, it plays a key role in powering stars through fusion reactions. Hydrogen is one of two important elements found in water (H 2 O). Each molecule of water is made up of two hydrogen atoms bonded to one oxygen atom.
3905802 3 - perfectly relevant Hydrogen is found naturally in the molecular H2 form. To exist as a liquid, H2 must be cooled below hydrogen's critical point of 33 K. However, for hydrogen to be in a fully liquid state without boiling at atmospheric pressure, it needs to be cooled to 20.28 K (−423.17 °F/−252.87 °C).

查询 #1133167 “how is the weather in jamaica” 返回以下结果

Passage id Relevance rating         Passage
3023123 2 - highly relevant Climate - Jamaica. Temperature, rainfall, prevailing weather conditions, when to go, what to pack. In Jamaica the climate is tropical, hot all year round, with little difference between winter and summer (just a few degrees). Even in winter, the maximum temperatures are around 27/30 °C, and minimum temperatures around 20/23 °C.
434121 2 - highly relevant Temperature, rainfall, prevailing weather conditions, when to go, what to pack. In Jamaica the climate is tropical, hot all year round, with little difference between winter and summer (just a few degrees). Even in winter, the maximum temperatures are around 27/30 °C (81/86 °F), and minimum temperatures around 20/23 °C (68/73 °F).
4922619 2 - highly relevant Map from Google - Jamaica. 1  In Jamaica the climate is tropical, hot all year round, with little difference between winter and summer (just a few degrees). Even in winter, the maximum temperatures are around 27/30 °C (81/86 °F), and minimum temperatures around 20/23 °C (68/73 °F).
8255706 2 - highly relevant And it's absolutely true. This is Jamaica weather! Most of our days are filled with warmth and sunshine, even during the rainy season. Jamaica has a tropical climate with hot and humid weather at sea level. The higher inland regions have a more temperate climate. (Bring a light jacket just in case you travel to the mountains where temperatures can be 10 degrees cooler or in case you go on a windy boat ride).
190806 2 - highly relevant It is always important to know what the weather in Jamaica will be like before you plan and take your vacation. For the most part, the average temperature in Jamaica is between 80 °F and 90 °F (27 °FCelsius-29 °Celsius). Luckily, the weather in Jamaica is always vacation friendly. You will hardly experience long periods of rain fall, and you will become accustomed to weeks upon weeks of sunny weather.
1824486 2 - highly relevant The climate in Jamaica is tropical and humid with warm to hot temperatures all year round. The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit. Jamaican nights are considerably...
4498474 3 - perfectly relevant The climate in Jamaica is tropical and humid with warm to hot temperatures all year round. The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit. Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year.
1824480 3 - perfectly relevant The climate in Jamaica is tropical and humid with warm to hot temperatures all year round. The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit. Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year.

正如我们所见,对于所有 3 个查询,Elasticsearch 返回了大部分相关的结果,并且所有查询的前 1 个结果要么高度相关,要么完全相关。

语义搜索 - Semantic search

在上面的演示中,我们做一个搜索,需要如下的两个步骤:

  1. 通过 _infer API 接口获取搜索字符串的向量
  2. 通过 _knn_search 端点来对上一步获得的向量进行搜索

幸运的是,在最新的 8.7 的版本中,我们可以通过一个命令把上面的两个步骤合二为一:

GET collection-with-embeddings/_search
{
  "knn": {
    "field": "text_embedding.predicted_value",
    "query_vector_builder": {
      "text_embedding": {
        "model_id": "sentence-transformers__msmarco-distilbert-base-tas-b",
        "model_text": "How is the weather in Jamaica?"
      }
    },
    "k": 10,
    "num_candidates": 100
  },
  "_source": [
    "id",
    "text"
  ]
}

因此,你会从 collection-with-embedings 索引中收到与查询含义最接近的前 10 个文档,这些文档按与查询的接近程度排序:

"hits" : [
      {
        "_index" : "collection-with-embeddings",
        "_id" : "47TPtn8BjSkJO8zzKq_o",
        "_score" : 0.94591534,
        "_source" : {
          "id" : 434125,
          "text" : "The climate in Jamaica is tropical and humid with warm to hot temperatures all year round. The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit. Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year. Continue Reading."
        }
      },
      {
        "_index" : "collection-with-embeddings",
        "_id" : "3LTPtn8BjSkJO8zzKJO1",
        "_score" : 0.94536424,
        "_source" : {
          "id" : 4498474,
          "text" : "The climate in Jamaica is tropical and humid with warm to hot temperatures all year round. The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit. Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year"
        }
      },
      {
        "_index" : "collection-with-embeddings",
        "_id" : "KrXPtn8BjSkJO8zzPbDW",
        "_score" :  0.9432083,
        "_source" : {
          "id" : 190804,
          "text" : "Quick Answer. The climate in Jamaica is tropical and humid with warm to hot temperatures all year round. The average temperature in Jamaica is between 80 and 90 degrees Fahrenheit. Jamaican nights are considerably cooler than the days, and the mountain areas are cooler than the lower land throughout the year. Continue Reading"
        }
      },
      (...)
]

试一试

NLP 是 Elastic Stack 中的一项强大功能,具有令人兴奋的路线图。 通过在 Elastic Cloud 中构建集群,发现新功能并跟上最新发展。 立即注册免费试用 14 天,并尝试此博客中的示例。文章来源地址https://www.toymoban.com/news/detail-447104.html

到了这里,关于Elasticsearch:如何部署 NLP:文本嵌入和向量搜索的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处: 如若内容造成侵权/违法违规/事实不符,请点击违法举报进行投诉反馈,一经查实,立即删除!

领支付宝红包 赞助服务器费用

相关文章

  • 【NLP】什么是语义搜索以及如何实现 [Python、BERT、Elasticsearch]

    语义搜索是一种先进的信息检索技术,旨在通过理解搜索查询和搜索内容的上下文和含义来提高搜索结果的准确性和相关性。与依赖于匹配特定单词或短语的传统基于的搜索不同,语义搜索会考虑查询的意图、上下文和语义。 语义搜索在搜索结果的精度和相关性至关重

    2024年02月04日
    浏览(47)
  • 如何在 Elasticsearch 中将向量搜索与过滤结合起来 - Python 8.x

    大型语言模型(LLM)每天都在发展,这种情况有助于语义搜索的扩展。 LLM 擅长分析文本和揭示语义相似性。 这种情况也反映在搜索引擎上,因为语义搜索引擎可以为用户提供更满意的结果。 尽管大型语言模型可以捕获语义上接近的结果,但在搜索结果中实施过滤器对于增强

    2024年02月09日
    浏览(34)
  • 文本词向量嵌入方法对比

    1、文本表示哪些方法? 下面对文本表示进行一个归纳,也就是对于一篇文本可以如何用数学语言表示呢? 词袋模型(bag-of-words):基于one-hot、tf-idf、textrank等的; 主题模型:LSA(SVD)、pLSA、LDA; 基于词向量的固定表征:word2vec、fastText、glove 基于词向量的动态表征:elmo、GP

    2024年02月14日
    浏览(66)
  • 2.自然语言处理NLP:词映射为向量——词嵌入(word embedding)

    1. 什么是词嵌入(word2vec) : 把词映射为向量(实数域)的技术 2. 为什么不采用one-hot向量: one-hot词向量无法准确表达不同词之间的相似度,eg:余弦相似度,表示夹角之间的余弦值,无法表达不同词之间的相似度。 3. word2vec主要包含哪两个模型 跳字模型: 基于某个词生成

    2024年02月06日
    浏览(50)
  • AI实践与学习1_NLP文本特征提取以及Milvus向量数据库实践

    随着NLP预训练模型(大模型)以及多模态研究领域的发展,向量数据库被使用的越来越多。 在XOP亿级题库业务背景下,对于试题召回搜索单单靠着ES分片集群普通搜索已经出现性能瓶颈,因此需要预研其他技术方案提高试题搜索召回率。 现一个方案就是使用Bert等模型提取试

    2024年01月24日
    浏览(46)
  • Elasticsearch 向量相似搜索

    Elasticsearch 向量相似搜索的原理涉及使用密集向量(dense vector)来表示文档,并通过余弦相似性度量来计算文档之间的相似性。以下是 Elasticsearch 向量相似搜索的基本原理: 向量表示文档 : 文档的文本内容经过嵌入模型(如BERT、Word2Vec等)处理,得到一个密集向量(dense v

    2024年02月04日
    浏览(51)
  • 从零开始构建基于milvus向量数据库的文本搜索引擎

    在这篇文章中,我们将手动构建一个语义相似性搜索引擎,该引擎将单个论文作为“查询”输入,并查找Top-K的最类似论文。主要包括以下内容: 1.搭建milvus矢量数据库 2.使用MILVUS矢量数据库搭建语义相似性搜索引擎 3.从Kaggle下载ARXIV数据,使用dask将数据加载到Python中,并构

    2024年02月09日
    浏览(64)
  • 向量数据库:使用Elasticsearch实现向量数据存储与搜索

    Here’s the table of contents:   Elasticsearch在7.x的版本中支持 向量检索 。在向量函数的计算过程中,会对所有匹配的文档进行线性扫描。因此,查询预计时间会随着匹配文档的数量线性增长。出于这个原因,建议使用查询参数来限制匹配文档的数量(类似二次查找的逻辑,先使

    2024年02月07日
    浏览(56)
  • Elasticsearch:运用向量搜索通过图像搜索找到你的小狗

    作者:ALEX SALGADO 你是否曾经遇到过这样的情况:你在街上发现了一只丢失的小狗,但不知道它是否有主人? 了解如何使用向量搜索或图像搜索来做到这一点。 您是否曾经遇到过这样的情况:你在街上发现了一只丢失的小狗,但不知道它是否有主人? 在 Elasticsearch 中通过图像

    2024年02月03日
    浏览(43)
  • Elasticsearch:利用向量搜索进行音乐信息检索

    作者:Alex Salgado 欢迎来到音乐信息检索的未来,机器学习、向量数据库和音频数据分析融合在一起,带来令人兴奋的新可能性! 如果你对音乐数据分析领域感兴趣,或者只是热衷于技术如何彻底改变音乐行业,那么本指南适合你。 在这里,我们将带你踏上使用向量搜索方法

    2024年02月09日
    浏览(42)

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

博客赞助

微信扫一扫打赏

请作者喝杯咖啡吧~博客赞助

支付宝扫一扫领取红包,优惠每天领

二维码1

领取红包

二维码2

领红包