Skip to content

Commit

Permalink
Expands semantic search tutorial with hybrid search
Browse files Browse the repository at this point in the history
  • Loading branch information
kosabogi committed Oct 9, 2024
1 parent fb482f8 commit 9be5b82
Showing 1 changed file with 191 additions and 22 deletions.
Original file line number Diff line number Diff line change
@@ -1,21 +1,52 @@
[[semantic-search-semantic-text]]
=== Tutorial: semantic search with `semantic_text`
=== Tutorial: semantic search and hybrid search with `semantic_text`
++++
<titleabbrev>Semantic search with `semantic_text`</titleabbrev>
<titleabbrev>Semantic search and hybrid search with `semantic_text`</titleabbrev>
++++

beta[]

This tutorial shows you how to use the semantic text feature to perform semantic search on your data.
This tutorial demonstrates how to perform **semantic search** using the **semantic text** feature and explains how to implement
**hybrid search**, combining semantic search with traditional full-text search.

Semantic text simplifies the {infer} workflow by providing {infer} at ingestion time and sensible default values automatically.
You don't need to define model related settings and parameters, or create {infer} ingest pipelines.

The recommended way to use <<semantic-search,semantic search>> in the {stack} is following the `semantic_text` workflow.
When you need more control over indexing and query settings, you can still use the complete {infer} workflow (refer to <<semantic-search-inference,this tutorial>> to review the process).
In hybrid search, semantic search retrieves results based on the meaning of the text, while full-text search focuses
on exact word matches. By combining both methods, hybrid search delivers more relevant results, particularly in cases
where relying on a single approach may not be sufficient.

This tutorial uses the <<inference-example-elser,`elser` service>> for demonstration, but you can use any service and their supported models offered by the {infer-cap} API.
The recommended way to use <<semantic-search,semantic search>> and hybrid search in the {stack} is following the `semantic_text` workflow.
When you need more control over indexing and query settings, you can still use the complete {infer} workflow (refer to
<<semantic-search-inference,this tutorial>> to review the process).

This tutorial uses the <<inference-example-elser,`elser` service>> for demonstration, but you can use any service and their
supported models offered by the {infer-cap} API.


To perform a simple **semantic search**, follow these steps:

- <<semantic-text-infer-endpoint,Create the inference endpoint>>

- <<semantic-search-create-index-mapping,Create the index mapping for semantic search>>

- <<semantic-text-load-data,Load data>>

- <<semantic-search-reindex-data, Reindex the data for semantic search>>

- <<semantic-search-perform-search, Perform semantic search>>

To perform a **hybrid search**, follow these steps:

- <<semantic-text-infer-endpoint,Create the inference endpoint>>

- <<hybrid-search-create-index-mapping,Create the index mapping for hybrid search>>

- <<semantic-text-load-data,Load data>>

- <<hybrid-search-reindex-data, Reindex the data for hybrid search>>

- <<hybrid-search-perform-search, Perform hybrid search>>

[discrete]
[[semantic-text-requirements]]
Expand Down Expand Up @@ -65,8 +96,26 @@ If using the Python client, you can set the `timeout` parameter to a higher valu
[[semantic-text-index-mapping]]
==== Create the index mapping

The mapping of the destination index - the index that contains the embeddings that the inference endpoint will generate based on your input text - must be created.
The destination index must have a field with the <<semantic-text,`semantic_text`>> field type to index the output of the used inference endpoint.
The mapping of the destination index must be created. The destination index is the index where your processed data will be
stored and searched.

[NOTE]
====
If you're using web crawlers or connectors to generate indices, you have to
<<indices-put-mapping,update the index mappings>> for these indices to
include the `semantic_text` field. Once the mapping is updated, you'll need to run
a full web crawl or a full connector sync. This ensures that all existing
documents are reprocessed and updated with the new semantic embeddings,
enabling semantic search on the updated data.
====

[discrete]
[[semantic-search-create-index-mapping]]
===== Create an index mapping for semantic search

The destination index should contain the embeddings generated by the inference endpoint based on the input text.
This allows the semantic search to retrieve results based on the meaning of the query.
Must have a field with the `semantic_text` field type to index the output of the used inference endpoint.

[source,console]
------------------------------------------------------------
Expand All @@ -89,15 +138,39 @@ PUT semantic-embeddings
It will be used to generate the embeddings based on the input text.
Every time you ingest data into the related `semantic_text` field, this endpoint will be used for creating the vector representation of the text.

[NOTE]
====
If you're using web crawlers or connectors to generate indices, you have to
<<indices-put-mapping,update the index mappings>> for these indices to
include the `semantic_text` field. Once the mapping is updated, you'll need to run
a full web crawl or a full connector sync. This ensures that all existing
documents are reprocessed and updated with the new semantic embeddings,
enabling semantic search on the updated data.
====
[discrete]
[[hybrid-search-create-index-mapping]]
===== Create an index mapping for hybrid search

The destination index should contain both the embeddings (for semantic search) and the original text field (for full-text search).
This structure enables the combination of semantic search and full-text search, allowing the search engine to consider both the
meaning and the exact words in the query.

[source,console]
------------------------------------------------------------
PUT semantic-hybrid-embeddings
{
"mappings": {
"properties": {
"semantic_text": { <1>
"type": "semantic_text", <2>
"inference_id": "my-elser-endpoint" <3>
},
"content": { <4>
"type": "text", <5>
"copy_to": "semantic_text" <6>
}
}
}
}
------------------------------------------------------------
// TEST[skip:TBD]
<1> The name of the field to contain the generated embeddings for semantic search.
<2> The field to contain the embeddings is a `semantic_text` field.
<3> The `inference_id` is the inference endpoint that generates the embeddings based on the input text.
<4> The name of the field to contain the original text for lexical search.
<5> The `content` field is a `text` field that holds the raw text data.
<6> The `copy_to` option copies the contents of the `content` field into the `semantic_text` field, enabling hybrid search functionality.


[discrete]
Expand All @@ -124,17 +197,24 @@ After the upload is complete, you will see an index named `test-data` with 182,4
[[semantic-text-reindex-data]]
==== Reindex the data

Create the embeddings from the text by reindexing the data from the `test-data` index to the `semantic-embeddings` index.
The data in the `content` field will be reindexed into the `content` semantic text field of the destination index.
The reindexed data will be processed by the {infer} endpoint associated with the `content` semantic text field.
Reindex the data to create embeddings from the text, either by reindexing from the `test-data` index to the `semantic-embeddings`
index for semantic search, or to the `semantic-hybrid-embeddings` index for hybrid search.

[NOTE]
====
This step uses the reindex API to simulate data ingestion. If you are working with data that has already been indexed,
rather than using the test-data set, reindexing is required to ensure that the data is processed by the {infer} endpoint
rather than using the `test-data` set, reindexing is required to ensure that the data is processed by the {infer} endpoint
and the necessary embeddings are generated.
====

[discrete]
[[semantic-search-reindex-data]]
===== Reindex the data for semantic search

Create embeddings from the text by reindexing the data from the `test-data` index to the `semantic-embeddings` index.
The data in the `content` field will be reindexed into the `content` semantic text field of the destination index.
The reindexed data will be processed by the inference endpoint associated with the `content` semantic text field.

[source,console]
------------------------------------------------------------
POST _reindex?wait_for_completion=false
Expand Down Expand Up @@ -172,10 +252,56 @@ POST _tasks/<task_id>/_cancel
------------------------------------------------------------
// TEST[skip:TBD]

[discrete]
[[hybrid-search-reindex-data]]
===== Reindex the data for hybrid search

Reindex the data from the `test-data` index into the `semantic-hybrid-embeddings` index to enable both lexical and semantic search.
The data in the `content` field of the source index is copied into the `content` field of the destination index.
The `copy_to` functionality then ensures that the content is duplicated into the `semantic_text` field, where it will be
processed by the inference endpoint to generate embeddings.

[source,console]
------------------------------------------------------------
POST _reindex?wait_for_completion=false
{
"source": {
"index": "test-data", <1>
"size": 10 <2>
},
"dest": {
"index": "semantic-hybrid-embeddings" <3>
}
}
------------------------------------------------------------
// TEST[skip:TBD]
<1> The source index from which data will be reindexed.
<2> The batch size is reduced to 10 for quicker processing and progress tracking. The default is 1000.
<3> The destination index where the reindexed data will be ingested, in this case, the `semantic-hybrid-embeddings` index.

The call returns a task ID to monitor the progress:

[source,console]
------------------------------------------------------------
GET _tasks/<task_id> <1>
------------------------------------------------------------

Reindexing large datasets can take a long time. You can test this workflow using only a subset of the dataset.

To cancel the reindexing process and generate embeddings for the subset that was reindexed:

[source,console]
------------------------------------------------------------
POST _tasks/<task_id>/_cancel <2>
------------------------------------------------------------

[discrete]
[[semantic-text-semantic-search]]
==== Semantic search
==== Perform search

[discrete]
[[semantic-search-perform-search]]
===== Perform semantic search

After the data set has been enriched with the embeddings, you can query the data using semantic search.
Provide the `semantic_text` field name and the query text in a `semantic` query type.
Expand Down Expand Up @@ -282,6 +408,49 @@ query from the `semantic-embedding` index:
------------------------------------------------------------
// NOTCONSOLE

[discrete]
[[hybrid-search-perform-search]]
===== Perform hybrid search

After reindexing the data into the `semantic-hybrid-embeddings` index, you can perform hybrid search, which combines
both semantic and lexical search. You can perform hybrid search using <<rrf,reciprocal rank fusion (RRF)>>. RRF is a technique
that merges the rankings from both semantic and lexical queries, giving more weight to results that rank high in
either search. This ensures that the final results are balanced and relevant.

[source,console]
------------------------------------------------------------
GET semantic-hybrid-embeddings/_search
{
"retriever": {
"rrf": {
"retrievers": [
{
"standard": {
"query": {
"match": {
"content": "How to avoid muscle soreness while running?" <1>
}
}
}
},
{
"standard": {
"query": {
"semantic": {
"field": "semantic_text", <2>
"query": "How to avoid muscle soreness while running?"
}
}
}
}
]
}
}
}
------------------------------------------------------------
<1> Lexical search for documents that contain the exact phrase in the `content` field.
<2> Semantic search based on the `semantic_text` field.

[discrete]
[[semantic-text-further-examples]]
==== Further examples
Expand Down

0 comments on commit 9be5b82

Please sign in to comment.