Expands semantic search tutorial with hybrid search

kosabogi · Oct 9, 2024 · 9be5b82 · 9be5b82
1 parent fb482f8
commit 9be5b82
Showing 1 changed file with 191 additions and 22 deletions.
diff --git a/docs/reference/search/search-your-data/semantic-search-semantic-text.asciidoc b/docs/reference/search/search-your-data/semantic-search-semantic-text.asciidoc
@@ -1,21 +1,52 @@
 [[semantic-search-semantic-text]]
-=== Tutorial: semantic search with `semantic_text`
+=== Tutorial: semantic search and hybrid search with `semantic_text`
 ++++
-<titleabbrev>Semantic search with `semantic_text`</titleabbrev>
+<titleabbrev>Semantic search and hybrid search with `semantic_text`</titleabbrev>
 ++++
 
 beta[]
 
-This tutorial shows you how to use the semantic text feature to perform semantic search on your data.
+This tutorial demonstrates how to perform **semantic search** using the **semantic text** feature and explains how to implement
+**hybrid search**, combining semantic search with traditional full-text search. 
 
 Semantic text simplifies the {infer} workflow by providing {infer} at ingestion time and sensible default values automatically.
 You don't need to define model related settings and parameters, or create {infer} ingest pipelines.
 
-The recommended way to use <<semantic-search,semantic search>> in the {stack} is following the `semantic_text` workflow.
-When you need more control over indexing and query settings, you can still use the complete {infer} workflow (refer to  <<semantic-search-inference,this tutorial>> to review the process).
+In hybrid search, semantic search retrieves results based on the meaning of the text, while full-text search focuses 
+on exact word matches. By combining both methods, hybrid search delivers more relevant results, particularly in cases 
+where relying on a single approach may not be sufficient.
 
-This tutorial uses the <<inference-example-elser,`elser` service>> for demonstration, but you can use any service and their supported models offered by the {infer-cap} API.
+The recommended way to use <<semantic-search,semantic search>> and hybrid search in the {stack} is following the `semantic_text` workflow.
+When you need more control over indexing and query settings, you can still use the complete {infer} workflow (refer to
+<<semantic-search-inference,this tutorial>> to review the process).
 
+This tutorial uses the <<inference-example-elser,`elser` service>> for demonstration, but you can use any service and their 
+supported models offered by the {infer-cap} API.
+
+
+To perform a simple **semantic search**, follow these steps:
+
+- <<semantic-text-infer-endpoint,Create the inference endpoint>>
+
+- <<semantic-search-create-index-mapping,Create the index mapping for semantic search>>
+
+- <<semantic-text-load-data,Load data>>
+
+- <<semantic-search-reindex-data, Reindex the data for semantic search>>
+
+- <<semantic-search-perform-search, Perform semantic search>>
+
+To perform a **hybrid search**, follow these steps:
+
+- <<semantic-text-infer-endpoint,Create the inference endpoint>>
+
+- <<hybrid-search-create-index-mapping,Create the index mapping for hybrid search>>
+
+- <<semantic-text-load-data,Load data>>
+
+- <<hybrid-search-reindex-data, Reindex the data for hybrid search>>
+
+- <<hybrid-search-perform-search, Perform hybrid search>>
 
 [discrete]
 [[semantic-text-requirements]]
@@ -65,8 +96,26 @@ If using the Python client, you can set the `timeout` parameter to a higher valu
 [[semantic-text-index-mapping]]
 ==== Create the index mapping
 
-The mapping of the destination index - the index that contains the embeddings that the inference endpoint will generate based on your input text - must be created.
-The destination index must have a field with the <<semantic-text,`semantic_text`>> field type to index the output of the used inference endpoint.
+The mapping of the destination index must be created. The destination index is the index where your processed data will be 
+stored and searched.
+
+[NOTE]
+====
+If you're using web crawlers or connectors to generate indices, you have to
+<<indices-put-mapping,update the index mappings>> for these indices to
+include the `semantic_text` field. Once the mapping is updated, you'll need to run
+a full web crawl or a full connector sync. This ensures that all existing
+documents are reprocessed and updated with the new semantic embeddings,
+enabling semantic search on the updated data.
+====
+
+[discrete]
+[[semantic-search-create-index-mapping]]
+===== Create an index mapping for semantic search
+
+The destination index should contain the embeddings generated by the inference endpoint based on the input text. 
+This allows the semantic search to retrieve results based on the meaning of the query.
+Must have a field with the `semantic_text` field type to index the output of the used inference endpoint.
 
 [source,console]
 ------------------------------------------------------------
@@ -89,15 +138,39 @@ PUT semantic-embeddings
 It will be used to generate the embeddings based on the input text.
 Every time you ingest data into the related `semantic_text` field, this endpoint will be used for creating the vector representation of the text.
 
-[NOTE]
-====
-If you're using web crawlers or connectors to generate indices, you have to
-<<indices-put-mapping,update the index mappings>> for these indices to
-include the `semantic_text` field. Once the mapping is updated, you'll need to run
-a full web crawl or a full connector sync. This ensures that all existing
-documents are reprocessed and updated with the new semantic embeddings,
-enabling semantic search on the updated data.
-====
+[discrete]
+[[hybrid-search-create-index-mapping]]
+===== Create an index mapping for hybrid search
+
+The destination index should contain both the embeddings (for semantic search) and the original text field (for full-text search). 
+This structure enables the combination of semantic search and full-text search, allowing the search engine to consider both the 
+meaning and the exact words in the query.
+
+[source,console]
+------------------------------------------------------------
+PUT semantic-hybrid-embeddings
+{
+  "mappings": {
+    "properties": {
+      "semantic_text": { <1>
+        "type": "semantic_text", <2>
+        "inference_id": "my-elser-endpoint" <3>
+      },
+      "content": { <4>
+        "type": "text", <5>
+        "copy_to": "semantic_text" <6>
+      }
+    }
+  }
+}
+------------------------------------------------------------
+// TEST[skip:TBD]
+<1> The name of the field to contain the generated embeddings for semantic search.
+<2> The field to contain the embeddings is a `semantic_text` field.
+<3> The `inference_id` is the inference endpoint that generates the embeddings based on the input text.
+<4> The name of the field to contain the original text for lexical search.
+<5> The `content` field is a `text` field that holds the raw text data.
+<6> The `copy_to` option copies the contents of the `content` field into the `semantic_text` field, enabling hybrid search functionality.
 
 
 [discrete]
@@ -124,17 +197,24 @@ After the upload is complete, you will see an index named `test-data` with 182,4
 [[semantic-text-reindex-data]]
 ==== Reindex the data
 
-Create the embeddings from the text by reindexing the data from the `test-data` index to the `semantic-embeddings` index.
-The data in the `content` field will be reindexed into the `content` semantic text field of the destination index.
-The reindexed data will be processed by the {infer} endpoint associated with the `content` semantic text field.
+Reindex the data to create embeddings from the text, either by reindexing from the `test-data` index to the `semantic-embeddings`
+index for semantic search, or to the `semantic-hybrid-embeddings` index for hybrid search. 
 
 [NOTE]
 ====
 This step uses the reindex API to simulate data ingestion. If you are working with data that has already been indexed,
-rather than using the test-data set, reindexing is required to ensure that the data is processed by the {infer} endpoint
+rather than using the `test-data` set, reindexing is required to ensure that the data is processed by the {infer} endpoint
 and the necessary embeddings are generated.
 ====
 
+[discrete]
+[[semantic-search-reindex-data]]
+===== Reindex the data for semantic search
+
+Create embeddings from the text by reindexing the data from the `test-data` index to the `semantic-embeddings` index. 
+The data in the `content` field will be reindexed into the `content` semantic text field of the destination index. 
+The reindexed data will be processed by the inference endpoint associated with the `content` semantic text field.
+
 [source,console]
 ------------------------------------------------------------
 POST _reindex?wait_for_completion=false
@@ -172,10 +252,56 @@ POST _tasks/<task_id>/_cancel
 ------------------------------------------------------------
 // TEST[skip:TBD]
 
+[discrete]
+[[hybrid-search-reindex-data]]
+===== Reindex the data for hybrid search
+
+Reindex the data from the `test-data` index into the `semantic-hybrid-embeddings` index to enable both lexical and semantic search.
+The data in the `content` field of the source index is copied into the `content` field of the destination index. 
+The `copy_to` functionality then ensures that the content is duplicated into the `semantic_text` field, where it will be 
+processed by the inference endpoint to generate embeddings.
+
+[source,console]
+------------------------------------------------------------
+POST _reindex?wait_for_completion=false
+{
+  "source": {
+    "index": "test-data", <1>
+    "size": 10 <2>
+  },
+  "dest": {
+    "index": "semantic-hybrid-embeddings" <3>
+  }
+}
+------------------------------------------------------------
+// TEST[skip:TBD]
+<1> The source index from which data will be reindexed.
+<2> The batch size is reduced to 10 for quicker processing and progress tracking. The default is 1000.
+<3> The destination index where the reindexed data will be ingested, in this case, the `semantic-hybrid-embeddings` index.
+
+The call returns a task ID to monitor the progress:
+
+[source,console]
+------------------------------------------------------------
+GET _tasks/<task_id> <1>
+------------------------------------------------------------
+
+Reindexing large datasets can take a long time. You can test this workflow using only a subset of the dataset.
+
+To cancel the reindexing process and generate embeddings for the subset that was reindexed:
+
+[source,console]
+------------------------------------------------------------
+POST _tasks/<task_id>/_cancel <2>
+------------------------------------------------------------
 
 [discrete]
 [[semantic-text-semantic-search]]
-==== Semantic search
+==== Perform search
+
+[discrete]
+[[semantic-search-perform-search]]
+===== Perform semantic search
 
 After the data set has been enriched with the embeddings, you can query the data using semantic search.
 Provide the `semantic_text` field name and the query text in a `semantic` query type.
@@ -282,6 +408,49 @@ query from the `semantic-embedding` index:
 ------------------------------------------------------------
 // NOTCONSOLE
 
+[discrete]
+[[hybrid-search-perform-search]]
+===== Perform hybrid search
+
+After reindexing the data into the `semantic-hybrid-embeddings` index, you can perform hybrid search, which combines
+both semantic and lexical search. You can perform hybrid search using <<rrf,reciprocal rank fusion (RRF)>>. RRF is a technique
+that merges the rankings from both semantic and lexical queries, giving more weight to results that rank high in 
+either search. This ensures that the final results are balanced and relevant.
+
+[source,console]
+------------------------------------------------------------
+GET semantic-hybrid-embeddings/_search
+{
+  "retriever": {
+    "rrf": {
+      "retrievers": [
+        {
+          "standard": {
+            "query": {
+              "match": {
+                "content": "How to avoid muscle soreness while running?" <1>
+              }
+            }
+          }
+        },
+        {
+          "standard": {
+            "query": {
+              "semantic": {
+                "field": "semantic_text", <2>
+                "query": "How to avoid muscle soreness while running?" 
+              }
+            }
+          }
+        }
+      ]
+    }
+  }
+}
+------------------------------------------------------------
+<1> Lexical search for documents that contain the exact phrase in the `content` field.
+<2> Semantic search based on the `semantic_text` field.
+
 [discrete]
 [[semantic-text-further-examples]]
 ==== Further examples