diff --git a/autovec-tutorial/README.md b/autovec-tutorial/README.md new file mode 100644 index 0000000..a655b50 --- /dev/null +++ b/autovec-tutorial/README.md @@ -0,0 +1,53 @@ +# Couchbase Capella AI Services Auto-Vectorization with LangChain + +This guide is a comprehensive tutorial demonstrating how to use Couchbase Capella's AI Services auto-vectorization feature to automatically convert your data into vector embeddings and perform semantic search using LangChain. + +## πŸ“‹ Overview + +The main tutorial is contained in the Jupyter notebook `autovec_langchain.ipynb`, which walks you through: + +1. **Couchbase Capella Setup** - Creating account, cluster, and access controls +2. **Data Upload & Processing** - Using sample data +3. **Model Deployment** - Deploying embedding models for vectorization +4. **Auto-Vectorization Workflow** - Setting up automated embedding generation +5. **LangChain Integration** - Building semantic search applications with vector similarity + +## πŸš€ Quick Start + +### Prerequisites + +- Python 3.8 or higher +- A Couchbase Capella account +- Basic understanding of vector databases and embeddings + +### Installation Steps + +1. **Clone or download this repository** + ```bash + git clone https://github.com/couchbase-examples/vector-search-cookbook.git + cd vector-search-cookbook/autovec-tutorial + ``` + +2. **Install Python dependencies** + ```bash + pip install jupyter + pip install couchbase + pip install langchain-couchbase + pip install langchain-nvidia-ai-endpoints + ``` + +3. **Start Jupyter Notebook** + ```bash + jupyter notebook + ``` + or + ```bash + jupyter lab + ``` + +4. **Open the tutorial notebook** + - Navigate to `autovec_langchain.ipynb` in the Jupyter interface + - Follow the step-by-step instructions in the notebook +``` + +**Note**: This tutorial is designed for educational purposes. For production deployments, ensure proper security configurations and SSL/TLS verification. diff --git a/autovec-tutorial/autovec_langchain.ipynb b/autovec-tutorial/autovec_langchain.ipynb new file mode 100644 index 0000000..843a16f --- /dev/null +++ b/autovec-tutorial/autovec_langchain.ipynb @@ -0,0 +1,350 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "44480f12-3bd0-4fe9-9493-25bd6a2712bb", + "metadata": {}, + "source": [ + "# Couchbase Capella AI Services Auto-Vectorization Tutorial\n", + "\n", + "This comprehensive tutorial demonstrates how to use Couchbase Capella's new AI Services auto-vectorization feature to automatically convert your data into vector embeddings and perform semantic search using LangChain.\n", + "\n", + "---\n", + "\n", + "## πŸ“š Table of Contents\n", + "\n", + "1. [Capella Account Setup](#1-create-and-deploy-your-free-tier-operational-cluster-on-capella)\n", + "2. [Data Upload and Preparation](#2-data-upload-and-preparation)\n", + "3. [Deploying the Model](#3-deploying-the-model)\n", + "4. [Auto-Vectorization Process](#4-deploying-autovectorization-workflow)\n", + "5. [LangChain Vector Search](#5-vector-search)\n" + ] + }, + { + "cell_type": "markdown", + "id": "502eb13e", + "metadata": { + "jp-MarkdownHeadingCollapsed": true + }, + "source": [ + "# 1. Create and Deploy Operational Cluster on Capella\n", + " To get started with Couchbase Capella, create an account and use it to deploy a cluster. To know more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", + " ### Couchbase Capella Configuration\n", + " When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met:\n", + " * Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the travel-sample bucket (Read and Write) used in the application.\n", + " * [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running." + ] + }, + { + "cell_type": "markdown", + "id": "4369c925-adbc-4c7d-9ea6-04ff020cb1a6", + "metadata": {}, + "source": [ + "# 2. Data Upload and Preparation\n", + "\n", + "There are various techniques that exist to insert data into the cluster. To read about the techniques, please follow the [sample-data import](https://docs.couchbase.com/cloud/clusters/data-service/import-data-documents.html#import-sample-data) guide.\n", + "\n", + "After data upload is complete, follow the next steps to achieve vectorization for your required fields." + ] + }, + { + "cell_type": "markdown", + "id": "7e3afd3f-9949-4f5e-b96a-1aac1a3aea29", + "metadata": {}, + "source": [ + "# 3. Deploying the Model\n", + "Now, before we actually create embeddings for the documents, we need to deploy a model that will create the embeddings for us.\n", + "## 3.1: Selecting the Model \n", + "1. To select the model, you first need to navigate to the \"AI Services\" tab, then select \"Models\" and click on \"Deploy New Model\".\n", + " \n", + " \n", + "\n", + "2. Enter the model name, and choose the model that you want to deploy. After selecting your model, choose the model infrastructure and region where the model will be deployed.\n", + " \n", + " \n", + "\n", + "## 3.2 Access Control to the Model\n", + "\n", + "1. After deploying the model, go to the \"Models\" tab in the AI Services and click on \"Setup Access\".\n", + "\n", + " \n", + "\n", + "2. Enter your API key name, expiration time and the IP address from which you will be accessing the model.\n", + "\n", + " \n", + "\n", + "3. Download your API key\n", + "\n", + " " + ] + }, + { + "cell_type": "markdown", + "id": "daaf6525-d4e6-45fb-8839-fc7c20081675", + "metadata": {}, + "source": [ + "# 4. Deploying AutoVectorization Workflow\n", + "\n", + "Now, we are at the step that will help us create the embeddings/vectors. To proceed with the vectorization process, please follow the steps below:\n", + "\n", + "1. For deploying the autovectorization, you need to go to the `AI Services` tab, then click on `Workflows`, and then click on `Create New Workflow`.\n", + "\n", + " \n", + " \n", + "2. Start your workflow deployment by giving it a name and selecting where your data will be provided to the auto-vectorization service. There are currently 3 options: `pre-processed data (JSON format) from Capella`, `pre-processed data (JSON format) from external sources (S3 buckets)` and `unstructured data from external sources (S3 buckets)`. For this tutorial, we will choose the first option, which is pre-processed data from Capella.\n", + "\n", + " \n", + "\n", + "3. Now, select the `cluster`, `bucket`, `scope` and `collection` from which you want to select the documents and get the data vectorized.\n", + "\n", + " \n", + "\n", + "4. Field Mapping will be used to tell the AutoVectorize service which data will be converted to embeddings.\n", + "\n", + " There are two options:\n", + "\n", + " - All source fields - This feature will convert all your fields inside the document to a single vector field.\n", + " \n", + " \n", + "\n", + "\n", + " - Custom source fields - This feature will convert specific fields chosen by the user to a single vector field. In the image below, we have chosen `address`, `description` and `id` as the fields to be converted to a vector with the name `vec_addr_decr_id_mapping`.\n", + " \n", + " \n", + " \n", + "5. After choosing the type of mapping, it is required to either create an index on the new vector_embedding field or the creation of a vector index can be skipped, which is not recommended as the functionality of vector searching will be lost.\n", + "\n", + " \n", + "\n", + "6. Below screenshot highlights the whole process which were mentioned above, and click next afterwards as shown below.\n", + "\n", + " \n", + "\n", + "\n", + "7. Select the model which will be used to create the embeddings. There are two options to create the embeddings, `capella based` and `external model`.\n", + " \n", + " \n", + "\n", + " - For this tutorial, capella based embedding model is used as can be seen in the image above. API credentials can be uploaded using the file downloaded in `step 2.2` or it can be entered manually as well.\n", + " - Choices between private and insecure networking is available to choose.\n", + " - A click on `Next` will land you at the final page of the workflow.\n", + "\n", + "\n", + "\n", + "8. `Workflow Summary` will display all the necessary details of the workflow including `Data Source`, `Model Service` and `Billing Overview` as shown in image below.\n", + "\n", + " \n", + "\n", + "\n", + "\n", + "9. `Hurray! Workflow Deployed` Now in the `workflow` tab we can see the workflow deployed and can check the status of our workflow run.\n", + "\n", + " \n", + "\n", + "After this step, your vector embeddings for the selected fields should be ready, and you can check them out in the Capella UI. In the next step, we will demonstrate how we can use the generated vectors to perform vector search.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "e50204a4", + "metadata": {}, + "source": [ + "# 5. Vector Search\n", + "\n", + "The following code cells implement semantic vector search against the embeddings generated by the AutoVectorization workflow. \n", + " \n", + "Proceed to execute the cells in order to run the vector similarity search." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "30955126-0053-4cec-9dec-e4c05a8de7c3", + "metadata": {}, + "outputs": [], + "source": [ + "from couchbase.cluster import Cluster\n", + "from couchbase.auth import PasswordAuthenticator\n", + "from couchbase.options import ClusterOptions\n", + "\n", + "from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings \n", + "from langchain_couchbase.vectorstores import CouchbaseSearchVectorStore" + ] + }, + { + "cell_type": "markdown", + "id": "e5be1f01", + "metadata": {}, + "source": [ + "# Cluster Connection Setup\n", + " - Defines the secure connection string, user credentials, and creates a `Cluster` object.\n", + " - Disables TLS verification by `options = ClusterOptions(auth, tls_verify='none')` ONLY for quick local testing (not recommended in production) and applies the `wan_development` profile to tune timeouts for higher-latency networks." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7e4c9e8d", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = \"couchbases://cb.xyz.com\" # Replace this with Connection String\n", + "username = \"YOUR_USERNAME\" # Replace this with your username\n", + "password = \"YOUR_PASSWORD\" # Replace this with your password\n", + "auth = PasswordAuthenticator(username, password)\n", + "# Configure cluster options with SSL verification disabled for testing; in production you should enable it\n", + "options = ClusterOptions(auth, tls_verify='none')\n", + "options.apply_profile(\"wan_development\")\n", + "cluster = Cluster(endpoint, options)" + ] + }, + { + "cell_type": "markdown", + "id": "bbeb8a4f", + "metadata": {}, + "source": [ + "# Selection of Buckets / Scope / Collection / Index / Embedder\n", + " - Sets the bucket, scope, and collection where the documents (with vector fields) live.\n", + " - Specifies the Capella Search index name created (or selected) in Step 4.5.\n", + " - Instantiates the NVIDIA embedding model that will transform the user's natural language query into a vector at search time." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "799b2efc", + "metadata": {}, + "outputs": [], + "source": [ + "bucket_name = \"travel-sample\"\n", + "scope_name = \"inventory\"\n", + "collection_name = \"hotel\"\n", + "index_name = \"hybrid_autovec_workflow_vec_addr_descr_id\" # This is the name of the search index that was created in step 4.5 and can also be seen in the search tab of the cluster.\n", + " # It should be noted that hybrid_workflow_name_index_fieldname is the naming convention for the index created by AutoVectorization workflow where\n", + " # fieldname is the name of the field being indexed.\n", + "embedder = NVIDIAEmbeddings(\n", + " model=\"nvidia/nv-embedqa-e5-v5\", # This is the model that will be used to create the embedding of the query.\n", + " api_key=\"nvapi-xyz\" # This is the API key that will be used to access your model.\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "fda36710", + "metadata": {}, + "source": [ + "# VectorStore Construction\n", + " - Creates a `CouchbaseSearchVectorStore` instance that:\n", + " * Knows where to read documents (`bucket/scope/collection`).\n", + " * Knows the embedding field (the vector produced by the AutoVectorization workflow).\n", + " * Uses the provided embedder to embed queries on-demand.\n", + " - If your AutoVectorization workflow produced a different vector field name, update `embedding_key` accordingly.\n", + " - If you mapped multiple fields into a single vector, you can choose any representative field for `text_key`, or modify the VectorStore wrapper to concatenate fields." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "50b85f78", + "metadata": {}, + "outputs": [], + "source": [ + "vector_store = CouchbaseSearchVectorStore(\n", + " cluster=cluster,\n", + " bucket_name=bucket_name,\n", + " scope_name=scope_name,\n", + " collection_name=collection_name,\n", + " embedding=embedder,\n", + " index_name=index_name,\n", + " text_key=\"address\", # Your document's text field\n", + " embedding_key=\"vec_addr_descr_id\" # This is the field in which your vector (embedding) is stored in the cluster.\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "be207963", + "metadata": {}, + "source": [ + "# Performing a Similarity Search\n", + " - Defines a natural language query (e.g., \"USA\").\n", + " - Calls `similarity_search(k=3)` to retrieve the top 3 most semantically similar documents.\n", + " - Prints ranked results, extracting a `title` (if present) and the chosen `text_key` (here `address`).\n", + " - Change `query` to any descriptive phrase (e.g., \"beach resort\", \"airport hotel near NYC\").\n", + " - Adjust `k` for more or fewer results." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "177fd6d5", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1. Glossop β€” Address: Woodhead Road\n", + "2. Glossop β€” Address: 28 Woodhead Road\n", + "3. Hadrian's Wall β€” Address: Greenhead, Brampton, Cumbria, CA8 7HB\n" + ] + } + ], + "source": [ + "query = \"Woodhead Road\"\n", + "results = vector_store.similarity_search(query, k=3)\n", + "\n", + "# Print out the top-k results\n", + "for rank, doc in enumerate(results, start=1):\n", + " title = doc.metadata.get(\"title\", \"\")\n", + " address_text = doc.page_content\n", + " print(f\"{rank}. {title} β€” Address: {address_text}\")" + ] + }, + { + "cell_type": "markdown", + "id": "f9e0d863", + "metadata": {}, + "source": [ + "## 6. Results and Interpretation\n", + "\n", + "As we can see, 3 (or `k`) ranked results are printed in the output.\n", + "\n", + "### What Each Part Means\n", + "- Leading number (1, 2, 3): The result rank (1 = most similar to your query).\n", + "- Title: Pulled from `doc.metadata.get(\"title\", \"\")`. If your documents don't contain a `title` field, you will see ``.\n", + "- Address text: This is the value of the field you configured as `text_key` (in this tutorial: `address`). It represents the human-readable content we chose to display.\n", + "\n", + "### How the Ranking Works\n", + "1. Your natural language query (e.g., `\"Woodhead Road\"`) is embedded using the NVIDIA model (`nvidia/nv-embedqa-e5-v5`).\n", + "2. The vector store compares the query embedding to stored document embeddings in the field you configured (`embedding_key = \"vec_addr_descr_id\"`).\n", + "3. Results are sorted by vector similarity. Higher similarity = closer semantic meaning.\n", + "\n", + "\n", + "> Your vector search pipeline is working if the returned documents feel meaningfully related to your natural language queryβ€”even when exact keywords do not match. Feel free to experiment with increasingly descriptive queries to observe the semantic power of the embeddings." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "autovec", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.7" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/autovec-tutorial/frontmatter.md b/autovec-tutorial/frontmatter.md new file mode 100644 index 0000000..9a29ec6 --- /dev/null +++ b/autovec-tutorial/frontmatter.md @@ -0,0 +1,18 @@ +--- +# frontmatter +path: "/tutorial-couchbase-autovectorization-langchain" +title: Auto-Vectorization with Couchbase Capella AI Services and LangChain +short_title: Auto-Vectorization with Couchbase and LangChain +description: + - Learn how to use Couchbase Capella's AI Services auto-vectorization feature to automatically convert your data into vector embeddings. + - This tutorial demonstrates how to set up automated embedding generation workflows and perform semantic search using LangChain. +content_type: tutorial +filter: sdk +technology: + - vector search +tags: + - LangChain +sdk_language: + - python +length: 20 Mins +--- diff --git a/autovec-tutorial/img/Access_control.png b/autovec-tutorial/img/Access_control.png new file mode 100644 index 0000000..149dcd0 Binary files /dev/null and b/autovec-tutorial/img/Access_control.png differ diff --git a/autovec-tutorial/img/Create_auto_vec.png b/autovec-tutorial/img/Create_auto_vec.png new file mode 100644 index 0000000..61baeae Binary files /dev/null and b/autovec-tutorial/img/Create_auto_vec.png differ diff --git a/autovec-tutorial/img/Select_embedding_model.png b/autovec-tutorial/img/Select_embedding_model.png new file mode 100644 index 0000000..a4236f2 Binary files /dev/null and b/autovec-tutorial/img/Select_embedding_model.png differ diff --git a/autovec-tutorial/img/cluster_cloud_config.png b/autovec-tutorial/img/cluster_cloud_config.png new file mode 100644 index 0000000..c478a83 Binary files /dev/null and b/autovec-tutorial/img/cluster_cloud_config.png differ diff --git a/autovec-tutorial/img/cluster_no_nodes.png b/autovec-tutorial/img/cluster_no_nodes.png new file mode 100644 index 0000000..8a09de4 Binary files /dev/null and b/autovec-tutorial/img/cluster_no_nodes.png differ diff --git a/autovec-tutorial/img/create_cluster.png b/autovec-tutorial/img/create_cluster.png new file mode 100644 index 0000000..8af4219 Binary files /dev/null and b/autovec-tutorial/img/create_cluster.png differ diff --git a/autovec-tutorial/img/deploying_model.png b/autovec-tutorial/img/deploying_model.png new file mode 100644 index 0000000..5b83034 Binary files /dev/null and b/autovec-tutorial/img/deploying_model.png differ diff --git a/autovec-tutorial/img/download_api_key_details.png b/autovec-tutorial/img/download_api_key_details.png new file mode 100644 index 0000000..8ee7dc8 Binary files /dev/null and b/autovec-tutorial/img/download_api_key_details.png differ diff --git a/autovec-tutorial/img/import_sd.png b/autovec-tutorial/img/import_sd.png new file mode 100644 index 0000000..e6d1a66 Binary files /dev/null and b/autovec-tutorial/img/import_sd.png differ diff --git a/autovec-tutorial/img/imported_data_hotel.png b/autovec-tutorial/img/imported_data_hotel.png new file mode 100644 index 0000000..1aeb7f8 Binary files /dev/null and b/autovec-tutorial/img/imported_data_hotel.png differ diff --git a/autovec-tutorial/img/importing_model.png b/autovec-tutorial/img/importing_model.png new file mode 100644 index 0000000..41e80e9 Binary files /dev/null and b/autovec-tutorial/img/importing_model.png differ diff --git a/autovec-tutorial/img/login.png b/autovec-tutorial/img/login.png new file mode 100644 index 0000000..30e8b1e Binary files /dev/null and b/autovec-tutorial/img/login.png differ diff --git a/autovec-tutorial/img/login_.png b/autovec-tutorial/img/login_.png new file mode 100644 index 0000000..e171127 Binary files /dev/null and b/autovec-tutorial/img/login_.png differ diff --git a/autovec-tutorial/img/model_api_key_form.png b/autovec-tutorial/img/model_api_key_form.png new file mode 100644 index 0000000..0713a53 Binary files /dev/null and b/autovec-tutorial/img/model_api_key_form.png differ diff --git a/autovec-tutorial/img/model_setup_access.png b/autovec-tutorial/img/model_setup_access.png new file mode 100644 index 0000000..91dfae7 Binary files /dev/null and b/autovec-tutorial/img/model_setup_access.png differ diff --git a/autovec-tutorial/img/node_select_cluster_opt.png b/autovec-tutorial/img/node_select_cluster_opt.png new file mode 100644 index 0000000..a15a0f7 Binary files /dev/null and b/autovec-tutorial/img/node_select_cluster_opt.png differ diff --git a/autovec-tutorial/img/password_cluster.png b/autovec-tutorial/img/password_cluster.png new file mode 100644 index 0000000..85ad736 Binary files /dev/null and b/autovec-tutorial/img/password_cluster.png differ diff --git a/autovec-tutorial/img/select_cluster.png b/autovec-tutorial/img/select_cluster.png new file mode 100644 index 0000000..381439f Binary files /dev/null and b/autovec-tutorial/img/select_cluster.png differ diff --git a/autovec-tutorial/img/setup_access.png b/autovec-tutorial/img/setup_access.png new file mode 100644 index 0000000..08bf964 Binary files /dev/null and b/autovec-tutorial/img/setup_access.png differ diff --git a/autovec-tutorial/img/start_workflow.png b/autovec-tutorial/img/start_workflow.png new file mode 100644 index 0000000..23ce813 Binary files /dev/null and b/autovec-tutorial/img/start_workflow.png differ diff --git a/autovec-tutorial/img/vector_all_field_mapping.png b/autovec-tutorial/img/vector_all_field_mapping.png new file mode 100644 index 0000000..8800ac8 Binary files /dev/null and b/autovec-tutorial/img/vector_all_field_mapping.png differ diff --git a/autovec-tutorial/img/vector_custom_field_mapping.png b/autovec-tutorial/img/vector_custom_field_mapping.png new file mode 100644 index 0000000..519c475 Binary files /dev/null and b/autovec-tutorial/img/vector_custom_field_mapping.png differ diff --git a/autovec-tutorial/img/vector_data_source.png b/autovec-tutorial/img/vector_data_source.png new file mode 100644 index 0000000..f9db7e4 Binary files /dev/null and b/autovec-tutorial/img/vector_data_source.png differ diff --git a/autovec-tutorial/img/vector_field_mapping.png b/autovec-tutorial/img/vector_field_mapping.png new file mode 100644 index 0000000..dfdeacf Binary files /dev/null and b/autovec-tutorial/img/vector_field_mapping.png differ diff --git a/autovec-tutorial/img/vector_index.png b/autovec-tutorial/img/vector_index.png new file mode 100644 index 0000000..b52dd9a Binary files /dev/null and b/autovec-tutorial/img/vector_index.png differ diff --git a/autovec-tutorial/img/vector_index_page.png b/autovec-tutorial/img/vector_index_page.png new file mode 100644 index 0000000..3fa8da9 Binary files /dev/null and b/autovec-tutorial/img/vector_index_page.png differ diff --git a/autovec-tutorial/img/workflow.png b/autovec-tutorial/img/workflow.png new file mode 100644 index 0000000..fcf8a0c Binary files /dev/null and b/autovec-tutorial/img/workflow.png differ diff --git a/autovec-tutorial/img/workflow_deployed.png b/autovec-tutorial/img/workflow_deployed.png new file mode 100644 index 0000000..224dcfa Binary files /dev/null and b/autovec-tutorial/img/workflow_deployed.png differ diff --git a/autovec-tutorial/img/workflow_summary.png b/autovec-tutorial/img/workflow_summary.png new file mode 100644 index 0000000..f3810c1 Binary files /dev/null and b/autovec-tutorial/img/workflow_summary.png differ