diff --git a/autovec_unstructured/__frontmatter__.md b/autovec_unstructured/__frontmatter__.md new file mode 100644 index 0000000..a33b8fb --- /dev/null +++ b/autovec_unstructured/__frontmatter__.md @@ -0,0 +1,18 @@ +--- +# frontmatter +path: "/tutorial-couchbase-autovectorization-langchain" +title: Auto-Vectorization on Unstructured Data Stored in S3 Buckets Using Couchbase Capella AI Services +short_title: Auto-Vectorization on Unstructured Data Stored in S3 Buckets +description: + - Learn how to use Couchbase Capella's AI Services auto-vectorization feature to automatically convert your unstructured data into vector embeddings. + - This tutorial demonstrates how to set up automated embedding generation workflows and perform semantic search using LangChain. +content_type: tutorial +filter: sdk +technology: + - Artificial Intelligence +tags: + - LangChain +sdk_language: + - python +length: 20 Mins +--- diff --git a/autovec_unstructured/autovec_unstructured.ipynb b/autovec_unstructured/autovec_unstructured.ipynb new file mode 100644 index 0000000..ffde5b4 --- /dev/null +++ b/autovec_unstructured/autovec_unstructured.ipynb @@ -0,0 +1,374 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "6f623039", + "metadata": { + "jp-MarkdownHeadingCollapsed": true + }, + "source": [ + "# Auto-Vectorization on Unstructured Data Stored in S3 Buckets Using Couchbase Capella AI Services \n", + "This comprehensive tutorial demonstrates how to use Couchbase Capella's new AI Services auto-vectorization feature to automatically convert your unstructured data stored in S3 buckets to import it in Capella and convert it into vector embeddings and perform semantic search using LangChain." + ] + }, + { + "cell_type": "markdown", + "id": "a4d47a8a", + "metadata": {}, + "source": [ + "# 1. Create and Deploy Your Operational cluster on Capella\n", + "To get started with Couchbase Capella, create an account and use it to deploy a cluster. To know more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", + " \n", + "### Couchbase Capella Configuration\n", + "When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met.\n", + "- Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the bucket you will be using for this tutorial (e.g., `Unstructured_data_bucket`) with Read and Write permissions.\n", + "- [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running." + ] + }, + { + "cell_type": "markdown", + "id": "a08bd871-e20d-4362-b5c1-765737894c65", + "metadata": { + "jp-MarkdownHeadingCollapsed": true + }, + "source": [ + "# 2. Deploying the Model\n", + "Now, before we actually create embeddings for the documents, we need to deploy a model that will create the embeddings for us.\n", + "## 2.1: Selecting the Model \n", + "1. To select the model, you first need to navigate to the \"AI Services\" tab, then select \"Models\" and click on \"Deploy New Model\".\n", + " \n", + " \n", + "\n", + "2. Enter the model name, and choose the model that you want to deploy. After selecting your model, choose the model infrastructure and region where the model will be deployed.\n", + " \n", + " \n", + "\n", + "## 2.2: Access Control to the Model\n", + "\n", + "1. After deploying the model, go to the \"Models\" tab in the AI Services and click on \"Setup Access\".\n", + "\n", + " \n", + "\n", + "2. Enter your API key name, expiration time and the IP address from which you will be accessing the model.\n", + "\n", + " \n", + "\n", + "3. Download your API key\n", + "\n", + " " + ] + }, + { + "cell_type": "markdown", + "id": "e7552113", + "metadata": {}, + "source": [ + "# 3. Data upload from S3 bucket to Couchbase (with chunking and vectorization)" + ] + }, + { + "cell_type": "markdown", + "id": "fc1b64dd-495b-4358-b732-d01856713b70", + "metadata": {}, + "source": [ + "In order to import unstructured data from the S3 bucket, you need to create a workflow that connects to your S3 bucket and chunks your unstructured data before importing it into the collections. To do so, please follow the steps mentioned below:\n", + "1) Let's start by creating a new workflow. This can be done by clicking on the `AI Services` tab, then click on `Workflows`, and then click on `Create New Workflow`.\n", + " \n", + " \n", + " \n", + "2) Start your workflow deployment by giving it a name and selecting where your data will be provided to the auto-vectorization service. There are currently three options: `pre-processed data (JSON format) from Capella`, `pre-processed data (JSON format) from external sources (S3 buckets)` and `unstructured data from external sources (S3 buckets)`. For this tutorial, we will choose the third option, which is unstructured data from external sources (S3 buckets). After selecting the workflow enter the workflow name and click on `Start Workflow`.\n", + " \n", + " \n", + "\n", + "3) To proceed, Capella needs to connect to your S3 bucket which will be the source of the data, and to do so click on the `+ Add New S3 Bucket`.\n", + "\n", + " \n", + "\n", + "4) Upon clicking `+ Add New S3 Bucket` a new sidebar will appear that asks for the credentials of your S3 bucket.\n", + "\n", + " \n", + " \n", + " - Enter `Integration Name`, which will be later used to select your S3 Bucket.\n", + " - Select the AWS Region where the bucket is deployed.\n", + " - Enter the name of the S3 bucket deployed in AWS.\n", + " - Enter the path where your unstructured-data is present.\n", + " - Enter your S3 bucket credentials.\n", + " - Click on ADD Credentials.\n", + "5) If the steps mentioned above are followed correctly then you should see a success pop-up as shown below and then the S3 bucket can be selected from the drop-down menu.\n", + "\n", + " \n", + "\n", + "6) On selection of the S3 bucket, various options will be displayed as described below.\n", + "\n", + " \n", + "- `Index Configuration` allows us to create a search index on the generated embeddings of the imported data. If it's skipped then the functionality of vector searching will not be enabled and you need to create index later on.\n", + "- `Destination Cluster` helps choose the cluster, bucket, scope and collection in which the data needs to be imported.\n", + "- `Estimated Cost` dialogue box in blue color(on the right) will show you the cost of operation per document.\n", + "- Click on `Next`.\n", + " \n", + "7) `Configure Data Preprocessing` allows you to perform various operations on the data being imported from the S3 buckets and are described below.\n", + " \n", + " \n", + "- `Page Range selection` allows you to select a custom page range when working with PDFs. (Optional)\n", + "- `Layout Exclusions` allows you to skip various unnecessary objects in your unstructured data. (Optional)\n", + "- `Object Character Recognition (OCR)` allows you to detect text from images/pdfs. (Optional)\n", + "- `Chunking Strategy` is an important step for importing data and creating embeddings(vectors) in Capella, the step will be further described below.\n", + " - `Strategy` dropdown menu helps to select the strategy that will be used to chunk the data present in S3 bucket and might be useful depending upon the data present in the S3 bucket.\n", + " - `Max Token in Chunk` decides the number of tokens that will be present in a chunk.\n", + " - `Chunk Overlap` decides the number of tokens that will overlap, this helps create context between chunks.\n", + "- Click `Next` after the options above specified are modified according to the requirement.\n", + "\n", + "8) Select the model which will be used to create the embeddings. There are two options to create the embeddings, `Capella-based` and `external model`.\n", + "\n", + " \n", + " \n", + " - For this tutorial, Capella-based embedding model is used as can be seen in the image above. API credentials can be uploaded using the file downloaded in `step 2.2` or it can be entered manually as well.\n", + " - Choices between private and insecure networking is available to choose.\n", + " - A click on `Next` will land you at the final page of the workflow.\n", + " \n", + "9) `Workflow Summary` will display all the necessary details of the workflow including `Data Source`, `Model Service`, `Unstructured Data Service` and `Billing Overview` as shown in image below.\n", + "\n", + " \n", + "\n", + "10) `Hurray! Workflow Deployed` Now in the `workflow` tab we can see our workflow deployed and can check the status of our workflow. The status of the workflow run will be shown over here.\n", + "\n", + " \n", + "\n", + "\n", + " After this step, your vector embeddings for the selected fields should be ready, and you can check them out in the Capella UI. In the next step, we will demonstrate how we can use the generated vectors to perform vector search." + ] + }, + { + "cell_type": "markdown", + "id": "4f7321a7", + "metadata": {}, + "source": [ + "# 4. Vector Search\n", + "\n", + "The following code cells implement semantic vector search against the embeddings generated by the AutoVectorization workflow. \n", + "\n", + "Before you proceed, make sure the following packages are installed by running: " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0298d27f-ee03-4de2-829d-b653c39746a9", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install couchbase langchain-couchbase langchain-openai" + ] + }, + { + "cell_type": "markdown", + "id": "ea920e0f-bd81-4a74-841a-86a11cb8aec4", + "metadata": {}, + "source": [ + "`couchbase - Version: 4.4.0` \\\n", + "`langchain-couchbase - Version: 0.4.0` \\\n", + "`pip install langchain-openai - Version: 0.3.34` \n", + "\n", + "Now, please proceed to execute the cells in order to run the vector similarity search.\n", + "\n", + "# Importing Required Packages" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "5e8ba0fc", + "metadata": {}, + "outputs": [], + "source": [ + "from couchbase.cluster import Cluster\n", + "from couchbase.auth import PasswordAuthenticator\n", + "from couchbase.options import ClusterOptions\n", + "\n", + "from langchain_openai import OpenAIEmbeddings\n", + "from langchain_couchbase.vectorstores import CouchbaseSearchVectorStore" + ] + }, + { + "cell_type": "markdown", + "id": "4f8428f2-f923-42df-bf7d-beacf5b38f16", + "metadata": {}, + "source": [ + "# Cluster Connection Setup\n", + " - Defines the secure connection string, user credentials, and creates a `Cluster` object." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "f44ea528-1ec1-41ce-90db-bdd0d87b5cff", + "metadata": {}, + "outputs": [], + "source": [ + "endpoint = \"CLUSTER_CONNECTION_STRING\" # Replace this with Connection String\n", + "username = \"YOUR_USERNAME\" # Replace this with your username\n", + "password = \"YOUR_PASSWORD\" # Replace this with your password\n", + "auth = PasswordAuthenticator(username, password)\n", + "\n", + "options = ClusterOptions(auth)\n", + "cluster = Cluster(endpoint, options)\n", + "\n", + "cluster.wait_until_ready(timedelta(seconds=5))" + ] + }, + { + "cell_type": "markdown", + "id": "c0874f89", + "metadata": {}, + "source": [ + "# Selection of Buckets / Scope / Collection / Index / Embedder\n", + " - Sets the bucket, scope, and collection where the documents (with vector fields) live.\n", + " - `index_name` specifies the Capella Search index name.\n", + " - `embedder` instantiates the NVIDIA embedding model that will transform the user's natural language query into a vector at search time.\n", + " - `open_api_key` is the api key token created in `step 2.3`.\n", + " - `open_api_base` is the Capella model services endpoint found in the models section.\n", + "\n", + "`Note that the Capella AI Endpoint also requires an additional /v1 from the endpoint if not shown on the UI`" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "1d77404b", + "metadata": {}, + "outputs": [], + "source": [ + "bucket_name = \"Unstructured_data_bucket\"\n", + "scope_name = \"_default\"\n", + "collection_name = \"_default\"\n", + "index_name = \"search_autovec_workflow_text-embedding\" # This is the name of the search index that was created in step 3.6 and can also be seen in the search tab of the cluster.\n", + " \n", + "# Using the OpenAI SDK for the embeddings with the capella model services and they are compatible with the OpenAIEmbeddings class in Langchain\n", + "embedder = OpenAIEmbeddings(\n", + " model=\"nvidia/nv-embedqa-e5-v5\", # This is the model that will be used to create the embedding of the query.\n", + " openai_api_key=\"CAPELLA_MODEL_KEY\",\n", + " openai_api_base=\"CAPELLA_MODEL_ENDPOINT/v1\",\n", + " check_embedding_ctx_length=False,\n", + " tiktoken_enabled=False, \n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "a1b9ac43", + "metadata": {}, + "source": [ + "# VectorStore Construction\n", + " - Creates a `CouchbaseSearchVectorStore` instance that:\n", + " * Knows where to read documents (`bucket/scope/collection`).\n", + " * Knows the embedding field (the vector produced by the AutoVectorization workflow).\n", + " * Uses the provided embedder to embed queries on-demand.\n", + " - If your AutoVectorization workflow produced a different vector field name, update `embedding_key` accordingly.\n", + " - If you mapped multiple fields into a single vector, you can choose any representative field for `text_key`, or modify the VectorStore wrapper to concatenate fields." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "8efd0e80", + "metadata": {}, + "outputs": [], + "source": [ + "vector_store = CouchbaseSearchVectorStore(\n", + " cluster=cluster,\n", + " bucket_name=bucket_name,\n", + " scope_name=scope_name,\n", + " collection_name=collection_name,\n", + " embedding=embedder,\n", + " index_name=index_name,\n", + " text_key=\"text-to-embed\", # Your document's text field\n", + " embedding_key=\"text-embedding\" # This is the field in which your vector (embedding) is stored in the cluster.\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "17adeeed", + "metadata": {}, + "source": [ + "# Performing a Similarity Search\n", + " - Defines a natural language query (e.g., \"USA\").\n", + " - Calls `similarity_search(k=3)` to retrieve the top 3 most semantically similar documents.\n", + " - Prints ranked results, extracting the chosen `text_key` (here `text-to-embed`).\n", + " - Change `query` to any descriptive phrase (e.g., \"beach resort\", \"airport hotel near NYC\").\n", + " - Adjust `k` for more or fewer results." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "eb87c6e6", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1. — Score: 0.8052 — Content: Section Title: Set Up the Java SDK\n", + "Content: Run the command mvn install to pull in all the dependencies and finish your SDK setup.\n", + "2. — Score: 0.7971 — Content: Section Title: Set Up the Java SDK\n", + "Content: To set up the Java SDK: Create the following directory structure on your computer: In the student directory, create a new file called pom. xml. Paste the following code block into your pom. xm1 file: Open a terminal window and navigate to your student directory.\n", + "3. — Score: 0.7745 — Content: Section Title: Prerequisites\n", + "Content: e You have installed the Java Software Development Kit (version 8, 11, 17, or 21). o The recommended version is the latest Java LTS release. Make sure to install the highest available patch for the LTS version.\n" + ] + } + ], + "source": [ + "query = \"How to setup java SDK?\"\n", + "results = vector_store.similarity_search_with_score(query, k=3)\n", + "\n", + "for rank, (doc, score) in enumerate(results, start=1):\n", + " text = getattr(doc, \"page_content\", None)\n", + " print(f\"{rank}. — Score: {score:.4f} — Content: {text}\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "b5ab91ee", + "metadata": {}, + "source": [ + "# Results and Interpretation\n", + "\n", + "As we can see, 3 (or `k`) ranked results are printed in the output.\n", + "\n", + "### What Each Part Means\n", + "- Leading number (1, 2, 3): The result rank (1 = most similar to your query).\n", + "- Content text: This is the value of the field you configured as `text_key` (in this tutorial: `text-to-embed`). It represents the human-readable content we chose to display.\n", + "\n", + "### How the Ranking Works\n", + "1. Your natural language query (e.g., `query = \"How to setup java SDK?\"`) is embedded using the NVIDIA model (`nvidia/nv-embedqa-e5-v5`).\n", + "2. The vector store compares the query embedding to stored document embeddings in the field you configured (`embedding_key = \"text-embedding\"`).\n", + "3. Results are sorted by vector similarity. Higher similarity = closer semantic meaning.\n", + "\n", + "\n", + "> Your vector search pipeline is working if the returned documents feel meaningfully related to your natural language query—even when exact keywords do not match. Feel free to experiment with increasingly descriptive queries to observe the semantic power of the embeddings." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.7" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/autovec_unstructured/img/S3bucketsuccess.png b/autovec_unstructured/img/S3bucketsuccess.png new file mode 100644 index 0000000..9aa75d6 Binary files /dev/null and b/autovec_unstructured/img/S3bucketsuccess.png differ diff --git a/autovec_unstructured/img/S3credentials.png b/autovec_unstructured/img/S3credentials.png new file mode 100644 index 0000000..c9a74a4 Binary files /dev/null and b/autovec_unstructured/img/S3credentials.png differ diff --git a/autovec_unstructured/img/Select_embedding_model.png b/autovec_unstructured/img/Select_embedding_model.png new file mode 100644 index 0000000..a4236f2 Binary files /dev/null and b/autovec_unstructured/img/Select_embedding_model.png differ diff --git a/autovec_unstructured/img/addS3bucket.png b/autovec_unstructured/img/addS3bucket.png new file mode 100644 index 0000000..9566d59 Binary files /dev/null and b/autovec_unstructured/img/addS3bucket.png differ diff --git a/autovec_unstructured/img/configure_data_source.png b/autovec_unstructured/img/configure_data_source.png new file mode 100644 index 0000000..12c9797 Binary files /dev/null and b/autovec_unstructured/img/configure_data_source.png differ diff --git a/autovec_unstructured/img/data_processing.png b/autovec_unstructured/img/data_processing.png new file mode 100644 index 0000000..46cf504 Binary files /dev/null and b/autovec_unstructured/img/data_processing.png differ diff --git a/autovec_unstructured/img/deploying_model.png b/autovec_unstructured/img/deploying_model.png new file mode 100644 index 0000000..5b83034 Binary files /dev/null and b/autovec_unstructured/img/deploying_model.png differ diff --git a/autovec_unstructured/img/download_api_key_details.png b/autovec_unstructured/img/download_api_key_details.png new file mode 100644 index 0000000..8ee7dc8 Binary files /dev/null and b/autovec_unstructured/img/download_api_key_details.png differ diff --git a/autovec_unstructured/img/importing_model.png b/autovec_unstructured/img/importing_model.png new file mode 100644 index 0000000..41e80e9 Binary files /dev/null and b/autovec_unstructured/img/importing_model.png differ diff --git a/autovec_unstructured/img/model_api_key_form.png b/autovec_unstructured/img/model_api_key_form.png new file mode 100644 index 0000000..0713a53 Binary files /dev/null and b/autovec_unstructured/img/model_api_key_form.png differ diff --git a/autovec_unstructured/img/model_setup_access.png b/autovec_unstructured/img/model_setup_access.png new file mode 100644 index 0000000..91dfae7 Binary files /dev/null and b/autovec_unstructured/img/model_setup_access.png differ diff --git a/autovec_unstructured/img/start_workflow.png b/autovec_unstructured/img/start_workflow.png new file mode 100644 index 0000000..1d025b7 Binary files /dev/null and b/autovec_unstructured/img/start_workflow.png differ diff --git a/autovec_unstructured/img/workflow.png b/autovec_unstructured/img/workflow.png new file mode 100644 index 0000000..fcf8a0c Binary files /dev/null and b/autovec_unstructured/img/workflow.png differ diff --git a/autovec_unstructured/img/workflow_deployed.png b/autovec_unstructured/img/workflow_deployed.png new file mode 100644 index 0000000..7386879 Binary files /dev/null and b/autovec_unstructured/img/workflow_deployed.png differ diff --git a/autovec_unstructured/img/workflow_summary.png b/autovec_unstructured/img/workflow_summary.png new file mode 100644 index 0000000..ffcded6 Binary files /dev/null and b/autovec_unstructured/img/workflow_summary.png differ diff --git a/autovec_unstructured/sample.pdf b/autovec_unstructured/sample.pdf new file mode 100644 index 0000000..381c774 Binary files /dev/null and b/autovec_unstructured/sample.pdf differ