diff --git a/authors.yaml b/authors.yaml index bbc617f0ad..064043adbc 100644 --- a/authors.yaml +++ b/authors.yaml @@ -261,4 +261,9 @@ thli-openai: erikakettleson-openai: name: "Erika Kettleson" website: "https://www.linkedin.com/in/erika-kettleson-85763196/" - avatar: "https://avatars.githubusercontent.com/u/186107044?v=4" \ No newline at end of file + avatar: "https://avatars.githubusercontent.com/u/186107044?v=4" + +cornillon-google: + name: "Matt Cornillon" + website: "https://www.linkedin.com/in/matt-cornillon/" + avatar: "https://avatars.githubusercontent.com/u/486239?v=4" \ No newline at end of file diff --git a/examples/vector_databases/alloydb/alloydb_postgres_vector_search_scann.ipynb b/examples/vector_databases/alloydb/alloydb_postgres_vector_search_scann.ipynb new file mode 100644 index 0000000000..d327f68870 --- /dev/null +++ b/examples/vector_databases/alloydb/alloydb_postgres_vector_search_scann.ipynb @@ -0,0 +1,476 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "4gyKWv8errxF" + }, + "source": [ + "# Semantic search with AlloyDB for PostgreSQL and OpenAI\n", + "\n", + "This notebook walks you through using [Google Cloud AlloyDB](https://cloud.google.com/products/alloydb) as a vector database for OpenAI embeddings. It demonstrates how to:\n", + "\n", + "1. Use and create embeddings from OpenAI API.\n", + "2. Store embeddings in an AlloyDB database.\n", + "3. Use AlloyDB with the `pgvector` extension to perform vector similarity search.\n", + "4. Create and use ScaNN indexes to boost your queries" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VsrfqaUfrrxI" + }, + "source": [ + "## Before you begin\n", + "\n", + "To run this notebook, you will need the following:\n", + "\n", + " * [A Google Cloud Project](https://developers.google.com/workspace/guides/create-project).\n", + " * [An AlloyDB cluster and instance](https://cloud.google.com/alloydb/docs/cluster-create). You can create a free trial cluster, see instructions here [AlloyDB Free Trial Cluster](https://cloud.google.com/alloydb/docs/free-trial-cluster) or use a [local AlloyDB Omni](https://cloud.google.com/alloydb/omni/current/docs/overview).\n", + " * [An OpenAI API key](https://platform.openai.com/account/api-keys).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LGzfbR2yrrxI" + }, + "source": [ + "## Install required modules\n", + "\n", + "This notebook requires several packages that you can install with `pip`:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "-JxOphmrrrxI" + }, + "outputs": [], + "source": [ + "!python3 -m pip install -qU openai psycopg2-binary pandas wget" + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Set your AlloyDB database values and Open AI API key\n", + "Find your database values in the [AlloyDB Instances page](https://console.cloud.google.com/alloydb/clusters) and your OpenAI API key on [your account](https://platform.openai.com/account/api-keys)." + ], + "metadata": { + "id": "OSNGkdEU6SBm" + } + }, + { + "cell_type": "code", + "source": [ + "# @title Set Your Values Here { display-mode: \"form\" }\n", + "HOST = \"xx.xx.xx.xx\" # @param {type: \"string\"}\n", + "PORT = \"5432\" # @param {type: \"string\"}\n", + "DATABASE = \"postgres\" # @param {type: \"string\"}\n", + "USER = \"postgres\" # @param {type: \"string\"}\n", + "PASSWORD = \"secure-password\" # @param {type: \"string\"}\n", + "OPENAI_KEY = \"sk-xxx\" # @param {type: \"string\"}" + ], + "metadata": { + "id": "ZOo7D2Wu460m" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Connect to your AlloyDB instance\n", + "Using the values you just set, you can now connect to your AlloyDB for Postgres." + ], + "metadata": { + "id": "_FTQ16dDsUqp" + } + }, + { + "cell_type": "code", + "source": [ + "import psycopg2\n", + "\n", + "# Instantiate connection\n", + "connection = psycopg2.connect(\n", + " host=HOST,\n", + " port=PORT,\n", + " database=DATABASE,\n", + " user=USER,\n", + " password=PASSWORD\n", + ")\n", + "connection.set_session(autocommit=True)\n", + "\n", + "# Create a new cursor object\n", + "cursor = connection.cursor()" + ], + "metadata": { + "id": "8yN47hdfBpNk" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "### Test the connection" + ], + "metadata": { + "id": "1yyIlaj6sv1w" + } + }, + { + "cell_type": "code", + "source": [ + "# Execute a simple query to test the connection\n", + "cursor.execute(\"SELECT 'AlloyDB is amazing!';\")\n", + "result = cursor.fetchone()\n", + "\n", + "# Check the query result\n", + "if result == (\"AlloyDB is amazing!\",):\n", + " print(\"You are ready to query!\")\n", + "else:\n", + " print(\"Connection failed.\")" + ], + "metadata": { + "id": "iCIODOUIDNUm" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Set up your AlloyDB cluster for vector search\n", + "\n", + "To store your embeddings and execute similarity search queries, we will use the PostgreSQL `pgvector` extension.\n", + "\n", + "As for the vector indexes, we will use Google ScaNN through the extension `alloydb_scann`. More information on this unique index from Google [can be found here](https://cloud.google.com/blog/products/databases/understanding-the-scann-index-in-alloydb)." + ], + "metadata": { + "id": "XyfkzoEOs4uw" + } + }, + { + "cell_type": "code", + "source": [ + "create_extensions_stmt = '''\n", + "CREATE EXTENSION IF NOT EXISTS vector;\n", + "CREATE EXTENSION IF NOT EXISTS alloydb_scann;\n", + "'''\n", + "\n", + "# Execute the SQL statements\n", + "cursor.execute(create_extensions_stmt)" + ], + "metadata": { + "id": "vtfzcFgBEOPj" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Now we will create the table that will store our data and our embeddings, using the `vector` data type." + ], + "metadata": { + "id": "T0g86mCkuduo" + } + }, + { + "cell_type": "code", + "source": [ + "create_table_stmt = f'''\n", + "DROP TABLE IF EXISTS vector_store;\n", + "CREATE TABLE IF NOT EXISTS vector_store (\n", + " id INTEGER NOT NULL,\n", + " url TEXT,\n", + " title TEXT,\n", + " content TEXT,\n", + " title_vector vector(1536),\n", + " content_vector vector(1536),\n", + " vector_id INT\n", + ")\n", + "'''\n", + "\n", + "# Execute the SQL statements\n", + "cursor.execute(create_table_stmt)" + ], + "metadata": { + "id": "QMajDx0388Rx" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "And lastly, we create the vector indexes with Google ScaNN." + ], + "metadata": { + "id": "xqiRAyQZuy6i" + } + }, + { + "cell_type": "code", + "source": [ + "create_scann_index_stmt = f'''\n", + "CREATE INDEX scann_index_title ON vector_store\n", + " USING scann (title_vector cosine)\n", + " WITH (num_leaves = 200, max_num_levels = 2);\n", + "\n", + "CREATE INDEX scann_index_content ON vector_store\n", + " USING scann (content_vector cosine)\n", + " WITH (num_leaves = 200, max_num_levels = 2);\n", + "'''\n", + "\n", + "# Execute the SQL statements\n", + "cursor.execute(create_scann_index_stmt)" + ], + "metadata": { + "id": "SC0DSqQUNCpO" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Load the data into AlloyDB\n", + "\n", + "In this cookbook, we are gonna use the Simple English Wikipedia dataset hosted by OpenAI, with pre-calculated embeddings.\n", + "\n", + "It contains 25000 rows, import will take few minutes." + ], + "metadata": { + "id": "bHVSGLdBu5Yf" + } + }, + { + "cell_type": "code", + "source": [ + "import wget\n", + "\n", + "embeddings_url = \"https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip\"\n", + "\n", + "# The file is ~700 MB so this will take some time\n", + "wget.download(embeddings_url)" + ], + "metadata": { + "id": "7Pg34IqfGLVU" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "import zipfile\n", + "import os\n", + "import re\n", + "import tempfile\n", + "\n", + "current_directory = os.getcwd()\n", + "zip_file_path = os.path.join(current_directory, \"vector_database_wikipedia_articles_embedded.zip\")\n", + "output_directory = os.path.join(current_directory, \"../../data\")\n", + "\n", + "with zipfile.ZipFile(zip_file_path, \"r\") as zip_ref:\n", + " zip_ref.extractall(output_directory)\n", + "\n", + "# Check to see if the csv file was extracted\n", + "file_name = \"vector_database_wikipedia_articles_embedded.csv\"\n", + "data_directory = os.path.join(current_directory, \"../../data\")\n", + "file_path = os.path.join(data_directory, file_name)\n", + "\n", + "if os.path.exists(file_path):\n", + " print(f\"The data file {file_name} exists.\")\n", + "else:\n", + " print(f\"The data file {file_name} does not exist.\")" + ], + "metadata": { + "id": "tW662czfGjx4" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "We can have a quick look at the data before the ingestion." + ], + "metadata": { + "id": "mRl7nm1_vimg" + } + }, + { + "cell_type": "code", + "source": [ + "import pandas, json\n", + "data = pandas.read_csv('../../data/vector_database_wikipedia_articles_embedded.csv')\n", + "data" + ], + "metadata": { + "id": "V4LPgPMOITDd" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "And now we load the data into AlloyDB." + ], + "metadata": { + "id": "31ClylyVvtO6" + } + }, + { + "cell_type": "code", + "source": [ + "import csv\n", + "\n", + "csv_file_path = '../../data/vector_database_wikipedia_articles_embedded.csv'\n", + "\n", + "\n", + "with open(csv_file_path, 'r', encoding='utf-8') as file:\n", + " reader = csv.reader(file)\n", + " header = next(reader) # Skip the header row\n", + "\n", + " copy_command = f\"\"\"\n", + " COPY vector_store (id, url, title, content, title_vector, content_vector, vector_id)\n", + " FROM STDIN WITH (FORMAT CSV, HEADER FALSE, DELIMITER ',');\n", + " \"\"\"\n", + "\n", + " cursor.copy_expert(copy_command, file)\n", + "\n", + "print(f\"Successfully imported data from '{csv_file_path}' into 'vector_store'.\")" + ], + "metadata": { + "id": "1UBO8zfWJBj0" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Semantic Search with AlloyDB for PostgreSQL\n", + "\n", + "Now that our table is filled with data and embeddings, we are ready to execute similarity search queries.\n", + "\n", + "In this example, we are gonna use the cosine distance to compare our vectors and the embedding of our search string, generated by the OpenAI API." + ], + "metadata": { + "id": "kvD0ghW3v0rR" + } + }, + { + "cell_type": "code", + "source": [ + "from openai import OpenAI\n", + "\n", + "def semantic_search(query, vector_name=\"title_vector\", top_k=20):\n", + " # Creates embedding vector from user query\n", + " client = OpenAI(api_key=OPENAI_KEY)\n", + " embedded_query = client.embeddings.create(\n", + " input=query,\n", + " model=\"text-embedding-ada-002\",\n", + " ).data[0].embedding\n", + "\n", + " # Convert the embedded_query to PostgreSQL compatible format\n", + " embedded_query_pg = \"[\" + \",\".join(map(str, embedded_query)) + \"]\"\n", + "\n", + " # Create SQL query\n", + " query_sql = f\"\"\"\n", + " SELECT id, url, title, cosine_distance({vector_name},'{embedded_query_pg}') AS cosine_distance\n", + " FROM vector_store\n", + " ORDER BY {vector_name} <=> '{embedded_query_pg}'\n", + " LIMIT {top_k};\n", + " \"\"\"\n", + "\n", + " # Execute the query\n", + " cursor.execute(query_sql)\n", + " results = cursor.fetchall()\n", + "\n", + " return results" + ], + "metadata": { + "id": "HYjWJfpsOnxb" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "query_results = semantic_search(\"French history\", \"content_vector\", 20)\n", + "for i, result in enumerate(query_results):\n", + " print(f\"{i + 1}. {result[2]} (Score: {round(1 - result[3], 3)})\")" + ], + "metadata": { + "id": "4203uOkxXODl" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "1. French Revolution (Score: 0.847)\n", + "2. La Marseillaise (Score: 0.832)\n", + "3. New France (Score: 0.83)\n", + "4. Bastille (Score: 0.829)\n", + "5. Restauration (Score: 0.823)\n", + "6. Angevin (Score: 0.821)\n", + "7. Salic law (Score: 0.82)\n", + "8. Fin de siècle (Score: 0.82)\n", + "9. Tennis Court Oath (Score: 0.82)\n", + "10. History of Spain (Score: 0.818)\n", + "11. Gaul (Score: 0.817)\n", + "12. Dreyfus Affair (Score: 0.816)\n", + "13. Declaration of the Rights of Man and of the Citizen (Score: 0.815)\n", + "14. Siege of Orleans (Score: 0.814)\n", + "15. Palace of Versailles (Score: 0.814)\n", + "16. Franco-Prussian War (Score: 0.811)\n", + "17. Arc de Triomphe (Score: 0.811)\n", + "18. Louise Michel Battalions (Score: 0.808)\n", + "19. Treaty of Verdun (Score: 0.808)\n", + "20. Bastide (Score: 0.808)" + ], + "metadata": { + "id": "L7JP2MMlh5lC" + } + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + }, + "orig_nbformat": 4, + "colab": { + "provenance": [], + "toc_visible": true, + "name": "neon-postgres-vector-search-pgvector.ipynb" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file diff --git a/registry.yaml b/registry.yaml index 596c29b94e..ba4674bc38 100644 --- a/registry.yaml +++ b/registry.yaml @@ -1847,3 +1847,10 @@ - audio - speech +- title: Google AlloyDB for PostgreSQL as a vector database + path: examples/vector_databases/alloydb/alloydb_postgres_vector_search_scann.ipynb + date: 2025-03-26 + authors: + - cornillon-google + tags: + - embeddings \ No newline at end of file