diff --git a/quick-search-engine/quick-search-engine.ipynb b/quick-search-engine/quick-search-engine.ipynb new file mode 100644 index 0000000..524fa0e --- /dev/null +++ b/quick-search-engine/quick-search-engine.ipynb @@ -0,0 +1,550 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "b549c3c7", + "metadata": {}, + "source": [ + "# Overview\n", + "\n", + "### Quora Question Pairs\n", + "\n", + "It is a large corpus of different questions and is used to detect similar/repeating questions by understanding the semantic meaning of them\n", + "\n", + "### Qdrant\n", + "\n", + "Qdrant is an Open-Source Vector Database and Vector Search Engine written in Rust. It provides fast and scalable vector similarity search service.\n", + "\n", + "### Abstract\n", + "\n", + "This notebook implements a search engine using the `Quora Duplicate Questions` dataset and the `Qdrant library`. It aims to identify similar questions based on user input queries.\n", + "\n", + "### Methodology\n", + "\n", + "Here's a detailed overview of implementation:\n", + "\n", + "- Load the Quora Dataset and apply preprocessing steps.\n", + "- Vectorize the textual data and store in a vector space, where questions entered by users can be vectorized and compared in the same vector space - All these steps are covered by internal functionality of Qdrant.\n", + "- Several example queries are provided to demonstrate the functionality of the search engine.\n", + "\n", + "### Summary\n", + "\n", + "In summary, the notebook demonstrates how easily and efficiently, complete search engine can be created using Qdrant Vector Database and Client.\n", + "\n", + "### Explore More!\n", + "\n", + "- This notebook has been covered in an article on Medium: [Build a search engine in 5 minutes using Qdrant](https://medium.com/@raoarmaghanshakir040/build-a-search-engine-in-5-minutes-using-qdrant-f43df4fbe8d1)\n", + "- [E-Commerce Products Search Engine Using Qdrant](https://www.kaggle.com/code/sacrum/e-commerce-products-search-engine-using-qdrant)\n", + "- [Qdrant](https://qdrant.tech)\n", + "- [Qdrant Documentation](https://qdrant.tech/documentation/)\n", + "- [Qdrant Python Client Documentation](https://python-client.qdrant.tech)\n", + "- [Quora Question Pair](https://www.kaggle.com/competitions/quora-question-pairs)\n" + ] + }, + { + "cell_type": "markdown", + "id": "fcc4c336", + "metadata": { + "papermill": { + "duration": 0.007079, + "end_time": "2024-02-12T18:05:24.624181", + "exception": false, + "start_time": "2024-02-12T18:05:24.617102", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "# Dataset" + ] + }, + { + "cell_type": "markdown", + "id": "1289b4eb", + "metadata": {}, + "source": [ + "### Loading\n", + "1. Install `datasets` library\n", + "2. Load `Quora` dataset\n", + "3. Extract Questions\n", + "4. Concatenate all the questions" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "827c1ad1", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install datasets" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "a790c935", + "metadata": {}, + "outputs": [], + "source": [ + "from datasets import load_dataset\n", + "\n", + "dataset = load_dataset(\"quora\", split=\"train\")" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "5234d425", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(808580,\n", + " ['What is the step by step guide to invest in share market in india?',\n", + " 'What is the step by step guide to invest in share market?',\n", + " 'What is the story of Kohinoor (Koh-i-Noor) Diamond?',\n", + " 'What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?',\n", + " 'How can I increase the speed of my internet connection while using a VPN?',\n", + " 'How can Internet speed be increased by hacking through DNS?',\n", + " 'Why am I mentally very lonely? How can I solve it?',\n", + " 'Find the remainder when [math]23^{24}[/math] is divided by 24,23?',\n", + " 'Which one dissolve in water quikly sugar, salt, methane and carbon di oxide?',\n", + " 'Which fish would survive in salt water?'])" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "questions = []\n", + "for q in dataset['questions']:\n", + "\tquestions.extend(q['text'])\n", + "\n", + "len(questions), questions[:10]" + ] + }, + { + "cell_type": "markdown", + "id": "51dbb60a", + "metadata": {}, + "source": [ + "### Preprocess" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "4187f4a5", + "metadata": {}, + "outputs": [], + "source": [ + "# Remove all duplicates\n", + "questions = list(set(questions))" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "f9ceb941", + "metadata": {}, + "outputs": [], + "source": [ + "# Filter shorter or longer questions\n", + "\n", + "min_len = 10\n", + "max_len = 50\n", + "\n", + "def filter_function(question):\n", + "\twords = question.split()\n", + "\tn_words = len(words)\n", + "\tif n_words in range(min_len, max_len):\n", + "\t\treturn True\n", + "\treturn False\n", + "\n", + "questions = list(filter(filter_function, questions))" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "073cf435", + "metadata": {}, + "outputs": [], + "source": [ + "import random\n", + "\n", + "# Shuffle and Sample the dataset\n", + "# Since complete data is very large\n", + "# and can take longer processing time\n", + "N = 30_000\n", + "\n", + "questions = random.choices(questions, k=N)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "24d1f281", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(30000,\n", + " ['What is it like to be black (African migrant or African American) in Australia?',\n", + " 'What are these Canada people? Why Canada is not a state of America? Do they look like us?',\n", + " 'Why do we use long transmission line for longer than 240 km?',\n", + " \"I'm 11 and I want my nose pierced, I'm ok with waiting till I'm 12 which is in January, my dad said he will think about when I'm 12, can I u think?\",\n", + " 'What is If there was one thing you would like to change about Quora what would it be? That one thing where Quora need to improve?',\n", + " 'My car steering not working as my wheel got stuck?',\n", + " 'What is the best shipping option for an online business in Nigeria sending products to the USA?',\n", + " 'Who is the best center to ever play in the NBA?',\n", + " 'How are red blood cells structured and how do they function?',\n", + " 'What are some things new employees should know going into their first day at Commerce Bank?'])" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(questions), questions[:10]" + ] + }, + { + "cell_type": "markdown", + "id": "62566f8f", + "metadata": { + "papermill": { + "duration": 0.007047, + "end_time": "2024-02-12T18:05:27.943195", + "exception": false, + "start_time": "2024-02-12T18:05:27.936148", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "# Qdrant" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d6f42199", + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-12T18:05:27.960408Z", + "iopub.status.busy": "2024-02-12T18:05:27.960023Z", + "iopub.status.idle": "2024-02-12T18:06:09.747078Z", + "shell.execute_reply": "2024-02-12T18:06:09.745908Z" + }, + "papermill": { + "duration": 41.799227, + "end_time": "2024-02-12T18:06:09.750235", + "exception": false, + "start_time": "2024-02-12T18:05:27.951008", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "!pip install qdrant-client[fastembed]" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "4911a12e", + "metadata": {}, + "outputs": [], + "source": [ + "# Name of Qdrant Collection for saving vectors\n", + "QD_COLLECTION_NAME = \"collection_name\"" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "58d75809", + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-12T18:06:09.780598Z", + "iopub.status.busy": "2024-02-12T18:06:09.779364Z", + "iopub.status.idle": "2024-02-12T18:16:18.533873Z", + "shell.execute_reply": "2024-02-12T18:16:18.532156Z" + }, + "papermill": { + "duration": 608.787004, + "end_time": "2024-02-12T18:16:18.550945", + "exception": false, + "start_time": "2024-02-12T18:06:09.763941", + "status": "completed" + }, + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Completed\n" + ] + } + ], + "source": [ + "from qdrant_client import QdrantClient\n", + "\n", + "client = QdrantClient(\":memory:\")\n", + "\n", + "client.add(\n", + " collection_name=QD_COLLECTION_NAME,\n", + " documents=questions,\n", + ")\n", + "\n", + "print(\"Completed\")" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "48bb2415", + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-12T18:16:18.582461Z", + "iopub.status.busy": "2024-02-12T18:16:18.581711Z", + "iopub.status.idle": "2024-02-12T18:16:18.589335Z", + "shell.execute_reply": "2024-02-12T18:16:18.587929Z" + }, + "papermill": { + "duration": 0.027265, + "end_time": "2024-02-12T18:16:18.592322", + "exception": false, + "start_time": "2024-02-12T18:16:18.565057", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "def pretty_print(query):\n", + " results = client.query(\n", + " collection_name=QD_COLLECTION_NAME,\n", + " query_text=query,\n", + " limit=5\n", + " )\n", + " print(\"Query:\", query)\n", + " for i, result in enumerate(results):\n", + " print()\n", + " print(f\"{i+1}) {result.document}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "c132090b", + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-12T18:16:18.622919Z", + "iopub.status.busy": "2024-02-12T18:16:18.622422Z", + "iopub.status.idle": "2024-02-12T18:16:18.766211Z", + "shell.execute_reply": "2024-02-12T18:16:18.764524Z" + }, + "papermill": { + "duration": 0.164352, + "end_time": "2024-02-12T18:16:18.770923", + "exception": false, + "start_time": "2024-02-12T18:16:18.606571", + "status": "completed" + }, + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Query: what is the best earyly morning meal?\n", + "\n", + "1) What is your favorite food for a chilly winter day?\n", + "\n", + "2) Can you give me some recipes for a healthy and easy packed lunch?\n", + "\n", + "3) What is the best meal you ever had in your life?\n", + "\n", + "4) What's the first thing you put in your mouth in the morning?\n", + "\n", + "5) What is served for breakfast on a typical US army base?\n" + ] + } + ], + "source": [ + "pretty_print(\"what is the best earyly morning meal?\")" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "93362c88", + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-12T18:16:18.835972Z", + "iopub.status.busy": "2024-02-12T18:16:18.835126Z", + "iopub.status.idle": "2024-02-12T18:16:18.977442Z", + "shell.execute_reply": "2024-02-12T18:16:18.975638Z" + }, + "papermill": { + "duration": 0.180746, + "end_time": "2024-02-12T18:16:18.983214", + "exception": false, + "start_time": "2024-02-12T18:16:18.802468", + "status": "completed" + }, + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Query: How should one introduce themselves?\n", + "\n", + "1) What is the first step anyone will take before stating his/her own business?\n", + "\n", + "2) How can a fresh graduate face his 1st interview for a bank job if the question is \"say something about yourself / introduce yourself\"?\n", + "\n", + "3) What is the best way to respond to an email introduction?\n", + "\n", + "4) How do I give a welcoming speech to students of the freshman year?\n", + "\n", + "5) How do I speak up in front of many people?\n" + ] + } + ], + "source": [ + "pretty_print(\"How should one introduce themselves?\")" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "dbb0e763", + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-12T18:16:19.053710Z", + "iopub.status.busy": "2024-02-12T18:16:19.052919Z", + "iopub.status.idle": "2024-02-12T18:16:19.160098Z", + "shell.execute_reply": "2024-02-12T18:16:19.157372Z" + }, + "papermill": { + "duration": 0.144939, + "end_time": "2024-02-12T18:16:19.162846", + "exception": false, + "start_time": "2024-02-12T18:16:19.017907", + "status": "completed" + }, + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Query: Why is the Earth a sphere?\n", + "\n", + "1) Why do extraterrestrial bodies always appear as a spherical shape? Why not square or cylindrical?\n", + "\n", + "2) If the earth is a sphere, how is it that wherever we stand, we never fall off?\n", + "\n", + "3) All things are supposed to fall to the earth because of gravity, then why do clouds float?\n", + "\n", + "4) Why do some people think that the Earth is flat?\n", + "\n", + "5) What is the reason for existence of Earth's magnetic field?\n" + ] + } + ], + "source": [ + "pretty_print(\"Why is the Earth a sphere?\")" + ] + }, + { + "cell_type": "markdown", + "id": "1cc0cd3d", + "metadata": { + "papermill": { + "duration": 0.016819, + "end_time": "2024-02-12T18:16:19.194619", + "exception": false, + "start_time": "2024-02-12T18:16:19.177800", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "# Explore More\n", + "\n", + "- This notebook has been covered in an article on Medium: [Build a search engine in 5 minutes using Qdrant](https://medium.com/@raoarmaghanshakir040/build-a-search-engine-in-5-minutes-using-qdrant-f43df4fbe8d1)\n", + "- [E-Commerce Products Search Engine Using Qdrant](https://www.kaggle.com/code/sacrum/e-commerce-products-search-engine-using-qdrant)\n", + "- [Qdrant](https://qdrant.tech)\n", + "- [Qdrant Documentation](https://qdrant.tech/documentation/)\n", + "- [Qdrant Python Client Documentation](https://python-client.qdrant.tech)\n", + "- [Quora Question Pair](https://www.kaggle.com/competitions/quora-question-pairs)\n" + ] + } + ], + "metadata": { + "kaggle": { + "accelerator": "none", + "dataSources": [ + { + "databundleVersionId": 323734, + "sourceId": 6277, + "sourceType": "competition" + } + ], + "dockerImageVersionId": 30646, + "isGpuEnabled": false, + "isInternetEnabled": true, + "language": "python", + "sourceType": "notebook" + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + }, + "papermill": { + "default_parameters": {}, + "duration": 661.57748, + "end_time": "2024-02-12T18:16:20.604787", + "environment_variables": {}, + "exception": null, + "input_path": "__notebook__.ipynb", + "output_path": "__notebook__.ipynb", + "parameters": {}, + "start_time": "2024-02-12T18:05:19.027307", + "version": "2.5.0" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/quick-search-engine/requirements.txt b/quick-search-engine/requirements.txt new file mode 100644 index 0000000..c7a76c7 --- /dev/null +++ b/quick-search-engine/requirements.txt @@ -0,0 +1,2 @@ +datasets==2.18.0 +qdrant-client[fastembed]==1.8.0 \ No newline at end of file