-
Notifications
You must be signed in to change notification settings - Fork 67
YouTube Video Transcript Search and Idea Generation with Qdrant and OpenAI #71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces an end‑to‑end example demonstrating semantic search over YouTube video transcripts with Qdrant and idea generation using OpenAI. Key changes include:
- Adding helper functions (e.g., embed_and_store, search_similar_transcripts, generate_video_idea) for handling embeddings, storage, and idea generation.
- Implementing new API endpoints and management commands for YouTube processing and periodic task scheduling.
- Expanding documentation (README.MD) to guide setup and usage.
Reviewed Changes
Copilot reviewed 71 out of 73 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
video-generation/backend/api/youtube_utils.py | Adds YouTube authentication, transcript embedding, and storage functions using Qdrant and OpenAI. |
video-generation/backend/api/views.py | Provides API endpoints for user API keys, task handling, and video generation requests. |
video-generation/backend/api/urls.py | Registers new API endpoints for task management and video processing. |
video-generation/backend/api/transcription.py | Implements audio transcription using OpenAI Whisper API. |
video-generation/backend/api/tests.py | Placeholder for tests. |
video-generation/backend/api/redis_client.py | Sets up Redis client for task status. |
video-generation/backend/api/qdrant_utils.py | Adds functions for semantic search and video idea generation via Qdrant and OpenAI. |
video-generation/backend/api/models.py | Placeholder for Django models. |
video-generation/backend/api/management/commands/schedule_tasks.py | Schedules periodic tasks for video creation and vector DB updates. |
video-generation/backend/api/management/commands/run_youtube_process.py | Provides CLI command to trigger YouTube processing tasks. |
video-generation/backend/api/management/commands/create_qdrant_collection.py | Command to ensure Qdrant collection exists before processing. |
video-generation/backend/api/apps.py | Standard Django app configuration. |
video-generation/README.MD | Updates documentation for project setup, usage, and API integration. |
Files not reviewed (2)
- video-generation/backend/.gitignore: Language not supported
- video-generation/backend/Dockerfile: Language not supported
Comments suppressed due to low confidence (1)
video-generation/backend/api/views.py:25
- The task 'test_celery_task' is referenced without being imported or defined; ensure it is correctly imported or updated to the appropriate task.
task = test_celery_task.delay(2, 3)
|
||
|
||
|
||
response = openai.chat.completions.create( |
Copilot
AI
Apr 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call 'openai.chat.completions.create' appears to be incorrect; update it to 'openai.ChatCompletion.create' as per the OpenAI API specification.
response = openai.chat.completions.create( | |
response = openai.ChatCompletion.create( |
Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mwitiderrick Thanks a lot for contributing this application! I'm not merging it, as I don't think it belongs in the "examples" category. Please let me clarify that a bit. An example for us is a digestible piece presenting how to use a certain Qdrant functionality. However, this seems to be a fully-fledged application that requires multiple systems to run, so not that many people will be able to test it on their own.
Keeping this application in a separate repository would make sense, as we do with all the other demos. A running version should also be hosted somewhere to attract interest.
I left some minor comments, but in general, I think the idea for the app is really neat! The app looks like any standard Django application.
Thanks again for doing this effort!
qdrant = QdrantClient(url=os.getenv("QDRANT_HOST"),prefer_grpc=False ) | ||
|
||
def ensure_qdrant_collection(): | ||
if not qdrant.collection_exists("video_transcripts"): | ||
qdrant.create_collection( | ||
collection_name="video_transcripts", | ||
vectors_config=VectorParams( | ||
size=1536, | ||
distance=Distance.COSINE | ||
) | ||
) | ||
|
||
|
||
def embed_and_store(user, text, metadata): | ||
logger.info(f"[🔑] Starting embed_and_store for user {user.id} with metadata: {metadata}") | ||
|
||
try: | ||
client = OpenAI(api_key=user.openai_api_key_decrypted) | ||
logger.info("[🧠] Initialized OpenAI client.") | ||
except Exception as e: | ||
logger.exception("[❌] Failed to initialize OpenAI client.") | ||
raise e | ||
|
||
try: | ||
response = client.embeddings.create( | ||
input=[text], | ||
model="text-embedding-ada-002" | ||
) | ||
embedding = response.data[0].embedding | ||
logger.info("[✅] Embedding successfully created.") | ||
except Exception as e: | ||
logger.exception("[❌] Failed to generate embedding.") | ||
raise e | ||
|
||
try: | ||
point_id = str(uuid.uuid4()) | ||
logger.info(f"[🆔] Generated UUID: {point_id}") | ||
|
||
point = PointStruct(id=point_id, vector=embedding, payload=metadata) | ||
logger.info("[📦] PointStruct created.") | ||
|
||
qdrant.upsert("video_transcripts", [point]) | ||
logger.info(f"[📤] Upserted into Qdrant with point ID {point_id}") | ||
|
||
return point_id | ||
|
||
except Exception as e: | ||
logger.exception("[❌] Failed to upsert into Qdrant.") | ||
raise e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these functions do not belong here and should be put in qdrant_utils.py
instead. I struggled a bit with finding them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that file required at all?
def encrypt_value(value): | ||
if not value: | ||
return None | ||
f = get_fernet() | ||
return f.encrypt(value.encode()).decode() | ||
|
||
def decrypt_value(value): | ||
if not value: | ||
return None | ||
try: | ||
f = get_fernet() | ||
return f.decrypt(value.encode()).decode() | ||
except InvalidToken: | ||
return "[DECRYPTION_FAILED]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love that you implemented this! Many people will store everything in plaintext.
class Video(models.Model): | ||
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False) | ||
user = models.ForeignKey(User, on_delete=models.CASCADE, related_name="videos") | ||
title = models.CharField(max_length=255) | ||
description = models.TextField() | ||
video_url = models.URLField() | ||
created_at = models.DateTimeField(auto_now_add=True) | ||
|
||
def __str__(self): | ||
return self.title No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bit confusing that the Django app is called users
and there are some other models here. First of all, I checked the api
app, but there were no models at all, which I found quite intriguing.
This PR adds an end-to-end example demonstrating using Qdrant for semantic search over YouTube video transcripts combined with OpenAI to generate new video ideas based on past content.
Specifically, it includes:
✅ Setup of a
video_transcripts
collection in Qdrant with vector embeddings (text-embedding-ada-002
) and metadata payloads (e.g.,transcript
,user_id
).✅
embed_and_store()
function to generate embeddings for transcripts and store them with metadata in Qdrant.✅
search_similar_transcripts()
function to semantically search for transcripts similar to a query text, filtering by user ID.✅
generate_video_idea()
function that uses OpenAI Chat Completions API to propose a new video idea based on retrieved similar transcripts.✨ Why This is Useful:
Demonstrates a practical use case of combining vector search (Qdrant) with language generation (OpenAI).
Shows how to store text and metadata together with embeddings for richer retrieval.
Provides a real-world example relevant to creators, marketers, and content platforms.
🔧 Technologies Used:
Qdrant Client
OpenAI Embedding API
OpenAI Chat Completions API