-
Notifications
You must be signed in to change notification settings - Fork 96
refactor: ingestion logic to handle oversized paragraphs #16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…section splitting - Introduced a new function `splitOversizedParagraph` to handle paragraphs exceeding the defined section size. - Updated `splitIntoSections` to utilize the new paragraph splitting logic. - Enhanced error handling and logging in the ingestion process for documents and URLs. - Improved code readability by standardizing formatting and using consistent semicolon usage. - Adjusted server code to ensure proper response formatting and error handling. - Updated Docker Compose file for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the document ingestion logic to properly handle oversized paragraphs that exceed section size limits. The main change introduces a new splitOversizedParagraph function that recursively splits large paragraphs at sentence boundaries (or character boundaries for extremely long sentences) to ensure all sections fit within the configured SECTION_SIZE. Additionally, the PR standardizes code formatting across the codebase by adding semicolons and adjusting line length limits.
Key Changes:
- Introduced
splitOversizedParagraphfunction to handle paragraphs exceeding section size by splitting at sentence boundaries - Updated
splitIntoSectionsto utilize the new paragraph splitting logic, preventing oversized sections - Enhanced OpenAI embedding API error handling to include response body details
Reviewed Changes
Copilot reviewed 4 out of 7 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| backend/src/ingestion/index.ts | Added splitOversizedParagraph function and integrated it into splitIntoSections to handle oversized paragraphs |
| backend/src/embedding/index.ts | Enhanced error handling for OpenAI API calls and added dimension validation logic |
| backend/src/server/index.ts | Added request timeout configuration for long-running ingestion requests |
| .prettierrc.js | Increased printWidth from 80 to 160 characters |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Hey, thanks for the contribution. Your pull request has conflicts with 2 files. Please resolve it. |
…rver payload size management
…enhance model configuration
|
@nullure done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 6 out of 9 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
backend/src/embedding/index.ts:1
- The
openai_modelconfiguration now defaults to an empty string instead ofundefined, which may cause issues inresolveOpenAIModelandresolveTargetDimensionfunctions. The functionresolveOpenAIModelat line 43-45 uses a truthy checkenv.openai_model ||which treats empty string as falsy, but the type has changed fromstring | undefinedtostring. Consider either keeping the original|| undefinedto maintain the API contract or updating the logic in functions that consume this value to explicitly check for empty strings.
import { env } from '../config';
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <[email protected]>
|
Hey, thanks for the contribution. Your pull request has conflicts with 4 files. Also, please don't vibe code 100% of the code, as it makes the code longer and at one point could really affect the performance. |
|
@nullure hey, the PR changes are pretty small. I don't know why you think it's vibe coded at all.
The project fails at large files, URL fetching, simple queries. They all must be fixed. I tried to use in my side project but the results were not even close alternatives |
|
You still do have conflicts with 4 files. |
61d3803 to
8fa3a99
Compare
|
ready to merge @nullure |
|
it's mostly formatting changes btw |
|
Please add tests to prove that it works. Also, can you explain what gemini_queue and other things you added mean? I request you to just modify the ingestion file, we are currently working, and using sudden changes might be a hurdle. |
📋 Description
splitOversizedParagraphto handle paragraphs exceeding the defined section size.splitIntoSectionsto utilize the new paragraph splitting logic.🔄 Type of Change
🧪 Testing
🔍 Code Review Checklist