-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or request
Milestone
Description
Summary
When generating very large numbers of objects (1M+), the current data generation implementation has performance and memory limitations that should be addressed post-MVP.
Current Implementation
The data generation pipeline works in two phases:
- Generation phase: Objects are created in memory using
Parallel.For - Persistence phase: Objects are persisted to PostgreSQL via EF Core
Current Limitations
- Memory pressure: All generated objects are held in memory before persistence begins
- EF Core change tracking overhead: Even with batched
SaveChangesAsync, all entities remain tracked - Single transaction batching complexity: Clearing the change tracker between batches causes duplicate key errors due to complex navigation properties (MetaverseObjectType → DataGenerationTemplateAttribute, etc.)
Proposed Optimisations
1. Raw SQL Bulk Insert (Recommended for 1M+ objects)
Use PostgreSQL's COPY command or a library like EFCore.BulkExtensions for bulk inserts:
- Bypasses EF Core change tracking entirely
- Significantly faster for large datasets
- Requires flattening navigation properties to FK IDs
2. Streaming Generation with Batched Persistence
Generate and persist objects in chunks rather than all-at-once:
- Generate batch of N objects → Persist → Clear from memory → Repeat
- Requires restructuring the generation loop
- Trade-off: Some generation patterns (e.g., manager assignments) require all objects to exist first
3. Fresh DbContext per Batch
Create a new JimDbContext for each batch to avoid change tracker accumulation:
- Simpler than manual entity state management
- Requires careful handling of referenced entities (MetaverseObjectType, MetaverseAttribute)
4. Async Enumerable / Generator Pattern
Use IAsyncEnumerable<MetaverseObject> to stream objects from generation to persistence:
- Memory efficient
- Complex to implement with current parallel generation
Success Criteria
- Generate 1M objects without OOM errors
- Persistence phase completes in reasonable time (target: <5 minutes for 1M objects)
- Progress reporting continues to work during persistence
- No regression in performance for smaller datasets (10K-100K)
Related
- Current batched persistence implementation added in progress tracking feature
SyncPageSizesetting controls batch size (default: 500)
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request