Skip to content

Optimise data generation for very large object counts (1M+) #276

@JayVDZ

Description

@JayVDZ

Summary

When generating very large numbers of objects (1M+), the current data generation implementation has performance and memory limitations that should be addressed post-MVP.

Current Implementation

The data generation pipeline works in two phases:

  1. Generation phase: Objects are created in memory using Parallel.For
  2. Persistence phase: Objects are persisted to PostgreSQL via EF Core

Current Limitations

  1. Memory pressure: All generated objects are held in memory before persistence begins
  2. EF Core change tracking overhead: Even with batched SaveChangesAsync, all entities remain tracked
  3. Single transaction batching complexity: Clearing the change tracker between batches causes duplicate key errors due to complex navigation properties (MetaverseObjectType → DataGenerationTemplateAttribute, etc.)

Proposed Optimisations

1. Raw SQL Bulk Insert (Recommended for 1M+ objects)

Use PostgreSQL's COPY command or a library like EFCore.BulkExtensions for bulk inserts:

  • Bypasses EF Core change tracking entirely
  • Significantly faster for large datasets
  • Requires flattening navigation properties to FK IDs

2. Streaming Generation with Batched Persistence

Generate and persist objects in chunks rather than all-at-once:

  • Generate batch of N objects → Persist → Clear from memory → Repeat
  • Requires restructuring the generation loop
  • Trade-off: Some generation patterns (e.g., manager assignments) require all objects to exist first

3. Fresh DbContext per Batch

Create a new JimDbContext for each batch to avoid change tracker accumulation:

  • Simpler than manual entity state management
  • Requires careful handling of referenced entities (MetaverseObjectType, MetaverseAttribute)

4. Async Enumerable / Generator Pattern

Use IAsyncEnumerable<MetaverseObject> to stream objects from generation to persistence:

  • Memory efficient
  • Complex to implement with current parallel generation

Success Criteria

  • Generate 1M objects without OOM errors
  • Persistence phase completes in reasonable time (target: <5 minutes for 1M objects)
  • Progress reporting continues to work during persistence
  • No regression in performance for smaller datasets (10K-100K)

Related

  • Current batched persistence implementation added in progress tracking feature
  • SyncPageSize setting controls batch size (default: 500)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions