Skip to content

FEAT: Multiple backup slots, Emulator C-lib replaced with C++ class (Pybind), gymnasium support, ditched Scons, Refactor & Clean-ups#90

Closed
ali-mosavian wants to merge 40 commits into
Kautenja:masterfrom
ali-mosavian:master
Closed

FEAT: Multiple backup slots, Emulator C-lib replaced with C++ class (Pybind), gymnasium support, ditched Scons, Refactor & Clean-ups#90
ali-mosavian wants to merge 40 commits into
Kautenja:masterfrom
ali-mosavian:master

Conversation

@ali-mosavian
Copy link
Copy Markdown

@ali-mosavian ali-mosavian commented Dec 16, 2022

Description

  • Adds the ability to backup/restore to/from 10+1 slots, where the default slot is -1 (used for reset)
  • Upgraded pyglet to latest version, the required version does not work on MacOS Ventura.

Type of change

Please select all relevant options:

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

Only manual testing.

Test Configuration

  • Operating System: MacOS Ventura
  • Python version: 3.10.8
  • C++ compiler version: Apple clang version 14.0.0 (clang-1400.0.29.202)

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works

@ali-mosavian ali-mosavian changed the title FEAT: Ability to backup/restore to/from 10+1 slots FEAT: Multiple backup slots, gymnasium support, ditched Scons, Refactor Dec 29, 2024
@ali-mosavian ali-mosavian changed the title FEAT: Multiple backup slots, gymnasium support, ditched Scons, Refactor FEAT: Multiple backup slots, Emulator C-lib replaced with C++ class (Pybind), gymnasium support, ditched Scons, Refactor Dec 29, 2024
@ali-mosavian ali-mosavian changed the title FEAT: Multiple backup slots, Emulator C-lib replaced with C++ class (Pybind), gymnasium support, ditched Scons, Refactor FEAT: Multiple backup slots, Emulator C-lib replaced with C++ class (Pybind), gymnasium support, ditched Scons, Refactor & Clean-ups Dec 29, 2024
Add a new VectorEmulator class that enables stepping multiple NES
emulators in parallel using C++ threads. This provides significant
performance improvements over Python multiprocessing approaches by
eliminating IPC overhead and enabling zero-copy observation access.

Key features:
- Persistent worker threads with condition variable synchronization
- Zero-copy screen and memory buffer access via strided NumPy views
- GIL released during parallel stepping for true parallelism
- Interface mirrors NESEmulator: step(), screen_buffer(), memory_buffer()
- Bounds checking on all indexed accessors with clear error messages

Performance: ~1.5-1.8x faster than gymnasium.vector.AsyncVectorEnv
at 8 environments due to eliminated IPC and serialization overhead.

Technical changes:
- Add VectorEmulator class with thread-per-emulator architecture
- Use atomic states and condition variables for synchronization
- Implement strided views for BGR->RGB conversion without copying
- Fix Makefile to use undefined dynamic_lookup on macOS
- Add bounds checking that raises IndexError for invalid indices
This commit fixes critical memory corruption bugs that occurred when
loading snapshots taken from destroyed emulator instances.

Root cause analysis:
- NES::Core contains Mapper* pointers in MainBus and PictureBus
- NES::Core contains std::function callbacks in MainBus and PPU
- These callbacks capture references to Core members
- When Core was copied via default assignment, dangling references occurred
- Loading snapshots from destroyed emulators caused segfaults/hangs

Solution:
- Add copy_state_from() methods that copy only raw data:
  - Core::copy_state_from() - orchestrates selective copying
  - MainBus::copy_ram_from() - copies RAM, preserves mapper/callbacks
  - PictureBus::copy_ram_from() - copies VRAM/palette, preserves mapper
  - PPU::copy_state_from() - copies registers/state, preserves callback
- Emulator::restore() now uses Core::copy_state_from()
- Add get_mapper() accessor to MainBus and Core

Additional fixes:
- Add worker thread readiness synchronization in VectorEmulator
- Workers signal ready after entering wait loop
- Constructor waits for all workers before returning
- Prevents race condition where load_state is called before worker ready

New VectorEmulator methods:
- step_single(idx, action) - synchronous single emulator step
- dump_state(idx) - capture emulator state snapshot
- load_state(idx, state) - restore emulator from snapshot
Pin each VectorEmulator worker thread to a specific CPU core using
platform-specific APIs:
- Linux: pthread_setaffinity_np (hard affinity)
- macOS: thread_policy_set (hint only)
- Windows: SetThreadAffinityMask

This improves performance on multi-socket/NUMA systems (e.g., AMD EPYC)
by reducing:
- Cross-NUMA memory access latency
- Cache thrashing from thread migration
- Scheduler overhead

Thread idx maps to core_id with wraparound if idx exceeds core count.
Completely redesign VectorEmulator synchronization to eliminate mutex
contention that was causing ~67% idle time per core.

Changes:
- Remove condition variables (start_cv_, done_cv_) and their mutexes
- Remove shared done_count_ atomic counter
- Use per-worker cache-line aligned atomics (AlignedAtomic struct)
- Workers busy-wait on their own atomic state (no shared mutex)
- Main thread busy-waits checking all worker states (no shared mutex)

Key benefits:
- No mutex contention - each worker only touches its own cache line
- No thundering herd from notify_all()
- No barrier synchronization overhead
- Cache-line alignment prevents false sharing

Performance improvement on M3 (4 P-cores):
- 4 envs: 1157 → 1665 env/s (44% faster)
- 8 envs: 1136 → 1712 env/s (51% faster)
- 16 envs: 1275 → 1778 env/s (39% faster)

Expected even larger gains on NUMA systems (AMD EPYC) where mutex
contention across NUMA nodes was the primary bottleneck.
Add configure_ram_reads() and ram_values() methods to VectorEmulator
for efficient batch reading of RAM addresses after each step.

Features:
- Configure once at init with list of (address, size, type) specs
- Type 0=INT (single byte), Type 1=BCD (multi-byte decimal)
- RAM values read in C++ after step, returned as numpy array
- Eliminates hundreds of thousands of Python property calls

This reduces Python overhead for info collection by ~55% by moving
RAM reads from individual Python property access to a single C++ batch
operation.
- Remove verbose "failed to execute opcode" warnings that were causing
  major I/O overhead and flooding stderr (unofficial opcodes 0x1a, 0x1c
  are common in NES games and should be silently ignored)
- Add proper ROM validation with descriptive error messages in
  Cartridge::loadFromFile() including magic bytes, PRG size, and
  truncation checks
Changes since 9.2.0:
- fix: remove opcode warning spam that severely impacted performance
- fix: add comprehensive ROM validation before starting threads
- perf: NUMA-aware thread pinning across sockets
Replace complex NUMA-aware CPU affinity logic with simple round-robin
pinning. The previous implementation assumed a dual-socket AMD EPYC
topology which caused incorrect core mapping on Intel desktop CPUs.

Now: Worker N pins to core (N % num_cores), which works on any topology.

Bump version to 9.3.1.
@Kautenja
Copy link
Copy Markdown
Owner

Thanks so much for putting this together and for the detailed work here.

This PR has been open for a long time, and its scope has drifted from the current maintenance direction for nes-py, so I am going to close it out rather than leave it in limbo. I appreciate the contribution, especially the notes around backup slots, Gymnasium support, and macOS/pyglet compatibility. If any of these pieces come back up, they will be easier to revisit as smaller, focused changes against the active branch.

@Kautenja Kautenja closed this May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants