FEAT: Multiple backup slots, Emulator C-lib replaced with C++ class (Pybind), gymnasium support, ditched Scons, Refactor & Clean-ups#90
Closed
ali-mosavian wants to merge 40 commits into
Closed
Conversation
…ts. Upgraded pyglet to work with macos ventura
…g/resting state sligthly easier.
…mpability in step.
…d to gymnasium 1.0.0
… issues with scons and venvs
… of the SimpleNES code
…to use up to 16 cores
…w. Fixed mispelled metadata in NESEnv. Bumped version to 9.1.3
Add a new VectorEmulator class that enables stepping multiple NES emulators in parallel using C++ threads. This provides significant performance improvements over Python multiprocessing approaches by eliminating IPC overhead and enabling zero-copy observation access. Key features: - Persistent worker threads with condition variable synchronization - Zero-copy screen and memory buffer access via strided NumPy views - GIL released during parallel stepping for true parallelism - Interface mirrors NESEmulator: step(), screen_buffer(), memory_buffer() - Bounds checking on all indexed accessors with clear error messages Performance: ~1.5-1.8x faster than gymnasium.vector.AsyncVectorEnv at 8 environments due to eliminated IPC and serialization overhead. Technical changes: - Add VectorEmulator class with thread-per-emulator architecture - Use atomic states and condition variables for synchronization - Implement strided views for BGR->RGB conversion without copying - Fix Makefile to use undefined dynamic_lookup on macOS - Add bounds checking that raises IndexError for invalid indices
This commit fixes critical memory corruption bugs that occurred when loading snapshots taken from destroyed emulator instances. Root cause analysis: - NES::Core contains Mapper* pointers in MainBus and PictureBus - NES::Core contains std::function callbacks in MainBus and PPU - These callbacks capture references to Core members - When Core was copied via default assignment, dangling references occurred - Loading snapshots from destroyed emulators caused segfaults/hangs Solution: - Add copy_state_from() methods that copy only raw data: - Core::copy_state_from() - orchestrates selective copying - MainBus::copy_ram_from() - copies RAM, preserves mapper/callbacks - PictureBus::copy_ram_from() - copies VRAM/palette, preserves mapper - PPU::copy_state_from() - copies registers/state, preserves callback - Emulator::restore() now uses Core::copy_state_from() - Add get_mapper() accessor to MainBus and Core Additional fixes: - Add worker thread readiness synchronization in VectorEmulator - Workers signal ready after entering wait loop - Constructor waits for all workers before returning - Prevents race condition where load_state is called before worker ready New VectorEmulator methods: - step_single(idx, action) - synchronous single emulator step - dump_state(idx) - capture emulator state snapshot - load_state(idx, state) - restore emulator from snapshot
Pin each VectorEmulator worker thread to a specific CPU core using platform-specific APIs: - Linux: pthread_setaffinity_np (hard affinity) - macOS: thread_policy_set (hint only) - Windows: SetThreadAffinityMask This improves performance on multi-socket/NUMA systems (e.g., AMD EPYC) by reducing: - Cross-NUMA memory access latency - Cache thrashing from thread migration - Scheduler overhead Thread idx maps to core_id with wraparound if idx exceeds core count.
Completely redesign VectorEmulator synchronization to eliminate mutex contention that was causing ~67% idle time per core. Changes: - Remove condition variables (start_cv_, done_cv_) and their mutexes - Remove shared done_count_ atomic counter - Use per-worker cache-line aligned atomics (AlignedAtomic struct) - Workers busy-wait on their own atomic state (no shared mutex) - Main thread busy-waits checking all worker states (no shared mutex) Key benefits: - No mutex contention - each worker only touches its own cache line - No thundering herd from notify_all() - No barrier synchronization overhead - Cache-line alignment prevents false sharing Performance improvement on M3 (4 P-cores): - 4 envs: 1157 → 1665 env/s (44% faster) - 8 envs: 1136 → 1712 env/s (51% faster) - 16 envs: 1275 → 1778 env/s (39% faster) Expected even larger gains on NUMA systems (AMD EPYC) where mutex contention across NUMA nodes was the primary bottleneck.
Add configure_ram_reads() and ram_values() methods to VectorEmulator for efficient batch reading of RAM addresses after each step. Features: - Configure once at init with list of (address, size, type) specs - Type 0=INT (single byte), Type 1=BCD (multi-byte decimal) - RAM values read in C++ after step, returned as numpy array - Eliminates hundreds of thousands of Python property calls This reduces Python overhead for info collection by ~55% by moving RAM reads from individual Python property access to a single C++ batch operation.
- Remove verbose "failed to execute opcode" warnings that were causing major I/O overhead and flooding stderr (unofficial opcodes 0x1a, 0x1c are common in NES games and should be silently ignored) - Add proper ROM validation with descriptive error messages in Cartridge::loadFromFile() including magic bytes, PRG size, and truncation checks
Changes since 9.2.0: - fix: remove opcode warning spam that severely impacted performance - fix: add comprehensive ROM validation before starting threads - perf: NUMA-aware thread pinning across sockets
Replace complex NUMA-aware CPU affinity logic with simple round-robin pinning. The previous implementation assumed a dual-socket AMD EPYC topology which caused incorrect core mapping on Intel desktop CPUs. Now: Worker N pins to core (N % num_cores), which works on any topology. Bump version to 9.3.1.
Owner
|
Thanks so much for putting this together and for the detailed work here. This PR has been open for a long time, and its scope has drifted from the current maintenance direction for |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Type of change
Please select all relevant options:
How Has This Been Tested?
Only manual testing.
Test Configuration
Checklist