FEAT: Multiple backup slots, Emulator C-lib replaced with C++ class (Pybind), gymnasium support, ditched Scons, Refactor & Clean-ups by ali-mosavian · Pull Request #90 · Kautenja/nes-py

ali-mosavian · 2022-12-16T10:33:07Z

Description

Adds the ability to backup/restore to/from 10+1 slots, where the default slot is -1 (used for reset)
Upgraded pyglet to latest version, the required version does not work on MacOS Ventura.

Type of change

Please select all relevant options:

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)

How Has This Been Tested?

Only manual testing.

Test Configuration

Operating System: MacOS Ventura
Python version: 3.10.8
C++ compiler version: Apple clang version 14.0.0 (clang-1400.0.29.202)

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works

…ts. Upgraded pyglet to work with macos ventura

…g/resting state sligthly easier.

…mpability in step.

…d to gymnasium 1.0.0

… issues with scons and venvs

… of the SimpleNES code

…to use up to 16 cores

…w. Fixed mispelled metadata in NESEnv. Bumped version to 9.1.3

Add a new VectorEmulator class that enables stepping multiple NES emulators in parallel using C++ threads. This provides significant performance improvements over Python multiprocessing approaches by eliminating IPC overhead and enabling zero-copy observation access. Key features: - Persistent worker threads with condition variable synchronization - Zero-copy screen and memory buffer access via strided NumPy views - GIL released during parallel stepping for true parallelism - Interface mirrors NESEmulator: step(), screen_buffer(), memory_buffer() - Bounds checking on all indexed accessors with clear error messages Performance: ~1.5-1.8x faster than gymnasium.vector.AsyncVectorEnv at 8 environments due to eliminated IPC and serialization overhead. Technical changes: - Add VectorEmulator class with thread-per-emulator architecture - Use atomic states and condition variables for synchronization - Implement strided views for BGR->RGB conversion without copying - Fix Makefile to use undefined dynamic_lookup on macOS - Add bounds checking that raises IndexError for invalid indices

This commit fixes critical memory corruption bugs that occurred when loading snapshots taken from destroyed emulator instances. Root cause analysis: - NES::Core contains Mapper* pointers in MainBus and PictureBus - NES::Core contains std::function callbacks in MainBus and PPU - These callbacks capture references to Core members - When Core was copied via default assignment, dangling references occurred - Loading snapshots from destroyed emulators caused segfaults/hangs Solution: - Add copy_state_from() methods that copy only raw data: - Core::copy_state_from() - orchestrates selective copying - MainBus::copy_ram_from() - copies RAM, preserves mapper/callbacks - PictureBus::copy_ram_from() - copies VRAM/palette, preserves mapper - PPU::copy_state_from() - copies registers/state, preserves callback - Emulator::restore() now uses Core::copy_state_from() - Add get_mapper() accessor to MainBus and Core Additional fixes: - Add worker thread readiness synchronization in VectorEmulator - Workers signal ready after entering wait loop - Constructor waits for all workers before returning - Prevents race condition where load_state is called before worker ready New VectorEmulator methods: - step_single(idx, action) - synchronous single emulator step - dump_state(idx) - capture emulator state snapshot - load_state(idx, state) - restore emulator from snapshot

Pin each VectorEmulator worker thread to a specific CPU core using platform-specific APIs: - Linux: pthread_setaffinity_np (hard affinity) - macOS: thread_policy_set (hint only) - Windows: SetThreadAffinityMask This improves performance on multi-socket/NUMA systems (e.g., AMD EPYC) by reducing: - Cross-NUMA memory access latency - Cache thrashing from thread migration - Scheduler overhead Thread idx maps to core_id with wraparound if idx exceeds core count.

Completely redesign VectorEmulator synchronization to eliminate mutex contention that was causing ~67% idle time per core. Changes: - Remove condition variables (start_cv_, done_cv_) and their mutexes - Remove shared done_count_ atomic counter - Use per-worker cache-line aligned atomics (AlignedAtomic struct) - Workers busy-wait on their own atomic state (no shared mutex) - Main thread busy-waits checking all worker states (no shared mutex) Key benefits: - No mutex contention - each worker only touches its own cache line - No thundering herd from notify_all() - No barrier synchronization overhead - Cache-line alignment prevents false sharing Performance improvement on M3 (4 P-cores): - 4 envs: 1157 → 1665 env/s (44% faster) - 8 envs: 1136 → 1712 env/s (51% faster) - 16 envs: 1275 → 1778 env/s (39% faster) Expected even larger gains on NUMA systems (AMD EPYC) where mutex contention across NUMA nodes was the primary bottleneck.

Add configure_ram_reads() and ram_values() methods to VectorEmulator for efficient batch reading of RAM addresses after each step. Features: - Configure once at init with list of (address, size, type) specs - Type 0=INT (single byte), Type 1=BCD (multi-byte decimal) - RAM values read in C++ after step, returned as numpy array - Eliminates hundreds of thousands of Python property calls This reduces Python overhead for info collection by ~55% by moving RAM reads from individual Python property access to a single C++ batch operation.

- Remove verbose "failed to execute opcode" warnings that were causing major I/O overhead and flooding stderr (unofficial opcodes 0x1a, 0x1c are common in NES games and should be silently ignored) - Add proper ROM validation with descriptive error messages in Cartridge::loadFromFile() including magic bytes, PRG size, and truncation checks

Changes since 9.2.0: - fix: remove opcode warning spam that severely impacted performance - fix: add comprehensive ROM validation before starting threads - perf: NUMA-aware thread pinning across sockets

Replace complex NUMA-aware CPU affinity logic with simple round-robin pinning. The previous implementation assumed a dual-socket AMD EPYC topology which caused incorrect core mapping on Intel desktop CPUs. Now: Worker N pins to core (N % num_cores), which works on any topology. Bump version to 9.3.1.

Kautenja · 2026-05-18T04:58:24Z

Thanks so much for putting this together and for the detailed work here.

This PR has been open for a long time, and its scope has drifted from the current maintenance direction for nes-py, so I am going to close it out rather than leave it in limbo. I appreciate the contribution, especially the notes around backup slots, Gymnasium support, and macOS/pyglet compatibility. If any of these pieces come back up, they will be easier to revisit as smaller, focused changes against the active branch.

ali-mosavian and others added 11 commits December 16, 2022 11:28

FEAT: Added ability to backup to 10+1 (-1 is main slot for reset) slo…

c4d6b1a

…ts. Upgraded pyglet to work with macos ventura

CHG: Changed to std::array

6658e21

CHG: Renamed state to core which emulator inherits from, makes copyin…

a640ce6

…g/resting state sligthly easier.

CHG: Added explicit snapshot/restore to py env and fixed newer gym co…

fc90f45

…mpability in step.

FIX: Syntax errors.

cb9c7ab

FIX: Call did reset after a restore.

76419ad

CHG: Refactor, using pybind11 instead of ctypes.cdll

d0cd94a

FEAT: Fixed tests and added build script for python 3.8-3.13, upgrade…

4960d91

…d to gymnasium 1.0.0

FIX: video.frames_per_second -> render_fps (gymnasium)

b43d857

CHG: Using super reset instead

1166a70

CHG: Ditched SCons and pivoted to Make for building emulator. Lots of…

d3f5315

… issues with scons and venvs

ali-mosavian changed the title ~~FEAT: Ability to backup/restore to/from 10+1 slots~~ FEAT: Multiple backup slots, gymnasium support, ditched Scons, Refactor Dec 29, 2024

FIX: Removed debug stuff

b2b3281

ali-mosavian changed the title ~~FEAT: Multiple backup slots, gymnasium support, ditched Scons, Refactor~~ FEAT: Multiple backup slots, Emulator C-lib replaced with C++ class (Pybind), gymnasium support, ditched Scons, Refactor Dec 29, 2024

ali-mosavian and others added 4 commits December 29, 2024 21:48

FIX: More debug stuff removal

edba092

CHG: Some clean up

66b9600

FIX: Build for ARM64 on MacOS

c2c5406

FIX: More small fixes

80fa089

ali-mosavian and others added 11 commits December 29, 2024 23:52

CHG: Updated numpy version

4f80c9f

CHG: Minor change in setup.py

f0ed5fa

CHG: Updated setup.py

b856b3e

FIX: Reset wasn't fetching info

d454d67

FEAT: Snapshots are returned as np.ndarray, required some refactoring…

6397ed3

… of the SimpleNES code

FIX: Minor fix

1764a09

FEAT: Added _will_restore and _did_restore callbacks. Improved build …

6c27423

…to use up to 16 cores

FIX: Missing C++ headers

3e8ce36

FIX: Fixed JoypadSpace incompability with gymnasium

1f43bca

FEAT: Switched to nearest neighbour filtering when rendering to windo…

0ea00ac

…w. Fixed mispelled metadata in NESEnv. Bumped version to 9.1.3

FIX: Fixed pyglet import, bumped to version 9.1.4

9d89ad8

ali-mosavian added 4 commits January 27, 2026 21:15

ali-mosavian force-pushed the master branch from c46b9dc to eb94354 Compare January 28, 2026 08:28

ali-mosavian added 9 commits January 28, 2026 11:04

fix: use C++14 compatible tuple access and fix array constructor

83e998c

feat: add timing instrumentation to VectorEmulator::step

96d32f3

feat: add per-worker timing instrumentation

d6e4ac0

perf: NUMA-aware thread pinning - spread workers across sockets

4007282

fix: correct NUMA pinning to use physical cores across sockets

49582b9

chore: bump version to 9.3.0

392f65a

Changes since 9.2.0: - fix: remove opcode warning spam that severely impacted performance - fix: add comprehensive ROM validation before starting threads - perf: NUMA-aware thread pinning across sockets

Kautenja closed this May 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT: Multiple backup slots, Emulator C-lib replaced with C++ class (Pybind), gymnasium support, ditched Scons, Refactor & Clean-ups#90

FEAT: Multiple backup slots, Emulator C-lib replaced with C++ class (Pybind), gymnasium support, ditched Scons, Refactor & Clean-ups#90
ali-mosavian wants to merge 40 commits into
Kautenja:masterfrom
ali-mosavian:master

ali-mosavian commented Dec 16, 2022 •

edited

Loading

Uh oh!

Kautenja commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ali-mosavian commented Dec 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

How Has This Been Tested?

Test Configuration

Checklist

Uh oh!

Kautenja commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ali-mosavian commented Dec 16, 2022 •

edited

Loading