feat(ENG-5578): add integration tests with recorded market replay#127
feat(ENG-5578): add integration tests with recorded market replay#127tvinagre wants to merge 45 commits into
Conversation
6725dc1 to
5f6eb60
Compare
Breaking API Changes (Intentional)Breaking API changes detected and declared in the PR title. semver-checks output |
kayibal
left a comment
There was a problem hiding this comment.
Partial review for now. Mostly concerned about the newly exposed api surface from this PR. I think a lot of this API would be better suited to be placed within tycho-simulation if exposed publically.
tamaralipows
left a comment
There was a problem hiding this comment.
Thanks - some questions mostly for my understanding
dianacarvalho1
left a comment
There was a problem hiding this comment.
Thank you @tvinagre ! I gave a quick review - there are already a lot of good suggestions (that I agree 👍🏼)!
5f6eb60 to
29cce24
Compare
29cce24 to
d84433d
Compare
The solver produced different routing results across process runs because Rust's HashMap/HashSet use random SipHash seeds. Graph construction in `initialize_graph` and `add_components` iterated these collections to assign NodeIndex values and insert edges. Since petgraph's `graph.edges(node)` returns edges in insertion order, different seeds produced different BFS traversal and SPFA relaxation orders, leading to different routes being selected. Three changes: - Sort tokens before node creation and sort components before edge insertion in both `initialize_graph` and `add_components` - Sort the token list within each component before generating edge pairs - Sort the SPFA active_nodes set after collecting from HashSet to ensure deterministic relaxation order between rounds Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Introduce recording module in fynd-core with serializable wrappers around Tycho Update messages for integration testing. Uses serde_json (required by typetag) + zstd for compact storage. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Write and read MarketRecording files using serde_json for typetag compatibility and zstd compression. Roundtrip tested with both empty recordings and recordings containing ProtocolSim trait objects. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Address code quality review findings: - Add doc comments to write_recording and read_recording - Move tempfile to workspace dependencies for consistency Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Make handle_tycho_message pub for replay access. Re-export MarketEvent and DerivedDataEvent for integration test access. Test harness replays recorded Update messages through TychoFeed, computes derived data, builds worker pools, and exposes a quote() method for integration tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Store WorkerPool in TestHarness to prevent thread leaks - Store ComputationManager JoinHandle for panic visibility - Use QuoteStatus enum instead of String in GoldenOutput - Remove premature include_str\! from load_test_scenarios - Parse WETH address before acquiring read lock Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CLI tool that connects to Tycho, captures market state, and generates golden outputs for integration testing. Core recording and golden generation logic is stubbed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tests verify all golden pairs return solutions, unknown tokens produce errors, quality stays within 1% of baseline, and basic invariants hold (positive net output, bounded hops, positive gas). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
P95 solve time must stay within 4x golden baseline or 200ms absolute cap. No individual solve may exceed the absolute cap. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Verifies all derived data fields are computed and coverage meets thresholds: spot prices >= 95%, pool depths >= 90%, token gas prices >= 80%. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Run BLESS_GOLDEN=1 cargo nextest run to regenerate golden_outputs.json from the existing market recording after code changes that intentionally affect solution quality. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Track *.json.zst files in fixtures/integration/ with Git LFS to avoid bloating the repository with large market recordings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Integration tests run on every PR using recorded market data. Nightly canary connects to live Tycho to detect API drift and real-world regressions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace todo!() stubs with working implementations: - recorder.rs: connect to Tycho, discover protocols, capture Update messages from ProtocolStreamBuilder for the configured duration - golden.rs: replay recording through TychoFeed, compute derived data, build WorkerPoolRouter, run all pairs.json scenarios, capture outputs - scenarios.rs: parse pairs.json token definitions and trading pairs with decimal-scaled amounts into TestScenario structs Also: make register_exchanges pub for recorder access, add tycho-execution/tokio-stream/num-bigint deps to record-market, remove dead_code allows from harness and scenarios modules. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tested against tycho-beta.propellerheads.xyz mainnet. Fixes: - Add protocol filter, TVL range, min_token_quality, and traded_n_days_ago CLI flags to control recording scope - Use TLS by default (production Tycho requires it) - Replace chain_id with chain name string in RecordingMetadata - Add recording filter params to RecordingMetadata for reproducibility - Fix TVL filter: with_tvl_range(min, min) not (min, 0.0) - Fix broadcast channel error: keep a receiver alive during replay so handle_tycho_message's event broadcast doesn't fail - Wait for spot_prices + pool_depths (not token_prices) in replay since token_prices requires a live gas_price feed - Remove unused rpc_url from recorder (not needed for Tycho stream) - Add .gitattributes comment explaining LFS tracking - Increase derived data timeout to 120s for larger recordings Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rtions Store spot_price_pools, pool_depth_pools, and token_prices counts in GoldenMetadata.derived_data during golden generation. Integration tests assert exact equality against these values since replay is deterministic. Replaces 3 threshold-based coverage tests with a single test_derived_data_matches_golden that also verifies pool/token counts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Re-expose TychoFeed, TychoFeedConfig, DataFeedError as pub (needed by recording tool and integration tests for replay) - Re-export parse_chain from fynd_rpc for tools that don't depend on fynd-core directly (fynd-swap-cli, fynd-benchmark) - Add tools/record-market to workspace members alongside tools/fynd-swap-cli - Regenerate Cargo.lock after rebase Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Resolve conflicts in .gitignore, Cargo.toml, lib.rs, feed/mod.rs, derived/mod.rs, builder.rs, src/main.rs, fynd-swap-cli - Add missing doc comments required by main's #[warn(missing_docs)] - Re-expose TychoFeed/TychoFeedConfig/DataFeedError as pub for replay - Adapt to main's renames (BlacklistConfig -> BlocklistConfig) - Adapt to main's worker_pools.toml (bellman_ford_2_hops) - Regenerate golden fixtures from fresh recording Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add fynd-test-fixtures crate with shared types (ExpectedOutput, MarketRecording, TestScenario) — no test code in fynd-core lib - Add Solver::from_recording() factory method — encapsulates replay pipeline, keeps TychoFeed/TychoFeedConfig as pub(crate) - Delete fynd-core/src/recording/ module entirely - Remove zstd from fynd-core dependencies - Rename GoldenOutput -> ExpectedOutput per reviewer feedback - Move fixtures to fynd-core/tests/fixtures/ (closer to tests) - Tool reads back recording before generating expected outputs to ensure golden matches deserialized data (VM states filtered) - load_test_scenarios() returns Result (no panics in library code) - parse_chain error no longer lists specific chain names - Split READMEs: tools/record-market/ + fynd-core/tests/integration/ - Use local tycho-simulation path dep (pending PR #568 merge) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All 8 integration tests pass consistently. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…adme - Move BlockGasPrice/GasPrice/MarketEvent imports to top-level (not feature-gated or test-only) - Add tracing::warn when block number or gas price fall back to defaults in from_recording - Fix README reference: golden_outputs.json → expected_outputs.json Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…sion Architecture: - Gate Solver::from_recording and SolverBuildError::Replay behind cfg(feature = "test-utils") so test infrastructure stays off the public API surface - Revert 3 unnecessary visibility escalations (MarketEvent re-export, handle_tycho_message pub, DerivedDataEvent re-export) - Remove unused anyhow dep from fynd-core Canary workflow fixes: - Use --output-dir (not --output) matching the actual CLI flag - Write directly to fynd-core/tests/fixtures/ (where tests read) - Add --features test-utils to nextest invocations - Add failure notification step Test improvements: - Use BigUint arithmetic for quality comparison instead of lossy f64 - Add RUSTFLAGS="-D warnings" to integration test CI job - Add publish = false to record-market Cargo.toml Cleanup: - Remove dead _worker_pools_hash computation in golden.rs - Remove dead expected_file_path() function - Log warning on gas price parse failure instead of silent None Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…infra - Remove stale fixtures/integration/ duplicate (686KB binary in git objects) - Pin all CI actions to commit SHAs in new integration_tests and canary jobs - Hoist sha2 to workspace dependencies - Add schema_version field to RecordingMetadata for format versioning - Deduplicate PoolsFile into parse_pools_toml() in test-fixtures - Warn on broadcast failure in from_recording instead of silent drop - Guard f64-to-u128 cast in scenario loader with finite/non-negative check - Tighten P95 timing threshold from 4x to 3x Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove the duplicate fixtures/integration/ directory (tests read from fynd-core/tests/fixtures/). Rename golden.rs to expected.rs and replace all "golden" references with "expected" for consistent naming. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Docker dependency caching layer needs manifests and stub sources for all workspace members, including the new test-fixtures and record-market crates, so cargo can resolve the workspace graph. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use effective_gas_price() instead of matching only on Legacy variant. On mainnet (EIP1559), the recorder was silently falling back to a hardcoded 10 gwei default, discarding the real gas price. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Re-record market state and expected outputs against current solver. Widen quality regression threshold from 1% to 20% — the solver is non-deterministic due to async derived-data computation timing, and 1% was causing consistent flakes (observed up to 16% run-to-run variance on marginal routing pairs). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reverts the temporary 20% quality threshold back to 1%. The solver
non-determinism that required the wider threshold is fixed on the
tl/fix-graph-determinism branch.
Adds an `#[ignore]` integration test (`regenerate_expected_outputs`)
that re-generates expected_outputs.json from the existing recording.
Run after algorithm changes with:
cargo nextest run -p fynd-core --test integration \
regenerate_expected --all-features --run-ignored ignored-only
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The recording format serializes tycho-simulation Update messages directly, which requires the serde support released in 0.256. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…on-tests # Conflicts: # .gitignore # Cargo.lock # Cargo.toml # Dockerfile # fynd-core/src/algorithm/bellman_ford.rs # fynd-core/src/feed/events.rs # fynd-core/src/feed/mod.rs # fynd-core/src/graph/petgraph.rs # fynd-core/src/solver.rs
Replaying the existing recording through current code moves two LINK-pair baselines (UNI->LINK +16.4%, LINK->AAVE -16.0%); the other 42 scenarios are within 0.5% and no statuses changed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Adds the new tool, fixtures crate, and integration suite to the codebase guides, per review feedback. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
d84433d to
831b5e2
Compare
The bless test hardcoded block_number 0 while the record-market CLI read it from the recording; both now derive it from the replayed market state, the same source Solver::from_recording uses. Drop the now-unused MarketRecording::last_block_number. Re-bless the fixture (only metadata and per-run solve times change; amounts identical). The nightly canary regenerates baselines from its own fresh recording, so it detects format changes and pipeline crashes, not quality regressions; the workflow comment and failure messages now say so. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…on-tests # Conflicts: # Cargo.toml
louise-poole
left a comment
There was a problem hiding this comment.
Only saw 2 smallish things - otherwise looks good to me!
| solver | ||
| .wait_until_ready(Duration::from_secs(120)) | ||
| .await | ||
| .expect("solver not ready after 120s"); |
There was a problem hiding this comment.
There are a lot of groups multiple tests that call this and with a potential 120s wait each... we risk making our tests excessively slow. Could we maybe reuse the same test harness in all tests/one per test group? Initialise it only once?
There was a problem hiding this comment.
Good instinct, but the 120s here is a poll deadline, not a wait — wait_until_ready returns as soon as derived data is ready, which is ~15-20s in practice (you can see it in the per-test durations). Reuse isn't really available to us either: nextest runs each test in its own process, so a shared/static harness can never be hit twice, and the Solver's background tasks bind to each test's tokio runtime so sharing would be unsound even under plain cargo test. The one thing that would cut builds is collapsing the tests into fewer #[tokio::test] fns, which we'd rather not do (loses per-test granularity).
The saving grace is that the builds run in parallel: on the last CI run the whole Integration Tests job took 128s vs 216s for Compile & Test, so the suite isn't on the critical path. Happy to revisit (e.g. one harness per logical group via a shared binary) if it grows enough to become the bottleneck.
| cargo nextest run -p fynd-core --test integration | ||
| ``` | ||
|
|
||
| ## Test Descriptions |
There was a problem hiding this comment.
Perfect - thank you! Nothing else to say here other than Louise's comments
record-market gains a --chain flag (default ethereum). The recording stores the chain in its metadata and both replay paths (expected-output generation and the test harness) parse it from there, so fixtures are self-describing and replay can never disagree with the recording. Trading-pair scenarios move from a compile-time include of the benchmark's pairs.json to per-chain runtime fixtures at fynd-core/tests/fixtures/pairs/<chain>.json, seeded with the existing Ethereum set. Recording a new chain requires adding its pairs file. Also corrects the integration README run command, which was missing --features test-utils and silently ran zero tests when copy-pasted. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…on-tests # Conflicts: # Cargo.toml
The main merge auto-resolved Cargo.lock but left the branch-only crates (record-market, fynd-test-fixtures) at 0.78.0 while their manifests inherit the bumped 0.78.1 workspace version, breaking --locked builds. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
record-marketCLI tool that captures Tycho protocol stream updates and generates golden quote outputs by replaying through the full Fynd pipelinefynd-core --test integration) that replays recorded fixtures and verifies solution availability, quality baselines, derived data, and timingworker_pools.tomlto match production behaviorGoldenFile,TestScenario, etc.) infynd_core::recording::goldenparse_chainmoved tofynd_core::types::constantsfor reuse across cratesTest plan
cargo +nightly fmt -- --checkpassescargo +nightly clippy --locked --all --all-features --all-targets -- -D warningspassescargo nextest run --workspace --locked --all-targets --all-features --bin fynd— 416 passed, 0 failedGenerated with Claude Code