-
Notifications
You must be signed in to change notification settings - Fork 6.4k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Auto refresh iterator with snapshot (#13354)
Summary: # Problem Once opened, iterator will preserve its' respective RocksDB snapshot for read consistency. Unless explicitly `Refresh'ed`, the iterator will hold on to the `Init`-time assigned `SuperVersion` throughout its lifetime. As time goes by, this might result in artificially long holdup of the obsolete memtables (_potentially_ referenced by that superversion alone) consequently limiting the supply of the reclaimable memory on the DB instance. This behavior proved to be especially problematic in case of _logical_ backups (outside of RocksDB `BackupEngine`). # Solution Building on top of the `Refresh(const Snapshot* snapshot)` API introduced in #10594, we're adding a new `ReadOptions` opt-in knob that (when enabled) will instruct the iterator to automatically refresh itself to the latest superversion - all that while retaining the originally assigned, explicit snapshot (supplied in `read_options.snapshot` at the time of iterator creation) for consistency. To ensure minimal performance overhead we're leveraging relaxed atomic for superversion freshness lookups. Pull Request resolved: #13354 Test Plan: **Correctness:** New test to demonstrate the auto refresh behavior in contrast to legacy iterator: `./db_iterator_test --gtest_filter=*AutoRefreshIterator*`. **Stress testing:** We're adding command line parameter controlling the feature and hooking it up to as many iterator use cases in `db_stress` as we reasonably can with random feature on/off configuration in db_crashtest.py. # Benchmarking The goal of this benchmark is to validate that throughput did not regress substantially. Benchmark was run on optimized build, 3-5 times for each respective category or till convergence. In addition, we configured aggressive threshold of 1 second for new `Superversion` creation. Experiments have been run 'in parallel' (at the same time) on separate db instances within a single host to evenly spread the potential adverse impact of noisy neighbor activities. Host specs [1]. **TLDR;** Baseline & new solution are practically indistinguishable from performance standpoint. Difference (positive or negative) in throughput relative to the baseline, if any, is no more than 1-2%. **Snapshot initialization approach:** This feature is only effective on iterators with well-defined `snapshot` passed via `ReadOptions` config. We modified the existing `db_bench` program to reflect that constraint. However, it quickly turned out that the actual `Snapshot*` initialization is quite expensive. Especially in case of 'tiny scans' (100 rows) contributing as much as 25-35 microseconds, which is ~20-30% of the average per/op latency unintentionally masking _potentially_ adverse performance impact of this change. As a result, we ended up creating a single, explicit 'global' `Snapshot*` for all the future scans _before_ running multiple experiments en masse. This is also a valuable data point for us to keep in mind in case of any future discussions about taking implicit snapshots - now we know what the lower bound cost could be. ## "DB in memory" benchmark **DB Setup** 1. Allow a single memtable to grow large enough (~572MB) to fit in all the rows. Upon shutdown all the rows will be flushed to the WAL file (inspected `000004.log` file is 541MB in size). ``` ./db_bench -db=/tmp/testdb_in_mem -benchmarks="fillseq" -key_size=32 -value_size=512 -num=1000000 -write_buffer_size=600000000 max_write_buffer_number=2 -compression_type=none ``` 2. As a part of recovery in subsequent DB open, WAL will be processed to one or more SST files during the recovery. We're selecting a large block cache (`cache_size` parameter in `db_bench` script) suitable for holding the entire DB to test the “hot path” CPU overhead. ``` ./db_bench -use_existing_db=true -db=/tmp/testdb_in_mem -statistics=false -cache_index_and_filter_blocks=true -benchmarks=seekrandom -preserve_internal_time_seconds=1 max_write_buffer_number=2 -explicit_snapshot=1 -use_direct_reads=1 -async_io=1 -num=? -seek_nexts=? -cache_size=? -write_buffer_size=? -auto_refresh_iterator_with_snapshot={0|1} ``` | seek_nexts=100; num=2,000,000 | seek_nexts = 20,000; num=50000 | seek_nexts = 400,000; num=2000 -- | -- | -- | -- baseline | 36362 (± 300) ops/sec, 928.8 (± 23) MB/s, 99.11% block cache hit | 52.5 (± 0.5) ops/sec, 1402.05 (± 11.85) MB/s, 99.99% block cache hit | 156.2 (± 6.3) ms / op, 1330.45 (± 54) MB/s, 99.95% block cache hit auto refresh | 35775.5 (± 537) ops/sec, 926.65 (± 13.75) MB/s, 99.11% block cache hit | 53.5 (± 0.5) ops/sec, 1367.9 (± 9.5) MB/s, 99.99% block cache hit | 162 (± 4.14) ms / op, 1281.35 (± 32.75) MB/s, 99.95% block cache hit _-cache_size=5000000000 -write_buffer_size=3200000000 -max_write_buffer_number=2_ | seek_nexts=3,500,000; num=100 -- | -- baseline | 1447.5 (± 34.5) ms / op, 1255.1 (± 30) MB/s, 98.98% block cache hit auto refresh | 1473.5 (± 26.5) ms / op, 1232.6 (± 22.2) MB/s, 98.98% block cache hit _-cache_size=17680000000 -write_buffer_size=14500000000 -max_write_buffer_number=2_ | seek_nexts=17,500,000; num=10 -- | -- baseline | 9.11 (± 0.185) s/op, 997 (± 20) MB/s auto refresh | 9.22 (± 0.1) s/op, 984 (± 11.4) MB/s [1] ### Specs | Property | Value -- | -- RocksDB | version 10.0.0 Date | Mon Feb 3 23:21:03 2025 CPU | 32 * Intel Xeon Processor (Skylake) CPUCache | 16384 KB Keys | 16 bytes each (+ 0 bytes user-defined timestamp) Values | 100 bytes each (50 bytes after compression) Prefix | 0 bytes RawSize | 5.5 MB (estimated) FileSize | 3.1 MB (estimated) Compression | Snappy Compression sampling rate | 0 Memtablerep | SkipListFactory Perf Level | 1 Reviewed By: pdillinger Differential Revision: D69122091 Pulled By: mszeszko-meta fbshipit-source-id: 147ef7c4fe9507b6fb77f6de03415bf3bec337a8
- Loading branch information
1 parent
a30c020
commit 8234d67
Showing
15 changed files
with
411 additions
and
31 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.