MultiGet Performance

Introduction

There is a lot of complexity in the underlying RocksDB implementation to lookup a key. The complexity results in a lot of computational overhead, mainly due to cache misses when probing bloom filters, virtual function call dispatches, key comparisons and IO. Users that need to lookup many keys in order to process an application level request end up calling Get() in a loop to read the required KVs. By providing a MultiGet() API that accepts a batch of keys, it is possible for RocksDB to make the lookup more CPU efficient by reducing the number of virtual function calls and pipelining cache misses. Furthermore, latency can be reduced by doing IO in parallel.

Read Path

A typical RocksDB database instance has multiple levels, with each level containing a few tens to hundreds of SST files. A point lookup goes through the following stages (in order to keep it simple, we ignore merge operands and assume everything is a Put) -

The mutable memtable is looked up. If a bloom filter is configured for memtable, the filter is probed using either the whole key or prefix. If the result is positive, the memtable rep lookup happens.
If the key was not found, 0 or more immutable memtables are looked up using the same process as #1
Next, the SST files in successive levels are looked up as follows -
1. In L0, every SST file is looked up in reverse chronological order
2. For L1 and above, each level has a vector of SST file metadata objects, with each metadata object containing, among other things, the highest and lowest key in the file. A binary search is performed in this vector to determine the file that overlaps the desired key. There is an auxiliary index that uses pre-calculated information about file ranges in the lsm to determine the set of files overlap a given file in the next level. A full binary search is performed in L1, and this index is used to narrow down the binary search bound in subsequent levels. This is known as fractional cascading.
3. Once a candidate file is found, the file's bloom filter block is loaded (either from the block cache or disk) and probed for the key. The probe is likely to result in a CPU cache miss. In many cases, the bottommost level will not have a bloom filter.
4. If the probe result is positive, the SST file index block is loaded and binary searched to find the target data block. The filter and index blocks may have to be read from disk, but typically they are either pinned in memory or accessed frequently enough to be found in the block cache.
5. The data block is loaded and binary searched to find the key. Data block lookups are more likely to miss in the block cache and result in an IO. It is important to note that each block cache lookup is also likely to result in a CPU cache miss, since the block cache is indexed by a hash table.
Step #3 is repeated for each level, with the only difference in L2 and higher being the fractional cascading for SST file lookup.

MultiGet Performance Optimizations

Let us consider the case of a workload with good locality of reference. Successive point lookups in such a workload are likely to repeatedly access the same SST files and index/data blocks. For such workloads, MultiGet provides the following optimizations -

When options.cache_index_and_filter_blocks=true is set, filter and index blocks for an SST file are fetched from the block cache on each key lookup. On a system with many threads performing reads, this results in significant lock contention on the LRU mutex. MultiGet looks up the filter and index block in the block cache only once for a whole batch of keys overlapping an SST file key range, thus drastically reducing the LRU mutex contention.
In steps 1, 2 and 3c, CPU cache misses occur due to bloom filter probes. Assuming a database with 6 levels and most keys being found in the bottommost level, with an average of 2 L0 files, we will have ~6 cache misses due to filter lookups in SST files. There may be an additional 1-2 cache misses if memtable bloom filters are configured. By batching the lookups at each stage, the filter cache line accesses can be pipelined, thus hiding the cache miss latency.
In a large database, data block reads are highly likely to require IO. This introduces latency. MultiGet has the capability to issue IO requests for multiple data blocks in the same SST file in parallel, thus reducing latency. This depends on support for parallel reads in the same thread from the underlying Env implementation. On Linux, PosixEnv has the capability to do parallel IO for MultiGet() using the IO Uring interface. IO Uring is a new asynchronous IO implementation introduced in the Linux kernel starting from 5.1. Note that there are multiple implementations of MultiGet. Only the void-returning methods perform parallel IO.

Contents

RocksDB Wiki
Overview
RocksDB FAQ
Terminology
Requirements
Contributors' Guide
Release Methodology
RocksDB Users and Use Cases
RocksDB Public Communication and Information Channels
Basic Operations
- Iterator
- Prefix seek
- SeekForPrev
- Tailing Iterator
- Compaction Filter
- Multi Column Family Iterator
- Read-Modify-Write (Merge) Operator
- Column Families
- Creating and Ingesting SST files
- Single Delete
- Low Priority Write
- Time to Live (TTL) Support
- Transactions
- Snapshot
- DeleteRange
- Atomic flush
- Read-only and Secondary instances
- Approximate Size
- User-defined Timestamp
- Wide Columns
- BlobDB
- Online Verification
Options
- Setup Options and Basic Tuning
- Option String and Option Map
- RocksDB Options File
MemTable
Journal
- Write Ahead Log (WAL)
- MANIFEST
- Track WAL in MANIFEST
Cache
- Block Cache
- SecondaryCache (Experimental)
Write Buffer Manager
Compaction
- Leveled Compaction
- Universal compaction style
- FIFO compaction style
- Manual Compaction
- Subcompaction
- Choose Level Compaction Files
- Managing Disk Space Utilization
- Trivial Move Compaction
- Remote Compaction (Experimental)
SST File Formats
- Block-based Table Format
- PlainTable Format
- CuckooTable Format
- Index Block Format
- Bloom Filter
- Data Block Hash Index
IO
- Rate Limiter
- SST File Manager
- Direct I/O
Compression
- Dictionary Compression
Full File Checksum and Checksum Handoff
Background Error Handling
Huge Page TLB Support
Tiered Storage (Experimental)
Logging and Monitoring
- Logger
- Statistics
- Compaction Stats and DB Status
- Perf Context and IO Stats Context
- EventListener
Known Issues
Troubleshooting Guide
Tests
- Stress Test
- Fuzzing
- Benchmarking
Tools / Utilities
- Administration and Data Access Tool
- How to Backup RocksDB?
- Replication Helpers
- Checkpoints
- How to persist in-memory RocksDB database
- Third-party language bindings
- RocksDB Trace, Replay, Analyzer, and Workload Generation
- Block cache analysis and simulation tools
- IO Tracer and Parser
Implementation Details
- Delete Stale Files
- Partitioned Index/Filters
- WritePrepared-Transactions
- WriteUnprepared-Transactions
- How we keep track of live SST files
- How we index SST
- Merge Operator Implementation
- RocksDB Repairer
- Write Batch With Index
- Two Phase Commit
- Iterator's Implementation
- Simulation Cache
- [To Be Deprecated] Persistent Read Cache
- DeleteRange Implementation
- unordered_write
Extending RocksDB
- RocksDB Configurable Objects
- The Customizable Class
- Object Registry
RocksJava
- RocksJava Basics
- Logging in RocksJava
- JNI Debugging
- RocksJava API TODO
- RocksJava Performance on Flash Storage
- Tuning RocksDB from Java
Lua
- Lua CompactionFilter
Performance
- Performance Benchmarks
- In Memory Workload Performance
- Read-Modify-Write (Merge) Performance
- Delete A Range Of Keys
- Write Stalls
- Pipelined Write
- MultiGet Performance
- Tuning Guide
- Memory usage in RocksDB
- Speed-Up DB Open
- Implement Queue Service Using RocksDB
- Asynchronous IO
- Off-peak in RocksDB
Projects Being Developed
Misc
- Building on Windows
- Developing with an IDE
- Open Projects
- Talks
- Publication
- Features Not in LevelDB
- How to ask a performance-related question?
- Articles about Rocks

MultiGet Performance

Introduction

Read Path

MultiGet Performance Optimizations

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!