RocksDB Repairer

Overview

Repairer does best effort recovery to recover as much data as possible after a disaster without compromising consistency. It does not guarantee bringing the database to a time consistent state. Note: Currently there is a limitation that un-flushed column families will be lost after repair. This would happen even if the DB is in healthy state.

Usage

Note the CLI command uses default options for repairing your DB and only adds the column families found in the SST files. If you need to specify any options, e.g., custom comparator, have column family-specific options, or want to specify the exact set of column families, you should choose the programmatic way.

Programmatic

For programmatic usage, call one of the RepairDB functions declared in include/rocksdb/db.h.

CLI

For CLI usage, first build ldb, our admin CLI tool:

$ make clean && make ldb

Now use the ldb's repair subcommand, specifying your DB. Note it prints info logs to stderr so you may wish to redirect. Here I run it on a DB in ./tmp where I've deleted the MANIFEST file:

$ ./ldb repair --db=./tmp 2>./repair-log.txt
$ tail -2 ./repair-log.txt 
[WARN] [db/repair.cc:208] **** Repaired rocksdb ./tmp; recovered 1 files; 926bytes. Some data may have been lost. ****

Looks successful. MANIFEST file is back and DB is readable:

$ ls tmp/
000006.sst  CURRENT  IDENTITY  LOCK  LOG  LOG.old.1504116879407136  lost  MANIFEST-000001  MANIFEST-000003  OPTIONS-000005
$ ldb get a --db=./tmp
b

Notice the lost/ directory. It holds files containing data that was potentially lost during recovery.

Repair Process

Repair process is broken into 4 phase:

Find files
Convert logs to tables
Extract metadata
Write Descriptor

Find files

The repairer goes through all the files in the directory, and classifies them based on their file name. Any file that cannot be identified by name will be ignored.

Convert logs to table

Every log file that is active is replayed. All sections of the file where the checksum does not match is skipped over. We intentionally give preference to data consistency.

Extract metadata

We scan every table to compute

smallest/largest for the table
largest sequence number in the table

If we are unable to scan the file, then we ignore the table.

Write Descriptor

We generate descriptor contents:

log number is set to zero
next-file-number is set to 1 + largest file number we found
last-sequence-number is set to largest sequence# found across all tables
compaction pointers are cleared
every table file is added at level 0

Possible optimizations

Compute total size and use to pick appropriate max-level M
Sort tables by largest sequence# in the table
For each table: if it overlaps earlier table, place in level-0, else place in level-M.
We can provide options for time consistent recovery and unsafe recovery (ignore checksum failure when applicable)
Store per-table metadata (smallest, largest, largest-seq#, ...) in the table's meta section to speed up ScanTable.

Limitations

If the column family is created recently and not persisted in sst files by a flush, then it will be dropped during the repair process. With this limitation repair would might even damage a healthy db if its column families are not flushed yet.

Contents

RocksDB Wiki
Overview
RocksDB FAQ
Terminology
Requirements
Contributors' Guide
Release Methodology
RocksDB Users and Use Cases
RocksDB Public Communication and Information Channels
Basic Operations
- Iterator
- Prefix seek
- SeekForPrev
- Tailing Iterator
- Compaction Filter
- Multi Column Family Iterator (Experimental)
- Read-Modify-Write (Merge) Operator
- Column Families
- Creating and Ingesting SST files
- Single Delete
- Low Priority Write
- Time to Live (TTL) Support
- Transactions
- Snapshot
- DeleteRange
- Atomic flush
- Read-only and Secondary instances
- Approximate Size
- User-defined Timestamp
- Wide Columns
- BlobDB
- Online Verification
Options
- Setup Options and Basic Tuning
- Option String and Option Map
- RocksDB Options File
MemTable
Journal
- Write Ahead Log (WAL)
- MANIFEST
- Track WAL in MANIFEST
Cache
- Block Cache
- SecondaryCache (Experimental)
Write Buffer Manager
Compaction
- Leveled Compaction
- Universal compaction style
- FIFO compaction style
- Manual Compaction
- Subcompaction
- Choose Level Compaction Files
- Managing Disk Space Utilization
- Trivial Move Compaction
- Remote Compaction (Experimental)
SST File Formats
- Block-based Table Format
- PlainTable Format
- CuckooTable Format
- Index Block Format
- Bloom Filter
- Data Block Hash Index
IO
- Rate Limiter
- SST File Manager
- Direct I/O
Compression
- Dictionary Compression
Full File Checksum and Checksum Handoff
Background Error Handling
Huge Page TLB Support
Tiered Storage (Experimental)
Logging and Monitoring
- Logger
- Statistics
- Compaction Stats and DB Status
- Perf Context and IO Stats Context
- EventListener
Known Issues
Troubleshooting Guide
Tests
- Stress Test
- Fuzzing
- Benchmarking
Tools / Utilities
- Administration and Data Access Tool
- How to Backup RocksDB?
- Replication Helpers
- Checkpoints
- How to persist in-memory RocksDB database
- Third-party language bindings
- RocksDB Trace, Replay, Analyzer, and Workload Generation
- Block cache analysis and simulation tools
- IO Tracer and Parser
Implementation Details
- Delete Stale Files
- Partitioned Index/Filters
- WritePrepared-Transactions
- WriteUnprepared-Transactions
- How we keep track of live SST files
- How we index SST
- Merge Operator Implementation
- RocksDB Repairer
- Write Batch With Index
- Two Phase Commit
- Iterator's Implementation
- Simulation Cache
- [To Be Deprecated] Persistent Read Cache
- DeleteRange Implementation
- unordered_write
Extending RocksDB
- RocksDB Configurable Objects
- The Customizable Class
- Object Registry
RocksJava
- RocksJava Basics
- Logging in RocksJava
- JNI Debugging
- RocksJava API TODO
- RocksJava Performance on Flash Storage
- Tuning RocksDB from Java
Lua
- Lua CompactionFilter
Performance
- Performance Benchmarks
- In Memory Workload Performance
- Read-Modify-Write (Merge) Performance
- Delete A Range Of Keys
- Write Stalls
- Pipelined Write
- MultiGet Performance
- Tuning Guide
- Memory usage in RocksDB
- Speed-Up DB Open
- Implement Queue Service Using RocksDB
- Asynchronous IO
- Off-peak in RocksDB
Projects Being Developed
Misc
- Building on Windows
- Developing with an IDE
- Open Projects
- Talks
- Publication
- Features Not in LevelDB
- How to ask a performance-related question?
- Articles about Rocks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly