-
Notifications
You must be signed in to change notification settings - Fork 6.3k
RocksDB Repairer
Repairer does best effort recovery to recover as much data as possible after a disaster without compromising consistency. It does not guarantee bringing the database to a time consistent state. Note: Currently there is a limitation that un-flushed column families will be lost after repair. This would happen even if the DB is in healthy state.
Note the CLI command uses default options for repairing your DB and only adds the column families found in the SST files. If you need to specify any options, e.g., custom comparator, have column family-specific options, or want to specify the exact set of column families, you should choose the programmatic way.
For programmatic usage, call one of the RepairDB
functions declared in include/rocksdb/db.h
.
For CLI usage, first build ldb
, our admin CLI tool:
$ make clean && make ldb
Now use the ldb
's repair
subcommand, specifying your DB. Note it prints info logs to stderr so you may wish to redirect. Here I run it on a DB in ./tmp
where I've deleted the MANIFEST file:
$ ./ldb repair --db=./tmp 2>./repair-log.txt
$ tail -2 ./repair-log.txt
[WARN] [db/repair.cc:208] **** Repaired rocksdb ./tmp; recovered 1 files; 926bytes. Some data may have been lost. ****
Looks successful. MANIFEST file is back and DB is readable:
$ ls tmp/
000006.sst CURRENT IDENTITY LOCK LOG LOG.old.1504116879407136 lost MANIFEST-000001 MANIFEST-000003 OPTIONS-000005
$ ldb get a --db=./tmp
b
Notice the lost/
directory. It holds files containing data that was potentially lost during recovery.
Repair process is broken into 4 phase:
- Find files
- Convert logs to tables
- Extract metadata
- Write Descriptor
The repairer goes through all the files in the directory, and classifies them based on their file name. Any file that cannot be identified by name will be ignored.
Every log file that is active is replayed. All sections of the file where the checksum does not match is skipped over. We intentionally give preference to data consistency.
We scan every table to compute
- smallest/largest for the table
- largest sequence number in the table
If we are unable to scan the file, then we ignore the table.
We generate descriptor contents:
- log number is set to zero
- next-file-number is set to 1 + largest file number we found
- last-sequence-number is set to largest sequence# found across all tables
- compaction pointers are cleared
- every table file is added at level 0
- Compute total size and use to pick appropriate max-level M
- Sort tables by largest sequence# in the table
- For each table: if it overlaps earlier table, place in level-0, else place in level-M.
- We can provide options for time consistent recovery and unsafe recovery (ignore checksum failure when applicable)
- Store per-table metadata (smallest, largest, largest-seq#, ...) in the table's meta section to speed up ScanTable.
If the column family is created recently and not persisted in sst files by a flush, then it will be dropped during the repair process. With this limitation repair would might even damage a healthy db if its column families are not flushed yet.
Contents
- RocksDB Wiki
- Overview
- RocksDB FAQ
- Terminology
- Requirements
- Contributors' Guide
- Release Methodology
- RocksDB Users and Use Cases
- RocksDB Public Communication and Information Channels
-
Basic Operations
- Iterator
- Prefix seek
- SeekForPrev
- Tailing Iterator
- Compaction Filter
- Multi Column Family Iterator (Experimental)
- Read-Modify-Write (Merge) Operator
- Column Families
- Creating and Ingesting SST files
- Single Delete
- Low Priority Write
- Time to Live (TTL) Support
- Transactions
- Snapshot
- DeleteRange
- Atomic flush
- Read-only and Secondary instances
- Approximate Size
- User-defined Timestamp
- Wide Columns
- BlobDB
- Online Verification
- Options
- MemTable
- Journal
- Cache
- Write Buffer Manager
- Compaction
- SST File Formats
- IO
- Compression
- Full File Checksum and Checksum Handoff
- Background Error Handling
- Huge Page TLB Support
- Tiered Storage (Experimental)
- Logging and Monitoring
- Known Issues
- Troubleshooting Guide
- Tests
- Tools / Utilities
-
Implementation Details
- Delete Stale Files
- Partitioned Index/Filters
- WritePrepared-Transactions
- WriteUnprepared-Transactions
- How we keep track of live SST files
- How we index SST
- Merge Operator Implementation
- RocksDB Repairer
- Write Batch With Index
- Two Phase Commit
- Iterator's Implementation
- Simulation Cache
- [To Be Deprecated] Persistent Read Cache
- DeleteRange Implementation
- unordered_write
- Extending RocksDB
- RocksJava
- Lua
- Performance
- Projects Being Developed
- Misc