Skip to content

Data Snapshot

Hongzheng Shi edited this page Dec 3, 2018 · 2 revisions

Overview

Snapshot is the process of persisting current data live store (in memory) to disk for fast dimension table recovery (as alternative to replay every events), and enable merging and purging redo-logs.

File layout on disk

checkout here

Snapshot Process

Base on table level configurations, when the scheduler ticks it will check whether: number of mutations on live store is over a threshold, or a pre defined time interval passed. If either condition is satisfied for a dimension table, a snapshot will be created for that table. Snapshot manager will record current live store status: redofile, batch offset, number of mutations, last read record, then start persisting live shards into disk, after which it will update live store and metastore with latest status.

Recover Process

When a table is bootstrapped, the recovery process will check with metastore on the latest snapshot info, and use latest available snapshot to fast rebuild table.