Skip to content
/ rsdos Public

KV store for files in local file system. Support optimize the storage with different types of compression algorithms. Implemented in Rust with python API available.

Notifications You must be signed in to change notification settings

unkcpz/rsdos

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RSDOS

An efficient (R)u(S)ty (D)isk-(O)bject(S)tore.

Installation

  • cargo binstall
  • cargo instatll
  • curl
  • python library and bin
  • apt/pacman/brew

Usage

TODO:

Design

APIs

Rust

  • insert and insert_many for insert single or many stream objects (e.g. files) to container.
  • extract and extract_many for extract single or many objects from container.
  • insert_many and extract_many should both using iterator as input/output since the number of objects can be huge. Meanwhile using iterator helps with buffer management on rw large amount of files.
  • For the container, it should implement insert, extract, insert_many and extract_many. That requires loose has insert_many implemented from insert and be the method for the container.
  • Since insert/extract can interact with either loose or packed store, I use enum-based strategy.
  • naming convention are loose and packed. To compatible with legacy dos, if legacy container exist, packs is also valid.
  • pack is the operation to move objects from loose to packed store. It calling insert_many to packed store since no overhead on DB openning/closing.
  • repack is on packed store and do the pack again using incremental pack_id (vacuum after).
  • Besides the pack and repack cases above, insert_many to packed store should not exposed to normal user.
  • To make Container a generic type, things that implement insert, extract, insert_many and extract_many should be a Container no matter it is local or not.
  • hashkey servers two purpose: 1. as the id of the object stored, this need to use sha256 to avoid duplicate 2. as the checksum to see if the lazy object read is valid, for this purpose can use cheap checksum.
  • For the Packed objects, raw_size is the uncompressed size while size is the compressed size occupied the packed file, this different from legacy dos which length is the compressed size occupied in packed file.
  • Only support one compression library, for V1 that is zlib, for V2 use zstd.
  • Support zlib and zstd (default).
  • when it is worth to compress?

Py wrapper

I think open container as context manager is a bad idea in legacy dos. Because the drop part is calling db.close() which only required for packs rw. In principle the context manager should always used since otherwise DB is not gracefully tear down and cause memory leak. Rust will take care of drop in scope so I do not need to put any drop codes for container in py wrapper. After initialize the container, the object is straightforward to use and every IO has its own connection to DB. Since I am using sled as embeded DB, it is even safe to be used in a non-blocking condition (not tested but in principle if we trust sled).

When interact with container, client side (user) have no knowledge on where the objects are stored it can be in loose or packed. Therefore, when calling insert or insert_many from python wrapper it always goes to loose. When calling extract or extract_many it will check loose first and then packed store to get the object(s). The pack operation will trigger the move from loose to packed store and result into the objects are distrubuted in two places.

Illustration

The bytes passing in the container and cross python/rust boundary is demostrated.

cross boundaries

Estimate whether to compress

In the reader maker, I put a method to the trait named worth_compress that return (SmallContent, MaybeBinary, ZFile([u8; 4]), MaybeLargeText). By default it return MaybeLargeText so should be compressed by default if compression turned on. I use the metric metioned in the "att" article above to decide whether I'll compress it or not.

Here is the decision making flow:

  • If something wrong when parsing the maybe format, just regard it "worth to compress" (e.g. MaybeLargeText).
  • If it is a file (SmallContent) < 850 bytes don't compress. (file metadata)
  • Read 2 header bytes if it is a zilb or a zstd(which is 4 bytes in header) (ZFile([u8; 4])), don't compress. (this will be override if recompress was on and different compression algorithm is assigned.)
  • Read 512 bytes and check if it is a binary (MaybeBinary) (by checking null bytes which is a heuristic for it is a binary data)
  • none of above is true, regard it as "worth to compress!" (MabyLargeText)

This avoid to run actuall compress which bring overhead.

Migration

  • The loose is the same, packs need to rename to packed.
  • Using zstd therefore read as zlib write as zstd.
  • The config.json will contain more information so use default to fill the missing field.
  • The packed DB is the most important thing to migrate, all elements are read out and injected into the new embeded DB backend. (need to be careful about size, raw_size definition w.r.t to the legacy dos)
  • To do the migration, function as CLI command is provided. I also need to provide python wrapper so it can call from AiiDA.

io_uring

TODO:

File operation timeout

Deprecated design decision: see io_uring

Since I/O operiation is synchroues, operiations on large files will block thread. No matter whether I use multithread (through tokio::task::spawn_blocking which is issuing a blocking call in general), I put a timeout to close the handler.

  • Default timeout: 10s.
  • Provide API to pass the timeout and can pass the default value as global variable.

Blocking IO

Deprecated design decision: see io_uring

There is tokio/fs [1] that slap an asynchronous IO but Linux doesn't have non-blocking file I/O so it is blocking a thread somewhere anyway. Tokio will use ordinary blocking file operations behind the scenes by using spawn_blocking treadpool to run in background.

Thus comes to the design, using one thread as default and spawning thread only when the global set for the async is turned on. Because of that, it is also consider to add timeout to the file operations.

I am considering using io-uring so should not having blocking IO in the end.

[1] https://docs.rs/tokio/latest/tokio/fs/index.html

PyO3 at boundary

When wrapping rust interface to python lib, the interfaces require explicit types, which means the traits can not be used in the types that mapping at the boundary. These are the streams that need to pass into and fetched from container defined in the rust implementation. The stream is a file-like instance that can either can read when it used in the flow that goes into the container, or it can write when it used in the flow that being fetch from the container. Typically in the python world it is BinaryIO, StringIO etc.

The write into CNT part is straitforward, since streams can be wrapped as PyFileLikeObject (a PyObject provided by pyo3-file) with read permission. It then becomes rust Reader. As for the read from CNT part, the rust implementation use object hash to lookup in the container and return an Object<R> which can be from loose or packs storages. However, the Object<R> is a generic type defined with any type of reader. If I want to directly return this object to the python world, I need to make it an explicit type and turn out to be very complex in the type: Object<Take<BufReader<File>>> as the time I writing this note. To simplify it, I use PyFileLikeObject with write to map any file-like instance from python world to a Writer. This design at the same time makes the boundary looks symmetry in turns of read and write operations.

Performance notes

Raw benchmark run on the CLI

For pack, see summary in #10

Time scales to be noticed (2009)

https://surana.wordpress.com/2009/01/01/numbers-everyone-should-know/

  • L1 cache reference 0.5 ns
  • Branch mispredict 5 ns
  • L2 cache reference 7 ns
  • Mutex lock/unlock 100 ns
  • Main memory reference 100 ns
  • Compress 1K bytes with Zippy 10,000 ns
  • Send 2K bytes over 1 Gbps network 20,000 ns
  • Read 1 MB sequentially from memory 250,000 ns
  • Round trip within same datacenter 500,000 ns
  • Disk seek 10,000,000 ns
  • Read 1 MB sequentially from network 10,000,000 ns
  • Read 1 MB sequentially from disk 30,000,000 ns
  • Send packet CA->Netherlands->CA 150,000,000 ns

Improvement ideas

Why legacy-dos is a bit slower than rsdos

Misc

  • When read from hashkey, store the handler and meta which require allocation, and increase the cache miss.

API discrepancy with legacy dos

  • useless close(), since rust manage drop after out of scope
  • some exceptions are quite redundent such as FileNotFoundError, and NotInitialized Error which are take care by the anyhow to propogate up already.
  • getter for loose_prefix_len and pack_size_target is passing through Config in rsdos, no willing to support expose API to container.

Progress

  • Init command
  • Status command and test on large dos
  • AddFiles and then can start prepare tests cases
  • Prepare test using stream to loose and test on init/status/add-files
  • Read APIs: has_objects, get_object_hash, get_folder, get_object_stream, get_objects_stream_and_meta, list_all_objects
  • Container as an struct
  • pyo3 bindings and get object iter bind
  • 1st benchmark with python dos on loose read and write behaviors
  • Pack write
  • Pack read
  • Solve HashWriter overhead by using reference instead of mem alloc for each entry visiting
  • benchmark on packs read/write
  • profiling on packs read (db?, io?) see flamegraph of batch_packs_read.rs and it shows db is the bottleneck.
  • packs correctly on adding new packs file
  • loose -> Pack
  • benchmark on loose -> Pack without compress (more than 3x times faster)
  • API redesign to make it ergonamic and idiomatic Rust #7
  • compression (zlib) #8
  • benchmark on pack with compress #9
  • estimate on the input stream format and decide whether pack. #9
  • (v2) Use sled as k-v DB backend which should have better performance than sqlite #1
  • (v2) io_uring
  • (v2) switch to using zstd instead of zlib
  • CLI commands so can bench on large samples (200 GB SSSP repo).
  • Memory footprint tracking when packing, since rsdos use iterator it should be more memory efficient.
  • Dependency injection mode to attach progress bar to long run functions (py exposed interface as well)
  • docs as library
  • repack
  • optimize
  • validate
  • backup
  • benchmark on optimize/validate/backup ...
  • (v1) Explicit using buffer reader/writer to replace copy_by_chunk, need to symmetry use buf on reader and write for insert/extract. I need to decide in which timing to wrap reader as a BufReader, in ReaderMaker or in copy??? (no gain on performance)
  • (v1) own rust benchmark on detail performance tuning.
  • (v1) Add mutex to the pack write, panic when other thread is writing. (or io_uring take care of async?)
  • (v2) Compress on adding to loose as git not just during packs. Header definition required.
  • (v2) hide direct write to packs and shading with the same loose structure
  • (v2) Rename packs -> packed
  • (v2) migation plan and CLI tool (v1 is design to compatible to w/r legacy dos repo and same interface, v2 has API change therefore take care of migaration.)
  • (v3) integrate with OpenDAL for unified interface
  • (v3) (dup as above?) generic Container interface that can extent to host data in online storage (trait Container with insert/extract methods)

About

KV store for files in local file system. Support optimize the storage with different types of compression algorithms. Implemented in Rust with python API available.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published