You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, thank you for such a great tutorial on LSM-Tree storage engine.
After finishing 3 weeks project, I have got a lot of to share, including some questions and suggestions.
1. StorageIterator specification
Is there an implicit StorageIterator specification?
In week 1 project, there are several iterators that implement StorageIterator.
As I implemented these iterators, I often asked myself, "Have I implemented these iterators in a coherent way?"
It turns out that I wrote the same assertions again and again, so I feel there is some implicit specification that I should apply.
The spec I summarized is listed below:
fn key(&self) -> KeyType<'_>
This function can be called iff the iterator is valid, if it is called from an invalid iterator, we could panic.
The return value must be an valid key, which means it is an non-empty key.
fn value(&self) -> &[u8]
This function can be called iff the iterator is valid, if it is called from an invalid iterator, we could panic.
fn is_valid(&self) -> bool
This function return false iff its underlying storage cannot produce a new element.
For FuseIterator, it also return false after next return an error.
fn num_active_iterators(&self) -> usize
For invalid iterator, it should return 0
For valid iterator, it should not return 0.
For FuseIterator, when the iterator is valid, it should not return 0. when the iterator is invalid, the return value is unspecified.
fn next(&mut self) -> Result<()>
For invalid iterator, it is no-op.
For valid iterator, if return Ok(())
If underlying storage cannot produce a new element, turn into invalid iterator. return Ok(())
Otherwise, store the new element. return Ok(())
The key from new element should strict greater than old key.
For valid iterator, if return Err(e)
The key, value, is_valid call should return same value as before, that means act like no-op.
For compound of iterators like MergeIterator, SstConcatIterator and TwoMergeIterator, it should remove the underlying iterator which cause the error. (I'm not sure whether is sound for SstConcatIterator and TwoMergeIterator)
For FuseIterator, it should turn into invalid iterator.
I'm not sure if it covers all situations.
2. Inconsistent function signature
As mentioned in #72. There is an inconsistent function signature in decode_block_meta.
/// The compactor will call this function with the compaction task and the list of SST ids generated. This function applies the
/// result and generates a new LSM state. The functions should only change `l0_sstables` and `levels` without changing memtables
/// and `sstables` hash map. Though there should only be one thread running compaction jobs, you should think about the case
/// where an L0 SST gets flushed while the compactor generates new SSTs, and with that in mind, you should do some sanity checks
/// in your implementation.
pubfnapply_compaction_result(
As the document does point out that LsmStorageState should be the new state, but it does not point out what Vec<usize> means.
Only after I exam solution code, I figure out that should be sst_ids which need be removed.
5. Seedable level compaction simulator
The level compaction simulator uses rand to generate the key range, so each run will produce different output, making it difficult to verify that the output is identical to the reference solution by using command line tools such as diffcmp.
One solution is to add the seed flag to the simulator, and use rand::SeedableRng to generate random numbers.
6. println!() to log
In the reference solution, the implementation of apply_result contains some of println!() to log information, making it difficult to verify that the output is identical to the reference solution by using command line tools such as diff, cmp.
I think it should be replaced with eprintln!(), which will output to stderr and does not lose logging functionality.
7. Encourage to add test
In rust project, adding test is pretty easy! and adding tests is an easiest way to contribute to repository.
For example, when implementing manifest functionality, it is good to test serde_json library first.
So I can quickly write relative test within manifest.rs.
Most of IDEs have the ability to run tests directly in Rust, which is different from C/C++ or Java, where you need some sort of test framework or use the main function to run a single simple test.
8. extract apply result logic to bind specific impl
When implementing apply_result function, I use a lot of related method of Vec in my solution,
e.g. Vec::splice, Vec::drain. But the levels Vec<(usize, Vec<usize>)> seems inefficient for doing compaction.
We could generalize levels impl like levels: Impl LevelsImpl and create a type alias type Levels = Vec<(usize, Vec<usize>)>;, and impl LevelsImpl trait for Vec<(usize, Vec<usize>)>, and we have the ability to easily switch to another implementation.
But it is of low priority, since this project is just for educational purposes.
9. immutable memtable is not immutated
Due to the interior mutability of MemTable, there is no way to forbid programmers from modifying immutable memtables in Vec<Arc<MemTable>>.
Which may (low probability) cause a nasty bug. But could be avoided with a little care.
The solution is quite simple: Use the Newtype pattern and expose only the read-only interface.
10. Migration to mvcc version is considered painful
It is kind of dirty work, and frustrating, to refactor code to use the key+ts representation.
The code that needs to be changed is scattered across many files.
Asking students to write unsafe code in Rust is awkward.
Learning on code that does not complied is quite inefficient.
11. use Arc_cycle to store weak pointer into LsmMvccInner
Strictly speaking, LsmMvccInner::new_txn can only be called with the LsmStorageInner that created it, but its semantics do not restrict this.
For example, if there are two different LsmStorageInner named storageA and storageB with associated LsmMvccInner named mvccA and mvccB,
nobody forbids us to call mvccB.new_txn(storageA) or mvccA.new_txn(storageB), which is obviously a logical error.
However, as project developers, we know that there is only one LsmStorageInner instance, so this won't be a critical issue.
But ideally, we can do better by using Arc::Arc_cycle to pass a weak pointer directly into LsmMvccInner::new.
12. TxnIterator, LsmIterator is not compatible on non-mvcc version
Currently TxnIterator uses TwoMergeIterator<TxnLocalIterator, LsmIterator> to iterate over the underlying storage.
When switching to the non-mvcc version, creating an empty transaction without mvcc can be tricky.
It might be better to use Option<A> in TwoMergeIterator and Option<Arc<Transaction>> in TxnIterator to handle this situation.
Again, thank you for such a great tutorial on LSM-Tree storage engine.
The text was updated successfully, but these errors were encountered:
First of all, thank you for such a great tutorial on LSM-Tree storage engine.
After finishing 3 weeks project, I have got a lot of to share, including some questions and suggestions.
1. StorageIterator specification
Is there an implicit StorageIterator specification?
In week 1 project, there are several iterators that implement StorageIterator.
As I implemented these iterators, I often asked myself, "Have I implemented these iterators in a coherent way?"
It turns out that I wrote the same assertions again and again, so I feel there is some implicit specification that I should apply.
The spec I summarized is listed below:
fn key(&self) -> KeyType<'_>
fn value(&self) -> &[u8]
fn is_valid(&self) -> bool
false
iff its underlying storage cannot produce a new element.FuseIterator
, it also returnfalse
afternext
return an error.fn num_active_iterators(&self) -> usize
FuseIterator
, when the iterator is valid, it should not return 0. when the iterator is invalid, the return value is unspecified.fn next(&mut self) -> Result<()>
Ok(())
Ok(())
Ok(())
Err(e)
key
,value
,is_valid
call should return same value as before, that means act like no-op.MergeIterator
,SstConcatIterator
andTwoMergeIterator
, it should remove the underlying iterator which cause the error. (I'm not sure whether is sound forSstConcatIterator
andTwoMergeIterator
)FuseIterator
, it should turn into invalid iterator.I'm not sure if it covers all situations.
2. Inconsistent function signature
As mentioned in #72. There is an inconsistent function signature in
decode_block_meta
.mini-lsm/mini-lsm-starter/src/table.rs
Lines 45 to 46 in dd333ca
mini-lsm/mini-lsm/src/table.rs
Lines 63 to 64 in dd333ca
Until week 2 day 7, the signature is fine, but it turns out that this is not sufficient for checksum functionality.
Result
,&[u8]
is more convenient thanimpl Buf
when implementing checksum functionality.3. key always non-empty
It seems that an empty key is considered invalid. So we can enforce it by adding
ParsedKey
type, we could usenutype
crate to achieve this goal.But it may not be necessary for an educational project, since it is convenient to use an empty key instead of
Option
wrapper.4.
apply_result
does not point out what it should returnmini-lsm/mini-lsm-starter/src/compact/simple_leveled.rs
Lines 41 to 48 in dd333ca
As the document does point out that
LsmStorageState
should be the new state, but it does not point out whatVec<usize>
means.Only after I exam solution code, I figure out that should be sst_ids which need be removed.
5. Seedable level compaction simulator
The level compaction simulator uses rand to generate the key range, so each run will produce different output, making it difficult to verify that the output is identical to the reference solution by using command line tools such as
diff
cmp
.One solution is to add the seed flag to the simulator, and use
rand::SeedableRng
to generate random numbers.6.
println!()
to logIn the reference solution, the implementation of
apply_result
contains some ofprintln!()
to log information, making it difficult to verify that the output is identical to the reference solution by using command line tools such asdiff
,cmp
.I think it should be replaced with
eprintln!()
, which will output tostderr
and does not lose logging functionality.7. Encourage to add test
In rust project, adding test is pretty easy! and adding tests is an easiest way to contribute to repository.
For example, when implementing manifest functionality, it is good to test
serde_json
library first.So I can quickly write relative test within
manifest.rs
.Most of IDEs have the ability to run tests directly in Rust, which is different from C/C++ or Java, where you need some sort of test framework or use the main function to run a single simple test.
8. extract apply result logic to bind specific impl
When implementing
apply_result
function, I use a lot of related method ofVec
in my solution,e.g.
Vec::splice
,Vec::drain
. But the levelsVec<(usize, Vec<usize>)>
seems inefficient for doing compaction.We could generalize levels impl like
levels: Impl LevelsImpl
and create a type aliastype Levels = Vec<(usize, Vec<usize>)>;
, and implLevelsImpl
trait forVec<(usize, Vec<usize>)>
, and we have the ability to easily switch to another implementation.But it is of low priority, since this project is just for educational purposes.
9. immutable memtable is not immutated
Due to the interior mutability of
MemTable
, there is no way to forbid programmers from modifying immutable memtables inVec<Arc<MemTable>>
.Which may (low probability) cause a nasty bug. But could be avoided with a little care.
The solution is quite simple: Use the
Newtype
pattern and expose only the read-only interface.10. Migration to mvcc version is considered painful
It is kind of dirty work, and frustrating, to refactor code to use the key+ts representation.
The code that needs to be changed is scattered across many files.
Asking students to write unsafe code in Rust is awkward.
Learning on code that does not complied is quite inefficient.
11. use
Arc_cycle
to store weak pointer into LsmMvccInnerpub fn new_txn(&self, inner: Arc<LsmStorageInner>, serializable: bool) -> Arc<Transaction>
Strictly speaking,
LsmMvccInner::new_txn
can only be called with theLsmStorageInner
that created it, but its semantics do not restrict this.For example, if there are two different
LsmStorageInner
namedstorageA
andstorageB
with associatedLsmMvccInner
namedmvccA
andmvccB
,nobody forbids us to call
mvccB.new_txn(storageA)
ormvccA.new_txn(storageB)
, which is obviously a logical error.However, as project developers, we know that there is only one
LsmStorageInner
instance, so this won't be a critical issue.But ideally, we can do better by using
Arc::Arc_cycle
to pass a weak pointer directly intoLsmMvccInner::new
.12. TxnIterator, LsmIterator is not compatible on non-mvcc version
Currently
TxnIterator
usesTwoMergeIterator<TxnLocalIterator, LsmIterator>
to iterate over the underlying storage.When switching to the non-mvcc version, creating an empty transaction without mvcc can be tricky.
It might be better to use
Option<A>
inTwoMergeIterator
andOption<Arc<Transaction>>
inTxnIterator
to handle this situation.Again, thank you for such a great tutorial on LSM-Tree storage engine.
The text was updated successfully, but these errors were encountered: