-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Compressor interface #584
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to finish util/compressor class but have done the rest. Looks good so far!
@@ -150,7 +150,7 @@ Status BlobFileBuilder::Add(const Slice& key, const Slice& value, | |||
} | |||
|
|||
BlobIndex::EncodeBlob(blob_index, blob_file_number, blob_offset, blob.size(), | |||
blob_compression_type_); | |||
blob_compressor_->GetCompressionType()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking through the code, I am not sure if blob_compressor_ is always set. Do you know if that is true? Should there be an assert somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blob_compressor_ is initialized in BlobFileBuilder's constructor from mutable_cf_options.blob_compressor. The default constructor of MutableCFOptions sets blob_compressor_ to nullptr.
I'm not sure if we can ever run into a situation where blob_compressor_ is left null, but I'll initialize it to NoCompressor in the BlobFileBuilder's constructor to be safe. This better aligns with the previous code initializing to kNoCompression by default.
db/compaction/compaction.cc
Outdated
bool _manual_compaction, const std::string& _trim_ts, double _score, | ||
bool _deletion_compaction, bool l0_files_might_overlap, | ||
CompactionReason _compaction_reason, | ||
uint32_t _output_path_id, std::shared_ptr<Compressor> _compressor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const std::shared_ptr& ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in next push
@@ -1150,7 +1149,6 @@ TEST_F(DBOptionsTest, ChangeCompression) { | |||
ASSERT_OK(dbfull()->TEST_WaitForCompact()); | |||
ASSERT_TRUE(compacted); | |||
ASSERT_EQ(CompressionType::kSnappyCompression, compression_used); | |||
ASSERT_EQ(6, compression_opt_used.level); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason this line was removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertion was added again in the second PR (with some changes). I'm moving it back here.
@@ -26,6 +26,7 @@ | |||
#include "rocksdb/utilities/object_registry.h" | |||
#include "rocksdb/utilities/options_type.h" | |||
#include "util/autovector.h" | |||
#include "util/string_util.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified that it is. Env::GenerateUniqueId calls PutBaseChars in string_util.h. Include changes in other files led to this.
|
||
// Abstract algebra teaches us that a finite cyclic group (such as the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did this code go? Was it moved or is it no longer necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It moved to util/compressor.cc. It is part of the default implementation to sample and create a dictionary. The Compressor interface allows plugins to override the methods and customize dictionary creation if desired.
@@ -142,15 +142,16 @@ Status ReadAndParseBlockFromFile( | |||
BlockCreateContext& create_context, bool maybe_compressed, | |||
const UncompressionDict& uncompression_dict, | |||
const PersistentCacheOptions& cache_options, | |||
MemoryAllocator* memory_allocator, bool for_compaction, bool async_read) { | |||
MemoryAllocator* memory_allocator, bool for_compaction, | |||
std::shared_ptr<Compressor>& compressor, bool async_read) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
///should this be a const?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should. Fixed.
&rep_->compressor); | ||
if (!s.ok() || rep_->compressor == nullptr) { | ||
ROCKS_LOG_ERROR(rep_->ioptions.logger, | ||
"Compression type not supported"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this also set "blocks_maybe_compressed=false"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compressor::CreateFromString failing means that the compressor needed by this table is missing. Instantiating NoCompressor (condition for blocks_maybe_compressed=false) should never fail. This scenario should eventually lead to decompression failing.
table/table_builder.h
Outdated
@@ -93,8 +93,7 @@ struct TableBuilderOptions { | |||
const ImmutableOptions& _ioptions, const MutableCFOptions& _moptions, | |||
const InternalKeyComparator& _internal_comparator, | |||
const IntTblPropCollectorFactories* _int_tbl_prop_collector_factories, | |||
CompressionType _compression_type, | |||
const CompressionOptions& _compression_opts, uint32_t _column_family_id, | |||
std::shared_ptr<Compressor> _compressor, uint32_t _column_family_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const & ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@@ -56,8 +56,8 @@ Status BlobDumpTool::Run(const std::string& filename, DisplayType show_key, | |||
reader_.reset(new RandomAccessFileReader(std::move(file), filename)); | |||
uint64_t offset = 0; | |||
uint64_t footer_offset = 0; | |||
CompressionType compression = kNoCompression; | |||
s = DumpBlobLogHeader(&offset, &compression); | |||
std::shared_ptr<Compressor> compressor; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be null or a NoCompression compressor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compressor is initialized in DumpBlobLogHeader. It takes compression from BlobLogHeader (kNoCompression by default).
util/compression.cc
Outdated
@@ -1,11 +1,684 @@ | |||
// Copyright (c) 2022-present, Facebook, Inc. All rights reserved. | |||
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The year here changed...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. We introduced compression.cc first in the PRs, but it got later created in RocksDB too, and the copyright date change was missed.
return true; | ||
} | ||
} | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should type be set to Unknown/None or something on error (similar to how result is cleared)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If StringToType returns false, the caller should not rely on the value of type. All uses of StringToType adhere to this expectation (except for a few tests, where the string is known to be a valid name).
@@ -666,55 +282,66 @@ inline std::string CompressionOptionsToString( | |||
return result; | |||
} | |||
|
|||
#ifdef SNAPPY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the XXX_Compress/Uncompress functions still needed in a header file or can they be local to the source file now? If they cannot be moved to the source file, can you please add a TODO comment to make that move when it is available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The functions were left in the header file for ease of review. They should be moved to the source file eventually. Adding a TODO comment.
|
||
// Returns a new uncompression dictionary from the input dict. | ||
// Classes which have a ProcessedDict should override this method. | ||
virtual UncompressionDict* NewUncompressionDict(const std::string& dict) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these return a std::unique_ptr? Maybe something that should happen in the next phase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pointer returned by NewUncompressionDict is used to reset a unique_ptr in the two usages I found in the code (BlockCreateContext::Create and BlockBasedTableBuilder::EnterUnbuffered). This is similar to the previous code, where the new operator was used.
const std::string& compression_dict_samples, | ||
const std::vector<size_t>& compression_dict_sample_lens); | ||
|
||
private: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I would rather these private functions were not in the interface if it can be avoided. Can you explain when/where/why they are used and if there is a way to move them to something local?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These private getters are used to access certain compression options: GetMaxDictBytes, GetMaxTrainBytes, UseDictTrainer, GetLevel. These options are used by some Compressor methods with default implementations (e.g., SampleDict and TrainDict). Different compressors could customize the names of the underlying options, and the getters provide a standard way to access them.
For example, SampleDict and TrainDict provide the default RocksDB dictionary creation logic. Compressors are free to add their custom logic, but I'd expect many compressors to rely on the default implementation.
RocksDB code outside Compressor does not need these options. There are other similar option getters that are public, as other RocksDB code needs to access them (GetMaxDictBufferBytes, GetParallelThreads).
Making these getters private may be a bit too restrictive. We could at least make them protected.
void Compressor::SampleDict(std::vector<std::string>& data_block_buffers, | ||
std::string* compression_dict_samples, | ||
std::vector<size_t>* compression_dict_sample_lens) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these be a static method or somehow otherwise denoted? It does not appear to use anything inside of the Compressor (other than dictionary sizes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SampleDict and TrainDict provide the default RocksDB dictionary creation logic. These methods are used in the default implementation of CreateDict. I split the two phases of dictionary creation as Compressors may want to use to customize only sampling or training (I ran into that scenario). Compressors are free to customize any of these methods if they need custom logic for one or both phases.
An alternative (also covering the private getters in the previous comment) would be to place the default dictionary creation code in a utility class. Compressors would then use the utility class in their CreateDict method to use the default logic.
util/compressor.cc
Outdated
dict = ZSTD_TrainDictionary(compression_dict_samples, | ||
compression_dict_sample_lens, max_dict_bytes); | ||
} else { | ||
dict = ZSTD_FinalizeDictionary(compression_dict_samples, | ||
compression_dict_sample_lens, | ||
max_dict_bytes, level); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the dictionary training the same for all dictionaries and not dependent on ZSTD?
What happens if built without ZSTD enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ZSTD dictionary trainer is used with other compression algorithms too.
- ZSTD_TrainDictionary has conditional compilation based on ZSTD library version. If the trainer is not supported, it returns an empty string.
- There is also a ZSTD_TrainDictionarySupported function with the same library version check. It is used in CheckCompressionSupported. That only checks CompressionOptions, not Compressor options (as a Compressor could define a custom trainer).
Similar comments apply to ZSTD_FinalizeDictionary.
I added a check for ZSTD_TrainDictionarySupported in TrainDict's default implementation. If the Compressor selects to use a trainer (based on GetMaxTrainBytes and UseDictTrainer), it checks if ZSTD_TrainDictionary is supported. If not, it checks if ZSTD_FinalizeDictionary is supported. If neither is supported, it will return back the original samples.
241d920
to
1def930
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks for the PR
@lucagiac81 please rebase |
1def930
to
439bc30
Compare
@ofriedma, @speedbmike - Please review |
439bc30
to
d88100c
Compare
Code is rebased on latest main based RocksDB 8.6.7 (the previous code was based on RocksDB 8.1.1). One note regarding ZSTDContext
|
|
This PR refactors compression by introducing a Compressor interface.
This is a first step towards supporting compressors as plugins. PR #585 covers the next step.
It is equivalent to PR7650 in RocksDB.
Compressor class
The Compressor class defines an interface for each compression algorithm to implement. It is a Customizable class, like other extensible components in Speedb.
A Compressor has
Built-in compressors
The existing compression algorithms (referred to as "built-in") are encapsulated in Compressor classes. The classes inherit from BuiltinCompressor, which includes functionality shared by all built-in compressors.
Built-in compressors can be referenced by their numeric id (as defined by the CompressionType enum) to ensure backward compatibility. BuiltinCompressor uses the existing CompressionOptions struct as its configurable options.
Compressor lifecycle
For this PR, compression options exposed in the public API are unchanged (exposing Compressor in the public API and options is the scope of PR #585).
TODO
Streaming compression is not included in the Compressor class yet. The plan is to cover that in a separate PR.
It could be offered through additional methods in the Compressor class. For example