refactor(silo): remove remaining parts of partitioning#1226
Conversation
|
There is no change in the changelog. This PR will not produce a new releasable version. |
a87ee14 to
27799c8
Compare
There was a problem hiding this comment.
Pull request overview
This PR fully removes SILO’s remaining partitioning code paths, consolidating storage, serialization, and query execution to operate on a single table-wide column set and a single filter bitmap.
Changes:
- Removed
TablePartitionand migrated partition validation/finalization/serialization logic intostorage::Table. - Replaced per-partition filter computation and scan iteration with a single
CopyOnWriteBitmapviacomputeFilter(...), and updated query/filter compilation accordingly. - Renamed column container/types from
*Partition*to non-partitioned equivalents (e.g.,ColumnGroup,StringColumn,SequenceColumn, etc.) and updated tests + serialized state.
Reviewed changes
Copilot reviewed 119 out of 121 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| testBaseData/siloSerializedState/1774625658/data_version.silo | Adds new serialized-state version marker for v22. |
| testBaseData/siloSerializedState/1774259904/data_version.silo | Removes old serialized-state version marker for v21. |
| src/silo/storage/table_partition.h | Removes partition type definition. |
| src/silo/storage/table_partition.cpp | Removes partition validation/finalization implementation. |
| src/silo/storage/table.h | Moves partition-owned fields/serialization/validation APIs into Table. |
| src/silo/storage/table.cpp | Implements single-table initialization, validation, serialization, load/save. |
| src/silo/storage/column_group.test.cpp | Updates tests from ColumnPartitionGroup to ColumnGroup and non-partition column types. |
| src/silo/storage/column_group.h | Renames/retargets container to ColumnGroup with non-partition column types. |
| src/silo/storage/column_group.cpp | Updates column accessors and JSON insertion to non-partition column types. |
| src/silo/storage/column/zstd_compressed_string_column.test.cpp | Updates test naming/types for ZstdCompressedStringColumn. |
| src/silo/storage/column/zstd_compressed_string_column.h | Renames ZstdCompressedStringColumnPartition to ZstdCompressedStringColumn. |
| src/silo/storage/column/zstd_compressed_string_column.cpp | Updates implementations to new class name. |
| src/silo/storage/column/string_column.test.cpp | Updates tests to StringColumn and renames one test case. |
| src/silo/storage/column/string_column.h | Renames StringColumnPartition to StringColumn. |
| src/silo/storage/column/string_column.cpp | Updates implementation to new class name. |
| src/silo/storage/column/sequence_column.test.cpp | Updates tests to SequenceColumn. |
| src/silo/storage/column/sequence_column.h | Renames SequenceColumnPartition to SequenceColumn. |
| src/silo/storage/column/sequence_column.cpp | Updates implementation and explicit template instantiations. |
| src/silo/storage/column/int_column.test.cpp | Updates tests to IntColumn. |
| src/silo/storage/column/int_column.h | Renames IntColumnPartition to IntColumn. |
| src/silo/storage/column/int_column.cpp | Updates implementation to new class name. |
| src/silo/storage/column/indexed_string_column.test.cpp | Updates tests to IndexedStringColumn. |
| src/silo/storage/column/indexed_string_column.h | Renames IndexedStringColumnPartition to IndexedStringColumn. |
| src/silo/storage/column/indexed_string_column.cpp | Updates implementation to new class name. |
| src/silo/storage/column/float_column.test.cpp | Updates tests to FloatColumn. |
| src/silo/storage/column/float_column.h | Renames FloatColumnPartition to FloatColumn. |
| src/silo/storage/column/float_column.cpp | Updates implementation to new class name. |
| src/silo/storage/column/date32_column.test.cpp | Updates tests to Date32Column. |
| src/silo/storage/column/date32_column.h | Renames Date32ColumnPartition to Date32Column. |
| src/silo/storage/column/date32_column.cpp | Updates implementation to new class name. |
| src/silo/storage/column/column_type_visitor.h | Updates visitor dispatch to non-partition column types. |
| src/silo/storage/column/bool_column.h | Renames BoolColumnPartition to BoolColumn. |
| src/silo/storage/column/bool_column.cpp | Updates implementation to new class name. |
| src/silo/query_engine/operators/zstd_decompress_node.cpp | Switches include/calls from partition filters to computeFilter, updates column types. |
| src/silo/query_engine/operators/table_scan_node.cpp | Uses computeFilter and passes bitmap to scan exec node. |
| src/silo/query_engine/operators/phylo_subtree_node.cpp | Uses a single bitmap filter and reads from table.columns instead of partitions. |
| src/silo/query_engine/operators/mutations_node.cpp | Removes partition prefiltering and evaluates mutations against a single table/bitmap. |
| src/silo/query_engine/operators/most_recent_common_ancestor_node.cpp | Uses single bitmap filter and table-wide column access. |
| src/silo/query_engine/operators/insertions_node.cpp | Removes partition prefiltering and evaluates insertions against a single table/bitmap. |
| src/silo/query_engine/operators/count_filter_node.cpp | Counts cardinality of a single bitmap filter instead of summing partitions. |
| src/silo/query_engine/operators/compute_partition_filters.h | Changes API to computeFilter returning a single bitmap (file name unchanged). |
| src/silo/query_engine/operators/compute_partition_filters.cpp | Deletes old per-partition filter computation implementation. |
| src/silo/query_engine/operators/compute_filter.cpp | Adds new single-bitmap filter computation implementation. |
| src/silo/query_engine/filter/operators/string_in_set.test.cpp | Updates operator tests to new column types. |
| src/silo/query_engine/filter/operators/string_in_set.cpp | Updates explicit template instantiations to new column types. |
| src/silo/query_engine/filter/operators/selection.test.cpp | Updates operator tests to use IntColumn. |
| src/silo/query_engine/filter/operators/selection.h | Updates CompareToValueSelection specialization for StringColumn. |
| src/silo/query_engine/filter/operators/selection.cpp | Updates specialization implementation to StringColumn. |
| src/silo/query_engine/filter/expressions/true.h | Removes partition parameter from rewrite/compile APIs. |
| src/silo/query_engine/filter/expressions/true.cpp | Compiles against table-wide sequence_count. |
| src/silo/query_engine/filter/expressions/symbol_in_set.h | Removes partition parameter from rewrite/compile APIs. |
| src/silo/query_engine/filter/expressions/symbol_in_set.cpp | Compiles symbol filters against table.columns and table-wide row count. |
| src/silo/query_engine/filter/expressions/symbol_equals.h | Removes partition parameter from rewrite/compile APIs. |
| src/silo/query_engine/filter/expressions/symbol_equals.cpp | Uses table-wide sequence column metadata/access in rewrite. |
| src/silo/query_engine/filter/expressions/string_search.h | Removes partition parameter from rewrite/compile APIs. |
| src/silo/query_engine/filter/expressions/string_search.cpp | Compiles search against table.columns and table-wide row count. |
| src/silo/query_engine/filter/expressions/string_in_set.h | Removes partition parameter from rewrite/compile APIs. |
| src/silo/query_engine/filter/expressions/string_in_set.cpp | Rewrites/compiles against table.columns and new operator types. |
| src/silo/query_engine/filter/expressions/string_equals.h | Removes partition parameter from rewrite/compile APIs. |
| src/silo/query_engine/filter/expressions/string_equals.cpp | Compiles against table.columns and table-wide row count. |
| src/silo/query_engine/filter/expressions/phylo_child_filter.h | Removes partition dependency and updates string column type. |
| src/silo/query_engine/filter/expressions/phylo_child_filter.cpp | Compiles against table.columns and table-wide row count. |
| src/silo/query_engine/filter/expressions/or.test.cpp | Updates tests to use Table without partitions and new rewrite signature. |
| src/silo/query_engine/filter/expressions/or.h | Removes partition parameter from rewrite/compile APIs. |
| src/silo/query_engine/filter/expressions/or.cpp | Rewrites/compiles children against table-only APIs. |
| src/silo/query_engine/filter/expressions/nof.h | Removes partition parameter from rewrite/compile APIs. |
| src/silo/query_engine/filter/expressions/nof.cpp | Compiles N-of expressions against table-only APIs. |
| src/silo/query_engine/filter/expressions/negation.h | Removes partition parameter from rewrite/compile APIs. |
| src/silo/query_engine/filter/expressions/negation.cpp | Rewrites/compiles negation against table-only APIs. |
| src/silo/query_engine/filter/expressions/maybe.h | Removes partition parameter from rewrite/compile APIs. |
| src/silo/query_engine/filter/expressions/maybe.cpp | Updates rewrite/compile signatures to table-only. |
| src/silo/query_engine/filter/expressions/lineage_filter.h | Removes partition dependency and updates indexed-string column type. |
| src/silo/query_engine/filter/expressions/lineage_filter.cpp | Compiles against table.columns and table-wide row count. |
| src/silo/query_engine/filter/expressions/is_null.h | Removes partition parameter from rewrite/compile APIs. |
| src/silo/query_engine/filter/expressions/is_null.cpp | Compiles null checks against table.columns and table-wide row count. |
| src/silo/query_engine/filter/expressions/int_equals.h | Removes partition parameter from rewrite/compile APIs. |
| src/silo/query_engine/filter/expressions/int_equals.cpp | Compiles int checks against table.columns and table-wide row count. |
| src/silo/query_engine/filter/expressions/int_between.h | Removes partition parameter from rewrite/compile APIs. |
| src/silo/query_engine/filter/expressions/int_between.cpp | Compiles int ranges against table.columns and table-wide row count. |
| src/silo/query_engine/filter/expressions/insertion_contains.h | Removes partition parameter from rewrite/compile APIs. |
| src/silo/query_engine/filter/expressions/insertion_contains.cpp | Compiles insertion checks against table.columns and table-wide row count. |
| src/silo/query_engine/filter/expressions/has_mutation.h | Removes partition parameter from rewrite/compile APIs. |
| src/silo/query_engine/filter/expressions/has_mutation.cpp | Rewrites using table.columns sequence metadata. |
| src/silo/query_engine/filter/expressions/float_equals.h | Removes partition parameter from rewrite/compile APIs. |
| src/silo/query_engine/filter/expressions/float_equals.cpp | Compiles float checks against table.columns and table-wide row count. |
| src/silo/query_engine/filter/expressions/float_between.h | Removes partition parameter from rewrite/compile APIs. |
| src/silo/query_engine/filter/expressions/float_between.cpp | Compiles float ranges against table.columns and table-wide row count. |
| src/silo/query_engine/filter/expressions/false.h | Removes partition parameter from rewrite/compile APIs. |
| src/silo/query_engine/filter/expressions/false.cpp | Compiles empty operator with table-wide row count. |
| src/silo/query_engine/filter/expressions/expression.h | Updates base interface to table-only rewrite/compile. |
| src/silo/query_engine/filter/expressions/exact.h | Removes partition parameter from rewrite/compile APIs. |
| src/silo/query_engine/filter/expressions/exact.cpp | Updates rewrite signature and compile error path. |
| src/silo/query_engine/filter/expressions/date_equals.h | Removes partition parameter from rewrite/compile APIs. |
| src/silo/query_engine/filter/expressions/date_equals.cpp | Compiles date checks against table.columns and table-wide row count. |
| src/silo/query_engine/filter/expressions/date_between.h | Updates date column type and removes partition dependency. |
| src/silo/query_engine/filter/expressions/date_between.cpp | Compiles date ranges against table.columns and table-wide row count. |
| src/silo/query_engine/filter/expressions/bool_equals.h | Removes partition parameter from rewrite/compile APIs. |
| src/silo/query_engine/filter/expressions/bool_equals.cpp | Compiles bool checks against table.columns and table-wide row count. |
| src/silo/query_engine/filter/expressions/and.h | Removes partition parameter from rewrite/compile APIs. |
| src/silo/query_engine/filter/expressions/and.cpp | Compiles AND expressions against table-only APIs. |
| src/silo/query_engine/exec_node/table_scan.h | Updates scan generator to accept a single bitmap filter and scan a single table. |
| src/silo/query_engine/exec_node/table_scan.cpp | Updates scan materialization to use table-wide row ids and non-partition column types. |
| src/silo/query_engine/exec_node/arrow_util.h | Updates Arrow builder mapping to non-partition column types. |
| src/silo/query_engine/actions/mutations.cpp | Removes partition include usage. |
| src/silo/query_engine/actions/insertions.cpp | Removes partition include usage. |
| src/silo/initialize/initializer.test.cpp | Updates metadata type assertions to new column types. |
| src/silo/initialize/initializer.cpp | Updates metadata initializer visitor to new column types. |
| src/silo/database_info.h | Removes number_of_partitions from API struct. |
| src/silo/database_info.cpp | Removes numberOfPartitions JSON output. |
| src/silo/database.test.cpp | Updates expectations accordingly. |
| src/silo/database.cpp | Updates filtering, statistics, and save/load paths for single-table serialization. |
| src/silo/common/nucleotide_symbols.h | Updates Nucleotide::Column alias to new SequenceColumn. |
| src/silo/common/data_version.h | Bumps serialization version to 22. |
| src/silo/common/aa_symbols.h | Updates AminoAcid::Column alias to new SequenceColumn. |
| src/silo/append/database_inserter.h | Removes partition inserter and simplifies to table-only inserter. |
| src/silo/append/database_inserter.cpp | Updates insertion logic and commit path to finalize/validate on table. |
| src/silo/api/lineage_definition_handler.cpp | Updates metadata lookup to new indexed-string column type. |
| endToEndTests/test/info.test.js | Updates /info contract to remove numberOfPartitions. |
| documentation/api.md | Updates /info response example and field list. |
Comments suppressed due to low confidence (1)
src/silo/storage/table.cpp:1
- With partitions removed, row ids are now global across the table. Allowing
nuc_column.sequence_count < table.sequence_count(only guarding>here) can cause table scans and filter evaluation to index sequence columns withrow_id >= nuc_column.sequence_count, leading to out-of-range access in sequence indexes/buffers. If shorter sequence columns are not a supported state, this should validatenuc_column.sequence_count == sequence_count(and similarly for amino acid sequences and other row-aligned columns) to prevent runtime memory errors. If shorter columns are intended, the scan/filter paths need to treat missing tail rows as nulls consistently.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
27799c8 to
0528ec6
Compare
fengelniederhammer
left a comment
There was a problem hiding this comment.
Nice, only minor stuff.
also the data_directories.md says:
default/
... (partition data)
Which might be worth updating.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 119 out of 121 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (2)
src/silo/storage/table.cpp:1
loadData()no longer checks whether the returned input stream is actually open/valid before deserializing. In this file,saveData()still performs an explicit stream validity check, soloadData()should do the same and throw a load-specific exception (e.g.,persistence::LoadDatabaseException) with a clear message when the file cannot be opened/read.
src/silo/query_engine/exec_node/table_scan.cpp:176- The column lookup
table.columns.getColumns<Column>().at(column_name)is performed for everyrow_id. This has avoidable overhead (map lookup + bounds check) on hot scan paths. Hoist the column reference outside the loop (and similarly hoist any other repeated per-row lookups) so the loop only does null/value access.
arrow::Status ColumnEntryAppender::operator()(
ExecBatchBuilder& table_scan_node,
const std::string& column_name,
const storage::Table& table,
const roaring::Roaring& row_ids
) {
EVOBENCH_SCOPE("ColumnEntryAppender", columnTypeToString(Column::TYPE));
auto array = table_scan_node.getColumnTypeArrayBuilders<Column>().at(column_name);
for (auto row_id : row_ids) {
auto& column = table.columns.getColumns<Column>().at(column_name);
if (column.isNull(row_id)) {
ARROW_RETURN_NOT_OK(array->AppendNull());
} else {
if constexpr (std::is_same_v<Column, storage::column::StringColumn>) {
auto value = column.getValueString(row_id);
ARROW_RETURN_NOT_OK(array->Append(value));
} else if constexpr (std::is_same_v<Column, storage::column::IndexedStringColumn>) {
auto value = column.getValueString(row_id);
ARROW_RETURN_NOT_OK(array->Append(value));
} else if constexpr (std::is_same_v<Column, storage::column::Date32Column>) {
auto value = common::date32ToString(column.getValue(row_id));
ARROW_RETURN_NOT_OK(array->Append(value));
} else {
auto value = column.getValue(row_id);
ARROW_RETURN_NOT_OK(array->Append(value));
}
}
}
return arrow::Status::OK();
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
0528ec6 to
fd92530
Compare
Partitioning in SILO has been "disabled" for some time now, and we do not plan on reintroducing it. Rather approaches such as #1151 will be followed regarding this direction.
This removes the partition code paths from SILO