Skip to content

Commit e7b55b2

Browse files
committed
feat: eliminate GenericDatum in Avro reader for performance
Replace GenericDatum intermediate layer with direct Avro decoder access to improve manifest I/O performance. Changes: - Add avro_direct_decoder_internal.h with DecodeAvroToBuilder API - Add avro_direct_decoder.cc implementing direct Avro→Arrow decoding - Primitive types: bool, int, long, float, double, string, binary, fixed - Temporal types: date, time, timestamp - Logical types: uuid, decimal (with validation) - Nested types: struct, list, map - Union type handling with bounds checking - Field skipping with proper multi-block handling for arrays/maps - Modify avro_reader.cc to use DataFileReaderBase with direct decoder - Replace DataFileReader<GenericDatum> with DataFileReaderBase - Use decoder.decodeInt(), decodeLong(), etc. directly - Remove GenericDatum allocation and extraction overhead - Update CMakeLists.txt to include new decoder source Validation added: - Union branch bounds checking - Decimal byte width validation (uses schema fixedSize, not calculated) - Decimal precision sufficiency validation - Logical type presence validation - Type mismatch error handling Documentation: - Comprehensive API documentation in header - Schema evolution handling via SchemaProjection explained - Error handling behavior documented - Limitations noted (default values not supported) Performance improvement: - Before: Avro binary → GenericDatum → Extract → Arrow (3 steps) - After: Avro binary → decoder.decodeInt() → Arrow (2 steps) This matches Java implementation which uses Decoder directly via ValueReader interface, avoiding intermediate object allocation. All 173 avro_test cases pass. Issue: #332
1 parent 9805fae commit e7b55b2

File tree

4 files changed

+689
-10
lines changed

4 files changed

+689
-10
lines changed

src/iceberg/CMakeLists.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -138,6 +138,7 @@ if(ICEBERG_BUILD_BUNDLE)
138138
set(ICEBERG_BUNDLE_SOURCES
139139
arrow/arrow_fs_file_io.cc
140140
avro/avro_data_util.cc
141+
avro/avro_direct_decoder.cc
141142
avro/avro_reader.cc
142143
avro/avro_writer.cc
143144
avro/avro_register.cc

0 commit comments

Comments
 (0)