Also see root AGENTS.md for cross-language standards.
- Use
Vec::with_capacity()when size is known or estimable — prefer over-estimating capacity to multiple reallocations. - Wrap large or expensive-to-clone struct fields (maps, protobuf metadata, schemas) in
Arc<T>to avoid deep copies. - Use
Box::pin(...)or.boxed()but never both —.boxed()already returnsPin<Box<...>>. - Remove dead code instead of adding
#[allow(dead_code)]. Delete unused constants instead of reducing visibility. - Use
column_by_name()forRecordBatchcolumn access in production code; usebatch["column_name"]in tests. - Use
PrimitiveArray::<T>::from(vec)(zero-copy) instead offrom_iter_values(vec)for Vec-to-PrimitiveArray conversion. - Implement
Defaulttrait on config/options structs instead of standalonedefault_*()helpers. - Place
#[cfg(test)] mod testsas a single block at the bottom of each file — no production code after it. - Place
useimports at the top of the file, not inline within function bodies. - Extract substantial new logic (bin packing, scheduling) into dedicated submodules instead of inlining into large files.
- Delete obsolete internal (
pub(crate)/ private) methods in the same PR that introduces their replacements. For public API methods, follow the deprecation path in root AGENTS.md instead. - Choose log levels by audience:
debug!for routine/high-frequency ops,info!for infrequent operator-visible state changes,warn!for unexpected conditions.
- Use
with_-prefixed builder methods for optional config (e.g.,MyStruct::new(required).with_option(v)) — don't create separate constructor variants. - For public APIs, prefer
Into<T>orAsRef<T>trait bounds for flexible inputs. - Prefer
pub(crate)overpubfor crate-internal items. Usepub usere-exports for the actual public API surface. - Use enums instead of magic numbers for format versions, variant types, and discriminators — leverage exhaustive
match. - Use strongly-typed structs instead of
HashMap<String, String>in APIs — convert to strings only at serialization boundaries. - Keep
RowAddr(physical fragment+offset) andRowId(stable logical identifier) as distinct types — never rawu64for both. - Use
RowAddressfromlance-core/src/utils/address.rsinstead of raw bitwise operations on row addresses. - Use
RowAddrTreeMap/RoaringBitmapinstead ofVec<Range<u64>>for physical row selections. - Use logical row counts (
num_rows()) instead ofphysical_rowsfor user-facing metrics — subtract deletions. - Keep traits minimal — only core abstraction methods. Move helpers to standalone functions and config to struct fields.
- Get column/field types from schema metadata — never materialize data rows just to inspect types.
- Use stable, versioned serialization formats for persistent storage (e.g., index files) — avoid unstable cross-version formats.
- Use Arrow's type-safe access (
ArrayAccessortrait bounds,as_*_arrayhelpers) instead ofarrow::compute::cast+downcast_ref. Prefer_optvariants (e.g.,as_string_opt) unless the data type has already been verified. - In
lance-io/, use single-syscall writes for local filesystem I/O — don't reuse cloud multipart upload machinery.
- Never use
.unwrap(),.expect(),panic!(), orassert!()in library code for fallible operations — use?withResultand proper error types. Reserve.unwrap()for tests only. - Avoid bare
.unwrap(); useif let,match,let ... else,?, or combinators. Never.is_none()followed by.unwrap(). If unavoidable, use.expect("reason"). - Return
LanceError::NotSupportedinstead oftodo!()orunimplemented!()for unsupported code paths. Test withResult::Errassertions, not#[should_panic]. - Match
Errorvariant to root cause:Error::invalid_inputfor caller data issues,Error::corrupt_filefor format/integrity issues,Error::not_foundfor missing resources,Error::iofor I/O failures. - Include full context in error messages — variable names, values, sizes, types, indices. Not generic messages like
"Invalid chunk size". - Use
checked_add/checked_mulinstead ofwrapping_add/wrapping_mulfor counters and IDs — return an error on overflow. - Prefer
debug_assert!overassert!for non-safety invariants; reserveassert!for conditions preventing data corruption. Always include descriptive messages. - Don't silently guard against impossible conditions — use
debug_assert!, return an explicit error, or remove the check. - Log warnings on best-effort/cleanup failures instead of silently swallowing or propagating errors.
- Log warnings for silent no-ops (skipped operations); omit warnings before errors since the error message is sufficient.
- Avoid
unwrap_or(default)on map lookups for required config params — use.ok_or_else(|| Error::...)and verify key names match between serialization and deserialization. - Advance all parallel iterators before any
continuebranches — early exits that skip.next()calls cause misalignment. - Bind
iter.next()withlet Some(x) = iter.next() else { ... }— never call.next()twice to check-then-use.
- Reserve
_-prefixed names for truly unused bindings — if a variable is read, drop the underscore. - Prefix boolean variables with
is_orhas_instead of ambiguouswith_or bare adjectives. - Name booleans so
false(zero/Default::default()) is the desired default — usedisable_*instead ofenable_*when the feature should be on by default. - Name functions to match their actual scope — e.g.,
handle_partition_system_columnsnothandle_system_columnsif only a subset is handled.
- Use
record_batch!()fromarrow_arrayto constructRecordBatchin tests instead of manual Schema/Arc/try_new boilerplate. - Use
gen_batch()builder API (.col(),.into_reader_rows()) for test data setup instead of manual Arrow construction. - Use
.try_into_batch()instead of.try_into_stream().try_collect()for scanner results in tests. - Use plain
"memory://"URIs in tests — no atomic counters or unique suffixes needed. - Assert on both error variant (
assert!(matches!(error, ErrorType::Variant { .. }))) and message content — don't just checkis_err().
- Add doc comments to public API elements that convey semantic meaning, valid values, and effects — don't restate type signatures.
- Document enum variant doc comments with behavioral semantics, not just labels. For numeric parameters, state whether it's an id, count, index, etc.
- Add doc comments to magic constants, thresholds, and non-obvious transformation functions — explain what the value represents and why it was chosen.
- Comment fallback/guard code paths with when they trigger and why they exist.
- Ensure doc comments match actual semantics — distinguish mutates-in-place (
&mut self) from returns-new-value. - Use explicit forward-looking language (
TODO,FIXME) in comments to distinguish current behavior from planned changes. - Document the semantic meaning of both present and absent states for
Option<T>fields. - Use precise domain terminology — avoid ambiguous abbreviations (e.g., "FIXED" vs "fixed-width") or incorrect terms (e.g., "fields" when meaning "fragments").
Performance-critical encoding/decoding paths have additional requirements:
- Hoist loop-invariant conditionals out of hot loops — branch once outside, then use separate loop bodies or monomorphized variants.
- Pre-allocate single contiguous buffers. Default to
buf.resize(len, 0)for safe initialization; reserveVec::with_capacity+unsafe { set_len() }for measured hot paths only, with a// SAFETY:comment explaining why the buffer will be fully initialized before read (e.g., immediately followed byread_exact). - Use
spawn_cpu()only at the async-to-CPU boundary (e.g., FSST, decompression, batch materialization) — never nest redundantspawn_cpu()calls. - Use
expect_next()and similar utility methods instead of inliningNone-checks with error returns.