Releases: lance-format/lance
v0.4.8 Better support for nested fields and more supported predicates
Previously predicates on nested (and deeply nested) fields were not properly supported. This release adds support for filtering on struct sub-fields or deeply nested structs.
We also add support for more filter predicates and fixed a regression in NULL handling for string columns.
What's Changed
- Fix nested schema merge by @changhiskhan in #836
- Fix nested field filtering by @changhiskhan in #837
- Fix Projection for struct fields by @changhiskhan in #844
- [Bug] Calculating the nulls from position slices in chunks. by @eddyxu in #846
- Add tests for is_NULL, is_not_null, and invert in filter by @eddyxu in #847
Full Changelog: v0.4.7...v0.4.8
v0.4.7 Random access improvements
In this version, we improve the random access over cloud storage by allowing a higher number of parallel I/Os.
What's Changed
- [Rust] increase parallelism, reduce array build overhead by @eddyxu in #830
- Recursively merge record batches by @changhiskhan in #833
- Allow user to customize BlockSize on ObjectStore. by @eddyxu in #835
Full Changelog: v0.4.6...v0.4.7
v0.4.6 Support FileFragment creation
Allows the creation of a distributed lance dataset from scratch
What's Changed
Full Changelog: v0.4.5...v0.4.6
v0.4.5 Preview private API for merging columns
Welcome @Mause as our newest contributor! Also, a big thank you for your work on the duckdb extension framework.
In this release we added a preview of the feature to do distributed column additions. This makes it possible to distribute Lance Fragments across nodes, add a new column to each Fragment, and then write a new Lance dataset version manifest with the updated schema and files.
What's Changed
- add support for aws profile by @Renkai in #807
- Upgrade Arrow to 37 by @changhiskhan in #810
- Schema intersection by @eddyxu in #814
- Add a check to make sure field names don't contain periods by @changhiskhan in #816
- fix(docs): correct link to docs.rs by @Mause in #819
- update arrow version in duckdb extension by @changhiskhan in #817
- Do not use lifetime on FileWriter by @eddyxu in #820
- Setting field ID after merging the fields. by @eddyxu in #821
- [Rust] Project schema by schema by @eddyxu in #822
- Merge batches from multiple datafiles in the same Fragment by @eddyxu in #815
- Update README.md by @jaichopra in #809
- [Python] Provide a private / distributed add column api in Python by @eddyxu in #823
New Contributors
Full Changelog: v0.4.4...v0.4.5
v0.4.4 Various bug fixes
#805 fixed an integer overflow bug in the plain decoder that resulted in high latency for Take (and consequently high latency for the vector search). We'll be adding continuous performance benchmarks soon to prevent issues like this from being released in the future.
We also fixed a gap in cosine similarity where the vectors does not line up perfectly with SIMD strides on the platform.
DiskANN progress is continuing. First milestone will be an in-memory version to support smaller datasets. A compressed, disk-based version will follow soon after that.
What's Changed
- Fix L2 simd benchmark by @eddyxu in #793
- bugfix for dataset overwrite method by @gsilvestrin in #794
- [Rust] Minor SIMD benchmark fix set minimal CPU target for AVX2 by @eddyxu in #795
- Persist simple diskann index by @eddyxu in #787
- Fix offset overflow in plain decoder by @eddyxu in #805
- Fix cosine similarity when missing simd alignment by @changhiskhan in #808
Full Changelog: v0.4.3...v0.4.4
v0.4.3 Bug fixes and code cleanup
What's Changed
- [Rust] L2 distance on not aligned data by @eddyxu in #779
- [Rust] Move L2 to linalg module by @eddyxu in #781
- [Rust] Build DiskANN index by @eddyxu in #763
- Refactor cosine distance into linalg module by @eddyxu in #786
- google cloud storage fixes by @gsilvestrin in #782
- Fix unaligned normalization bug on arm64 by @eddyxu in #789
- Speed up vector index tests by reducing dataset size by @changhiskhan in #790
Full Changelog: v0.4.2...v0.4.3
v0.4.2 Polars, GCS, and distributed lances
A warm welcome to @hzhang86 as Lance's newest contributor. Thanks for adding TPCH benchmarks for Lance to establish a baseline. This is really helpful for us to focus performance optimization roadmap.
This release is packed with valuable features:
- Direct polars scan without needing to pull everything into memory is added.
- We expose FileFragment's to allow distributed processing engines like Spark to access parts of a Lance dataset easily
- Last but not least, we've added support for reading Lance data directly from GS buckets
What's Changed
- [Rust] FileReader read range API by @eddyxu in #752
- Support direct polars scan by @changhiskhan in #755
- [Rust] Persist graph using lance file format. by @eddyxu in #756
- Refactor PQ and OPQ training function to make it usable widely by @eddyxu in #758
- Matrix::centroids method by @eddyxu in #759
- [Python] Set minimal version of Polars for python tests by @eddyxu in #765
- [Rust] Refactor RecordBatchStream trait by @eddyxu in #766
- [Rust] Expose DataFragment as pubilc dataset api. by @eddyxu in #769
- Revert "[Python] Set minimal version of Polars for python tests (#765)" by @gsilvestrin in #770
- add python script to compare lance performance vs parquet TPCH by @hzhang86 in #749
- Expose index metadata by @changhiskhan in #768
- Google Cloud Storage support. by @gsilvestrin in #773
- [Python] Expose DataFragment via dataset by @eddyxu in #774
- Get S3 credentials from_env by @changhiskhan in #775
- Fix duckdb build by @eddyxu in #776
- [Rust] A arrow kernel to compute hash value of the array. by @eddyxu in #777
New Contributors
Full Changelog: v0.4.1...v0.4.2
v0.4.1 Support Append in Vector Search
The vector search in Lance now supports live updates. Previously, when you added new vectors to the dataset, you would be required to rebuild the index. Now, the index is "inherited" and the vector search results are the combination of ANN search on the indexed data and KNN on the new Appended data. So there's a small latency increase and the recall should be the same or better.
This provides a smooth performance curve until you have inserted enough new data that re-indexing is warranted.
What's Changed
- Adding secret to publish task by @gsilvestrin in #742
- [Rust] make distance function to take slice instead of Float32Array by @eddyxu in #748
- Vector search should support appending new rows by @changhiskhan in #593
- windows lapack support by @gsilvestrin in #743
- Fix LanceDataset.to_batches by @changhiskhan in #751
Full Changelog: v0.4.0...v0.4.1
v0.4.0 Windows support
A warm welcome to @gsajko ! Thanks for making our tutorial notebook easier to use and understand!
Note: OPQ is disabled in windows for the vector index. This will be addressed once LAPACK support is added.
What's Changed
- small fixes by @gsajko in #725
- Windows support by @gsilvestrin in #724
New Contributors
Full Changelog: v0.3.19...v0.4.0
v0.3.19 Bug fix for filter predicates on large-utf8 type
Also fix publishing to crates.io
What's Changed
- Make contract clear for KNN nodes by @eddyxu in #729
- Refactor Scan I/O plan by @eddyxu in #731
- [Rust] Use folked sqlparser to unblock rust crate release by @eddyxu in #732
- [Rust] Fix filter on large UTF8 columns by @eddyxu in #733
Full Changelog: v0.3.18...v0.3.19