Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: how to do transparent compression and decompression for binary field in the lance format #3445

Open
yanghua opened this issue Feb 13, 2025 · 4 comments
Labels
question Further information is requested

Comments

@yanghua
Copy link
Collaborator

yanghua commented Feb 13, 2025

We have some binary-type fields and want to do compression with lz4 or some other compression algorithm.

Can Lance match this requirement currently? If can't, can you share an idea of how to implement it?

@yanghua yanghua added the question Further information is requested label Feb 13, 2025
@yanghua
Copy link
Collaborator Author

yanghua commented Feb 13, 2025

@westonpace Any input? thanks.

@wjones127
Copy link
Contributor

I know if you pass the field metadata metadata={"lance-encoding:compression": "zstd"} it enables some compression, but I wasn't sure if that's page level or cell level.

@westonpace
Copy link
Contributor

westonpace commented Feb 13, 2025

I've been doing some work on compression as part of the 2.1 file format. For large values (e.g. > 4KB per value) I've been using Zstd-per-value: #3448

For smaller values I've been using FSST and/or dictionary.

That covered all my test cases but it's not an exhaustive set at the moment.

I think the main case I haven't thoroughly benchmarked is values between 128 bytes and 4KB. FSST might work fine, per-value zstd might work fine. If neither works well I have an idea that we can chunk a few values and still use an offsets array. I'm working on a paper that documents this all in more detail.

@yanghua
Copy link
Collaborator Author

yanghua commented Feb 14, 2025

@wjones127 @westonpace thank you very much! Let me do some research.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants