Feat: Add support for parquet files #443

deependujha · 2025-01-06T09:38:02Z

Before submitting

Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

Fixes #191

Index parquet dataset stored in local or cloud (s3 or gs).

import litdata as ld

pq_data_uri = "gs://deep-litdata-parquet/my-parquet-data"

ld.index_parquet_dataset(pq_data_uri)

Use it as normal optimized dataset

import litdata as ld
from litdata.streaming.item_loader import ParquetLoader

ds = ld.StreamingDataset('gs://deep-litdata-parquet/my-parquet-data', item_loader = ParquetLoader())

for _ds in ds:
    print(f"{_ds=}")

Benchmark on Data prep machine

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

codecov · 2025-01-06T09:52:07Z

Codecov Report

Attention: Patch coverage is 63.75000% with 58 lines in your changes missing coverage. Please review.

Project coverage is 78%. Comparing base (ee77852) to head (48f4a45).

Additional details and impacted files

@@         Coverage Diff          @@
##           main   #443    +/-   ##
====================================
- Coverage    78%    78%    -0%     
====================================
  Files        36     37     +1     
  Lines      5217   5372   +155     
====================================
+ Hits       4088   4185    +97     
- Misses     1129   1187    +58

tchaton · 2025-01-24T11:26:45Z

Hey @deependujha Nice progress ;)

tchaton

This is dope. If we could automatically index an s3 folder

And generate an index file, it would be dope.

import polars as pl
import fsspec

file_path = "s3://your-bucket/path/to/your-file.parquet"

# Open the Parquet file with fsspec
with fsspec.open(file_path, mode="rb") as f:
    # Fetch the number of rows from the metadata
    num_rows = pl.read_parquet(f, use_pyarrow=True).shape[0]
    print(f"Number of rows: {num_rows}")

for more information, see https://pre-commit.ci

gitguardian · 2025-01-28T04:46:37Z

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

Since your pull request originates from a forked repository, GitGuardian is not able to associate the secrets uncovered with secret incidents on your GitGuardian dashboard.
Skipping this check run and merging your pull request will create secret incidents on your GitGuardian dashboard.

🔎 Detected hardcoded secret in your pull request

GitGuardian id	GitGuardian status	Secret	Commit	Filename
5685611	Triggered	Generic High Entropy Secret	`76efafb`	tests/streaming/test_resolver.py	View secret

🛠 Guidelines to remediate hardcoded secrets

Understand the implications of revoking this secret by investigating where it is used in your code.
Replace and store your secret safely. Learn here the best practices.
Revoke and rotate this secret.
If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider

following these best practices for managing and storing secrets including API keys and other credentials
install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.

^{🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.}

deependujha · 2025-01-28T13:00:27Z

Adding support for directly consuming HF datasets is an exciting direction!

For HF datasets, my current idea involves iterating through all the Parquet datasets in the HF repository and creating an index.json file that is stored in a cache (since modifying the original dataset is not feasible).

When using the streaming dataset/dataloader, we would then pass this separate index.json file from the cache.

At this point, I'm uncertain about the exact approach for handling HF datasets comprehensively. This PR is ready for review and lays the groundwork for future enhancements. We can discuss HF dataset integration in a subsequent PR.

tchaton

This is quite awesome !

src/litdata/streaming/item_loader.py

tchaton

Can you add the benchmarks in the description ?

started working on adding parquet support in litdata

6bde708

deependujha requested a review from tchaton as a code owner January 6, 2025 09:38

deependujha marked this pull request as draft January 6, 2025 09:38

deependujha added 6 commits January 12, 2025 02:25

write_parquet_index fn working

34c8a64

streaming_dataset and streaming_dataset can read optimized parquet files

c1c0f13

fixed mypy issues

ff40eba

Merge branch 'main' into feat/add-hf-parquet-support

d53dc7d

update

3db6033

update

239e093

tchaton reviewed Jan 24, 2025

View reviewed changes

deependujha and others added 2 commits January 28, 2025 10:16

Merge branch 'main' into feat/add-hf-parquet-support

76efafb

[pre-commit.ci] auto fixes from pre-commit.com hooks

09d1000

for more information, see https://pre-commit.ci

deependujha added 6 commits January 28, 2025 16:04

need to test it on s3

8b880ac

update

520ef21

update

211c987

fixed test

4640756

fixed mypy error

53e8995

hip-hip hurray. working on google-storage

75b3163

deependujha marked this pull request as ready for review January 28, 2025 12:48

deependujha changed the title ~~WIP: Add support for parquet files & HF datasets~~ Feat: Add support for parquet files Jan 28, 2025

deependujha added 5 commits January 29, 2025 11:08

update readme

5a4e83b

remove assert

7da25e0

cache parquet reads

7e23490

update

95fb340

update

3376e0a

tchaton approved these changes Feb 1, 2025

View reviewed changes

src/litdata/streaming/item_loader.py Outdated Show resolved Hide resolved

src/litdata/streaming/item_loader.py Outdated Show resolved Hide resolved

src/litdata/streaming/item_loader.py Show resolved Hide resolved

tchaton approved these changes Feb 1, 2025

View reviewed changes

deependujha added 6 commits February 1, 2025 15:46

made required changes

caabd03

update

8791f1a

update

1d0b446

add type annotations

363529b

update

c55269c

update

48f4a45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Add support for parquet files #443

Feat: Add support for parquet files #443

deependujha commented Jan 6, 2025 •

edited

Loading

codecov bot commented Jan 6, 2025 •

edited

Loading

tchaton commented Jan 24, 2025

tchaton left a comment

gitguardian bot commented Jan 28, 2025 •

edited

Loading

deependujha commented Jan 28, 2025

tchaton left a comment

tchaton left a comment

Feat: Add support for parquet files #443

Are you sure you want to change the base?

Feat: Add support for parquet files #443

Conversation

deependujha commented Jan 6, 2025 • edited Loading

What does this PR do?

Benchmark on Data prep machine

PR review

Did you have fun?

codecov bot commented Jan 6, 2025 • edited Loading

Codecov Report

tchaton commented Jan 24, 2025

tchaton left a comment

Choose a reason for hiding this comment

gitguardian bot commented Jan 28, 2025 • edited Loading

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

deependujha commented Jan 28, 2025

tchaton left a comment

Choose a reason for hiding this comment

tchaton left a comment

Choose a reason for hiding this comment

deependujha commented Jan 6, 2025 •

edited

Loading

codecov bot commented Jan 6, 2025 •

edited

Loading

gitguardian bot commented Jan 28, 2025 •

edited

Loading