Spark Data Source for Hugging Face Datasets

A Spark Data Source for accessing 🤗 Hugging Face Datasets:

Stream datasets from Hugging Face as Spark DataFrames
Select subsets and splits, apply projection and predicate filters
Save Spark DataFrames as Parquet files to Hugging Face
Fully distributed
Authentication via huggingface-cli login or tokens
Compatible with Spark 4 (with auto-import)
Backport for Spark 3.5, 3.4 and 3.3

Installation

pip install pyspark_huggingface

Usage

Load a dataset (here stanfordnlp/imdb):

import pyspark_huggingface
df = spark.read.format("huggingface").load("stanfordnlp/imdb")

Save to Hugging Face:

# Login with huggingface-cli login
df.write.format("huggingface").save("username/my_dataset")
# Or pass a token manually
df.write.format("huggingface").option("token", "hf_xxx").save("username/my_dataset")

Advanced

Select a split:

test_df = (
    spark.read.format("huggingface")
    .option("split", "test")
    .load("stanfordnlp/imdb")
)

Select a subset/config:

test_df = (
    spark.read.format("huggingface")
    .option("config", "sample-10BT")
    .load("HuggingFaceFW/fineweb-edu")
)

Filters columns and rows (especially efficient for Parquet datasets):

df = (
    spark.read.format("huggingface")
    .option("filters", '[("language_score", ">", 0.99)]')
    .option("columns", '["text", "language_score"]')
    .load("HuggingFaceFW/fineweb-edu")
)

Backport

While the Data Source API was introcuded in Spark 4, this package includes a backport for older versions.

Importing pyspark_huggingface patches the PySpark reader and writer to add the "huggingface" data source. It is compatible with PySpark 3.5, 3.4 and 3.3:

>>> import pyspark_huggingface
huggingface datasource enabled for pyspark 3.x.x (backport from pyspark 4)

The import is only necessary on Spark 3.x to enable the backport. Spark 4 automatically imports pyspark_huggingface as soon as it is installed, and registers the "huggingface" data source.

Development

Install uv if not already done.

Then, from the project root directory, sync dependencies and run tests.

uv sync
uv run pytest

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github/workflows		.github/workflows
pyspark_huggingface		pyspark_huggingface
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
demo.ipynb		demo.ipynb
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Spark Data Source for Hugging Face Datasets

Installation

Usage

Advanced

Backport

Development

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

huggingface/pyspark_huggingface

Folders and files

Latest commit

History

Repository files navigation

Spark Data Source for Hugging Face Datasets

Installation

Usage

Advanced

Backport

Development

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages