mvp-dataset is a efficient data loading library for multimodal training.
It is built around a few ideas:
- one immutable
Datasetpipeline - deterministic sharding across distributed ranks and worker processes
- simple on-the-fly transforms
- a practical
.cache()boundary for expensive preprocessing - optional
TorchLoaderintegration for PyTorchDataLoader
uv pip install "git+https://github.com/mvp-ai-lab/mvp-dataset.git"
# Development install
uv pip install -e .If you use TorchLoader, install PyTorch separately.
Top-level exports:
DatasetRuntimeContextDataLoadMeshTorchLoaderset_logger(...)/get_logger(...)set_log_level(...)/get_log_level(...)
Dataset construction uses a single entrypoint:
Dataset.from_source(source_kind, ...)Supported source_kind values:
"tars""jsonl""parquet""lance"
Chainable dataset stages:
.map(fn).select(fields).shuffle(buffer_size, initial=None).assemble(factory, drop_last=False).batch(batch_size, drop_last=False, collate_fn=None).unbatch().cache(cache_dir=".cache", cache_num_workers=1, cache_write_batch_size=1024, max_rows_per_fragment=5000)
TorchLoader supports post-merge stages:
.unbatch().shuffle(buffer_size, initial=None, seed=None).assemble(factory, drop_last=False).batch(batch_size, drop_last=False, collate_fn=None)
Dataset is an immutable iterable pipeline. Every stage returns a new dataset.
from mvp_dataset import Dataset
dataset = (
Dataset.from_source("tars", "/data/shards/shard_{000000..000127}.tar")
.shuffle(buffer_size=1024)
.map(preprocess)
)RuntimeContext carries rank, worker, seed, and optional mesh information.
Most users do not need to construct it manually. If context= is omitted, runtime values are inferred from:
torch.distributedtorch.utils.data.get_worker_info()- environment variables such as
RANKandWORLD_SIZE
DataLoadMesh is used when the training mesh contains model-parallel dimensions such as tensor parallelism.
Its purpose is to separate:
- data-parallel dimensions: different data
- model-parallel dimensions: same data
This lets TP co-members receive identical dataset shards while DP ranks remain distinct.
from mvp_dataset import Dataset, TorchLoader
dataset = (
Dataset.from_source("tars", "/data/shards/shard_{000000..000006}.tar", resample=True)
.shuffle(buffer_size=1024)
.map(decode_sample)
)
loader = (
TorchLoader(dataset, num_workers=8, batch_size=32, collate_fn=lambda x: x)
.unbatch()
.shuffle(buffer_size=4096)
.assemble(pack_samples)
.batch(batch_size=64)
)
for batch in loader:
train_step(batch)Use sidecars= when related modalities live in separate tar shards with matching shard layouts.
from mvp_dataset import Dataset
dataset = Dataset.from_source(
"tars",
"/data/images/shard_{000000..000006}.tar",
sidecars=[
("depth", lambda shard: shard.replace("/images/", "/depth/")),
],
)from mvp_dataset import Dataset
dataset = Dataset.from_source(
"jsonl",
"/data/meta/train.jsonl",
ref_fields=[("image", "/data")],
)from mvp_dataset import Dataset
def add_length(sample: dict[str, object]) -> dict[str, object]:
return {**sample, "length": len(str(sample["text"]))}
dataset = (
Dataset.from_source(
"parquet",
"/data/meta/train_{000000..000003}.parquet",
columns=["text", "label"],
batch_size=8192,
)
.map(add_length)
)from mvp_dataset import Dataset
dataset = Dataset.from_source(
"lance",
"/data/cache/train.lance",
columns=["tokens", "label"],
)Dataset.from_source(
"tars",
shards,
context=None,
resample=False,
sidecars=None,
)Notes:
- inputs must resolve to
.tarfiles - tar shard count must be at least
context.total_slots - samples are parsed from tar members as keyed field dictionaries
Dataset.from_source(
"jsonl",
shards,
context=None,
resample=False,
ref_fields=None,
)Notes:
- inputs must resolve to
.jsonlfiles - JSONL inputs are internally split into slot-aligned logical shards
ref_fields=[(field_name, base_dir), ...]resolvestar://...-style references
Dataset.from_source(
"parquet",
shards,
context=None,
resample=False,
columns=None,
batch_size=65536,
use_threads=True,
)Notes:
- inputs must resolve to
.parquetfiles - work is scheduled by parquet fragment / row group
- each row becomes one sample dict
Dataset.from_source(
"lance",
shards,
context=None,
resample=False,
columns=None,
batch_size=65536,
shuffle_mode="none",
)Notes:
- inputs are local Lance dataset paths or URIs
shuffle_mode="none"preserves ordered reads and is the default used by cached Lance datasetsshuffle_mode="global"performs an exact global permutation viatake(...)shuffle_mode="random_scan"usesscanner(scan_in_order=False)for higher-throughput approximate shufflerandom_scanfalls back toglobalwhen fragment parallelism is lower thancontext.total_slots
Applies a pure per-sample transform.
Projects user fields while preserving internal metadata such as __key__.
Applies bounded-memory sample shuffle with deterministic seeding derived from runtime context.
Builds a stateful stream assembler from factory(context) and allows many-input to many-output transformations. It is very useful for data packing in LLM/VLM training.
The factory receives the runtime-resolved RuntimeContext, so it can safely depend on:
worker_idnum_workers- distributed rank information
- mesh-aware DP semantics
Operate directly on the dataset stream. These are valid for normal iteration, but batch and unbatch are intentionally rejected before .cache().
.cache() materializes all upstream outputs into a local Lance dataset.
This is intended for expensive preprocessing such as:
- image decode
- tokenization
- feature extraction
- multimodal assembly
Basic example:
from mvp_dataset import Dataset
def tokenize(sample: dict[str, object]) -> dict[str, object]:
return {**sample, "tokens": sample["text"].split()}
dataset = (
Dataset.from_source("jsonl", "/data/train.jsonl")
.map(tokenize)
.cache()
.shuffle(buffer_size=4096)
)All stages before .cache() are materialized once. All stages after .cache() run on every iteration.
dataset = (
Dataset.from_source("tars", shards)
.map(expensive_preprocess) # cached
.cache()
.map(random_augment) # not cached
.shuffle(buffer_size=2048) # not cached
)- storage format: Lance
- default cache root:
.cacheunder the current working directory - output path is derived from a plan fingerprint
The cache plan fingerprint is built from:
- source listing / fragment identity
- every pre-cache stage marked as traceable
Current behavior:
.map()participates in invalidation.select()participates in invalidation.assemble()participates in invalidation, includingdrop_last.shuffle()is intentionally allowed before.cache(), but does not participate in invalidation
If .shuffle() appears before .cache(), a warning is logged to make that tradeoff explicit.
Lambdas and local closures are rejected before .cache() because their identity is not stable across process restarts.
Supported:
mapselectshuffleassemble
Rejected:
batchunbatch
The current implementation preserves full-stream semantics:
- only the leading contiguous
map(...)prefix is parallelized withcache_num_workers - once a non-
mapstage is reached, all remaining pre-cache stages run on the merged stream in declaration order
This is why shuffle and assemble behave the same with and without cache.
Without a mesh, sharding is based on global rank and world_size.
With a DataLoadMesh, sharding is based on data-parallel dimensions:
RuntimeContext.slotusesdp_rankRuntimeContext.total_slotsusesdp_size- TP co-members receive identical data
- DP ranks receive distinct data
Example:
from mvp_dataset import Dataset, RuntimeContext
ctx = RuntimeContext.from_runtime(
device_mesh=mesh,
dp_dims=("replicate", "shard"),
)
dataset = Dataset.from_source("tars", shards, context=ctx)The cache layer is mesh-aware:
- per-slot cache shards are keyed by
dp_rank, not globalrank - only one DP leader writes each per-slot cache shard
- merge consumes one shard per DP slot
This prevents TP co-members from duplicating cached samples.
TorchLoader wraps PyTorch DataLoader and applies additional post-merge stages in the main process.
Recommended high-throughput pattern:
- use worker micro-batches with
batch_sizeandcollate_fn=lambda x: x - call
.unbatch() - call loader-side
.shuffle(...) - optionally call loader-side
.assemble(...) - call loader-side
.batch(...)
Example:
from mvp_dataset import Dataset, TorchLoader
dataset = Dataset.from_source("tars", shards).map(preprocess)
loader = (
TorchLoader(dataset, num_workers=8, batch_size=32, collate_fn=lambda x: x)
.unbatch()
.shuffle(buffer_size=4096)
.assemble(pack_samples)
.batch(batch_size=128)
)Loader-side shuffle uses the upstream dataset context when available, so TP co-members stay aligned by default. Loader-side assemble also resolves runtime context from the upstream dataset when available.
Tar members are parsed as <key>.<field>.
Examples:
abc.jpg-> key=abc, field=jpgabc.extra.jpgwithkey_dot_level=2-> key=abc.extra, field=jpg
Related environment variable:
LOADER_TAR_KEY_DOT_LEVELdefault1
Supported URI grammar:
tar://<shard_path>#<key>.<field>
<shard_path>#<key>.<field>
Example:
tar://examples/data/tars/shard_000000.tar#image_001.jpg
Reference fields may be either:
- one URI string
- a list of URI strings
Each parquet row is exposed as a sample dict containing:
__file____index_in_file____key__
Parquet scheduling uses row groups as units of work.
Distributed / runtime context:
RANKWORLD_SIZEWORKERNUM_WORKERS
Source behavior:
LOADER_TAR_KEY_DOT_LEVELLOADER_JSONL_TAR_CACHE_SIZE
python3 examples/tar.py examples/demo_data/image/shard_{000000..000003}.tar --max-batches 2
python3 examples/jsonl.py examples/demo_data/samples.jsonl --max-batches 2
python3 examples/parquet.py "examples/demo_data/*.parquet" --max-batches 2