A PySpark-based data augmentation framework for trajectory data with integrated MimicGen support.
- 🚀 Intuitive API: Simple, clean pipeline design for data augmentation
- ⚡ Parallel Processing: Automatic distribution using PySpark
- 🔧 Flexible Augmentations: Easy-to-define local and global augmentations
- 🤖 MimicGen Integration: Built-in support for robotic trajectory generation
- 📦 HDF5 Support: Seamless data loading and saving
# Clone the repository
git clone <repository-url>
cd dataaug-platform
# Install in editable mode
pip install -e .- Python >= 3.8
- PySpark 4.0.1
- NumPy 2.3.4
- h5py >= 3.0.0
- robosuite 1.4.1
- robomimic 0.5
- mimicgen 1.0.0
from dataaug_platform import Augmentation, local_aug, Pipeline, hdf5_to_rdd
from pyspark.sql import SparkSession
import numpy as np
# Define a custom augmentation
class AddNoiseAugmentation(Augmentation):
def __init__(self, noise_scale=0.01):
self.noise_scale = noise_scale
@local_aug
def apply(self, traj):
"""Add noise to trajectory states."""
traj_copy = traj.copy()
noise = np.random.normal(0, self.noise_scale, traj_copy["states"].shape)
traj_copy["states"] = traj_copy["states"] + noise
return traj_copy
# Initialize Spark
spark = SparkSession.builder.appName("DataAugExample").getOrCreate()
# Load data
trajectories = hdf5_to_rdd(spark.sparkContext, "data.hdf5")
# Build and run pipeline
pipeline = Pipeline(spark)
pipeline.add(AddNoiseAugmentation(noise_scale=0.01))
results = pipeline.run(trajectories).collect()from dataaug_platform import hdf5_to_rdd, write_trajectories_to_hdf5
# Load HDF5 data
trajectories_rdd = hdf5_to_rdd(spark.sparkContext, "input.hdf5")
# Process with pipeline
results = pipeline.run(trajectories_rdd).collect()
# Save results
write_trajectories_to_hdf5(results, "output.hdf5")Define custom augmentations using decorators:
Processes each trajectory independently (parallelizable):
class MyLocalAug(Augmentation):
@local_aug
def apply(self, traj):
# Process single trajectory
return modified_trajProcesses all trajectories together:
class MyGlobalAug(Augmentation):
@global_aug
def apply(self, trajs):
# Process all trajectories
# Return list of new trajectories
return new_trajsChain augmentations together:
pipeline = Pipeline(spark)
pipeline.add(AugmentationA())
pipeline.add(AugmentationB(times=5, keep_original=True))
results = pipeline.run(data)Simple functions for HDF5 I/O:
hdf5_to_rdd(sc, path)- Load HDF5 to Spark RDDwrite_trajectories_to_hdf5(trajs, path)- Write trajectories to HDF5read_hdf5_metadata(path)- Read HDF5 metadata/attributesload_hdf5_group(group)- Load HDF5 group recursively
MimicGen generates new robot demonstrations by composing and adapting existing trajectories using subtask-level reasoning.
from dataaug_platform import Pipeline, MimicGenAugmentation, hdf5_to_rdd
from mimicgen.configs import MG_TaskSpec
import robomimic.utils.file_utils as FileUtils
# Load data
trajectories_rdd = hdf5_to_rdd(spark.sparkContext, "demos.hdf5")
# Get environment metadata from HDF5
env_meta = FileUtils.get_env_metadata_from_dataset("demos.hdf5")
# Configure task specification
task_spec = MG_TaskSpec()
# Subtask 1: Grasp object
task_spec.add_subtask(
object_ref="object_name",
subtask_term_signal="grasp",
subtask_term_offset_range=(10, 20),
selection_strategy="nearest_neighbor_object",
selection_strategy_kwargs={"nn_k": 3},
action_noise=0.05,
num_interpolation_steps=5,
)
# Subtask 2: Final placement
task_spec.add_subtask(
object_ref="target_name",
subtask_term_signal=None, # Final subtask
selection_strategy="random",
action_noise=0.05,
)
# Build pipeline
pipeline = Pipeline(spark)
pipeline.add(MimicGenAugmentation(
task_spec=task_spec,
env_meta=env_meta, # Pass env metadata
env_interface_type="MG_TaskName", # e.g., "MG_Square"
times=10, # Generate 10 new trajectories
keep_original=True,
num_workers=5, # Parallel generation
))
# Run pipeline
results = pipeline.run(trajectories_rdd).collect()Since robosuite environments cannot be serialized for Spark distribution, the framework uses a per-worker initialization pattern:
- Pass
env_meta(serializable dict) instead ofenvobject - Each Spark worker creates its own environment instance
- Enables true parallel trajectory generation
This approach:
- ✅ Avoids serialization issues
- ✅ Enables parallel data generation
- ✅ Follows MimicGen's own patterns
- ✅ Clean and maintainable
Basic example demonstrating pipeline usage with custom augmentations.
python examples/example_usage.pyComplete working example with MimicGen for the Square task:
- Proper environment setup with robomimic
- Datagen info preparation
- Full MimicGen data generation pipeline
- Parallel trajectory generation
python examples/mimicgen_square_example.py \
--source square_demos.hdf5 \
--output generated_demos.hdf5 \
--num-demos 10Features:
- Uses
hdf5_to_rdd()for data loading AddDatagenInfoAugmentationprepares MimicGen metadata (end-effector poses, object poses, subtask signals)- Per-worker environment initialization for parallelization
- Achieves 100% success rate in testing
┌─────────────────────────────────────────────────────────────┐
│ Your Application │
│ • Define augmentations with @local_aug / @global_aug │
│ • Build pipeline with Pipeline().add() │
│ • Load data with hdf5_to_rdd() │
└──────────────────────┬──────────────────────────────────────┘
│
┌──────────────────────▼──────────────────────────────────────┐
│ DataAug Platform │
│ • Manages Spark parallelization │
│ • Handles data distribution │
│ • Coordinates augmentation execution │
└──────────────────────┬──────────────────────────────────────┘
│
┌──────────────────────▼──────────────────────────────────────┐
│ PySpark │
│ • Distributed computing │
│ • RDD transformations │
│ • Automatic scaling │
└─────────────────────────────────────────────────────────────┘
Base class for all augmentations.
Parameters:
times(int): Number of times to apply augmentation (for global augs)keep_original(bool): Whether to keep original data
Manages augmentation workflow.
Methods:
add(augmentation): Add augmentation to pipelinerun(data): Execute pipeline on data
MimicGen-based trajectory generation.
Parameters:
task_spec(MG_TaskSpec): Task specification with subtasksenv_meta(dict): Environment metadata for worker initializationenv_interface_type(str): Environment interface class nametimes(int): Number of trajectories to generatekeep_original(bool): Keep source demonstrationsnum_workers(int): Number of parallel workersselect_src_per_subtask(bool): Selection strategy flagtransform_first_robot_pose(bool): Transform initial poseinterpolate_from_last_target_pose(bool): Interpolation strategyrender(bool): Enable rendering
Load HDF5 file with structure /data/demo_0, /data/demo_1, etc. into Spark RDD.
Write list of trajectory dictionaries to HDF5 file.
Read all metadata/attributes from HDF5 file.
Recursively load HDF5 group into nested Python dict.
- Start Simple: Test augmentations locally before scaling to Spark
- Monitor Memory: Use
spark.driver.memoryconfig for large datasets - Use Partitioning: Adjust
num_workersbased on cluster size - Check Spark UI: Monitor job progress at http://localhost:4040
- Test Incrementally: Validate each augmentation before chaining
Solution: Increase Spark driver memory:
spark = SparkSession.builder \
.config("spark.driver.memory", "8g") \
.getOrCreate()Solution: Adjust parallelization:
pipeline.add(MimicGenAugmentation(..., num_workers=10))Solution: Use env_meta pattern as shown in MimicGen examples
To add new augmentations:
- Inherit from
Augmentation - Use
@local_augor@global_augdecorator - Implement
apply()method - Test locally before scaling
Example:
class MyAugmentation(Augmentation):
@local_aug
def apply(self, traj):
# Your logic here
return modified_traj