Dataset v2.0 (huggingface#461)

Co-authored-by: Remi <[email protected]>
DomThePorcupine · Dec 2, 2024 · 9a9a0f5 · 9a9a0f5
1 parent 040f0bd
commit 9a9a0f5
Show file tree

Hide file tree

Showing 70 changed files with 6,111 additions and 1,761 deletions.
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
@@ -21,7 +21,7 @@ Provide a simple way for the reviewer to try out your changes.
 
 Examples:
 ```bash
-DATA_DIR=tests/data pytest -sx tests/test_stuff.py::test_something
+pytest -sx tests/test_stuff.py::test_something
 ```
 ```bash
 python lerobot/scripts/train.py --some.option=true

diff --git a/.github/workflows/nightly-tests.yml b/.github/workflows/nightly-tests.yml
@@ -7,10 +7,8 @@ on:
   schedule:
     - cron: "0 2 * * *"
 
-env:
-  DATA_DIR: tests/data
+# env:
   # SLACK_API_TOKEN: ${{ secrets.SLACK_API_TOKEN }}
-
 jobs:
   run_all_tests_cpu:
     name: CPU
@@ -30,13 +28,9 @@ jobs:
         working-directory: /lerobot
     steps:
       - name: Tests
-        env:
-          DATA_DIR: tests/data
         run: pytest -v --cov=./lerobot --disable-warnings tests
 
       - name: Tests end-to-end
-        env:
-          DATA_DIR: tests/data
         run: make test-end-to-end
 
 

diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -29,7 +29,6 @@ jobs:
     name: Pytest
     runs-on: ubuntu-latest
     env:
-      DATA_DIR: tests/data
       MUJOCO_GL: egl
     steps:
       - uses: actions/checkout@v4
@@ -70,7 +69,6 @@ jobs:
     name: Pytest (minimal install)
     runs-on: ubuntu-latest
     env:
-      DATA_DIR: tests/data
       MUJOCO_GL: egl
     steps:
       - uses: actions/checkout@v4
@@ -103,12 +101,11 @@ jobs:
             -W ignore::UserWarning:gymnasium.utils.env_checker:247 \
             && rm -rf tests/outputs outputs
 
-
+  # TODO(aliberts, rcadene): redesign after v2 migration / removing hydra
   end-to-end:
     name: End-to-end
     runs-on: ubuntu-latest
     env:
-      DATA_DIR: tests/data
       MUJOCO_GL: egl
     steps:
       - uses: actions/checkout@v4

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -267,7 +267,7 @@ We use `pytest` in order to run the tests. From the root of the
 repository, here's how to run tests with `pytest` for the library:
 
 ```bash
-DATA_DIR="tests/data" python -m pytest -sv ./tests
+python -m pytest -sv ./tests
 ```
 
 

diff --git a/README.md b/README.md
@@ -153,10 +153,12 @@ python lerobot/scripts/visualize_dataset.py \
     --episode-index 0
 ```
 
-or from a dataset in a local folder with the root `DATA_DIR` environment variable (in the following case the dataset will be searched for in `./my_local_data_dir/lerobot/pusht`)
+or from a dataset in a local folder with the `root` option and the `--local-files-only` (in the following case the dataset will be searched for in `./my_local_data_dir/lerobot/pusht`)
 ```bash
-DATA_DIR='./my_local_data_dir' python lerobot/scripts/visualize_dataset.py \
+python lerobot/scripts/visualize_dataset.py \
     --repo-id lerobot/pusht \
+    --root ./my_local_data_dir \
+    --local-files-only 1 \
     --episode-index 0
 ```
 
@@ -208,12 +210,10 @@ dataset attributes:
 
 A `LeRobotDataset` is serialised using several widespread file formats for each of its parts, namely:
 - hf_dataset stored using Hugging Face datasets library serialization to parquet
-- videos are stored in mp4 format to save space or png files
-- episode_data_index saved using `safetensor` tensor serialization format
-- stats saved using `safetensor` tensor serialization format
-- info are saved using JSON
+- videos are stored in mp4 format to save space
+- metadata are stored in plain json/jsonl files
 
-Dataset can be uploaded/downloaded from the HuggingFace hub seamlessly. To work on a local dataset, you can set the `DATA_DIR` environment variable to your root dataset folder as illustrated in the above section on dataset visualization.
+Dataset can be uploaded/downloaded from the HuggingFace hub seamlessly. To work on a local dataset, you can use the `local_files_only` argument and specify its location with the `root` argument if it's not in the default `~/.cache/huggingface/lerobot` location.
 
 ### Evaluate a pretrained policy
 

diff --git a/benchmarks/video/run_video_benchmark.py b/benchmarks/video/run_video_benchmark.py
@@ -266,7 +266,7 @@ def benchmark_encoding_decoding(
         )
 
     ep_num_images = dataset.episode_data_index["to"][0].item()
-    width, height = tuple(dataset[0][dataset.camera_keys[0]].shape[-2:])
+    width, height = tuple(dataset[0][dataset.meta.camera_keys[0]].shape[-2:])
     num_pixels = width * height
     video_size_bytes = video_path.stat().st_size
     images_size_bytes = get_directory_size(imgs_dir)

diff --git a/examples/10_use_so100.md b/examples/10_use_so100.md
@@ -192,7 +192,6 @@ Record 2 episodes and upload your dataset to the hub:
 python lerobot/scripts/control_robot.py record \
     --robot-path lerobot/configs/robot/so100.yaml \
     --fps 30 \
-    --root data \
     --repo-id ${HF_USER}/so100_test \
     --tags so100 tutorial \
     --warmup-time-s 5 \
@@ -212,18 +211,16 @@ echo ${HF_USER}/so100_test
 If you didn't upload with `--push-to-hub 0`, you can also visualize it locally with:
 ```bash
 python lerobot/scripts/visualize_dataset_html.py \
-  --root data \
   --repo-id ${HF_USER}/so100_test
 ```
 
 ## Replay an episode
 
 Now try to replay the first episode on your robot:
 ```bash
-DATA_DIR=data python lerobot/scripts/control_robot.py replay \
+python lerobot/scripts/control_robot.py replay \
     --robot-path lerobot/configs/robot/so100.yaml \
     --fps 30 \
-    --root data \
     --repo-id ${HF_USER}/so100_test \
     --episode 0
 ```
@@ -232,7 +229,7 @@ DATA_DIR=data python lerobot/scripts/control_robot.py replay \
 
 To train a policy to control your robot, use the [`python lerobot/scripts/train.py`](../lerobot/scripts/train.py) script. A few arguments are required. Here is an example command:
 ```bash
-DATA_DIR=data python lerobot/scripts/train.py \
+python lerobot/scripts/train.py \
   dataset_repo_id=${HF_USER}/so100_test \
   policy=act_so100_real \
   env=so100_real \
@@ -248,7 +245,6 @@ Let's explain it:
 3. We provided an environment as argument with `env=so100_real`. This loads configurations from [`lerobot/configs/env/so100_real.yaml`](../lerobot/configs/env/so100_real.yaml).
 4. We provided `device=cuda` since we are training on a Nvidia GPU, but you can also use `device=mps` if you are using a Mac with Apple silicon, or `device=cpu` otherwise.
 5. We provided `wandb.enable=true` to use [Weights and Biases](https://docs.wandb.ai/quickstart) for visualizing training plots. This is optional but if you use it, make sure you are logged in by running `wandb login`.
-6. We added `DATA_DIR=data` to access your dataset stored in your local `data` directory. If you dont provide `DATA_DIR`, your dataset will be downloaded from Hugging Face hub to your cache folder `$HOME/.cache/hugginface`. In future versions of `lerobot`, both directories will be in sync.
 
 Training should take several hours. You will find checkpoints in `outputs/train/act_so100_test/checkpoints`.
 
@@ -259,7 +255,6 @@ You can use the `record` function from [`lerobot/scripts/control_robot.py`](../l
 python lerobot/scripts/control_robot.py record \
   --robot-path lerobot/configs/robot/so100.yaml \
   --fps 30 \
-  --root data \
   --repo-id ${HF_USER}/eval_act_so100_test \
   --tags so100 tutorial eval \
   --warmup-time-s 5 \

diff --git a/examples/11_use_moss.md b/examples/11_use_moss.md
@@ -192,7 +192,6 @@ Record 2 episodes and upload your dataset to the hub:
 python lerobot/scripts/control_robot.py record \
     --robot-path lerobot/configs/robot/moss.yaml \
     --fps 30 \
-    --root data \
     --repo-id ${HF_USER}/moss_test \
     --tags moss tutorial \
     --warmup-time-s 5 \
@@ -212,18 +211,16 @@ echo ${HF_USER}/moss_test
 If you didn't upload with `--push-to-hub 0`, you can also visualize it locally with:
 ```bash
 python lerobot/scripts/visualize_dataset_html.py \
-  --root data \
   --repo-id ${HF_USER}/moss_test
 ```
 
 ## Replay an episode
 
 Now try to replay the first episode on your robot:
 ```bash
-DATA_DIR=data python lerobot/scripts/control_robot.py replay \
+python lerobot/scripts/control_robot.py replay \
     --robot-path lerobot/configs/robot/moss.yaml \
     --fps 30 \
-    --root data \
     --repo-id ${HF_USER}/moss_test \
     --episode 0
 ```
@@ -232,7 +229,7 @@ DATA_DIR=data python lerobot/scripts/control_robot.py replay \
 
 To train a policy to control your robot, use the [`python lerobot/scripts/train.py`](../lerobot/scripts/train.py) script. A few arguments are required. Here is an example command:
 ```bash
-DATA_DIR=data python lerobot/scripts/train.py \
+python lerobot/scripts/train.py \
   dataset_repo_id=${HF_USER}/moss_test \
   policy=act_moss_real \
   env=moss_real \
@@ -248,7 +245,6 @@ Let's explain it:
 3. We provided an environment as argument with `env=moss_real`. This loads configurations from [`lerobot/configs/env/moss_real.yaml`](../lerobot/configs/env/moss_real.yaml).
 4. We provided `device=cuda` since we are training on a Nvidia GPU, but you can also use `device=mps` if you are using a Mac with Apple silicon, or `device=cpu` otherwise.
 5. We provided `wandb.enable=true` to use [Weights and Biases](https://docs.wandb.ai/quickstart) for visualizing training plots. This is optional but if you use it, make sure you are logged in by running `wandb login`.
-6. We added `DATA_DIR=data` to access your dataset stored in your local `data` directory. If you dont provide `DATA_DIR`, your dataset will be downloaded from Hugging Face hub to your cache folder `$HOME/.cache/hugginface`. In future versions of `lerobot`, both directories will be in sync.
 
 Training should take several hours. You will find checkpoints in `outputs/train/act_moss_test/checkpoints`.
 
@@ -259,7 +255,6 @@ You can use the `record` function from [`lerobot/scripts/control_robot.py`](../l
 python lerobot/scripts/control_robot.py record \
   --robot-path lerobot/configs/robot/moss.yaml \
   --fps 30 \
-  --root data \
   --repo-id ${HF_USER}/eval_act_moss_test \
   --tags moss tutorial eval \
   --warmup-time-s 5 \

diff --git a/examples/1_load_lerobot_dataset.py b/examples/1_load_lerobot_dataset.py
@@ -3,78 +3,120 @@
 It illustrates how to load datasets, manipulate them, and apply transformations suitable for machine learning tasks in PyTorch.
 
 Features included in this script:
-- Loading a dataset and accessing its properties.
-- Filtering data by episode number.
-- Converting tensor data for visualization.
-- Saving video files from dataset frames.
+- Viewing a dataset's metadata and exploring its properties.
+- Loading an existing dataset from the hub or a subset of it.
+- Accessing frames by episode number.
 - Using advanced dataset features like timestamp-based frame selection.
 - Demonstrating compatibility with PyTorch DataLoader for batch processing.
 
 The script ends with examples of how to batch process data using PyTorch's DataLoader.
 """
 
-from pathlib import Path
 from pprint import pprint
 
-import imageio
 import torch
+from huggingface_hub import HfApi
 
 import lerobot
-from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
+from lerobot.common.datasets.lerobot_dataset import LeRobotDataset, LeRobotDatasetMetadata
 
+# We ported a number of existing datasets ourselves, use this to see the list:
 print("List of available datasets:")
 pprint(lerobot.available_datasets)
 
-# Let's take one for this example
-repo_id = "lerobot/pusht"
-
-# You can easily load a dataset from a Hugging Face repository
+# You can also browse through the datasets created/ported by the community on the hub using the hub api:
+hub_api = HfApi()
+repo_ids = [info.id for info in hub_api.list_datasets(task_categories="robotics", tags=["LeRobot"])]
+pprint(repo_ids)
+
+# Or simply explore them in your web browser directly at:
+# https://huggingface.co/datasets?other=LeRobot
+
+# Let's take this one for this example
+repo_id = "lerobot/aloha_mobile_cabinet"
+# We can have a look and fetch its metadata to know more about it:
+ds_meta = LeRobotDatasetMetadata(repo_id)
+
+# By instantiating just this class, you can quickly access useful information about the content and the
+# structure of the dataset without downloading the actual data yet (only metadata files — which are
+# lightweight).
+print(f"Total number of episodes: {ds_meta.total_episodes}")
+print(f"Average number of frames per episode: {ds_meta.total_frames / ds_meta.total_episodes:.3f}")
+print(f"Frames per second used during data collection: {ds_meta.fps}")
+print(f"Robot type: {ds_meta.robot_type}")
+print(f"keys to access images from cameras: {ds_meta.camera_keys=}\n")
+
+print("Tasks:")
+print(ds_meta.tasks)
+print("Features:")
+pprint(ds_meta.features)
+
+# You can also get a short summary by simply printing the object:
+print(ds_meta)
+
+# You can then load the actual dataset from the hub.
+# Either load any subset of episodes:
+dataset = LeRobotDataset(repo_id, episodes=[0, 10, 11, 23])
+
+# And see how many frames you have:
+print(f"Selected episodes: {dataset.episodes}")
+print(f"Number of episodes selected: {dataset.num_episodes}")
+print(f"Number of frames selected: {dataset.num_frames}")
+
+# Or simply load the entire dataset:
 dataset = LeRobotDataset(repo_id)
+print(f"Number of episodes selected: {dataset.num_episodes}")
+print(f"Number of frames selected: {dataset.num_frames}")
 
-# LeRobotDataset is actually a thin wrapper around an underlying Hugging Face dataset
-# (see https://huggingface.co/docs/datasets/index for more information).
-print(dataset)
-print(dataset.hf_dataset)
+# The previous metadata class is contained in the 'meta' attribute of the dataset:
+print(dataset.meta)
 
-# And provides additional utilities for robotics and compatibility with Pytorch
-print(f"\naverage number of frames per episode: {dataset.num_samples / dataset.num_episodes:.3f}")
-print(f"frames per second used during data collection: {dataset.fps=}")
-print(f"keys to access images from cameras: {dataset.camera_keys=}\n")
+# LeRobotDataset actually wraps an underlying Hugging Face dataset
+# (see https://huggingface.co/docs/datasets for more information).
+print(dataset.hf_dataset)
 
-# Access frame indexes associated to first episode
+# LeRobot datasets also subclasses PyTorch datasets so you can do everything you know and love from working
+# with the latter, like iterating through the dataset.
+# The __getitem__ iterates over the frames of the dataset. Since our datasets are also structured by
+# episodes, you can access the frame indices of any episode using the episode_data_index. Here, we access
+# frame indices associated to the first episode:
 episode_index = 0
 from_idx = dataset.episode_data_index["from"][episode_index].item()
 to_idx = dataset.episode_data_index["to"][episode_index].item()
 
-# LeRobot datasets actually subclass PyTorch datasets so you can do everything you know and love from working
-# with the latter, like iterating through the dataset. Here we grab all the image frames.
-frames = [dataset[idx]["observation.image"] for idx in range(from_idx, to_idx)]
+# Then we grab all the image frames from the first camera:
+camera_key = dataset.meta.camera_keys[0]
+frames = [dataset[idx][camera_key] for idx in range(from_idx, to_idx)]
 
-# Video frames are now float32 in range [0,1] channel first (c,h,w) to follow pytorch convention. To visualize
-# them, we convert to uint8 in range [0,255]
-frames = [(frame * 255).type(torch.uint8) for frame in frames]
-# and to channel last (h,w,c).
-frames = [frame.permute((1, 2, 0)).numpy() for frame in frames]
+# The objects returned by the dataset are all torch.Tensors
+print(type(frames[0]))
+print(frames[0].shape)
 
-# Finally, we save the frames to a mp4 video for visualization.
-Path("outputs/examples/1_load_lerobot_dataset").mkdir(parents=True, exist_ok=True)
-imageio.mimsave("outputs/examples/1_load_lerobot_dataset/episode_0.mp4", frames, fps=dataset.fps)
+# Since we're using pytorch, the shape is in pytorch, channel-first convention (c, h, w).
+# We can compare this shape with the information available for that feature
+pprint(dataset.features[camera_key])
+# In particular:
+print(dataset.features[camera_key]["shape"])
+# The shape is in (h, w, c) which is a more universal format.
 
 # For many machine learning applications we need to load the history of past observations or trajectories of
 # future actions. Our datasets can load previous and future frames for each key/modality, using timestamps
 # differences with the current loaded frame. For instance:
 delta_timestamps = {
     # loads 4 images: 1 second before current frame, 500 ms before, 200 ms before, and current frame
-    "observation.image": [-1, -0.5, -0.20, 0],
-    # loads 8 state vectors: 1.5 seconds before, 1 second before, ... 20 ms, 10 ms, and current frame
-    "observation.state": [-1.5, -1, -0.5, -0.20, -0.10, -0.02, -0.01, 0],
+    camera_key: [-1, -0.5, -0.20, 0],
+    # loads 8 state vectors: 1.5 seconds before, 1 second before, ... 200 ms, 100 ms, and current frame
+    "observation.state": [-1.5, -1, -0.5, -0.20, -0.10, 0],
     # loads 64 action vectors: current frame, 1 frame in the future, 2 frames, ... 63 frames in the future
     "action": [t / dataset.fps for t in range(64)],
 }
+# Note that in any case, these delta_timestamps values need to be multiples of (1/fps) so that added to any
+# timestamp, you still get a valid timestamp.
+
 dataset = LeRobotDataset(repo_id, delta_timestamps=delta_timestamps)
-print(f"\n{dataset[0]['observation.image'].shape=}")  # (4,c,h,w)
-print(f"{dataset[0]['observation.state'].shape=}")  # (8,c)
-print(f"{dataset[0]['action'].shape=}\n")  # (64,c)
+print(f"\n{dataset[0][camera_key].shape=}")  # (4, c, h, w)
+print(f"{dataset[0]['observation.state'].shape=}")  # (6, c)
+print(f"{dataset[0]['action'].shape=}\n")  # (64, c)
 
 # Finally, our datasets are fully compatible with PyTorch dataloaders and samplers because they are just
 # PyTorch datasets.
@@ -84,8 +126,9 @@
     batch_size=32,
     shuffle=True,
 )
+
 for batch in dataloader:
-    print(f"{batch['observation.image'].shape=}")  # (32,4,c,h,w)
-    print(f"{batch['observation.state'].shape=}")  # (32,8,c)
-    print(f"{batch['action'].shape=}")  # (32,64,c)
+    print(f"{batch[camera_key].shape=}")  # (32, 4, c, h, w)
+    print(f"{batch['observation.state'].shape=}")  # (32, 5, c)
+    print(f"{batch['action'].shape=}")  # (32, 64, c)
     break
diff --git a/examples/3_train_policy.py b/examples/3_train_policy.py
@@ -40,7 +40,7 @@
 # For this example, no arguments need to be passed because the defaults are set up for PushT.
 # If you're doing something different, you will likely need to change at least some of the defaults.
 cfg = DiffusionConfig()
-policy = DiffusionPolicy(cfg, dataset_stats=dataset.stats)
+policy = DiffusionPolicy(cfg, dataset_stats=dataset.meta.stats)
 policy.train()
 policy.to(device)
-Original file line number
+Diff line change
@@ Expand Up @@
     repository, here's how to run tests with `pytest` for the library:
     ```bash
-    DATA_DIR="tests/data" python -m pytest -sv ./tests
+    python -m pytest -sv ./tests
     ```
@@ Expand Down @@