-
Notifications
You must be signed in to change notification settings - Fork 772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset v2.0 #461
Dataset v2.0 #461
Conversation
…_25_reshape_dataset
…_25_reshape_dataset
…_25_reshape_dataset
TODO after merging: #485 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beautiful work thanks. Left some comments. Hope it helps :)
@@ -297,6 +289,7 @@ def test_flatten_unflatten_dict(): | |||
assert json.dumps(original_d, sort_keys=True) == json.dumps(d, sort_keys=True), f"{original_d} != {d}" | |||
|
|||
|
|||
@pytest.mark.skip("TODO after v2 migration / removing hydra") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test test_backward_compatibility(repo_id):
makes me think we should probably train diffusion policy on pusht before and after this PR to compare dataset v1 vs v2.
I tried training using the new dataset and see some errors in compute_stats.py, should d.stats be changed to d.meta.stats? |
…_25_reshape_dataset
…_25_reshape_dataset
Co-authored-by: Remi <[email protected]>
What this does
This PR introduces a new format for
LeRobotDataset
, which is accompanied by a new file structure. As these changes are not backward compatible, we increaseCODEBASE_VERSION
fromv1.6
tov2.0
.What do I need to do?
If you already pushed a dataset using
v1.6
of our codebase, you can use the conversion scriptlerobot/common/datasets/v2/convert_dataset_v1_to_v2.py
to convert it to the new format.You will be asked to enter a prompt describing the task performed in the dataset.
Examples for single-task dataset:
python lerobot/common/datasets/v2/convert_dataset_v1_to_v2.py \ --repo-id lerobot/aloha_sim_insertion_human_image \ --task "Insert the peg into the socket."
If you recorded your dataset with one of the manipulator robots currently supported in LeRobot (or your own implementation), you can provide its configuration path to add the motor names and robot type to the dataset info using the
--robot-config
option:python lerobot/common/datasets/v2/convert_dataset_v1_to_v2.py \ --repo-id aliberts/koch_tutorial \ --task "Pick the Lego block and drop it in the box on the right." \ --robot-config lerobot/configs/robot/koch.yaml
For the more complicated cases of one task per episode or multiple tasks per episodes, please refer to the documentation in that script.
Motivation
Current implementation of our
LeRobotDataset
suffers from a few shortcomings which make it not easy to use on some aspects. Specifically:datasets
andhuggingface_hub
makes it not convenient to create datasets locally (with recording). In order to use the newly created files on disk, these libraries check if those files are present in the cache (which they won't) and if not, will download them even though they may already be on disk.VideoFrame
not yet being integrated intodatasets
.Changes
Some of the biggest change come from the new file structure and their content:
Note that this file-based structure is designed to be as versatile as possible. The parquet files are split by episodes (this was already the case for videos) which allows a much more granular control over which episodes one wants to use and download. The structure of the dataset is entirely described in the info.json file, which can be easily downloaded or viewed directly on the hub before downloading any actual data. The type of files used are very simple and do not need complex tools to be read, it only uses
.parquet
,.json
,.jsonl
and.mp4
files (.md
for the README).Added
LeRobotDataset
can now be called with anepisodes
argument (e.g.episodes=[1, 10, 12, 40]
) to select a specific subset of episodes by their episode_index. By doing so, only the files corresponding to these episodes will be downloaded (if they're not already on disk). In that case, thehf_dataset
attribute will only contain data from these episodes, as well as theepisode_data_index
.LeRobotDatasetMetadata
class. This allows to get info about a dataset before loading the data. For example, you could do this:tasks.json
mapped to itstask_index
which is what's actually stored in parquet files. Using the api, they can be accessed either withdataset.meta.tasks
to get that mapping or throughdataset.episode_dict[episode_index]["tasks"]
if you're only interested in a particular episode.info.json
(keys, shapes, number of episodes, etc.). It serves as a source of truth for what's inside the dataset.episodes.jsonl
contains per-episode information (episode_index, tasks in natural language and episode lengths). This is accessed through theepisode_dict
attribute in the api.LeRobotDataset.create()
allows to create a new dataset from scratch, either for recording data or for porting an existing dataset to the LeRobotDataset format. To that end, new methods are added:start_image_writter()
: This instantiates anImageWriter
in theimage_writer
attribute to write images asynchonously during data recording. This is automatically called duringLeRobotDataset.create()
if specified in the arguments.stop_image_writter()
: This is to properly stop and remove theImageWriter
from the dataset's attributes. Importantly: if theimage_writer
has been set to a multiprocessImageWriter
, this needs to be called first if you want to pass this dataset into a parallelized DataLoader as theImageWriter
class is not pickleable (required for objects to be transfered between processes). This is not needed when instantiating a dataset with__init__
as the image_writer then is not created.add_frame()
: Adds a single timestamp data frame to theepisode_buffer
, which keep data in memory temporarily. Note: this will be merged with theDataBuffer
from #445 in a subsequent PR.add_episode()
: Saves the content of theepisode_buffer
to disk and updates metadata for them to be in sync with the contents of the files. This method expects atask
argument as a string prompt in natural language describing the task performed in the episode. Videos from that episode can optionally be encoded during this phase but it's not mandatory and can be done later in order to give more flexibility on when to do that.consolidate()
: This will encode videos that have not yet been encoded, clean up the temporary image files, compute dataset statistics, check timestamps are in sync with thefps
and perform additional sanity checks in the dataset. It needs to be done before uploading the dataset to the hub withpush_to_hub()
.clear_episode_buffer()
: This can be used to reset theepisode_buffer
(e.g. to discard data from a current recording).Changed
__getitem__()
and is now done during__init__
orconsolidate
. This has the benefit of both saving computation during the__getitem__()
as well as knowing immediately if there are sync issues with the timestamps.info.json
to allow flexibility and to easily split chunks of files between directories to avoid the hub's limit of files (10k) per folder.~/.cache/huggingface/lerobot
by default. Changingroot
or setting theLEROBOT_HOME
env variable allows to change that location. Every call to thehuggingface_hub
download functions likesnapshot_download
orhf_hub_download
use thelocal_dir
argument to that location so that files are not duplicated in cache and to solve the issue of having to download again files already present on disk.populate_dataset.py
into anImageWriter
class.stats.safetensors
is nowstats.json
(the content remains the same but it's unflattened).episode_data_index.safetensors
is removed but theepisode_data_index
is still in the api to map episode_index to indices.Performance
In the nominal case (no
delta_timestamp
),LeRobotDataset.__getitem__()
is on par with the previous version, sometimes slightly improved but still in the same ballpark generally.__getitem__()
call time in seconds (average on 10k iterations):Benchmarking code
Using
delta_timestamps
, results are more diverse depending on the dataset but still remain in the same ballpark.__getitem__()
call time in seconds (average on 10k iterations),delta_timestamps=[-1/fps, 0, 1/fps]
:Benchmarking code (delta_timestamps)
Fixes
load_previous_and_future_frames
which didn't actually raise an error when the requested timestamps fromdelta_timestamps
did not correspond to actual timestamps in the dataset."tf.Tensor(b'Do something', shape=(), dtype=string)"
)lerobot/aloha_mobile_shrimp
lerobot/aloha_static_battery
lerobot/aloha_static_fork_pick_up
lerobot/aloha_static_thread_velcro
lerobot/uiuc_d3field
lerobot/viola
is missing video keys [TODO]How it was tested
tests/fixtures/
in which fixtures and fixtures factories have been added to simplify writing/adding tests. These factories allow the flexibility to create partially mocked objects on the fly to be used in tests, while not relying on other components of the codebase that are not meant to be tested in a particular test (e.g. initializing a dataset using hydra).tests/test_image_writer.py
tests/test_delta_timestamps.py
How to checkout & try? (for the reviewer)
Use an existing dataset:
Try out the new feature to select / download specific episodes:
You can also create a new dataset: