KempnerInstitute · mbsabath · Mar 14, 2025 · Jan 14, 2025 · Feb 11, 2025 · Feb 19, 2025
diff --git a/.github/workflows/release-prep.yml b/.github/workflows/release-prep.yml
@@ -30,12 +30,15 @@ jobs:
         - name: bump package version
           run: |
               new_version=`echo ${{ steps.bumpr.outputs.next_version }} | sed 's/^v//'`
-              poetry version $new_version
-              git config --local user.email ""
-              git config --local user.name "github-actions[bot]"
-              git add pyproject.toml
-              git commit -m "bump version to $new_version"
-              git push origin HEAD:${{ github.head_ref}}
+              branch_version=`poetry version | awk '{print $2}'`
+              echo "Current version: $branch_version"
+              echo "New version: $new_version"
+              if [ "$branch_version" != "$new_version" ]; then
+                  echo "Version is not up to date"
+                  exit 1
+              else
+                  echo "Version is already up to date"
+              fi
         - name: Hotfix PR to dev
           if: contains(github.head_ref, 'hotfix')
           env:

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -50,7 +50,8 @@ In this repo, we use github actions to manage package versioning and releases. T
 
 All PRs from `dev` to `main` should represent a new release of the package. In order to ensure that, we use the github workflow at `.github/workflows/deploy.yml` to handle changing the semantic version number and to create a new tag/new release on github. 
 
-When you are ready to release a new version of the package, you should create a PR from `dev` to `main`. On a PR from `dev` to `main` you should label the PR with one of 3 labels (bump:major, bump:minor, or bump:patch). On labeling, a workflow will be triggered that will look to the repos existing tags and from those determine what the next semantic version should be. This workflow will also make a commit to `dev` bumping the version number accordingly.
+When you are ready to release a new version of the package, you should create a PR from `dev` to `main`. On a PR from `dev` to `main` you should label the PR with one of 3 labels (bump:major, bump:minor, or bump:patch). On labeling, a workflow will be triggered that will look to the repos existing tags and from those determine what the next semantic version should be. This workflow will ensure that the branch being merged in has 
+the correct version number in the `pyproject.toml` file. In order to update the version number, a separate PR to `dev` (or the hotfix release branch) should be created updating the version number. THe release workflow will fail until the version in the release branch matches the expected version number.
 
 On merging that PR to `main`, the workflow will again be triggered, this time looking to see if the PR being merged was labeled, and creating a tag associated with the new version. The workflow will also use poetry to build a python wheel and create a release including the zipped source code and the built wheel.
 
@@ -60,7 +61,7 @@ If a hotfix is needed, you should create a branch from `main` with a name begini
 
 ### Automated versioning
 
-The versioning is handled by the `haya14busa/action-bumpr` action. This action uses existing tags to determine the next version number. The action will look at the tags in the repo and determine the next version based on the highest tag. If the highest tag is `v1.2.3` then the next version will be `v1.2.4`. If the highest tag is `v1.2.3` and the PR is labeled with `bump:minor` then the next version will be `v1.3.0`. If the highest tag is `v1.2.3` and the PR is labeled with `bump:major` then the next version will be `v2.0.0`. If you make a mistake with your labeling, you can remove the label and apply the proper label and the workflow will rerun and update the code with the proper version number. If you don't want to update the version number, you can remove the label, put will have to make a manual commit to `dev` to update the version number to the previous version.
+The versioning is handled by the `haya14busa/action-bumpr` action. This action uses existing tags to determine the next version number. The action will look at the tags in the repo and determine the next version based on the highest tag. If the highest tag is `v1.2.3` then the next version will be `v1.2.4`. If the highest tag is `v1.2.3` and the PR is labeled with `bump:minor` then the next version will be `v1.3.0`. If the highest tag is `v1.2.3` and the PR is labeled with `bump:major` then the next version will be `v2.0.0`. If you make a mistake with your labeling, you can remove the label and apply the proper label and the workflow will rerun and update the check with the proper version number. 
 
 ## Merging
 

diff --git a/docs/dataset_splits.md b/docs/dataset_splits.md
@@ -0,0 +1,83 @@
+# Dataset Splitting
+
+The `TatmDataset` class includes functionality for creating simple index based train and validation scripts 
+that can be used to separate the dataset into training and validation sets. The functionality as implemented 
+allows for users to either specify a number of indices or a percentage of the dataset to be used for validation.
+The split will be deterministic based on the index in the full dataset and will not change between runs or if the 
+same dataset is loaded multiple times.
+
+## Example loading a dataset and splitting it into training and validation sets
+
+```python
+from tatm import get_dataset
+
+dataset = get_dataset("my_data", context_length=512, val_split_size=0.1)
+print(len(dataset))
+# 1000
+
+```
+
+The proceeding code will load the dataset and tell the dataset object to prepare to split the dataset into a training
+and validation set where the validation set will be 10% of the full dataset. However if we call `len` on the dataset
+at this point we will see that the dataset is still the full dataset.
+
+If we want to use the training set for the split we have two possible approaches. The first is to call the `set_split`
+method on the dataset object and pass in the string "train" as the argument. The second is to pass "train" as the
+split argument when initializing the dataset.
+
+```python
+dataset.set_split("train")
+print(len(dataset))
+# 900
+
+train_dataset = get_dataset("my_data", context_length=512, val_split_size=0.1, split="train")
+print(len(train_dataset))
+# 900
+```
+
+We can also use the same approach to get the validation set.
+
+```python
+dataset.set_split("val")
+print(len(dataset))
+# 100
+
+val_dataset = get_dataset("my_data", context_length=512, val_split_size=150, split="val") # we can also pass in a number of items to use for the validation set
+print(len(val_dataset))
+# 150
+```
+
+Note that we can use the `set_split` method to switch between the training and validation sets at any time. If we want to operate on the full dataset we can call `set_split(None)` or pass `None` as the split argument when initializing the dataset. If we have loaded a dataset without defining a split size, we can still create a split by calling the `create_split` method and passing in the desired split size. This will create a new split based on the current dataset and the specified split size. Note that this has to be done prior to calling `set_split` or passing in a split argument when initializing the dataset. 
+
+```python
+dataset = get_dataset("my_data", context_length=512)
+print(len(dataset))
+# 1000
+dataset.create_split(0.1) # create a split with 10% of the dataset reserved for validation
+print(len(dataset))
+# 1000
+dataset.set_split("train")
+print(len(dataset))
+# 900
+dataset.set_split("val")
+print(len(dataset))
+# 100
+dataset.set_split(None) # set the split to None to use the full dataset
+print(len(dataset))
+# 1000
+```
+
+
+
+When we have the splits created, the indices used to return items in the dataset will be remapped to only return items from the split that we are using. Note that 
+this also means in the case of the validation split, indices will be remapped so that the first index in the validation split can be returned by calling `dataset[0]`.
+
+```python
+dataset = get_dataset("my_data", context_length=512, val_split_size=0.2)
+dataset.set_split(None)
+val_dataset = get_dataset("my_data", context_length=512, val_split_size=0.2, split="val")
+print(dataset[800] == val_dataset[0])
+# True
+```
+
+With these features in place, we can easily create training and validation sets for our dataset and use them in training and evaluation loops as drop in replacements for the full dataset.
diff --git a/docs/index.md b/docs/index.md
@@ -23,6 +23,7 @@ configuration.md
 :caption: Examples:
 text_dataset.md
 metadata.md
+dataset_splits.md
 ```
 
 ```{toctree}