diff --git a/3.test_cases/1.megatron-lm/.gitignore b/3.test_cases/1.megatron-lm/.gitignore
index 502f0226..4ca01d8b 100644
--- a/3.test_cases/1.megatron-lm/.gitignore
+++ b/3.test_cases/1.megatron-lm/.gitignore
@@ -1,2 +1,2 @@
 gpt2
-*.yaml
+*.yaml
\ No newline at end of file
diff --git a/3.test_cases/1.megatron-lm/Makefile b/3.test_cases/1.megatron-lm/Makefile
deleted file mode 100644
index 9a040dda..00000000
--- a/3.test_cases/1.megatron-lm/Makefile
+++ /dev/null
@@ -1,10 +0,0 @@
-all: build clean import
-
-build:
-	docker build -t megatron-training -f 0.distributed-training.Dockerfile .
-
-clean:
-	-rm megatron-training.sqsh
-
-import:
-	enroot import -o megatron-training.sqsh  dockerd://megatron-training:latest
\ No newline at end of file
diff --git a/3.test_cases/1.megatron-lm/README.md b/3.test_cases/1.megatron-lm/README.md
old mode 100644
new mode 100755
index cb4e5ca3..9fe90e3f
--- a/3.test_cases/1.megatron-lm/README.md
+++ b/3.test_cases/1.megatron-lm/README.md
@@ -1,406 +1,15 @@
 # MegatronLM Test Case
 
-[MegatronLM](https://github.com/NVIDIA/Megatron-LM) is a framework from Nvidia that can be used to train LLMs. We recommend that you read papers on the framework to know the different knobs you can tune and in particular these articles:
+[MegatronLM](https://github.com/NVIDIA/Megatron-LM) is a framework from Nvidia designed for training large language models (LLMs). We recommend reading the following papers to understand the various tuning options available:
 
 - [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
-- [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/1909.08053)
+- [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473)
+- [Reducing Activation Recomputatio in Large Transformer Models](https://arxiv.org/pdf/2205.05198)
 
-To run a test case you will go through a series of steps described below:
+To run a test case, follow these steps:
 
-1. Prepare your environment
-2. Build a container, download, and pre-process the data
-3. Train!
-
-We describe the steps below for Slurm and Kubernetes users.
-
-## 1. Preparation
-
-This guide assumes that you have the following:
-
-- A functional Slurm or EKS cluster on AWS.
-- Docker, for Slurm [Pyxis](https://github.com/NVIDIA/pyxis) and [Enroot](https://github.com/NVIDIA/enroot) need to be installed as well.
-- An FSx for Lustre filesystem mounted on `/fsx` in all Slurm nodes or a persistent volume claim that can be mounted on `/fsx` in pods running on EKS. An example of setting up FSx on EKS is available [here](https://github.com/aws-samples/aws-do-eks/tree/main/Container-Root/eks/deployment/csi/fsx). 
-
-It is recommended that you use the templates in the architectures [directory](../../1.architectures) for Parallel Cluster
-
-You will also setup the following variables in your terminal environment.
-
-```bash
-export DATA_PATH=/fsx # FSx for Lustre shared file-system
-```
-
-Make sure that your current directory is under a shared filesystem such as `/fsx/` or the home directory when using [Parallel Cluster](../../1.architectures/aws-parallelcluster).
-
-## 2. Data Preprocessing
-
-Before running training jobs you need to retrieve input data and preprocess it. This section of the guide you will retrieve a container then you convert it into a Squash file via [Enroot](https://github.com/NVIDIA/enroot), you will then retrieve input data ans tokenize it using the GPT2 vocabulary.
-
-Below are the steps you need to follow:
-
-1. Copy the file `0.distributed-training.Dockerfile` or its content to your head-node or any instance where you have the [Docker](https://docs.docker.com/get-docker/) cli available.
-2. Build the container image with the command below
-
-   ```bash
-   docker build -t megatron-training -f 0.distributed-training.Dockerfile .
-   ```
-
-3. Once the image is built, you can check if it is present with `docker images`. You should see an output similar to this one:
-
-   ```text
-   [ec2-user@ip-10-0-10-78 ~]$ docker images
-   REPOSITORY               TAG         IMAGE ID       CREATED          SIZE
-   megatron-training           latest      a33c9d5bcb6e   9 seconds ago    20.7GB
-   ```
-
-4. Prepare the image for your target environment.
-
-   If you are using SLURM - create the squash file with the command below.
- 
-   ```bash
-   enroot import -o megatron-training.sqsh  dockerd://megatron-training:latest
-   ```
-
-   The file will be stored in the current directory (if left as default). The output should look as below.
-
-    ```bash
-    [ec2-user@ip-10-0-10-78 ~]$ enroot import -o ./megatron-training.sqsh  dockerd://megatron-training:latest
-    [INFO] Fetching image
-
-    e19aa13505c1710876982dc440226dc479da5177dc4770452cc79bedc8b5b41d
-
-    [INFO] Extracting image content...
-    [INFO] Creating squashfs filesystem...
-
-    Parallel mksquashfs: Using 32 processors
-    Creating 4.0 filesystem on /home/ec2-user/megatron-training.sqsh, block size 131072.
-    [==========================================================/] 299550/299550 100%
-
-    Exportable Squashfs 4.0 filesystem, gzip compressed, data block size 131072
-       uncompressed data, uncompressed metadata, uncompressed fragments, uncompressed xattrs
-       duplicates are not removed
-    ...
-    ```
-    
-   If you are using EKS, tag and push the image to your container registry.
-   
-   ```bash
-   # Tag image
-   export AWS_REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]')
-   export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
-   export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
-   docker tag megatron-training:latest ${REGISTRY}megatron-training:latest
-   # Create repository if needed
-   REGISTRY_COUNT=$(aws ecr describe-repositories | grep \"megatron-training\" | wc -l)
-   if [ "$REGISTRY_COUNT" == "0" ]; then
-      aws ecr create-repository --repository-name megatron-training
-   fi
-   # Login to registry
-   echo "Logging in to $REGISTRY ..."
-   aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY
-   # Push image to registry
-   docker image push ${REGISTRY}megatron-training:latest
-   ```
-
-5. Run the code below to retrieve the input datasets and vocabulary.
-    
-   SLURM: 
-   
-    ```bash
-    #!/bin/bash
-    mkdir -p gpt2
-    cd gpt2/
-
-    wget https://huggingface.co/bigscience/misc-test-data/resolve/main/stas/oscar-1GB.jsonl.xz
-    wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
-    wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
-    xz -d oscar-1GB.jsonl.xz
-    ```
-
-   EKS:
-   
-   Run the following snippet to crete a job container that mounts the fsx volume
-   and downloads the data on it.
-   
-   ```bash
-   cat getdata-job.yaml-template | envsubst > getdata-job.yaml
-   kubectl apply -f ./getdata-job.yaml
-   ```
-   
-   Monitor the job progress
-   
-   ```bash
-   kubectl logs -f $(kubectl get pods | grep getdata | cut -d ' ' -f 1)
-   ```
-   
-   When status is `Completed`, delete the job pod:
-   
-   ```bash
-   kubectl delete -f ./getdata-job.yaml
-   ```
-
-6. Preprocess the data
-
-   SLURM:
-   
-   Copy the file `1.data-preprocessing.sbatch` or its content on your SLURM cluster then submit a preprocessing jobs with the command below:
-
-    ```bash
-    sbatch 1.data-preprocessing.sbatch
-    ```
-
-   You will see a new file in your current working directory called `slurm-XY.out` where `XY` is a number.
-   This is your output file and will capture the `STDOUT` and `STDERR` from your job.
-   You can check how it progresses via the command `tail -f slurm-XY.out` but with the relevant filename.
-   The file content will be similar to the below:
-
-    ```text
-    0: Opening /fsx/oscar-1GB.jsonl
-    0: Time to startup: 0.9956498146057129
-    0: Processed 1000 documents (101.28050670002645 docs/s, 1.258563987556778 MB/s).
-    0: Processed 2000 documents (188.07992853480727 docs/s, 2.3571624257619614 MB/s).
-    ...
-    0: Processed 78000 documents (1293.9967304914383 docs/s, 16.67556064420713 MB/s).
-    0: Processed 79000 documents (1298.6715286585202 docs/s, 16.763634765830606 MB/s).
-    ```
-    
-   EKS:
-    
-   Launch a job pod that preprocesses the data.
-    
-    ```bash
-    export DATA_PATH=/fsx/gpt2
-    cat prepdata-job.yaml-template | envsubst > prepdata-job.yaml
-    kubectl apply -f ./prepdata-job.yaml
-    ```
-    
-   Monitor the job progress.
-    
-    ```bash
-    kubectl logs -f $(kubectl get pods | grep prepdata | cut -d ' ' -f 1)
-    ```
-    
-   When the job status is `Completed`, clean up the job pod.
-    
-    ```bash
-    kubectl delete -f ./prepdata-job.yaml
-    ```
-
-   Voilà! You have executed the preprocessing job. Next, you will go through the steps to run your training job.
-
-## 3. Distributed training
-
-Now that the data is preprocessed, we will pretrain a GPT3 model MegatronLM.
-
-   SLURM:
-   
-   Copy the file `2.distributed-training.sbatch` to your cluster then submit a training jobs with the command below:
-
-
-   ```bash
-   sbatch 2.distributed-training.sbatch
-   ```
-
-   The training starts running and should produce an output similar to below if successful.
-
-   ```text
-   1:  iteration       25/73242187 | consumed samples:           50 | elapsed time per iteration (ms): 87.0 | learning rate: 1.638E-08 | global batch size:     2 | lm loss: 1.086954E+01 | loss scale: 4294967296.0 | grad norm: 0.000 | number of skipped iterations:   0 | number of nan iterations:   0 |
-   1:  iteration       26/73242187 | consumed samples:           52 | elapsed time per iteration (ms): 86.5 | learning rate: 1.704E-08 | global batch size:     2 | lm loss: 1.086217E+01 | loss scale: 4294967296.0 | grad norm: 0.000 | number of skipped iterations:   0 | number of nan iterations:   0 |
-   1:  iteration       27/73242187 | consumed samples:           54 | elapsed time per iteration (ms): 88.4 | learning rate: 1.769E-08 | global batch size:     2 | lm loss: 1.087129E+01 | loss scale: 4294967296.0 | grad norm: 0.000 | number of skipped iterations:   0 | number of nan iterations:   0 |
-   ```
-
-
-   EKS:
-
-   Launch a PyTorchJob
-    
-    
-   ```bash
-   export DATA_PATH=/fsx
-   export NUM_NODES=1
-   export INSTANCE_TYPE=p5.48xlarge
-   export IMAGE_URI=${REGISTRY}megatron-training:latest
-   export GPU_PER_NODE=8
-   export EFA_PER_NODE=32
-   export TENSOR_PARALLEL=8
-   export PIPELINE_PARALLEL=1
-   export NUM_LAYERS=36
-   export HIDDEN_SIZE=4096
-   export NUM_ATTENTION_HEADS=32
-   export SEQ_LENGTH=2048
-   export MAX_POSITION_EMBEDDINGS=2048
-   export MICRO_BATCH_SIZE=1
-   export GLOBAL_BATCH_SIZE=288
-   cat pytorchjob.yaml-template | envsubst > pytorchjob.yaml
-   kubectl apply -f ./pytorchjob.yaml
-   ```
-
-   The training starts running:
-    
-   ```bash
-   kubectl get pods
-   ```
-    
-   You should see one etcd and one worker pod.
-    
-   ```text
-   NAME                    READY   STATUS      RESTARTS   AGE
-   etcd-7787559c74-wpcb9   1/1     Running     0          3m10s
-   megatron-worker-0       1/1     Running     0          3m10s
-   ```
-    
-   Log lines describing the iterations show that the training is working properly.
-    
-   ```bash
-   kubectl logs -f megatron-worker-0
-   ```
-   
-   An abbreviated sample log is shown below:
-    
-   ```text
-   ...
-   using torch.float16 for parameters ...
-   ------------------------ arguments ------------------------
-   accumulate_allreduce_grads_in_fp32 .............. False
-   adam_beta1 ...................................... 0.9
-   adam_beta2 ...................................... 0.95
-   ...
-   -------------------- end of arguments ---------------------
-   setting number of micro-batches to constant 288
-   > building GPT2BPETokenizer tokenizer ...
-   > padded vocab (size: 50257) with 943 dummy tokens (new size: 51200)
-   > initializing torch distributed ...
-   > initialized tensor model parallel with size 8
-   > initialized pipeline model parallel with size 1
-   > setting random seeds to 1234 ...
-   > compiling dataset index builder ...
-   make: Entering directory '/workspace/Megatron-LM/megatron/core/datasets'
-   ...
-   time to initialize megatron (seconds): 15.424
-   [after megatron is initialized] datetime: 2024-07-16 22:14:01
-   building GPT model ...
-   > number of parameters on (tensor, pipeline) model parallel rank (4, 0): 941594624
-   ...
-   > building train, validation, and test datasets ...
-   > datasets target sizes (minimum size):
-       train:      146484375
-       validation: 5863680
-       test:       11520
-   ...
-   iteration        1/  508626 | consumed samples:          288 | elapsed time per iteration (ms): 255940.5 | learning rate: 0.000E+00 | global batch size:   288 | loss scale: 4294967296.0 | number of skipped iterations:   1 | number of nan iterations:   0 |
-   iteration        2/  508626 | consumed samples:          576 | elapsed time per iteration (ms): 243438.3 | learning rate: 0.000E+00 | global batch size:   288 | loss scale: 2147483648.0 | number of skipped iterations:   1 | number of nan iterations:   0 |
-   iteration        3/  508626 | consumed samples:          864 | elapsed time per iteration (ms): 243344.4 | learning rate: 0.000E+00 | global batch size:   288 | loss scale: 1073741824.0 | number of skipped iterations:   1 | number of nan iterations:   0 |
-   ...
-   ```
-    
-   You can stop the training job by executing:
-    
-   ```bash
-   kubectl delete -f ./pytorchjob.yaml
-   ```
-    
-## 4. What's next?
-
-The example is based on the GPT3 example from MegatronLM's [repository](https://github.com/NVIDIA/Megatron-LM/blob/main/examples/pretrain_gpt.sh). You can modify `NUM_ATTENTION_HEADS`, `NUM_LAYERS`, and `HIDDEN_SIZE`  based on the Table 1 (Page 8) of the document [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473) to change the model size. You can also run the following commands to launch training for different model sizes before submitting a job as follows: `NUM_LAYERS=64 HIDDEN_SIZE=8192 NUM_ATTENTION_HEADS=48 sbatch  3.distributed-training.sbatch`
-
-| Model size | Parameters                                                |
-|------------|-----------------------------------------------------------|
-| 1.7B       | `NUM_ATTENTION_HEADS=24 HIDDEN_SIZE=2304 NUM_LAYERS=24`   |
-| 3.6B       | `NUM_ATTENTION_HEADS=32 HIDDEN_SIZE=3072 NUM_LAYERS=30`   |
-| 7.5B       | `NUM_ATTENTION_HEADS=32 HIDDEN_SIZE=4096 NUM_LAYERS=36`   |
-| 18.4B      | `NUM_ATTENTION_HEADS=48 HIDDEN_SIZE=6144 NUM_LAYERS=40`   |
-| 39.1B      | `NUM_ATTENTION_HEADS=64 HIDDEN_SIZE=8192 NUM_LAYERS=48`   |
-| 76.1B      | `NUM_ATTENTION_HEADS=80 HIDDEN_SIZE=10240 NUM_LAYERS=60`  |
-| 145.6B     | `NUM_ATTENTION_HEADS=96 HIDDEN_SIZE=12288 NUM_LAYERS=80`  |
-| 310.1B     | `NUM_ATTENTION_HEADS=128 HIDDEN_SIZE=16384 NUM_LAYERS=96` |
-
-## 4. Appendix
-
-### 4.1. Benchmark mode
-
-To run in benchmark mode (i.e., train only, no validation and test), apply these changes to `2.distributed-training.sbatch` when calling `pretrain_gpt.py`:
-
-```diff
--        --eval-iters 40 \
--        --eval-interval 1000 \
--        --split 98,2,0 \
-+        --eval-iters 0 \
-+        --split 100,0,0 \
-```
-
-Incorrect settings will cause this error message to appear in the Slurm output:
-
-```text
-Traceback (most recent call last):
-  File "/workspace/Megatron-LM/pretrain_gpt.py", line 198, in <module>
-    pretrain(train_valid_test_datasets_provider,
-  File "/workspace/Megatron-LM/megatron/training.py", line 227, in pretrain
-    = build_train_valid_test_data_iterators(
-  File "/workspace/Megatron-LM/megatron/training.py", line 1283, in build_train_valid_test_data_iterators
-    build_train_valid_test_data_loaders(
-  File "/workspace/Megatron-LM/megatron/training.py", line 1244, in build_train_valid_test_data_loaders
-    train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
-  File "/workspace/Megatron-LM/megatron/training.py", line 1214, in build_train_valid_test_datasets
-    return build_train_valid_test_datasets_provider(train_val_test_num_samples)
-  File "/workspace/Megatron-LM/pretrain_gpt.py", line 186, in train_valid_test_datasets_provider
-    ).build()
-  File "/workspace/Megatron-LM/megatron/core/datasets/blended_megatron_dataset_builder.py", line 56, in build
-    return self._build_blended_dataset_splits()
-  File "/workspace/Megatron-LM/megatron/core/datasets/blended_megatron_dataset_builder.py", line 76, in _build_blended_dataset_splits
-    return self._build_megatron_dataset_splits(blend[0], split, self.sizes)
-  File "/workspace/Megatron-LM/megatron/core/datasets/blended_megatron_dataset_builder.py", line 216, in _build_megatron_dataset_splits
-    self.build_generic_dataset(
-  File "/workspace/Megatron-LM/megatron/core/datasets/blended_megatron_dataset_builder.py", line 258, in build_generic_dataset
-    dataset = cls(*args)
-  File "/workspace/Megatron-LM/megatron/core/datasets/gpt_dataset.py", line 68, in __init__
-    super().__init__(indexed_dataset, indexed_indices, num_samples, index_split, config)
-  File "/workspace/Megatron-LM/megatron/core/datasets/megatron_dataset.py", line 42, in __init__
-    assert num_samples > 0
-AssertionError
-```
-
-### 4.2. Adjust training steps
-
-By default, the .sbatch scripts specify the number of samples, then the number of training steps equals to `--train_samples` / `--global-batch-size`. To directly specify the number of steps, apply these changes to `2.distributed-training.sbatch` when calling `pretrain_gpt.py`. Note that `samples` and `iters` are mutually exclusive.
-
-```diff
--        --train-samples 146484375 \
--        --lr-decay-samples 126953125 \
--        --lr-warmup-samples 183105 \
-+        --train-iters 50 \
-+        --lr-decay-iters 45 \
-+        --lr-warmup-iters 2 \
-```
-=======
-
-Following the same pattern, you can train other models. Pretraining scripts for models like 
-Bert, ICT, and T5 are already included in the Megatron-LM container under `/workspace/Megatron-LM`. 
-
-## 5. Appendix: Llama2 on Slurm
-
-To pretrain Llama2, you must visit <https://huggingface.co/meta-llama/Llama-2-7b-hf> to download the tokenizers files (i.e., `tokenizer.json` and `tokenizer.model`). Registration required. Alternatively, you may train your own tokenizer but this is beyond the scope for this document. Either way, once you have the tokenizer files, you need to upload them to the FSx Lustre that your Slurm cluster mounts.
-
-The remaining steps are similar to the GPT3 example. For more information, please refer to the official Megatron-LM documentation on Llama2 [here](https://github.com/NVIDIA/Megatron-LM/blob/main/docs/llama2.md).
-
-### 5.1. Download and prepocess data
-
-```bash
-mkdir -p llama2
-# Then, place `tokenizer.json` and `tokenizer.model` to this `llama2/` directory.
-
-# Download sample dataset
-wget -P llama2 https://huggingface.co/bigscience/misc-test-data/resolve/main/stas/oscar-1GB.jsonl.xz
-xz -d llama2/oscar-1GB.jsonl.xz
-
-sbatch 3.data-preproc-llama2.sbatch
-```
-
-### 5.2. Run pretraining job
-
-Edit `4.pre-train-llama2.sbatch` to choose the model size you want to train. Do this by commenting and uncommenting the related stanzas. Feel free to experiment with the hyperparameters such as parallelism, batches, etc. (for more details, please refer to the [Megatron-LM project](https://github.com/NVIDIA/Megatron-LM/) and the Megatron papers ([Shoeybi20](https://arxiv.org/abs/1909.08053), [Narayanan21](https://arxiv.org/abs/2104.04473)).
-
-```bash
-sbatch 4.pre-train-llama2.sbatch
-```
-
-Tips: the Llama2 example prints the estimated FLOPS/GPU (enabled via `--log-throughput` in the pretrain `.sbatch` file). You might want to look at [PR-682](https://github.com/NVIDIA/Megatron-LM/pull/682) and decide whether to patch your Megatron-LM to adjust the way FLOPS/GPU is calculated.
+1. Prepare your environment.
+2. Build a container, download, and preprocess the data.
+3. Train the model.
 
+We provide guidance for both Slurm and Kubernetes users. For detailed instructions, refer to the [slurm](./slurm) or [kubernetes](./kubernetes) subdirectories.
\ No newline at end of file
diff --git a/3.test_cases/1.megatron-lm/0.distributed-training.Dockerfile b/3.test_cases/1.megatron-lm/aws-megatron-lm.Dockerfile
old mode 100644
new mode 100755
similarity index 92%
rename from 3.test_cases/1.megatron-lm/0.distributed-training.Dockerfile
rename to 3.test_cases/1.megatron-lm/aws-megatron-lm.Dockerfile
index f63165b4..97929167
--- a/3.test_cases/1.megatron-lm/0.distributed-training.Dockerfile
+++ b/3.test_cases/1.megatron-lm/aws-megatron-lm.Dockerfile
@@ -1,13 +1,15 @@
 # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 # SPDX-License-Identifier: MIT-0
 
-FROM nvcr.io/nvidia/pytorch:24.08-py3
+FROM nvcr.io/nvidia/pytorch:25.01-py3
 
 ARG GDRCOPY_VERSION=v2.4.1
-ARG EFA_INSTALLER_VERSION=1.34.0
-ARG AWS_OFI_NCCL_VERSION=v1.11.0-aws
-ARG TRANSFORMERS_VERSION=4.44.2
-ARG MEGATRON_LM_VERSION=core_r0.8.0
+ARG EFA_INSTALLER_VERSION=1.37.0
+ARG AWS_OFI_NCCL_VERSION=v1.13.2-aws
+ARG NCCL_VERSION=v2.23.4-1
+ARG NCCL_TESTS_VERSION=v2.13.10
+ARG MEGATRON_LM_VERSION=core_r0.10.0
+ARG TRANSFORMERS_VERSION=4.48.1
 
 ARG OPEN_MPI_PATH=/opt/amazon/openmpi
 
@@ -109,7 +111,7 @@ RUN rm -rf /var/lib/apt/lists/*
 RUN echo "hwloc_base_binding_policy = none" >> /opt/amazon/openmpi/etc/openmpi-mca-params.conf \
  && echo "rmaps_base_mapping_policy = slot" >> /opt/amazon/openmpi/etc/openmpi-mca-params.conf
 
-RUN pip3 install awscli pynvml
+RUN pip3 install awscli pynvml wandb
 
 RUN mv $OPEN_MPI_PATH/bin/mpirun $OPEN_MPI_PATH/bin/mpirun.real \
  && echo '#!/bin/bash' > $OPEN_MPI_PATH/bin/mpirun \
@@ -125,10 +127,11 @@ RUN pip install transformers==${TRANSFORMERS_VERSION} sentencepiece python-etcd
 # Install megatron-lm
 #####################
 RUN pip install -U setuptools==75.1.0
+RUN apt-get remove -y python3-blinker # https://github.com/triton-inference-server/server/issues/7243
 RUN cd /workspace && git clone --depth 1 --branch ${MEGATRON_LM_VERSION} https://github.com/NVIDIA/Megatron-LM.git \
     && cd Megatron-LM \
     && python3 -m pip install nltk  \
-    && python -m pip install .
+    && python3 -m pip install .
 
 ## Set Open MPI variables to exclude network interface and conduit.
 ENV OMPI_MCA_pml=^cm,ucx            \
diff --git a/3.test_cases/1.megatron-lm/kubernetes/README.md b/3.test_cases/1.megatron-lm/kubernetes/README.md
new file mode 100755
index 00000000..9e3a4665
--- /dev/null
+++ b/3.test_cases/1.megatron-lm/kubernetes/README.md
@@ -0,0 +1,40 @@
+# Running Megatron-LM on Kubernetes
+
+This directory contains Kubernetes-specific instructions and templates for setting up and running MegatronLM on an EKS cluster.
+
+## 1. Preparation
+
+Ensure you have the following prerequisites:
+
+- A functional EKS cluster on AWS.
+- Docker installed for building the container image.
+- An FSx for Lustre filesystem mounted on `/fsx` in all nodes or a persistent volume claim that can be mounted on `/fsx` in pods running on EKS. An example of setting up FSx on EKS is available [here](https://github.com/aws-samples/aws-do-eks/tree/main/Container-Root/eks/deployment/csi/fsx).
+
+Set up the following environment variables in your terminal:
+
+```bash
+export DATA_PATH=/fsx # FSx for Lustre shared file-system
+```
+
+
+### 2. Building the Container
+
+1. Copy the megatron-lm.Dockerfile to your local machine.
+
+2. Build the containerimage:
+
+```bash
+docker build -t megatron-training -f megatron-lm.Dockerfile .
+```
+
+3. Tag and push the image to your container registry:
+
+```bash
+export AWS_REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]')
+export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
+export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
+docker tag megatron-training:latest ${REGISTRY}megatron-training:latest
+docker push ${REGISTRY}megatron-training:latest
+```
+
+Now you are all set for distributed training with Megatron-LM on EKS! Proceed to the subdirectories for detailed instructions for different model training.
\ No newline at end of file
diff --git a/3.test_cases/1.megatron-lm/kubernetes/gpt3/README.md b/3.test_cases/1.megatron-lm/kubernetes/gpt3/README.md
new file mode 100644
index 00000000..c7a0d745
--- /dev/null
+++ b/3.test_cases/1.megatron-lm/kubernetes/gpt3/README.md
@@ -0,0 +1,229 @@
+# GPT Model Training on EKS with MegatronLM
+
+This directory contains Kubernetes-specific instructions and templates for setting up and running GPT model training using MegatronLM on an EKS cluster.
+
+## 1. Preparation
+
+Ensure you have the following prerequisites:
+
+- A functional EKS cluster on AWS.
+- An FSx for Lustre filesystem mounted on `/fsx` in all nodes or a persistent volume claim that can be mounted on `/fsx` in pods running on EKS. An example of setting up FSx on EKS is available [here](https://github.com/aws-samples/awsome-distributed-training/tree/main/Container-Root/eks/deployment/csi/fsx).
+
+Set up the following environment variables in your terminal:
+
+```bash
+export DATA_PATH=/fsx # FSx for Lustre shared file-system
+```
+
+### 2. Data Preprocessing
+
+1. Run the following snippet to crete a job container that mounts the fsx volume and downloads the input datasets and vocabulary on it:
+
+    ```bash
+    cat getdata-job.yaml-template | envsubst > getdata-job.yaml
+    kubectl apply -f ./getdata-job.yaml
+    ```
+
+    Monitor the job progress:
+
+    ```bash
+    kubectl logs -f $(kubectl get pods | grep getdata | cut -d ' ' -f 1)
+    ```
+
+    When status is `Completed`, delete the job pod:
+
+    ```bash
+    kubectl delete -f ./getdata-job.yaml
+    ```
+
+
+2. Preprocess the data
+
+    Launch a job pod that preprocesses the data.
+
+    ```bash
+    export DATA_PATH=/fsx/gpt2
+    cat prepdata-job.yaml-template | envsubst > prepdata-job.yaml
+    kubectl apply -f ./prepdata-job.yaml
+    ```
+
+    Monitor the job progress.
+
+    ```bash
+    kubectl logs -f $(kubectl get pods | grep prepdata | cut -d ' ' -f 1)
+    ```
+
+    When the job status is `Completed`, cleanup the job pod.
+
+    ```bash
+    kubectl delete -f ./prepdata-job.yaml
+    ```
+
+    Voilà! You have executed the preprocessing job. Next, you will go through the steps to run your training job.
+
+### 3. Distributed training
+
+Now that the data is preprocessed, we will pretrain a GPT3 model MegatronLM.  Launch a PyTorchJob with the environment variables:
+
+```bash
+export DATA_PATH=/fsx
+export NUM_NODES=1
+export INSTANCE_TYPE=p5.48xlarge
+export IMAGE_URI=${REGISTRY}megatron-training:latest
+export GPU_PER_NODE=8
+export EFA_PER_NODE=32
+export TENSOR_PARALLEL=8
+export PIPELINE_PARALLEL=1
+export NUM_LAYERS=36
+export HIDDEN_SIZE=4096
+export NUM_ATTENTION_HEADS=32
+export SEQ_LENGTH=2048
+export MAX_POSITION_EMBEDDINGS=2048
+export MICRO_BATCH_SIZE=1
+export GLOBAL_BATCH_SIZE=288
+cat pytorchjob.yaml-template | envsubst > pytorchjob.yaml
+kubectl apply -f ./pytorchjob.yaml
+```
+
+The training starts running:
+
+```bash
+kubectl get pods
+```
+
+You should see one etcd and one worker pod.
+
+```bash
+NAME                    READY   STATUS      RESTARTS   AGE
+etcd-7787559c74-wpcb9   1/1     Running     0          3m10s
+megatron-worker-0       1/1     Running     0          3m10s
+```
+
+Log lines describing the iterations show that the training is working properly.
+
+```bash
+kubectl logs -f megatron-worker-0
+```
+
+An abbreviated sample log is shown below:
+
+   An abbreviated sample log is shown below:
+    
+   ```text
+   ...
+   using torch.float16 for parameters ...
+   ------------------------ arguments ------------------------
+   accumulate_allreduce_grads_in_fp32 .............. False
+   adam_beta1 ...................................... 0.9
+   adam_beta2 ...................................... 0.95
+   ...
+   -------------------- end of arguments ---------------------
+   setting number of micro-batches to constant 288
+   > building GPT2BPETokenizer tokenizer ...
+   > padded vocab (size: 50257) with 943 dummy tokens (new size: 51200)
+   > initializing torch distributed ...
+   > initialized tensor model parallel with size 8
+   > initialized pipeline model parallel with size 1
+   > setting random seeds to 1234 ...
+   > compiling dataset index builder ...
+   make: Entering directory '/workspace/Megatron-LM/megatron/core/datasets'
+   ...
+   time to initialize megatron (seconds): 15.424
+   [after megatron is initialized] datetime: 2024-07-16 22:14:01
+   building GPT model ...
+   > number of parameters on (tensor, pipeline) model parallel rank (4, 0): 941594624
+   ...
+   > building train, validation, and test datasets ...
+   > datasets target sizes (minimum size):
+       train:      146484375
+       validation: 5863680
+       test:       11520
+   ...
+   iteration        1/  508626 | consumed samples:          288 | elapsed time per iteration (ms): 255940.5 | learning rate: 0.000E+00 | global batch size:   288 | loss scale: 4294967296.0 | number of skipped iterations:   1 | number of nan iterations:   0 |
+   iteration        2/  508626 | consumed samples:          576 | elapsed time per iteration (ms): 243438.3 | learning rate: 0.000E+00 | global batch size:   288 | loss scale: 2147483648.0 | number of skipped iterations:   1 | number of nan iterations:   0 |
+   iteration        3/  508626 | consumed samples:          864 | elapsed time per iteration (ms): 243344.4 | learning rate: 0.000E+00 | global batch size:   288 | loss scale: 1073741824.0 | number of skipped iterations:   1 | number of nan iterations:   0 |
+   ...
+   ```
+    
+   You can stop the training job by executing:
+    
+   ```bash
+   kubectl delete -f ./pytorchjob.yaml
+   ```
+    
+## 4. What's next?
+
+The example is based on the GPT3 example from MegatronLM's [repository](https://github.com/NVIDIA/Megatron-LM/blob/main/examples/pretrain_gpt.sh). You can modify `NUM_ATTENTION_HEADS`, `NUM_LAYERS`, and `HIDDEN_SIZE`  based on the Table 1 (Page 8) of the document [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473) to change the model size. You can also run the following commands to launch training for different model sizes before submitting a job as follows: `NUM_LAYERS=64 HIDDEN_SIZE=8192 NUM_ATTENTION_HEADS=48 sbatch  3.distributed-training.sbatch`
+
+| Model size | Parameters                                                |
+|------------|-----------------------------------------------------------|
+| 1.7B       | `NUM_ATTENTION_HEADS=24 HIDDEN_SIZE=2304 NUM_LAYERS=24`   |
+| 3.6B       | `NUM_ATTENTION_HEADS=32 HIDDEN_SIZE=3072 NUM_LAYERS=30`   |
+| 7.5B       | `NUM_ATTENTION_HEADS=32 HIDDEN_SIZE=4096 NUM_LAYERS=36`   |
+| 18.4B      | `NUM_ATTENTION_HEADS=48 HIDDEN_SIZE=6144 NUM_LAYERS=40`   |
+| 39.1B      | `NUM_ATTENTION_HEADS=64 HIDDEN_SIZE=8192 NUM_LAYERS=48`   |
+| 76.1B      | `NUM_ATTENTION_HEADS=80 HIDDEN_SIZE=10240 NUM_LAYERS=60`  |
+| 145.6B     | `NUM_ATTENTION_HEADS=96 HIDDEN_SIZE=12288 NUM_LAYERS=80`  |
+| 310.1B     | `NUM_ATTENTION_HEADS=128 HIDDEN_SIZE=16384 NUM_LAYERS=96` |
+
+## 4. Appendix
+
+### 4.1. Benchmark mode
+
+To run in benchmark mode (i.e., train only, no validation and test), apply these changes to `2.distributed-training.sbatch` when calling `pretrain_gpt.py`:
+
+```diff
+-        --eval-iters 40 \
+-        --eval-interval 1000 \
+-        --split 98,2,0 \
++        --eval-iters 0 \
++        --split 100,0,0 \
+```
+
+Incorrect settings will cause this error message to appear in the Slurm output:
+
+```text
+Traceback (most recent call last):
+  File "/workspace/Megatron-LM/pretrain_gpt.py", line 198, in <module>
+    pretrain(train_valid_test_datasets_provider,
+  File "/workspace/Megatron-LM/megatron/training.py", line 227, in pretrain
+    = build_train_valid_test_data_iterators(
+  File "/workspace/Megatron-LM/megatron/training.py", line 1283, in build_train_valid_test_data_iterators
+    build_train_valid_test_data_loaders(
+  File "/workspace/Megatron-LM/megatron/training.py", line 1244, in build_train_valid_test_data_loaders
+    train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
+  File "/workspace/Megatron-LM/megatron/training.py", line 1214, in build_train_valid_test_datasets
+    return build_train_valid_test_datasets_provider(train_val_test_num_samples)
+  File "/workspace/Megatron-LM/pretrain_gpt.py", line 186, in train_valid_test_datasets_provider
+    ).build()
+  File "/workspace/Megatron-LM/megatron/core/datasets/blended_megatron_dataset_builder.py", line 56, in build
+    return self._build_blended_dataset_splits()
+  File "/workspace/Megatron-LM/megatron/core/datasets/blended_megatron_dataset_builder.py", line 76, in _build_blended_dataset_splits
+    return self._build_megatron_dataset_splits(blend[0], split, self.sizes)
+  File "/workspace/Megatron-LM/megatron/core/datasets/blended_megatron_dataset_builder.py", line 216, in _build_megatron_dataset_splits
+    self.build_generic_dataset(
+  File "/workspace/Megatron-LM/megatron/core/datasets/blended_megatron_dataset_builder.py", line 258, in build_generic_dataset
+    dataset = cls(*args)
+  File "/workspace/Megatron-LM/megatron/core/datasets/gpt_dataset.py", line 68, in __init__
+    super().__init__(indexed_dataset, indexed_indices, num_samples, index_split, config)
+  File "/workspace/Megatron-LM/megatron/core/datasets/megatron_dataset.py", line 42, in __init__
+    assert num_samples > 0
+AssertionError
+```
+
+### 4.2. Adjust training steps
+
+By default, the .sbatch scripts specify the number of samples, then the number of training steps equals to `--train_samples` / `--global-batch-size`. To directly specify the number of steps, apply these changes to `2.distributed-training.sbatch` when calling `pretrain_gpt.py`. Note that `samples` and `iters` are mutually exclusive.
+
+```diff
+-        --train-samples 146484375 \
+-        --lr-decay-samples 126953125 \
+-        --lr-warmup-samples 183105 \
++        --train-iters 50 \
++        --lr-decay-iters 45 \
++        --lr-warmup-iters 2 \
+```
+=======
+
+Following the same pattern, you can train other models. Pretraining scripts for models like 
+Bert, ICT, and T5 are already included in the Megatron-LM container under `/workspace/Megatron-LM`. 
diff --git a/3.test_cases/1.megatron-lm/getdata-job.yaml-template b/3.test_cases/1.megatron-lm/kubernetes/gpt3/getdata-job.yaml-template
old mode 100644
new mode 100755
similarity index 100%
rename from 3.test_cases/1.megatron-lm/getdata-job.yaml-template
rename to 3.test_cases/1.megatron-lm/kubernetes/gpt3/getdata-job.yaml-template
diff --git a/3.test_cases/1.megatron-lm/prepdata-job.yaml-template b/3.test_cases/1.megatron-lm/kubernetes/gpt3/prepdata-job.yaml-template
old mode 100644
new mode 100755
similarity index 100%
rename from 3.test_cases/1.megatron-lm/prepdata-job.yaml-template
rename to 3.test_cases/1.megatron-lm/kubernetes/gpt3/prepdata-job.yaml-template
diff --git a/3.test_cases/1.megatron-lm/pytorchjob.yaml-template b/3.test_cases/1.megatron-lm/kubernetes/gpt3/pytorchjob.yaml-template
old mode 100644
new mode 100755
similarity index 100%
rename from 3.test_cases/1.megatron-lm/pytorchjob.yaml-template
rename to 3.test_cases/1.megatron-lm/kubernetes/gpt3/pytorchjob.yaml-template
diff --git a/3.test_cases/1.megatron-lm/slurm/Makefile b/3.test_cases/1.megatron-lm/slurm/Makefile
new file mode 100755
index 00000000..96ba3ab5
--- /dev/null
+++ b/3.test_cases/1.megatron-lm/slurm/Makefile
@@ -0,0 +1,12 @@
+all: build clean import
+
+IMAGE=aws-megatron-lm
+
+build:
+	docker build -t ${IMAGE} -f ${IMAGE}.Dockerfile .
+
+clean:
+	-rm ${IMAGE}.sqsh
+
+import:
+	enroot import -o ${IMAGE}.sqsh  dockerd://${IMAGE}:latest
\ No newline at end of file
diff --git a/3.test_cases/1.megatron-lm/slurm/README.md b/3.test_cases/1.megatron-lm/slurm/README.md
new file mode 100755
index 00000000..d25a9c78
--- /dev/null
+++ b/3.test_cases/1.megatron-lm/slurm/README.md
@@ -0,0 +1,80 @@
+# Running Megatron-LM on Slurm
+
+This directory contains scripts and instructions for setting up the Megatron-LM training environment on a Slurm cluster. For detailed instructions on running distributed training jobs with this environment, please refer to the subdirectories.
+
+
+## 1. Preparation
+
+This guide assumes that you have the following:
+
+- A functional Slurm on AWS.
+- Docker, for Slurm [Pyxis](https://github.com/NVIDIA/pyxis) and [Enroot](https://github.com/NVIDIA/enroot) need to be installed as well.
+- An FSx for Lustre filesystem mounted on `/fsx` in all Slurm nodes.
+
+It is recommended that you use the templates for [AWS Parallel Cluster](../../../1.architectures/2.aws-parallelcluster/) or [Amazon SageMaker HyperPod Slurm](../../../1.architectures/5.sagemaker-hyperpod) set up.
+
+You will also setup the following variables in your terminal environment.
+
+```bash
+export DATA_PATH=/fsx/data # FSx for Lustre shared file-system
+```
+
+The following instructions assume you have cloned this repository under such a shared filesystem and changed your current directory to this directory.
+
+
+## 2. Environment Setup 
+
+This section of the guide how to build a Megatron-LM container then convert it into a Squash file via [Enroot](https://github.com/NVIDIA/enroot).
+
+Below are the steps you need to follow:
+
+1. Copy the file `0.distributed-training.Dockerfile` or its content to your head-node or any instance where you have the [Docker](https://docs.docker.com/get-docker/) cli available.
+2. Build the container image with the command below
+
+   ```bash
+   docker build -t aws-megatron-lm -f 0.distributed-training.Dockerfile .
+   ```
+
+3. Once the image is built, you can check if it is present with `docker images`. You should see an output similar to this one:
+
+   ```text
+   [ubuntu@ip-10-0-10-78 ~]$ docker images
+   REPOSITORY               TAG         IMAGE ID       CREATED          SIZE
+   megatron-training           latest      a33c9d5bcb6e   9 seconds ago    20.7GB
+   ```
+
+4. Prepare the image for your target environment.
+
+   Create the squash file with the command below.
+ 
+   ```bash
+   enroot import -o aws-megatron-lm.sqsh  dockerd://aws-megatron-l:latest
+   ```
+
+   The file will be stored in the current directory (if left as default). The output should look as below.
+
+    ```bash
+    [ubuntu@ip-10-0-10-78 ~]$ enroot import -o ./megatron-training.sqsh  dockerd://megatron-training:latest
+    [INFO] Fetching image
+
+    e19aa13505c1710876982dc440226dc479da5177dc4770452cc79bedc8b5b41d
+
+    [INFO] Extracting image content...
+    [INFO] Creating squashfs filesystem...
+
+    Parallel mksquashfs: Using 32 processors
+    Creating 4.0 filesystem on /home/ec2-user/megatron-training.sqsh, block size 131072.
+    [==========================================================/] 299550/299550 100%
+
+    Exportable Squashfs 4.0 filesystem, gzip compressed, data block size 131072
+       uncompressed data, uncompressed metadata, uncompressed fragments, uncompressed xattrs
+       duplicates are not removed
+    ...
+    ```
+
+## 2. Next Steps 
+
+Now that you have the Megatron-LM container and have enabled the squash file, you can scale your training job with the container on your Slurm cluster. The subdirectories illustrate detailed end-to-end instructions for different models.
+
+
+
diff --git a/3.test_cases/1.megatron-lm/1.data-preprocessing.sbatch b/3.test_cases/1.megatron-lm/slurm/gpt3/1.data-preprocessing.sbatch
old mode 100644
new mode 100755
similarity index 100%
rename from 3.test_cases/1.megatron-lm/1.data-preprocessing.sbatch
rename to 3.test_cases/1.megatron-lm/slurm/gpt3/1.data-preprocessing.sbatch
diff --git a/3.test_cases/1.megatron-lm/2.distributed-training.sbatch b/3.test_cases/1.megatron-lm/slurm/gpt3/2.distributed-training.sbatch
old mode 100644
new mode 100755
similarity index 94%
rename from 3.test_cases/1.megatron-lm/2.distributed-training.sbatch
rename to 3.test_cases/1.megatron-lm/slurm/gpt3/2.distributed-training.sbatch
index ed903fef..fbaaf20f
--- a/3.test_cases/1.megatron-lm/2.distributed-training.sbatch
+++ b/3.test_cases/1.megatron-lm/slurm/gpt3/2.distributed-training.sbatch
@@ -33,7 +33,7 @@ set -ex;
 # default variables for Enroot
 : "${IMAGE:=$(pwd)/megatron-training.sqsh}"
 : "${DATA_PATH:=/fsx}"
-: "${FSX_MOUNT:=$(pwd):$DATA_PATH}"
+: "${FSX_MOUNT:=/fsx:/fsx}"
 
 ###########################
 ## Environment Variables ##
@@ -108,5 +108,9 @@ srun ${AUTO_RESUME} -l "${ARGS[@]}" python -m torch.distributed.run "${TORCHRUN_
         --adam-beta1 0.9 \
         --adam-beta2 0.95 \
         --init-method-std 0.006 \
+        --wandb-name "gpt" \
+        --wandb-project "megatron-lm" \
+        --wandb-entity "gpt-entity" \
         --fp16 \
-        --recompute-activations
+        --recompute-activations \
+        --tp-comm-overlap 
diff --git a/3.test_cases/1.megatron-lm/slurm/gpt3/README.md b/3.test_cases/1.megatron-lm/slurm/gpt3/README.md
new file mode 100644
index 00000000..0784419e
--- /dev/null
+++ b/3.test_cases/1.megatron-lm/slurm/gpt3/README.md
@@ -0,0 +1,192 @@
+## Megatron GPT Pretraining on Slurm
+
+## 1. Preparation
+
+Make sure to complete all the preparation steps for the [slurm environment setup](../README.md) before proceed.
+
+Also, setup the following variables in your terminal environment. Note that this has to be on a shared file system.
+
+```bash
+export DATA_PATH=/fsx/data # FSx for Lustre shared file-system
+```
+
+## 2. Data retrieval and preprocessing
+
+1. Run the code below to retrieve the input datasets and vocabulary.
+    
+    ```bash
+    #!/bin/bash
+    mkdir -p gpt2
+    cd gpt2/
+
+    wget https://huggingface.co/bigscience/misc-test-data/resolve/main/stas/oscar-1GB.jsonl.xz
+    wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
+    wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
+    xz -d oscar-1GB.jsonl.xz
+    ```
+
+2. Preprocess the data
+
+   Copy the file `1.data-preprocessing.sbatch` or its content on your SLURM cluster then submit a preprocessing jobs with the command below:
+
+    ```bash
+    sbatch 1.data-preprocessing.sbatch
+    ```
+
+   You will see a new file in your current working directory called `slurm-XY.out` where `XY` is a number.
+   This is your output file and will capture the `STDOUT` and `STDERR` from your job.
+   You can check how it progresses via the command `tail -f slurm-XY.out` but with the relevant filename.
+   The file content will be similar to the below:
+
+    ```text
+    0: Opening /fsx/oscar-1GB.jsonl
+    0: Time to startup: 0.9956498146057129
+    0: Processed 1000 documents (101.28050670002645 docs/s, 1.258563987556778 MB/s).
+    0: Processed 2000 documents (188.07992853480727 docs/s, 2.3571624257619614 MB/s).
+    ...
+    0: Processed 78000 documents (1293.9967304914383 docs/s, 16.67556064420713 MB/s).
+    0: Processed 79000 documents (1298.6715286585202 docs/s, 16.763634765830606 MB/s).
+    ```
+    
+   Voilà! You have executed the preprocessing job. Next, you will go through the steps to run your training job.
+
+## 3. Distributed training
+
+Now that the data is preprocessed, we will pretrain a GPT3 model MegatronLM.
+
+   Copy the file `2.distributed-training.sbatch` to your cluster then submit a training jobs with the command below:
+
+
+   ```bash
+   sbatch 2.distributed-training.sbatch
+   ```
+
+   The training starts running and should produce an output similar to below if successful.
+
+   ```text
+   1:  iteration       25/73242187 | consumed samples:           50 | elapsed time per iteration (ms): 87.0 | learning rate: 1.638E-08 | global batch size:     2 | lm loss: 1.086954E+01 | loss scale: 4294967296.0 | grad norm: 0.000 | number of skipped iterations:   0 | number of nan iterations:   0 |
+   1:  iteration       26/73242187 | consumed samples:           52 | elapsed time per iteration (ms): 86.5 | learning rate: 1.704E-08 | global batch size:     2 | lm loss: 1.086217E+01 | loss scale: 4294967296.0 | grad norm: 0.000 | number of skipped iterations:   0 | number of nan iterations:   0 |
+   1:  iteration       27/73242187 | consumed samples:           54 | elapsed time per iteration (ms): 88.4 | learning rate: 1.769E-08 | global batch size:     2 | lm loss: 1.087129E+01 | loss scale: 4294967296.0 | grad norm: 0.000 | number of skipped iterations:   0 | number of nan iterations:   0 |
+   ```
+
+   
+   An abbreviated sample log is shown below:
+    
+   ```text
+   ...
+   using torch.float16 for parameters ...
+   ------------------------ arguments ------------------------
+   accumulate_allreduce_grads_in_fp32 .............. False
+   adam_beta1 ...................................... 0.9
+   adam_beta2 ...................................... 0.95
+   ...
+   -------------------- end of arguments ---------------------
+   setting number of micro-batches to constant 288
+   > building GPT2BPETokenizer tokenizer ...
+   > padded vocab (size: 50257) with 943 dummy tokens (new size: 51200)
+   > initializing torch distributed ...
+   > initialized tensor model parallel with size 8
+   > initialized pipeline model parallel with size 1
+   > setting random seeds to 1234 ...
+   > compiling dataset index builder ...
+   make: Entering directory '/workspace/Megatron-LM/megatron/core/datasets'
+   ...
+   time to initialize megatron (seconds): 15.424
+   [after megatron is initialized] datetime: 2024-07-16 22:14:01
+   building GPT model ...
+   > number of parameters on (tensor, pipeline) model parallel rank (4, 0): 941594624
+   ...
+   > building train, validation, and test datasets ...
+   > datasets target sizes (minimum size):
+       train:      146484375
+       validation: 5863680
+       test:       11520
+   ...
+   iteration        1/  508626 | consumed samples:          288 | elapsed time per iteration (ms): 255940.5 | learning rate: 0.000E+00 | global batch size:   288 | loss scale: 4294967296.0 | number of skipped iterations:   1 | number of nan iterations:   0 |
+   iteration        2/  508626 | consumed samples:          576 | elapsed time per iteration (ms): 243438.3 | learning rate: 0.000E+00 | global batch size:   288 | loss scale: 2147483648.0 | number of skipped iterations:   1 | number of nan iterations:   0 |
+   iteration        3/  508626 | consumed samples:          864 | elapsed time per iteration (ms): 243344.4 | learning rate: 0.000E+00 | global batch size:   288 | loss scale: 1073741824.0 | number of skipped iterations:   1 | number of nan iterations:   0 |
+   ...
+   ```
+    
+   You can stop the training job by executing:
+    
+   ```bash
+   kubectl delete -f ./pytorchjob.yaml
+   ```
+    
+## 4. What's next?
+
+The example is based on the GPT3 example from MegatronLM's [repository](https://github.com/NVIDIA/Megatron-LM/blob/main/examples/pretrain_gpt.sh). You can modify `NUM_ATTENTION_HEADS`, `NUM_LAYERS`, and `HIDDEN_SIZE`  based on the Table 1 (Page 8) of the document [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473) to change the model size. You can also run the following commands to launch training for different model sizes before submitting a job as follows: `NUM_LAYERS=64 HIDDEN_SIZE=8192 NUM_ATTENTION_HEADS=48 sbatch  3.distributed-training.sbatch`
+
+| Model size | Parameters                                                |
+|------------|-----------------------------------------------------------|
+| 1.7B       | `NUM_ATTENTION_HEADS=24 HIDDEN_SIZE=2304 NUM_LAYERS=24`   |
+| 3.6B       | `NUM_ATTENTION_HEADS=32 HIDDEN_SIZE=3072 NUM_LAYERS=30`   |
+| 7.5B       | `NUM_ATTENTION_HEADS=32 HIDDEN_SIZE=4096 NUM_LAYERS=36`   |
+| 18.4B      | `NUM_ATTENTION_HEADS=48 HIDDEN_SIZE=6144 NUM_LAYERS=40`   |
+| 39.1B      | `NUM_ATTENTION_HEADS=64 HIDDEN_SIZE=8192 NUM_LAYERS=48`   |
+| 76.1B      | `NUM_ATTENTION_HEADS=80 HIDDEN_SIZE=10240 NUM_LAYERS=60`  |
+| 145.6B     | `NUM_ATTENTION_HEADS=96 HIDDEN_SIZE=12288 NUM_LAYERS=80`  |
+| 310.1B     | `NUM_ATTENTION_HEADS=128 HIDDEN_SIZE=16384 NUM_LAYERS=96` |
+
+## 4. Appendix
+
+### 4.1. Benchmark mode
+
+To run in benchmark mode (i.e., train only, no validation and test), apply these changes to `2.distributed-training.sbatch` when calling `pretrain_gpt.py`:
+
+```diff
+-        --eval-iters 40 \
+-        --eval-interval 1000 \
+-        --split 98,2,0 \
++        --eval-iters 0 \
++        --split 100,0,0 \
+```
+
+Incorrect settings will cause this error message to appear in the Slurm output:
+
+```text
+Traceback (most recent call last):
+  File "/workspace/Megatron-LM/pretrain_gpt.py", line 198, in <module>
+    pretrain(train_valid_test_datasets_provider,
+  File "/workspace/Megatron-LM/megatron/training.py", line 227, in pretrain
+    = build_train_valid_test_data_iterators(
+  File "/workspace/Megatron-LM/megatron/training.py", line 1283, in build_train_valid_test_data_iterators
+    build_train_valid_test_data_loaders(
+  File "/workspace/Megatron-LM/megatron/training.py", line 1244, in build_train_valid_test_data_loaders
+    train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
+  File "/workspace/Megatron-LM/megatron/training.py", line 1214, in build_train_valid_test_datasets
+    return build_train_valid_test_datasets_provider(train_val_test_num_samples)
+  File "/workspace/Megatron-LM/pretrain_gpt.py", line 186, in train_valid_test_datasets_provider
+    ).build()
+  File "/workspace/Megatron-LM/megatron/core/datasets/blended_megatron_dataset_builder.py", line 56, in build
+    return self._build_blended_dataset_splits()
+  File "/workspace/Megatron-LM/megatron/core/datasets/blended_megatron_dataset_builder.py", line 76, in _build_blended_dataset_splits
+    return self._build_megatron_dataset_splits(blend[0], split, self.sizes)
+  File "/workspace/Megatron-LM/megatron/core/datasets/blended_megatron_dataset_builder.py", line 216, in _build_megatron_dataset_splits
+    self.build_generic_dataset(
+  File "/workspace/Megatron-LM/megatron/core/datasets/blended_megatron_dataset_builder.py", line 258, in build_generic_dataset
+    dataset = cls(*args)
+  File "/workspace/Megatron-LM/megatron/core/datasets/gpt_dataset.py", line 68, in __init__
+    super().__init__(indexed_dataset, indexed_indices, num_samples, index_split, config)
+  File "/workspace/Megatron-LM/megatron/core/datasets/megatron_dataset.py", line 42, in __init__
+    assert num_samples > 0
+AssertionError
+```
+
+### 4.2. Adjust training steps
+
+By default, the .sbatch scripts specify the number of samples, then the number of training steps equals to `--train_samples` / `--global-batch-size`. To directly specify the number of steps, apply these changes to `2.distributed-training.sbatch` when calling `pretrain_gpt.py`. Note that `samples` and `iters` are mutually exclusive.
+
+```diff
+-        --train-samples 146484375 \
+-        --lr-decay-samples 126953125 \
+-        --lr-warmup-samples 183105 \
++        --train-iters 50 \
++        --lr-decay-iters 45 \
++        --lr-warmup-iters 2 \
+```
+=======
+
+Following the same pattern, you can train other models. Pretraining scripts for models like 
+Bert, ICT, and T5 are already included in the Megatron-LM container under `/workspace/Megatron-LM`. 
diff --git a/3.test_cases/1.megatron-lm/slurm/llama2/README.md b/3.test_cases/1.megatron-lm/slurm/llama2/README.md
new file mode 100644
index 00000000..3b57603c
--- /dev/null
+++ b/3.test_cases/1.megatron-lm/slurm/llama2/README.md
@@ -0,0 +1,47 @@
+# Llama2 Training Example on Slurm with MegatronLM
+
+This directory contains instructions and templates for setting up and running Llama2 model training using MegatronLM on a Slurm cluster.
+
+To pretrain Llama2, you must visit <https://huggingface.co/meta-llama/Llama-2-7b-hf> to download the tokenizers files (i.e., `tokenizer.json` and `tokenizer.model`). Registration required. Alternatively, you may train your own tokenizer but this is beyond the scope for this document. Either way, once you have the tokenizer files, you need to upload them to the FSx Lustre that your Slurm cluster mounts.
+
+The remaining steps are similar to the GPT3 example. For more information, please refer to the official Megatron-LM documentation on Llama2 [here](https://github.com/NVIDIA/Megatron-LM/blob/main/docs/llama2.md).
+
+
+## 1. Preparation
+
+Ensure you have the following prerequisites:
+
+- A functional Slurm cluster.
+- Docker, Pyxis, and Enroot installed on the head node and compute nodes.
+- An FSx for Lustre filesystem mounted on `/fsx` in all nodes.
+- 
+
+Set up the following environment variables in your terminal:
+
+```bash
+export DATA_PATH=/fsx # FSx for Lustre shared file-system
+```
+
+### 2. Download and prepocess data
+
+```bash
+mkdir -p llama2
+# Then, place `tokenizer.json` and `tokenizer.model` to this `llama2/` directory.
+
+# Download sample dataset
+wget -P llama2 https://huggingface.co/bigscience/misc-test-data/resolve/main/stas/oscar-1GB.jsonl.xz
+xz -d llama2/oscar-1GB.jsonl.xz
+
+sbatch 3.data-preproc-llama2.sbatch
+```
+
+### 3. Run pretraining job
+
+Edit `pre-train-llama2.sbatch` to choose the model size you want to train. Do this by commenting and uncommenting the related stanzas. Feel free to experiment with the hyperparameters such as parallelism, batches, etc. (for more details, please refer to the [Megatron-LM project](https://github.com/NVIDIA/Megatron-LM/) and the Megatron papers ([Shoeybi20](https://arxiv.org/abs/1909.08053), [Narayanan21](https://arxiv.org/abs/2104.04473)).
+
+```bash
+sbatch pre-train-llama2.sbatch
+```
+
+Tips: the Llama2 example prints the estimated FLOPS/GPU (enabled via `--log-throughput` in the pretrain `.sbatch` file). You might want to look at [PR-682](https://github.com/NVIDIA/Megatron-LM/pull/682) and decide whether to patch your Megatron-LM to adjust the way FLOPS/GPU is calculated.
+
diff --git a/3.test_cases/1.megatron-lm/3.data-preproc-llama2.sbatch b/3.test_cases/1.megatron-lm/slurm/llama2/data-preproc-llama2.sbatch
old mode 100644
new mode 100755
similarity index 100%
rename from 3.test_cases/1.megatron-lm/3.data-preproc-llama2.sbatch
rename to 3.test_cases/1.megatron-lm/slurm/llama2/data-preproc-llama2.sbatch
diff --git a/3.test_cases/1.megatron-lm/4.pretrain-llama2.sbatch b/3.test_cases/1.megatron-lm/slurm/llama2/pretrain-llama2.sbatch
old mode 100644
new mode 100755
similarity index 98%
rename from 3.test_cases/1.megatron-lm/4.pretrain-llama2.sbatch
rename to 3.test_cases/1.megatron-lm/slurm/llama2/pretrain-llama2.sbatch
index 06bb4c77..e9ab80d1
--- a/3.test_cases/1.megatron-lm/4.pretrain-llama2.sbatch
+++ b/3.test_cases/1.megatron-lm/slurm/llama2/pretrain-llama2.sbatch
@@ -139,7 +139,6 @@ MEGATRON_ARGS+=(
     # Example how to disable all validations, hence only training steps performed.
     --split 100,0,0
 )
-
 [[ -f ${IMAGE} ]] || { echo "Could not find enroot image: $IMAGE" ; exit -1 ; }
 srun -l "${ARGS[@]}" python -m torch.distributed.run "${TORCHRUN_ARGS[@]}" /workspace/Megatron-LM/pretrain_gpt.py \
         "${MEGATRON_ARGS[@]}" \
@@ -157,5 +156,7 @@ srun -l "${ARGS[@]}" python -m torch.distributed.run "${TORCHRUN_ARGS[@]}" /work
         --weight-decay 0.1 \
         --adam-beta1 0.9 \
         --adam-beta2 0.95 \
+        --wandb-exp-name llama2-megatron-lm \
+        --wandb-project "megatron-lm" \
         --init-method-std 0.006 \
         --fp16
diff --git a/3.test_cases/1.megatron-lm/test_megatron_lm.py b/3.test_cases/1.megatron-lm/test_megatron_lm.py
old mode 100644
new mode 100755
index 5be8c0fd..23940b91
--- a/3.test_cases/1.megatron-lm/test_megatron_lm.py
+++ b/3.test_cases/1.megatron-lm/test_megatron_lm.py
@@ -5,5 +5,5 @@
 def test_img_megatron_training(docker_build, docker_run):
     print(f"module file {os.path.dirname(__file__)}")
     print(f"cwd {os.getcwd()}")
-    img = docker_build('megatron-training', '0.distributed-training.Dockerfile')
+    img = docker_build('megatron-training', 'megatron-lm.Dockerfile')
     docker_run(img, ['python3', '-c', 'import torch'])