Build On Trainium Workshop

In this workshop you will learn how to develop support for a new model with NeuronX Distributed Inference, through the context of Llama 3.2 1B. You will also learn how to write your own kernel to directly program the accelerated hardware with the Neuron Kernel Interface. Both of these tools will help you design your research proposals and experiments on Trainium.

What is Build on Trainium?

Build on Trainium is a $110M credit program focused on AI research and university education to support the next generation of innovation and development on AWS Trainium. AWS Trainium chips are purpose-built for high-performance deep learning (DL) training of generative AI models, including large language models (LLMs) and latent diffusion models. Build on Trainium provides compute credits to novel AI research on Trainium, investing in leading academic teams to build innovations in critical areas including new model architectures, ML libraries, optimizations, large-scale distributed systems, and more. This multi-year initiative lays the foundation for the future of AI by inspiring the academic community to utilize, invest in, and contribute to the open-source community around Trainium. Combining these benefits with Neuron software development kit (SDK) and recent launch of the Neuron Kernel Interface (NKI), AI researchers can innovate at scale in the cloud.

What are AWS Trainium and Neuron?

AWS Trainium is an AI chip developed by AWS for accelerating building and deploying machine learning models. Built on a specialized architecture designed for deep learning, Trainium accelerates the training and inference of complex models with high output and scalability, making it ideal for academic researchers looking to optimize performance and costs. This architecture also emphasizes sustainability through energy-efficient design, reducing environmental impact. Amazon has established a dedicated Trainium research cluster featuring up to 40,000 Trainium chips, accessible via Amazon EC2 Trn1 instances. These instances are connected through a non-blocking, petabit-scale network using Amazon EC2 UltraClusters, enabling seamless high-performance ML training. The Trn1 instance family is optimized to deliver substantial compute power for cutting-edge AI research and development. This unique offering not only enhances the efficiency and affordability of model training but also presents academic researchers with opportunities to publish new papers on underrepresented compute architectures, thus advancing the field.

Learn more about Build On Trainium here.

Your workshop

This hands-on workshop is designed for academic researchers who are planning on submitting proposals to Build On Trainium.

The workshop has 3 main modules:

Set up instructions
Run inference with Llama and NeuronX Distributed inference (NxD)
Write your own kernel with Neuron Kernel Interface (NKI)

Instructor-led workshop

If you are participating in an instructor-led workshop, follow the guidance provided by your instructor for accessing the environment.

Self-managed workshop

If you are following the workshop steps in your own environment, you will need to take the following actions:

Launch a trn1.2xlarge instance on Amazon EC2, using the latest DLAMI with Neuron packages preinstalled
Use a Python virtual environment preinstalled in that DLAMI, commonly located in /opt/aws_<xxx>.
Set up and manage your own development environment on that instance, such as by using VSCode or a Jupyter Lab server.

Background knowledge

This workshop introduces developing on AWS Trainium for the academic AI research audience. As such it's expected that the audience will already have a firm understanding of machine learning fundamentals.

Workshop costs

If you are participating in an instructor-led workshop hosted in an AWS-managed Workshop Studio environment, you will not incur any costs through using this environment. If you following this workshop in your own environment, then you will incur associated costs with provisioning an Amazon EC2 instance. Please see the service pricing details here.

At the time of writing, this workshop uses a trn1.2xlarge instance with an on-demand hourly rate in supported US regions of $1.34 per hour.

FAQ's and known issues

Workshop instructions are available here.
If you use the NousResearch Llama 3.2 1B, please note you'll need to remove a trailing comma in the model config file. You can do this by using VIM in VSCode. If you do not take this step, you'll get an error for invalid JSON in trying to read the model config in Lab 1. If editing the file through the terminal is a little challenging, you can also download the config file from this repository with the following command: !wget https://github.com/aws-neuron/build-on-trainium-workshop/blob/main/labs/generation_config.json -O /home/ec2-user/models/llama/
Jupyter kernels can hold on to the NeuronCores as a Python process even after your cell has completed. This can then cause issues when you try to run a new notebook, and sometimes when you try to run another cell. If you encounter a NeuronCore not found or similar error statement, please just restart your Jupyter kernel and/or shut down kernels from previous sessions. You can also restart the instance through the EC2 console. Once your node is back online, you can always check the availability of the NeuronCores with neuron-ls.
Want to see how to integrate NKI with NxD? Check out our nki-llama here.

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
contributed/models		contributed/models
labs		labs
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Build On Trainium Workshop

What is Build on Trainium?

What are AWS Trainium and Neuron?

Your workshop

Instructor-led workshop

Self-managed workshop

Background knowledge

Workshop costs

FAQ's and known issues

Security

License

About

Packages

Contributors 2

Languages

License

aws-neuron/build-on-trainium-workshop

Folders and files

Latest commit

History

Repository files navigation

Build On Trainium Workshop

What is Build on Trainium?

What are AWS Trainium and Neuron?

Your workshop

Instructor-led workshop

Self-managed workshop

Background knowledge

Workshop costs

FAQ's and known issues

Security

License

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Packages 0

Contributors 2

Languages

Packages