ShadowSwarm

A streamlined framework for setting up a multi-node, GPU-accelerated, distributed system for PyTorch workloads using Docker Swarm. With ShadowSWARM, you can quickly configure and deploy a scalable environment for machine learning inference or training across multiple machines.

Features

Automated Docker Swarm initialization and worker node setup.
Flexible configuration using interactive CLI (config.py).
Dynamic IP and hostname detection for seamless multi-node deployment.
Streamlined distributed PyTorch workloads with Fully Sharded Data Parallel (FSDP).
Integrated Streamlit interface for easy interaction with your system.

Quickstart Guide

Prerequisites

Docker and NVIDIA Drivers:

Install Docker and NVIDIA drivers on all machines.

Install the NVIDIA Container Toolkit:

sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Verify Docker GPU support:

docker run --rm --gpus all nvidia/cuda:12.1.1-base-ubuntu20.04 nvidia-smi

Python 3.8+:
- Install Python on the master machine:
```
sudo apt-get install python3 python3-pip
```
Passwordless SSH:
- Configure passwordless SSH from the master to all worker nodes:
```
ssh-keygen -t rsa -b 2048
ssh-copy-id user@worker-ip
```
  - You only need to set up SSH from the master node to the workers.
  - The worker nodes do not need SSH access to each other or the master.

Installation

Clone the Repository (Only on Master):
- Clone this repository on the master node:
```
git clone https://github.com/DJStompZone/shadowswarm.git
cd shadowswarm
```
- The worker nodes do not need the repository because Docker Swarm handles the deployment of containers automatically.
Build the Docker Image: Build the Docker image on the master node:
```
docker build -t shadowswarm-app .
```

Setup and Deployment

Run the Configuration Script: Use the interactive CLI to gather and validate the necessary configuration:
```
python3 config.py
```
This script will:
- Prompt for the master and worker node details.
- Save the configuration to a .env file.
- Start the bootstrap.sh script to initialize Docker Swarm and add workers.
Verify Swarm Setup: Check the Swarm status after the bootstrap:
```
docker node ls
```
Deploy the Docker Stack: Once the Swarm is ready, deploy the application:
```
docker stack deploy --compose-file docker-compose.yml shadowswarm
```

Access the Streamlit App

Open a browser and navigate to the master node IP:
```
http://<master-node-ip>:8501
```
Use the Streamlit interface to interact with your distributed PyTorch system.

File Structure

shadowswarm/
├── config.py            # CLI script for gathering configuration
├── bootstrap.sh         # Script for initializing Docker Swarm and adding workers
├── docker-compose.yml   # Docker Swarm stack configuration
├── Dockerfile           # Docker image definition
├── .env                 # Environment variables for the deployment
├── app/                 # Application directory
│   ├── main.py          # PyTorch and Streamlit code
│   └── utils.py         # Utility functions

How It Works

Configuration:
- config.py prompts for master and worker node details, saves them to .env, and triggers bootstrap.sh.
Swarm Initialization:
- bootstrap.sh initializes Docker Swarm on the master node and connects workers via SSH.
Stack Deployment:
- docker-compose.yml orchestrates the master and worker containers, assigning roles using environment variables.
Distributed Workload:
- The master node manages the distributed PyTorch workload across all nodes using Fully Sharded Data Parallel (FSDP).

Environment Variables

Variable	Description
`MASTER_HOSTNAME`	Hostname of the master node.
`MASTER_IP`	IP address of the master node.
`WORKER_HOSTNAMES`	Comma-separated list of worker hostnames.
`NODE_RANK`	Rank of the node in the distributed setup.
`WORLD_SIZE`	Total number of nodes in the cluster.
`MASTER_PORT`	Port for master-worker communication.

Troubleshooting

Docker Swarm Issues:
- Check if Swarm is initialized:
```
docker info
```
- Verify worker nodes are connected:
```
docker node ls
```
SSH Issues:
- Test passwordless SSH from the master:
```
ssh <worker-ip>
```

Container Logs:

Check the logs for the master or workers:

docker service logs shadowswarm_master
docker service logs shadowswarm_worker1

GPU Issues:

Ensure GPUs are accessible:

docker run --rm --gpus all nvidia/cuda:12.1.1-base-ubuntu20.04 nvidia-smi

Scaling

Add a new worker node to the swarm:

docker swarm join --token <worker-join-token> <master-ip>:2377

Update the WORKER_HOSTNAMES in the .env file to include the new worker.

Re-deploy the stack:

docker stack deploy --compose-file docker-compose.yml shadowswarm

Contributing

Contributions are welcome! Please open an issue or submit a pull request if you have problems, suggestions, or improvements.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
banner.sh		banner.sh
bootstrap.sh		bootstrap.sh
config.py		config.py
deploy.sh		deploy.sh
docker-compose.yml		docker-compose.yml
launch.sh		launch.sh
requirements.txt		requirements.txt
teardown.sh		teardown.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ShadowSwarm

Features

Quickstart Guide

Prerequisites

Installation

Setup and Deployment

Access the Streamlit App

File Structure

How It Works

Environment Variables

Troubleshooting

Scaling

Contributing

License

About

Releases

Packages

Languages

License

StompZone/ShadowSWARM

Folders and files

Latest commit

History

Repository files navigation

ShadowSwarm

Features

Quickstart Guide

Prerequisites

Installation

Setup and Deployment

Access the Streamlit App

File Structure

How It Works

Environment Variables

Troubleshooting

Scaling

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages