Skip to content

vibe #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
FROM python:3.10-slim

# Install system dependencies
RUN apt-get update && apt-get install -y \
curl \
sudo \
wget \
git \
tar \
&& rm -rf /var/lib/apt/lists/*

# Install nebula
RUN ARCH=$(uname -m) && \
if [ "$ARCH" = "x86_64" ]; then ARCH="amd64"; elif [ "$ARCH" = "aarch64" ]; then ARCH="arm64"; fi && \
LATEST_VERSION=$(curl -s https://api.github.com/repos/slackhq/nebula/releases/latest | grep -Po '"tag_name": "\K.*?(?=")') && \
curl -L -o /tmp/nebula.tar.gz "https://github.com/slackhq/nebula/releases/download/${LATEST_VERSION}/nebula-linux-${ARCH}.tar.gz" && \
mkdir -p /tmp/nebula && \
tar -xzf /tmp/nebula.tar.gz -C /tmp/nebula && \
cp /tmp/nebula/nebula /usr/local/bin/ && \
cp /tmp/nebula/nebula-cert /usr/local/bin/ && \
chmod +x /usr/local/bin/nebula && \
chmod +x /usr/local/bin/nebula-cert && \
rm -rf /tmp/nebula /tmp/nebula.tar.gz

# Install Python dependencies
RUN pip install --no-cache-dir torch torchvision torchft

# Set working directory
WORKDIR /app

# Copy test files
COPY test_multi_node.py /app/
COPY train_ddp.py /app/
COPY run_multi_node_test.sh /app/

# Make script executable
RUN chmod +x /app/run_multi_node_test.sh

ENTRYPOINT ["/app/run_multi_node_test.sh"]
33 changes: 33 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
.PHONY: setup test test-local test-docker clean

# Default target
all: test

# Setup the environment
setup:
pip install torch torchvision torchft
if ! command -v nebula &> /dev/null; then \
if [[ "$$OSTYPE" == "darwin"* ]]; then \
brew install nebula; \
else \
echo "Please install Nebula manually: https://github.com/slackhq/nebula/releases"; \
fi \
fi

# Run the local test (requires sudo)
test-local:
sudo ./run_multi_node_test.sh

# Run the test using Docker
test-docker:
docker-compose build
docker-compose up

# Run the test (either local or Docker)
test: test-docker

# Clean up
clean:
rm -rf test_multi_node_env
docker-compose down
sudo pkill nebula || true
62 changes: 59 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,83 @@
# KernelSwarm

## Lightouse details
## Lighthouse details

Right now the Lighthouse is running on Digital Ocean on a $4/month droplet. If you'd like to be added please let us know and share your public ssh key.

The lighthouse is primarily responsible for keeping track of all nodes in the swarm.


## How to join the swarm

Idea of the setup is that a user
1. Click some button to get a client
2. We share the client with them
3. They run the client and it connects them to the swarm


## High level infra details
1. Lighthouse on Digital Ocean
2. Nebula VPN for swarm communication
3. Clients in shell scripts but soon should be docker containers
4. A fault tolerant PyTorch job that is responsible for the actual training
5. Share results in some public dashboard

## Running the multi-node test

This project includes an end-to-end test for verifying the setup with multiple nodes connecting via Nebula and running distributed training using torchft.

### Prerequisites

- Python 3.6+
- pip
- Nebula VPN (installed via the test script if not present)
- Docker and Docker Compose (for container-based testing)

### Running the test

There are two ways to run the test:

#### 1. Using Docker (recommended)

This method uses Docker to set up multiple containers, each representing a node in the swarm:

```bash
# Build and run using Docker Compose
make test-docker
```

#### 2. Local testing

This method simulates multiple nodes on your local machine:

```bash
# Run the test locally (requires sudo for setting up Nebula interfaces)
make test-local
```

### Test options

The test supports several command-line options:

```bash
# Run with specific options
sudo ./run_multi_node_test.sh --num-nodes 3 --timeout 600
```

Available options:
- `--num-nodes`: Number of nodes to simulate (default: 2)
- `--torchft-version`: Version of torchft to install (default: latest)
- `--node-ip-prefix`: IP prefix for node addresses (default: 192.168.100.)
- `--lighthouse-ip`: IP address of the lighthouse node (default: 192.168.100.1)
- `--username`: Username for nebula client creation (default: test)
- `--timeout`: Timeout in seconds for test completion (default: 300)

### Cleaning up

To clean up after running the tests:

```bash
make clean
```

## TBD

![swarm](./swarm.png)
30 changes: 30 additions & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
version: '3'

services:
node1:
build:
context: .
dockerfile: Dockerfile
privileged: true # Needed for creating network interfaces
network_mode: "host"
volumes:
- ./:/app
command: ["--num-nodes", "2", "--node-ip-prefix", "192.168.100.", "--lighthouse-ip", "192.168.100.1"]
environment:
- REPLICA_GROUP_ID=0
- NUM_REPLICA_GROUPS=2

node2:
build:
context: .
dockerfile: Dockerfile
privileged: true # Needed for creating network interfaces
network_mode: "host"
volumes:
- ./:/app
command: ["--num-nodes", "2", "--node-ip-prefix", "192.168.100.", "--lighthouse-ip", "192.168.100.1"]
environment:
- REPLICA_GROUP_ID=1
- NUM_REPLICA_GROUPS=2
depends_on:
- node1
75 changes: 75 additions & 0 deletions run_multi_node_test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
#!/bin/bash
# Script to run the multi-node test

set -e

# Check if script is run with sudo
if [ "$EUID" -ne 0 ]; then
echo "This script must be run with sudo to set up Nebula VPN interfaces."
echo "Please run: sudo $0 $@"
exit 1
fi

# Install nebula if not already installed
if ! command -v nebula &> /dev/null; then
echo "Nebula not found. Installing..."

# Detect OS
if [[ "$OSTYPE" == "linux-gnu"* ]]; then
# For Linux (Ubuntu/Debian)
if command -v apt-get &> /dev/null; then
apt-get update
apt-get install -y curl

# Download latest nebula release
ARCH=$(uname -m)
if [ "$ARCH" == "x86_64" ]; then
ARCH="amd64"
elif [ "$ARCH" == "aarch64" ]; then
ARCH="arm64"
fi

LATEST_VERSION=$(curl -s https://api.github.com/repos/slackhq/nebula/releases/latest | grep -Po '"tag_name": "\K.*?(?=")')
curl -L -o /tmp/nebula.tar.gz "https://github.com/slackhq/nebula/releases/download/${LATEST_VERSION}/nebula-linux-${ARCH}.tar.gz"

mkdir -p /tmp/nebula
tar -xzf /tmp/nebula.tar.gz -C /tmp/nebula
cp /tmp/nebula/nebula /usr/local/bin/
cp /tmp/nebula/nebula-cert /usr/local/bin/

chmod +x /usr/local/bin/nebula
chmod +x /usr/local/bin/nebula-cert

rm -rf /tmp/nebula /tmp/nebula.tar.gz
else
echo "Unsupported Linux distribution. Please install Nebula manually."
exit 1
fi
elif [[ "$OSTYPE" == "darwin"* ]]; then
# For macOS
if command -v brew &> /dev/null; then
brew install nebula
else
echo "Homebrew not found. Please install Homebrew and then run: brew install nebula"
exit 1
fi
else
echo "Unsupported operating system. Please install Nebula manually."
exit 1
fi
fi

# Install Python packages if needed
if ! command -v pip3 &> /dev/null; then
if [[ "$OSTYPE" == "linux-gnu"* ]]; then
apt-get update
apt-get install -y python3-pip
elif [[ "$OSTYPE" == "darwin"* ]]; then
echo "pip3 not found. Please install Python 3 and pip3."
exit 1
fi
fi

# Run the test
echo "Starting multi-node test..."
python3 test_multi_node.py "$@"
Loading