Who is this guide for?
- People that want to run containerized inference on a GPU
- People that have some experience with using containers
- People that have experience with configuring and using remote resources
What are the guide limitations?
- It assumes you have the remote already setup with appropriate dependencies, i.e. NVIDIA GPU drivers, the Container Toolkit, and either Docker or Podman
- It assumes that Ubuntu 22.04 or Ubuntu 24.04 LTS is the reference OS
- Some steps may need adjustment
What else do I need to know?
- You will need root/sudo access for most setup steps
- The first place to check for issues is
nvidia-smioutput - Make sure your GPU meets the minimum requirements before starting
For detailed software installation instructions, see:
- NVIDIA Driver Installation
- NVIDIA Container Toolkit Setup
- Docker Installation or Podman Installation
There are many ways you can setup inference on a remote GPU.
We will go over two: Ollama and NVIDIA NIM.
- Easier to set up and manage
- Runs many models out of the box
- Good for experimentation and development
- See Ollama Setup Guide below
- GPU requirements: Depends on the model selected, but lighter weight than NIMs and can go down to 8 GB of vRAM
- More steps to setup but has better performance and optimization
- Many options for configuring deployment and model optimization
- Better for production use
- See NIM Setup Guide below
- GPU requirements: Depends on the model selected, but generally require 24GB of vRAM or higher
- Make sure the remote is properly setup and that you have SSH access to it
- IP address:
<remote-ip> - Remote user:
<remote-user>
- IP address:
- Be in a terminal session on the remote
- Make sure it's open to TCP access on a known port, i.e.
<remote-port> - Make sure that the container runtime is properly configured
- Satisfy the General Prerequisites
- Deploy Ollama Container: Pull the Ollama container onto the remote and run it
- Pull Model into Ollama Container: Exec into the container and load the desired model
- Add Ollama Container as an Endpoint: Configure the Agentic RAG app to use the model
Do the following in the remote terminal. If you encounter any errors, use an LLM to help you debug.
# Pull the Ollama container. Change the tag if you want a different one.
docker pull ollama/ollama:latest
# Run it and make sure to connect the port on the remote, <remote-port>, to the Ollama port in the container, 11434
docker run -d --name ollama \
--gpus all --restart unless-stopped \
-p <remote-port>:11434 \
-v ollama_data:/root/.ollama \
ollama/ollama:latest
# Make sure it is running properly
curl http://localhost:10000/api/tags
Do the following in the remote terminal. If you encounter any errors, use an LLM to help you debug.
# Exec into the container and pull a model
docker exec ollama ollama pull llama2:7b
# Verify the model has been pulled and is available
curl http://localhost:<remote-port>/api/tags
# Submit a simple query to the model to test
curl -X POST http://localhost:<remote-port>/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama3",
"prompt": "What is the meaning of life?",
"stream": false
}'
Do the following in a local terminal. If you encounter any errors, use an LLM to help you debug.
You will need the remote ip, <remote-ip>, and the remote port, <remote-port>.
# Test local access to the Ollama container on the remote
curl -X POST http://<remote-ip>:<remote-port>/api/generate \
-H "Content-Type: application/json" \
-d '{"model": "llama2:7b", "prompt": "Hello, Ollama!"}'- Open the Agentic RAG project in AI Workbench
- Select Models
- For each component you want to use the Ollama container:
- Select Self-Hosted Endpoint
- Enter the ip address, port and model name
- Satisfy the General Prerequisites
- Ensure you have an NVIDIA NGC API key (available at https://ngc.nvidia.com/setup/api-key)
- Deploy NIM Container: Pull and run the NIM container on the remote
- Configure NIM Container: Set up the model and API key
- Add NIM Container as an Endpoint: Configure the Agentic RAG app to use the model
Do the following in the remote terminal. If you encounter any errors, use an LLM to help you debug.
# Login to NVIDIA NGC
# Set user as $oauthtoken and password as the NGC key
docker login nvcr.io
# Pull the NIM container
docker pull nvcr.io/nim/deepseek-r1-distill-llama-8b:latest
# Run the container
docker run -d --name nim \
--gpus all --restart unless-stopped \
-p <remote-port>:8000 \
-e NGC_API_KEY=<your-ngc-api-key> \
nvcr.io/nim/deepseek-r1-distill-llama-8b:latest
# Verify the container is running
docker ps | grep nimThe NIM container will automatically download and configure the model when it starts. You can verify it's working by:
# Check the container logs
docker logs nim
# Test the API endpoint
curl -X POST http://localhost:<remote-port>/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Hello, NIM!"}
]
}'Do the following in a local terminal. If you encounter any errors, use an LLM to help you debug.
You will need the remote ip, <remote-ip>, and the remote port, <remote-port>.
# Test local access to the NIM container on the remote
curl -X POST http://<remote-ip>:<remote-port>/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Hello, NIM!"}
]
}'- Open the Agentic RAG project in AI Workbench
- Select Models
- For each component you want to use the NIM container:
- Select Self-Hosted Endpoint
- Enter the ip address, port and model name
- NIM containers are optimized for production use and provide better performance than Ollama
- They require more GPU memory (typically 24GB or more)
- The API follows the OpenAI-compatible format, making it easy to integrate with existing applications
- You can monitor the container's performance using NVIDIA's monitoring tools
- For production deployments, consider setting up proper security measures and load balancing