Skip to content

Conversation

@kostichs
Copy link
Collaborator

Summary

This PR replaces the outdated "Create an AI cluster" article with three focused articles covering the complete GPU cluster lifecycle:

  1. Create a Bare Metal GPU cluster - Step-by-step cluster creation guide
  2. Spot Bare Metal GPU - Spot instances overview and creation
  3. Manage a Bare Metal GPU cluster - Cluster operations and administration

Changes by article

DOC-1145: Create a Bare Metal GPU cluster

  • New comprehensive creation guide with all configuration steps
  • 12 targeted screenshots covering each UI section (region, capacity, image, network, SSH, file shares, firewall)
  • Clear parameter explanations and prerequisites

DOC-944: Spot Bare Metal GPU

  • Dedicated article for Spot GPU clusters with preemption model explanation
  • Unique screenshots (removed duplicates from original draft)
  • Simplified structure without redundant bullet pseudo-headers
  • Fixed broken link to object storage documentation

DOC-1146: Manage a Bare Metal GPU cluster

  • Complete operations guide based on hands-on product research
  • 21 contextual screenshots for all major UI elements
  • Documented features: resize, power actions (soft/hard reboot, rebuild), networking, console access, tags, user actions (audit trail), deletion
  • Data retention table for cluster deletion (NVMe scrubbing, file share behavior)
  • Applied unified style guide improvements (impersonal voice, structured content)

Other changes

  • Updated docs.json sidebar with new article entries under GPU cloud group
  • Minor updates to getting-started.mdx and configure-file-shares.mdx for consistency

Stats

  • 3 new articles (587 lines of documentation)
  • 36 new screenshots
  • 42 files changed

Sergey Kostichev added 23 commits January 15, 2026 11:44
…hots

- Add step-by-step screenshots for cluster creation workflow
- Add cluster architecture section with InfiniBand and NCCL explanations
- Add file share integration section with persistence details
- Add network settings table with use cases
- Add firewall settings section (conditional)
- Add API automation section
- Add verification commands (nvidia-smi, file share mount check)
- Fix terminology explanations (InfiniBand, NCCL, DDP, Slurm)
- Fix style guide compliance (impersonal voice, present tense)
- Move screenshots to proper directory structure
- Add create-a-bare-metal-gpu-cluster to GPU cloud sidebar
- Remove old screenshots from incorrect location (images/edge-ai/)
- Screenshots moved to proper structure (images/docs/edge-ai/ai-infrastructure/)
- Create dedicated article for Spot Bare Metal GPU clusters
- Explain reclamation process (24-hour notice, email notification)
- Document data preservation (file shares, object storage not affected)
- Add best practices for checkpointing and interruption handling
- Add screenshots for Spot selector and warning dialog
- Add to GPU cloud sidebar
- Create dedicated article for post-creation cluster management
- Document cluster details page navigation
- Add resize operations (scale up, scale down, delete specific node)
- Add power actions (individual and bulk)
- Document network interface management
- Add console access instructions
- Document tags and user actions log
- Add cluster deletion with warnings
- List current limitations
- Add to GPU cloud sidebar
- Add Spot vs On-demand comparison table
- Explain capacity source difference (dedicated vs unused capacity)
- Expand reclamation process with warning block
- Add detailed terms from UI warning
- Enhance best practices with specific recommendations
- Add checkpoint interval guidance (1-4 hours)
- Improve workload suitability guidance
- Add data deletion timeline table (immediate for worker nodes, 48h for volumes)
- Clarify that data deletion happens immediately upon suspension, not after 24h
- Add billing details (per minute, aggregated hourly, entire node)
- Restructure data preservation section with actionable strategies
- Add pre-reclamation transfer recommendation
- Add explicit note that only one email is sent (no follow-up reminders)
- Clarify that deletion happens without additional warnings after 24h
- Replace 'suspension' with 'deletion' per SME interview
- Spot reclamation is direct deletion, not account suspension
- Simplify data deletion table (remove 48h volumes row - not applicable)
- Clarify that local NVMe is erased as part of deletion process
- Add minimum balance requirement (500 EUR/USD for card payments)
- Add bank transfer option with Sales contact
- Link to GPU Cloud billing page
- Clarify what appears in cluster type selector
- Describe warning banner visual appearance (yellow)
- Specify that flavor card shows hourly and monthly rates
- Add GPU cluster type selector screenshot
- Add Spot selected with warning banner screenshot
- Add Spot flavor card with pricing screenshot
- Add cluster capacity section overview screenshot
- Replace old screenshots with higher quality versions
- Improve UI-to-text correlation throughout article
- Add 'Out of Stock' explanation in Availability section
- Remove spot-selected-with-warning.png (duplicated selector + warning)
- Remove cluster-capacity-section.png (duplicated selector view)
- Keep only 3 unique screenshots: selector, warning banner, price
- Update article to remove reference to deleted screenshot
- Add step-region.png showing region selector
- Add step-gpu-cluster-type.png showing Spot selected with warning
- Update Availability section with more informative screenshot
- Add region screenshot to Creating section step 3
- Remove duplicate gpu-cluster-type-selector.png
… screenshots with unique ones - Remove redundant content and repetitions - Simplify text structure, remove bullet pseudo-headers - Fix broken link to object storage - Remove Best practices section
# Conflicts:
#	docs.json
Sergey Kostichev added 2 commits January 16, 2026 11:59
# Conflicts:
#	edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster.mdx
Sergey Kostichev added 2 commits January 16, 2026 12:07
# Conflicts:
#	edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster.mdx
@kostichs kostichs merged commit 6bdb6ba into main Jan 19, 2026
3 checks passed
@kostichs kostichs deleted the DOC-1144 branch January 19, 2026 09:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants