Skip to content

Instantly share code, notes, and snippets.

@gregpriday
Last active June 10, 2025 14:36
Show Gist options
  • Select an option

  • Save gregpriday/eb1289c48e7f75b356b2e57420b72492 to your computer and use it in GitHub Desktop.

Select an option

Save gregpriday/eb1289c48e7f75b356b2e57420b72492 to your computer and use it in GitHub Desktop.

Lambda Labs PyTorch Training Environment

This CLAUDE.md provides context for Claude Code to manage PyTorch model training on Lambda Labs GPU infrastructure.

Important Notes

  • Always run Claude Code from a project directory, not from home directory
  • This is a single-purpose training environment - no virtual environments needed
  • Git LFS is often not pre-installed - check and install if needed
  • NumPy 2.x breaks system PyTorch - always use NumPy <2.0
  • pip installs to user directory - this is normal on Lambda Labs

Environment Overview

  • Platform: Lambda Cloud GPU Instance
  • OS: Ubuntu 22.04 LTS with Lambda Stack
  • Pre-installed: PyTorch, CUDA 12.x, cuDNN, Python 3.10+
  • Important: System PyTorch is compiled with NumPy 1.x
  • Storage: Root volume (ephemeral) + Persistent NFS at /lambda/nfs/<FILESYSTEM_NAME>
  • Key Point: Root volume data is deleted on instance termination

Environment Verification

# Quick environment check
python -c "import torch; import numpy; print(f'PyTorch: {torch.__version__}'); print(f'NumPy: {numpy.__version__}'); print(f'CUDA: {torch.cuda.is_available()}')"

# Check for common ML tools
for cmd in git git-lfs gh tmux nvidia-smi; do
    command -v $cmd &> /dev/null && echo "$cmd installed" || echo "$cmd NOT installed"
done

Initial Setup

GitHub Authentication and Repository Setup

# Install GitHub CLI if not present
# Note: The apt version may be outdated (2.4.0). This installs the latest version.
if ! command -v gh &> /dev/null; then
    curl -fsSL https://cli.github.com/packages/githubcli-archive-keyring.gpg | sudo dd of=/usr/share/keyrings/githubcli-archive-keyring.gpg
    echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/githubcli-archive-keyring.gpg] https://cli.github.com/packages stable main" | sudo tee /etc/apt/sources.list.d/github-cli.list > /dev/null
    sudo apt update && sudo apt install -y gh
fi

# Authenticate with GitHub
# Option 1: Browser authentication (manually open the URL it provides)
gh auth login --web

# Option 2: Token authentication
# Create token at: https://github.com/settings/tokens
# Scopes needed: repo, workflow, read:org
echo "YOUR_PERSONAL_ACCESS_TOKEN" | gh auth login --with-token

# Check authentication status
gh auth status

# Clone repository
gh repo clone <username>/<repo> [directory]

# Clone specific branch (use -- to pass flags to git)
gh repo clone <username>/<repo> [directory] -- --branch develop
# OR clone then checkout
gh repo clone <username>/<repo> [directory]
cd [directory] && git checkout develop

# After cloning a repo with LFS files
git lfs pull

# Typical workflow
gh repo clone <username>/<repo> project
cd project

Git LFS Setup (Usually Required)

# Check if Git LFS is installed
if ! command -v git-lfs &> /dev/null; then
    # Install Git LFS
    curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
    sudo apt-get install -y git-lfs
    git lfs install
else
    echo "Git LFS is already installed"
fi

# After cloning a repo with LFS files
git lfs pull

Installing Dependencies

Critical: NumPy Version Compatibility

# IMPORTANT: Lambda Labs PyTorch is compiled with NumPy 1.x
# Installing NumPy 2.x will cause compatibility errors!

# Check current NumPy version
python -c "import numpy; print(f'NumPy version: {numpy.__version__}')"

# If you see NumPy 2.x errors, pin NumPy version:
pip install "numpy<2.0"

# Or add to requirements file:
echo "numpy<2.0" >> requirements.txt

Check and Install Requirements

# Look for requirements files in this order
if [ -f requirements/requirements.txt ]; then
    pip install -r requirements/requirements.txt
elif [ -f requirements/base.txt ]; then
    pip install -r requirements/base.txt
elif [ -f requirements.txt ]; then
    pip install -r requirements.txt
fi

# Check for additional requirement files
for req in requirements/*.txt; do
    if [ -f "$req" ]; then
        echo "Found requirements file: $req"
    fi
done

# Note: pip will default to user installation (--user) on Lambda Labs
# This is normal and expected behavior

# For faster installs, avoid --force-reinstall unless necessary
# For specific package issues, use --no-deps:
# pip install --no-deps package_name

# For long installations that might timeout:
# Use tmux before installing large dependencies
tmux new -s install
# Then run pip install commands
# Detach with Ctrl+B, D if needed

# Verify all imports are satisfied
python -c "import pkgutil; print('\n'.join([m.name for m in pkgutil.iter_modules()]))" | sort

# Find all imports in Python files (helps identify missing requirements)
find . -name "*.py" -exec grep -h "^import\|^from.*import" {} \; | sort | uniq

# Common missing packages in ML projects:
# pip install transformers  # Often used but not in requirements
# pip install wandb tensorboard  # Common monitoring tools

Finding and Running Training Scripts

Pre-flight Checks

# 1. Verify NumPy compatibility
python -c "import numpy; assert numpy.__version__ < '2.0', f'NumPy {numpy.__version__} may cause issues with system PyTorch'"

# 2. Test key imports before running full training
python -c "import torch, numpy; print('Core imports OK')"

# 3. Check if all imports in training script are available
# Example: python -c "from transformers import AutoModel"

Locate Training Scripts

# Check common locations for training scripts
find . -name "train*.py" -type f 2>/dev/null
find scripts -name "*.py" -type f 2>/dev/null
find src -name "train*.py" -type f 2>/dev/null

# Check script arguments
python scripts/train.py --help  # If exists
python train.py --help          # If in root

tmux for Long-Running Jobs

Essential tmux Commands

# Start new session for training
tmux new -s training

# Start session with command
tmux new -s training -d "python train.py"

# List sessions
tmux ls

# Attach to session
tmux attach -t training

# Detach from session (while in tmux)
# Press: Ctrl+B, then D

# Kill session
tmux kill-session -t training

tmux Key Shortcuts (after Ctrl+B)

  • C - Create new window
  • N - Next window
  • P - Previous window
  • % - Split pane horizontally
  • " - Split pane vertically
  • Arrow keys - Navigate between panes
  • D - Detach from session

GPU Management

Environment Variables

# Control which GPUs to use
export CUDA_VISIBLE_DEVICES=0      # Use GPU 0
export CUDA_VISIBLE_DEVICES=0,1    # Use GPU 0 and 1
export CUDA_VISIBLE_DEVICES=0,1,2,3  # Use all 4 GPUs

# Monitor GPU usage
watch -n 1 nvidia-smi

Clear GPU Memory

# If CUDA out of memory errors
python -c "import torch; torch.cuda.empty_cache()"

# Find and kill processes using GPU
sudo fuser -v /dev/nvidia*

Storage Management

Working with Persistent Storage

# Check persistent storage
df -h /lambda/nfs/

# Copy data from persistent to local (faster for training)
cp -r /lambda/nfs/datasets/mydataset ~/data/

# Save important files to persistent storage
cp -r models/ /lambda/nfs/$USER/models_backup/

# Use rsync for large transfers with progress
rsync -avP source/ destination/

Important Storage Notes

  • /lambda/nfs/ is persistent across instance terminations
  • Home directory and root volume are ephemeral
  • Always backup important results to persistent storage
  • Persistent storage costs $0.20/GB/month

Monitoring Training

TensorBoard

# Start TensorBoard (if training logs to tensorboard)
tensorboard --logdir logs --host 0.0.0.0 --port 6006

# Access via SSH tunnel from local machine:
# ssh -L 6006:localhost:6006 ubuntu@<instance-ip>

System Monitoring

# GPU monitoring
nvidia-smi

# CPU and memory
htop

# Disk usage
df -h

# Follow training logs
tail -f logs/training.log  # If log file exists

Environment Variables

# Useful environment settings
export PYTHONUNBUFFERED=1  # See print statements immediately
export CUDA_LAUNCH_BLOCKING=1  # For debugging CUDA errors

# Cache directories (create on persistent storage)
export HF_HOME=/lambda/nfs/$USER/cache/huggingface
export TORCH_HOME=/lambda/nfs/$USER/cache/torch
mkdir -p $HF_HOME $TORCH_HOME

Git Best Practices

Authentication Check

# Verify GitHub authentication
gh auth status

IMPORTANT: No Commits Without Permission

# Check status but DO NOT commit unless explicitly instructed
git status
git diff

# When instructed to commit, use descriptive messages
# git add <files>
# git commit -m "descriptive message here"

# Always use git lfs for large files
git lfs track "*.pth"
git lfs track "*.ckpt"
git lfs track "*.bin"

Lambda Labs Specific Information

Instance Limitations

  • Cannot pause instances (only terminate)
  • Partial GPU usage not supported (can't use half a GPU)
  • Data on root volume deleted on termination
  • Use persistent storage for anything important

Network Access

  • JupyterLab accessible through Lambda Cloud dashboard
  • Use SSH tunneling for other services:
    # From local machine:
    ssh -L 8888:localhost:8888 ubuntu@<instance-ip>  # Jupyter
    ssh -L 6006:localhost:6006 ubuntu@<instance-ip>  # TensorBoard

Quick Commands

# Check Lambda Stack version
lambda-stack-version

# System information
lsb_release -a  # Ubuntu version
python --version
pip list | grep torch

# Disk space check
df -h /
df -h /lambda/nfs/

Common Issues

NumPy Version Error

# Error: "A module that was compiled using NumPy 1.x cannot be run in NumPy 2.x"
# Fix: Downgrade NumPy
pip install "numpy<2.0" --force-reinstall

Missing Import Errors

# Before running training, verify imports:
python -c "from <module> import <function>"  # Test specific imports

# Common missing packages:
pip install transformers  # If using HuggingFace models
pip install timm  # If using PyTorch Image Models
pip install wandb  # If using Weights & Biases logging

Out of Memory

# Reduce batch size in training arguments
# Enable mixed precision if supported
# Clear cache: python -c "import torch; torch.cuda.empty_cache()"

Slow Data Loading

# Copy data to local disk first
# Increase num_workers in DataLoader
# Check I/O with: iotop

SSH Connection Drops

# Always use tmux for long-running jobs
# Add to ~/.ssh/config on local machine:
# Host lambda
#   ServerAliveInterval 60
#   ServerAliveCountMax 3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment