Lambda Labs PyTorch Training Environment

This CLAUDE.md provides context for Claude Code to manage PyTorch model training on Lambda Labs GPU infrastructure.

Important Notes

Always run Claude Code from a project directory, not from home directory
This is a single-purpose training environment - no virtual environments needed
Git LFS is often not pre-installed - check and install if needed
NumPy 2.x breaks system PyTorch - always use NumPy <2.0
pip installs to user directory - this is normal on Lambda Labs

Environment Overview

Platform: Lambda Cloud GPU Instance
OS: Ubuntu 22.04 LTS with Lambda Stack
Pre-installed: PyTorch, CUDA 12.x, cuDNN, Python 3.10+
Important: System PyTorch is compiled with NumPy 1.x
Storage: Root volume (ephemeral) + Persistent NFS at /lambda/nfs/<FILESYSTEM_NAME>
Key Point: Root volume data is deleted on instance termination

Environment Verification

# Quick environment check
python -c "import torch; import numpy; print(f'PyTorch: {torch.__version__}'); print(f'NumPy: {numpy.__version__}'); print(f'CUDA: {torch.cuda.is_available()}')"

# Check for common ML tools
for cmd in git git-lfs gh tmux nvidia-smi; do
    command -v $cmd &> /dev/null && echo "✓ $cmd installed" || echo "✗ $cmd NOT installed"
done

Initial Setup

GitHub Authentication and Repository Setup

# Install GitHub CLI if not present
# Note: The apt version may be outdated (2.4.0). This installs the latest version.
if ! command -v gh &> /dev/null; then
    curl -fsSL https://cli.github.com/packages/githubcli-archive-keyring.gpg | sudo dd of=/usr/share/keyrings/githubcli-archive-keyring.gpg
    echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/githubcli-archive-keyring.gpg] https://cli.github.com/packages stable main" | sudo tee /etc/apt/sources.list.d/github-cli.list > /dev/null
    sudo apt update && sudo apt install -y gh
fi

# Authenticate with GitHub
# Option 1: Browser authentication (manually open the URL it provides)
gh auth login --web

# Option 2: Token authentication
# Create token at: https://github.com/settings/tokens
# Scopes needed: repo, workflow, read:org
echo "YOUR_PERSONAL_ACCESS_TOKEN" | gh auth login --with-token

# Check authentication status
gh auth status

# Clone repository
gh repo clone <username>/<repo> [directory]

# Clone specific branch (use -- to pass flags to git)
gh repo clone <username>/<repo> [directory] -- --branch develop
# OR clone then checkout
gh repo clone <username>/<repo> [directory]
cd [directory] && git checkout develop

# After cloning a repo with LFS files
git lfs pull

# Typical workflow
gh repo clone <username>/<repo> project
cd project

Git LFS Setup (Usually Required)

# Check if Git LFS is installed
if ! command -v git-lfs &> /dev/null; then
    # Install Git LFS
    curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
    sudo apt-get install -y git-lfs
    git lfs install
else
    echo "Git LFS is already installed"
fi

# After cloning a repo with LFS files
git lfs pull

Installing Dependencies

Critical: NumPy Version Compatibility

# IMPORTANT: Lambda Labs PyTorch is compiled with NumPy 1.x
# Installing NumPy 2.x will cause compatibility errors!

# Check current NumPy version
python -c "import numpy; print(f'NumPy version: {numpy.__version__}')"

# If you see NumPy 2.x errors, pin NumPy version:
pip install "numpy<2.0"

# Or add to requirements file:
echo "numpy<2.0" >> requirements.txt

Check and Install Requirements

# Look for requirements files in this order
if [ -f requirements/requirements.txt ]; then
    pip install -r requirements/requirements.txt
elif [ -f requirements/base.txt ]; then
    pip install -r requirements/base.txt
elif [ -f requirements.txt ]; then
    pip install -r requirements.txt
fi

# Check for additional requirement files
for req in requirements/*.txt; do
    if [ -f "$req" ]; then
        echo "Found requirements file: $req"
    fi
done

# Note: pip will default to user installation (--user) on Lambda Labs
# This is normal and expected behavior

# For faster installs, avoid --force-reinstall unless necessary
# For specific package issues, use --no-deps:
# pip install --no-deps package_name

# For long installations that might timeout:
# Use tmux before installing large dependencies
tmux new -s install
# Then run pip install commands
# Detach with Ctrl+B, D if needed

# Verify all imports are satisfied
python -c "import pkgutil; print('\n'.join([m.name for m in pkgutil.iter_modules()]))" | sort

# Find all imports in Python files (helps identify missing requirements)
find . -name "*.py" -exec grep -h "^import\|^from.*import" {} \; | sort | uniq

# Common missing packages in ML projects:
# pip install transformers  # Often used but not in requirements
# pip install wandb tensorboard  # Common monitoring tools

Finding and Running Training Scripts

Pre-flight Checks

# 1. Verify NumPy compatibility
python -c "import numpy; assert numpy.__version__ < '2.0', f'NumPy {numpy.__version__} may cause issues with system PyTorch'"

# 2. Test key imports before running full training
python -c "import torch, numpy; print('Core imports OK')"

# 3. Check if all imports in training script are available
# Example: python -c "from transformers import AutoModel"

Locate Training Scripts

# Check common locations for training scripts
find . -name "train*.py" -type f 2>/dev/null
find scripts -name "*.py" -type f 2>/dev/null
find src -name "train*.py" -type f 2>/dev/null

# Check script arguments
python scripts/train.py --help  # If exists
python train.py --help          # If in root

tmux for Long-Running Jobs

Essential tmux Commands

# Start new session for training
tmux new -s training

# Start session with command
tmux new -s training -d "python train.py"

# List sessions
tmux ls

# Attach to session
tmux attach -t training

# Detach from session (while in tmux)
# Press: Ctrl+B, then D

# Kill session
tmux kill-session -t training

tmux Key Shortcuts (after Ctrl+B)

C - Create new window
N - Next window
P - Previous window
% - Split pane horizontally
" - Split pane vertically
Arrow keys - Navigate between panes
D - Detach from session

GPU Management

Environment Variables

# Control which GPUs to use
export CUDA_VISIBLE_DEVICES=0      # Use GPU 0
export CUDA_VISIBLE_DEVICES=0,1    # Use GPU 0 and 1
export CUDA_VISIBLE_DEVICES=0,1,2,3  # Use all 4 GPUs

# Monitor GPU usage
watch -n 1 nvidia-smi

Clear GPU Memory

# If CUDA out of memory errors
python -c "import torch; torch.cuda.empty_cache()"

# Find and kill processes using GPU
sudo fuser -v /dev/nvidia*

Storage Management

Working with Persistent Storage

# Check persistent storage
df -h /lambda/nfs/

# Copy data from persistent to local (faster for training)
cp -r /lambda/nfs/datasets/mydataset ~/data/

# Save important files to persistent storage
cp -r models/ /lambda/nfs/$USER/models_backup/

# Use rsync for large transfers with progress
rsync -avP source/ destination/

Important Storage Notes

/lambda/nfs/ is persistent across instance terminations
Home directory and root volume are ephemeral
Always backup important results to persistent storage
Persistent storage costs $0.20/GB/month

Monitoring Training

TensorBoard

# Start TensorBoard (if training logs to tensorboard)
tensorboard --logdir logs --host 0.0.0.0 --port 6006

# Access via SSH tunnel from local machine:
# ssh -L 6006:localhost:6006 ubuntu@<instance-ip>

System Monitoring

# GPU monitoring
nvidia-smi

# CPU and memory
htop

# Disk usage
df -h

# Follow training logs
tail -f logs/training.log  # If log file exists

Environment Variables

# Useful environment settings
export PYTHONUNBUFFERED=1  # See print statements immediately
export CUDA_LAUNCH_BLOCKING=1  # For debugging CUDA errors

# Cache directories (create on persistent storage)
export HF_HOME=/lambda/nfs/$USER/cache/huggingface
export TORCH_HOME=/lambda/nfs/$USER/cache/torch
mkdir -p $HF_HOME $TORCH_HOME

Git Best Practices

Authentication Check

# Verify GitHub authentication
gh auth status

IMPORTANT: No Commits Without Permission

# Check status but DO NOT commit unless explicitly instructed
git status
git diff

# When instructed to commit, use descriptive messages
# git add <files>
# git commit -m "descriptive message here"

# Always use git lfs for large files
git lfs track "*.pth"
git lfs track "*.ckpt"
git lfs track "*.bin"

Lambda Labs Specific Information

Instance Limitations

Cannot pause instances (only terminate)
Partial GPU usage not supported (can't use half a GPU)
Data on root volume deleted on termination
Use persistent storage for anything important

Network Access

JupyterLab accessible through Lambda Cloud dashboard

Use SSH tunneling for other services:

# From local machine:
ssh -L 8888:localhost:8888 ubuntu@<instance-ip>  # Jupyter
ssh -L 6006:localhost:6006 ubuntu@<instance-ip>  # TensorBoard

Quick Commands

# Check Lambda Stack version
lambda-stack-version

# System information
lsb_release -a  # Ubuntu version
python --version
pip list | grep torch

# Disk space check
df -h /
df -h /lambda/nfs/

Common Issues

NumPy Version Error

# Error: "A module that was compiled using NumPy 1.x cannot be run in NumPy 2.x"
# Fix: Downgrade NumPy
pip install "numpy<2.0" --force-reinstall

Missing Import Errors

# Before running training, verify imports:
python -c "from <module> import <function>"  # Test specific imports

# Common missing packages:
pip install transformers  # If using HuggingFace models
pip install timm  # If using PyTorch Image Models
pip install wandb  # If using Weights & Biases logging

Out of Memory

# Reduce batch size in training arguments
# Enable mixed precision if supported
# Clear cache: python -c "import torch; torch.cuda.empty_cache()"

Slow Data Loading

# Copy data to local disk first
# Increase num_workers in DataLoader
# Check I/O with: iotop

SSH Connection Drops

# Always use tmux for long-running jobs
# Add to ~/.ssh/config on local machine:
# Host lambda
#   ServerAliveInterval 60
#   ServerAliveCountMax 3

gregpriday/claude_lambda_labs.md

Lambda Labs PyTorch Training Environment

Important Notes

Environment Overview

Environment Verification

Initial Setup

GitHub Authentication and Repository Setup

Git LFS Setup (Usually Required)

Installing Dependencies

Critical: NumPy Version Compatibility

Check and Install Requirements

Finding and Running Training Scripts

Pre-flight Checks

Locate Training Scripts

tmux for Long-Running Jobs

Essential tmux Commands

tmux Key Shortcuts (after Ctrl+B)

GPU Management

Environment Variables

Clear GPU Memory

Storage Management

Working with Persistent Storage

Important Storage Notes

Monitoring Training

TensorBoard

System Monitoring

Environment Variables

Git Best Practices

Authentication Check

IMPORTANT: No Commits Without Permission

Lambda Labs Specific Information

Instance Limitations

Network Access

Quick Commands

Common Issues

NumPy Version Error

Missing Import Errors

Out of Memory

Slow Data Loading

SSH Connection Drops