This CLAUDE.md provides context for Claude Code to manage PyTorch model training on Lambda Labs GPU infrastructure.
- Always run Claude Code from a project directory, not from home directory
- This is a single-purpose training environment - no virtual environments needed
- Git LFS is often not pre-installed - check and install if needed
- NumPy 2.x breaks system PyTorch - always use NumPy <2.0
- pip installs to user directory - this is normal on Lambda Labs
- Platform: Lambda Cloud GPU Instance
- OS: Ubuntu 22.04 LTS with Lambda Stack
- Pre-installed: PyTorch, CUDA 12.x, cuDNN, Python 3.10+
- Important: System PyTorch is compiled with NumPy 1.x
- Storage: Root volume (ephemeral) + Persistent NFS at
/lambda/nfs/<FILESYSTEM_NAME> - Key Point: Root volume data is deleted on instance termination
# Quick environment check
python -c "import torch; import numpy; print(f'PyTorch: {torch.__version__}'); print(f'NumPy: {numpy.__version__}'); print(f'CUDA: {torch.cuda.is_available()}')"
# Check for common ML tools
for cmd in git git-lfs gh tmux nvidia-smi; do
command -v $cmd &> /dev/null && echo "✓ $cmd installed" || echo "✗ $cmd NOT installed"
done# Install GitHub CLI if not present
# Note: The apt version may be outdated (2.4.0). This installs the latest version.
if ! command -v gh &> /dev/null; then
curl -fsSL https://cli.github.com/packages/githubcli-archive-keyring.gpg | sudo dd of=/usr/share/keyrings/githubcli-archive-keyring.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/githubcli-archive-keyring.gpg] https://cli.github.com/packages stable main" | sudo tee /etc/apt/sources.list.d/github-cli.list > /dev/null
sudo apt update && sudo apt install -y gh
fi
# Authenticate with GitHub
# Option 1: Browser authentication (manually open the URL it provides)
gh auth login --web
# Option 2: Token authentication
# Create token at: https://github.com/settings/tokens
# Scopes needed: repo, workflow, read:org
echo "YOUR_PERSONAL_ACCESS_TOKEN" | gh auth login --with-token
# Check authentication status
gh auth status
# Clone repository
gh repo clone <username>/<repo> [directory]
# Clone specific branch (use -- to pass flags to git)
gh repo clone <username>/<repo> [directory] -- --branch develop
# OR clone then checkout
gh repo clone <username>/<repo> [directory]
cd [directory] && git checkout develop
# After cloning a repo with LFS files
git lfs pull
# Typical workflow
gh repo clone <username>/<repo> project
cd project# Check if Git LFS is installed
if ! command -v git-lfs &> /dev/null; then
# Install Git LFS
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt-get install -y git-lfs
git lfs install
else
echo "Git LFS is already installed"
fi
# After cloning a repo with LFS files
git lfs pull# IMPORTANT: Lambda Labs PyTorch is compiled with NumPy 1.x
# Installing NumPy 2.x will cause compatibility errors!
# Check current NumPy version
python -c "import numpy; print(f'NumPy version: {numpy.__version__}')"
# If you see NumPy 2.x errors, pin NumPy version:
pip install "numpy<2.0"
# Or add to requirements file:
echo "numpy<2.0" >> requirements.txt# Look for requirements files in this order
if [ -f requirements/requirements.txt ]; then
pip install -r requirements/requirements.txt
elif [ -f requirements/base.txt ]; then
pip install -r requirements/base.txt
elif [ -f requirements.txt ]; then
pip install -r requirements.txt
fi
# Check for additional requirement files
for req in requirements/*.txt; do
if [ -f "$req" ]; then
echo "Found requirements file: $req"
fi
done
# Note: pip will default to user installation (--user) on Lambda Labs
# This is normal and expected behavior
# For faster installs, avoid --force-reinstall unless necessary
# For specific package issues, use --no-deps:
# pip install --no-deps package_name
# For long installations that might timeout:
# Use tmux before installing large dependencies
tmux new -s install
# Then run pip install commands
# Detach with Ctrl+B, D if needed
# Verify all imports are satisfied
python -c "import pkgutil; print('\n'.join([m.name for m in pkgutil.iter_modules()]))" | sort
# Find all imports in Python files (helps identify missing requirements)
find . -name "*.py" -exec grep -h "^import\|^from.*import" {} \; | sort | uniq
# Common missing packages in ML projects:
# pip install transformers # Often used but not in requirements
# pip install wandb tensorboard # Common monitoring tools# 1. Verify NumPy compatibility
python -c "import numpy; assert numpy.__version__ < '2.0', f'NumPy {numpy.__version__} may cause issues with system PyTorch'"
# 2. Test key imports before running full training
python -c "import torch, numpy; print('Core imports OK')"
# 3. Check if all imports in training script are available
# Example: python -c "from transformers import AutoModel"# Check common locations for training scripts
find . -name "train*.py" -type f 2>/dev/null
find scripts -name "*.py" -type f 2>/dev/null
find src -name "train*.py" -type f 2>/dev/null
# Check script arguments
python scripts/train.py --help # If exists
python train.py --help # If in root# Start new session for training
tmux new -s training
# Start session with command
tmux new -s training -d "python train.py"
# List sessions
tmux ls
# Attach to session
tmux attach -t training
# Detach from session (while in tmux)
# Press: Ctrl+B, then D
# Kill session
tmux kill-session -t trainingC- Create new windowN- Next windowP- Previous window%- Split pane horizontally"- Split pane vertically- Arrow keys - Navigate between panes
D- Detach from session
# Control which GPUs to use
export CUDA_VISIBLE_DEVICES=0 # Use GPU 0
export CUDA_VISIBLE_DEVICES=0,1 # Use GPU 0 and 1
export CUDA_VISIBLE_DEVICES=0,1,2,3 # Use all 4 GPUs
# Monitor GPU usage
watch -n 1 nvidia-smi# If CUDA out of memory errors
python -c "import torch; torch.cuda.empty_cache()"
# Find and kill processes using GPU
sudo fuser -v /dev/nvidia*# Check persistent storage
df -h /lambda/nfs/
# Copy data from persistent to local (faster for training)
cp -r /lambda/nfs/datasets/mydataset ~/data/
# Save important files to persistent storage
cp -r models/ /lambda/nfs/$USER/models_backup/
# Use rsync for large transfers with progress
rsync -avP source/ destination//lambda/nfs/is persistent across instance terminations- Home directory and root volume are ephemeral
- Always backup important results to persistent storage
- Persistent storage costs $0.20/GB/month
# Start TensorBoard (if training logs to tensorboard)
tensorboard --logdir logs --host 0.0.0.0 --port 6006
# Access via SSH tunnel from local machine:
# ssh -L 6006:localhost:6006 ubuntu@<instance-ip># GPU monitoring
nvidia-smi
# CPU and memory
htop
# Disk usage
df -h
# Follow training logs
tail -f logs/training.log # If log file exists# Useful environment settings
export PYTHONUNBUFFERED=1 # See print statements immediately
export CUDA_LAUNCH_BLOCKING=1 # For debugging CUDA errors
# Cache directories (create on persistent storage)
export HF_HOME=/lambda/nfs/$USER/cache/huggingface
export TORCH_HOME=/lambda/nfs/$USER/cache/torch
mkdir -p $HF_HOME $TORCH_HOME# Verify GitHub authentication
gh auth status# Check status but DO NOT commit unless explicitly instructed
git status
git diff
# When instructed to commit, use descriptive messages
# git add <files>
# git commit -m "descriptive message here"
# Always use git lfs for large files
git lfs track "*.pth"
git lfs track "*.ckpt"
git lfs track "*.bin"- Cannot pause instances (only terminate)
- Partial GPU usage not supported (can't use half a GPU)
- Data on root volume deleted on termination
- Use persistent storage for anything important
- JupyterLab accessible through Lambda Cloud dashboard
- Use SSH tunneling for other services:
# From local machine: ssh -L 8888:localhost:8888 ubuntu@<instance-ip> # Jupyter ssh -L 6006:localhost:6006 ubuntu@<instance-ip> # TensorBoard
# Check Lambda Stack version
lambda-stack-version
# System information
lsb_release -a # Ubuntu version
python --version
pip list | grep torch
# Disk space check
df -h /
df -h /lambda/nfs/# Error: "A module that was compiled using NumPy 1.x cannot be run in NumPy 2.x"
# Fix: Downgrade NumPy
pip install "numpy<2.0" --force-reinstall# Before running training, verify imports:
python -c "from <module> import <function>" # Test specific imports
# Common missing packages:
pip install transformers # If using HuggingFace models
pip install timm # If using PyTorch Image Models
pip install wandb # If using Weights & Biases logging# Reduce batch size in training arguments
# Enable mixed precision if supported
# Clear cache: python -c "import torch; torch.cuda.empty_cache()"# Copy data to local disk first
# Increase num_workers in DataLoader
# Check I/O with: iotop# Always use tmux for long-running jobs
# Add to ~/.ssh/config on local machine:
# Host lambda
# ServerAliveInterval 60
# ServerAliveCountMax 3