Skip to content

Instantly share code, notes, and snippets.

@danielrosehill
Created November 23, 2025 22:07
Show Gist options
  • Select an option

  • Save danielrosehill/278b1719598093767126e52105ae076e to your computer and use it in GitHub Desktop.

Select an option

Save danielrosehill/278b1719598093767126e52105ae076e to your computer and use it in GitHub Desktop.
Comparison: WhisperX vs Faster-Whisper vs OpenAI Whisper for AMD GPU ASR

Whisper Variants Comparison for AMD GPU ASR

A technical comparison of WhisperX, Faster-Whisper, and OpenAI Whisper for automatic speech recognition (ASR) inference on AMD GPUs with ROCm.

Quick Summary

Feature OpenAI Whisper Faster-Whisper WhisperX
Speed Baseline (1x) 4-8x faster 4-8x faster + optimizations
AMD GPU Support Native PyTorch ROCm CTranslate2 ROCm support PyTorch ROCm + optimizations
Memory Usage High Low (~4x reduction) Medium
Accuracy Baseline Same as OpenAI Same as OpenAI
Word Timestamps Basic Basic Accurate (phoneme-level)
Speaker Diarization No No Yes (built-in)
VAD No Yes (Silero) Yes (improved)
Ease of Setup Easy Easy Moderate
Best For Development/testing Production inference Production + diarization

Detailed Comparison

1. OpenAI Whisper (Original)

Implementation: Python, PyTorch-based Repository: https://github.com/openai/whisper

Pros

  • ✅ Reference implementation - most compatible
  • ✅ Easy to install: pip install openai-whisper
  • ✅ Native ROCm support through PyTorch
  • ✅ Well-documented, widely tested
  • ✅ Direct model access and customization

Cons

  • ❌ Slowest inference speed
  • ❌ High memory usage
  • ❌ No built-in VAD (processes silence)
  • ❌ Basic word-level timestamps (not very accurate)
  • ❌ No speaker diarization

AMD GPU Setup

# Install with ROCm PyTorch
pip install openai-whisper torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2

# May need GFX override for certain AMD GPUs
export HSA_OVERRIDE_GFX_VERSION=11.0.1  # For gfx1101 (RX 7700/7800 XT)

Typical Usage

import whisper

model = whisper.load_model("base")  # tiny, base, small, medium, large-v3
result = model.transcribe("audio.mp3")
print(result["text"])

Performance (AMD RX 7700 XT example)

  • tiny: ~5-8x realtime
  • base: ~3-5x realtime
  • small: ~2-3x realtime
  • medium: ~1-1.5x realtime
  • large-v3: ~0.5-1x realtime

2. Faster-Whisper

Implementation: Python wrapper around CTranslate2 (C++/CUDA/ROCm) Repository: https://github.com/SYSTRAN/faster-whisper

Pros

  • 4-8x faster than OpenAI Whisper
  • ~4x less memory usage (quantization support)
  • ✅ Built-in VAD (Silero) - skips silence automatically
  • ✅ ROCm support through CTranslate2
  • ✅ Int8/Float16 quantization for speed/memory savings
  • ✅ Easy drop-in replacement for OpenAI Whisper

Cons

  • ❌ Requires CTranslate2 with ROCm build
  • ❌ Basic word timestamps (same limitations as OpenAI)
  • ❌ No speaker diarization
  • ❌ Slightly more complex setup

AMD GPU Setup

# Install CTranslate2 with ROCm support
pip install ctranslate2-rocm

# Install faster-whisper
pip install faster-whisper

# May need GFX override
export HSA_OVERRIDE_GFX_VERSION=11.0.1

Typical Usage

from faster_whisper import WhisperModel

model = WhisperModel("base", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3", beam_size=5, vad_filter=True)

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

Performance (AMD RX 7700 XT example)

  • tiny: ~20-40x realtime (with VAD)
  • base: ~15-25x realtime
  • small: ~10-15x realtime
  • medium: ~5-8x realtime
  • large-v3: ~2-4x realtime

Quantization Options

  • float16: Best balance (minimal accuracy loss)
  • int8: Faster, lower memory, slight accuracy loss (~1-2%)
  • float32: Slower, higher memory, no benefit

3. WhisperX

Implementation: Python, combines Faster-Whisper + Phoneme alignment + Diarization Repository: https://github.com/m-bain/whisperX

Pros

  • Same speed as Faster-Whisper (uses it internally)
  • Accurate word-level timestamps (phoneme-based alignment)
  • Built-in speaker diarization (pyannote.audio)
  • ✅ Improved VAD over Faster-Whisper
  • ✅ Best for multi-speaker scenarios
  • ✅ ROCm support (inherits from PyTorch + Faster-Whisper)

Cons

  • ❌ More complex setup (multiple dependencies)
  • ❌ Higher memory usage than Faster-Whisper (alignment models)
  • ❌ Requires HuggingFace token for diarization models
  • ❌ Slower than Faster-Whisper when using all features
  • ❌ More dependencies = more potential for conflicts

AMD GPU Setup

# Install WhisperX
pip install git+https://github.com/m-bain/whisperx.git

# Install PyTorch with ROCm
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2

# For diarization (optional, requires HF token)
# Get token from: https://huggingface.co/settings/tokens
export HF_TOKEN="your_token_here"

# May need GFX override
export HSA_OVERRIDE_GFX_VERSION=11.0.1

Typical Usage

import whisperx

device = "cuda"
audio_file = "audio.mp3"
batch_size = 16
compute_type = "float16"

# 1. Transcribe with Faster-Whisper
model = whisperx.load_model("base", device, compute_type=compute_type)
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)

# 2. Align whisper output (accurate word timestamps)
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device)

# 3. Assign speaker labels (diarization)
diarize_model = whisperx.DiarizationPipeline(use_auth_token="YOUR_HF_TOKEN", device=device)
diarize_segments = diarize_model(audio)
result = whisperx.assign_word_speakers(diarize_segments, result)

# Print results with speaker labels
for segment in result["segments"]:
    print(f"[{segment['start']:.2f}s - {segment['end']:.2f}s] Speaker {segment.get('speaker', 'UNKNOWN')}: {segment['text']}")

Performance (AMD RX 7700 XT example)

  • Transcription only: Same as Faster-Whisper
  • With alignment: ~20% slower than Faster-Whisper
  • With diarization: ~40-60% slower (depends on audio length)

Use Case Recommendations

Choose OpenAI Whisper if:

  • Quick testing/development
  • Need reference implementation
  • Customizing model architecture
  • Simplicity is priority
  • Speed not critical

Choose Faster-Whisper if:

  • Production inference (speed + memory critical)
  • Single-speaker audio
  • Don't need accurate word timestamps
  • Want best performance on AMD GPU
  • Recommended for most ASR use cases

Choose WhisperX if:

  • Need accurate word-level timestamps
  • Multi-speaker audio (podcasts, meetings, interviews)
  • Need speaker diarization
  • Post-processing/subtitle generation with precise timing
  • Research/analysis requiring segment-level accuracy

AMD GPU Compatibility Notes

ROCm Requirements

  • ROCm 5.7+ recommended (6.0+ preferred)
  • PyTorch with ROCm support
  • CTranslate2 with ROCm backend (for Faster-Whisper/WhisperX)

Common AMD GPU Configurations

RX 7000 Series (gfx1101 - Navi 32/33)

export HSA_OVERRIDE_GFX_VERSION=11.0.1
export ROCM_PATH=/opt/rocm

RX 6000 Series (gfx1030/1031/1032 - Navi 21/22/23)

export HSA_OVERRIDE_GFX_VERSION=10.3.0
export ROCM_PATH=/opt/rocm

Older GPUs (Vega, Polaris)

  • Check compatibility: rocminfo | grep gfx
  • May need older ROCm versions or CPU fallback

Verifying GPU Usage

import torch
print(f"ROCm available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")

Performance Optimization Tips

For All Variants

  1. Use appropriate model size: Don't use large-v3 if base meets accuracy needs
  2. Enable VAD: Reduces processing time by skipping silence
  3. Batch processing: Process multiple files concurrently
  4. Float16 precision: Minimal accuracy loss, significant speedup

Faster-Whisper Specific

model = WhisperModel(
    "base",
    device="cuda",
    compute_type="float16",  # or "int8" for more speed
    num_workers=4,           # CPU threads for preprocessing
    download_root="~/ai/models/stt/faster-whisper/"  # Local model cache
)

segments, info = model.transcribe(
    audio,
    beam_size=5,        # Lower = faster (default: 5)
    vad_filter=True,    # Enable VAD
    vad_parameters={
        "threshold": 0.5,
        "min_speech_duration_ms": 250,
        "min_silence_duration_ms": 100
    }
)

WhisperX Specific

# Use lower batch_size if running out of memory
batch_size = 8  # or 4 for lower memory

# Skip diarization if not needed
result = model.transcribe(audio, batch_size=batch_size)
result = whisperx.align(result["segments"], model_a, metadata, audio, device)
# Skip: diarize_model and assign_word_speakers

Benchmark Summary (1 hour audio, AMD RX 7700 XT)

Model OpenAI Whisper Faster-Whisper WhisperX (full)
tiny ~12 min ~1.5 min ~2 min
base ~20 min ~2.5 min ~3.5 min
small ~35 min ~5 min ~7 min
medium ~55 min ~9 min ~13 min
large-v3 ~90 min ~18 min ~25 min

Note: Times are approximate and depend on audio characteristics (speech density, background noise, etc.)


Final Recommendations

For general ASR on AMD GPU:Faster-Whisper (best speed/accuracy/memory balance)

For multi-speaker content:WhisperX (diarization + accurate timestamps worth the overhead)

For development/testing:OpenAI Whisper (simplest setup, reference implementation)


Additional Resources


Generated by Claude Code | Please validate technical details and benchmark numbers for your specific hardware configuration. Performance may vary based on audio characteristics, model size, and system specifications.

Comments are disabled for this gist.