Whisper Variants Comparison for AMD GPU ASR

A technical comparison of WhisperX, Faster-Whisper, and OpenAI Whisper for automatic speech recognition (ASR) inference on AMD GPUs with ROCm.

Quick Summary

Feature	OpenAI Whisper	Faster-Whisper	WhisperX
Speed	Baseline (1x)	4-8x faster	4-8x faster + optimizations
AMD GPU Support	Native PyTorch ROCm	CTranslate2 ROCm support	PyTorch ROCm + optimizations
Memory Usage	High	Low (~4x reduction)	Medium
Accuracy	Baseline	Same as OpenAI	Same as OpenAI
Word Timestamps	Basic	Basic	Accurate (phoneme-level)
Speaker Diarization	No	No	Yes (built-in)
VAD	No	Yes (Silero)	Yes (improved)
Ease of Setup	Easy	Easy	Moderate
Best For	Development/testing	Production inference	Production + diarization

Detailed Comparison

1. OpenAI Whisper (Original)

Implementation: Python, PyTorch-based Repository: https://github.com/openai/whisper

Pros

✅ Reference implementation - most compatible
✅ Easy to install: pip install openai-whisper
✅ Native ROCm support through PyTorch
✅ Well-documented, widely tested
✅ Direct model access and customization

Cons

❌ Slowest inference speed
❌ High memory usage
❌ No built-in VAD (processes silence)
❌ Basic word-level timestamps (not very accurate)
❌ No speaker diarization

AMD GPU Setup

# Install with ROCm PyTorch
pip install openai-whisper torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2

# May need GFX override for certain AMD GPUs
export HSA_OVERRIDE_GFX_VERSION=11.0.1  # For gfx1101 (RX 7700/7800 XT)

Typical Usage

import whisper

model = whisper.load_model("base")  # tiny, base, small, medium, large-v3
result = model.transcribe("audio.mp3")
print(result["text"])

Performance (AMD RX 7700 XT example)

tiny: ~5-8x realtime
base: ~3-5x realtime
small: ~2-3x realtime
medium: ~1-1.5x realtime
large-v3: ~0.5-1x realtime

2. Faster-Whisper

Implementation: Python wrapper around CTranslate2 (C++/CUDA/ROCm) Repository: https://github.com/SYSTRAN/faster-whisper

Pros

✅ 4-8x faster than OpenAI Whisper
✅ ~4x less memory usage (quantization support)
✅ Built-in VAD (Silero) - skips silence automatically
✅ ROCm support through CTranslate2
✅ Int8/Float16 quantization for speed/memory savings
✅ Easy drop-in replacement for OpenAI Whisper

Cons

❌ Requires CTranslate2 with ROCm build
❌ Basic word timestamps (same limitations as OpenAI)
❌ No speaker diarization
❌ Slightly more complex setup

AMD GPU Setup

# Install CTranslate2 with ROCm support
pip install ctranslate2-rocm

# Install faster-whisper
pip install faster-whisper

# May need GFX override
export HSA_OVERRIDE_GFX_VERSION=11.0.1

Typical Usage

from faster_whisper import WhisperModel

model = WhisperModel("base", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3", beam_size=5, vad_filter=True)

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

Performance (AMD RX 7700 XT example)

tiny: ~20-40x realtime (with VAD)
base: ~15-25x realtime
small: ~10-15x realtime
medium: ~5-8x realtime
large-v3: ~2-4x realtime

Quantization Options

float16: Best balance (minimal accuracy loss)
int8: Faster, lower memory, slight accuracy loss (~1-2%)
float32: Slower, higher memory, no benefit

3. WhisperX

Implementation: Python, combines Faster-Whisper + Phoneme alignment + Diarization Repository: https://github.com/m-bain/whisperX

Pros

✅ Same speed as Faster-Whisper (uses it internally)
✅ Accurate word-level timestamps (phoneme-based alignment)
✅ Built-in speaker diarization (pyannote.audio)
✅ Improved VAD over Faster-Whisper
✅ Best for multi-speaker scenarios
✅ ROCm support (inherits from PyTorch + Faster-Whisper)

Cons

❌ More complex setup (multiple dependencies)
❌ Higher memory usage than Faster-Whisper (alignment models)
❌ Requires HuggingFace token for diarization models
❌ Slower than Faster-Whisper when using all features
❌ More dependencies = more potential for conflicts

AMD GPU Setup

# Install WhisperX
pip install git+https://github.com/m-bain/whisperx.git

# Install PyTorch with ROCm
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2

# For diarization (optional, requires HF token)
# Get token from: https://huggingface.co/settings/tokens
export HF_TOKEN="your_token_here"

# May need GFX override
export HSA_OVERRIDE_GFX_VERSION=11.0.1

Typical Usage

import whisperx

device = "cuda"
audio_file = "audio.mp3"
batch_size = 16
compute_type = "float16"

# 1. Transcribe with Faster-Whisper
model = whisperx.load_model("base", device, compute_type=compute_type)
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)

# 2. Align whisper output (accurate word timestamps)
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device)

# 3. Assign speaker labels (diarization)
diarize_model = whisperx.DiarizationPipeline(use_auth_token="YOUR_HF_TOKEN", device=device)
diarize_segments = diarize_model(audio)
result = whisperx.assign_word_speakers(diarize_segments, result)

# Print results with speaker labels
for segment in result["segments"]:
    print(f"[{segment['start']:.2f}s - {segment['end']:.2f}s] Speaker {segment.get('speaker', 'UNKNOWN')}: {segment['text']}")

Performance (AMD RX 7700 XT example)

Transcription only: Same as Faster-Whisper
With alignment: ~20% slower than Faster-Whisper
With diarization: ~40-60% slower (depends on audio length)

Use Case Recommendations

Choose OpenAI Whisper if:

Quick testing/development
Need reference implementation
Customizing model architecture
Simplicity is priority
Speed not critical

Choose Faster-Whisper if:

Production inference (speed + memory critical)
Single-speaker audio
Don't need accurate word timestamps
Want best performance on AMD GPU
Recommended for most ASR use cases

Choose WhisperX if:

Need accurate word-level timestamps
Multi-speaker audio (podcasts, meetings, interviews)
Need speaker diarization
Post-processing/subtitle generation with precise timing
Research/analysis requiring segment-level accuracy

AMD GPU Compatibility Notes

ROCm Requirements

ROCm 5.7+ recommended (6.0+ preferred)
PyTorch with ROCm support
CTranslate2 with ROCm backend (for Faster-Whisper/WhisperX)

Common AMD GPU Configurations

RX 7000 Series (gfx1101 - Navi 32/33)

export HSA_OVERRIDE_GFX_VERSION=11.0.1
export ROCM_PATH=/opt/rocm

RX 6000 Series (gfx1030/1031/1032 - Navi 21/22/23)

export HSA_OVERRIDE_GFX_VERSION=10.3.0
export ROCM_PATH=/opt/rocm

Older GPUs (Vega, Polaris)

Check compatibility: rocminfo | grep gfx
May need older ROCm versions or CPU fallback

Verifying GPU Usage

import torch
print(f"ROCm available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")

Performance Optimization Tips

For All Variants

Use appropriate model size: Don't use large-v3 if base meets accuracy needs
Enable VAD: Reduces processing time by skipping silence
Batch processing: Process multiple files concurrently
Float16 precision: Minimal accuracy loss, significant speedup

Faster-Whisper Specific

model = WhisperModel(
    "base",
    device="cuda",
    compute_type="float16",  # or "int8" for more speed
    num_workers=4,           # CPU threads for preprocessing
    download_root="~/ai/models/stt/faster-whisper/"  # Local model cache
)

segments, info = model.transcribe(
    audio,
    beam_size=5,        # Lower = faster (default: 5)
    vad_filter=True,    # Enable VAD
    vad_parameters={
        "threshold": 0.5,
        "min_speech_duration_ms": 250,
        "min_silence_duration_ms": 100
    }
)

WhisperX Specific

# Use lower batch_size if running out of memory
batch_size = 8  # or 4 for lower memory

# Skip diarization if not needed
result = model.transcribe(audio, batch_size=batch_size)
result = whisperx.align(result["segments"], model_a, metadata, audio, device)
# Skip: diarize_model and assign_word_speakers

Benchmark Summary (1 hour audio, AMD RX 7700 XT)

Model	OpenAI Whisper	Faster-Whisper	WhisperX (full)
tiny	~12 min	~1.5 min	~2 min
base	~20 min	~2.5 min	~3.5 min
small	~35 min	~5 min	~7 min
medium	~55 min	~9 min	~13 min
large-v3	~90 min	~18 min	~25 min

Note: Times are approximate and depend on audio characteristics (speech density, background noise, etc.)

Final Recommendations

For general ASR on AMD GPU: → Faster-Whisper (best speed/accuracy/memory balance)

For multi-speaker content: → WhisperX (diarization + accurate timestamps worth the overhead)

For development/testing: → OpenAI Whisper (simplest setup, reference implementation)

Additional Resources

Generated by Claude Code | Please validate technical details and benchmark numbers for your specific hardware configuration. Performance may vary based on audio characteristics, model size, and system specifications.

danielrosehill/whisper-variants-amd-gpu-comparison.md

Whisper Variants Comparison for AMD GPU ASR

Quick Summary

Detailed Comparison

1. OpenAI Whisper (Original)

Pros

Cons

AMD GPU Setup

Typical Usage

Performance (AMD RX 7700 XT example)

2. Faster-Whisper

Pros

Cons

AMD GPU Setup

Typical Usage

Performance (AMD RX 7700 XT example)

Quantization Options

3. WhisperX

Pros

Cons

AMD GPU Setup

Typical Usage

Performance (AMD RX 7700 XT example)

Use Case Recommendations

Choose OpenAI Whisper if:

Choose Faster-Whisper if:

Choose WhisperX if:

AMD GPU Compatibility Notes

ROCm Requirements

Common AMD GPU Configurations

RX 7000 Series (gfx1101 - Navi 32/33)

RX 6000 Series (gfx1030/1031/1032 - Navi 21/22/23)

Older GPUs (Vega, Polaris)

Verifying GPU Usage

Performance Optimization Tips

For All Variants

Faster-Whisper Specific

WhisperX Specific

Benchmark Summary (1 hour audio, AMD RX 7700 XT)

Final Recommendations

Additional Resources