A technical comparison of WhisperX, Faster-Whisper, and OpenAI Whisper for automatic speech recognition (ASR) inference on AMD GPUs with ROCm.
| Feature | OpenAI Whisper | Faster-Whisper | WhisperX |
|---|---|---|---|
| Speed | Baseline (1x) | 4-8x faster | 4-8x faster + optimizations |
| AMD GPU Support | Native PyTorch ROCm | CTranslate2 ROCm support | PyTorch ROCm + optimizations |
| Memory Usage | High | Low (~4x reduction) | Medium |
| Accuracy | Baseline | Same as OpenAI | Same as OpenAI |
| Word Timestamps | Basic | Basic | Accurate (phoneme-level) |
| Speaker Diarization | No | No | Yes (built-in) |
| VAD | No | Yes (Silero) | Yes (improved) |
| Ease of Setup | Easy | Easy | Moderate |
| Best For | Development/testing | Production inference | Production + diarization |
Implementation: Python, PyTorch-based Repository: https://github.com/openai/whisper
- ✅ Reference implementation - most compatible
- ✅ Easy to install:
pip install openai-whisper - ✅ Native ROCm support through PyTorch
- ✅ Well-documented, widely tested
- ✅ Direct model access and customization
- ❌ Slowest inference speed
- ❌ High memory usage
- ❌ No built-in VAD (processes silence)
- ❌ Basic word-level timestamps (not very accurate)
- ❌ No speaker diarization
# Install with ROCm PyTorch
pip install openai-whisper torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2
# May need GFX override for certain AMD GPUs
export HSA_OVERRIDE_GFX_VERSION=11.0.1 # For gfx1101 (RX 7700/7800 XT)import whisper
model = whisper.load_model("base") # tiny, base, small, medium, large-v3
result = model.transcribe("audio.mp3")
print(result["text"])- tiny: ~5-8x realtime
- base: ~3-5x realtime
- small: ~2-3x realtime
- medium: ~1-1.5x realtime
- large-v3: ~0.5-1x realtime
Implementation: Python wrapper around CTranslate2 (C++/CUDA/ROCm) Repository: https://github.com/SYSTRAN/faster-whisper
- ✅ 4-8x faster than OpenAI Whisper
- ✅ ~4x less memory usage (quantization support)
- ✅ Built-in VAD (Silero) - skips silence automatically
- ✅ ROCm support through CTranslate2
- ✅ Int8/Float16 quantization for speed/memory savings
- ✅ Easy drop-in replacement for OpenAI Whisper
- ❌ Requires CTranslate2 with ROCm build
- ❌ Basic word timestamps (same limitations as OpenAI)
- ❌ No speaker diarization
- ❌ Slightly more complex setup
# Install CTranslate2 with ROCm support
pip install ctranslate2-rocm
# Install faster-whisper
pip install faster-whisper
# May need GFX override
export HSA_OVERRIDE_GFX_VERSION=11.0.1from faster_whisper import WhisperModel
model = WhisperModel("base", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3", beam_size=5, vad_filter=True)
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")- tiny: ~20-40x realtime (with VAD)
- base: ~15-25x realtime
- small: ~10-15x realtime
- medium: ~5-8x realtime
- large-v3: ~2-4x realtime
float16: Best balance (minimal accuracy loss)int8: Faster, lower memory, slight accuracy loss (~1-2%)float32: Slower, higher memory, no benefit
Implementation: Python, combines Faster-Whisper + Phoneme alignment + Diarization Repository: https://github.com/m-bain/whisperX
- ✅ Same speed as Faster-Whisper (uses it internally)
- ✅ Accurate word-level timestamps (phoneme-based alignment)
- ✅ Built-in speaker diarization (pyannote.audio)
- ✅ Improved VAD over Faster-Whisper
- ✅ Best for multi-speaker scenarios
- ✅ ROCm support (inherits from PyTorch + Faster-Whisper)
- ❌ More complex setup (multiple dependencies)
- ❌ Higher memory usage than Faster-Whisper (alignment models)
- ❌ Requires HuggingFace token for diarization models
- ❌ Slower than Faster-Whisper when using all features
- ❌ More dependencies = more potential for conflicts
# Install WhisperX
pip install git+https://github.com/m-bain/whisperx.git
# Install PyTorch with ROCm
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2
# For diarization (optional, requires HF token)
# Get token from: https://huggingface.co/settings/tokens
export HF_TOKEN="your_token_here"
# May need GFX override
export HSA_OVERRIDE_GFX_VERSION=11.0.1import whisperx
device = "cuda"
audio_file = "audio.mp3"
batch_size = 16
compute_type = "float16"
# 1. Transcribe with Faster-Whisper
model = whisperx.load_model("base", device, compute_type=compute_type)
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
# 2. Align whisper output (accurate word timestamps)
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device)
# 3. Assign speaker labels (diarization)
diarize_model = whisperx.DiarizationPipeline(use_auth_token="YOUR_HF_TOKEN", device=device)
diarize_segments = diarize_model(audio)
result = whisperx.assign_word_speakers(diarize_segments, result)
# Print results with speaker labels
for segment in result["segments"]:
print(f"[{segment['start']:.2f}s - {segment['end']:.2f}s] Speaker {segment.get('speaker', 'UNKNOWN')}: {segment['text']}")- Transcription only: Same as Faster-Whisper
- With alignment: ~20% slower than Faster-Whisper
- With diarization: ~40-60% slower (depends on audio length)
- Quick testing/development
- Need reference implementation
- Customizing model architecture
- Simplicity is priority
- Speed not critical
- Production inference (speed + memory critical)
- Single-speaker audio
- Don't need accurate word timestamps
- Want best performance on AMD GPU
- Recommended for most ASR use cases
- Need accurate word-level timestamps
- Multi-speaker audio (podcasts, meetings, interviews)
- Need speaker diarization
- Post-processing/subtitle generation with precise timing
- Research/analysis requiring segment-level accuracy
- ROCm 5.7+ recommended (6.0+ preferred)
- PyTorch with ROCm support
- CTranslate2 with ROCm backend (for Faster-Whisper/WhisperX)
export HSA_OVERRIDE_GFX_VERSION=11.0.1
export ROCM_PATH=/opt/rocmexport HSA_OVERRIDE_GFX_VERSION=10.3.0
export ROCM_PATH=/opt/rocm- Check compatibility:
rocminfo | grep gfx - May need older ROCm versions or CPU fallback
import torch
print(f"ROCm available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")- Use appropriate model size: Don't use
large-v3ifbasemeets accuracy needs - Enable VAD: Reduces processing time by skipping silence
- Batch processing: Process multiple files concurrently
- Float16 precision: Minimal accuracy loss, significant speedup
model = WhisperModel(
"base",
device="cuda",
compute_type="float16", # or "int8" for more speed
num_workers=4, # CPU threads for preprocessing
download_root="~/ai/models/stt/faster-whisper/" # Local model cache
)
segments, info = model.transcribe(
audio,
beam_size=5, # Lower = faster (default: 5)
vad_filter=True, # Enable VAD
vad_parameters={
"threshold": 0.5,
"min_speech_duration_ms": 250,
"min_silence_duration_ms": 100
}
)# Use lower batch_size if running out of memory
batch_size = 8 # or 4 for lower memory
# Skip diarization if not needed
result = model.transcribe(audio, batch_size=batch_size)
result = whisperx.align(result["segments"], model_a, metadata, audio, device)
# Skip: diarize_model and assign_word_speakers| Model | OpenAI Whisper | Faster-Whisper | WhisperX (full) |
|---|---|---|---|
| tiny | ~12 min | ~1.5 min | ~2 min |
| base | ~20 min | ~2.5 min | ~3.5 min |
| small | ~35 min | ~5 min | ~7 min |
| medium | ~55 min | ~9 min | ~13 min |
| large-v3 | ~90 min | ~18 min | ~25 min |
Note: Times are approximate and depend on audio characteristics (speech density, background noise, etc.)
For general ASR on AMD GPU: → Faster-Whisper (best speed/accuracy/memory balance)
For multi-speaker content: → WhisperX (diarization + accurate timestamps worth the overhead)
For development/testing: → OpenAI Whisper (simplest setup, reference implementation)
- OpenAI Whisper Paper
- Faster-Whisper Benchmarks
- WhisperX Paper
- ROCm Documentation
- CTranslate2 Performance Guide
Generated by Claude Code | Please validate technical details and benchmark numbers for your specific hardware configuration. Performance may vary based on audio characteristics, model size, and system specifications.