A comprehensive guide to installing and configuring WhisperX on Windows with GPU acceleration and speaker diarization.
Last tested: January 2026
Environment: Windows 11, NVIDIA GPU (CUDA 12.6), Miniconda, Python 3.10
- Prerequisites
- Conda Environment Setup
- WhisperX Installation
- Hugging Face Configuration
- Wrapper Scripts
- Usage
- Troubleshooting
- Known Warnings
Install via Scoop (recommended) or download from conda.io:
scoop install miniconda3Ensure you have recent NVIDIA drivers. CUDA toolkit is bundled with PyTorch, so no separate installation is typically needed.
Verify GPU is detected:
nvidia-smiCreate an account at huggingface.co — required for speaker diarization models.
Create a dedicated environment with Python 3.10 (recommended for compatibility):
conda create -n whisperx python=3.10 -y
conda activate whisperxInstall PyTorch with CUDA support. Check pytorch.org for the latest command, but typically:
conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidiaOr for CUDA 12.1:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121pip install whisperxWhisperX has some dependencies that may not be automatically installed:
pip install requestspython -c "import whisperx; print('WhisperX OK')"
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"Speaker diarization requires access to gated models on Hugging Face. This is the most common source of errors.
- Go to huggingface.co/settings/tokens
- Create a new token with a descriptive name (e.g., "WhisperX")
- Enable: "Read access to contents of all public gated repos you can access"
- Save the token securely
You must visit each of these pages while logged in and accept the user agreement:
| Model | URL |
|---|---|
| Speaker Diarization 3.1 | huggingface.co/pyannote/speaker-diarization-3.1 |
| Segmentation 3.0 | huggingface.co/pyannote/segmentation-3.0 |
Look for "Gated model - You have been granted access to this model" after accepting.
For persistent authentication without passing tokens:
conda activate whisperx
huggingface-cli loginPaste your token when prompted. This caches credentials in ~/.cache/huggingface/token.
WhisperX requires patches to work with modern PyTorch (2.6+) and huggingface_hub versions. Create these two files in your scripts directory (e.g., C:\dev\scripts\).
This Python wrapper applies necessary compatibility patches:
import sys
import os
import torch
import functools
import huggingface_hub
# === PATCH 1: Fix PyTorch 2.6+ Security Change ===
# Force "weights_only=False" globally to allow loading older models
_original_load = torch.load
@functools.wraps(_original_load)
def robust_load(*args, **kwargs):
kwargs['weights_only'] = False
return _original_load(*args, **kwargs)
torch.load = robust_load
# ================================================
# === PATCH 2: Fix Hugging Face "use_auth_token" Deprecation ===
# The library changed argument 'use_auth_token' to 'token'.
# We intercept the call and rename the argument on the fly.
_original_hf_download = huggingface_hub.hf_hub_download
@functools.wraps(_original_hf_download)
def robust_hf_download(*args, **kwargs):
if 'use_auth_token' in kwargs:
kwargs['token'] = kwargs.pop('use_auth_token')
return _original_hf_download(*args, **kwargs)
huggingface_hub.hf_hub_download = robust_hf_download
# ================================================
# === PATCH 3: Set HF_TOKEN environment variable ===
# pyannote.audio reads from environment as fallback
if len(sys.argv) > 1:
for i, arg in enumerate(sys.argv):
if arg == '--hf_token' and i + 1 < len(sys.argv):
os.environ['HF_TOKEN'] = sys.argv[i + 1]
break
# ================================================
# Import WhisperX AFTER applying patches
from whisperx.__main__ import cli
if __name__ == "__main__":
sys.exit(cli())Batch script for easy transcription with drag-and-drop support:
@echo off
setlocal
:: ================= CONFIGURATION =================
:: PASTE YOUR HUGGING FACE TOKEN BELOW
set "HF_TOKEN=hf_YOUR_TOKEN_HERE"
:: NAME OF YOUR CONDA ENVIRONMENT
set "CONDA_ENV=whisperx"
:: SUPPRESS SYMLINK WARNINGS ON WINDOWS
set "HF_HUB_DISABLE_SYMLINKS_WARNING=1"
:: =================================================
:: Check if input file is provided
if "%~1"=="" (
echo [ERROR] No input file provided.
echo Usage: transcribe "path\to\audio.mp3"
goto :EOF
)
:: Get absolute path of the input file
set "INPUT_FILE=%~f1"
:: Get directory of the input file
set "OUTPUT_DIR=%~dp1"
:: Remove trailing backslash to prevent quote escaping bugs
if "%OUTPUT_DIR:~-1%"=="\" set "OUTPUT_DIR=%OUTPUT_DIR:~0,-1%"
echo.
echo ----------------------------------------------------------------
echo Source: %INPUT_FILE%
echo Target: %OUTPUT_DIR%
echo ----------------------------------------------------------------
echo.
:: Activate the Conda environment
call conda activate %CONDA_ENV%
:: === RUN THE WRAPPER SCRIPT ===
:: We use "%~dp0" to find the python script in the same folder as this batch file.
python "%~dp0run_whisperx_safe.py" "%INPUT_FILE%" --model medium --diarize --hf_token %HF_TOKEN% --output_dir "%OUTPUT_DIR%" --device cuda --compute_type int8 --batch_size 4
echo.
echo ----------------------------------------------------------------
echo Transcription Complete!
echo ----------------------------------------------------------------
:: Deactivate
call conda deactivate
endlocalNote: Replace
hf_YOUR_TOKEN_HEREwith your actual Hugging Face token.
Add your scripts directory to your system PATH, or create a shortcut/alias for easy access.
C:\dev\scripts\transcribe_WhisperX.bat "C:\path\to\audio.mp3"Create a shortcut to the batch file on your desktop. Drag audio files onto it to transcribe.
Transcription creates multiple output formats in the same directory as the input file:
.json— Full transcript with timestamps and speaker labels.srt— Subtitle format.vtt— WebVTT subtitle format.txt— Plain text transcript
Common options you can modify in the batch file:
| Option | Description | Values |
|---|---|---|
--model |
Whisper model size | tiny, base, small, medium, large-v2, large-v3 |
--device |
Compute device | cuda, cpu |
--compute_type |
Precision | float16, int8, float32 |
--batch_size |
Batch size for GPU | 1-32 (depends on VRAM) |
--diarize |
Enable speaker diarization | flag (no value) |
--language |
Force language | en, nl, de, etc. (auto-detect if omitted) |
--min_speakers |
Minimum speakers | integer |
--max_speakers |
Maximum speakers | integer |
Cause: Hugging Face authentication failed for gated models.
Solution:
- Verify you accepted licenses at:
- Check token has "gated repos" read permission
- Ensure token is correctly set in batch file
Cause: Missing dependency not installed by WhisperX.
Solution:
conda activate whisperx
pip install requestsCause: PyTorch 2.6+ changed default security settings for torch.load().
Solution: Use the run_whisperx_safe.py wrapper which patches this automatically.
Cause: huggingface_hub renamed the parameter from use_auth_token to token.
Solution: The wrapper script handles this. Also ensure HF_TOKEN environment variable is set.
Cause: Models are downloaded on first use (~3-5 GB total).
Solution: Wait for downloads to complete. Subsequent runs use cached models from:
~/.cache/huggingface/~/.cache/torch/
Solution: Reduce batch size or use a smaller model:
python ... --batch_size 2 --model smallThese warnings are safe to ignore — they don't affect functionality:
| Warning | Explanation |
|---|---|
torchaudio._backend.list_audio_backends deprecated |
Future API change in torchaudio |
Model was trained with pyannote.audio 0.0.1 |
Version mismatch, but backward compatible |
TensorFloat-32 (TF32) has been disabled |
Reproducibility safeguard |
speechbrain.pretrained deprecated |
Auto-redirects to new API |
symlinks not supported |
Windows limitation, uses more disk space |
Lightning upgraded your checkpoint |
Automatic checkpoint format update |
std(): degrees of freedom <= 0 |
Edge case in speaker embedding calculation |
To suppress the symlink warning, set in your batch file:
set "HF_HUB_DISABLE_SYMLINKS_WARNING=1"# Create environment
conda create -n whisperx python=3.10 -y
conda activate whisperx
# Install dependencies
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install whisperx requests
# Login to Hugging Face (one-time)
huggingface-cli login
# Accept model licenses (visit in browser while logged in):
# - https://huggingface.co/pyannote/speaker-diarization-3.1
# - https://huggingface.co/pyannote/segmentation-3.0
# Run transcription
python run_whisperx_safe.py "audio.mp3" --model medium --diarize --hf_token YOUR_TOKEN --output_dir . --device cudaCreated after extensive troubleshooting. May your transcriptions be swift and your speakers correctly identified.