Free Transcription Using WhisperX w/ Speaker Diarization
Works well on a Macbook Pro as of March 1, 2025
This guide walks you through setting up WhisperX with speaker diarization on Apple Silicon Macs (M4 MacBook Pro).
Initial Setup for MacBook Pro with Apple Silicon (M4)
Prerequisites
1. Install Miniconda (if not already installed)
brew install miniconda
2. Initialize conda for your shell
conda init
You may need to restart your terminal or source your shell config
3. Create a dedicated conda environment
conda create --name whisperx python=3.10 -y
4. Activate the environment (conda does this similar to pipenv and tools like it)
conda activate whisperx
If you have activation issues, you may need to first run:
source /opt/homebrew/Caskroom/miniconda/base/etc/profile.d/conda.sh
5. Install PyTorch (optimized for Apple Silicon), ffmpeg & finally Install WhisperX
pip install torch torchvision torchaudio
brew install ffmpeg
pip install whisperx
Hugging Face Authentication for Speaker Diarization
Speaker diarization in WhisperX uses Hugging Face models that require authentication:
Create a Hugging Face account if you don't have one at huggingface.co
Critical Step: Accept the license terms for **both** required models:
Visit https://huggingface.co/pyannote/speaker-diarization-3.1 and accept terms
Visit https://huggingface.co/pyannote/segmentation-3.0 and accept terms
Generate a Hugging Face access token
Go to https://huggingface.co/settings/tokens
Create a new token with read access
Running WhisperX with Speaker Diarization
Finally you’re ready to run the command.
Tip: You may want to try with a very short audio file before something larger, in case you still run into any issues to debug
Basic command structure:
whisperx /path/to/your/audio.mp3 \
--model large-v2 \
--compute_type float32 \
--output_format txt \
--diarize \
--hf_token YOUR_HUGGING_FACE_TOKEN
Creating a Reusable Script (Optional)
Save this as `transcribe.sh`:
#!/bin/bash
set -euo pipefail
# Replace with your actual token
export HF_TOKEN=your_token_here
whisperx "$1" \
--model large-v2 \
--compute_type float32 \
--output_format txt \
--diarize \
--hf_token $HF_TOKEN
Make it executable & use it:
chmod +x transcribe.sh
./transcribe.sh your_audio_file.mp3
Common Issues and Troubleshooting
"Could not download" errors: Make sure you've accepted the terms for both required models on the Hugging Face website
Authentication errors: After accepting terms, always generate a fresh token
Performance issues: For large files, consider testing on a small clip first to verify everything works
Output Formats
I’ve tried txt and json outputs that both seem to work well. You may also want to try (untested by me):
* `srt`: Subtitles with timestamps
* `vtt`: Web video text tracks
Specify your desired format with the `--output_format` parameter.