Free Transcription Using WhisperX w/ Speaker Diarization

Works well on a Macbook Pro as of March 1, 2025

Mar 02, 2025

This guide walks you through setting up WhisperX with speaker diarization on Apple Silicon Macs (M4 MacBook Pro).

Initial Setup for MacBook Pro with Apple Silicon (M4)

Prerequisites

1. Install Miniconda (if not already installed)

brew install miniconda

2. Initialize conda for your shell

conda init

You may need to restart your terminal or source your shell config

3. Create a dedicated conda environment

conda create --name whisperx python=3.10 -y

4. Activate the environment (conda does this similar to pipenv and tools like it)

conda activate whisperx

If you have activation issues, you may need to first run:

source /opt/homebrew/Caskroom/miniconda/base/etc/profile.d/conda.sh

5. Install PyTorch (optimized for Apple Silicon), ffmpeg & finally Install WhisperX

pip install torch torchvision torchaudio
brew install ffmpeg
pip install whisperx

Hugging Face Authentication for Speaker Diarization

Speaker diarization in WhisperX uses Hugging Face models that require authentication:

Create a Hugging Face account if you don't have one at huggingface.co
Critical Step: Accept the license terms for **both** required models:
1. Visit https://huggingface.co/pyannote/speaker-diarization-3.1 and accept terms
2. Visit https://huggingface.co/pyannote/segmentation-3.0 and accept terms
Generate a Hugging Face access token
1. Go to https://huggingface.co/settings/tokens
2. Create a new token with read access

Running WhisperX with Speaker Diarization

Finally you’re ready to run the command.

Tip: You may want to try with a very short audio file before something larger, in case you still run into any issues to debug

Basic command structure:

whisperx /path/to/your/audio.mp3 \

 --model large-v2 \

--compute_type float32 \

 --output_format txt \

--diarize \

 --hf_token YOUR_HUGGING_FACE_TOKEN

Creating a Reusable Script (Optional)

Save this as `transcribe.sh`:

#!/bin/bash

set -euo pipefail

# Replace with your actual token

export HF_TOKEN=your_token_here

whisperx "$1" \

 --model large-v2 \

--compute_type float32 \

 --output_format txt \

--diarize \

 --hf_token $HF_TOKEN

Make it executable & use it:

chmod +x transcribe.sh

./transcribe.sh your_audio_file.mp3

Common Issues and Troubleshooting

"Could not download" errors: Make sure you've accepted the terms for both required models on the Hugging Face website

Authentication errors: After accepting terms, always generate a fresh token

Performance issues: For large files, consider testing on a small clip first to verify everything works

Output Formats

I’ve tried txt and json outputs that both seem to work well. You may also want to try (untested by me):

* `srt`: Subtitles with timestamps

* `vtt`: Web video text tracks

Specify your desired format with the `--output_format` parameter.

Scott Nelson

Discussion about this post