A bash script that transcribes audio files using OpenAI's Whisper API with optional speaker diarization (identifying who said what).
- Speaker Diarization: Identify and label individual speakers in multi-person audio
- Flexible Configuration: Support for 0-4 speakers
- Multiple Output Formats: Plain text and structured JSON segments
- Error Handling: Comprehensive validation and clear error messages
- Italian Language Support: Optimized for Italian transcription
- Bash (v4.0+)
- curl - for API requests
- jq - for JSON processing
- base64 - for encoding voice references
- OpenAI API Key - set as
OPENAI_API_KEYenvironment variable - ffmpeg (optional) - for preparing voice reference files
- Clone or download the script:
chmod +x transcriptize.sh- Set your OpenAI API key:
export OPENAI_API_KEY="your-api-key-here"- Install dependencies (if not already installed):
# macOS
brew install jq ffmpeg
# Ubuntu/Debian
sudo apt-get install jq ffmpeg./transcriptize.sh <audio_file.mp3> [--speaker Name:voicefile.wav] [--speaker Name2:voicefile2.wav] ..../transcriptize.sh interview.mp3 --speaker Luana:voce_luana_mini.wav --speaker Chiara:voce_chiara_mini.wavOutput:
interview_raw.json- Complete JSON response with text and speaker-labeled segmentsinterview_diarized.txt- Human-readable formatted transcript with merged speaker segmentsinterview_text.txt- Plain text transcription only
./transcriptize.sh interview.mp3Output includes generic speaker labels (A, B, C...)
./transcriptize.sh lecture.mp3 --speaker Professor:prof_voice.wav./transcriptize.sh meeting.mp3 \
--speaker Alice:alice.wav \
--speaker Bob:bob.wav \
--speaker Carol:carol.wav \
--speaker Dave:dave.wavVoice reference files must be:
- Format: WAV (uncompressed PCM)
- Sample Rate: 16 kHz
- Channels: Mono (1 channel)
- Duration: 1.2 - 10 seconds (3 seconds recommended)
Extract a 3-second sample from an MP3 file:
ffmpeg -i speaker_audio.mp3 -t 3 -ar 16000 -ac 1 speaker_reference.wavParameters explained:
-i speaker_audio.mp3- Input file-t 3- Duration (3 seconds)-ar 16000- Sample rate (16 kHz)-ac 1- Audio channels (mono)speaker_reference.wav- Output file
The script generates three files for each transcription:
Complete API response including:
- Full text transcription
- Detailed segments with speaker identification
- Timestamps (start/end)
- Segment IDs
Example:
{
"text": "per questo ti volevo fare un po' di domande, perché io so che Luana è abbastanza storico come come locale. Allora, la storia nostra è cominciata che noi era una famiglia dei contadini, quindi mio padre, loro erano undici figli.",
"segments": [
{
"type": "transcript.text.segment",
"text": " per questo ti volevo fare un po' di domande...",
"speaker": "Chiara",
"start": 0.0,
"end": 5.35,
"id": "seg_0"
},
{
"type": "transcript.text.segment",
"text": " Allora, la storia nostra è cominciata...",
"speaker": "Luana",
"start": 5.8,
"end": 12.3,
"id": "seg_1"
}
]
}To extract just the text: jq -r '.text' {filename}_raw.json
To extract just segments: jq '.segments' {filename}_raw.json
Human-readable formatted transcript with:
- Speaker names in uppercase
- Timestamps in [HH:MM:SS] format
- Consecutive segments from the same speaker merged together
Example:
CHIARA [00:00:00]
per questo ti volevo fare un po' di domande, perché io so che Luana è abbastanza storico come come locale.
LUANA [00:00:05]
Allora, la storia nostra è cominciata che noi era una famiglia dei contadini, quindi mio padre, loro erano undici figli.
Plain text transcription extracted from the .text field of the API response.
Example:
per questo ti volevo fare un po' di domande, perché io so che Luana è abbastanza storico come come locale. Allora, la storia nostra è cominciata che noi era una famiglia dei contadini, quindi mio padre, loro erano undici figli.
- Maximum 4 speakers - Script enforces this limit
- WAV files only - Voice references must be pre-converted to WAV format
- Duration limits - Voice references must be 1.2-10 seconds
- API costs - OpenAI charges for API usage based on audio duration
The script validates:
- ✅ Input file exists
- ✅
--speakerformat is correct (Name:file.wav) - ✅ Voice reference files exist
- ✅ Voice reference files are WAV format
- ✅ Maximum speaker count (4)
- ✅ API response errors
Common errors and solutions:
| Error | Solution |
|---|---|
Voice file 'x.wav' not found |
Check file path is correct |
Voice file must be a WAV file |
Convert to WAV using ffmpeg |
Invalid --speaker format |
Use format Name:file.wav |
Maximum 4 speakers allowed |
Reduce number of speakers |
Known speaker references has duration... |
Voice file must be 1.2-10 seconds |
OPENAI_API_KEY- (Required) Your OpenAI API key
This script uses:
- Model:
gpt-4o-transcribe-diarize - Language: Italian (
it) - Response Format:
diarized_json - Chunking Strategy:
auto
Install jq: brew install jq (macOS) or sudo apt-get install jq (Linux)
Verify your WAV file meets requirements:
ffprobe -i your_voice.wav -show_streamsLook for:
codec_name: pcm_s16le(or similar PCM codec)sample_rate: 16000channels: 1duration: 1.2-10.0seconds