Audio Transcription with Speaker Diarization

A bash script that transcribes audio files using OpenAI's Whisper API with optional speaker diarization (identifying who said what).

Features

Speaker Diarization: Identify and label individual speakers in multi-person audio
Flexible Configuration: Support for 0-4 speakers
Multiple Output Formats: Plain text and structured JSON segments
Error Handling: Comprehensive validation and clear error messages
Italian Language Support: Optimized for Italian transcription

Requirements

Bash (v4.0+)
curl - for API requests
jq - for JSON processing
base64 - for encoding voice references
OpenAI API Key - set as OPENAI_API_KEY environment variable
ffmpeg (optional) - for preparing voice reference files

Installation

Clone or download the script:

chmod +x transcriptize.sh

Set your OpenAI API key:

export OPENAI_API_KEY="your-api-key-here"

Install dependencies (if not already installed):

# macOS
brew install jq ffmpeg

# Ubuntu/Debian
sudo apt-get install jq ffmpeg

Usage

Basic Syntax

./transcriptize.sh <audio_file.mp3> [--speaker Name:voicefile.wav] [--speaker Name2:voicefile2.wav] ...

Examples

1. With Speaker Diarization (2 speakers)

./transcriptize.sh interview.mp3 --speaker Luana:voce_luana_mini.wav --speaker Chiara:voce_chiara_mini.wav

Output:

interview_raw.json - Complete JSON response with text and speaker-labeled segments
interview_diarized.txt - Human-readable formatted transcript with merged speaker segments
interview_text.txt - Plain text transcription only

2. Without Speaker Diarization (Basic Transcription)

./transcriptize.sh interview.mp3

Output includes generic speaker labels (A, B, C...)

3. With 1 Speaker

./transcriptize.sh lecture.mp3 --speaker Professor:prof_voice.wav

4. With 4 Speakers

./transcriptize.sh meeting.mp3 \
  --speaker Alice:alice.wav \
  --speaker Bob:bob.wav \
  --speaker Carol:carol.wav \
  --speaker Dave:dave.wav

Preparing Voice Reference Files

Voice reference files must be:

Format: WAV (uncompressed PCM)
Sample Rate: 16 kHz
Channels: Mono (1 channel)
Duration: 1.2 - 10 seconds (3 seconds recommended)

Creating Voice References with ffmpeg

Extract a 3-second sample from an MP3 file:

ffmpeg -i speaker_audio.mp3 -t 3 -ar 16000 -ac 1 speaker_reference.wav

Parameters explained:

-i speaker_audio.mp3 - Input file
-t 3 - Duration (3 seconds)
-ar 16000 - Sample rate (16 kHz)
-ac 1 - Audio channels (mono)
speaker_reference.wav - Output file

Output Files

The script generates three files for each transcription:

`{filename}_raw.json`

Complete API response including:

Full text transcription
Detailed segments with speaker identification
Timestamps (start/end)
Segment IDs

Example:

{
  "text": "per questo ti volevo fare un po' di domande, perché io so che Luana è abbastanza storico come come locale. Allora, la storia nostra è cominciata che noi era una famiglia dei contadini, quindi mio padre, loro erano undici figli.",
  "segments": [
    {
      "type": "transcript.text.segment",
      "text": " per questo ti volevo fare un po' di domande...",
      "speaker": "Chiara",
      "start": 0.0,
      "end": 5.35,
      "id": "seg_0"
    },
    {
      "type": "transcript.text.segment",
      "text": " Allora, la storia nostra è cominciata...",
      "speaker": "Luana",
      "start": 5.8,
      "end": 12.3,
      "id": "seg_1"
    }
  ]
}

To extract just the text: jq -r '.text' {filename}_raw.json To extract just segments: jq '.segments' {filename}_raw.json

`{filename}_diarized.txt`

Human-readable formatted transcript with:

Speaker names in uppercase
Timestamps in [HH:MM:SS] format
Consecutive segments from the same speaker merged together

Example:

CHIARA [00:00:00]
 per questo ti volevo fare un po' di domande, perché io so che Luana è abbastanza storico come come locale.

LUANA [00:00:05]
 Allora, la storia nostra è cominciata che noi era una famiglia dei contadini, quindi mio padre, loro erano undici figli.

`{filename}_text.txt`

Plain text transcription extracted from the .text field of the API response.

Example:

per questo ti volevo fare un po' di domande, perché io so che Luana è abbastanza storico come come locale. Allora, la storia nostra è cominciata che noi era una famiglia dei contadini, quindi mio padre, loro erano undici figli.

Limitations

Maximum 4 speakers - Script enforces this limit
WAV files only - Voice references must be pre-converted to WAV format
Duration limits - Voice references must be 1.2-10 seconds
API costs - OpenAI charges for API usage based on audio duration

Error Handling

The script validates:

✅ Input file exists
✅ --speaker format is correct (Name:file.wav)
✅ Voice reference files exist
✅ Voice reference files are WAV format
✅ Maximum speaker count (4)
✅ API response errors

Common errors and solutions:

Error	Solution
`Voice file 'x.wav' not found`	Check file path is correct
`Voice file must be a WAV file`	Convert to WAV using ffmpeg
`Invalid --speaker format`	Use format `Name:file.wav`
`Maximum 4 speakers allowed`	Reduce number of speakers
`Known speaker references has duration...`	Voice file must be 1.2-10 seconds

Environment Variables

OPENAI_API_KEY - (Required) Your OpenAI API key

API Details

This script uses:

Model: gpt-4o-transcribe-diarize
Language: Italian (it)
Response Format: diarized_json
Chunking Strategy: auto

Troubleshooting

Script shows "command not found: jq"

Install jq: brew install jq (macOS) or sudo apt-get install jq (Linux)

Voice reference rejected by API

Verify your WAV file meets requirements:

ffprobe -i your_voice.wav -show_streams

Look for:

codec_name: pcm_s16le (or similar PCM codec)
sample_rate: 16000
channels: 1
duration: 1.2-10.0 seconds

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
transcriptize.sh		transcriptize.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Audio Transcription with Speaker Diarization

Features

Requirements

Installation

Usage

Basic Syntax

Examples

1. With Speaker Diarization (2 speakers)

2. Without Speaker Diarization (Basic Transcription)

3. With 1 Speaker

4. With 4 Speakers

Preparing Voice Reference Files

Creating Voice References with ffmpeg

Output Files

`{filename}_raw.json`

`{filename}_diarized.txt`

`{filename}_text.txt`

Limitations

Error Handling

Environment Variables

API Details

Troubleshooting

Script shows "command not found: jq"

Voice reference rejected by API

About

Uh oh!

Releases

Packages

Uh oh!

Languages

sekmo/transcriptor

Folders and files

Latest commit

History

Repository files navigation

Audio Transcription with Speaker Diarization

Features

Requirements

Installation

Usage

Basic Syntax

Examples

1. With Speaker Diarization (2 speakers)

2. Without Speaker Diarization (Basic Transcription)

3. With 1 Speaker

4. With 4 Speakers

Preparing Voice Reference Files

Creating Voice References with ffmpeg

Output Files

{filename}_raw.json

{filename}_diarized.txt

{filename}_text.txt

Limitations

Error Handling

Environment Variables

API Details

Troubleshooting

Script shows "command not found: jq"

Voice reference rejected by API

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

`{filename}_raw.json`

`{filename}_diarized.txt`

`{filename}_text.txt`

Packages