Voice-to-Text with Optional Diarization

Upload audio (mp3, wav, m4a, ...)

Language

Leave blank for auto-detect.

Enable Speaker Diarization

Uses NVIDIA Sortformer model (max 4 speakers, downloads ~700MB on first use).

Expected Number of Speakers

Set to 0 for automatic detection, or specify 2-4 to consolidate speakers.

0 4

Beam Size

1 10

Best Of

1 10

Transcript

Detailed Segments & Speakers

Tips

Whisper model: large-v3 (first run downloads model automatically)
Diarization model: NVIDIA nvidia/diar_streaming_sortformer_4spk-v2 (streaming, max 4 speakers)
Diarization downloads ~700MB on first use (cached afterward)
Change WHISPER_MODEL_SIZE in Space Variables to medium or large-v3 for higher accuracy
Optimized for Arabic customer service calls with specialized initial prompt
Streaming configuration: High latency preset (10s latency, better accuracy)