Voice-to-Text with Optional Diarization

Powered by faster-whisper and NVIDIA NeMo Sortformer (runs locally on this Space).

Language

Leave blank for auto-detect.

Uses NVIDIA Sortformer model (max 4 speakers, downloads ~700MB on first use).

0 4
1 10
1 10

Tips

  • Whisper model: large-v3 (first run downloads model automatically)
  • Diarization model: NVIDIA nvidia/diar_streaming_sortformer_4spk-v2 (streaming, max 4 speakers)
  • Diarization downloads ~700MB on first use (cached afterward)
  • Change WHISPER_MODEL_SIZE in Space Variables to medium or large-v3 for higher accuracy
  • Optimized for Arabic customer service calls with specialized initial prompt
  • Streaming configuration: High latency preset (10s latency, better accuracy)