Speech-to-Text (STT)

Definition & meaning

Definition

Speech-to-Text (STT), also called Automatic Speech Recognition (ASR), is a technology that converts spoken language into written text using AI. Modern STT systems use deep learning models like OpenAI Whisper to transcribe audio with high accuracy across dozens of languages, even in noisy environments. STT is essential for voice assistants, meeting transcription, podcast indexing, accessibility, and voice-controlled interfaces. The technology has advanced from basic dictation to understanding context, speaker identification, punctuation, and domain-specific terminology. Popular STT solutions include OpenAI Whisper (open-source), Google Speech-to-Text, Amazon Transcribe, and AssemblyAI.

How It Works

Speech-to-text (STT), also called automatic speech recognition (ASR), converts spoken audio into written text. Modern STT systems use end-to-end deep learning architectures that map audio waveforms directly to text tokens. The audio is first converted into mel spectrograms or log-filterbank features, capturing frequency information over time. These features feed into an encoder—typically a transformer or conformer architecture—that learns acoustic representations. The decoder then converts these representations into text tokens using either connectionist temporal classification (CTC) or attention-based sequence-to-sequence decoding. OpenAI's Whisper popularized the encoder-decoder approach trained on 680,000 hours of multilingual audio. More recent models like Deepgram's Nova-2 and AssemblyAI's Universal-2 achieve near-human accuracy by training on even larger proprietary datasets. Real-time STT adds streaming capabilities using chunked processing with look-ahead buffers, enabling transcription with under 500ms latency.

Why It Matters

Accurate, fast speech-to-text is foundational infrastructure for dozens of applications—meeting transcription, live captioning, voice search, podcast indexing, call center analytics, and voice-controlled interfaces. The accuracy gap between AI and human transcribers has effectively closed for most languages and accents. For developers, this means you can build voice-first experiences with confidence. For businesses, real-time STT enables automated compliance monitoring, customer sentiment analysis on calls, and accessible content. The shift from batch processing to real-time streaming STT has been especially transformative, enabling conversational AI agents that actually understand what users say.

Real-World Examples

OpenAI's Whisper is the dominant open-source STT model, supporting 99 languages and available for self-hosting. Deepgram Nova-2 leads on speed and accuracy for enterprise real-time transcription. AssemblyAI offers best-in-class speaker diarization (who said what) alongside transcription. Google Cloud Speech-to-Text and AWS Transcribe serve high-scale cloud deployments. Rev.ai combines AI with human review for maximum accuracy. On ThePlanetTools.ai, we evaluate STT platforms on word error rate (WER), real-time factor, language coverage, speaker diarization quality, and pricing models.