Voice
Active

Universal

Universal is designed for seamless integration into diverse speech-to-text workflows, enabling accurate and efficient transcription across multiple languages and audio conditions.
Try it now

AI Playground

Test all API models in the sandbox environment before you integrate. We provide more than 200 models to integrate into your app.
AI Playground image
Ai models list in playground
Testimonials

Our Clients' Voices

UniversalTechflow Logo - Techflow X Webflow Template

Universal

Universal offers a scalable, robust solution suitable for a wide range of speech recognition tasks.

AssemblyAI's Universal series embodies cutting-edge speech-to-text (STT) technology designed to convert spoken language into highly accurate and readable text. The models are trained on over 12.5 million hours of multilingual audio data, excelling in complex real-world conversational scenarios by effectively handling multiple speakers, accents, and background noise with high fidelity.

Technical Specifications

  • Architecture: Universal-1 uses a Conformer encoder combined with a recurrent neural network transducer (RNN-T) model for speed and accuracy.
  • Encoder Details: The encoder has convolutional layers for 4x subsampling, positional encoding, and 24 Conformer layers totaling roughly 600 million parameters. Each Conformer block applies chunk-wise attention on 8-second audio chunks for faster processing and robustness to audio length variations.
  • Decoder: Two-layer LSTM predictor with a joiner, using a WordPiece tokenizer trained on multilingual corpora.
  • Parallel Processing: The design leverages highly parallelized encoder computation, enabling large-scale, low-latency inference ideal for real-time applications.
  • Timestamping: Maintains precise time alignment for accurate word-level timestamp estimation.

Performance Benchmarks

  • Achieves state-of-the-art Word Error Rate (WER) on English, surpassing several commercial ASR providers and open-source models including OpenAI’s Whisper Large-v3 and NVIDIA’s Canary-1B.
  • Demonstrated improved noise robustness, telephony, and challenging acoustic environment performance.
  • Shows competitive WER in Spanish, French, and German datasets with strong cross-language robustness.
  • Human evaluations show 60% preference for Universal-1 transcriptions over previous generation Conformer-2, highlighting qualitative transcription improvements.


Training Data

Universal leverages a Conformer encoder combined with a recurrent neural network transducer (RNN-T) model to achieve fast and accurate speech recognition. Its encoder features convolutional layers for 4x subsampling, positional encoding, and 24 Conformer layers with approximately 600 million parameters. The model uses chunk-wise attention over 8-second audio segments to improve processing speed and handle varying audio lengths effectively. The decoder consists of a two-layer LSTM predictor and joiner, employing a WordPiece tokenizer trained on multilingual text corpora. This architecture supports highly parallelized computation and precise word-level timestamping, making it ideal for low-latency, large-scale inference.

API Pricing

  • $0.004725 per min

Core Features & Capabilities

  • High-accuracy transcription with punctuation, capitalization, and advanced text formatting.
  • Speaker diarization to identify individual speakers.
  • Accurate recognition and transcription of proper nouns and alphanumeric content (e.g., phone numbers, emails).
  • Low-latency real-time transcription with scalable efficiency.
  • Flexible fine-tuning and customization options for enterprise use cases.
  • Rigorous bias mitigation, content safety, and hallucination reduction strategies.

Code Sample

Comparison with Other Models

vs GPT-5: GPT-5 features an extremely large 400,000-token context window and advanced hierarchical reasoning capabilities. It excels at handling complex, multi-step tasks that go beyond speech transcription, making it more suitable for large-scale language understanding and generation, rather than real-time STT processing.

vs GPT-4.1: GPT-4.1 specializes in coding-related tasks and structured code manipulation, with a smaller context window compared to GPT-5. While it offers less extensive multimodal support, it is optimized for developer-focused scenarios rather than broad speech recognition or multimodal integration.

vs OpenAI o3: OpenAI o3 primarily serves legacy agent tasks and provides basic image understanding capabilities. It has higher latency compared to Universal AssemblyAI and delivers less accurate multimodal reasoning, making it less suited for modern real-time transcription and multimodal applications.

Try it now

400+ AI Models

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

The Best Growth Choice
for Enterprise

Get API Key