AssemblyAI: Speech-to-Text API for Voice AI Apps

Build voice AI apps with production-ready speech recognition. Accurate transcription for developers. SOC 2 compliant. Free tier available.

AssemblyAI is the leading speech-to-text API for building voice AI applications. Founded in 2017 and based in San Francisco, it processes over 40 terabytes of audio daily for thousands of companies. The platform offers industry-leading accuracy (5.6% word error rate) with Universal-3 Pro, a promptable speech language model enabling domain-specific customization without retraining. It supports 99 languages, includes speaker diarization and entity detection, and integrates natively with voice platforms like LiveKit and Twilio. SOC 2 Type 2, HIPAA, and GDPR compliant, making it ideal for regulated industries. Pricing starts free with $50 credits, then $0.15/hour pay-as-you-go.

Pricing

Free tier: $50 credits (185 hours pre-recorded, 333 hours streaming). Pay-as-you-go: Universal/Universal-Streaming $0.15/hr, Universal-3 Pro $0.21/hr (pre-recorded) or $0.45/hr (streaming). Speaker diarization +$0.02/hr, sentiment analysis +$0.02/hr, entity detection +$0.03/hr. Volume discounts available for high-usage customers (50,000+ hours/month).

Frequently Asked Questions

What's the difference between Universal-3 Pro and Universal models?

Universal-3 Pro is a promptable speech language model supporting natural language prompting for fine-grained transcription control (speaker labels, disfluencies, audio tagging) without retraining. It achieves 5.6% WER on English and costs $0.21/hr (pre-recorded) or $0.45/hr (streaming). Universal is the older model at $0.15/hr, supports 99 languages, and doesn't include prompting—better for general-purpose use.

How does AssemblyAI pricing actually work with add-on features?

Base transcription: $0.15/hr. Speaker diarization costs +$0.02/hr, sentiment analysis +$0.02/hr, entity detection +$0.03/hr. These stack on top of the base rate. A fully-featured transcription can quickly reach 2-5x the advertised $0.15/hr base cost. Streaming is 3x the pre-recorded rate ($0.45/hr for Universal-3 Pro vs $0.21/hr).

Is AssemblyAI HIPAA compliant for healthcare applications?

Yes. AssemblyAI holds SOC 2 Type 2, HIPAA, GDPR, ISO 27001, PCI, FedRAMP, and CSA Star Level 1 certifications. It supports Business Associate Agreements (BAAs) for healthcare use cases including medical transcription, clinical documentation, and telemedicine call recording.

What's the free tier limit and when should I upgrade?

Free tier provides $50 credits: up to 185 hours of pre-recorded transcription or 333 hours of streaming. Once exhausted, pricing switches to pay-as-you-go. For commercial use or >10 hours/month of transcription, a paid account is recommended. Enterprise discounts available at 50,000+ hours/month.

How does Universal-3 Pro streaming work with voice agents?

Universal-3 Pro Streaming delivers immutable low-latency transcripts with intelligent endpointing for real-time voice agent turn detection. It uses punctuation-based turn detection and supports up to 1,000 domain-specific keyterms. Native integrations with LiveKit, Twilio, Daily, and PipeCat allow deployment in <15 minutes. Costs $0.45/hr for session duration.

Can I use AssemblyAI offline or self-host?

AssemblyAI is a cloud-only API platform—no offline capabilities or self-hosting options. All processing happens on AssemblyAI's infrastructure. For on-premise transcription, consider open-source alternatives like Whisper or Deepgram's enterprise self-hosting option.

What integrations does AssemblyAI support?

Native integrations: LiveKit, PipeCat, Twilio, Daily. LLM Gateway connects to OpenAI GPT, Anthropic Claude, Google Gemini. SDKs available for Python, JavaScript/Node.js, Ruby, Go. Zapier and Slack require custom webhook setup. No native Salesforce or Microsoft Teams connectors.