Use cases · Speech-to-Text & Transcription APIs

Best Speech-to-Text APIs with Speaker Diarization

Transcription APIs that label who spoke when, separating speakers for meetings, interviews, and multi-party call recordings.

Required capability: Speaker diarization.

Our pick: ElevenLabs Scribe (Speech to Text)

ElevenLabs Scribe is a REST speech-to-text API supporting batch and real-time transcription across 90+ languages, with sub-150ms latency for streaming use cases. It covers speaker diarization, word and character timestamps, entity detection and redaction, multichannel processing, and keyterm prompting, making it suitable for podcasts, video captioning, meeting documentation, and AI agent integrations. Pricing starts at $0.22 per hour of audio with a free tier of 4.5 hours per month, self-serve signup, and an enterprise plan available. The service holds SOC 2 Type 2, HIPAA, GDPR, ISO 27001, and PCI DSS certifications, and ships SDKs for Python, Node.js, Swift, Kotlin, and Flutter.

Best for: Prototypes and side projects - free to start, no sales call; Regulated or enterprise workloads - compliance attestations and an enterprise plan; AI agents and automation - an agent-ready surface (MCP / llms.txt).

ElevenLabs Scribe (Speech to Text) profile →

Best for…

Best overall: ElevenLabs Scribe (Speech to Text) - our default pick: strongest across pricing, trust and breadth
Best free pick: ElevenLabs Scribe (Speech to Text) - free tier: Free plan includes 4 hours 30 minutes/month of Scribe v1/v2 transcription and 2 hours 30…
Best for enterprise: ElevenLabs Scribe (Speech to Text) - for regulated or large teams: SOC 2 Type II, HIPAA, enterprise plan
Cheapest to start: Voicegain - from $0.0015 minute to start; compare on your real usage, not the entry price
Best for agents: ElevenLabs Scribe (Speech to Text) - easiest to wire up programmatically: MCP server + llms.txt
Broadest surface: Deepgram - 30 documented actions; breadth isn't quality, but it's the most to build on

Ranked (13)

#1 ElevenLabs Scribe (Speech to Text)
81 / 100
- Best overall
- Best free pick
- Best for enterprise
- Best for agents
ElevenLabs Scribe is a REST speech-to-text API supporting batch and real-time transcription across 90+ languages, with sub-150ms latency for streaming use cases. It covers speaker diarization, word and character timestamps, entity detection and redaction, multichannel processing, and keyterm prompting, making it suitable for podcasts, video captioning, meeting documentation, and AI agent integrations. Pricing starts at $0.22 per hour of audio with a free tier of 4.5 hours per month, self-serve signup, and an enterprise plan available. The service holds SOC 2 Type 2, HIPAA, GDPR, ISO 27001, and PCI DSS certifications, and ships SDKs for Python, Node.js, Swift, Kotlin, and Flutter.
PricingHybrid · from $0.22 hour of audio · free tier ✓
TrustSOC 2 Type II · HIPAA · GDPR · ISO 27001 · PCI DSS
Does
- Real-time streaming
- Speaker diarization
- PII redaction
Used byRevolut, Klarna, Washington Post, Deutsche Telekom
ElevenLabs Scribe (Speech to Text) profile →
#2 Azure AI Speech to Text
79 / 100
Azure AI Speech to Text is Microsoft's cloud speech recognition service, offering real-time transcription, batch processing, speaker diarization, pronunciation assessment, and speech translation across more than 30 Azure regions. It starts at $1.00 per hour of audio with a free tier of 5 hours per month, scales via usage-based pricing, and supports self-serve signup with no sales call required. SDKs cover C#, Python, JavaScript, Java, Go, and Objective-C, and the service holds SOC 2 Type II, HIPAA, GDPR, ISO 27001, and PCI DSS certifications.
PricingUsage · from $1 hour of audio · free tier ✓
TrustSOC 2 Type II · HIPAA · GDPR · ISO 27001 · PCI DSS
Does
- Real-time streaming
- Speaker diarization
- Speech translation
Used byMicrosoft Teams, Microsoft Office 365, Microsoft Edge
Azure AI Speech to Text profile →
#3 Amazon Transcribe
72 / 100
Amazon Transcribe is an automatic speech recognition service from AWS that converts audio to text via batch or real-time streaming, with support for speaker diarization, custom vocabularies, custom language models, and multi-language identification. It targets a broad range of applications including contact center analytics, clinical documentation through a dedicated medical variant, accessibility captioning, and toxic content detection in gaming. Pricing starts at $0.006 per minute on a pay-as-you-go basis, with a free tier of 60 minutes per month for the first 12 months. The service is HIPAA-eligible, SOC 2 Type 2 certified, ISO 27001 and PCI DSS compliant, available across 25 AWS regions including GovCloud, and provides SDKs for Python, JavaScript, Java, Go, C++, Ruby, and PHP.
PricingUsage · from $0.006 minute · free tier ✓
TrustSOC 2 Type II · HIPAA · GDPR · ISO 27001 · PCI DSS
Does
- Real-time streaming
- Speaker diarization
- Medical transcription
- PII redaction
Amazon Transcribe profile →
#4 Google Cloud Speech-to-Text
70 / 100
Google Cloud Speech-to-Text is a REST API from Google Cloud that converts audio to text, supporting synchronous, batch, and streaming transcription across more than a dozen languages and regional endpoints. It covers call center transcription, live captioning with WebVTT and SRT output, speaker diarization, and multi-speaker meeting transcription. Pricing starts at $0.016 per minute with a free tier of 60 minutes per month, self-serve signup, and no sales call required. The service holds SOC 2 Type 2, ISO 27001, HIPAA, GDPR, and PCI DSS certifications, and ships official SDKs for Python, Node.js, Java, Go, C#, PHP, Ruby, and C++.
PricingUsage · from $0.02 minute · free tier ✓
TrustSOC 2 Type II · HIPAA · GDPR · ISO 27001 · PCI DSS
Does
- Real-time streaming
- Speaker diarization
- Medical transcription
Used byHubSpot, InteractiveTel, Embodied, iGenius
Google Cloud Speech-to-Text profile →
#5 IBM watsonx Speech to Text
73 / 100
IBM watsonx Speech to Text is a REST API for fast, accurate transcription supporting batch, streaming, and WebSocket modes, aimed at customer self-service, call-center analytics, captioning, and accessibility applications. Pricing starts at $0.02 per minute with a 500-minute free tier and no sales call required, scaling to enterprise plans with unlimited concurrency. Deployments are available across seven global regions, SDKs cover Python, Node.js, Java, Swift, and Go, and the service holds SOC 2 Type II, HIPAA, GDPR, and ISO 27001 certifications.
PricingUsage · from $0.02 minute · free tier ✓
TrustSOC 2 Type II · HIPAA · GDPR · ISO 27001
Does
- Real-time streaming
- Speaker diarization
Used byCitibank, Bradesco, Humana
IBM watsonx Speech to Text profile →
#6 AssemblyAI
79 / 100
AssemblyAI is a voice AI platform providing speech-to-text transcription, speaker diarization, and audio intelligence features via REST API, aimed at developers building products on top of speech data. Pricing is usage-based at $0.0025 per minute with a $50 one-time free credit requiring no credit card, and enterprise plans are available. The service holds SOC 2 Type II, HIPAA, GDPR, ISO 27001, and PCI DSS certifications, with data processed in the US and EU. Customers include Zoom, Spotify, and Dovetail, and SDKs are actively maintained for Python and Node.js.
PricingUsage · from $0.0025 minute · free tier ✗
TrustSOC 2 Type II · HIPAA · GDPR · ISO 27001 · PCI DSS
Does
- Real-time streaming
- Speaker diarization
- Speech translation
- Medical transcription
- PII redaction
Used byZoom, Spotify, Veed, CallRail
Avoid ifYou want to try it free before paying
AssemblyAI profile →
#7 Speechmatics
67 / 100
Speechmatics is a speech-to-text API supporting batch and real-time transcription across EU, US, and Australia regions, with capabilities including speaker diarization, language detection, translation, summarization, and audio event detection, making it suited for contact centers, legal, medical, and broadcast use cases. Pricing starts at $0.0022 per minute with a free tier of 3,000 minutes per month and self-serve signup, scaling to enterprise plans with dedicated regional endpoints. The API is REST-based with SDK support for Python, Node.js, .NET, and Rust, and holds SOC 2 Type 2, HIPAA, GDPR, and ISO 27001 certifications.
PricingUsage · from $0.0022 minute · free tier ✓
TrustSOC 2 Type II · HIPAA · GDPR · ISO 27001
Does
- Real-time streaming
- Speaker diarization
- Speech translation
- Medical transcription
Used bywhat3words, 3Play Media, Veritone, Deloitte UK
Speechmatics profile →
#8 Deepgram
59 / 100
- Broadest surface
Deepgram provides real-time and batch APIs for speech-to-text, text-to-speech, and voice agents, plus audio intelligence features like summarization. Pricing is usage-based, published, and self-serve. It offers webhooks, four SDKs, and an official MCP server, with availability in North America and Europe. The platform carries SOC 2 Type 2, HIPAA, GDPR, and PCI DSS compliance with a published SLA.
PricingUsage · free tier ✗
TrustSOC 2 Type II · HIPAA · GDPR · PCI DSS
Does
- Real-time streaming
- Speaker diarization
- PII redaction
- Self-hosted option
Avoid ifYou want to try it free before paying
Deepgram profile →
#9 OpenAI Speech-to-Text
72 / 100
OpenAI Speech-to-Text is a REST API offering batch, streaming, and real-time audio transcription, speaker diarization, language detection, and translation to English, built on Whisper and newer gpt-4o-based models. It is priced at $0.003 per minute on a self-serve, pay-as-you-go basis with no sales call required, and an enterprise plan is available. The API ships official SDKs for Python, Node.js, Java, Go, Ruby, and .NET, and holds SOC 2 Type II, HIPAA, GDPR, ISO 27001, and PCI DSS certifications.
PricingUsage · from $0.003 minute · free tier ✗
TrustSOC 2 Type II · HIPAA · GDPR · ISO 27001 · PCI DSS
Does
- Real-time streaming
- Speaker diarization
- Speech translation
Used bySpeak
Avoid ifYou want to try it free before paying
OpenAI Speech-to-Text profile →
#10 Gladia
71 / 100
Gladia is an audio infrastructure API covering batch and real-time speech-to-text transcription, speaker diarization, translation, summarization, sentiment and emotion analysis, and named entity recognition, targeting voice agents, contact centers, meeting assistants, and media captioning workflows. Pricing is usage-based at $0.61 per hour with a free tier of 10 hours per month and no sales call required to start. The API is REST-based with TypeScript, JavaScript, and Python SDKs, webhooks, and an MCP server, and is hosted in EU (France, default) and US regions. Gladia holds SOC 2 Type II, HIPAA, and GDPR compliance, and counts Aircall, Citibank, Samsung, Oracle, and Microsoft among its customers.
PricingUsage · from $0.61 hour · free tier ✓
TrustSOC 2 Type II · HIPAA · GDPR
Does
- Real-time streaming
- Speaker diarization
- Speech translation
- PII redaction
Used byAircall, Attention, Recall, VEED
Gladia profile →
#11 Rev AI
58 / 100
Rev AI is a speech-to-text API from Rev, offering both asynchronous batch transcription and real-time streaming, with capabilities including speaker diarization, word timestamps, custom vocabulary, language detection, translation, sentiment analysis, and summarization. Pricing is usage-based at $0.0017 per minute with a 5-hour free tier and self-serve signup, making it accessible without a sales call. SDKs are available for Python, Node.js, Java, and Go, and the service is SOC 2 Type 2 certified, HIPAA compliant, and GDPR compliant, with data residency options in the US and EU.
PricingUsage · from $0.0017 minute · free tier ✗
TrustSOC 2 Type II · HIPAA · GDPR
Does
- Real-time streaming
- Speaker diarization
- Speech translation
Avoid ifYou want to try it free before paying
Rev AI profile →
#12 Voicegain
60 / 100
- Cheapest to start
Voicegain is a speech-to-text and voice AI platform aimed at contact centers, healthcare payers, and enterprises that need telephony transcription, PII/PCI redaction, real-time agent assist, and custom ASR model training. Pricing starts at $0.0015 per minute on a pay-as-you-go basis, with a $50 one-time signup credit and no credit card required; on-premise and private-cloud deployments are available but require an annual commitment. The platform holds SOC 2 Type 2, HIPAA, GDPR, and PCI DSS certifications, and customers include Aetna, Samsung, and Sutherland.
PricingUsage · from $0.0015 minute · free tier ✗
TrustSOC 2 Type II · HIPAA · GDPR · PCI DSS
Does
- Real-time streaming
- Speaker diarization
- PII redaction
- Self-hosted option
Used bySutherland, Samsung, Aetna, LevelAI
Avoid ifYou want to try it free before paying
Voicegain profile →
#13 Soniox
68 / 100
Soniox is a speech-to-text API built for real-time and batch transcription workloads, targeting voice agents, call centers, medical teams, and media producers who need multilingual support, speaker diarization, and word-level timestamps. Pricing is usage-based at $0.0017 per minute with self-serve sign-up and no sales call required, though free credits were discontinued in October 2025. The platform holds SOC 2 Type 2, HIPAA, GDPR, and ISO 27001 certifications, with data residency options across the United States, European Union, and Japan. SDKs are available for Python, Node.js, and browser JavaScript, and an MCP server is also supported.
PricingUsage · from $0.0017 minute · free tier ✗
TrustSOC 2 Type II · HIPAA · GDPR · ISO 27001
Does
- Real-time streaming
- Speaker diarization
- Speech translation
- Medical transcription
Used byScribe
Avoid ifYou want to try it free before paying
Soniox profile →

Scope: only APIs with the required capability, picked from published, cited data. The score is one input, not the verdict, and we lead with each one’s trade-off. No reviews yet, no paid placement. See the full Speech-to-Text & Transcription APIs directory.

Best Speech-to-Text APIs with Speaker Diarization

Our pick: ElevenLabs Scribe (Speech to Text)

Best for…

Ranked (13)

#1 ElevenLabs Scribe (Speech to Text)

#2 Azure AI Speech to Text

#3 Amazon Transcribe

#4 Google Cloud Speech-to-Text

#5 IBM watsonx Speech to Text

#6 AssemblyAI

#7 Speechmatics

#8 Deepgram

#9 OpenAI Speech-to-Text

#10 Gladia

#11 Rev AI

#12 Voicegain

#13 Soniox