Azure AI Speech to Text

"Azure Speech in Foundry Tools provides speech to text, text to speech, and other capabilities through a Microsoft Foundry resource. You can transcribe speech to text with high accuracy, produce natural-sounding text-to-speech voices, translate spoken audio, and conduct live AI voice conversations." [1]

Speech-to-Text & Transcription APIs

azure.microsoft.com/en-us/products/ai-services/ai-speech · By Microsoft · Agent JSON · Suggest an edit · Last verified 2026-06-21 · Source confidence: high

Azure AI Speech to Text is Microsoft's cloud speech recognition service, offering real-time transcription, batch processing, speaker diarization, pronunciation assessment, and speech translation across more than 30 Azure regions. It starts at $1.00 per hour of audio with a free tier of 5 hours per month, scales via usage-based pricing, and supports self-serve signup with no sales call required. SDKs cover C#, Python, JavaScript, Java, Go, and Objective-C, and the service holds SOC 2 Type II, HIPAA, GDPR, ISO 27001, and PCI DSS certifications.

Best for / Avoid if

Best for: Prototypes and side projects - free to start, no sales call; Regulated or enterprise workloads - compliance attestations and an enterprise plan; AI agents and automation - an agent-ready surface (MCP / llms.txt)

Pricing & procurement

Pricing model: Usage-based [2]
Published pricing: Yes [3]
Free tier: Yes [4]blocksentient.com/review/microsoft-azure-speech-service/“The F0 tier provides users with 5 audio hours free per month for both standard and custom speech-to-text (batch excluded), plus one hosted custom model monthly with automatic decommissioning after 7 days if unused.”learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-services-quotas-and-limits“For the Free (F0) pricing tier, see the monthly allowances on the pricing page.”
Free tier details: Free (F0) tier: 5 audio hours per month for Standard and Custom Speech to Text (shared; batch transcription not available on F0). Resets monthly. Concurrent request limit of 1 (not adjustable). No SLA on F0 tier.
Self-serve signup: Yes
Requires sales call: No
Enterprise plan: Yes

Published prices
Plan	Item	Per	Amount	Source
Free (F0)	Standard real-time speech to text	5 audio hours per month	$0	source
Pay As You Go	Standard real-time speech to text	audio hour	$1	source
Pay As You Go	Fast transcription (synchronous file-based)	audio hour	$0.36	source
Pay As You Go	Batch transcription	audio hour	$0.18	source
Pay As You Go	Custom speech real-time transcription	audio hour	$1.2	source
Pay As You Go	Custom speech batch transcription	audio hour	$0.225	source
Pay As You Go	Custom model training	compute hour	$10	source
Pay As You Go	Custom model endpoint hosting	model per hour	$0.0538	source
Pay As You Go	Language identification add-on (real-time)	audio hour	$0.3	source
Pay As You Go	Speaker diarization add-on (real-time)	audio hour	$0.3	source
Pay As You Go	Pronunciation assessment add-on (real-time)	audio hour	$0.3	source
Commitment Tier — Standard 2,000 hrs/mo	Standard real-time speech to text	month (2,000 hours included; $0.80/hr effective)	$1600	source
Commitment Tier — Standard 10,000 hrs/mo	Standard real-time speech to text	month (10,000 hours included; $0.65/hr effective)	$6500	source
Commitment Tier — Standard 50,000 hrs/mo	Standard real-time speech to text	month (50,000 hours included; $0.50/hr effective)	$25000	source
Commitment Tier — Custom 2,000 hrs/mo	Custom speech real-time transcription	month (2,000 hours included; $0.96/hr effective)	$1920	source
Commitment Tier — Custom 50,000 hrs/mo	Custom speech real-time transcription	month (50,000 hours included; $0.60/hr effective)	$30000	source
Connected Container — Standard 2,000 hrs/mo	Standard real-time speech to text (connected container)	month (2,000 hours included)	$1520	source
Connected Container — Standard 50,000 hrs/mo	Standard real-time speech to text (connected container)	month (50,000 hours included)	$23750	source
Connected Container — Custom 2,000 hrs/mo	Custom speech real-time transcription (connected container)	month (2,000 hours included)	$1824	source
Connected Container — Custom 50,000 hrs/mo	Custom speech real-time transcription (connected container)	month (50,000 hours included)	$28500	source
Disconnected Container — Standard 120,000 hrs/yr	Standard real-time speech to text (disconnected/air-gapped container)	year (120,000 hours included)	$74100	source
Disconnected Container — Standard 600,000 hrs/yr	Standard real-time speech to text (disconnected/air-gapped container)	year (600,000 hours included)	$285000	source
Disconnected Container — Custom 120,000 hrs/yr	Custom speech real-time transcription (disconnected/air-gapped container)	year (120,000 hours included)	$88920	source
Disconnected Container — Custom 600,000 hrs/yr	Custom speech real-time transcription (disconnected/air-gapped container)	year (600,000 hours included)	$342000	source

Capabilities

Real-time streaming
Speaker diarization
Speech translation

Supported actions: transcribe_realtime, transcribe_batch, transcribe_fast, speaker_diarization, language_detection, word_timestamps, custom_speech_model, pronunciation_assessment, phrase_lists, speech_translation, keyword_recognition, llm_speech_transcription, post_stream_refinement [5]learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-to-text“The speech to text service offers the following core features: Real-time transcription: Instant transcription with intermediate results for live audio inputs. Fast transcription: Fastest synchronous output for situations with predictable latency. Batch transcription: Efficient processing for large volumes of prerecorded audio. Custom speech: Models with enhanced accuracy for specific domains and conditions.”learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-to-text“Diarization is the process of distinguishing and separating different speakers in an audio recording. This feature is particularly useful for transcribing conversations, meetings, or any multi-speaker audio content. The Speech service can identify up to 35 different speakers in an audio recording.”
Regions: South Africa North, East Asia, Southeast Asia, Australia East, Central India, Japan East, Japan West, Korea Central, Canada Central, Canada East, North Europe, West Europe, France Central, Germany West Central, Italy North, Norway East, Sweden Central, Switzerland North, Switzerland West, UK South, UK West, UAE North, Brazil South, Qatar Central, Central US, East US, East US 2, North Central US, South Central US, West Central US, West US, West US 2, West US 3 [6]
Languages: Afrikaans (South Africa), Amharic (Ethiopia), Arabic (20+ locales), Assamese (India), Azerbaijani, Bulgarian, Bhojpuri (India), Bengali (India), Bosnian, Catalan, Czech, Welsh, Danish, German (3 locales), Greek, English (15+ locales), Spanish (22 locales), Estonian, Basque, Persian, Finnish, Filipino, French (4 locales), Irish, Galician, Gujarati, Hebrew, Hindi, Croatian, Hungarian, Armenian, Indonesian, Icelandic, Italian (2 locales), Japanese, Javanese, Georgian, Kazakh, Khmer, Kannada, Korean, Lao, Lithuanian, Latvian, Macedonian, Malayalam, Mongolian, Marathi, Malay, Maltese, Burmese, Norwegian Bokmål, Nepali, Dutch (2 locales), Odia, Punjabi, Polish, Pashto, Portuguese (2 locales), Romanian, Russian, Sinhala, Slovak, Slovenian, Somali, Albanian, Serbian, Swedish, Kiswahili, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Chinese Wu (Simplified), Chinese Cantonese (Simplified), Chinese Mandarin (Simplified), Chinese Southwestern Mandarin, Chinese Cantonese (Traditional), Chinese Taiwanese Mandarin, isiZulu - 130+ languages/locales total [7]
Input types: audio/wav (PCM, default), audio/mp3, audio/ogg (OPUS), audio/flac, AMR, AMR-WB, A-Law, Mu-Law, streaming via WebSocket (Speech SDK), file via Azure Blob Storage SAS URI, file via public URI, live microphone stream
Output types: JSON (with word-level timestamps, offset, duration, speaker labels), plain text, SRT/VTT captions (via post-processing), word-level timestamps (batch and real-time SDK), diarization speaker labels
Webhooks: Yes [8]learn.microsoft.com/en-us/azure/ai-services/speech-service/batch-transcription-create“Instead of polling for transcription status, you can register a webhook to receive a notification when a transcription job completes (or reaches any other terminal state). The Speech service sends HTTP POST callbacks to your endpoint for transcription.created, transcription.processing, transcription.succeeded, transcription.failed, and transcription.deleted events.”
Sandbox / test mode: No [9]
SDK languages: C#/.NET, Python, JavaScript, Java, Go, Objective-C [10]
MCP server: Yes [11]

Trust & compliance

SOC 2: SOC 2 Type II [12]
HIPAA: Yes [13]
GDPR: Yes [14]
ISO 27001: Yes [15]
PCI DSS: Yes [16]learn.microsoft.com/en-us/azure/compliance/offerings/offering-pci-dss“Microsoft Azure maintains a PCI DSS validation using an approved Qualified Security Assessor (QSA), and is certified as compliant under PCI DSS version 4.0 at Service Provider Level 1.”learn.microsoft.com/en-us/azure/compliance/offerings/cloud-services-in-audit-scope“See Appendices A and B in Microsoft Azure Compliance Offerings for detailed insight into which cloud services are in scope for the following compliance offerings: PCI DSS.”
Published SLA: Yes [17]
Rate limits: Real-time speech to text: 100 concurrent requests per resource (base model and custom endpoint, adjustable for S0). Fast transcription: 600 requests per minute (adjustable). Batch transcription REST API: 100 requests per 10 seconds (600/min). Free (F0) concurrent request limit: 1 (not adjustable). Batch transcription: max audio file size 1 GB, max audio length 240 min (with diarization), max 1,000 files per request, max 10,000 blobs per container. [18]
Known restrictions: Free (F0) tier does not support batch transcription, Free (F0) concurrent request limit of 1 is not adjustable, Maximum audio file size for batch and fast transcription: 500 MB (fast) / 1 GB (batch), Maximum audio length for fast transcription: 5 hours per file, Maximum diarization audio length: 240 minutes per session/file, Diarization supports up to 35 speakers, Real-time diarization session max: 240 minutes, Data is processed only within the region of the Azure Speech resource (no cross-region processing), Sovereign cloud availability limited (Azure Government, 21Vianet) [19]learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-services-quotas-and-limits“Maximum audio input file size (fast transcription): < 500 MB. Maximum audio length (fast transcription): < 5 hours per file. Maximum audio length for transcriptions with diarization enabled: 240 minutes per file. Maximum number of files per transcription request: 1,000. Maximum file size for audio input (batch): 1 GB.”

Developer surface

Docs rendering: static · llms.txt present

Integration

API style: rest
Base URL: https://{resource}.cognitiveservices.azure.com/speechtotext/
Version: 2025-10-15
Versioning: url
Stability: ga
Auth methods: api_key, oauth2
Error format: vendor-specific
Webhook signing: hmac_sha256
Rate limit: 100 / concurrent

SDKs

C#/.NET Microsoft.CognitiveServices.Speech · repo
Python azure-cognitiveservices-speech · repo
JavaScript microsoft-cognitiveservices-speech-sdk · repo
Java com.microsoft.cognitiveservices.speech:client-sdk · repo
Go github.com/Microsoft/cognitive-services-speech-sdk-go · repo
Objective-C · repo

Adoption & maturity

Launched: 2018-09-24
GA: 2018-09-24
Notable customers: Microsoft Teams, Microsoft Office 365, Microsoft Edge

Other Speech-to-Text & Transcription APIs

ElevenLabs Scribe (Speech to Text)
"Scribe v2 is the most accurate Speech to Text model" offering "real-time Speech to Text in under 150 ms" across "90+ languages."
Hybrid · free tier · public pricing · self-serve
Amazon Transcribe
"Amazon Transcribe is an automatic speech recognition service that uses machine learning models to convert audio to text. You can use Amazon Transcribe as a standalone transcription service or to add speech-to-text capabilities to any application."
Usage · free tier · public pricing · self-serve
Google Cloud Speech-to-Text
"Accurate voice typing and transcription powered by Gemini."
Usage · free tier · public pricing · self-serve
IBM watsonx Speech to Text
"IBM Watson® Speech to Text technology enables fast and accurate speech transcription in multiple languages for a variety of use cases, including but not limited to customer self-service, agent assistance and speech analytics."
Usage · free tier · public pricing · self-serve
AssemblyAI
"Voice AI infrastructure for developers building products that transcribe, understand, and act on speech."
Usage · public pricing · self-serve
Speechmatics
"Low-latency speech-to-text for multilingual, multi-speaker conversations."
Usage · free tier · public pricing · self-serve

Azure AI Speech to Text alternatives · Azure AI Speech to Text vs ElevenLabs Scribe (Speech to Text) · All Speech-to-Text & Transcription APIs APIs

References

Each field above carries a numbered source - hover for a preview, click to jump here.

↑Description: learn.microsoft.com
↑Pricing model: blocksentient.com · azure.microsoft.com
↑Published pricing: azure.microsoft.com
↑Free tier: blocksentient.com · learn.microsoft.com
↑Supported actions: learn.microsoft.com · learn.microsoft.com
↑Regions: learn.microsoft.com
↑Languages: learn.microsoft.com · learn.microsoft.com
↑Webhooks: learn.microsoft.com
↑Sandbox: learn.microsoft.com
↑SDK languages: learn.microsoft.com
↑MCP server: learn.microsoft.com
↑SOC 2: learn.microsoft.com
↑HIPAA: learn.microsoft.com
↑GDPR: learn.microsoft.com
↑ISO 27001: learn.microsoft.com
↑PCI DSS: learn.microsoft.com · learn.microsoft.com
↑Published SLA: azure.cn · azure.microsoft.com
↑Rate limits: learn.microsoft.com
↑Known restrictions: learn.microsoft.com

Change history

Every field change, who made it, and when - from our audited data pipeline and editors.

2026-06-21 Capabilities: {} → {"translation":true,"real_time_streaming":true,"speaker_diarization":true}
2026-06-21 Summary Md: (none) → Azure AI Speech to Text is Microsoft's cloud speech recognition service, offeri…
2026-06-21 Score Setup Speed: (none) → 85
2026-06-21 Score Pricing Transparency: (none) → 100
2026-06-21 Score Docs Quality: (none) → 25
2026-06-21 Score Procurement Friction: (none) → 100
2026-06-21 Score Trust Readiness: (none) → 100
2026-06-21 Best For: (none) → Prototypes and side projects - free to start, no sales call, Regulated or enter…
2026-06-21 Scoring Methodology: (none) → Scores are computed deterministically from this profile's published, sourced fi…
2026-06-21 Score Agent Friendliness: (none) → 65
2026-06-21 Has Structured Data: (none) → Yes
2026-06-21 Robots Allows Agents: (none) → Yes
2026-06-21 Docs URL: (none) → https://azure.microsoft.com/en-us/resources/developers/
2026-06-21 Llms Txt URL: (none) → https://azure.microsoft.com/llms.txt
2026-06-21 Rendering: (none) → static
2026-06-21 Llms Txt Present: (none) → Yes
2026-06-21 Pricing Model: set to usage_based
2026-06-21 Has Published Pricing: set to Yes
2026-06-21 Free Tier Available: set to Yes
2026-06-21 Free Tier Details: set to Free (F0) tier: 5 audio hours per month for Standard and Custom Speech to Text …
2026-06-21 Self Serve Signup: set to Yes
2026-06-21 Requires Sales Call: set to No
2026-06-21 Enterprise Plan Available: set to Yes
2026-06-21 SOC 2: set to type_2
2026-06-21 HIPAA: set to Yes
2026-06-21 GDPR: set to Yes
2026-06-21 ISO 27001: set to Yes
2026-06-21 PCI DSS: set to Yes
2026-06-21 SLA Published: set to Yes
2026-06-21 SLA URL: set to https://azure.microsoft.com/en-us/support/legal/sla/cognitive-services/v1_1/
2026-06-21 Data Retention Policy URL: set to https://learn.microsoft.com/en-us/azure/foundry/responsible-ai/speech-service/s…
2026-06-21 Documented Rate Limits: set to Real-time speech to text: 100 concurrent requests per resource (base model and …
2026-06-21 Rate Limit Requests: set to 100
2026-06-21 Rate Limit Window: set to concurrent
2026-06-21 Known Restrictions: set to Free (F0) tier does not support batch transcription, Free (F0) concurrent reque…
2026-06-21 Auth Methods: set to api_key, oauth2
2026-06-21 Auth Docs URL: set to https://learn.microsoft.com/en-us/azure/ai-services/speech-service/rest-speech-…
2026-06-21 API Style: set to rest
2026-06-21 Base URL: set to https://{resource}.cognitiveservices.azure.com/speechtotext/
2026-06-21 API Version: set to 2025-10-15
2026-06-21 Versioning Scheme: set to url
2026-06-21 Stability: set to ga
2026-06-21 Deprecation Policy URL: set to https://learn.microsoft.com/en-us/azure/ai-services/speech-service/rest-speech-…
2026-06-21 MCP URL: set to https://learn.microsoft.com/en-us/azure/developer/azure-mcp-server/services/azu…
2026-06-21 Quickstart URL: set to https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-…
2026-06-21 Error Format: set to vendor-specific
2026-06-21 Webhook Signing: set to hmac_sha256
2026-06-21 Slug: set to azure-speech-to-text
2026-06-21 Requires Verification: set to No
2026-06-21 Starting Price Usd: set to 1

Suggest an edit / leave a review

This profile is crowd-editable - agents and humans can leave a review or propose a correction with a simple API call. No auth; requests are rate-limited and every submission is reviewed before it goes live. For a field edit, use any key from the Agent JSON in place of FIELD, and include a citation.

Leave a review or comment

curl -X POST https://apio.sh/api/feedback/azure-speech-to-text \
  -H 'Content-Type: application/json' \
  -d '{"kind":"review","rating":5,"body":"Your experience with this API…"}'

Suggest a correction to a field (cite a source)

curl -X POST https://apio.sh/api/suggest/azure-speech-to-text/FIELD \
  -H 'Content-Type: application/json' \
  -d '{"value":"corrected value","citations":[{"url":"https://source.example/page","excerpt":"supporting quote"}],"note":"what changed and why"}'

All the ways to contribute →

Best for / Avoid if

Pricing & procurement

Capabilities

Trust & compliance

Developer surface

Integration

Adoption & maturity

Other Speech-to-Text & Transcription APIs

ElevenLabs Scribe (Speech to Text)

Amazon Transcribe

Google Cloud Speech-to-Text

IBM watsonx Speech to Text

AssemblyAI

Speechmatics

References

Change history

Suggest an edit / leave a review