Azure AI Speech to Text
"Azure Speech in Foundry Tools provides speech to text, text to speech, and other capabilities through a Microsoft Foundry resource. You can transcribe speech to text with high accuracy, produce natural-sounding text-to-speech voices, translate spoken audio, and conduct live AI voice conversations." [1]
Azure AI Speech to Text is Microsoft's cloud speech recognition service, offering real-time transcription, batch processing, speaker diarization, pronunciation assessment, and speech translation across more than 30 Azure regions. It starts at $1.00 per hour of audio with a free tier of 5 hours per month, scales via usage-based pricing, and supports self-serve signup with no sales call required. SDKs cover C#, Python, JavaScript, Java, Go, and Objective-C, and the service holds SOC 2 Type II, HIPAA, GDPR, ISO 27001, and PCI DSS certifications.
Best for / Avoid if
Best for: Prototypes and side projects - free to start, no sales call; Regulated or enterprise workloads - compliance attestations and an enterprise plan; AI agents and automation - an agent-ready surface (MCP / llms.txt)
Pricing & procurement
- Pricing model
- Usage-based [2]
- Published pricing
- ✓ Yes [3]
- Free tier
- ✓ Yes [4]
- Free tier details
- Free (F0) tier: 5 audio hours per month for Standard and Custom Speech to Text (shared; batch transcription not available on F0). Resets monthly. Concurrent request limit of 1 (not adjustable). No SLA on F0 tier.
- Self-serve signup
- ✓ Yes
- Requires sales call
- ✗ No
- Enterprise plan
- ✓ Yes
| Plan | Item | Per | Amount | Source |
|---|---|---|---|---|
| Free (F0) | Standard real-time speech to text | 5 audio hours per month | $0 | source |
| Pay As You Go | Standard real-time speech to text | audio hour | $1 | source |
| Pay As You Go | Fast transcription (synchronous file-based) | audio hour | $0.36 | source |
| Pay As You Go | Batch transcription | audio hour | $0.18 | source |
| Pay As You Go | Custom speech real-time transcription | audio hour | $1.2 | source |
| Pay As You Go | Custom speech batch transcription | audio hour | $0.225 | source |
| Pay As You Go | Custom model training | compute hour | $10 | source |
| Pay As You Go | Custom model endpoint hosting | model per hour | $0.0538 | source |
| Pay As You Go | Language identification add-on (real-time) | audio hour | $0.3 | source |
| Pay As You Go | Speaker diarization add-on (real-time) | audio hour | $0.3 | source |
| Pay As You Go | Pronunciation assessment add-on (real-time) | audio hour | $0.3 | source |
| Commitment Tier — Standard 2,000 hrs/mo | Standard real-time speech to text | month (2,000 hours included; $0.80/hr effective) | $1600 | source |
| Commitment Tier — Standard 10,000 hrs/mo | Standard real-time speech to text | month (10,000 hours included; $0.65/hr effective) | $6500 | source |
| Commitment Tier — Standard 50,000 hrs/mo | Standard real-time speech to text | month (50,000 hours included; $0.50/hr effective) | $25000 | source |
| Commitment Tier — Custom 2,000 hrs/mo | Custom speech real-time transcription | month (2,000 hours included; $0.96/hr effective) | $1920 | source |
| Commitment Tier — Custom 50,000 hrs/mo | Custom speech real-time transcription | month (50,000 hours included; $0.60/hr effective) | $30000 | source |
| Connected Container — Standard 2,000 hrs/mo | Standard real-time speech to text (connected container) | month (2,000 hours included) | $1520 | source |
| Connected Container — Standard 50,000 hrs/mo | Standard real-time speech to text (connected container) | month (50,000 hours included) | $23750 | source |
| Connected Container — Custom 2,000 hrs/mo | Custom speech real-time transcription (connected container) | month (2,000 hours included) | $1824 | source |
| Connected Container — Custom 50,000 hrs/mo | Custom speech real-time transcription (connected container) | month (50,000 hours included) | $28500 | source |
| Disconnected Container — Standard 120,000 hrs/yr | Standard real-time speech to text (disconnected/air-gapped container) | year (120,000 hours included) | $74100 | source |
| Disconnected Container — Standard 600,000 hrs/yr | Standard real-time speech to text (disconnected/air-gapped container) | year (600,000 hours included) | $285000 | source |
| Disconnected Container — Custom 120,000 hrs/yr | Custom speech real-time transcription (disconnected/air-gapped container) | year (120,000 hours included) | $88920 | source |
| Disconnected Container — Custom 600,000 hrs/yr | Custom speech real-time transcription (disconnected/air-gapped container) | year (600,000 hours included) | $342000 | source |
Capabilities
- Supported actions
- transcribe_realtime, transcribe_batch, transcribe_fast, speaker_diarization, language_detection, word_timestamps, custom_speech_model, pronunciation_assessment, phrase_lists, speech_translation, keyword_recognition, llm_speech_transcription, post_stream_refinement [5]
- Regions
- South Africa North, East Asia, Southeast Asia, Australia East, Central India, Japan East, Japan West, Korea Central, Canada Central, Canada East, North Europe, West Europe, France Central, Germany West Central, Italy North, Norway East, Sweden Central, Switzerland North, Switzerland West, UK South, UK West, UAE North, Brazil South, Qatar Central, Central US, East US, East US 2, North Central US, South Central US, West Central US, West US, West US 2, West US 3 [6]
- Languages
- Afrikaans (South Africa), Amharic (Ethiopia), Arabic (20+ locales), Assamese (India), Azerbaijani, Bulgarian, Bhojpuri (India), Bengali (India), Bosnian, Catalan, Czech, Welsh, Danish, German (3 locales), Greek, English (15+ locales), Spanish (22 locales), Estonian, Basque, Persian, Finnish, Filipino, French (4 locales), Irish, Galician, Gujarati, Hebrew, Hindi, Croatian, Hungarian, Armenian, Indonesian, Icelandic, Italian (2 locales), Japanese, Javanese, Georgian, Kazakh, Khmer, Kannada, Korean, Lao, Lithuanian, Latvian, Macedonian, Malayalam, Mongolian, Marathi, Malay, Maltese, Burmese, Norwegian Bokmål, Nepali, Dutch (2 locales), Odia, Punjabi, Polish, Pashto, Portuguese (2 locales), Romanian, Russian, Sinhala, Slovak, Slovenian, Somali, Albanian, Serbian, Swedish, Kiswahili, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Chinese Wu (Simplified), Chinese Cantonese (Simplified), Chinese Mandarin (Simplified), Chinese Southwestern Mandarin, Chinese Cantonese (Traditional), Chinese Taiwanese Mandarin, isiZulu - 130+ languages/locales total [7]
- Input types
- audio/wav (PCM, default), audio/mp3, audio/ogg (OPUS), audio/flac, AMR, AMR-WB, A-Law, Mu-Law, streaming via WebSocket (Speech SDK), file via Azure Blob Storage SAS URI, file via public URI, live microphone stream
- Output types
- JSON (with word-level timestamps, offset, duration, speaker labels), plain text, SRT/VTT captions (via post-processing), word-level timestamps (batch and real-time SDK), diarization speaker labels
- Webhooks
- ✓ Yes [8]
- Sandbox / test mode
- ✗ No [9]
- SDK languages
- C#/.NET, Python, JavaScript, Java, Go, Objective-C [10]
- MCP server
- ✓ Yes [11]
Trust & compliance
- SOC 2
- SOC 2 Type II [12]
- HIPAA
- ✓ Yes [13]
- GDPR
- ✓ Yes [14]
- ISO 27001
- ✓ Yes [15]
- PCI DSS
- ✓ Yes [16]
- Published SLA
- ✓ Yes [17]
- Rate limits
- Real-time speech to text: 100 concurrent requests per resource (base model and custom endpoint, adjustable for S0). Fast transcription: 600 requests per minute (adjustable). Batch transcription REST API: 100 requests per 10 seconds (600/min). Free (F0) concurrent request limit: 1 (not adjustable). Batch transcription: max audio file size 1 GB, max audio length 240 min (with diarization), max 1,000 files per request, max 10,000 blobs per container. [18]
- Known restrictions
- Free (F0) tier does not support batch transcription, Free (F0) concurrent request limit of 1 is not adjustable, Maximum audio file size for batch and fast transcription: 500 MB (fast) / 1 GB (batch), Maximum audio length for fast transcription: 5 hours per file, Maximum diarization audio length: 240 minutes per session/file, Diarization supports up to 35 speakers, Real-time diarization session max: 240 minutes, Data is processed only within the region of the Azure Speech resource (no cross-region processing), Sovereign cloud availability limited (Azure Government, 21Vianet) [19]
Developer surface
Integration
- API style
- rest
- Base URL
- https://{resource}.cognitiveservices.azure.com/speechtotext/
- Version
- 2025-10-15
- Versioning
- url
- Stability
- ga
- Auth methods
- api_key, oauth2
- Error format
- vendor-specific
- Webhook signing
- hmac_sha256
- Rate limit
- 100 / concurrent
Adoption & maturity
- Launched
- 2018-09-24
- GA
- 2018-09-24
- Notable customers
- Microsoft Teams, Microsoft Office 365, Microsoft Edge
Other Speech-to-Text & Transcription APIs
ElevenLabs Scribe (Speech to Text)
"Scribe v2 is the most accurate Speech to Text model" offering "real-time Speech to Text in under 150 ms" across "90+ languages."
Amazon Transcribe
"Amazon Transcribe is an automatic speech recognition service that uses machine learning models to convert audio to text. You can use Amazon Transcribe as a standalone transcription service or to add speech-to-text capabilities to any application."
Google Cloud Speech-to-Text
"Accurate voice typing and transcription powered by Gemini."
IBM watsonx Speech to Text
"IBM Watson® Speech to Text technology enables fast and accurate speech transcription in multiple languages for a variety of use cases, including but not limited to customer self-service, agent assistance and speech analytics."
AssemblyAI
"Voice AI infrastructure for developers building products that transcribe, understand, and act on speech."
Speechmatics
"Low-latency speech-to-text for multilingual, multi-speaker conversations."
References
- ↑Description: learn.microsoft.com
- ↑Pricing model: blocksentient.com · azure.microsoft.com
- ↑Published pricing: azure.microsoft.com
- ↑Free tier: blocksentient.com · learn.microsoft.com
- ↑Supported actions: learn.microsoft.com · learn.microsoft.com
- ↑Regions: learn.microsoft.com
- ↑Languages: learn.microsoft.com · learn.microsoft.com
- ↑Webhooks: learn.microsoft.com
- ↑Sandbox: learn.microsoft.com
- ↑SDK languages: learn.microsoft.com
- ↑MCP server: learn.microsoft.com
- ↑SOC 2: learn.microsoft.com
- ↑HIPAA: learn.microsoft.com
- ↑GDPR: learn.microsoft.com
- ↑ISO 27001: learn.microsoft.com
- ↑PCI DSS: learn.microsoft.com · learn.microsoft.com
- ↑Published SLA: azure.cn · azure.microsoft.com
- ↑Rate limits: learn.microsoft.com
- ↑Known restrictions: learn.microsoft.com
Change history
- 2026-06-21 Capabilities: {} → {"translation":true,"real_time_streaming":true,"speaker_diarization":true}
- 2026-06-21 Summary Md: (none) → Azure AI Speech to Text is Microsoft's cloud speech recognition service, offeri…
- 2026-06-21 Score Setup Speed: (none) → 85
- 2026-06-21 Score Pricing Transparency: (none) → 100
- 2026-06-21 Score Docs Quality: (none) → 25
- 2026-06-21 Score Procurement Friction: (none) → 100
- 2026-06-21 Score Trust Readiness: (none) → 100
- 2026-06-21 Best For: (none) → Prototypes and side projects - free to start, no sales call, Regulated or enter…
- 2026-06-21 Scoring Methodology: (none) → Scores are computed deterministically from this profile's published, sourced fi…
- 2026-06-21 Score Agent Friendliness: (none) → 65
- 2026-06-21 Has Structured Data: (none) → Yes
- 2026-06-21 Robots Allows Agents: (none) → Yes
- 2026-06-21 Docs URL: (none) → https://azure.microsoft.com/en-us/resources/developers/
- 2026-06-21 Llms Txt URL: (none) → https://azure.microsoft.com/llms.txt
- 2026-06-21 Rendering: (none) → static
- 2026-06-21 Llms Txt Present: (none) → Yes
- 2026-06-21 Pricing Model: set to usage_based
- 2026-06-21 Has Published Pricing: set to Yes
- 2026-06-21 Free Tier Available: set to Yes
- 2026-06-21 Free Tier Details: set to Free (F0) tier: 5 audio hours per month for Standard and Custom Speech to Text …
- 2026-06-21 Self Serve Signup: set to Yes
- 2026-06-21 Requires Sales Call: set to No
- 2026-06-21 Enterprise Plan Available: set to Yes
- 2026-06-21 SOC 2: set to type_2
- 2026-06-21 HIPAA: set to Yes
- 2026-06-21 GDPR: set to Yes
- 2026-06-21 ISO 27001: set to Yes
- 2026-06-21 PCI DSS: set to Yes
- 2026-06-21 SLA Published: set to Yes
- 2026-06-21 SLA URL: set to https://azure.microsoft.com/en-us/support/legal/sla/cognitive-services/v1_1/
- 2026-06-21 Data Retention Policy URL: set to https://learn.microsoft.com/en-us/azure/foundry/responsible-ai/speech-service/s…
- 2026-06-21 Documented Rate Limits: set to Real-time speech to text: 100 concurrent requests per resource (base model and …
- 2026-06-21 Rate Limit Requests: set to 100
- 2026-06-21 Rate Limit Window: set to concurrent
- 2026-06-21 Known Restrictions: set to Free (F0) tier does not support batch transcription, Free (F0) concurrent reque…
- 2026-06-21 Auth Methods: set to api_key, oauth2
- 2026-06-21 Auth Docs URL: set to https://learn.microsoft.com/en-us/azure/ai-services/speech-service/rest-speech-…
- 2026-06-21 API Style: set to rest
- 2026-06-21 Base URL: set to https://{resource}.cognitiveservices.azure.com/speechtotext/
- 2026-06-21 API Version: set to 2025-10-15
- 2026-06-21 Versioning Scheme: set to url
- 2026-06-21 Stability: set to ga
- 2026-06-21 Deprecation Policy URL: set to https://learn.microsoft.com/en-us/azure/ai-services/speech-service/rest-speech-…
- 2026-06-21 MCP URL: set to https://learn.microsoft.com/en-us/azure/developer/azure-mcp-server/services/azu…
- 2026-06-21 Quickstart URL: set to https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-…
- 2026-06-21 Error Format: set to vendor-specific
- 2026-06-21 Webhook Signing: set to hmac_sha256
- 2026-06-21 Slug: set to azure-speech-to-text
- 2026-06-21 Requires Verification: set to No
- 2026-06-21 Starting Price Usd: set to 1
Suggest an edit / leave a review
Leave a review or comment
curl -X POST https://apio.sh/api/feedback/azure-speech-to-text \
-H 'Content-Type: application/json' \
-d '{"kind":"review","rating":5,"body":"Your experience with this API…"}'Suggest a correction to a field (cite a source)
curl -X POST https://apio.sh/api/suggest/azure-speech-to-text/FIELD \
-H 'Content-Type: application/json' \
-d '{"value":"corrected value","citations":[{"url":"https://source.example/page","excerpt":"supporting quote"}],"note":"what changed and why"}'