Azure AI Speech to Text

"Azure Speech in Foundry Tools provides speech to text, text to speech, and other capabilities through a Microsoft Foundry resource. You can transcribe speech to text with high accuracy, produce natural-sounding text-to-speech voices, translate spoken audio, and conduct live AI voice conversations." [1]

azure.microsoft.com/en-us/products/ai-services/ai-speech · By Microsoft · Agent JSON · Suggest an edit · Last verified 2026-06-21 · Source confidence: high

Azure AI Speech to Text is Microsoft's cloud speech recognition service, offering real-time transcription, batch processing, speaker diarization, pronunciation assessment, and speech translation across more than 30 Azure regions. It starts at $1.00 per hour of audio with a free tier of 5 hours per month, scales via usage-based pricing, and supports self-serve signup with no sales call required. SDKs cover C#, Python, JavaScript, Java, Go, and Objective-C, and the service holds SOC 2 Type II, HIPAA, GDPR, ISO 27001, and PCI DSS certifications.

Best for / Avoid if

Best for: Prototypes and side projects - free to start, no sales call; Regulated or enterprise workloads - compliance attestations and an enterprise plan; AI agents and automation - an agent-ready surface (MCP / llms.txt)

Pricing & procurement

Pricing model
Usage-based [2]
Published pricing
Yes [3]
Free tier
Yes [4]
Free tier details
Free (F0) tier: 5 audio hours per month for Standard and Custom Speech to Text (shared; batch transcription not available on F0). Resets monthly. Concurrent request limit of 1 (not adjustable). No SLA on F0 tier.
Self-serve signup
Yes
Requires sales call
No
Enterprise plan
Yes
Published prices
PlanItemPerAmountSource
Free (F0)Standard real-time speech to text5 audio hours per month$0source
Pay As You GoStandard real-time speech to textaudio hour$1source
Pay As You GoFast transcription (synchronous file-based)audio hour$0.36source
Pay As You GoBatch transcriptionaudio hour$0.18source
Pay As You GoCustom speech real-time transcriptionaudio hour$1.2source
Pay As You GoCustom speech batch transcriptionaudio hour$0.225source
Pay As You GoCustom model trainingcompute hour$10source
Pay As You GoCustom model endpoint hostingmodel per hour$0.0538source
Pay As You GoLanguage identification add-on (real-time)audio hour$0.3source
Pay As You GoSpeaker diarization add-on (real-time)audio hour$0.3source
Pay As You GoPronunciation assessment add-on (real-time)audio hour$0.3source
Commitment Tier — Standard 2,000 hrs/moStandard real-time speech to textmonth (2,000 hours included; $0.80/hr effective)$1600source
Commitment Tier — Standard 10,000 hrs/moStandard real-time speech to textmonth (10,000 hours included; $0.65/hr effective)$6500source
Commitment Tier — Standard 50,000 hrs/moStandard real-time speech to textmonth (50,000 hours included; $0.50/hr effective)$25000source
Commitment Tier — Custom 2,000 hrs/moCustom speech real-time transcriptionmonth (2,000 hours included; $0.96/hr effective)$1920source
Commitment Tier — Custom 50,000 hrs/moCustom speech real-time transcriptionmonth (50,000 hours included; $0.60/hr effective)$30000source
Connected Container — Standard 2,000 hrs/moStandard real-time speech to text (connected container)month (2,000 hours included)$1520source
Connected Container — Standard 50,000 hrs/moStandard real-time speech to text (connected container)month (50,000 hours included)$23750source
Connected Container — Custom 2,000 hrs/moCustom speech real-time transcription (connected container)month (2,000 hours included)$1824source
Connected Container — Custom 50,000 hrs/moCustom speech real-time transcription (connected container)month (50,000 hours included)$28500source
Disconnected Container — Standard 120,000 hrs/yrStandard real-time speech to text (disconnected/air-gapped container)year (120,000 hours included)$74100source
Disconnected Container — Standard 600,000 hrs/yrStandard real-time speech to text (disconnected/air-gapped container)year (600,000 hours included)$285000source
Disconnected Container — Custom 120,000 hrs/yrCustom speech real-time transcription (disconnected/air-gapped container)year (120,000 hours included)$88920source
Disconnected Container — Custom 600,000 hrs/yrCustom speech real-time transcription (disconnected/air-gapped container)year (600,000 hours included)$342000source

Capabilities

  • Real-time streaming
  • Speaker diarization
  • Speech translation
Supported actions
transcribe_realtime, transcribe_batch, transcribe_fast, speaker_diarization, language_detection, word_timestamps, custom_speech_model, pronunciation_assessment, phrase_lists, speech_translation, keyword_recognition, llm_speech_transcription, post_stream_refinement [5]
Regions
South Africa North, East Asia, Southeast Asia, Australia East, Central India, Japan East, Japan West, Korea Central, Canada Central, Canada East, North Europe, West Europe, France Central, Germany West Central, Italy North, Norway East, Sweden Central, Switzerland North, Switzerland West, UK South, UK West, UAE North, Brazil South, Qatar Central, Central US, East US, East US 2, North Central US, South Central US, West Central US, West US, West US 2, West US 3 [6]
Languages
Afrikaans (South Africa), Amharic (Ethiopia), Arabic (20+ locales), Assamese (India), Azerbaijani, Bulgarian, Bhojpuri (India), Bengali (India), Bosnian, Catalan, Czech, Welsh, Danish, German (3 locales), Greek, English (15+ locales), Spanish (22 locales), Estonian, Basque, Persian, Finnish, Filipino, French (4 locales), Irish, Galician, Gujarati, Hebrew, Hindi, Croatian, Hungarian, Armenian, Indonesian, Icelandic, Italian (2 locales), Japanese, Javanese, Georgian, Kazakh, Khmer, Kannada, Korean, Lao, Lithuanian, Latvian, Macedonian, Malayalam, Mongolian, Marathi, Malay, Maltese, Burmese, Norwegian Bokmål, Nepali, Dutch (2 locales), Odia, Punjabi, Polish, Pashto, Portuguese (2 locales), Romanian, Russian, Sinhala, Slovak, Slovenian, Somali, Albanian, Serbian, Swedish, Kiswahili, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Chinese Wu (Simplified), Chinese Cantonese (Simplified), Chinese Mandarin (Simplified), Chinese Southwestern Mandarin, Chinese Cantonese (Traditional), Chinese Taiwanese Mandarin, isiZulu - 130+ languages/locales total [7]
Input types
audio/wav (PCM, default), audio/mp3, audio/ogg (OPUS), audio/flac, AMR, AMR-WB, A-Law, Mu-Law, streaming via WebSocket (Speech SDK), file via Azure Blob Storage SAS URI, file via public URI, live microphone stream
Output types
JSON (with word-level timestamps, offset, duration, speaker labels), plain text, SRT/VTT captions (via post-processing), word-level timestamps (batch and real-time SDK), diarization speaker labels
Webhooks
Yes [8]
Sandbox / test mode
No [9]
SDK languages
C#/.NET, Python, JavaScript, Java, Go, Objective-C [10]
MCP server
Yes [11]

Trust & compliance

SOC 2
SOC 2 Type II [12]
HIPAA
Yes [13]
GDPR
Yes [14]
ISO 27001
Yes [15]
PCI DSS
Yes [16]
Published SLA
Yes [17]
Rate limits
Real-time speech to text: 100 concurrent requests per resource (base model and custom endpoint, adjustable for S0). Fast transcription: 600 requests per minute (adjustable). Batch transcription REST API: 100 requests per 10 seconds (600/min). Free (F0) concurrent request limit: 1 (not adjustable). Batch transcription: max audio file size 1 GB, max audio length 240 min (with diarization), max 1,000 files per request, max 10,000 blobs per container. [18]
Known restrictions
Free (F0) tier does not support batch transcription, Free (F0) concurrent request limit of 1 is not adjustable, Maximum audio file size for batch and fast transcription: 500 MB (fast) / 1 GB (batch), Maximum audio length for fast transcription: 5 hours per file, Maximum diarization audio length: 240 minutes per session/file, Diarization supports up to 35 speakers, Real-time diarization session max: 240 minutes, Data is processed only within the region of the Azure Speech resource (no cross-region processing), Sovereign cloud availability limited (Azure Government, 21Vianet) [19]

Developer surface

Docs rendering: static · llms.txt present

Integration

API style
rest
Base URL
https://{resource}.cognitiveservices.azure.com/speechtotext/
Version
2025-10-15
Versioning
url
Stability
ga
Auth methods
api_key, oauth2
Error format
vendor-specific
Webhook signing
hmac_sha256
Rate limit
100 / concurrent

SDKs

  • C#/.NET Microsoft.CognitiveServices.Speech · repo
  • Python azure-cognitiveservices-speech · repo
  • JavaScript microsoft-cognitiveservices-speech-sdk · repo
  • Java com.microsoft.cognitiveservices.speech:client-sdk · repo
  • Go github.com/Microsoft/cognitive-services-speech-sdk-go · repo
  • Objective-C · repo

Adoption & maturity

Launched
2018-09-24
GA
2018-09-24
Notable customers
Microsoft Teams, Microsoft Office 365, Microsoft Edge

Other Speech-to-Text & Transcription APIs

  • ElevenLabs Scribe (Speech to Text)

    "Scribe v2 is the most accurate Speech to Text model" offering "real-time Speech to Text in under 150 ms" across "90+ languages."

    Hybrid · free tier · public pricing · self-serve

  • Amazon Transcribe

    "Amazon Transcribe is an automatic speech recognition service that uses machine learning models to convert audio to text. You can use Amazon Transcribe as a standalone transcription service or to add speech-to-text capabilities to any application."

    Usage · free tier · public pricing · self-serve

  • Google Cloud Speech-to-Text

    "Accurate voice typing and transcription powered by Gemini."

    Usage · free tier · public pricing · self-serve

  • IBM watsonx Speech to Text

    "IBM Watson® Speech to Text technology enables fast and accurate speech transcription in multiple languages for a variety of use cases, including but not limited to customer self-service, agent assistance and speech analytics."

    Usage · free tier · public pricing · self-serve

  • AssemblyAI

    "Voice AI infrastructure for developers building products that transcribe, understand, and act on speech."

    Usage · public pricing · self-serve

  • Speechmatics

    "Low-latency speech-to-text for multilingual, multi-speaker conversations."

    Usage · free tier · public pricing · self-serve

Azure AI Speech to Text alternatives · Azure AI Speech to Text vs ElevenLabs Scribe (Speech to Text) · All Speech-to-Text & Transcription APIs APIs

References

Change history

Every field change, who made it, and when - from our audited data pipeline and editors.

  1. 2026-06-21 Capabilities: {}{"translation":true,"real_time_streaming":true,"speaker_diarization":true}
  2. 2026-06-21 Summary Md: (none)Azure AI Speech to Text is Microsoft's cloud speech recognition service, offeri…
  3. 2026-06-21 Score Setup Speed: (none)85
  4. 2026-06-21 Score Pricing Transparency: (none)100
  5. 2026-06-21 Score Docs Quality: (none)25
  6. 2026-06-21 Score Procurement Friction: (none)100
  7. 2026-06-21 Score Trust Readiness: (none)100
  8. 2026-06-21 Best For: (none)Prototypes and side projects - free to start, no sales call, Regulated or enter…
  9. 2026-06-21 Scoring Methodology: (none)Scores are computed deterministically from this profile's published, sourced fi…
  10. 2026-06-21 Score Agent Friendliness: (none)65
  11. 2026-06-21 Has Structured Data: (none)Yes
  12. 2026-06-21 Robots Allows Agents: (none)Yes
  13. 2026-06-21 Docs URL: (none)https://azure.microsoft.com/en-us/resources/developers/
  14. 2026-06-21 Llms Txt URL: (none)https://azure.microsoft.com/llms.txt
  15. 2026-06-21 Rendering: (none)static
  16. 2026-06-21 Llms Txt Present: (none)Yes
  17. 2026-06-21 Pricing Model: set to usage_based
  18. 2026-06-21 Has Published Pricing: set to Yes
  19. 2026-06-21 Free Tier Available: set to Yes
  20. 2026-06-21 Free Tier Details: set to Free (F0) tier: 5 audio hours per month for Standard and Custom Speech to Text …
  21. 2026-06-21 Self Serve Signup: set to Yes
  22. 2026-06-21 Requires Sales Call: set to No
  23. 2026-06-21 Enterprise Plan Available: set to Yes
  24. 2026-06-21 SOC 2: set to type_2
  25. 2026-06-21 HIPAA: set to Yes
  26. 2026-06-21 GDPR: set to Yes
  27. 2026-06-21 ISO 27001: set to Yes
  28. 2026-06-21 PCI DSS: set to Yes
  29. 2026-06-21 SLA Published: set to Yes
  30. 2026-06-21 SLA URL: set to https://azure.microsoft.com/en-us/support/legal/sla/cognitive-services/v1_1/
  31. 2026-06-21 Data Retention Policy URL: set to https://learn.microsoft.com/en-us/azure/foundry/responsible-ai/speech-service/s…
  32. 2026-06-21 Documented Rate Limits: set to Real-time speech to text: 100 concurrent requests per resource (base model and …
  33. 2026-06-21 Rate Limit Requests: set to 100
  34. 2026-06-21 Rate Limit Window: set to concurrent
  35. 2026-06-21 Known Restrictions: set to Free (F0) tier does not support batch transcription, Free (F0) concurrent reque…
  36. 2026-06-21 Auth Methods: set to api_key, oauth2
  37. 2026-06-21 Auth Docs URL: set to https://learn.microsoft.com/en-us/azure/ai-services/speech-service/rest-speech-…
  38. 2026-06-21 API Style: set to rest
  39. 2026-06-21 Base URL: set to https://{resource}.cognitiveservices.azure.com/speechtotext/
  40. 2026-06-21 API Version: set to 2025-10-15
  41. 2026-06-21 Versioning Scheme: set to url
  42. 2026-06-21 Stability: set to ga
  43. 2026-06-21 Deprecation Policy URL: set to https://learn.microsoft.com/en-us/azure/ai-services/speech-service/rest-speech-…
  44. 2026-06-21 MCP URL: set to https://learn.microsoft.com/en-us/azure/developer/azure-mcp-server/services/azu…
  45. 2026-06-21 Quickstart URL: set to https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-…
  46. 2026-06-21 Error Format: set to vendor-specific
  47. 2026-06-21 Webhook Signing: set to hmac_sha256
  48. 2026-06-21 Slug: set to azure-speech-to-text
  49. 2026-06-21 Requires Verification: set to No
  50. 2026-06-21 Starting Price Usd: set to 1

Suggest an edit / leave a review

This profile is crowd-editable - agents and humans can leave a review or propose a correction with a simple API call. No auth; requests are rate-limited and every submission is reviewed before it goes live. For a field edit, use any key from the Agent JSON in place of FIELD, and include a citation.

Leave a review or comment

curl -X POST https://apio.sh/api/feedback/azure-speech-to-text \
  -H 'Content-Type: application/json' \
  -d '{"kind":"review","rating":5,"body":"Your experience with this API…"}'

Suggest a correction to a field (cite a source)

curl -X POST https://apio.sh/api/suggest/azure-speech-to-text/FIELD \
  -H 'Content-Type: application/json' \
  -d '{"value":"corrected value","citations":[{"url":"https://source.example/page","excerpt":"supporting quote"}],"note":"what changed and why"}'

All the ways to contribute →