Azure AI Text to Speech

"Text to speech enables your applications, tools, or devices to convert text into human like synthesized speech. The text to speech capability is also known as speech synthesis. Use human like standard voices out of the box, or create a custom voice that's unique to your product or brand." [1]

Text-to-Speech APIs

azure.microsoft.com/en-us/products/ai-services/ai-speech · By Microsoft · Agent JSON · Suggest an edit · Last verified 2026-06-21 · Source confidence: high

Azure AI Text to Speech is Microsoft's managed speech synthesis service, suited for voice agents, call center automation, audiobook narration, accessibility tools, and content creation. It offers over 30 deployment regions, a free tier of 500,000 characters per month, and usage-based pricing starting at $15 per million characters for standard voices. SDKs are available for Python, C#, JavaScript, Java, and Go, and the service carries SOC 2 Type 2, HIPAA, GDPR, ISO 27001, and PCI DSS certifications. Custom and personal voice cloning are supported, though professional voice fine-tuning requires limited-access approval.

Best for / Avoid if

Best for: Prototypes and side projects - free to start, no sales call; Regulated or enterprise workloads - compliance attestations and an enterprise plan; AI agents and automation - an agent-ready surface (MCP / llms.txt)

Pricing & procurement

Pricing model: Usage-based [2]
Published pricing: Yes [3]
Free tier: Yes [4]
Free tier details: Free (F0) tier: 0.5 million characters per month for neural text-to-speech (recurring monthly allowance, not a one-time trial). Free tier rate limits are not adjustable.
Self-serve signup: Yes
Requires sales call: No
Enterprise plan: Yes [5]

Published prices
Plan	Item	Per	Amount	Source
Free (F0)	Speech synthesis - neural voices (recurring monthly allowance)	0.5M characters per month	$0	source
Pay As You Go (S0)	Speech synthesis - neural voices (real-time & batch)	1M characters	$16	source
Pay As You Go (S0)	Speech synthesis - neural HD voices (real-time & batch)	1M characters	$22	source
Pay As You Go (S0)	Speech synthesis - neural voices (long audio creation)	1M characters	$100	source
Pay As You Go (S0)	Speech synthesis - custom neural voice (real-time & batch)	1M characters	$24	source
Pay As You Go (S0)	Speech synthesis - custom neural HD voice (real-time & batch)	1M characters	$48	source
Pay As You Go (S0)	Speech synthesis - custom neural voice (long audio creation)	1M characters	$100	source
Pay As You Go (S0)	Custom neural voice model training	compute hour	$52	source
Pay As You Go (S0)	Custom neural voice endpoint hosting	model per hour	$4.04	source
Pay As You Go (S0)	Personal voice synthesis	1M characters	$24	source
Pay As You Go (S0)	Personal voice profile storage	1,000 profiles per month	$600	source
Commitment - 80M characters/month	Speech synthesis - neural voices	month (80M characters included)	$1024	source
Commitment - 400M characters/month	Speech synthesis - neural voices	month (400M characters included)	$4160	source
Commitment - 2,000M characters/month	Speech synthesis - neural voices	month (2,000M characters included)	$16000	source
Connected Container - 80M characters/month	Speech synthesis - neural voices	month (80M characters included)	$972.8	source
Connected Container - 400M characters/month	Speech synthesis - neural voices	month (400M characters included)	$3952	source
Connected Container - 2,000M characters/month	Speech synthesis - neural voices	month (2,000M characters included)	$15200	source

Capabilities

Real-time streaming
Voice cloning
Voice design
SSML control
Multilingual voices
Word timestamps

Supported actions: synthesize_speech, streaming_tts, batch_synthesis, ssml_support, word_timestamps, viseme_generation, professional_voice_cloning, personal_voice_cloning, custom_voice_training, multilingual_synthesis, voice_design, audio_content_creation, text_to_speech_avatar, real_time_synthesis, async_long_audio_synthesis [6]learn.microsoft.com/en-us/azure/ai-services/speech-service/text-to-speech“Improve text to speech output with SSML: Speech Synthesis Markup Language (SSML) is an XML-based markup language used to customize text to speech outputs. Visemes: Visemes are the key poses in observed speech, including the position of the lips, jaw, and tongue in producing a particular phoneme.”learn.microsoft.com/en-us/azure/ai-services/speech-service/text-to-speech“Real-time speech synthesis: Use the Speech SDK or REST API to convert text to speech by using standard voices or custom voices. Asynchronous synthesis of long audio: Use the batch synthesis API to asynchronously synthesize text to speech files longer than 10 minutes”
Regions: Australia East, Brazil South, Canada Central, Canada East, Central US, East Asia, East US, East US 2, France Central, Germany West Central, India Central, Italy North, Japan East, Japan West, Korea Central, North Central US, North Europe, Norway East, Qatar Central, South Africa North, South Central US, Southeast Asia, Sweden Central, Switzerland North, Switzerland West, UAE North, UK South, UK West, West Central US, West Europe, West US, West US 2, West US 3, US Gov Arizona, US Gov Virginia [7]learn.microsoft.com/en-us/azure/ai-services/speech-service/rest-text-to-speech“These regions are supported for text to speech through the REST API: Australia East, Brazil South, Canada Central, Canada East, Central US, East Asia, East US, East US 2, France Central, Germany West Central, India Central, Italy North, Japan East, Japan West, Korea Central, North Central US, North Europe, Norway East, Qatar Central, South Africa North, South Central US, Southeast Asia, Sweden Central, Switzerland North, Switzerland West, UAE North, UK South, UK West, US Gov Arizona, US Gov Virginia, West Central US, West Europe, West US, West US 2, West US 3”
Languages: af-ZA (Afrikaans, South Africa), am-ET (Amharic, Ethiopia), ar-AE (Arabic, UAE), ar-BH (Arabic, Bahrain), ar-DZ (Arabic, Algeria), ar-EG (Arabic, Egypt), ar-IQ (Arabic, Iraq), ar-JO (Arabic, Jordan), ar-KW (Arabic, Kuwait), ar-LB (Arabic, Lebanon), ar-LY (Arabic, Libya), ar-MA (Arabic, Morocco), ar-OM (Arabic, Oman), ar-QA (Arabic, Qatar), ar-SA (Arabic, Saudi Arabia), ar-SY (Arabic, Syria), ar-TN (Arabic, Tunisia), ar-YE (Arabic, Yemen), as-IN (Assamese, India), az-AZ (Azerbaijani, Azerbaijan), bg-BG (Bulgarian, Bulgaria), bn-BD (Bangla, Bangladesh), bn-IN (Bengali, India), bs-BA (Bosnian, Bosnia and Herzegovina), ca-ES (Catalan), cs-CZ (Czech, Czechia), cy-GB (Welsh, United Kingdom), da-DK (Danish, Denmark), de-AT (German, Austria), de-CH (German, Switzerland), de-DE (German, Germany), el-GR (Greek, Greece), en-AU (English, Australia), en-CA (English, Canada), en-GB (English, United Kingdom), en-HK (English, Hong Kong SAR), en-IE (English, Ireland), en-IN (English, India), en-KE (English, Kenya), en-NG (English, Nigeria), en-NZ (English, New Zealand), en-PH (English, Philippines), en-SG (English, Singapore), en-TZ (English, Tanzania), en-US (English, United States), en-ZA (English, South Africa), es-AR (Spanish, Argentina), es-BO (Spanish, Bolivia), es-CL (Spanish, Chile), es-CO (Spanish, Colombia), es-CR (Spanish, Costa Rica), es-CU (Spanish, Cuba), es-DO (Spanish, Dominican Republic), es-EC (Spanish, Ecuador), es-ES (Spanish, Spain), es-GQ (Spanish, Equatorial Guinea), es-GT (Spanish, Guatemala), es-HN (Spanish, Honduras), es-MX (Spanish, Mexico), es-NI (Spanish, Nicaragua), es-PA (Spanish, Panama), es-PE (Spanish, Peru), es-PR (Spanish, Puerto Rico), es-PY (Spanish, Paraguay), es-SV (Spanish, El Salvador), es-US (Spanish, United States), es-UY (Spanish, Uruguay), es-VE (Spanish, Venezuela), et-EE (Estonian, Estonia), eu-ES (Basque), fa-IR (Persian, Iran), fi-FI (Finnish, Finland), fil-PH (Filipino, Philippines), fr-BE (French, Belgium), fr-CA (French, Canada), fr-CH (French, Switzerland), fr-FR (French, France), ga-IE (Irish, Ireland), gl-ES (Galician), gu-IN (Gujarati, India), he-IL (Hebrew, Israel), hi-IN (Hindi, India), hr-HR (Croatian, Croatia), hu-HU (Hungarian, Hungary), hy-AM (Armenian, Armenia), id-ID (Indonesian, Indonesia), is-IS (Icelandic, Iceland), it-IT (Italian, Italy), iu-CANS-CA (Inuktitut Syllabics, Canada), iu-LATN-CA (Inuktitut Latin, Canada), ja-JP (Japanese, Japan), jv-ID (Javanese, Indonesia), ka-GE (Georgian, Georgia), kk-KZ (Kazakh, Kazakhstan), km-KH (Khmer, Cambodia), kn-IN (Kannada, India), ko-KR (Korean, Korea), lo-LA (Lao, Laos), lt-LT (Lithuanian, Lithuania), lv-LV (Latvian, Latvia), mk-MK (Macedonian, North Macedonia), ml-IN (Malayalam, India), mn-MN (Mongolian, Mongolia), mr-IN (Marathi, India), ms-MY (Malay, Malaysia), mt-MT (Maltese, Malta), my-MM (Burmese, Myanmar), nb-NO (Norwegian Bokmål, Norway), ne-NP (Nepali, Nepal), nl-BE (Dutch, Belgium), nl-NL (Dutch, Netherlands), or-IN (Odia, India), pa-IN (Punjabi, India), pl-PL (Polish, Poland), ps-AF (Pashto, Afghanistan), pt-BR (Portuguese, Brazil), pt-PT (Portuguese, Portugal), ro-RO (Romanian, Romania), ru-RU (Russian, Russia), si-LK (Sinhala, Sri Lanka), sk-SK (Slovak, Slovakia), sl-SI (Slovenian, Slovenia), so-SO (Somali, Somalia), sq-AL (Albanian, Albania), sr-LATN-RS (Serbian Latin, Serbia), sr-RS (Serbian Cyrillic, Serbia), su-ID (Sundanese, Indonesia), sv-SE (Swedish, Sweden), sw-KE (Kiswahili, Kenya), sw-TZ (Kiswahili, Tanzania), ta-IN (Tamil, India), ta-LK (Tamil, Sri Lanka), ta-MY (Tamil, Malaysia), ta-SG (Tamil, Singapore), te-IN (Telugu, India), th-TH (Thai, Thailand), tr-TR (Turkish, Türkiye), uk-UA (Ukrainian, Ukraine), ur-IN (Urdu, India), ur-PK (Urdu, Pakistan), uz-UZ (Uzbek, Uzbekistan), vi-VN (Vietnamese, Vietnam), wuu-CN (Chinese Wu, Simplified), yue-CN (Chinese Cantonese, Simplified), zh-CN (Chinese Mandarin, Simplified), zh-HK (Chinese Cantonese, Traditional), zh-TW (Chinese Taiwanese Mandarin, Traditional) [8]learn.microsoft.com/en-us/azure/ai-services/speech-service/text-to-speech“Standard voices: High-quality neural voices available out of the box in 100+ languages and locales”learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support“The text-to-speech section contains an extensive table listing supported locales including af-ZA, am-ET, ar-AE through zh-TW — approximately 140+ language-locale combinations representing over 130 distinct languages.”
Input types: plain text, SSML (Speech Synthesis Markup Language)
Output types: mp3 (various bitrates), opus (ogg, webm containers), pcm (raw), wav (riff), alaw, mulaw, truesilk, g722, amr-wb [9]
Webhooks: No [10]
Sandbox / test mode: No
SDK languages: Python, C#, JavaScript, Java, Go [11]
MCP server: Yes [12]

Trust & compliance

SOC 2: SOC 2 Type II [13]
HIPAA: Yes [14]learn.microsoft.com/en-us/azure/compliance/offerings/offering-hipaa-us“Microsoft will enter into BAAs with its covered entity and business associate customers. Azure has enabled the physical, technical, and administrative safeguards required by HIPAA and the HITECH Act inside the in-scope Azure services, and offers a HIPAA BAA as part of the Microsoft Product Terms to all customers who are covered entities or business associates under HIPAA”
GDPR: Yes [15]
ISO 27001: Yes [16]
PCI DSS: Yes [17]learn.microsoft.com/en-us/azure/compliance/offerings/offering-pci-dss“Microsoft Azure maintains a PCI DSS validation using an approved Qualified Security Assessor (QSA), and is certified as compliant under PCI DSS version 4.0 at Service Provider Level 1.”learn.microsoft.com/en-us/azure/compliance/offerings/offering-pci-dss“For a list of Microsoft online services in audit scope, see the PCI DSS Attestation of Compliance (AoC) that is available separately for Azure and Azure Government”
Published SLA: Yes [18]
Rate limits: Free (F0): 20 transactions per 60 seconds (not adjustable). Standard (S0): 200 transactions per second (TPS) default, adjustable up to 1,000 TPS upon request. Maximum audio length per request: 10 minutes. Maximum SSML message size per WebSocket turn: 64 KB. Maximum distinct voice/audio tags in SSML: 50. HD voice latency: less than 300 ms. [19]learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-services-quotas-and-limits“Maximum number of transactions per time period for standard voices and custom voices: Free (F0): 20 transactions per 60 seconds (This limit isn't adjustable). Standard (S0): 200 transactions per second (TPS) (default value). The rate is adjustable up to 1,000 TPS for Standard (S0) resources.”learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-services-quotas-and-limits“Maximum audio length produced per request: 10 minutes (both F0 and S0). Maximum SSML message size per turn for WebSocket: 64 KB. Maximum total number of distinct voice and audio tags in SSML: 50.”
Known restrictions: Custom voice (professional voice fine-tuning) requires limited-access application approval, Chinese characters counted as two characters for billing, including kanji (Japanese), hanja (Korean), hanzi (other languages), HD voices support only a subset of SSML elements (not full SSML), Personal voice does not support BYOS (Bring Your Own Storage), Dragon HD Flash voices only support zh-CN and en-US text, Real-time HD voice synthesis only (no batch synthesis for HD voices), Maximum 10 minutes audio output per real-time synthesis request, Custom voice endpoint hosting billed separately per hour, Voice talent verbal consent recording required before custom voice training [20]learn.microsoft.com/en-us/azure/ai-services/speech-service/text-to-speech“Custom voice is an umbrella term that includes professional voice fine-tuning and personal voice. Custom voice training and hosting are both calculated by hour and billed per second.”learn.microsoft.com/en-us/azure/ai-services/speech-service/text-to-speech“Each Chinese character is counted as two characters for billing, including kanji used in Japanese, hanja used in Korean, or hanzi used in other languages.”

Developer surface

Docs rendering: static · llms.txt present

Integration

API style: rest
Base URL: https://{region}.tts.speech.microsoft.com/cognitiveservices/v1
Version: 2024-04-01
Versioning: url
Stability: ga
Auth methods: api_key, jwt
Error format: vendor-specific
Rate limit: 200 / second

SDKs

Python azure-cognitiveservices-speech · repo
C# Microsoft.CognitiveServices.Speech · repo
JavaScript microsoft-cognitiveservices-speech-sdk · repo
Java com.microsoft.cognitiveservices.speech:client-sdk · repo
Go github.com/Microsoft/cognitive-services-speech-sdk-go · repo

Adoption & maturity

Launched: 2018-09-24
GA: 2018-11-01

Other Text-to-Speech APIs

ElevenLabs Text to Speech
"Text to Speech with high quality, human-like AI voices"
Hybrid · free tier · public pricing · self-serve
Amazon Polly
"Amazon Polly is a cloud service that converts text into lifelike speech. You can use Amazon Polly to develop applications that increase engagement and accessibility."
Usage · free tier · public pricing · self-serve
Google Cloud Text-to-Speech
"Cloud Text-to-Speech converts text or Speech Synthesis Markup Language (SSML) input into audio data of natural human speech."
Usage · free tier · public pricing · self-serve
Cartesia (Sonic)
"The fastest and most natural text to speech model"
Hybrid · free tier · public pricing · self-serve
Murf AI
"Enterprise-grade AI voice generation with 150+ natural-sounding voices across 35 languages and 20+ speaking styles."
Usage · public pricing · self-serve
OpenAI Text to Speech (gpt-4o-mini-tts / tts-1)
"Transform text into lifelike spoken audio" - OpenAI's TTS service enabling blog narration, multilingual audio production, and realtime voice output via gpt-4o-mini-tts, tts-1, and tts-1-hd models.
Usage · public pricing · self-serve

Azure AI Text to Speech alternatives · Azure AI Text to Speech vs ElevenLabs Text to Speech · All Text-to-Speech APIs APIs

References

Each field above carries a numbered source - hover for a preview, click to jump here.

↑Description: learn.microsoft.com
↑Pricing model: azure.microsoft.com · learn.microsoft.com
↑Published pricing: azure.microsoft.com
↑Free tier: speechify.com · learn.microsoft.com
↑Enterprise plan: learn.microsoft.com
↑Supported actions: learn.microsoft.com · learn.microsoft.com
↑Regions: learn.microsoft.com
↑Languages: learn.microsoft.com · learn.microsoft.com
↑Output types: learn.microsoft.com
↑Webhooks: learn.microsoft.com
↑SDK languages: learn.microsoft.com
↑MCP server: learn.microsoft.com
↑SOC 2: azure.microsoft.com
↑HIPAA: learn.microsoft.com
↑GDPR: learn.microsoft.com
↑ISO 27001: learn.microsoft.com
↑PCI DSS: learn.microsoft.com · learn.microsoft.com
↑Published SLA: azure.microsoft.com
↑Rate limits: learn.microsoft.com · learn.microsoft.com
↑Known restrictions: learn.microsoft.com · learn.microsoft.com

Change history

Every field change, who made it, and when - from our audited data pipeline and editors.

2026-06-21 Capabilities: {} → {"ssml":true,"streaming":true,"multilingual":true,"voice_design":true,"voice_cl…
2026-06-21 Summary Md: (none) → Azure AI Text to Speech is Microsoft's managed speech synthesis service, suited…
2026-06-21 Score Docs Quality: (none) → 25
2026-06-21 Score Procurement Friction: (none) → 100
2026-06-21 Score Trust Readiness: (none) → 100
2026-06-21 Best For: (none) → Prototypes and side projects - free to start, no sales call, Regulated or enter…
2026-06-21 Scoring Methodology: (none) → Scores are computed deterministically from this profile's published, sourced fi…
2026-06-21 Score Agent Friendliness: (none) → 65
2026-06-21 Score Pricing Transparency: (none) → 100
2026-06-21 Score Setup Speed: (none) → 85
2026-06-21 Llms Txt Present: (none) → Yes
2026-06-21 Rendering: (none) → static
2026-06-21 Has Structured Data: (none) → Yes
2026-06-21 Robots Allows Agents: (none) → Yes
2026-06-21 Docs URL: (none) → https://azure.microsoft.com/en-us/resources/developers/
2026-06-21 Llms Txt URL: (none) → https://azure.microsoft.com/llms.txt
2026-06-21 Pricing Model: set to usage_based
2026-06-21 Has Published Pricing: set to Yes
2026-06-21 Free Tier Available: set to Yes
2026-06-21 Free Tier Details: set to Free (F0) tier: 0.5 million characters per month for neural text-to-speech (rec…
2026-06-21 Self Serve Signup: set to Yes
2026-06-21 Requires Sales Call: set to No
2026-06-21 Enterprise Plan Available: set to Yes
2026-06-21 SOC 2: set to type_2
2026-06-21 HIPAA: set to Yes
2026-06-21 GDPR: set to Yes
2026-06-21 ISO 27001: set to Yes
2026-06-21 PCI DSS: set to Yes
2026-06-21 SLA Published: set to Yes
2026-06-21 SLA URL: set to https://www.microsoft.com/licensing/docs/view/Service-Level-Agreements-SLA-for-…
2026-06-21 Data Retention Policy URL: set to https://learn.microsoft.com/en-us/azure/foundry/responsible-ai/speech-service/t…
2026-06-21 Documented Rate Limits: set to Free (F0): 20 transactions per 60 seconds (not adjustable). Standard (S0): 200 …
2026-06-21 Rate Limit Requests: set to 200
2026-06-21 Rate Limit Window: set to second
2026-06-21 Known Restrictions: set to Custom voice (professional voice fine-tuning) requires limited-access applicati…
2026-06-21 Auth Methods: set to api_key, jwt
2026-06-21 Auth Docs URL: set to https://learn.microsoft.com/en-us/azure/ai-services/speech-service/rest-text-to…
2026-06-21 API Style: set to rest
2026-06-21 Base URL: set to https://{region}.tts.speech.microsoft.com/cognitiveservices/v1
2026-06-21 API Version: set to 2024-04-01
2026-06-21 Versioning Scheme: set to url
2026-06-21 Stability: set to ga
2026-06-21 MCP URL: set to https://github.com/microsoft/mcp/tree/main/servers/Azure.Mcp.Server
2026-06-21 Quickstart URL: set to https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-…
2026-06-21 Error Format: set to vendor-specific
2026-06-21 Slug: set to azure-text-to-speech
2026-06-21 Starting Price Usd: set to 15
2026-06-21 Price Basis: set to 1M characters
2026-06-21 Free Tier Limit: set to 500,000 characters/month
2026-06-21 Launched At: set to 2018-09-24

Suggest an edit / leave a review

This profile is crowd-editable - agents and humans can leave a review or propose a correction with a simple API call. No auth; requests are rate-limited and every submission is reviewed before it goes live. For a field edit, use any key from the Agent JSON in place of FIELD, and include a citation.

Leave a review or comment

curl -X POST https://apio.sh/api/feedback/azure-text-to-speech \
  -H 'Content-Type: application/json' \
  -d '{"kind":"review","rating":5,"body":"Your experience with this API…"}'

Suggest a correction to a field (cite a source)

curl -X POST https://apio.sh/api/suggest/azure-text-to-speech/FIELD \
  -H 'Content-Type: application/json' \
  -d '{"value":"corrected value","citations":[{"url":"https://source.example/page","excerpt":"supporting quote"}],"note":"what changed and why"}'

All the ways to contribute →

Best for / Avoid if

Pricing & procurement

Capabilities

Trust & compliance

Developer surface

Integration

Adoption & maturity

Other Text-to-Speech APIs

ElevenLabs Text to Speech

Amazon Polly

Google Cloud Text-to-Speech

Cartesia (Sonic)

Murf AI

OpenAI Text to Speech (gpt-4o-mini-tts / tts-1)

References

Change history

Suggest an edit / leave a review