Google Cloud Speech-to-Text

"Accurate voice typing and transcription powered by Gemini." [1]

Speech-to-Text & Transcription APIs

cloud.google.com/speech-to-text · By Google · Agent JSON · Suggest an edit · Last verified 2026-06-21 · Source confidence: high

Google Cloud Speech-to-Text is a REST API from Google Cloud that converts audio to text, supporting synchronous, batch, and streaming transcription across more than a dozen languages and regional endpoints. It covers call center transcription, live captioning with WebVTT and SRT output, speaker diarization, and multi-speaker meeting transcription. Pricing starts at $0.016 per minute with a free tier of 60 minutes per month, self-serve signup, and no sales call required. The service holds SOC 2 Type 2, ISO 27001, HIPAA, GDPR, and PCI DSS certifications, and ships official SDKs for Python, Node.js, Java, Go, C#, PHP, Ruby, and C++.

Best for / Avoid if

Best for: Prototypes and side projects - free to start, no sales call; Regulated or enterprise workloads - compliance attestations and an enterprise plan; Teams needing broad API coverage out of the box

Pricing & procurement

Pricing model: Usage-based [2]
Published pricing: Yes [3]
Free tier: Yes [4]
Free tier details: 60 minutes of audio per month free (recurring monthly allowance); applies to both streaming and batch recognition across V1 and V2 APIs.
Self-serve signup: Yes
Requires sales call: No
Enterprise plan: Yes

Published prices
Plan	Item	Per	Amount	Source
V2 API – Standard (unlogged)	Recognition – 0 to 500,000 min/month	minute	$0.016	source
V2 API – Standard (unlogged)	Recognition – 500,000 to 1,000,000 min/month	minute	$0.01	source
V2 API – Standard (unlogged)	Recognition – 1,000,000 to 2,000,000 min/month	minute	$0.008	source
V2 API – Standard (unlogged)	Recognition – 2,000,000+ min/month	minute	$0.004	source
V2 API – Standard (logged / data-logging opt-in)	Recognition (Logged) – 0 to 500,000 min/month	minute	$0.012	source
V2 API – Standard (logged / data-logging opt-in)	Recognition (Logged) – 2,000,000+ min/month	minute	$0.003	source
V2 API – Dynamic Batch	Dynamic Batch Recognition (unlogged)	minute	$0.003	source
V2 API – Dynamic Batch	Dynamic Batch Recognition (logged)	minute	$0.0023	source
V2 API – Free tier	Monthly free allowance (all recognition types)	minute (first 60 min/month)	$0	source
V1 API – with data logging	Speech recognition with data logging	minute (above 60 free min/month)	$0.016	source
V1 API – without data logging	Speech recognition without data logging	minute (above 60 free min/month)	$0.024	source
V1 API – Medical (with data logging)	Medical dictation	minute (above 60 free min/month)	$0.078	source
V1 API – Medical (with data logging)	Medical conversation	minute (above 60 free min/month)	$0.078	source

Capabilities

Real-time streaming
Speaker diarization
Medical transcription

Supported actions: transcribe_synchronous, transcribe_batch, transcribe_streaming, speaker_diarization, automatic_punctuation, spoken_punctuation, spoken_emoji, profanity_filtering, word_timestamps, word_confidence_scores, language_detection, model_adaptation, custom_vocabulary_phrase_sets, custom_classes, srt_caption_generation, webvtt_caption_generation, recognizer_management, dynamic_batch_recognition [5]
Regions: global, us (US North America multi-region), eu (Europe multi-region), europe-west1, europe-west2, europe-west3, europe-west4, us-central1, asia-southeast1, asia-northeast1, asia-south1, northamerica-northeast1 [6]docs.cloud.google.com/speech-to-text/docs/locations“In Cloud Speech-to-Text API V2, there are different availabilities in different regions. To understand the availabilities, use the Locations API.”docs.cloud.google.com/speech-to-text/docs/speech-to-text-supported-languages“asia-northeast1, asia-south1, asia-southeast1, eu, europe-west2, europe-west3, europe-west4, global, northamerica-northeast1, us, us-central1”
Languages: 190+ language-region variants (BCP-47 codes) including: Afrikaans (af-ZA), Albanian (sq-AL), Amharic (am-ET), Arabic (20+ regional variants: ar-AE, ar-BH, ar-DZ, ar-EG, ar-IL, ar-IQ, ar-JO, ar-KW, ar-LB, ar-LY, ar-MA, ar-MR, ar-OM, ar-PS, ar-QA, ar-SA, ar-SY, ar-TN, ar-YE), Armenian (hy-AM), Assamese (as-IN), Azerbaijani (az-AZ), Bangla/Bengali (bn-BD, bn-IN), Basque (eu-ES), Bosnian (bs-BA), Bulgarian (bg-BG), Burmese (my-MM), Catalan (ca-ES), Chinese Cantonese (yue-Hant-HK), Chinese Mandarin (zh, zh-TW), Croatian (hr-HR), Czech (cs-CZ), Danish (da-DK), Dutch (nl-BE, nl-NL), English (en-AU, en-CA, en-GB, en-GH, en-HK, en-IE, en-IN, en-KE, en-NG, en-NZ, en-PH, en-PK, en-SG, en-TZ, en-US, en-ZA), Estonian (et-EE), Filipino (fil-PH), Finnish (fi-FI), French (fr-BE, fr-CA, fr-CH, fr-FR), Galician (gl-ES), Georgian (ka-GE), German (de-AT, de-CH, de-DE), Greek (el-GR), Gujarati (gu-IN), Hausa (ha-NG), Hebrew (iw-IL), Hindi (hi-IN), Hungarian (hu-HU), Icelandic (is-IS), Indonesian (id-ID), Italian (it-CH, it-IT), Japanese (ja-JP), Javanese (jv-ID), Kannada (kn-IN), Kazakh (kk-KZ), Khmer (km-KH), Korean (ko-KR), Lao (lo-LA), Latvian (lv-LV), Lithuanian (lt-LT), Macedonian (mk-MK), Malay (ms-MY), Malayalam (ml-IN), Marathi (mr-IN), Mongolian (mn-MN), Nepali (ne-NP), Norwegian (nb-NO), Pashto (ps-AF), Persian (fa-IR), Polish (pl-PL), Portuguese (pt-BR, pt-PT), Punjabi (pa-Guru-IN), Romanian (ro-RO), Russian (ru-RU), Serbian (sr-RS), Sinhala (si-LK), Slovak (sk-SK), Slovenian (sl-SI), Somali (so-SO), Spanish (es-AR, es-BO, es-CL, es-CO, es-CR, es-CU, es-DO, es-EC, es-ES, es-GT, es-HN, es-MX, es-NI, es-PA, es-PE, es-PR, es-PY, es-SV, es-US, es-UY, es-VE), Sundanese (su-ID), Swahili (sw-KE, sw-TZ), Swedish (sv-SE), Tamil (ta-IN, ta-LK, ta-MY, ta-SG), Telugu (te-IN), Thai (th-TH), Turkish (tr-TR), Ukrainian (uk-UA), Urdu (ur-IN, ur-PK), Uzbek (uz-UZ), Vietnamese (vi-VN), Xhosa (xh-ZA), Yoruba (yo-NG), Zulu (zu-ZA) [7]
Input types: audio/flac (FLAC), audio/l16 (LINEAR16 PCM), audio/mulaw (MULAW / μ-law), audio/mpeg (MP3), audio/amr (AMR narrowband, 8000 Hz), audio/amr-wb (AMR-WB wideband, 16000 Hz), audio/ogg; codecs=opus (OGG_OPUS), audio/speex (SPEEX_WITH_HEADER_BYTE, 16000 Hz), video/webm; codecs=opus (WEBM_OPUS), WAV (with LINEAR16 or MULAW encoding), Cloud Storage URI (gs://), local file upload (≤10 MB), live streaming via WebSocket/gRPC
Output types: JSON (transcription results with confidence scores), plain text transcript, word-level timestamps, word confidence scores, speaker diarization tags, SRT captions, WebVTT captions, Cloud Storage file output (TranscriptOutputConfig)
Webhooks: No [8]
Sandbox / test mode: No [9]
SDK languages: Python, Node.js, Java, Go, C#, PHP, Ruby, C++ [10]
MCP server: No [11]

Trust & compliance

SOC 2: SOC 2 Type II [12]
HIPAA: Yes [13]cloud.google.com/security/compliance/hipaa“If you are using Cloud Speech-to-Text, and you have entered into a BAA with Google covering any PHI obligations under HIPAA, then you should not opt into the data logging program.”cloud.google.com/terms/hipaa-baa“The Google Cloud BAA covers Google Cloud's entire infrastructure (all regions, all zones, all network paths, all points of presence), and the services listed below.”
GDPR: Yes [14]
ISO 27001: Yes [15]
PCI DSS: Yes [16]
Published SLA: Yes [17]
Rate limits: V2 API: Resource requests 100/60s; Operation requests 150/60s; Synchronous recognition 300/60s; Batch recognition 150/60s (all per region). V1 API: 900 recognition requests per 60 seconds per project (global). Daily audio processing limit: 480 hours per day. Synchronous audio max: ~1 min. Streaming session max: ~5 min. Async audio max: ~480 min. Local file max: 10 MB. [18]
Known restrictions: Synchronous recognition limited to ~1 minute of audio, Streaming sessions limited to ~5 minutes; session must be restarted for longer audio, Asynchronous audio over ~1 minute must be referenced via a Cloud Storage URI (gs://), Local file upload limit: 10 MB per request; no limit for Cloud Storage URIs, Multi-channel audio billed per channel (stereo = 2x rate), Maximum 5,000 phrases per adaptation request; 100,000 total characters; 100 characters per phrase, 900 recognition requests per 60 seconds per project on V1 (soft quota, adjustable), 480 hours of audio per day (daily quota, resets midnight PST/PDT), Custom speech model training is allowlist-only feature, No native webhook/push-notification for async job completion; polling or Cloud Storage triggers required, HIPAA users must not opt into the data logging program, Medical models (medical_conversation, medical_dictation) are V1 API only [19]

Developer surface

Docs rendering: static

Integration

API style: rest
Base URL: https://speech.googleapis.com
Version: v2
Versioning: url
Stability: ga
Auth methods: oauth2, api_key
Error format: vendor-specific (Google API error JSON: {error: {code, message, status}})
Rate limit: 300 / minute

SDKs

Python google-cloud-speech · repo
Node.js @google-cloud/speech · repo
Java com.google.cloud:google-cloud-speech · repo
Go cloud.google.com/go/speech/apiv2 · repo
C# Google.Cloud.Speech.V2 · repo
PHP google/cloud/speech · repo
Ruby google-cloud-speech · repo
C++ google-cloud-cpp speech · repo

Adoption & maturity

Launched: 2016-01-01
GA: 2017-04-18
Notable customers: HubSpot, InteractiveTel, Embodied, iGenius, LogMeIn

Other Speech-to-Text & Transcription APIs

ElevenLabs Scribe (Speech to Text)
"Scribe v2 is the most accurate Speech to Text model" offering "real-time Speech to Text in under 150 ms" across "90+ languages."
Hybrid · free tier · public pricing · self-serve
Azure AI Speech to Text
"Azure Speech in Foundry Tools provides speech to text, text to speech, and other capabilities through a Microsoft Foundry resource. You can transcribe speech to text with high accuracy, produce natural-sounding text-to-speech voices, translate spoken audio, and conduct live AI voice conversations."
Usage · free tier · public pricing · self-serve
Amazon Transcribe
"Amazon Transcribe is an automatic speech recognition service that uses machine learning models to convert audio to text. You can use Amazon Transcribe as a standalone transcription service or to add speech-to-text capabilities to any application."
Usage · free tier · public pricing · self-serve
IBM watsonx Speech to Text
"IBM Watson® Speech to Text technology enables fast and accurate speech transcription in multiple languages for a variety of use cases, including but not limited to customer self-service, agent assistance and speech analytics."
Usage · free tier · public pricing · self-serve
AssemblyAI
"Voice AI infrastructure for developers building products that transcribe, understand, and act on speech."
Usage · public pricing · self-serve
Speechmatics
"Low-latency speech-to-text for multilingual, multi-speaker conversations."
Usage · free tier · public pricing · self-serve

Google Cloud Speech-to-Text alternatives · Google Cloud Speech-to-Text vs ElevenLabs Scribe (Speech to Text) · All Speech-to-Text & Transcription APIs APIs

References

Each field above carries a numbered source - hover for a preview, click to jump here.

↑Description: cloud.google.com
↑Pricing model: cloud.google.com · cloud.google.com
↑Published pricing: cloud.google.com · cloud.google.com
↑Free tier: brasstranscripts.com · cloud.google.com
↑Supported actions: docs.cloud.google.com
↑Regions: docs.cloud.google.com · docs.cloud.google.com
↑Languages: docs.cloud.google.com · docs.cloud.google.com
↑Webhooks: docs.cloud.google.com
↑Sandbox: cloud.google.com
↑SDK languages: github.com · docs.cloud.google.com
↑MCP server: docs.cloud.google.com
↑SOC 2: cloud.google.com · cloud.google.com
↑HIPAA: cloud.google.com · cloud.google.com
↑GDPR: cloud.google.com
↑ISO 27001: cloud.google.com · cloud.google.com
↑PCI DSS: cloud.google.com
↑Published SLA: cloud.google.com
↑Rate limits: docs.cloud.google.com · docs.cloud.google.com
↑Known restrictions: costbench.com · docs.cloud.google.com

Change history

Every field change, who made it, and when - from our audited data pipeline and editors.

2026-06-21 Capabilities: {} → {"medical":true,"real_time_streaming":true,"speaker_diarization":true}
2026-06-21 Summary Md: (none) → Google Cloud Speech-to-Text is a REST API from Google Cloud that converts audio…
2026-06-21 Score Setup Speed: (none) → 85
2026-06-21 Score Pricing Transparency: (none) → 100
2026-06-21 Score Docs Quality: (none) → 15
2026-06-21 Score Procurement Friction: (none) → 100
2026-06-21 Score Trust Readiness: (none) → 100
2026-06-21 Best For: (none) → Prototypes and side projects - free to start, no sales call, Regulated or enter…
2026-06-21 Scoring Methodology: (none) → Scores are computed deterministically from this profile's published, sourced fi…
2026-06-21 Score Agent Friendliness: (none) → 20
2026-06-21 Robots Allows Agents: (none) → Yes
2026-06-21 Status Page URL: (none) → https://status.cloud.google.com
2026-06-21 Docs URL: (none) → https://docs.cloud.google.com/
2026-06-21 Rendering: (none) → static
2026-06-21 Has Structured Data: (none) → No
2026-06-21 Llms Txt Present: (none) → No
2026-06-21 MCP Server Available: set to No
2026-06-21 Pricing Model: set to usage_based
2026-06-21 Has Published Pricing: set to Yes
2026-06-21 Free Tier Available: set to Yes
2026-06-21 Free Tier Details: set to 60 minutes of audio per month free (recurring monthly allowance); applies to bo…
2026-06-21 Self Serve Signup: set to Yes
2026-06-21 Requires Sales Call: set to No
2026-06-21 Enterprise Plan Available: set to Yes
2026-06-21 SOC 2: set to type_2
2026-06-21 HIPAA: set to Yes
2026-06-21 GDPR: set to Yes
2026-06-21 ISO 27001: set to Yes
2026-06-21 PCI DSS: set to Yes
2026-06-21 SLA Published: set to Yes
2026-06-21 SLA URL: set to https://cloud.google.com/speech-to-text/sla
2026-06-21 Data Retention Policy URL: set to https://docs.cloud.google.com/speech-to-text/docs/v1/data-usage-faq
2026-06-21 Documented Rate Limits: set to V2 API: Resource requests 100/60s; Operation requests 150/60s; Synchronous reco…
2026-06-21 Rate Limit Requests: set to 300
2026-06-21 Rate Limit Window: set to minute
2026-06-21 Known Restrictions: set to Synchronous recognition limited to ~1 minute of audio, Streaming sessions limit…
2026-06-21 Auth Methods: set to oauth2, api_key
2026-06-21 Auth Docs URL: set to https://cloud.google.com/speech-to-text/docs/authentication
2026-06-21 API Style: set to rest
2026-06-21 Base URL: set to https://speech.googleapis.com
2026-06-21 API Version: set to v2
2026-06-21 Versioning Scheme: set to url
2026-06-21 Stability: set to ga
2026-06-21 Deprecation Policy URL: set to https://cloud.google.com/terms/deprecation
2026-06-21 Error Format: set to vendor-specific (Google API error JSON: {error: {code, message, status}})
2026-06-21 Requires Verification: set to No
2026-06-21 Starting Price Usd: set to 0.016
2026-06-21 Price Basis: set to minute
2026-06-21 Free Tier Limit: set to 60 minutes/month
2026-06-21 Launched At: set to 2016-01-01

Suggest an edit / leave a review

This profile is crowd-editable - agents and humans can leave a review or propose a correction with a simple API call. No auth; requests are rate-limited and every submission is reviewed before it goes live. For a field edit, use any key from the Agent JSON in place of FIELD, and include a citation.

Leave a review or comment

curl -X POST https://apio.sh/api/feedback/google-speech-to-text \
  -H 'Content-Type: application/json' \
  -d '{"kind":"review","rating":5,"body":"Your experience with this API…"}'

Suggest a correction to a field (cite a source)

curl -X POST https://apio.sh/api/suggest/google-speech-to-text/FIELD \
  -H 'Content-Type: application/json' \
  -d '{"value":"corrected value","citations":[{"url":"https://source.example/page","excerpt":"supporting quote"}],"note":"what changed and why"}'

All the ways to contribute →

Best for / Avoid if

Pricing & procurement

Capabilities

Trust & compliance

Developer surface

Integration

Adoption & maturity

Other Speech-to-Text & Transcription APIs

ElevenLabs Scribe (Speech to Text)

Azure AI Speech to Text

Amazon Transcribe

IBM watsonx Speech to Text

AssemblyAI

Speechmatics

References

Change history

Suggest an edit / leave a review