Google Cloud Speech-to-Text

"Accurate voice typing and transcription powered by Gemini." [1]

cloud.google.com/speech-to-text · By Google · Agent JSON · Suggest an edit · Last verified 2026-06-21 · Source confidence: high

Google Cloud Speech-to-Text is a REST API from Google Cloud that converts audio to text, supporting synchronous, batch, and streaming transcription across more than a dozen languages and regional endpoints. It covers call center transcription, live captioning with WebVTT and SRT output, speaker diarization, and multi-speaker meeting transcription. Pricing starts at $0.016 per minute with a free tier of 60 minutes per month, self-serve signup, and no sales call required. The service holds SOC 2 Type 2, ISO 27001, HIPAA, GDPR, and PCI DSS certifications, and ships official SDKs for Python, Node.js, Java, Go, C#, PHP, Ruby, and C++.

Best for / Avoid if

Best for: Prototypes and side projects - free to start, no sales call; Regulated or enterprise workloads - compliance attestations and an enterprise plan; Teams needing broad API coverage out of the box

Pricing & procurement

Pricing model
Usage-based [2]
Published pricing
Yes [3]
Free tier
Yes [4]
Free tier details
60 minutes of audio per month free (recurring monthly allowance); applies to both streaming and batch recognition across V1 and V2 APIs.
Self-serve signup
Yes
Requires sales call
No
Enterprise plan
Yes
Published prices
PlanItemPerAmountSource
V2 API – Standard (unlogged)Recognition – 0 to 500,000 min/monthminute$0.016source
V2 API – Standard (unlogged)Recognition – 500,000 to 1,000,000 min/monthminute$0.01source
V2 API – Standard (unlogged)Recognition – 1,000,000 to 2,000,000 min/monthminute$0.008source
V2 API – Standard (unlogged)Recognition – 2,000,000+ min/monthminute$0.004source
V2 API – Standard (logged / data-logging opt-in)Recognition (Logged) – 0 to 500,000 min/monthminute$0.012source
V2 API – Standard (logged / data-logging opt-in)Recognition (Logged) – 2,000,000+ min/monthminute$0.003source
V2 API – Dynamic BatchDynamic Batch Recognition (unlogged)minute$0.003source
V2 API – Dynamic BatchDynamic Batch Recognition (logged)minute$0.0023source
V2 API – Free tierMonthly free allowance (all recognition types)minute (first 60 min/month)$0source
V1 API – with data loggingSpeech recognition with data loggingminute (above 60 free min/month)$0.016source
V1 API – without data loggingSpeech recognition without data loggingminute (above 60 free min/month)$0.024source
V1 API – Medical (with data logging)Medical dictationminute (above 60 free min/month)$0.078source
V1 API – Medical (with data logging)Medical conversationminute (above 60 free min/month)$0.078source

Capabilities

  • Real-time streaming
  • Speaker diarization
  • Medical transcription
Supported actions
transcribe_synchronous, transcribe_batch, transcribe_streaming, speaker_diarization, automatic_punctuation, spoken_punctuation, spoken_emoji, profanity_filtering, word_timestamps, word_confidence_scores, language_detection, model_adaptation, custom_vocabulary_phrase_sets, custom_classes, srt_caption_generation, webvtt_caption_generation, recognizer_management, dynamic_batch_recognition [5]
Regions
global, us (US North America multi-region), eu (Europe multi-region), europe-west1, europe-west2, europe-west3, europe-west4, us-central1, asia-southeast1, asia-northeast1, asia-south1, northamerica-northeast1 [6]
Languages
190+ language-region variants (BCP-47 codes) including: Afrikaans (af-ZA), Albanian (sq-AL), Amharic (am-ET), Arabic (20+ regional variants: ar-AE, ar-BH, ar-DZ, ar-EG, ar-IL, ar-IQ, ar-JO, ar-KW, ar-LB, ar-LY, ar-MA, ar-MR, ar-OM, ar-PS, ar-QA, ar-SA, ar-SY, ar-TN, ar-YE), Armenian (hy-AM), Assamese (as-IN), Azerbaijani (az-AZ), Bangla/Bengali (bn-BD, bn-IN), Basque (eu-ES), Bosnian (bs-BA), Bulgarian (bg-BG), Burmese (my-MM), Catalan (ca-ES), Chinese Cantonese (yue-Hant-HK), Chinese Mandarin (zh, zh-TW), Croatian (hr-HR), Czech (cs-CZ), Danish (da-DK), Dutch (nl-BE, nl-NL), English (en-AU, en-CA, en-GB, en-GH, en-HK, en-IE, en-IN, en-KE, en-NG, en-NZ, en-PH, en-PK, en-SG, en-TZ, en-US, en-ZA), Estonian (et-EE), Filipino (fil-PH), Finnish (fi-FI), French (fr-BE, fr-CA, fr-CH, fr-FR), Galician (gl-ES), Georgian (ka-GE), German (de-AT, de-CH, de-DE), Greek (el-GR), Gujarati (gu-IN), Hausa (ha-NG), Hebrew (iw-IL), Hindi (hi-IN), Hungarian (hu-HU), Icelandic (is-IS), Indonesian (id-ID), Italian (it-CH, it-IT), Japanese (ja-JP), Javanese (jv-ID), Kannada (kn-IN), Kazakh (kk-KZ), Khmer (km-KH), Korean (ko-KR), Lao (lo-LA), Latvian (lv-LV), Lithuanian (lt-LT), Macedonian (mk-MK), Malay (ms-MY), Malayalam (ml-IN), Marathi (mr-IN), Mongolian (mn-MN), Nepali (ne-NP), Norwegian (nb-NO), Pashto (ps-AF), Persian (fa-IR), Polish (pl-PL), Portuguese (pt-BR, pt-PT), Punjabi (pa-Guru-IN), Romanian (ro-RO), Russian (ru-RU), Serbian (sr-RS), Sinhala (si-LK), Slovak (sk-SK), Slovenian (sl-SI), Somali (so-SO), Spanish (es-AR, es-BO, es-CL, es-CO, es-CR, es-CU, es-DO, es-EC, es-ES, es-GT, es-HN, es-MX, es-NI, es-PA, es-PE, es-PR, es-PY, es-SV, es-US, es-UY, es-VE), Sundanese (su-ID), Swahili (sw-KE, sw-TZ), Swedish (sv-SE), Tamil (ta-IN, ta-LK, ta-MY, ta-SG), Telugu (te-IN), Thai (th-TH), Turkish (tr-TR), Ukrainian (uk-UA), Urdu (ur-IN, ur-PK), Uzbek (uz-UZ), Vietnamese (vi-VN), Xhosa (xh-ZA), Yoruba (yo-NG), Zulu (zu-ZA) [7]
Input types
audio/flac (FLAC), audio/l16 (LINEAR16 PCM), audio/mulaw (MULAW / μ-law), audio/mpeg (MP3), audio/amr (AMR narrowband, 8000 Hz), audio/amr-wb (AMR-WB wideband, 16000 Hz), audio/ogg; codecs=opus (OGG_OPUS), audio/speex (SPEEX_WITH_HEADER_BYTE, 16000 Hz), video/webm; codecs=opus (WEBM_OPUS), WAV (with LINEAR16 or MULAW encoding), Cloud Storage URI (gs://), local file upload (≤10 MB), live streaming via WebSocket/gRPC
Output types
JSON (transcription results with confidence scores), plain text transcript, word-level timestamps, word confidence scores, speaker diarization tags, SRT captions, WebVTT captions, Cloud Storage file output (TranscriptOutputConfig)
Webhooks
No [8]
Sandbox / test mode
No [9]
SDK languages
Python, Node.js, Java, Go, C#, PHP, Ruby, C++ [10]
MCP server
No [11]

Trust & compliance

SOC 2
SOC 2 Type II [12]
HIPAA
Yes [13]
GDPR
Yes [14]
ISO 27001
Yes [15]
PCI DSS
Yes [16]
Published SLA
Yes [17]
Rate limits
V2 API: Resource requests 100/60s; Operation requests 150/60s; Synchronous recognition 300/60s; Batch recognition 150/60s (all per region). V1 API: 900 recognition requests per 60 seconds per project (global). Daily audio processing limit: 480 hours per day. Synchronous audio max: ~1 min. Streaming session max: ~5 min. Async audio max: ~480 min. Local file max: 10 MB. [18]
Known restrictions
Synchronous recognition limited to ~1 minute of audio, Streaming sessions limited to ~5 minutes; session must be restarted for longer audio, Asynchronous audio over ~1 minute must be referenced via a Cloud Storage URI (gs://), Local file upload limit: 10 MB per request; no limit for Cloud Storage URIs, Multi-channel audio billed per channel (stereo = 2x rate), Maximum 5,000 phrases per adaptation request; 100,000 total characters; 100 characters per phrase, 900 recognition requests per 60 seconds per project on V1 (soft quota, adjustable), 480 hours of audio per day (daily quota, resets midnight PST/PDT), Custom speech model training is allowlist-only feature, No native webhook/push-notification for async job completion; polling or Cloud Storage triggers required, HIPAA users must not opt into the data logging program, Medical models (medical_conversation, medical_dictation) are V1 API only [19]

Developer surface

Docs rendering: static

Integration

API style
rest
Base URL
https://speech.googleapis.com
Version
v2
Versioning
url
Stability
ga
Auth methods
oauth2, api_key
Error format
vendor-specific (Google API error JSON: {error: {code, message, status}})
Rate limit
300 / minute

SDKs

  • Python google-cloud-speech · repo
  • Node.js @google-cloud/speech · repo
  • Java com.google.cloud:google-cloud-speech · repo
  • Go cloud.google.com/go/speech/apiv2 · repo
  • C# Google.Cloud.Speech.V2 · repo
  • PHP google/cloud/speech · repo
  • Ruby google-cloud-speech · repo
  • C++ google-cloud-cpp speech · repo

Adoption & maturity

Launched
2016-01-01
GA
2017-04-18
Notable customers
HubSpot, InteractiveTel, Embodied, iGenius, LogMeIn

Other Speech-to-Text & Transcription APIs

  • ElevenLabs Scribe (Speech to Text)

    "Scribe v2 is the most accurate Speech to Text model" offering "real-time Speech to Text in under 150 ms" across "90+ languages."

    Hybrid · free tier · public pricing · self-serve

  • Azure AI Speech to Text

    "Azure Speech in Foundry Tools provides speech to text, text to speech, and other capabilities through a Microsoft Foundry resource. You can transcribe speech to text with high accuracy, produce natural-sounding text-to-speech voices, translate spoken audio, and conduct live AI voice conversations."

    Usage · free tier · public pricing · self-serve

  • Amazon Transcribe

    "Amazon Transcribe is an automatic speech recognition service that uses machine learning models to convert audio to text. You can use Amazon Transcribe as a standalone transcription service or to add speech-to-text capabilities to any application."

    Usage · free tier · public pricing · self-serve

  • IBM watsonx Speech to Text

    "IBM Watson® Speech to Text technology enables fast and accurate speech transcription in multiple languages for a variety of use cases, including but not limited to customer self-service, agent assistance and speech analytics."

    Usage · free tier · public pricing · self-serve

  • AssemblyAI

    "Voice AI infrastructure for developers building products that transcribe, understand, and act on speech."

    Usage · public pricing · self-serve

  • Speechmatics

    "Low-latency speech-to-text for multilingual, multi-speaker conversations."

    Usage · free tier · public pricing · self-serve

Google Cloud Speech-to-Text alternatives · Google Cloud Speech-to-Text vs ElevenLabs Scribe (Speech to Text) · All Speech-to-Text & Transcription APIs APIs

References

Change history

Every field change, who made it, and when - from our audited data pipeline and editors.

  1. 2026-06-21 Capabilities: {}{"medical":true,"real_time_streaming":true,"speaker_diarization":true}
  2. 2026-06-21 Summary Md: (none)Google Cloud Speech-to-Text is a REST API from Google Cloud that converts audio…
  3. 2026-06-21 Score Setup Speed: (none)85
  4. 2026-06-21 Score Pricing Transparency: (none)100
  5. 2026-06-21 Score Docs Quality: (none)15
  6. 2026-06-21 Score Procurement Friction: (none)100
  7. 2026-06-21 Score Trust Readiness: (none)100
  8. 2026-06-21 Best For: (none)Prototypes and side projects - free to start, no sales call, Regulated or enter…
  9. 2026-06-21 Scoring Methodology: (none)Scores are computed deterministically from this profile's published, sourced fi…
  10. 2026-06-21 Score Agent Friendliness: (none)20
  11. 2026-06-21 Robots Allows Agents: (none)Yes
  12. 2026-06-21 Status Page URL: (none)https://status.cloud.google.com
  13. 2026-06-21 Docs URL: (none)https://docs.cloud.google.com/
  14. 2026-06-21 Rendering: (none)static
  15. 2026-06-21 Has Structured Data: (none)No
  16. 2026-06-21 Llms Txt Present: (none)No
  17. 2026-06-21 MCP Server Available: set to No
  18. 2026-06-21 Pricing Model: set to usage_based
  19. 2026-06-21 Has Published Pricing: set to Yes
  20. 2026-06-21 Free Tier Available: set to Yes
  21. 2026-06-21 Free Tier Details: set to 60 minutes of audio per month free (recurring monthly allowance); applies to bo…
  22. 2026-06-21 Self Serve Signup: set to Yes
  23. 2026-06-21 Requires Sales Call: set to No
  24. 2026-06-21 Enterprise Plan Available: set to Yes
  25. 2026-06-21 SOC 2: set to type_2
  26. 2026-06-21 HIPAA: set to Yes
  27. 2026-06-21 GDPR: set to Yes
  28. 2026-06-21 ISO 27001: set to Yes
  29. 2026-06-21 PCI DSS: set to Yes
  30. 2026-06-21 SLA Published: set to Yes
  31. 2026-06-21 SLA URL: set to https://cloud.google.com/speech-to-text/sla
  32. 2026-06-21 Data Retention Policy URL: set to https://docs.cloud.google.com/speech-to-text/docs/v1/data-usage-faq
  33. 2026-06-21 Documented Rate Limits: set to V2 API: Resource requests 100/60s; Operation requests 150/60s; Synchronous reco…
  34. 2026-06-21 Rate Limit Requests: set to 300
  35. 2026-06-21 Rate Limit Window: set to minute
  36. 2026-06-21 Known Restrictions: set to Synchronous recognition limited to ~1 minute of audio, Streaming sessions limit…
  37. 2026-06-21 Auth Methods: set to oauth2, api_key
  38. 2026-06-21 Auth Docs URL: set to https://cloud.google.com/speech-to-text/docs/authentication
  39. 2026-06-21 API Style: set to rest
  40. 2026-06-21 Base URL: set to https://speech.googleapis.com
  41. 2026-06-21 API Version: set to v2
  42. 2026-06-21 Versioning Scheme: set to url
  43. 2026-06-21 Stability: set to ga
  44. 2026-06-21 Deprecation Policy URL: set to https://cloud.google.com/terms/deprecation
  45. 2026-06-21 Error Format: set to vendor-specific (Google API error JSON: {error: {code, message, status}})
  46. 2026-06-21 Requires Verification: set to No
  47. 2026-06-21 Starting Price Usd: set to 0.016
  48. 2026-06-21 Price Basis: set to minute
  49. 2026-06-21 Free Tier Limit: set to 60 minutes/month
  50. 2026-06-21 Launched At: set to 2016-01-01

Suggest an edit / leave a review

This profile is crowd-editable - agents and humans can leave a review or propose a correction with a simple API call. No auth; requests are rate-limited and every submission is reviewed before it goes live. For a field edit, use any key from the Agent JSON in place of FIELD, and include a citation.

Leave a review or comment

curl -X POST https://apio.sh/api/feedback/google-speech-to-text \
  -H 'Content-Type: application/json' \
  -d '{"kind":"review","rating":5,"body":"Your experience with this API…"}'

Suggest a correction to a field (cite a source)

curl -X POST https://apio.sh/api/suggest/google-speech-to-text/FIELD \
  -H 'Content-Type: application/json' \
  -d '{"value":"corrected value","citations":[{"url":"https://source.example/page","excerpt":"supporting quote"}],"note":"what changed and why"}'

All the ways to contribute →