Google Cloud Speech-to-Text
"Accurate voice typing and transcription powered by Gemini." [1]
Google Cloud Speech-to-Text is a REST API from Google Cloud that converts audio to text, supporting synchronous, batch, and streaming transcription across more than a dozen languages and regional endpoints. It covers call center transcription, live captioning with WebVTT and SRT output, speaker diarization, and multi-speaker meeting transcription. Pricing starts at $0.016 per minute with a free tier of 60 minutes per month, self-serve signup, and no sales call required. The service holds SOC 2 Type 2, ISO 27001, HIPAA, GDPR, and PCI DSS certifications, and ships official SDKs for Python, Node.js, Java, Go, C#, PHP, Ruby, and C++.
Best for / Avoid if
Best for: Prototypes and side projects - free to start, no sales call; Regulated or enterprise workloads - compliance attestations and an enterprise plan; Teams needing broad API coverage out of the box
Pricing & procurement
- Pricing model
- Usage-based [2]
- Published pricing
- ✓ Yes [3]
- Free tier
- ✓ Yes [4]
- Free tier details
- 60 minutes of audio per month free (recurring monthly allowance); applies to both streaming and batch recognition across V1 and V2 APIs.
- Self-serve signup
- ✓ Yes
- Requires sales call
- ✗ No
- Enterprise plan
- ✓ Yes
| Plan | Item | Per | Amount | Source |
|---|---|---|---|---|
| V2 API – Standard (unlogged) | Recognition – 0 to 500,000 min/month | minute | $0.016 | source |
| V2 API – Standard (unlogged) | Recognition – 500,000 to 1,000,000 min/month | minute | $0.01 | source |
| V2 API – Standard (unlogged) | Recognition – 1,000,000 to 2,000,000 min/month | minute | $0.008 | source |
| V2 API – Standard (unlogged) | Recognition – 2,000,000+ min/month | minute | $0.004 | source |
| V2 API – Standard (logged / data-logging opt-in) | Recognition (Logged) – 0 to 500,000 min/month | minute | $0.012 | source |
| V2 API – Standard (logged / data-logging opt-in) | Recognition (Logged) – 2,000,000+ min/month | minute | $0.003 | source |
| V2 API – Dynamic Batch | Dynamic Batch Recognition (unlogged) | minute | $0.003 | source |
| V2 API – Dynamic Batch | Dynamic Batch Recognition (logged) | minute | $0.0023 | source |
| V2 API – Free tier | Monthly free allowance (all recognition types) | minute (first 60 min/month) | $0 | source |
| V1 API – with data logging | Speech recognition with data logging | minute (above 60 free min/month) | $0.016 | source |
| V1 API – without data logging | Speech recognition without data logging | minute (above 60 free min/month) | $0.024 | source |
| V1 API – Medical (with data logging) | Medical dictation | minute (above 60 free min/month) | $0.078 | source |
| V1 API – Medical (with data logging) | Medical conversation | minute (above 60 free min/month) | $0.078 | source |
Capabilities
- Supported actions
- transcribe_synchronous, transcribe_batch, transcribe_streaming, speaker_diarization, automatic_punctuation, spoken_punctuation, spoken_emoji, profanity_filtering, word_timestamps, word_confidence_scores, language_detection, model_adaptation, custom_vocabulary_phrase_sets, custom_classes, srt_caption_generation, webvtt_caption_generation, recognizer_management, dynamic_batch_recognition [5]
- Regions
- global, us (US North America multi-region), eu (Europe multi-region), europe-west1, europe-west2, europe-west3, europe-west4, us-central1, asia-southeast1, asia-northeast1, asia-south1, northamerica-northeast1 [6]
- Languages
- 190+ language-region variants (BCP-47 codes) including: Afrikaans (af-ZA), Albanian (sq-AL), Amharic (am-ET), Arabic (20+ regional variants: ar-AE, ar-BH, ar-DZ, ar-EG, ar-IL, ar-IQ, ar-JO, ar-KW, ar-LB, ar-LY, ar-MA, ar-MR, ar-OM, ar-PS, ar-QA, ar-SA, ar-SY, ar-TN, ar-YE), Armenian (hy-AM), Assamese (as-IN), Azerbaijani (az-AZ), Bangla/Bengali (bn-BD, bn-IN), Basque (eu-ES), Bosnian (bs-BA), Bulgarian (bg-BG), Burmese (my-MM), Catalan (ca-ES), Chinese Cantonese (yue-Hant-HK), Chinese Mandarin (zh, zh-TW), Croatian (hr-HR), Czech (cs-CZ), Danish (da-DK), Dutch (nl-BE, nl-NL), English (en-AU, en-CA, en-GB, en-GH, en-HK, en-IE, en-IN, en-KE, en-NG, en-NZ, en-PH, en-PK, en-SG, en-TZ, en-US, en-ZA), Estonian (et-EE), Filipino (fil-PH), Finnish (fi-FI), French (fr-BE, fr-CA, fr-CH, fr-FR), Galician (gl-ES), Georgian (ka-GE), German (de-AT, de-CH, de-DE), Greek (el-GR), Gujarati (gu-IN), Hausa (ha-NG), Hebrew (iw-IL), Hindi (hi-IN), Hungarian (hu-HU), Icelandic (is-IS), Indonesian (id-ID), Italian (it-CH, it-IT), Japanese (ja-JP), Javanese (jv-ID), Kannada (kn-IN), Kazakh (kk-KZ), Khmer (km-KH), Korean (ko-KR), Lao (lo-LA), Latvian (lv-LV), Lithuanian (lt-LT), Macedonian (mk-MK), Malay (ms-MY), Malayalam (ml-IN), Marathi (mr-IN), Mongolian (mn-MN), Nepali (ne-NP), Norwegian (nb-NO), Pashto (ps-AF), Persian (fa-IR), Polish (pl-PL), Portuguese (pt-BR, pt-PT), Punjabi (pa-Guru-IN), Romanian (ro-RO), Russian (ru-RU), Serbian (sr-RS), Sinhala (si-LK), Slovak (sk-SK), Slovenian (sl-SI), Somali (so-SO), Spanish (es-AR, es-BO, es-CL, es-CO, es-CR, es-CU, es-DO, es-EC, es-ES, es-GT, es-HN, es-MX, es-NI, es-PA, es-PE, es-PR, es-PY, es-SV, es-US, es-UY, es-VE), Sundanese (su-ID), Swahili (sw-KE, sw-TZ), Swedish (sv-SE), Tamil (ta-IN, ta-LK, ta-MY, ta-SG), Telugu (te-IN), Thai (th-TH), Turkish (tr-TR), Ukrainian (uk-UA), Urdu (ur-IN, ur-PK), Uzbek (uz-UZ), Vietnamese (vi-VN), Xhosa (xh-ZA), Yoruba (yo-NG), Zulu (zu-ZA) [7]
- Input types
- audio/flac (FLAC), audio/l16 (LINEAR16 PCM), audio/mulaw (MULAW / μ-law), audio/mpeg (MP3), audio/amr (AMR narrowband, 8000 Hz), audio/amr-wb (AMR-WB wideband, 16000 Hz), audio/ogg; codecs=opus (OGG_OPUS), audio/speex (SPEEX_WITH_HEADER_BYTE, 16000 Hz), video/webm; codecs=opus (WEBM_OPUS), WAV (with LINEAR16 or MULAW encoding), Cloud Storage URI (gs://), local file upload (≤10 MB), live streaming via WebSocket/gRPC
- Output types
- JSON (transcription results with confidence scores), plain text transcript, word-level timestamps, word confidence scores, speaker diarization tags, SRT captions, WebVTT captions, Cloud Storage file output (TranscriptOutputConfig)
- Webhooks
- ✗ No [8]
- Sandbox / test mode
- ✗ No [9]
- SDK languages
- Python, Node.js, Java, Go, C#, PHP, Ruby, C++ [10]
- MCP server
- ✗ No [11]
Trust & compliance
- SOC 2
- SOC 2 Type II [12]
- HIPAA
- ✓ Yes [13]
- GDPR
- ✓ Yes [14]
- ISO 27001
- ✓ Yes [15]
- PCI DSS
- ✓ Yes [16]
- Published SLA
- ✓ Yes [17]
- Rate limits
- V2 API: Resource requests 100/60s; Operation requests 150/60s; Synchronous recognition 300/60s; Batch recognition 150/60s (all per region). V1 API: 900 recognition requests per 60 seconds per project (global). Daily audio processing limit: 480 hours per day. Synchronous audio max: ~1 min. Streaming session max: ~5 min. Async audio max: ~480 min. Local file max: 10 MB. [18]
- Known restrictions
- Synchronous recognition limited to ~1 minute of audio, Streaming sessions limited to ~5 minutes; session must be restarted for longer audio, Asynchronous audio over ~1 minute must be referenced via a Cloud Storage URI (gs://), Local file upload limit: 10 MB per request; no limit for Cloud Storage URIs, Multi-channel audio billed per channel (stereo = 2x rate), Maximum 5,000 phrases per adaptation request; 100,000 total characters; 100 characters per phrase, 900 recognition requests per 60 seconds per project on V1 (soft quota, adjustable), 480 hours of audio per day (daily quota, resets midnight PST/PDT), Custom speech model training is allowlist-only feature, No native webhook/push-notification for async job completion; polling or Cloud Storage triggers required, HIPAA users must not opt into the data logging program, Medical models (medical_conversation, medical_dictation) are V1 API only [19]
Developer surface
Integration
- API style
- rest
- Base URL
- https://speech.googleapis.com
- Version
- v2
- Versioning
- url
- Stability
- ga
- Auth methods
- oauth2, api_key
- Error format
- vendor-specific (Google API error JSON: {error: {code, message, status}})
- Rate limit
- 300 / minute
Adoption & maturity
- Launched
- 2016-01-01
- GA
- 2017-04-18
- Notable customers
- HubSpot, InteractiveTel, Embodied, iGenius, LogMeIn
Other Speech-to-Text & Transcription APIs
ElevenLabs Scribe (Speech to Text)
"Scribe v2 is the most accurate Speech to Text model" offering "real-time Speech to Text in under 150 ms" across "90+ languages."
Azure AI Speech to Text
"Azure Speech in Foundry Tools provides speech to text, text to speech, and other capabilities through a Microsoft Foundry resource. You can transcribe speech to text with high accuracy, produce natural-sounding text-to-speech voices, translate spoken audio, and conduct live AI voice conversations."
Amazon Transcribe
"Amazon Transcribe is an automatic speech recognition service that uses machine learning models to convert audio to text. You can use Amazon Transcribe as a standalone transcription service or to add speech-to-text capabilities to any application."
IBM watsonx Speech to Text
"IBM Watson® Speech to Text technology enables fast and accurate speech transcription in multiple languages for a variety of use cases, including but not limited to customer self-service, agent assistance and speech analytics."
AssemblyAI
"Voice AI infrastructure for developers building products that transcribe, understand, and act on speech."
Speechmatics
"Low-latency speech-to-text for multilingual, multi-speaker conversations."
References
- ↑Description: cloud.google.com
- ↑Pricing model: cloud.google.com · cloud.google.com
- ↑Published pricing: cloud.google.com · cloud.google.com
- ↑Free tier: brasstranscripts.com · cloud.google.com
- ↑Supported actions: docs.cloud.google.com
- ↑Regions: docs.cloud.google.com · docs.cloud.google.com
- ↑Languages: docs.cloud.google.com · docs.cloud.google.com
- ↑Webhooks: docs.cloud.google.com
- ↑Sandbox: cloud.google.com
- ↑SDK languages: github.com · docs.cloud.google.com
- ↑MCP server: docs.cloud.google.com
- ↑SOC 2: cloud.google.com · cloud.google.com
- ↑HIPAA: cloud.google.com · cloud.google.com
- ↑GDPR: cloud.google.com
- ↑ISO 27001: cloud.google.com · cloud.google.com
- ↑PCI DSS: cloud.google.com
- ↑Published SLA: cloud.google.com
- ↑Rate limits: docs.cloud.google.com · docs.cloud.google.com
- ↑Known restrictions: costbench.com · docs.cloud.google.com
Change history
- 2026-06-21 Capabilities: {} → {"medical":true,"real_time_streaming":true,"speaker_diarization":true}
- 2026-06-21 Summary Md: (none) → Google Cloud Speech-to-Text is a REST API from Google Cloud that converts audio…
- 2026-06-21 Score Setup Speed: (none) → 85
- 2026-06-21 Score Pricing Transparency: (none) → 100
- 2026-06-21 Score Docs Quality: (none) → 15
- 2026-06-21 Score Procurement Friction: (none) → 100
- 2026-06-21 Score Trust Readiness: (none) → 100
- 2026-06-21 Best For: (none) → Prototypes and side projects - free to start, no sales call, Regulated or enter…
- 2026-06-21 Scoring Methodology: (none) → Scores are computed deterministically from this profile's published, sourced fi…
- 2026-06-21 Score Agent Friendliness: (none) → 20
- 2026-06-21 Robots Allows Agents: (none) → Yes
- 2026-06-21 Status Page URL: (none) → https://status.cloud.google.com
- 2026-06-21 Docs URL: (none) → https://docs.cloud.google.com/
- 2026-06-21 Rendering: (none) → static
- 2026-06-21 Has Structured Data: (none) → No
- 2026-06-21 Llms Txt Present: (none) → No
- 2026-06-21 MCP Server Available: set to No
- 2026-06-21 Pricing Model: set to usage_based
- 2026-06-21 Has Published Pricing: set to Yes
- 2026-06-21 Free Tier Available: set to Yes
- 2026-06-21 Free Tier Details: set to 60 minutes of audio per month free (recurring monthly allowance); applies to bo…
- 2026-06-21 Self Serve Signup: set to Yes
- 2026-06-21 Requires Sales Call: set to No
- 2026-06-21 Enterprise Plan Available: set to Yes
- 2026-06-21 SOC 2: set to type_2
- 2026-06-21 HIPAA: set to Yes
- 2026-06-21 GDPR: set to Yes
- 2026-06-21 ISO 27001: set to Yes
- 2026-06-21 PCI DSS: set to Yes
- 2026-06-21 SLA Published: set to Yes
- 2026-06-21 SLA URL: set to https://cloud.google.com/speech-to-text/sla
- 2026-06-21 Data Retention Policy URL: set to https://docs.cloud.google.com/speech-to-text/docs/v1/data-usage-faq
- 2026-06-21 Documented Rate Limits: set to V2 API: Resource requests 100/60s; Operation requests 150/60s; Synchronous reco…
- 2026-06-21 Rate Limit Requests: set to 300
- 2026-06-21 Rate Limit Window: set to minute
- 2026-06-21 Known Restrictions: set to Synchronous recognition limited to ~1 minute of audio, Streaming sessions limit…
- 2026-06-21 Auth Methods: set to oauth2, api_key
- 2026-06-21 Auth Docs URL: set to https://cloud.google.com/speech-to-text/docs/authentication
- 2026-06-21 API Style: set to rest
- 2026-06-21 Base URL: set to https://speech.googleapis.com
- 2026-06-21 API Version: set to v2
- 2026-06-21 Versioning Scheme: set to url
- 2026-06-21 Stability: set to ga
- 2026-06-21 Deprecation Policy URL: set to https://cloud.google.com/terms/deprecation
- 2026-06-21 Error Format: set to vendor-specific (Google API error JSON: {error: {code, message, status}})
- 2026-06-21 Requires Verification: set to No
- 2026-06-21 Starting Price Usd: set to 0.016
- 2026-06-21 Price Basis: set to minute
- 2026-06-21 Free Tier Limit: set to 60 minutes/month
- 2026-06-21 Launched At: set to 2016-01-01
Suggest an edit / leave a review
Leave a review or comment
curl -X POST https://apio.sh/api/feedback/google-speech-to-text \
-H 'Content-Type: application/json' \
-d '{"kind":"review","rating":5,"body":"Your experience with this API…"}'Suggest a correction to a field (cite a source)
curl -X POST https://apio.sh/api/suggest/google-speech-to-text/FIELD \
-H 'Content-Type: application/json' \
-d '{"value":"corrected value","citations":[{"url":"https://source.example/page","excerpt":"supporting quote"}],"note":"what changed and why"}'