OpenAI Speech-to-Text
"The Audio API provides two speech-to-text endpoints, transcriptions and translations, based on our state-of-the-art open source large-v2 Whisper model." [1]
OpenAI Speech-to-Text is a REST API offering batch, streaming, and real-time audio transcription, speaker diarization, language detection, and translation to English, built on Whisper and newer gpt-4o-based models. It is priced at $0.003 per minute on a self-serve, pay-as-you-go basis with no sales call required, and an enterprise plan is available. The API ships official SDKs for Python, Node.js, Java, Go, Ruby, and .NET, and holds SOC 2 Type II, HIPAA, GDPR, ISO 27001, and PCI DSS certifications.
Best for / Avoid if
Best for: Regulated or enterprise workloads - compliance attestations and an enterprise plan; AI agents and automation - an agent-ready surface (MCP / llms.txt); Teams needing broad API coverage out of the box
Avoid if: You want to try it free before paying
Pricing & procurement
- Pricing model
- Usage-based [2]
- Published pricing
- ✓ Yes [3]
- Free tier
- ✗ No [4]
- Self-serve signup
- ✓ Yes
- Requires sales call
- ✗ No
- Enterprise plan
- ✓ Yes [5]
| Item | Per | Amount | Source |
|---|---|---|---|
| whisper-1 transcription | minute | $0.006 | source |
| gpt-4o-transcribe audio input | 1M tokens | $2.5 | source |
| gpt-4o-transcribe text output | 1M tokens | $10 | source |
| gpt-4o-transcribe estimated cost | minute | $0.006 | source |
| gpt-4o-mini-transcribe audio input (snapshots: gpt-4o-mini-transcribe-2025-12-15, gpt-4o-mini-transcribe-2025-03-20) | 1M tokens | $1.25 | source |
| gpt-4o-mini-transcribe text output | 1M tokens | $5 | source |
| gpt-4o-mini-transcribe estimated cost | minute | $0.003 | source |
| gpt-4o-transcribe-diarize audio input (speaker diarization) | 1M tokens | $2.5 | source |
| gpt-4o-transcribe-diarize text output (speaker diarization) | 1M tokens | $10 | source |
| gpt-4o-transcribe-diarize estimated cost (speaker diarization) | minute | $0.006 | source |
| gpt-realtime-whisper streaming transcription (audio duration) | minute | $0.017 | source |
| gpt-realtime-translate streaming speech translation (audio duration) | minute | $0.034 | source |
Capabilities
- Supported actions
- transcribe_batch, transcribe_streaming, transcribe_realtime, translation_to_english, speaker_diarization, word_timestamps, segment_timestamps, language_detection, prompting_for_accuracy, logprobs_confidence_scoring, voice_activity_detection [6]
- Languages
- 99+ languages including Afrikaans, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bosnian, Breton, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Faroese, Finnish, French, Galician, German, Greek, Gujarati, Haitian Creole, Hawaiian, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Lao, Latin, Latvian, Lingala, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Myanmar, Nepali, Norwegian, Nynorsk, Occitan, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskrit, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Tibetan, Turkish, Turkmen, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, Yiddish, Yoruba [7]
- Input types
- audio/mp3, audio/mp4, audio/mpeg, audio/mpga, audio/m4a, audio/wav, audio/webm, WebSocket (realtime streaming), WebRTC (realtime browser) [8]
- Output types
- JSON, plain text, SRT, VTT, verbose JSON, diarized JSON, word timestamps, segment timestamps, streaming transcript deltas [9]
- Webhooks
- ✗ No [10]
- Sandbox / test mode
- ✗ No [11]
- SDK languages
- Python, Node.js, Java, Go, Ruby, .NET [12]
- MCP server
- ✓ Yes [13]
Trust & compliance
- SOC 2
- SOC 2 Type II [14]
- HIPAA
- ✓ Yes [15]
- GDPR
- ✓ Yes [16]
- ISO 27001
- ✓ Yes [17]
- PCI DSS
- ✓ Yes [18]
- Published SLA
- ✓ Yes [19]
- Rate limits
- whisper-1: Free 3 RPM / 200 RPD; Tier 1: 500 RPM; Tier 2: 2,500 RPM; Tier 3: 5,000 RPM; Tier 4: 7,500 RPM; Tier 5: 10,000 RPM. gpt-4o-transcribe / gpt-4o-transcribe-diarize: Tier 1: 500 RPM / 10K TPM; Tier 2: 2,000 RPM / 100K TPM; Tier 3: 5,000 RPM / 400K TPM; Tier 4: 10,000 RPM / 2M TPM; Tier 5: 10,000 RPM / 6M TPM. gpt-4o-mini-transcribe: Tier 1: 500 RPM / 50K TPM; Tier 2: 2,000 RPM / 150K TPM; Tier 3: 5,000 RPM / 600K TPM; Tier 4: 10,000 RPM / 2M TPM; Tier 5: 10,000 RPM / 8M TPM. gpt-realtime-whisper: Tier 1: 100 min/min; Tier 2: 350 min/min; Tier 3: 650 min/min; Tier 4: 1,000 min/min; Tier 5: 1,300 min/min. [20]
- Known restrictions
- Maximum file upload size: 25 MB, Translation endpoint outputs English only (whisper-1 only; not available on gpt-4o-transcribe models), Speaker diarization (gpt-4o-transcribe-diarize) requires chunking_strategy for audio longer than 30 seconds, gpt-4o-transcribe-diarize does not support prompts, logprobs, or timestamp_granularities[], Prompt steering not supported for gpt-realtime-whisper in realtime sessions, Context window: 16,000 tokens; max output: 2,000 tokens (gpt-4o-transcribe models), gpt-4o-transcribe and gpt-4o-mini-transcribe output JSON or plain text only (not SRT/VTT) [21]
Developer surface
Integration
- API style
- rest
- Base URL
- https://api.openai.com/v1
- Version
- v1
- Versioning
- url
- Stability
- ga
- Auth methods
- api_key
- Error format
- vendor-specific
- Webhook signing
- hmac_sha256
- Rate limit
- 500 / minute
Adoption & maturity
- Launched
- 2023-03-01
- GA
- 2025-04-01
- Notable customers
- Speak
Other Speech-to-Text & Transcription APIs
ElevenLabs Scribe (Speech to Text)
"Scribe v2 is the most accurate Speech to Text model" offering "real-time Speech to Text in under 150 ms" across "90+ languages."
Azure AI Speech to Text
"Azure Speech in Foundry Tools provides speech to text, text to speech, and other capabilities through a Microsoft Foundry resource. You can transcribe speech to text with high accuracy, produce natural-sounding text-to-speech voices, translate spoken audio, and conduct live AI voice conversations."
Amazon Transcribe
"Amazon Transcribe is an automatic speech recognition service that uses machine learning models to convert audio to text. You can use Amazon Transcribe as a standalone transcription service or to add speech-to-text capabilities to any application."
Google Cloud Speech-to-Text
"Accurate voice typing and transcription powered by Gemini."
IBM watsonx Speech to Text
"IBM Watson® Speech to Text technology enables fast and accurate speech transcription in multiple languages for a variety of use cases, including but not limited to customer self-service, agent assistance and speech analytics."
AssemblyAI
"Voice AI infrastructure for developers building products that transcribe, understand, and act on speech."
References
- ↑Description: developers.openai.com
- ↑Pricing model: developers.openai.com · developers.openai.com
- ↑Published pricing: developers.openai.com
- ↑Free tier: developers.openai.com
- ↑Enterprise plan: openai.com
- ↑Supported actions: developers.openai.com · developers.openai.com
- ↑Languages: developers.openai.com · github.com
- ↑Input types: developers.openai.com · developers.openai.com
- ↑Output types: developers.openai.com
- ↑Webhooks: developers.openai.com
- ↑Sandbox: developers.openai.com
- ↑SDK languages: developers.openai.com
- ↑MCP server: developers.openai.com
- ↑SOC 2: trust.openai.com
- ↑HIPAA: help.openai.com
- ↑GDPR: trust.openai.com
- ↑ISO 27001: trust.openai.com
- ↑PCI DSS: trust.openai.com
- ↑Published SLA: openai.com
- ↑Rate limits: developers.openai.com · developers.openai.com
- ↑Known restrictions: developers.openai.com · developers.openai.com
Change history
- 2026-06-21 Summary Md: (none) → OpenAI Speech-to-Text is a REST API offering batch, streaming, and real-time au…
- 2026-06-21 Summary Md: OpenAI Speech-to-Text offers transcription and translation via two model famili… → (none)
- 2026-06-21 Score Trust Readiness: 90 → 100
- 2026-06-21 Supported Languages: Afrikaans, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bosnian, Breton, … → 99+ languages including Afrikaans, Arabic, Armenian, Azerbaijani, Basque, Belar…
- 2026-06-21 Input Types: audio/mp3, audio/mp4, audio/mpeg, audio/mpga, audio/m4a, audio/wav, audio/webm,… → audio/mp3, audio/mp4, audio/mpeg, audio/mpga, audio/m4a, audio/wav, audio/webm,…
- 2026-06-21 Output Types: JSON, plain text, SRT, VTT, verbose JSON, diarized JSON, word timestamps, segme… → JSON, plain text, SRT, VTT, verbose JSON, diarized JSON, word timestamps, segme…
- 2026-06-21 PCI DSS: No → Yes
- 2026-06-21 SDK Packages: Python, Node.js, Java, Go, Ruby, .NET → Python, Node.js, Java, Go, Ruby, .NET
- 2026-06-21 Name: OpenAI Speech-to-Text (gpt-4o-transcribe / Whisper API) → OpenAI Speech-to-Text
- 2026-06-21 Supported Actions: transcribe_batch, transcribe_streaming, transcribe_realtime, translation, speak… → transcribe_batch, transcribe_streaming, transcribe_realtime, translation_to_eng…
- 2026-06-21 Known Restrictions: Maximum file upload size: 25 MB, Translation endpoint outputs English only (whi… → Maximum file upload size: 25 MB, Translation endpoint outputs English only (whi…
- 2026-06-21 Documented Rate Limits: whisper-1: Free tier 3 RPM / 200 RPD; Tier 1: 500 RPM; Tier 2: 2,500 RPM; Tier … → whisper-1: Free 3 RPM / 200 RPD; Tier 1: 500 RPM; Tier 2: 2,500 RPM; Tier 3: 5,…
- 2026-06-21 Fields Not Found: supported_regions (no explicit data residency regions listed for the STT API sp… → supported_regions (no explicit data residency regions listed for the STT API), …
- 2026-06-21 Starting Price Usd: 0.003 → 0.003
- 2026-06-21 Capabilities: {} → {"translation":true,"real_time_streaming":true,"speaker_diarization":true}
- 2026-06-21 Summary Md: (none) → OpenAI Speech-to-Text offers transcription and translation via two model famili…
- 2026-06-21 Scoring Methodology: (none) → Scores are computed deterministically from this profile's published, sourced fi…
- 2026-06-21 Score Agent Friendliness: (none) → 50
- 2026-06-21 Score Pricing Transparency: (none) → 85
- 2026-06-21 Score Setup Speed: (none) → 60
- 2026-06-21 Score Docs Quality: (none) → 50
- 2026-06-21 Score Procurement Friction: (none) → 85
- 2026-06-21 Score Trust Readiness: (none) → 90
- 2026-06-21 Best For: (none) → Regulated or enterprise workloads - compliance attestations and an enterprise p…
- 2026-06-21 Avoid If: (none) → You want to try it free before paying
- 2026-06-21 Llms Txt Present: (none) → No
- 2026-06-21 Docs URL: (none) → https://developers.openai.com/api/docs
- 2026-06-21 Markdown Docs URL: (none) → https://platform.openai.com/docs/guides/speech-to-text.md
- 2026-06-21 Markdown Docs Served: (none) → Yes
- 2026-06-21 API Reference URL: (none) → https://platform.openai.com/api/reference/overview
- 2026-06-21 Robots Allows Agents: (none) → Yes
- 2026-06-21 Has Structured Data: (none) → No
- 2026-06-21 Rendering: (none) → static
- 2026-06-21 Known Restrictions: set to Maximum file upload size: 25 MB, Translation endpoint outputs English only (whi…
- 2026-06-21 Auth Methods: set to api_key
- 2026-06-21 Auth Docs URL: set to https://developers.openai.com/api/docs/quickstart
- 2026-06-21 API Style: set to rest
- 2026-06-21 Base URL: set to https://api.openai.com/v1
- 2026-06-21 API Version: set to v1
- 2026-06-21 Versioning Scheme: set to url
- 2026-06-21 Stability: set to ga
- 2026-06-21 Deprecation Policy URL: set to https://developers.openai.com/api/docs/deprecations
- 2026-06-21 MCP URL: set to https://developers.openai.com/mcp
- 2026-06-21 Quickstart URL: set to https://developers.openai.com/api/docs/guides/speech-to-text
- 2026-06-21 Error Format: set to vendor-specific
- 2026-06-21 Webhook Signing: set to hmac_sha256
- 2026-06-21 Webhook Events URL: set to https://developers.openai.com/api/docs/guides/webhooks
- 2026-06-21 Requires Verification: set to No
- 2026-06-21 Starting Price Usd: set to 0.003
- 2026-06-21 Price Basis: set to minute
Suggest an edit / leave a review
Leave a review or comment
curl -X POST https://apio.sh/api/feedback/openai-transcribe \
-H 'Content-Type: application/json' \
-d '{"kind":"review","rating":5,"body":"Your experience with this API…"}'Suggest a correction to a field (cite a source)
curl -X POST https://apio.sh/api/suggest/openai-transcribe/FIELD \
-H 'Content-Type: application/json' \
-d '{"value":"corrected value","citations":[{"url":"https://source.example/page","excerpt":"supporting quote"}],"note":"what changed and why"}'