OpenAI Speech-to-Text

"The Audio API provides two speech-to-text endpoints, transcriptions and translations, based on our state-of-the-art open source large-v2 Whisper model." [1]

Speech-to-Text & Transcription APIs

platform.openai.com/docs/guides/speech-to-text · By OpenAI · Agent JSON · Suggest an edit · Last verified 2026-06-21 · Source confidence: high

OpenAI Speech-to-Text is a REST API offering batch, streaming, and real-time audio transcription, speaker diarization, language detection, and translation to English, built on Whisper and newer gpt-4o-based models. It is priced at $0.003 per minute on a self-serve, pay-as-you-go basis with no sales call required, and an enterprise plan is available. The API ships official SDKs for Python, Node.js, Java, Go, Ruby, and .NET, and holds SOC 2 Type II, HIPAA, GDPR, ISO 27001, and PCI DSS certifications.

Best for / Avoid if

Best for: Regulated or enterprise workloads - compliance attestations and an enterprise plan; AI agents and automation - an agent-ready surface (MCP / llms.txt); Teams needing broad API coverage out of the box

Avoid if: You want to try it free before paying

Pricing & procurement

Pricing model: Usage-based [2]
Published pricing: Yes [3]
Free tier: No [4]
Self-serve signup: Yes
Requires sales call: No
Enterprise plan: Yes [5]

Published prices
Item	Per	Amount	Source
whisper-1 transcription	minute	$0.006	source
gpt-4o-transcribe audio input	1M tokens	$2.5	source
gpt-4o-transcribe text output	1M tokens	$10	source
gpt-4o-transcribe estimated cost	minute	$0.006	source
gpt-4o-mini-transcribe audio input (snapshots: gpt-4o-mini-transcribe-2025-12-15, gpt-4o-mini-transcribe-2025-03-20)	1M tokens	$1.25	source
gpt-4o-mini-transcribe text output	1M tokens	$5	source
gpt-4o-mini-transcribe estimated cost	minute	$0.003	source
gpt-4o-transcribe-diarize audio input (speaker diarization)	1M tokens	$2.5	source
gpt-4o-transcribe-diarize text output (speaker diarization)	1M tokens	$10	source
gpt-4o-transcribe-diarize estimated cost (speaker diarization)	minute	$0.006	source
gpt-realtime-whisper streaming transcription (audio duration)	minute	$0.017	source
gpt-realtime-translate streaming speech translation (audio duration)	minute	$0.034	source

Capabilities

Real-time streaming
Speaker diarization
Speech translation

Supported actions: transcribe_batch, transcribe_streaming, transcribe_realtime, translation_to_english, speaker_diarization, word_timestamps, segment_timestamps, language_detection, prompting_for_accuracy, logprobs_confidence_scoring, voice_activity_detection [6]developers.openai.com/api/docs/guides/speech-to-text“Supported models: gpt-4o-transcribe, gpt-4o-mini-transcribe, gpt-4o-transcribe-diarize, whisper-1. Speaker Diarization: Identifies and labels different speakers in audio. Word-Level Timestamps: Precise timing for individual words. Translation: Transcribe and translate audio into English.”developers.openai.com/api/docs/guides/realtime-transcription“gpt-realtime-whisper is an alternative for live transcription... supports streaming transcript deltas as audio arrives, tunable latency via the delay parameter, voice activity detection (VAD) support.”
Languages: 99+ languages including Afrikaans, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bosnian, Breton, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Faroese, Finnish, French, Galician, German, Greek, Gujarati, Haitian Creole, Hawaiian, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Lao, Latin, Latvian, Lingala, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Myanmar, Nepali, Norwegian, Nynorsk, Occitan, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskrit, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Tibetan, Turkish, Turkmen, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, Yiddish, Yoruba [7]developers.openai.com/api/docs/guides/speech-to-text“Over 99 languages including major languages like English, Spanish, French, German, Mandarin, Japanese, and many others. The platform prioritizes languages with less than 50% word error rate.”github.com/openai/whisper“Whisper supports transcription and translation for 99 languages. It was trained on 680,000 hours of supervised multilingual audio data.”
Input types: audio/mp3, audio/mp4, audio/mpeg, audio/mpga, audio/m4a, audio/wav, audio/webm, WebSocket (realtime streaming), WebRTC (realtime browser) [8]
Output types: JSON, plain text, SRT, VTT, verbose JSON, diarized JSON, word timestamps, segment timestamps, streaming transcript deltas [9]
Webhooks: No [10]
Sandbox / test mode: No [11]
SDK languages: Python, Node.js, Java, Go, Ruby, .NET [12]
MCP server: Yes [13]

Trust & compliance

SOC 2: SOC 2 Type II [14]
HIPAA: Yes [15]
GDPR: Yes [16]
ISO 27001: Yes [17]
PCI DSS: Yes [18]
Published SLA: Yes [19]
Rate limits: whisper-1: Free 3 RPM / 200 RPD; Tier 1: 500 RPM; Tier 2: 2,500 RPM; Tier 3: 5,000 RPM; Tier 4: 7,500 RPM; Tier 5: 10,000 RPM. gpt-4o-transcribe / gpt-4o-transcribe-diarize: Tier 1: 500 RPM / 10K TPM; Tier 2: 2,000 RPM / 100K TPM; Tier 3: 5,000 RPM / 400K TPM; Tier 4: 10,000 RPM / 2M TPM; Tier 5: 10,000 RPM / 6M TPM. gpt-4o-mini-transcribe: Tier 1: 500 RPM / 50K TPM; Tier 2: 2,000 RPM / 150K TPM; Tier 3: 5,000 RPM / 600K TPM; Tier 4: 10,000 RPM / 2M TPM; Tier 5: 10,000 RPM / 8M TPM. gpt-realtime-whisper: Tier 1: 100 min/min; Tier 2: 350 min/min; Tier 3: 650 min/min; Tier 4: 1,000 min/min; Tier 5: 1,300 min/min. [20]
Known restrictions: Maximum file upload size: 25 MB, Translation endpoint outputs English only (whisper-1 only; not available on gpt-4o-transcribe models), Speaker diarization (gpt-4o-transcribe-diarize) requires chunking_strategy for audio longer than 30 seconds, gpt-4o-transcribe-diarize does not support prompts, logprobs, or timestamp_granularities[], Prompt steering not supported for gpt-realtime-whisper in realtime sessions, Context window: 16,000 tokens; max output: 2,000 tokens (gpt-4o-transcribe models), gpt-4o-transcribe and gpt-4o-mini-transcribe output JSON or plain text only (not SRT/VTT) [21]

Developer surface

Docs rendering: static · markdown variants served

Integration

API style: rest
Base URL: https://api.openai.com/v1
Version: v1
Versioning: url
Stability: ga
Auth methods: api_key
Error format: vendor-specific
Webhook signing: hmac_sha256
Rate limit: 500 / minute

SDKs

Python openai · repo
Node.js openai · repo
Java com.openai:openai-java · repo
Go github.com/openai/openai-go · repo
Ruby openai · repo
.NET OpenAI

Adoption & maturity

Launched: 2023-03-01
GA: 2025-04-01
Notable customers: Speak

Other Speech-to-Text & Transcription APIs

ElevenLabs Scribe (Speech to Text)
"Scribe v2 is the most accurate Speech to Text model" offering "real-time Speech to Text in under 150 ms" across "90+ languages."
Hybrid · free tier · public pricing · self-serve
Azure AI Speech to Text
"Azure Speech in Foundry Tools provides speech to text, text to speech, and other capabilities through a Microsoft Foundry resource. You can transcribe speech to text with high accuracy, produce natural-sounding text-to-speech voices, translate spoken audio, and conduct live AI voice conversations."
Usage · free tier · public pricing · self-serve
Amazon Transcribe
"Amazon Transcribe is an automatic speech recognition service that uses machine learning models to convert audio to text. You can use Amazon Transcribe as a standalone transcription service or to add speech-to-text capabilities to any application."
Usage · free tier · public pricing · self-serve
Google Cloud Speech-to-Text
"Accurate voice typing and transcription powered by Gemini."
Usage · free tier · public pricing · self-serve
IBM watsonx Speech to Text
"IBM Watson® Speech to Text technology enables fast and accurate speech transcription in multiple languages for a variety of use cases, including but not limited to customer self-service, agent assistance and speech analytics."
Usage · free tier · public pricing · self-serve
AssemblyAI
"Voice AI infrastructure for developers building products that transcribe, understand, and act on speech."
Usage · public pricing · self-serve

OpenAI Speech-to-Text alternatives · OpenAI Speech-to-Text vs ElevenLabs Scribe (Speech to Text) · All Speech-to-Text & Transcription APIs APIs

References

Each field above carries a numbered source - hover for a preview, click to jump here.

↑Description: developers.openai.com
↑Pricing model: developers.openai.com · developers.openai.com
↑Published pricing: developers.openai.com
↑Free tier: developers.openai.com
↑Enterprise plan: openai.com
↑Supported actions: developers.openai.com · developers.openai.com
↑Languages: developers.openai.com · github.com
↑Input types: developers.openai.com · developers.openai.com
↑Output types: developers.openai.com
↑Webhooks: developers.openai.com
↑Sandbox: developers.openai.com
↑SDK languages: developers.openai.com
↑MCP server: developers.openai.com
↑SOC 2: trust.openai.com
↑HIPAA: help.openai.com
↑GDPR: trust.openai.com
↑ISO 27001: trust.openai.com
↑PCI DSS: trust.openai.com
↑Published SLA: openai.com
↑Rate limits: developers.openai.com · developers.openai.com
↑Known restrictions: developers.openai.com · developers.openai.com

Change history

Every field change, who made it, and when - from our audited data pipeline and editors.

2026-06-21 Summary Md: (none) → OpenAI Speech-to-Text is a REST API offering batch, streaming, and real-time au…
2026-06-21 Summary Md: OpenAI Speech-to-Text offers transcription and translation via two model famili… → (none)
2026-06-21 Score Trust Readiness: 90 → 100
2026-06-21 Supported Languages: Afrikaans, Arabic, Armenian, Azerbaijani, Basque, Belarusian, Bosnian, Breton, … → 99+ languages including Afrikaans, Arabic, Armenian, Azerbaijani, Basque, Belar…
2026-06-21 Input Types: audio/mp3, audio/mp4, audio/mpeg, audio/mpga, audio/m4a, audio/wav, audio/webm,… → audio/mp3, audio/mp4, audio/mpeg, audio/mpga, audio/m4a, audio/wav, audio/webm,…
2026-06-21 Output Types: JSON, plain text, SRT, VTT, verbose JSON, diarized JSON, word timestamps, segme… → JSON, plain text, SRT, VTT, verbose JSON, diarized JSON, word timestamps, segme…
2026-06-21 PCI DSS: No → Yes
2026-06-21 SDK Packages: Python, Node.js, Java, Go, Ruby, .NET → Python, Node.js, Java, Go, Ruby, .NET
2026-06-21 Name: OpenAI Speech-to-Text (gpt-4o-transcribe / Whisper API) → OpenAI Speech-to-Text
2026-06-21 Supported Actions: transcribe_batch, transcribe_streaming, transcribe_realtime, translation, speak… → transcribe_batch, transcribe_streaming, transcribe_realtime, translation_to_eng…
2026-06-21 Known Restrictions: Maximum file upload size: 25 MB, Translation endpoint outputs English only (whi… → Maximum file upload size: 25 MB, Translation endpoint outputs English only (whi…
2026-06-21 Documented Rate Limits: whisper-1: Free tier 3 RPM / 200 RPD; Tier 1: 500 RPM; Tier 2: 2,500 RPM; Tier … → whisper-1: Free 3 RPM / 200 RPD; Tier 1: 500 RPM; Tier 2: 2,500 RPM; Tier 3: 5,…
2026-06-21 Fields Not Found: supported_regions (no explicit data residency regions listed for the STT API sp… → supported_regions (no explicit data residency regions listed for the STT API), …
2026-06-21 Starting Price Usd: 0.003 → 0.003
2026-06-21 Capabilities: {} → {"translation":true,"real_time_streaming":true,"speaker_diarization":true}
2026-06-21 Summary Md: (none) → OpenAI Speech-to-Text offers transcription and translation via two model famili…
2026-06-21 Scoring Methodology: (none) → Scores are computed deterministically from this profile's published, sourced fi…
2026-06-21 Score Agent Friendliness: (none) → 50
2026-06-21 Score Pricing Transparency: (none) → 85
2026-06-21 Score Setup Speed: (none) → 60
2026-06-21 Score Docs Quality: (none) → 50
2026-06-21 Score Procurement Friction: (none) → 85
2026-06-21 Score Trust Readiness: (none) → 90
2026-06-21 Best For: (none) → Regulated or enterprise workloads - compliance attestations and an enterprise p…
2026-06-21 Avoid If: (none) → You want to try it free before paying
2026-06-21 Llms Txt Present: (none) → No
2026-06-21 Docs URL: (none) → https://developers.openai.com/api/docs
2026-06-21 Markdown Docs URL: (none) → https://platform.openai.com/docs/guides/speech-to-text.md
2026-06-21 Markdown Docs Served: (none) → Yes
2026-06-21 API Reference URL: (none) → https://platform.openai.com/api/reference/overview
2026-06-21 Robots Allows Agents: (none) → Yes
2026-06-21 Has Structured Data: (none) → No
2026-06-21 Rendering: (none) → static
2026-06-21 Known Restrictions: set to Maximum file upload size: 25 MB, Translation endpoint outputs English only (whi…
2026-06-21 Auth Methods: set to api_key
2026-06-21 Auth Docs URL: set to https://developers.openai.com/api/docs/quickstart
2026-06-21 API Style: set to rest
2026-06-21 Base URL: set to https://api.openai.com/v1
2026-06-21 API Version: set to v1
2026-06-21 Versioning Scheme: set to url
2026-06-21 Stability: set to ga
2026-06-21 Deprecation Policy URL: set to https://developers.openai.com/api/docs/deprecations
2026-06-21 MCP URL: set to https://developers.openai.com/mcp
2026-06-21 Quickstart URL: set to https://developers.openai.com/api/docs/guides/speech-to-text
2026-06-21 Error Format: set to vendor-specific
2026-06-21 Webhook Signing: set to hmac_sha256
2026-06-21 Webhook Events URL: set to https://developers.openai.com/api/docs/guides/webhooks
2026-06-21 Requires Verification: set to No
2026-06-21 Starting Price Usd: set to 0.003
2026-06-21 Price Basis: set to minute

Suggest an edit / leave a review

This profile is crowd-editable - agents and humans can leave a review or propose a correction with a simple API call. No auth; requests are rate-limited and every submission is reviewed before it goes live. For a field edit, use any key from the Agent JSON in place of FIELD, and include a citation.

Leave a review or comment

curl -X POST https://apio.sh/api/feedback/openai-transcribe \
  -H 'Content-Type: application/json' \
  -d '{"kind":"review","rating":5,"body":"Your experience with this API…"}'

Suggest a correction to a field (cite a source)

curl -X POST https://apio.sh/api/suggest/openai-transcribe/FIELD \
  -H 'Content-Type: application/json' \
  -d '{"value":"corrected value","citations":[{"url":"https://source.example/page","excerpt":"supporting quote"}],"note":"what changed and why"}'

All the ways to contribute →

Best for / Avoid if

Pricing & procurement

Capabilities

Trust & compliance

Developer surface

Integration

Adoption & maturity

Other Speech-to-Text & Transcription APIs

ElevenLabs Scribe (Speech to Text)

Azure AI Speech to Text

Amazon Transcribe

Google Cloud Speech-to-Text

IBM watsonx Speech to Text

AssemblyAI

References

Change history

Suggest an edit / leave a review