Skip to content

Call transcription

Updated: Jun 18, 2026

Overview

The Calling API can transcribe the audio of business-initiated calls (BIC) and user-initiated calls (UIC) you make through the WhatsApp Business Cloud API. When you opt a call into transcription, both participants hear a short legally required announcement before transcription begins. After the call ends, you receive a webhook with a media ID you can use to download the finished transcript as a JSON document.

Transcription is opt-in on a per-call basis — you decide at the time you initiate or accept each call whether it should be transcribed.

Transcription and call recording are independent features. You can enable either one on its own, both together, or neither. They are configured and priced separately, each has its own request object, and each delivers its result in its own webhook event. Enabling transcription does not produce an audio recording, and enabling recording does not produce a transcript. See Using transcription with recording for what changes when you enable both on the same call.

Prerequisites

Before you transcribe a call, make sure:

Enable transcription on a business-initiated call

Add a transcription object to your business-initiated call request body:

POST /<PHONE_NUMBER_ID>/calls  
{  
  "messaging_product": "whatsapp",  
  "to": "14085551234",  
  "recipient": "US.13491208655302741918",  
  "action": "connect",  
  "session": {  
    "sdp_type": "offer",  
    "sdp": "<<RFC 8866 SDP>>"  
  },  
  "transcription": {  
    "status": "ENABLED",  
    "purpose": "quality assurance",  
    "announcement_language": "en_US"  
  }  
}

Usernames and business-scoped user IDs: The recipient field lets you identify the WhatsApp user by their BSUID instead of, or in addition to, their phone number in to. For details, see Business-scoped user IDs.

Enable transcription on a user-initiated call

Add the same transcription object when you accept an incoming call:

POST /<PHONE_NUMBER_ID>/calls  
{  
  "messaging_product": "whatsapp",  
  "call_id": "wacid.ABGGFjFVU2AfAgo6V",  
  "action": "accept",  
  "session": {  
    "sdp_type": "answer",  
    "sdp": "<<RFC 8866 SDP>>"  
  },  
  "transcription": {  
    "status": "ENABLED",  
    "purpose": "quality assurance",  
    "announcement_language": "en_US"  
  }  
}

To accept an incoming call without transcribing it, either omit the transcription field entirely or send it with "status": "DISABLED".

Before any audio is transcribed, the Calling API mixes a spoken announcement into both your business and the WhatsApp user audio streams. The announcement is generated from the purpose string you provide and the announcement_language you select, for example:

"The audio of this call will be transcribed for the following purpose: <your purpose string>."

Transcription starts only after the announcement has finished playing. A participant who does not consent can decline by terminating the call before or during the announcement.

The purpose field is mandatory whenever status is ENABLED. Calls submitted with transcription enabled but without a purpose are rejected with a request error.

transcription object reference

FieldTypeRequiredDescription
statusStringYesENABLED to transcribe the call, DISABLED to explicitly opt out.
purposeStringYes, when status is ENABLEDThe purpose of the transcription, spoken to both participants as part of the announcement. Maximum 250 characters. Provide the text in the language you specified in announcement_language.
announcement_languageStringYes, when status is ENABLEDLocale code for the language of the spoken announcement, for example en_US or es. See Supported announcement languages.

Supported announcement languages

The following announcement_language values have a localized announcement. The Calling API speaks the matching phrase, followed by your purpose string, to both participants before transcription begins.

Languageannouncement_languageTranscription announcement
Englishen (also en_US, en_AU, en_CA, en_GB, en_IN, en_NZ)The audio of this call will be transcribed for the following purpose:
FrenchfrL'audio de cet appel sera transcrit aux fins suivantes :
GermandeDieser Anruf wird zu folgenden Zwecken transkribiert:
Hindihiइस कॉल के ऑडियो को इस उद्देश्य के लिए ट्रांसक्राइब किया जाएगा:
ItalianitL'audio di questa chiamata verrà trascritto per il seguente scopo:
Kannadaknಈ ಕರೆಯ ಆಡಿಯೋವನ್ನು ಈ ಕೆಳಗಿನ ಉದ್ದೇಶಕ್ಕಾಗಿ ಲಿಪ್ಯಂತರಿಸಲಾಗುತ್ತದೆ:
Portuguese (Brazil)ptO áudio desta ligação será transcrito para a seguinte finalidade:
SpanishesEl audio de esta llamada se transcribirá con el siguiente propósito:
Teluguteఈ కాల్ ఆడియో క్రింది ప్రయోజనం కోసం ట్రాన్‌స్క్రైబ్ చేయడం జరుగుతుంది:
VietnameseviÂm thanh của cuộc gọi này sẽ được chép lời cho mục đích sau:

The announcement_language field also accepts nl and es_ES. These values are valid, but until a localized transcription announcement is available they play the English announcement.

Using transcription with recording

Transcription and recording are fully independent. The transcription and recording objects are separate request fields, so you choose each one independently on a per-call basis:

  • Send only transcription to receive a transcript and no audio recording.
  • Send only recording to receive an audio recording and no transcript.
  • Send both objects to receive both a transcript and an audio recording.
  • Omit both (or set both to DISABLED) to receive neither.

When you enable both on the same call, participants hear a single combined announcement instead of two:

"The audio of this call will be recorded and transcribed for the following purpose: <your purpose string>."

When both objects are present, the announcement_language and purpose from the recording object are used for this combined announcement, and the corresponding values in the transcription object are ignored. You still receive a separate webhook for each enabled feature: a call_recording_available event for the recording and a call_transcription_available event for the transcript.

Transcription-available webhook

After the call ends and post-processing finishes (typically under one minute), the Calling API sends a call_transcription_available event under the existing calls webhook field:

{  
  "object": "whatsapp_business_account",  
  "entry": [  
    {  
      "id": "<WABA_ID>",  
      "changes": [  
        {  
          "field": "calls",  
          "value": {  
            "messaging_product": "whatsapp",  
            "metadata": {  
              "phone_number_id": "<BUSINESS_PHONE_NUMBER_ID>",  
              "display_phone_number": "<BUSINESS_DISPLAY_PHONE_NUMBER>"  
            },  
            "calls": [  
              {  
                "id": "wacid.HBgLMTQxMjYxMzYyNTMVAgASGCBGO...",  
                "from": "<USER_PHONE_NUMBER>",  
                "from_user_id": "<BSUID>",  
                "from_parent_user_id": "<PARENT_BSUID>",  
                "timestamp": "1728932177",  
                "event": "call_transcription_available",  
                "call_transcript": {  
                  "document": {  
                    "id": "1002764438271669",  
                    "sha256": "Y9vvGyeo3n76ptkXu3CwDBsnzbRFqpjHskQdMGSVqas=",  
                    "mime_type": "application/json",  
                    "url": "https://lookaside.fbsbx.com/whatsapp_business/attachments/?mid=133..."  
                  }  
                }  
              }  
            ]  
          }  
        }  
      ]  
    }  
  ]  
}

call_transcript fields

FieldTypeDescription
document.idStringMedia asset ID. Use the Media API to retrieve the media URL for download.
document.sha256StringBase64-encoded SHA-256 hash of the transcript document. Use it to verify the downloaded file's integrity.
document.mime_typeStringMIME type of the transcript document. Currently always application/json.
document.urlStringA short-lived download URL. Issue an authenticated GET request with your access token to download the asset.

Usernames and business-scoped user IDs: The from_user_id and from_parent_user_id fields identify the WhatsApp user by their BSUID; the from phone number may be omitted if the user has adopted a username. For details, see Business-scoped user IDs.

Transcript language

You do not specify a transcription language in the request. The Calling API automatically detects the spoken language of the call, transcribes it, and reports the detected language in the transcript.language field of the transcript document (see Transcript document format). This detected language is an ISO 639 language code such as en and is determined from the audio — it is independent of the announcement_language you set for the spoken announcement.

The set of languages that can be automatically detected and transcribed is evolving constantly as the underlying speech models improve, so this list changes over time. The languages currently supported include:

Afrikaans, Albanian, Arabic, Azerbaijani, Bengali, Bulgarian, Burmese, Cebuano, Chinese, Croatian, Czech, Danish, Dutch, English, Finnish, French, German, Greek, Guarani, Gujarati, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Javanese, Kannada, Korean, Macedonian, Malay, Malayalam, Marathi, Norwegian, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Serbian, Sinhala, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog (Filipino), Tamil, Telugu, Thai, Turkish, Urdu, and Vietnamese.

If a call is spoken in a language that isn't currently supported, you still receive the call_transcription_available webhook, but the returned transcript may be empty.

Transcript document format

The downloaded transcript is a JSON document with two top-level objects: metadata (information about the processed audio) and transcript (the transcribed content, including a flat text rendering, the detected language, an overall confidence, and time-stamped segments with word-level detail).

Each segment is attributed to the speaker who produced it and the channel it was spoken on — channel 0 is your business and channel 1 is the WhatsApp user — so speaker attribution stays accurate even when participants talk over each other. The full conversation is also available as a single string in transcript.text, with each segment prefixed by its speaker label, for example [Business] or [Customer].

{  
  "metadata": {  
    "processed_at": "2026-06-18T20:16:47Z",  
    "audio": {  
      "duration": 21.76,  
      "sample_rate": 16000,  
      "channels": 2,  
      "audio_format": "stereo"  
    }  
  },  
  "transcript": {  
    "text": "[Business] Hello, how about you? [Customer] Hey, I'm good. How are you?",  
    "language": "en",  
    "duration": 21.76,  
    "confidence": 0.83,  
    "segments": [  
      {  
        "id": 1,  
        "speaker": "Business",  
        "channel": 0,  
        "start": 1.16,  
        "end": 2.44,  
        "text": "Hello, how about you?",  
        "confidence": 0.85,  
        "words": [  
          {  
            "word": "Hello,",  
            "start": 1.16,  
            "end": 1.64,  
            "confidence": 0.89,  
            "lang": "en"  
          },  
          {  
            "word": "how",  
            "start": 1.64,  
            "end": 1.8,  
            "confidence": 0.99,  
            "lang": "en"  
          },  
          {  
            "word": "about",  
            "start": 1.8,  
            "end": 2.04,  
            "confidence": 0.52,  
            "lang": "en"  
          },  
          {  
            "word": "you?",  
            "start": 2.04,  
            "end": 2.44,  
            "confidence": 0.99,  
            "lang": "en"  
          }  
        ]  
      },  
      {  
        "id": 2,  
        "speaker": "Customer",  
        "channel": 1,  
        "start": 3.66,  
        "end": 5.74,  
        "text": "Hey, I'm good. How are you?",  
        "confidence": 0.85,  
        "words": [  
          {  
            "word": "Hey,",  
            "start": 3.66,  
            "end": 4.46,  
            "confidence": 0.60,  
            "lang": "en"  
          },  
          {  
            "word": "I'm",  
            "start": 4.46,  
            "end": 4.7,  
            "confidence": 0.78,  
            "lang": "en"  
          },  
          {  
            "word": "good.",  
            "start": 4.7,  
            "end": 5.02,  
            "confidence": 0.71,  
            "lang": "en"  
          },  
          {  
            "word": "How",  
            "start": 5.02,  
            "end": 5.18,  
            "confidence": 0.99,  
            "lang": "en"  
          },  
          {  
            "word": "are",  
            "start": 5.18,  
            "end": 5.34,  
            "confidence": 0.99,  
            "lang": "en"  
          },  
          {  
            "word": "you?",  
            "start": 5.34,  
            "end": 5.74,  
            "confidence": 0.99,  
            "lang": "en"  
          }  
        ]  
      }  
    ]  
  }  
}

metadata fields

FieldTypeDescription
processed_atStringISO 8601 UTC timestamp of when transcription post-processing completed.
audio.durationNumberDuration of the processed call audio, in seconds.
audio.sample_rateIntegerSample rate of the processed audio, in Hz.
audio.channelsIntegerNumber of audio channels. A two-party call has two channels.
audio.audio_formatStringFormat of the processed audio mix, for example stereo.

transcript fields

FieldTypeDescription
textStringThe full conversation as a single string. Each segment is prefixed with its speaker label in brackets, for example [Business] or [Customer].
languageStringThe detected language of the call as an ISO 639 code, for example en. See Transcript language.
durationNumberTotal duration of the transcribed audio, in seconds.
confidenceNumberOverall confidence score for the transcript, from 0 to 1.
segmentsArrayThe ordered list of spoken segments. See segments fields.

segments fields

Each segment represents a continuous span of speech from one speaker.

FieldTypeDescription
idIntegerSequential identifier for the segment within the transcript.
speakerStringThe speaker who produced the segment, either Business or Customer.
channelIntegerThe audio channel the segment was spoken on. Channel 0 is the business; channel 1 is the WhatsApp user.
startNumberThe start time of the segment, in seconds from the beginning of the call audio.
endNumberThe end time of the segment, in seconds from the beginning of the call audio.
textStringThe full transcribed text of the segment.
confidenceNumberA confidence score from 0 to 1 for the segment transcription.
wordsArrayWord-level breakdown of the segment. Each entry contains word (String), start (Number), end (Number), confidence (Number), and lang (String, the ISO 639 code of the detected language for that word), where start and end are in seconds from the beginning of the call audio.

Download the transcript

Transcripts use the same download flow as media messages:

  • The url returned in the webhook is valid for 5 minutes. Issue an authenticated GET request with your access token to download the file directly.
  • If the URL has expired, use the Media API to retrieve a fresh media URL with the document.id.

Retention

Transcripts remain available for download for 7 days after the call_transcription_available webhook is delivered. After that period, the media ID expires and the underlying file is deleted. Download and persist the transcript to your own storage within the retention window if you need to keep it long-term.

Errors

The following request errors are specific to call transcription. See Cloud API error codes for the full list.

ScenarioDescription
Missing purposetranscription.status is ENABLED but purpose is omitted or empty.
purpose too longpurpose exceeds 250 characters.
Invalid announcement_languageannouncement_language is not a supported locale code.
Invalid statusstatus is not one of ENABLED or DISABLED.

Unofficial mirror for reference/search purposes. All content originates from developers.facebook.com — see the source link at the top of each page. Machine-readable indexes: llms.txt · llms-full.txt · About