AWS Audio File Transcription

About.
The client mandates language identification for every call, specifying whether the conversation occurred in Hindi, English, Telugu, or any other regional language. As part of standard compliance and quality assurance practices, each call must be transcribed in full and translated into English, irrespective of the original language. The transcription should follow a conversational dialogue format, clearly attributing each utterance to either the agent or the customer (e.g., Agent: "Hi, Good Afternoon. I am calling from the IIFL MCollect team." Customer: "Hi, tell me what this call is about."). This format aligns with industry best practices for enhancing auditability, training, and customer experience analysis. Accuracy in translation and transcription is critical to maintain the context and intent of the conversation. Additionally, adherence to data privacy and compliance regulations must be ensured throughout the process. Once completed, the transcript must be programmatically submitted via the ReverseFeed API in real time or immediately after call termination, as part of the client’s operational workflow.

AWS Audio File Transcription with Caller and Receiver Text Using Java SDK

✨ Overview

This document provides a comprehensive guide for transcribing stereo audio files using Amazon Transcribe with the AWS SDK for Java. It explains how to identify caller vs. receiver speech using channel identification and calculate transcription accuracy.

Story:
https://phonon.atlassian.net/browse/CEN-365

This is the story we have done our research and development part and get the 90+ % accuracy of transcription

we have set this configuration to achieved this development i am adding class file for reference.

Here is the output of the result:

ch_1:
Thank you for calling HDFC Bank manual outbound calling service , Please enter the loan number or AVN number you wish to dial followed by हैश की । Please enter the phone number you wish
ch_0:
नंबर हाँ । लगा है । क्या नाम है मोहम्मद ।
ch_1:
हर दिन एक हजार ऐसी भी अधिक लोग साइबर अपराधियों के हाथों करोड़ो रूपया गवा रहे हैं । देखिए किसी को पैसे दोगुने करने की जल्दी होती है तो कोई जो है अनजाने नंबरों ऐसी आये हुए विडियो कॉल को जड़ से उठा लेते हैं । कोई कॉल इस ओन बेहल्फ ओफ एच । डी । एफ । सी । बैंक काइंडली नोट थाट थिस कॉल इस गेटिंग रिकॉर्डेड ।
ch_0:
नहीं । नंबर बताया । Hello हेलो हाँ सिर मोहम्मद बात कर रहे हैं ।
ch_1:
हेलो । जी मैं पेमेंट पेमेंट के लिए पेमेंट के लिए मैं कर रहा हूँ । दस मिनट में आपसे कॉल कर रहा हूँ मैं ।
ch_0:
एक मिनट एक मिनट । तो कैसे पे लिंक कर रहे है आप ।
ch_1:
जी । दस मिनट में मैं कॉल कर रहा हूँ direct call madam
ch_0:
हाँ तो सुनिए ना सुनिए दस मिनट बाद जो पेमेंट कर रहे हैं न वही अभी ओन कॉल हो जाता है न तो तुरंत अपडेट हो जायेगा तो अभी करवा दीजिए तुरंत अपडेट हो जायेगा । कैसे कर रहे हैं मुझे बताइए ।
ch_1:
मुझे करने नहीं आ रहा । मैं क्या करूँ दूसरे में डाल दूँ बोलिए ।
ch_0:
हाँ मैं प्रोसीजर मैं गाइड करती हूँ आपको प्रोसीजर बताती हूँ मैं अमाउंट एक नहीं तो लिंक भेजती हूँ लिंक कर सकते हैं ।
ch_1:
अच्छा व्हाट्सएप भेज । क्या कह रहा है ।
ch_0:
लाइन पे रहिए कॉल डिस्कनेक्ट लॉन्ग डिस्टेंस एक मिनट कॉल डिस्कनेक्ट मत कीजिये । कॉल चालू रखिए ।
ch_1:
सब । हेलो
ch_0:
हाँ बोलिए ।
ch_1:
हाँ ।
ch_0:
एक नहीं नहीं वापस मत कॉल कीजिए सिर अभी अभी लिंक आएगा ना आप को लिंक भी तुरंत अपडेट हो जाएगा सिर स्कैनर से स्कैन करोगे तो तुरंत अपडेट हो जाए मैं गाइड भी करती हूँ आप को । ठीक है । मॉर्निंग में अपन एक सेकंड लिंक होगी आप चेक कीजिये ।
ch_1:
हम्म । कितना तारीख को पेमेंट किया ।
ch_0:
एक स्पीकर पे रख दिया फोन
ch_1:
क्या क्या ?
ch_0:
स्पीकर पे रखिए मोबाइल पर आपका अभी लिंक आयी होगी जस्ट चेक कीजिये टेक्स्ट मैसेज में ।
ch_1:
अरे यार क्यूँ बोलो ।
ch_0:
तो ।
ch_1:
इसको बंद कर दिया तुम स्पीकर
ch_0:
ओपन ओपन किया क्या आपने
ch_1:
हाँ हेलो ।
ch_0:
बोलिए ।
ch_1:
Hello ?
ch_0:
हाँ बोलिए सिर ।
ch_1:
हाँ बोलिए ।
ch_0:
link open कीजिए वो ।
ch_1:
व्हाट्सएप पे लिंक भेजे हैं ।
ch_0:
WhatsApp नहीं s text message में लिंक भेजा है देख लीजिए ।
ch_1:
मैसेज में
ch_0:
आपका सादा मेसेज रहता है ना तो उधर ।
ch_1:
जी ।
ch_0:
हाँ आई । चेक कीजिये ।
ch_1:
बोलिए । क्या बोला ।
ch_0:
आया तो चेक कीजिये एक बार लिंक आया है आपको ।
ch_1:
हाँ आया है किए है बोलिए ना ओपन किए है ।
ch_0:
हाँ ओपन कीजिये ।
ch_1:
ओपन कर दिए ।
ch_0:
आगे की हाँ किया न आगे पे कीजिये । किया है । Hello ?
ch_1:
हेलो हाँ हेलो ।
ch_0:
आप पे कीजिये तो पे नाइन अमन । बताओ
ch_1:
पे आउट आ ही नहीं रहा है । इसमें हम कैसे करे ।
ch_0:
फर्स्ट टाइम पे कर रहे थे लिंक सिक्स सिक्स फॉर सेवेन ट्रिपल जीरो त्वो न एक काम कीजिये आप फोन पे नहीं तो गूगल पे उसे करते क्या ?
ch_1:
फोन पे
ch_0:
और फोन पे ना फोन पे ओपन कीजिए अपना ।
ch_1:
हाँ फोन में ओपन है बोलिए ।
ch_0:
ठीक है हम मोबाइल रिचार्ज करते हैं ना उधर आप क्रेडिट कार्ड बिल पेमेंट का ऑप्शन मिलेगा उस पे जाइए । credit card मिलते है ।
ch_1:
एक मिनट । मोबाइल रिचार्ज तरफ
ch_0:
हाँ उधर ही आपको क्रेडिट कार्ड बिल का ऑप्शन मिलेगा उससे । मिला सिर क्रेडिट कार्ड का ऑप्शन ।
ch_1:
जी देख रहा हूँ देख रहा हूँ ।
ch_0:
चेक कीजिये । नीचे स्क्रॉल कीजिये मेरे को ।
ch_1:
हाँ बोलिए ।
ch_0:
मिला क्रेडिट कार्ड पेमेंट ।
ch_1:
हां , हां ।
ch_0:
ठीक है । उधर आपका कार्ड नंबर डालिए । आईआईटी थ्री फाइव आईआईटी लास्ट कार्ड का वो डालिए । नहीं तो उधर आया होगा card number । चला क्या ?
ch_1:
कार्ड नंबर नहीं आया card number
ch_0:
क्या आ रहा है मुझे बताइए । Hello ?
ch_1:
एक मिनट credit card
ch_0:
हाँ ।
ch_1:
one two three four है ।
ch_0:
एक मिनट हाँ एक मिनट आपका कार्ड नंबर है आईआईटी थ्री फाइव आईआईटी सिर वो डालिए उधर ।
ch_1:
eight three
ch_0:
फाइव आईआईटी लास्ट डालिए हो जायेगा आईआईटी थ्री फाइव आईआईटी आजकल । डाला क्या Hello ?
ch_1:
Eight three five eight ।
ch_0:
हाँ । अभी नहीं कीजिये आगे कन्फर्म करके ।
ch_1:
क्या बोला ।
ch_0:
अभी नहीं कीजिए आगे अभी आपको अमाउंट दिख रहा है सर ।
ch_1:
नहीं दिख रहा है ।
ch_0:
नेक्स्ट कीजिये । कन्फर्म कीजिये उसके बाद दिखेगा ।
ch_1:
क्रेडिट कार्ड में जो है न ।
ch_0:
हाँ ।
ch_1:
ये एच । डी । एफ । सी कहाँ आ रहा है इसका नाम कुछ
ch_0:
हाँ तो नेक्स्ट चीज़ है ना उसपे ठीक कीजिए ।
ch_1:
हाँ
ch_0:
किया था ।
ch_1:
एक मिनट एक मिनट । हाँ एक मिनट एक मिनट ।
ch_0:
हाँ । उनके ऊपर नहीं है ।
ch_1:
काट के मैं कॉल कर रहा हूँ एक मिनट ।

✅ Use Case

In typical telephony systems (like Asterisk), recorded calls are saved in stereo:

The caller’s voice is on the left channel (ch_0)
The receiver’s voice is on the right channel (ch_1)

This guide helps you:

Upload stereo audio to Amazon S3
Start a transcription job with channel separation
Retrieve and parse the transcription output
Calculate accuracy percentage per speaker

📁 1. Project Setup

🧱 What You Need

AWS account with S3 and Transcribe access
Java 8+ environment
AWS SDK for Java v2
Stereo .wav or .mp3 audio file
AWS credentials configured locally

⚙️ Add Dependencies (Gradle)

dependencies {
    implementation platform('software.amazon.awssdk:bom:2.25.7')
    implementation 'software.amazon.awssdk:s3'
    implementation 'software.amazon.awssdk:transcribe'
}

☁️ 2. Upload Stereo Audio to Amazon S3

Amazon Transcribe processes media files that are accessible via a public or signed S3 URI.

Java Upload Code

public void uploadToS3(String bucketName, String key, Path filePath) {
    S3Client s3 = S3Client.create();

    PutObjectRequest putRequest = PutObjectRequest.builder()
        .bucket(bucketName)
        .key(key)
        .build();

    s3.putObject(putRequest, filePath);
}

Example Usage

uploadToS3("my-audio-bucket", "calls/call001.wav", Path.of("/recordings/call001.wav"));

📝 3. Start the Transcription Job

Channel identification separates each stereo channel into an individual speaker.

Java Transcribe Code

public void startTranscription(String jobName, String mediaUri, String outputBucket) {
    TranscribeClient transcribe = TranscribeClient.create();

    Media media = Media.builder()
        .mediaFileUri(mediaUri)
        .build();

    StartTranscriptionJobRequest request = StartTranscriptionJobRequest.builder()
        .transcriptionJobName(jobName)
        .languageCode(LanguageCode.EN_US)
        .mediaFormat(MediaFormat.WAV)
        .media(media)
        .outputBucketName(outputBucket)
        .channelIdentification(true)
        .build();

    transcribe.startTranscriptionJob(request);
}

Example

String mediaUri = "https://s3.amazonaws.com/my-audio-bucket/calls/call001.wav";
startTranscription("Call001Job", mediaUri, "my-transcriptions-bucket");

⏳ 4. Wait for Transcription Job Completion

Transcribe jobs are asynchronous. Poll the job status until it's complete.

Java Polling Code

public TranscriptionJob waitForJob(String jobName) throws InterruptedException {
    TranscribeClient transcribe = TranscribeClient.create();

    while (true) {
        GetTranscriptionJobResponse response = transcribe.getTranscriptionJob(
            GetTranscriptionJobRequest.builder().transcriptionJobName(jobName).build()
        );

        TranscriptionJob job = response.transcriptionJob();

        if (job.transcriptionJobStatus() == TranscriptionJobStatus.COMPLETED) {
            return job;
        } else if (job.transcriptionJobStatus() == TranscriptionJobStatus.FAILED) {
            throw new RuntimeException("Transcription failed: " + job.failureReason());
        }

        Thread.sleep(5000);
    }
}

📜 5. Parse the Transcription Output

Once complete, Amazon returns a URL to the result JSON file.

Java Download Code

public String downloadTranscript(String transcriptUri) throws Exception {
    URL url = new URL(transcriptUri);
    try (BufferedReader reader = new BufferedReader(
        new InputStreamReader(url.openStream(), StandardCharsets.UTF_8))) {

        StringBuilder result = new StringBuilder();
        String line;
        while ((line = reader.readLine()) != null) {
            result.append(line);
        }
        return result.toString();
    }
}

📊 6. Extract Caller and Receiver Text

Group text by channel_label such as ch_0 (caller) and ch_1 (receiver).

Example JSON Item

{
  "channel_label": "ch_0",
  "alternatives": [ { "content": "Hello", "confidence": "0.98" } ]
}

🎯 7. Calculate Transcription Accuracy Percentage

Amazon Transcribe provides confidence scores (0-1) per word. Use this to estimate transcription accuracy.

Java Accuracy Calculation Code

public Map<String, Double> calculateChannelAccuracy(String json) throws Exception {
    ObjectMapper mapper = new ObjectMapper();
    JsonNode root = mapper.readTree(json);
    JsonNode items = root.path("results").path("items");

    Map<String, List<Double>> channelConfidences = new HashMap<>();

    for (JsonNode item : items) {
        if (!item.path("type").asText().equals("pronunciation")) continue;

        String channel = item.path("channel_label").asText();
        double confidence = item.path("alternatives").get(0).path("confidence").asDouble();

        channelConfidences
            .computeIfAbsent(channel, k -> new ArrayList<>())
            .add(confidence);
    }

    Map<String, Double> averageAccuracy = new HashMap<>();
    for (Map.Entry<String, List<Double>> entry : channelConfidences.entrySet()) {
        List<Double> scores = entry.getValue();
        double average = scores.stream().mapToDouble(Double::doubleValue).average().orElse(0.0);
        averageAccuracy.put(entry.getKey(), average * 100); // percentage
    }

    return averageAccuracy;
}

Sample Output

Accuracy Results:
ch_0 (Caller): 96.45%
ch_1 (Receiver): 93.87%

✅ Summary Table

Step	Description

Step	Description
Upload	Upload stereo audio file to S3
Transcribe	Start transcription with channel separation
Wait	Poll for job completion
Parse	Download and read JSON result
Separate Channels	Extract caller and receiver speech
Accuracy	Compute average confidence (accuracy)