Skip to content
  • ASR - Automatic Speech Recognition / Speech to Text

    • https://github.com/dwani-ai/asr-indic-server.git
      git clone https://github.com/dwani-ai/asr-indic-server.git
      cd asr-indic-server
      
      python -m venv --system-site-packages venv
      source venv/bin/activate
      
      pip install -r requirements.txt
      
      
      python src/multilingual/asr_api.py --host 0.0.0.0 --port 7863 --device cuda
      

ASR Indic Server

Overview

Automatic Speech Recognition (ASR) for Indian languages using IndicConformer models. The default model is set to Kannada ASR.

Demo Video

Watch a quick demo of our project in action! Click the image below to view the video on YouTube.

Watch the video

Table of Contents

Supported Languages

22 Indian languages are supported, thanks to AIBharat organisation

Language Code
Assamese as
Bengali bn
Bodo brx
Dogri doi
Gujarati gu
Hindi hi
Kannada kn
Kashmiri ks
Konkani kok
Maithili mai
Malayalam ml
Manipuri mni
Marathi mr
Nepali ne
Odia or
Punjabi pa
Sanskrit sa
Santali sat
Sindhi sd
Tamil ta
Telugu te
Urdu ur

Live Server

We have hosted an Automatic Speech Recognition (ASR) service that can be used to verify the accuracy of audio transcriptions.

Notes

  • Ensure that the audio file path (samples/kannada_sample_2.wav) is correct and accessible.
  • The language parameter in the URL specifies the language of the audio file. In the examples above, it is set to kannada.
  • The service expects the audio file to be in WAV format.

Getting Started - Development

For Development (Local)

  • Prerequisites: Python 3.10 (compatibility verified)
  • Steps:
  • Create a virtual environment:
    python -m venv venv
    
  • Activate the virtual environment:
    source venv/bin/activate
    
    On Windows, use:
    venv\Scripts\activate
    
  • Install dependencies:
    pip install -r requirements.txt
    
    • For Individual language models

      pip install -r nemo-requirements.txt
      

Downloading Translation Models

Models can be downloaded from AI4Bharat's HuggingFace repository:

For Multi-lingual language supported model

huggingface-cli download ai4bharat/indic-conformer-600m-multilingual

For Individual langauge models

  • Kannada
    huggingface-cli download ai4bharat/indicconformer_stt_kn_hybrid_rnnt_large
    

Other Languages

Sample Code

For all languages

from transformers import AutoModel
import torchaudio
import torch

# Load the model
model = AutoModel.from_pretrained("ai4bharat/indic-conformer-600m-multilingual", trust_remote_code=True)

# Load an audio file
wav, sr = torchaudio.load("kannada_sample_1.wav")
wav = torch.mean(wav, dim=0, keepdim=True)

target_sample_rate = 16000  # Expected sample rate
if sr != target_sample_rate:
    resampler = torchaudio.transforms.Resample(orig_freq=sr, new_freq=target_sample_rate)
    wav = resampler(wav)

# Perform ASR with CTC decoding
transcription_ctc = model(wav, "kn", "ctc")
print("CTC Transcription:", transcription_ctc)

# Perform ASR with RNNT decoding
transcription_rnnt = model(wav, "kn", "rnnt")
print("RNNT Transcription:", transcription_rnnt)
  • Run the Code
    python asr-multi-lingual.py
    

Individual Languages

import torch
import nemo.collections.asr as nemo_asr

model = nemo_asr.models.ASRModel.from_pretrained("ai4bharat/indicconformer_stt_kn_hybrid_rnnt_large")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.freeze() # inference mode
model = model.to(device)

model.cur_decoder = "rnnt"
rnnt_text = model.transcribe(['samples/kannada_sample_1.wav'], batch_size=1, language_id='kn')[0]


print(rnnt_text)
  • Run the Code
    python asr_code.py
    

Alternative examples for Development

For Local Development

  • Gradio
    python src/ux/app_local.py
    

For Server Development

Running with FastAPI Server

Run the server using FastAPI with the desired language (e.g., Kannada): - for GPU

python src/multi-lingual/asr_api.py --port 7860 --language kn --host 0.0.0.0 --device gpu
- for CPU only
python src/multi-lingual/asr_api.py --port 7860 --language kn --host 0.0.0.0 --device cpu

Evaluating Results for FastApi Server

You can evaluate the ASR transcription results using curl commands.

Kannada Transcription Examples

Sample 1: kannada_sample_1.wav

  • Audio File: samples/kannada_sample_1.wav
  • Command:
    curl -X 'POST' 'http://loca?language=kannada' -H 'accept: application/json'   -H 'Content-Type: multipa'Content-Type  multipart/form-data' -F 'file=@samples/kannada_sample_1.wav;type=audio/x-wav'
    
  • Expected Output: ಕರ್ನಾಟಕದ ರಾಜಧಾನಿ ಯಾವುದು Translation: "What is the capital of Karnataka"

Sample 2: kannada_sample_2.wav

  • Audio File: samples/kannada_sample_2.wav
  • Command:
    curl -X 'POST' \
    'http://localhost:7860/transcribe/?language=kannada' \
    -H 'accept: application/json'   -H 'Content-Type: multipart/form-data' \
    -F 'file=@samples/kannada_sample_2.wav;type=audio/x-wav'
    
  • Expected Output: ಬೆಂಗಳೂರು ಕರ್ನಾಟಕ ರಾಜ್ಯದ ರಾಜಧಾನಿ ಆಗಿದೆ ಕರ್ನಾಟಕದಲ್ಲಿ ನಾವು ಕನ್ನಡ ಮಾತನಾಡುತ್ತೇವೆ

Sample 3 - Song - 4 minutes

Sample 4 - Song - 6.4 minutes

Note: The ASR does not provide sentence breaks or punctuation (e.g., question marks). We plan to integrate an LLM parser for improved context in future updates.

Batch Transcription Examples

Transcribe Batch Endpoint

The /transcribe_batch endpoint allows you to transcribe multiple audio files in a single request. This is useful for batch processing of audio files.

  • Command:
    curl -X 'POST' \
    'http://localhost:7860/transcribe_batch/' \
    -H 'accept: application/json' \
    -H 'Content-Type: multipart/form-data' \
    -F 'files=@samples/kannada_sample_1.wav;type=audio/x-wav' \
    -F 'files=@samples/kannada_sample_2.wav;type=audio/x-wav'
    
  • Expected Output:
    {
      "transcriptions": [
        "ಕರ್ನಾಟಕದ ರಾಜಧಾನಿ ಯಾವುದು",
        "ಬೆಂಗಳೂರು ಕರ್ನಾಟಕ ರಾಜ್ಯದ ರಾಜಧಾನಿ ಆಗಿದೆ ಕರ್ನಾಟಕದಲ್ಲಿ ನಾವು ಕನ್ನಡ ಮಾತನಾಡುತ್ತೇವೆ"
      ]
    }
    

Troubleshooting

  • Transcription errors: Verify the audio file is in WAV format, mono, and sampled at 16kHz. Adjust using:
    ffmpeg -i sample_audio.wav -ac 1 -ar 16000 sample_audio_infer_ready.wav -y
    
  • Model not found: Download the required models using the huggingface-cli download commands above.
  • Port conflicts: Ensure port 7860 is free when running the FastAPI server.

Contributing

We welcome contributions! Please read the CONTRIBUTING.md file for guidelines on how to contribute to this project.

Also you can join the discord group to collaborate

References

Additional methods for Development

Running Nemo Model

  1. Download the Nemo model:
    wget https://objectstore.e2enetworks.net/indic-asr-public/indicConformer/ai4b_indicConformer_kn.nemo -O kannada.nemo
    
  2. Adjust the audio:
    ffmpeg -i sample_audio.wav -ac 1 -ar 16000 sample_audio_infer_ready.wav -y
    
  3. Run the program:
    python nemo_asr.py