-
ASR - Automatic Speech Recognition / Speech to Text
- https://github.com/dwani-ai/asr-indic-server.git
git clone https://github.com/dwani-ai/asr-indic-server.git cd asr-indic-server python -m venv --system-site-packages venv source venv/bin/activate pip install -r requirements.txt python src/multilingual/asr_api.py --host 0.0.0.0 --port 7863 --device cuda
- https://github.com/dwani-ai/asr-indic-server.git
ASR Indic Server
Overview
Automatic Speech Recognition (ASR) for Indian languages using IndicConformer models. The default model is set to Kannada ASR.
Demo Video
Watch a quick demo of our project in action! Click the image below to view the video on YouTube.
Table of Contents
- Supported Languages
- Live Server
- Getting Started
- For Production (Docker)
- For Development (Local)
- Downloading Translation Models
- Kannada
- Other Languages
- Running with FastAPI Server
- Live Server
- Service Modes
- How to Use the Service
- Evaluating Results
- Kannada Transcription Examples
- Batch Transcription Examples
- Building Docker Image
- Run the Docker Image
- Troubleshooting
- References
- Additional Resources
- Running Nemo Model
- Running with Transformers
Supported Languages
22 Indian languages are supported, thanks to AIBharat organisation
| Language | Code |
|---|---|
| Assamese | as |
| Bengali | bn |
| Bodo | brx |
| Dogri | doi |
| Gujarati | gu |
| Hindi | hi |
| Kannada | kn |
| Kashmiri | ks |
| Konkani | kok |
| Maithili | mai |
| Malayalam | ml |
| Manipuri | mni |
| Marathi | mr |
| Nepali | ne |
| Odia | or |
| Punjabi | pa |
| Sanskrit | sa |
| Santali | sat |
| Sindhi | sd |
| Tamil | ta |
| Telugu | te |
| Urdu | ur |
Live Server
We have hosted an Automatic Speech Recognition (ASR) service that can be used to verify the accuracy of audio transcriptions.
Notes
- Ensure that the audio file path (
samples/kannada_sample_2.wav) is correct and accessible. - The
languageparameter in the URL specifies the language of the audio file. In the examples above, it is set tokannada. - The service expects the audio file to be in WAV format.
Getting Started - Development
For Development (Local)
- Prerequisites: Python 3.10 (compatibility verified)
- Steps:
- Create a virtual environment:
python -m venv venv - Activate the virtual environment:
On Windows, use:
source venv/bin/activatevenv\Scripts\activate - Install dependencies:
pip install -r requirements.txt-
For Individual language models
pip install -r nemo-requirements.txt
-
Downloading Translation Models
Models can be downloaded from AI4Bharat's HuggingFace repository:
For Multi-lingual language supported model
huggingface-cli download ai4bharat/indic-conformer-600m-multilingual
For Individual langauge models
- Kannada
huggingface-cli download ai4bharat/indicconformer_stt_kn_hybrid_rnnt_large
Other Languages
-
Malayalam
huggingface-cli download ai4bharat/indicconformer_stt_ml_hybrid_rnnt_large -
Hindi
huggingface-cli download ai4bharat/indicconformer_stt_hi_hybrid_rnnt_large
Sample Code
For all languages
from transformers import AutoModel
import torchaudio
import torch
# Load the model
model = AutoModel.from_pretrained("ai4bharat/indic-conformer-600m-multilingual", trust_remote_code=True)
# Load an audio file
wav, sr = torchaudio.load("kannada_sample_1.wav")
wav = torch.mean(wav, dim=0, keepdim=True)
target_sample_rate = 16000 # Expected sample rate
if sr != target_sample_rate:
resampler = torchaudio.transforms.Resample(orig_freq=sr, new_freq=target_sample_rate)
wav = resampler(wav)
# Perform ASR with CTC decoding
transcription_ctc = model(wav, "kn", "ctc")
print("CTC Transcription:", transcription_ctc)
# Perform ASR with RNNT decoding
transcription_rnnt = model(wav, "kn", "rnnt")
print("RNNT Transcription:", transcription_rnnt)
- Run the Code
python asr-multi-lingual.py
Individual Languages
import torch
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.ASRModel.from_pretrained("ai4bharat/indicconformer_stt_kn_hybrid_rnnt_large")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.freeze() # inference mode
model = model.to(device)
model.cur_decoder = "rnnt"
rnnt_text = model.transcribe(['samples/kannada_sample_1.wav'], batch_size=1, language_id='kn')[0]
print(rnnt_text)
- Run the Code
python asr_code.py
Alternative examples for Development
For Local Development
- Gradio
python src/ux/app_local.py
For Server Development
Running with FastAPI Server
Run the server using FastAPI with the desired language (e.g., Kannada): - for GPU
python src/multi-lingual/asr_api.py --port 7860 --language kn --host 0.0.0.0 --device gpu
python src/multi-lingual/asr_api.py --port 7860 --language kn --host 0.0.0.0 --device cpu
Evaluating Results for FastApi Server
You can evaluate the ASR transcription results using curl commands.
Kannada Transcription Examples
Sample 1: kannada_sample_1.wav
- Audio File: samples/kannada_sample_1.wav
- Command:
curl -X 'POST' 'http://loca?language=kannada' -H 'accept: application/json' -H 'Content-Type: multipa'Content-Type multipart/form-data' -F 'file=@samples/kannada_sample_1.wav;type=audio/x-wav' - Expected Output:
ಕರ್ನಾಟಕದ ರಾಜಧಾನಿ ಯಾವುದುTranslation: "What is the capital of Karnataka"
Sample 2: kannada_sample_2.wav
- Audio File: samples/kannada_sample_2.wav
- Command:
curl -X 'POST' \ 'http://localhost:7860/transcribe/?language=kannada' \ -H 'accept: application/json' -H 'Content-Type: multipart/form-data' \ -F 'file=@samples/kannada_sample_2.wav;type=audio/x-wav' - Expected Output:
ಬೆಂಗಳೂರು ಕರ್ನಾಟಕ ರಾಜ್ಯದ ರಾಜಧಾನಿ ಆಗಿದೆ ಕರ್ನಾಟಕದಲ್ಲಿ ನಾವು ಕನ್ನಡ ಮಾತನಾಡುತ್ತೇವೆ
Sample 3 - Song - 4 minutes
- YT Video- Navaduva Nudiye
- Audio File: samples/kannada_sample_3.wav
- Command:
curl -X 'POST' \ 'http://localhost:7860/transcribe/language=kannada' \ -H 'accept: application/json' \ -H 'Content-Type: multipart/form-data' \ -F 'file=@samples/kannada_sample_3.wav;type=audio/x-wav' - Expected Output: kannada_sample_3_out.md
Sample 4 - Song - 6.4 minutes
- YT Video- Aagadu Yendu
- Audio File: samples/kannada_sample_4.wav
- Command:
curl -X 'POST' \ 'http://localhost:7860/transcribe/language=kannada' \ -H 'accept: application/json' \ -H 'Content-Type: multipart/form-data' \ -F 'file=@samples/kannada_sample_4.wav;type=audio/x-wav' - Expected Output: kannada_sample_4_out.md
Note: The ASR does not provide sentence breaks or punctuation (e.g., question marks). We plan to integrate an LLM parser for improved context in future updates.
Batch Transcription Examples
Transcribe Batch Endpoint
The /transcribe_batch endpoint allows you to transcribe multiple audio files in a single request. This is useful for batch processing of audio files.
- Command:
curl -X 'POST' \ 'http://localhost:7860/transcribe_batch/' \ -H 'accept: application/json' \ -H 'Content-Type: multipart/form-data' \ -F 'files=@samples/kannada_sample_1.wav;type=audio/x-wav' \ -F 'files=@samples/kannada_sample_2.wav;type=audio/x-wav' - Expected Output:
{ "transcriptions": [ "ಕರ್ನಾಟಕದ ರಾಜಧಾನಿ ಯಾವುದು", "ಬೆಂಗಳೂರು ಕರ್ನಾಟಕ ರಾಜ್ಯದ ರಾಜಧಾನಿ ಆಗಿದೆ ಕರ್ನಾಟಕದಲ್ಲಿ ನಾವು ಕನ್ನಡ ಮಾತನಾಡುತ್ತೇವೆ" ] }
Troubleshooting
- Transcription errors: Verify the audio file is in WAV format, mono, and sampled at 16kHz. Adjust using:
ffmpeg -i sample_audio.wav -ac 1 -ar 16000 sample_audio_infer_ready.wav -y - Model not found: Download the required models using the
huggingface-cli downloadcommands above. - Port conflicts: Ensure port 7860 is free when running the FastAPI server.
Contributing
We welcome contributions! Please read the CONTRIBUTING.md file for guidelines on how to contribute to this project.
Also you can join the discord group to collaborate
References
- AI4Bharat IndicConformerASR GitHub Repository
- Nemo - AI4Bharat
- IndicConformer Collection on HuggingFace
Additional methods for Development
Running Nemo Model
- Download the Nemo model:
wget https://objectstore.e2enetworks.net/indic-asr-public/indicConformer/ai4b_indicConformer_kn.nemo -O kannada.nemo - Adjust the audio:
ffmpeg -i sample_audio.wav -ac 1 -ar 16000 sample_audio_infer_ready.wav -y - Run the program:
python nemo_asr.py