Explore Our Datasets
Every dataset is built to your spec, not pulled from a shelf. Core languages below. Multilingual code-mixed combinations and other languages available on request. All data is 100% human-annotated and ethically sourced.
Hindi Conversational
Transcript Snippet
spk_02: haan lekin evening window better rahegi
Metadata Schema
"language": "hindi",
"accent": "standard_northern"
}
File Structure
├── audio.wav
├── transcript.json
└── metadata.json
Primary Use Cases
High-fidelity training for ASR, customer support voicebots, onboarding flows, sales conversation AI, and dialect-specific NLU models.
Hinglish Code-Mixed
Transcript Snippet
spk_02: I am sorry sir, let me check the status
Metadata Schema
"mix_ratio": "60:40",
"domain": "ecommerce"
}
File Structure
├── batch_001.zip
├── manifest.csv
└── segments.json
Primary Use Cases
Ideal for urban Indian AI assistants, e-commerce support bots, and multi-lingual sentiment analysis.
Punjabi Regional
Transcript Snippet
spk_02: vadiya vadiya, tusi daso kidda aana hoya?
Metadata Schema
"dialect": "majhi",
"setting": "indoor_quiet"
}
File Structure
├── raw_wavs/
├── trans_vtt/
└── session_logs.xml
Primary Use Cases
Regional voice search engines, agricultural advisory bots, and government service accessibility.
Marwadi Commerce
Transcript Snippet
spk_02: arey bhai, kal pakka bhej dyun.
Metadata Schema
"domain": "trade_finance",
"verified_consent": true
}
File Structure
├── audio_processed/
├── master_meta.json
└── consent_proofs/
Primary Use Cases
Hyper-local commerce bots, financial inclusion initiatives, and specialized dialect translation.
Indian English (Enterprise)
Transcript Snippet
spk_02: Sending it over via email right away, sir.
Metadata Schema
"accent_profile": "pan_india_professional",
"role": "support_agent"
}
File Structure
├── wav_16k/
├── ortho_transcripts/
└── sentiment_tags.json
Primary Use Cases
Global support automation, enterprise meeting transcription, and Indian accent-aware LLM evaluation.
Multilingual Code-Mixed
Example Language Pairs
Metadata Schema
"lang_pair": "marwadi-hindi",
"per_token_lang_id": true
}
File Structure
├── audio.wav
├── transcript.json
└── metadata.json
Primary Use Cases
Code-switching ASR, multilingual LLM fine-tuning, cross-dialect NLU, and regional voice assistants that reflect how India actually speaks.
Request Dataset Specs
Tell us about your requirements and receive full schema details within 24 hours.
Request Received
We'll send full dataset specifications to your work email within 24 hours.