Dataset Specifications

Explore Dataset Formats

Every dataset is built to your spec, not pulled from a shelf. Core languages below. Multilingual code-mixed combinations and other languages available on request. Recorded by consented human contributors, then annotated and reviewed according to the agreed project scope.

What buyers receive

Each dataset is scoped around the buyer's model, language mix, use case, speaker requirements, metadata needs, QA depth, and delivery format.

Audio

Audio in agreed WAV or project-specific formats, with recording environment and session details documented where required.

Transcripts

Speaker-labelled transcripts where required, with timestamp depth scoped to the project.

Session metadata

Language, dialect or accent group, speaker ID, scenario type, recording environment, duration, and QA status.

Consent-linked records

Consent-linked references covering the agreed data use, including AI training, evaluation, or benchmarking where applicable.

QA records

Audio quality checks, language verification, scenario adherence review, and known issue notes where required.

Delivery

Secure delivery method agreed before sample or pilot sharing, with format adjustments available based on buyer workflow.

Hindi Customer Support Conversations

Noise: Medium Overlap: Yes Built to Order

Transcript Snippet

spk_01: mera order abhi tak nahi aaya, teen din ho gaye.
spk_02: sir, mujhe aapka order number dijiye, main check karta hoon.

Metadata Schema

{
"language": "hindi",
"accent": "standard_northern"
}

File Structure

session_021/
├── audio.wav
├── transcript.json
└── metadata.json

Primary Use Cases

ASR fine-tuning for customer support voicebots, onboarding flows, sales conversation AI, and dialect-specific NLU models.

Scope a Dataset

Hinglish Voice Agent Testing

Noise: High Overlap: Frequent Built to Order

Transcript Snippet

spk_01: flight cancel ho gayi, ab kya karun?
spk_02: sir, main check karta hoon aapke liye, ek minute please.

Metadata Schema

{
  "mix_ratio": "defined_per_project",
  "domain": "customer_support",
  "scenario_type": "refund_follow_up"
}

File Structure

hg_train_set_v1/
├── batch_001.zip
├── manifest.csv
└── segments.json

Primary Use Cases

Ideal for urban Indian AI assistants, e-commerce support bots, and multi-lingual sentiment analysis.

Scope a Dataset

Punjabi Conversational Speech

Noise: Low Overlap: Minimal Built to Order

Transcript Snippet

spk_01: ssa ji, ki haal chal hai?
spk_02: vadiya vadiya, tusi daso kidda aana hoya?

Metadata Schema

{
"dialect": "majhi",
"setting": "indoor_quiet"
}

File Structure

punjabi_v2_core/
├── audio/
├── transcripts/
└── metadata.json

Primary Use Cases

Regional voice search engines, agricultural advisory bots, and government service accessibility.

Scope a Dataset

Marwadi Regional Conversations

Noise: Med-High Overlap: Yes Built to Order

Transcript Snippet

spk_01: mhaare thode paise baaki hai.
spk_02: arey bhai, kal pakka bhej dyun.

Metadata Schema

{
  "domain": "trade_finance",
  "consent_reference": "linked",
  "consent_scope": "project_defined"
}

File Structure

marwadi_trade_v1/
├── audio/
├── master_meta.json
└── consent_summary.json

Primary Use Cases

Hyper-local commerce bots, financial inclusion initiatives, and specialized dialect translation.

Scope a Dataset

Indian English Support Conversations

Noise: Medium Overlap: Moderate Built to Order

Transcript Snippet

spk_01: I wanted to follow up on the refund, it has been two weeks now.
spk_02: Sure, let me pull up your account. Can I have your order number?

Metadata Schema

{
"accent_profile": "pan_india_professional",
"role": "support_agent"
}

File Structure

ie_support_set/
├── audio/
├── transcripts/
└── metadata.json

Primary Use Cases

Enterprise meeting transcription, call centre ASR, and Indian accent-aware LLM evaluation.

Scope a Dataset

Code-Switched Evaluation Set

Custom Pairs On Request Same Schema

Example Language Pairs

Hindi–English Marwadi–Hindi Haryanvi–Hindi Marwadi–English Tamil–English Marathi–Hindi + any pair your model needs

Metadata Schema

{
"lang_pair": "marwadi-hindi",
"per_token_lang_id": true
}

File Structure

custom_multilingual_v1/
├── audio.wav
├── transcript.json
└── metadata.json

Primary Use Cases

Code-switching ASR, multilingual LLM fine-tuning, cross-dialect NLU, and regional voice assistants that reflect how India actually speaks.

Discuss Your Requirements

For AI teams

Request a Scope Review

Not all voice data is the same. Tell us your use case, languages, and pipeline and we will review the scope and share the most relevant preview path for your evaluation.

Scope a Dataset

Explore Dataset Formats

What buyers receive

Hindi Customer Support Conversations

Hinglish Voice Agent Testing

Punjabi Conversational Speech

Marwadi Regional Conversations

Indian English Support Conversations

Code-Switched Evaluation Set

Request a Scope Review

Request Dataset Specs

Request Received