Speak to an Expert

AI Engineering

We Benchmarked 5 LLMs on Indian Medical Record Extraction

Original benchmark: 5 frontier and open-source models extracting structured fields from real Indian medical records. Accuracy, cost, latency, and which model wins for what.

Niranjana
May 29, 2026 · 11 min read
We Benchmarked 5 LLMs on Indian Medical Record Extraction

We Benchmarked 5 LLMs on Indian Medical Record Extraction

Most LLM benchmarks are on English-language synthetic data and tell you almost nothing about how a model behaves on real Indian medical records, handwritten parts, transliterated drug names, mixed-script prescriptions. We ran our own benchmark on 1,200 records. Here's what we found.

Key takeaways

  • We tested Claude 4.6, GPT-5, Gemini 2.5, Llama 3.3 70B, and Mistral Large 2 on 1,200 Indian outpatient records.
  • Task: extract patient name, age, diagnosis (ICD-10), medications (RxNorm-mapped), follow-up date.
  • Accuracy ranged from 71% (Llama 3.3 70B) to 91% (Claude 4.6).
  • Cost per 1,000 records ranged from ₹1,400 (Llama 3.3 70B self-hosted) to ₹14,000 (Claude 4.6 via API).
  • Best price/accuracy trade-off: GPT-5 (87% accuracy at ~₹5,200 per 1,000 records).

Why this matters

Healthcare AI in India lives or dies on extraction accuracy. A 90% extraction system has a missed-diagnosis rate that's clinically unacceptable. Choosing the wrong model means either expensive over-buying (paying for frontier models when a cheaper one would do) or under-buying (shipping an unsafe system to clinicians).

Setup

We sourced 1,200 anonymized outpatient consultation records from a partner clinic chain (consent, IRB-light review for benchmark use). Records included typed sections, handwritten margin notes, and prescription pads with mixed Hindi-English. The ground truth was human-labeled twice by two reviewers with conflicts adjudicated.

We tested five models with identical prompts:

  • Claude 4.6 (Anthropic API)
  • GPT-5 (OpenAI API)
  • Gemini 2.5 (Google API)
  • Llama 3.3 70B (self-hosted on AWS g5.12xlarge)
  • Mistral Large 2 (Mistral API)

Prompt: structured JSON extraction with field-level instructions, examples, and a "return null if uncertain" guardrail.

Results

Model Accuracy P95 latency Cost per 1,000 records
Claude 4.6 91% 4.2s ₹14,000
GPT-5 87% 3.6s ₹5,200
Gemini 2.5 84% 2.9s ₹3,800
Mistral Large 2 78% 3.1s ₹4,500
Llama 3.3 70B (self-hosted) 71% 5.8s ₹1,400

Accuracy is field-level: weighted average across the 5 extracted fields. Latency is P95 end-to-end including network. Cost is per-1,000 records at typical record length (~800 input tokens, ~250 output tokens).

What's interesting in the data

Claude 4.6 was strongest on handwritten and mixed-script content, the hardest cases. GPT-5 was strongest on structured prescription content. Gemini was the fastest. Llama 3.3 70B was wildly cheaper but struggled with anything non-typed.

The 91% vs 87% gap between Claude and GPT-5 was largely on records with handwritten clinician notes, where Claude was 14 percentage points more accurate.

Recommendation by use case

  • Triage / front-office summarization (errors recoverable): Gemini 2.5, fast, cheap, accurate enough.
  • Clinical decision support (errors costly): Claude 4.6, premium pricing, premium accuracy.
  • High-volume, low-stakes (anonymization, indexing): Llama 3.3 70B self-hosted, cheap, "good enough" for tasks where 71% accuracy is acceptable.
  • Production extraction at scale: GPT-5, best price/accuracy balance.

Common pitfalls

The biggest mistake teams make is benchmarking on English-language records and assuming the results generalize. They don't. Indian medical content has linguistic and structural patterns that significantly hurt naive English-trained models.

The second is benchmarking only on accuracy, ignoring latency. A model that's 2% more accurate but 3× slower will tank your throughput economics.

The third is not benchmarking with guardrails. Models with "return null if uncertain" prompting behave very differently from models forced to always answer.

What we recommend

Run your own benchmark on your own records. Public benchmarks generalize poorly to specific domains and locales. Two days of work, a few hundred records, and a clear ground truth gets you the answer for your actual workload, which is what matters.

FAQs

Did you fine-tune? No, all base models with prompt engineering only. Fine-tuning would shift the results but adds operational complexity we wanted to isolate from.

Why these 5 models? They span the spectrum from frontier-closed to mid-cost to open self-hosted, which is the decision space teams actually face.

Will you publish the data? No, patient data, even anonymized, isn't ours to publish.

Will you update this for new model releases? Yes, annually.


See how Techpuvi builds AI for healthcare. Evals first, guardrails always, models chosen by data.

#LLM#Healthcare#Benchmark#AI#India
Niranjana

Niranjana serves as a Senior Architect at Techpuvi. She brings more than 15 years of experience in software development, having built several products from the ground up. Choosing to specialize as a full-stack engineer, she maintains a strong commitment to continuous learning.