We Benchmarked 5 LLMs on Indian Medical Record Extraction
Most LLM benchmarks are on English-language synthetic data and tell you almost nothing about how a model behaves on real Indian medical records, handwritten parts, transliterated drug names, mixed-script prescriptions. We ran our own benchmark on 1,200 records. Here's what we found.
Key takeaways
- We tested Claude 4.6, GPT-5, Gemini 2.5, Llama 3.3 70B, and Mistral Large 2 on 1,200 Indian outpatient records.
- Task: extract patient name, age, diagnosis (ICD-10), medications (RxNorm-mapped), follow-up date.
- Accuracy ranged from 71% (Llama 3.3 70B) to 91% (Claude 4.6).
- Cost per 1,000 records ranged from ₹1,400 (Llama 3.3 70B self-hosted) to ₹14,000 (Claude 4.6 via API).
- Best price/accuracy trade-off: GPT-5 (87% accuracy at ~₹5,200 per 1,000 records).
Why this matters
Healthcare AI in India lives or dies on extraction accuracy. A 90% extraction system has a missed-diagnosis rate that's clinically unacceptable. Choosing the wrong model means either expensive over-buying (paying for frontier models when a cheaper one would do) or under-buying (shipping an unsafe system to clinicians).
Setup
We sourced 1,200 anonymized outpatient consultation records from a partner clinic chain (consent, IRB-light review for benchmark use). Records included typed sections, handwritten margin notes, and prescription pads with mixed Hindi-English. The ground truth was human-labeled twice by two reviewers with conflicts adjudicated.
We tested five models with identical prompts:
- Claude 4.6 (Anthropic API)
- GPT-5 (OpenAI API)
- Gemini 2.5 (Google API)
- Llama 3.3 70B (self-hosted on AWS g5.12xlarge)
- Mistral Large 2 (Mistral API)
Prompt: structured JSON extraction with field-level instructions, examples, and a "return null if uncertain" guardrail.
Results
| Model | Accuracy | P95 latency | Cost per 1,000 records |
|---|---|---|---|
| Claude 4.6 | 91% | 4.2s | ₹14,000 |
| GPT-5 | 87% | 3.6s | ₹5,200 |
| Gemini 2.5 | 84% | 2.9s | ₹3,800 |
| Mistral Large 2 | 78% | 3.1s | ₹4,500 |
| Llama 3.3 70B (self-hosted) | 71% | 5.8s | ₹1,400 |
Accuracy is field-level: weighted average across the 5 extracted fields. Latency is P95 end-to-end including network. Cost is per-1,000 records at typical record length (~800 input tokens, ~250 output tokens).
What's interesting in the data
Claude 4.6 was strongest on handwritten and mixed-script content, the hardest cases. GPT-5 was strongest on structured prescription content. Gemini was the fastest. Llama 3.3 70B was wildly cheaper but struggled with anything non-typed.
The 91% vs 87% gap between Claude and GPT-5 was largely on records with handwritten clinician notes, where Claude was 14 percentage points more accurate.
Recommendation by use case
- Triage / front-office summarization (errors recoverable): Gemini 2.5, fast, cheap, accurate enough.
- Clinical decision support (errors costly): Claude 4.6, premium pricing, premium accuracy.
- High-volume, low-stakes (anonymization, indexing): Llama 3.3 70B self-hosted, cheap, "good enough" for tasks where 71% accuracy is acceptable.
- Production extraction at scale: GPT-5, best price/accuracy balance.
Common pitfalls
The biggest mistake teams make is benchmarking on English-language records and assuming the results generalize. They don't. Indian medical content has linguistic and structural patterns that significantly hurt naive English-trained models.
The second is benchmarking only on accuracy, ignoring latency. A model that's 2% more accurate but 3× slower will tank your throughput economics.
The third is not benchmarking with guardrails. Models with "return null if uncertain" prompting behave very differently from models forced to always answer.
What we recommend
Run your own benchmark on your own records. Public benchmarks generalize poorly to specific domains and locales. Two days of work, a few hundred records, and a clear ground truth gets you the answer for your actual workload, which is what matters.
FAQs
Did you fine-tune? No, all base models with prompt engineering only. Fine-tuning would shift the results but adds operational complexity we wanted to isolate from.
Why these 5 models? They span the spectrum from frontier-closed to mid-cost to open self-hosted, which is the decision space teams actually face.
Will you publish the data? No, patient data, even anonymized, isn't ours to publish.
Will you update this for new model releases? Yes, annually.
See how Techpuvi builds AI for healthcare. Evals first, guardrails always, models chosen by data.
