The most common question I get from health system CTOs and clinical informatics leads is some version of the same wrong question: “which is the best LLM right now?” The honest answer is that the question is malformed. There is no best clinical LLM, the same way there is no best surgical instrument. There is the right instrument for the cut you are making, with the patient in front of you, on the table you actually have.
Healthcare workloads are not a single workload. A coding suggestion on a discharge summary is not the same job as parsing a fax, which is not the same job as a longitudinal chart review across twelve years of progress notes, which is not the same job as a patient-facing intake dialogue with hard safety constraints. The model that wins one of those wins almost none of the others.
Stop asking which model is best. Start asking which model is best at the job, on the data you actually have, with the failure modes you can actually live with.
The five clinical job classes
Almost every healthcare AI workload we build at Widal decomposes into one of five jobs. Most production systems are a pipeline of two or three of them. The point of naming them is to stop treating “the AI” as one undifferentiated component.
- Extraction. Pull typed facts out of unstructured clinical text, faxes, lab PDFs, prior auth packets, intake forms. Output is a schema, not prose. Success is measured in field-level precision and recall against a held-out human gold set.
- Summarization. Compress a chart, a thread, or an encounter into a clinician-readable note with provenance. Hallucination here is not a quality issue, it is a safety issue. The summary becomes part of the chart.
- Structured tool-use. Multi-step agent loops that call EHR APIs, scheduling systems, lab order tools, eligibility checks. Success is stability over twenty plus tool calls without the agent inventing a tool, an argument, or a result.
- Dialogue under safety constraints. Patient-facing intake, triage, education. The model has to be safely uncertain, escalate cleanly, and refuse the right things without being uselessly cautious about the wrong things.
- Long-context retrieval. Reasoning across a full chart, a research corpus, a payer policy library, or a multi-year correspondence record. The risk is not running out of context, it is silent recall failure on the middle of the window.
These five jobs have different ideal architectures, different evals, different latency budgets, and crucially, different best-fit frontier models. A team that runs one model for all five is leaving accuracy, cost, and safety headroom on the table.
What to weigh per job
For each job class, we score candidate models against a fixed rubric. The weights change per job. The categories do not.
- Structured output reliability. JSON-mode adherence, schema fidelity under adversarial input, behavior when the source is missing a field.
- Tool-use stability. Long-chain agent loops without invented tools, malformed arguments, or context-window thrash.
- Refusal calibration. Refuses the unsafe, does not refuse the merely uncomfortable. The unsafe behavior is not over-compliance, it is the wrong compliance.
- Hallucination rate on long context. The needle-in-haystack number is the floor, not the answer. Production hallucination is measured on real charts, not on synthetic recall.
- Eval headroom. How much room you have between current performance and the ceiling, measured on your task, not on MMLU.
- BAA availability. The model is unusable for PHI without a signed BAA. All three frontier labs offer one in 2026, with caveats per deployment surface.
- Latency. p50 and p95, with realistic prompt sizes, including the tool call round trips. Dashboards lie if they only report happy path.
- Cost. Per-encounter, not per-token. A cheaper model that needs four passes is not cheaper.
Anthropic Claude
The Claude Opus and Sonnet families are, in our deployments, the default for agentic clinical workflows. The reason is mostly architectural rather than headline benchmark. Claude is the model we trust to run a twenty-step tool chain over an EHR without inventing a function, an argument, or a record id. It holds the shape of a structured task across hundreds of thousands of tokens. Refusal calibration is the most clinically usable of the three: it declines unsafe asks without becoming theatrical about normal medical content.
Where Claude wins: long tool-use chains, structured outputs, safety-tuned refusals, long-context reasoning that has to hold its shape end-to-end. Where it does not: native image understanding for radiology or pathology is not where you start, and the multimodal surface area is narrower than Gemini's. BAA is available through Anthropic directly and through AWS Bedrock.
OpenAI GPT
The GPT-5 family is the most generally capable raw generator and the most mature in tooling ecosystem, function-calling shape, and audio. For transcription-adjacent workloads, ambient scribing, voice intake, real-time call coaching, the Whisper plus GPT pipeline is still the path of least resistance, and the most integrated. Image reads are mature, the function-calling developer ergonomics are well-trodden, and the streaming and realtime APIs are ahead.
Where GPT wins: transcription, voice agents, image input where the answer is short and verifiable, function-calling at moderate chain depth, ecosystem maturity for teams already on Azure. Where it does not: long tool chains tend to drift earlier than Claude, and structured-output adherence under adversarial input is more variable. BAA is available via Azure OpenAI and via OpenAI directly for enterprise.
Google Gemini
Gemini 3 is the model we reach for when the job is natively multimodal, when the context is genuinely enormous, or when the task is document layout, chart understanding, or scientific-paper synthesis. The native multi-million token context window is not a parlor trick. It changes the architecture: instead of a retrieval pipeline plus a smaller window, you can hand Gemini the entire payer policy library or the entire longitudinal chart and ask it to reason. Cost and latency become the constraint, not context.
Where Gemini wins: native multimodal, very long context that actually behaves through the middle of the window, chart and table understanding, document layout, science-domain reasoning, and increasingly, radiology workflow assist with appropriate human-in-the-loop. Where it does not: the agentic tool-use ecosystem is younger, and the refusal behavior on clinical edge cases is still maturing. BAA is available via Vertex AI on Google Cloud.
The scoring matrix
What follows is the matrix we actually use as a starting point in an architecture review. Scores are 1 to 5, on the rubric above, averaged across our last several production deployments. They are opinionated. They will be wrong by the time you read this, which is the meta-point of this entire post.
Job class | Claude | GPT | Gemini | Notes
------------------------------|---------|---------|---------|------------------------
Extraction (forms, faxes) | 4.5 | 4.0 | 4.0 | Any of them + good eval
Summarization (chart, note) | 4.5 | 4.0 | 4.0 | Claude on refusal cal.
Structured tool-use (agent) | 5.0 | 4.0 | 3.5 | Claude on long chains
Dialogue + safety (intake) | 4.5 | 4.0 | 3.5 | Claude on calibration
Long-context retrieval | 4.0 | 3.5 | 5.0 | Gemini on >500k tokens
Transcription-adjacent | 3.5 | 5.0 | 4.0 | GPT on audio pipelines
Image (rad / path / chart) | 3.5 | 4.5 | 5.0 | Gemini on layout + chart
Multimodal native | 3.0 | 4.5 | 5.0 | Gemini, by margin
Cost per encounter | 4.0 | 4.0 | 4.0 | Workload-dependent
BAA + deployment surface | 5.0 | 5.0 | 5.0 | All three availableTwo honest notes about this matrix. First, on three of the rows, the right answer is “any of them with a good eval suite.” Extraction on a stable schema, basic summarization, and BAA-protected deployment are commodity in 2026. Picking by benchmark on those rows is mostly procurement theater. Second, on the rows where there is a clear winner, that winner is not permanent. Six months ago the long-context row looked different. Six months from now the agent row may.
The rubric goes stale fast. The architecture is what does not. Build the routing layer so the swap is one config change, not one quarter.
The routing layer is the actual product
The most important architectural decision in a clinical AI system is not which model you start with. It is whether the system is model-agnostic at the routing layer. A pipeline that hard-codes a single vendor SDK is a pipeline that will lose twelve to eighteen months of accuracy and cost gains every time a frontier release lands. We treat model choice as a per-job dispatch decision, evaluated on a continuously updated suite, with the ability to swap per job class without touching the calling code. This is one of the things our healthcare AI engagements install on day one, before any single model is picked.
At the code level, the abstraction looks roughly like this. The specifics differ per stack, but the shape does not.
// per-job dispatch, not per-vendor coupling
type Job =
| "extract"
| "summarize"
| "tool_use"
| "safety_dialogue"
| "long_context"
| "transcribe"
| "image_read"
const ROUTING: Record<Job, ModelChoice> = {
extract: { vendor: "anthropic", family: "sonnet" },
summarize: { vendor: "anthropic", family: "sonnet" },
tool_use: { vendor: "anthropic", family: "opus" },
safety_dialogue: { vendor: "anthropic", family: "opus" },
long_context: { vendor: "google", family: "gemini" },
transcribe: { vendor: "openai", family: "gpt" },
image_read: { vendor: "google", family: "gemini" },
}
async function run(job: Job, input: TypedInput): Promise<TypedOutput> {
const choice = ROUTING[job] // swap per job, not per app
return await dispatch(choice, input) // vendor SDK behind interface
}Three properties matter more than the table values. The vendor SDK lives behind an interface, so we can replace a model in one place. The routing table is data, not code, so it can be updated by an eval pipeline rather than a deploy. The eval suite runs cross-model, not just on the incumbent, so we always know what the next-best option scores on our actual workload.
Evals are the only thing that does not go stale
The dirty secret of frontier model selection in healthcare is that the public benchmarks barely matter. They are too generic, too contaminated, and too far from the actual prompt shape and failure modes of a production clinical workflow. The number that matters is your eval on your data, run weekly, across every candidate model, gated against the same release criteria.
A good clinical eval suite has four properties: it is locked and versioned, it covers the adversarial cases not just the happy path, it scores against a clinically validated gold set not against another model, and it runs cross-vendor on the same inputs. Without that, model selection is vibes. With it, the matrix above becomes a function you can recompute, not a blog post you have to trust.
The principle
The right question is never “which is the best model.” It is “which model for which job, on which data, with which failure modes, behind which routing interface, evaluated by which suite.” Teams that bet a clinical workflow on a single frontier vendor are taking on a risk that does not pay them anything: they get neither better accuracy nor lower cost nor stronger safety, and they get worse optionality every quarter as the frontier moves.
The architecture we install treats model choice as a swappable decision and the eval suite as the durable asset. The matrix in this post will be wrong by next quarter. That is fine. The system is designed so being wrong about the matrix costs you one config change, not one rebuild.
The routing layer, the eval harness, and the per-job model dispatch described above are part of our healthcare AI engineering engagements. When the work needs to land inside a clinical team and survive contact with real workflow, we embed it with forward deployed engineers on site.