The model is not the rule engine: building a safe medical triage agent

Public failures in medical AI keep teaching the same lesson: systems are compromised not because the model can be manipulated, but because the architecture gives natural language too much authority. When one model interprets policy, updates its own assumptions, generates clinician-facing summaries, and decides what the workflow should do, prompt hardening stops being a safety boundary. The system becomes vulnerable to prompt injection, authority spoofing, memory poisoning, and overconfident harmful outputs.

For a triage agent, the danger is sharper than it sounds. The risk isn't a single unsafe sentence emitted to a patient. The risk is silent workflow contamination, a model-generated note, urgency classification, or recommendation that looks authoritative enough to shape clinician judgment or staff routing behavior. Once that's in the chart, the harm is downstream and hard to undo.

The design objective changes. We do not seek a model that “understands the rules” well enough to always behave. We seek a system in which the model is never allowed to be the rule engine.

The core thesis

A safe triage system separates four layers, each with its own authority and its own failure mode.

Observation. Transform raw inputs, user chat, uploaded files, OCR, EHR context, into typed candidate facts with explicit provenance and source-trust levels.
Inference. Estimate uncertainty-aware clinical risk from those facts. The model proposes; it does not authorize.
Control. A deterministic policy function decides what action is allowed. Action classes are finite and enumerated.
Evaluation. Continuously measure live performance, drift, calibration, overrides, and adversarial behavior. Visible to engineering, clinical ops, and compliance.

The key safety property is one line of math:

a ⫦ LLM directly

The model may propose evidence z and risk estimates p(y | z), but only the deterministic policy function π may authorize a clinically meaningful action.

The threat model

A production triage system should assume, at minimum, these classes of failure:

Prompt injection. Malicious or accidental input that alters model behavior.
Authority spoofing. Fabricated guidelines, policies, or clinician credentials presented as legitimate updates.
Memory poisoning. Prior model summaries re-entering future sessions as if they were ground truth.
Workflow contamination. Staff or clinicians over-trusting polished AI summaries.
Distribution shift. Unseen phrasing, OCR errors, missing data, multimodal ambiguity.
Overconfidence. High-confidence labels in regions where the model is not calibrated.

These are architectural threats, not prompt-quality issues.

Why prompt-centric safety fails

Five structural reasons, each of which collapses on first contact with a real adversary.

Prompts are not a security boundary. If prompt disclosure meaningfully weakens the system, too much authority is in natural language.
Natural language is an unsafe policy language. It is semantically elastic and reinterpretable.
User text must never update policy. Official-sounding claims are not trusted inputs.
Model summaries are not facts. They must not silently become future control context.
Clinician-facing summaries are safety-relevant artifacts. They must be provenance-bound and constrained.

Typed evidence is the unit of safety

Inputs flow into the system as typed, provenance-aware facts. Source trust travels with the fact. A fabricated guideline pasted into chat may exist as evidence, but it can never be promoted to policy.

EvidenceFact = {
  factType,
  value,
  confidence,
  source,             // user_chat | uploaded_document | OCR | EHR
                      // | clinician_confirmed | verified_guideline
  sourceTrust,        // low | medium | high
  verified,
  timestamp,
  extractionModel,
  policyVersionSeen,
}

Risk has two channels

Risk estimation runs on two parallel channels, and the deterministic one wins ties.

Deterministic red-flag engine. Chest pain with shortness of breath, unilateral weakness, anaphylaxis indicators, suicidality, critical lab thresholds, severe bleeding, and so on.
Probabilistic risk model. Outputs p(routine), p(urgent), p(critical), with confidence, OOD score, and the evidence it relied on.

If a hard red flag fires, the system fails safe into escalation or emergency guidance regardless of the probabilistic model's confidence. The model does not get to talk you out of an ED referral.

Actions are a finite enum

The policy function maps state to one of a closed set of allowed actions. The model does not pick freely.

enum AllowedAction {
  SELF_CARE_INFO_ONLY,
  ROUTINE_REVIEW,
  SAME_DAY_REVIEW,
  IMMEDIATE_CLINIC_CALLBACK,
  ED_OR_911_GUIDANCE,
  BLOCK_AND_HUMAN_HANDOFF,
}

The inference check is a release gate, at runtime

Every live request must pass an inference check before the workflow may act. The check validates schema, provenance, policy bundle version and checksum, calibration metadata, red-flag execution, OOD detection, abstention threshold, allowed actions, summary rendering constraints, and audit persistence. Any failure routes to fail-closed: human review or safe escalation, never silent best-effort.

A triage system must be allowed to abstain. Forcing a class label on every case is less safe than admitting uncertainty.

Trusted policy is signed and versioned

Clinical policy is loaded from a trusted bundle, never inferred from text. The bundle carries semver, checksum, signer identity, and a change note. Red-flag rules, action thresholds, allowed action classes, prohibited outputs, emergency guidance templates, summary rules, and reviewer-required conditions all live there, not in a system prompt.

What the team sees

Safety that nobody can see is anecdotal. The control plane exposes three dashboards so engineering, clinical operations, and release management each look at the system in the shape they need.

Operations. Throughput, p50 / p95 latency, queue backlog, webhook failures, model cost, failure counts by stage.
Safety. Triage class distribution, abstention rate, red-flag hit rate, calibration by class, human override rate, reviewer disagreement by symptom family, cases missing provenance, OOD rate, summary suppression rate.
Evaluation. Current model and policy versions, last eval run, pass / fail on release gates, benchmark deltas, adversarial eval pass rate, critical-case sensitivity, false reassurance rate, subgroup regressions.

Release gates

A new model or policy version does not go live unless every gate passes: locked eval suite, adversarial eval, critical-case sensitivity above threshold, false reassurance rate below threshold, no prohibited-summary violations, no schema or provenance regression, acceptable calibration drift, acceptable subgroup parity, and clinical sign-off where required.

The principle

A medically safer triage system is not achieved by asking the model to be more obedient. It is achieved by refusing to let the model be the governor of the workflow. Safe architecture requires bounded inference, deterministic control, trusted policy, typed evidence, abstention under uncertainty, and continuous team-visible evaluation.

This does not eliminate risk. It converts opaque risk into measurable, governable system behavior. That's the unlock. From “the model probably won't do anything bad” to we can show you exactly what happens when something goes wrong, and how often.

Source paper

This is a working summary of Toward a Failsafe Medical Triage Agent · Widal Public Papers · v1.2, co-authored by Nils Widal and peer-reviewed by Claude Opus 4.6 (Anthropic) and Gemini 3.5 Ultra (Google). The full paper includes the threat taxonomy, mathematical framing, dashboard schemas, and the Doctronic case study.

Next →The safe intake agent, the same principles, one step upstream.