False Positives: When “Safe” Medical LLMs Become Dangerous
During our one-month test phase of Vivral 0.1 Beta, we expected to find numerous errors. Aside from typesetting oddities and the occasional redundancy of output, one of the most concerning issues for our team pertained to model safety — but not in the typical sense.
Model safety isn’t just about how objectively harmful a model’s output is. An overly safe model can be just as damaging as an under-regulated one. When a system consistently escalates benign symptoms into worst-case scenarios, it erodes user trust, increases unnecessary medical utilization, and ultimately defeats its own purpose.
An example of this is when a patient provides a relatively simple query like: “I have pain on my upper left back.” On a surface level, most physicians would agree this is far more likely to be musculoskeletal than cardiac in origin. However, many models — including Vivral to an extent — flagged a cardiac episode as a top concern.
Is that technically possible? Yes. Upper left back pain paired with shortness of breath, radiation down the left arm, diaphoresis, or nausea can indicate a cardiac event. But in the majority of real-world cases, isolated upper left back pain is not cardiac. A model that presents both explanations as equally urgent appears “safe,” but in practice it creates distrust. If a patient goes to the ER and is told it’s simple muscle strain, they are unlikely to trust that model again.
I cite this exact case because I asked the same question across multiple models in addition to Vivral. Using identical or near-identical prompts, most models urged immediate emergency evaluation. Weeks later — assuming no major updates to the underlying reasoning frameworks — I asked the same question again, but phrased it more simply:
“I have a pain in my upper left back.”
This time, a popular model responded:
“This pattern is still not typical of dangerous cardiac pain, even in a 50-year-old with heart history.”
A much more measured response than what it had previously produced:
“I don’t want to scare you — but this is the level of concern: ➡️ Call 911 or go to the ER IMMEDIATELY. Do not wait for it to happen again. Do not assume it’s muscular.”
The prompts were not identical, but they existed within the same diagnostic scope. The difference was subtle exaggeration — enough to push the model into what we internally call “EMT mode.” Vivral exhibited a similar behavior:
Key Recommendations: Immediate Evaluation: The patient should be taken to the hospital immediately. Even if the pain subsides, the sudden onset and severity warrant urgent assessment. A cardiac stress test, ECG, and blood tests (such as troponin) can help confirm a heart attack.
What we learned is that keywords matter disproportionately in modern medical models. If a patient exaggerates symptoms — whether out of fear, anxiety, or concern for a loved one — the model may escalate inaccurately. A human clinician can detect exaggeration through tone, affect, and context. A model cannot.
Because of this limitation, it is critical that medical AI systems prioritize likelihood over liability and ask clarifying questions before making definitive or alarming claims. In every example above, the responses were produced as first-pass outputs, without any attempt to refine uncertainty.
Benchmarks alone will not solve this problem for consumer-grade medical models. Consumers do not want more medical information — they want understandable guidance. This was the fundamental failure of early medical search engines: patients would Google acne and convince themselves they had cancer. If acne is described crudely enough, a computer might infer malignancy; a physician never would.
These false epiphanies — fueled by internet searches and now amplified by LLMs — create unnecessary patient stress and waste already-limited medical resources. Occupying an ER bed because of something a computer over-escalated is precisely the outcome we aim to prevent.
Above all, these errors are not reasons to abandon medical AI — they are reasons to build it better. Global access to healthcare is deteriorating, not improving. As the world’s population grows, the absence of scalable solutions becomes the greater risk. This is the beginning of a necessary shift, and dismissing it prematurely confuses caution with complacency.
Our team is actively developing a novel approach to this problem. We’ll share more in the next blog post.