Researchers tested ChatGPT Health performance in a structured test of triage recommendations in emergency departments. They found version 4.0 correct 86% of the time. But big differences across studies raised questions. The work came out in a science journal. It looked at how AI might help sort patients fast when hospitals get crowded.

Key Takeaways

  • ChatGPT 4.0 got triage right 86% of the time across 14 studies with over 1,400 cases.
  • Older version 3.5 scored lower at 63% accuracy, with more errors.
  • Experts say it spots emergencies well but laypeople often misunderstand its advice.
  • More tests needed due to study differences and possible biases.

Background

Emergency rooms see chaos every day. Patients flood in with chest pain, broken bones, fevers. Nurses and doctors must decide who goes first. It's called triage. Get it wrong, and someone suffers. Or worse.

Advertisement

AI tools like ChatGPT popped up fast. People wondered if they could help. Hospitals face staff shortages. Wait times stretch long. Could a chatbot read symptoms and pick urgency levels? Researchers wanted answers.

This study pulled together 14 earlier tests. They had 1,412 patient cases or fake scenarios. Teams fed details into ChatGPT. Like age, symptoms, vital signs. The AI picked colors: red for now, orange for soon, down to blue for wait. Real staff do the same.

And it worked. Sort of. ChatGPT 4.0 beat the older one. But not perfect. Hospitals use systems like Manchester Triage. Five levels. Red means life or death. Blue means phone a doctor later.

One team checked how everyday folks read the AI's words. Experts liked it. Laypeople? Not so much. They mixed up advice. Sent some home too soon. Or rushed others in panic.

Google's pushing AI hard too. See their three new paths for AI models. Sam Altman talked energy use in India. His defense of AI resources ties in. Power needs matter for health tools.

Key Details

The main test used stats software. R version 4.4 crunched numbers. They pooled results. ChatGPT 4.0 hit 86% right calls. Confidence interval 64% to 98%. Wide spread. Studies differed a lot. I squared at 93%. That's high variability.

ChatGPT 3.5? 63%. From 43% to 81%. Also messy. I squared 84%. Funnel plots showed bias hints. Especially for 3.5. Some good studies missing maybe.

How They Tested

Teams made prompts. Gave AI patient stories. Short ones. Real ones from charts. AI spat back levels. Experts scored them. Matched against gold standard.

In one set, GPT nailed red cases 96%. Told folks to rush to ER. Green cases, non-urgent? It said stay home 89%. Gave reasons why. Smart.

But over-triage happened 11%. Sent too many in. Under-triage just 3.5%. Better than some human rates. Trauma rules allow over-triage up to 35%. Under 5% max.

Another test used German Manchester scale. GPT-4 matched pros. Kappa 0.67. Like untrained docs at 0.68. GPT-3.5 lagged at 0.54.

All spotted true reds right. No misses there. Good sign. But fuzzy cases tripped it up.

"ChatGPT, particularly version 4.0, has the potential to improve triage accuracy and reduce unsafe decisions in emergency settings." – Lead researcher from the study team

They used QUADAS-2 for quality. Some studies weak. Bias risks. Applicability issues.

What This Means

ChatGPT could ease ER burdens. Faster sorts mean lives saved. Nurses get help on busy nights. But don't plug it in blind.

Version 4 shines. 86% beats humans sometimes. Tired staff err more. AI won't.

Laypeople struggle though. Parents read AI say 'watch symptoms.' Think it's fine. Kid worsens. Gap there.

Variability screams caution. One study 50% right. Next 100%. Why? Prompts? Cases? Need standards.

Biases lurk. Funnel plots twist. Good news published more? Publish flops too.

Hospitals test more. Add safeguards. Train on local data. Mix with nurse checks.

Pecans help hearts per new review. AI might flag risks like that. But triage first.

Under-triage low. Safety plus. Over-triage wastes beds. Fixable.

Fraser warns no solo use. Gebrael says aid providers. Knebel spots harm risks.

Real world next. Live ER trials. Not just vignettes.

Staff agree sometimes. One doc fixed error after GPT nudge.

Limits clear. No gold standard yet. More work ahead.

Frequently Asked Questions

What is triage in an emergency room?
Triage sorts patients by need. Red gets seen first. Blue last. Saves lives in crowds.

How accurate was ChatGPT 4.0?
It got 86% right in pooled studies. Better than 3.5's 63%. But results varied.

Can patients use ChatGPT for advice?
Experts say no. Laypeople misread it. See a doctor instead.

Frequently Asked Questions

What is triage in an emergency room?

Triage sorts patients by need. Red gets seen first. Blue last. Saves lives in crowds.

How accurate was ChatGPT 4.0?

It got 86% right in pooled studies. Better than 3.5’s 63%. But results varied.

Can patients use ChatGPT for advice?

Experts say no. Laypeople misread it. See a doctor instead.