Artificial intelligence might be able to help doctors by filling out rote paperwork, but it's not going to be useful in the ER anytime soon, a new study shows.
OpenAI's ChatGPT program provided inconsistent conclusions when presented with simulated cases of patients with chest pain, researchers report.
The AI returned different heart risk assessment levels for the exact same patient data -- not something doctors want to see when responding to a medical emergency.
"ChatGPT was not acting in a consistent manner,"said lead researcher Dr. Thomas Heston, an associate professor with Washington State University's Elson S. Floyd College of Medicine.
"Given the exact same data, ChatGPT would give a score of low risk, then next time an intermediate risk, and occasionally, it would go as far as giving a high risk,"Heston said in a university news release.
The AI also failed to perform as well as traditional methods doctors use to just a patient's heart risk, according to the findings published recently in the journal PLOS One.
For the study, researchers fed ChatGPT thousands of simulated cases of patients with heart pain. Earlier research showed the AI program can pass medical exams, so the hope was it would be of use when responding to medical emergencies.
Chest pains are a common complaint in the ER, and doctors must rapidly assess the urgency of a patient's condition.
Very serious cases can be easy to identify from symptoms, but lower-risk cases can be trickier, Heston said. It can be tough to decide whether a person should be kept in the hospital for observation or sent home.
Doctors today often use two measures to assess heart risk, called TIMI and HEART, Heston explained. These checklists serve as calculators that use symptoms, health history and age to determine the sickness of a heart patient.
By contrast, an AI like ChatGPT can evaluate billions of variables quickly, ostensibly meaning it might be able to analyze a complex medical situation faster and more thoroughly.
Researchers created three sets of 10,000 randomized simulated cases. The first set contained the seven variables used for the TIMI scale, the second the five variables used in the HEART, and the third had a more complex set of 44 randomized health readings.
When fed the first two data sets, ChatGPT agreed with fixed TIMI and HEART scores about half the time, 45% and 48% respectively.
On the last data set, researchers ran the same cases through four times and found that ChatGPT often couldn't even agree with itself. The AI returned different assessments for the same cases 44% of the time.
The problem is likely due to the randomness built into the current version of the ChatGPT software, which helps it vary its responses to simulate natural language.
Such randomness is not helpful in health care, where treatment decisions require a single and consistent answer.
"We found there was a lot of variation, and that variation in approach can be dangerous,"Heston said. "It can be a useful tool, but I think the technology is going a lot faster than our understanding of it, so it's critically important that we do a lot of research, especially in these high-stakes clinical situations."
Despite this study, Heston said AI does have the potential to be truly helpful in the ER.
For example, a person's entire medical record could be fed into the program, and it could provide the most pertinent facts about a patient quickly in an emergency, Heston said.
Doctors also could ask the program to offer several possible diagnoses in difficult and complex cases.
"ChatGPT could be excellent at creating a differential diagnosis and that's probably one of its greatest strengths,"Heston said. "If you don't quite know what's going on with a patient, you could ask it to give the top five diagnoses and the reasoning behind each one. So it could be good at helping you think through a problem, but it's not good at giving the answer."
More information
The Mayo Clinic has more about AI in health care.
SOURCE: Washington State University, news release, May 1, 2024