
Artificial intelligence algorithms are being built into almost all aspects of health care. They’re integrated into breast cancer screenings, clinical note-taking, health insurance management and even phone and computer apps to create virtual nurses and transcribe doctor-patient conversations. Companies say that these tools will make medicine more efficient and reduce the burden on doctors and other health care workers. But some experts question whether the tools work as well as companies claim they do.
AI tools such as large language models, or LLMs, which are trained on vast troves of text data to generate humanlike text, are only as good as their training and testing. But the publicly available assessments of LLM capabilities in the medical domain are based on evaluations that use medical student exams, such as the MCAT. In fact, a review of studies evaluating health care AI models, specifically LLMs, found that only 5 percent used real patient data. Moreover, most studies evaluated LLMs by asking questions about medical knowledge. Very few assessed LLMs’ abilities to write prescriptions, summarize conversations or have conversations with patients — tasks LLMs would do in the real world.
The current benchmarks are distracting, computer scientist Deborah Raji and colleagues argue in the February New England Journal of Medicine AI. The tests can’t measure actual clinical ability; they don’t adequately account for the complexities of real-world cases that require nuanced decision-making. They also aren’t flexible in what they measure and can’t evaluate different types of clinical tasks. And because the tests are based on physicians’ knowledge, they don’t properly represent information from nurses or other medical staff.
“A lot of expectations and optimism people have for these systems were anchored to these medical exam test benchmarks,” says Raji, who studies AI auditing and evaluation at the University of California, Berkeley. “That optimism is now translating into deployments, with people trying to integrate these systems into the real world and throw them out there on real patients.” She and her colleagues argue that we need to develop evaluations of how LLMs perform when responding to complex and diverse clinical tasks.
Science News spoke with Raji about the current state of health care AI testing, concerns with it and solutions to create better evaluations. This interview has been edited for length and clarity.
SN: Why do current benchmark tests fall short?
Raji: These benchmarks are not indicative of the types of applications people are aspiring to, so the whole field should not obsess about them in the way they do and to the degree they do.
This is not a new problem or specific to health care. This is something that exists throughout machine learning, where we put together these benchmarks and we want it to represent general intelligence or general competence at this particular domain that we care about. But we just have to be really careful about the claims we make around these datasets.
The further the representation of these systems is from the situations in which they are actually deployed, the more difficult it is for us to understand the failure modes these systems hold. These systems are far from perfect. Sometimes they fail on particular populations, and sometimes, because they misrepresent the tasks, they don’t capture the complexity of the task in a way that reveals certain failures in deployment. This sort of benchmark bias issue, where we make the choice to deploy these systems based on information that doesn’t represent the deployment situation, leads to a lot of hubris.
SN: How do you create better evaluations for health care AI models?
Raji: One strategy is interviewing domain experts in terms of what the actual practical workflow is and collecting naturalistic datasets of pilot interactions with the model to see the types or range of different queries that people put in and the different outputs. There’s also this idea that [coauthor] Roxana Daneshjou has been doing in some of her work with “red teaming,” with actively gathering a group of people to adversarialy prompt the model. Those are all different approaches to getting at a more realistic set of prompts closer to how people actually interact with the systems.
Another thing we are trying is getting information from actual hospitals as usage data — like how they are actually deploying it and workflows from them about how they are actually integrating the system — and anonymized patient information or anonymized inputs to these models that could then inform future benchmarking and evaluation practices.
Sponsor Message
There are approaches that exist from other disciplines [like psychology] about how to ground your evaluations in observations of reality to be able to assess something. The same applies here — how much of our current evaluation ecosystem is grounded in the reality of what people are observing and what people are either appreciating or struggling with in terms of the actual deployment of these systems.
SN: How specialized should model benchmark testing be?
Raji: The benchmark that is geared towards question answering and knowledge recall is very different from a benchmark to validate the model on summarizing doctors’ notes or doing questioning and answering on uploaded data. That kind of nuance in terms of the task design is something that I’m trying to get to. Not that every single person should have their own personalized benchmark, but that common task that we do share needs to be way more grounded than multiple-choice tests. Because even for real doctors, those multiple-choice questions are not indicative of their actual performance.
SN: What policies or frameworks need to be in place to create such evaluations?
Raji: This is mostly a call for researchers to invest in thinking through and constructing not just benchmarks but also evaluations, at large, that are more grounded in the reality of what our expectations are for these systems once they get deployed. Right now, evaluation is very much an afterthought. We just think that there’s a lot more attention that could be paid towards the methodology of evaluation, the methodology of benchmark design and the methodology of just assessment in this space.
Second, we can ask for more transparency at the institutional level such as through AI inventories in hospitals, wherein hospitals should share the full list of different AI products that they make use of as part of their clinical practice. That’s the kind of practice at the institutional level, at the hospital level, that would really help us understand what people are currently using AI systems for. If [hospitals and other institutions] published information about the workflows that they sort of integrate these AI systems into, that can also help us think of better evaluations. That kind of thing at the hospital level will be super helpful.
At the vendor level too, sharing information about what their current evaluation practice is — what their current benchmarks rely on — helps us figure out the gap between what they are currently doing and something that could be more realistic or more grounded.
SN: What is your advice for people working with these models?
Raji: We should, as a field, be more thoughtful about the evaluations that we focus on or that we [overly base our performance on.]
It’s really easy to pick the lowest hanging fruit — medical exams are just the most available medical tests out there. And even if they are completely unrepresentative of what people are hoping to do with these models at deployment, it’s like an easy dataset to compile and put together and upload and download and run.
But I would challenge the field to be a lot more thoughtful and to pay more attention to really constructing valid representations of what we hope the models do and our expectations for these models once they are deployed.
Source link
Leave a Reply