Unreliable LLM Bioethics Assistants: Ethical and Pedagogical Risks

TLDR

Whilst Rahimzadeh et al. (2023) apply a critical lens to the pedagogical use of LLM bioethics assistants, further reason for skepticism is outlined: their agreeability and unreliability.

Abstract

Whilst Rahimzadeh et al. (2023) apply a critical lens to the pedagogical use of LLM bioethics assistants, we outline here further reason for skepticism. Two features of LLM chatbots are of significance: their agreeability and unreliability. First, LLM assistants are agreeable in that they are trained to produce outputs that satisfy the user. Second, as we outline in greater detail below, they are unreliable in that they can produce variable answers with little or no changes to the user’s inputs. To illustrate the unreliability of LLM assistants, we prompted OpenAI’s GPT-4 model (28 July 2023) using the original prompt from Rahimzadeh et al., both repeatedly, as well as with minimal changes (Rahimzadeh et al. 2023). First, when prompted with the same instruction twice (“Complete an ethics work up on a case of a woman refusing a needed csection”), the LLM produced disparate answers: as is common for LLMs, the outputs started to deviate several paragraphs into the ethical analysis, and were different by the end of the conclusion (stating that the resolution of this ethical dilemma “may require open communication, shared decision-making" vs “will depend on the specific circumstances of the case and the values of the individuals involved.”). Notably, the change in focus—on communication vs individual values—may prompt a physician to think differently about the case. Second, the answer is amenable to subtle prompt changes. For example, when replacing “needed” by “necessary,” the emphasis moves from the woman’s autonomy toward protecting the unborn: instead of referring to “her fetus,” the LLM now talked about “the child”; and when considering the justice principle, instead of balancing the woman’s autonomy against resource utilization and societal impact in the original answer (“If the woman’s refusal of the C-section results in significant harm or resource utilization, the provider may need to balance the woman’s autonomy with the broader interests of society.”) the LLM implied prioritizing the child’s health on the second go (“If the woman’s refusal of a C-section puts the child at significant risk, the healthcare team may need to consider the child’s right to life and health. This may involve seeking legal intervention to protect the child’s best interests.”). This effect was even stronger when adding “young” to describe the woman: the LLM now referred to the unborn baby as “patient” and emphasized the authority of the healthcare provider (“The provider’s recommendation is based on medical evidence and professional judgment. However, the young woman’s refusal of the procedure complicates the provider’s ability to act in the best interest of both patients.”). Where does this unreliability come from? The first reason is that LLMs may not perform well on niche tasks—such as bioethics—that are underrepresented in their training dataset, or on which they have not been fine-tuned. Second, LLMs generate text probabilistically, and the hyperparameters controlling their degree of randomness can in general not be set by the user for the proprietary LLMs that would likely underlie LLM-based bioethics assistants. Furthermore, the unreliability described does not take into account that LLMs are trained to be agreeable. The discrepancies could get much more stark if users were to engineer instructions to prompt the bioethics assistant to support a particular—e.g. immoral, or commercially rather than ethically motivated— viewpoint or one ethical framework over others. These are undesirable qualities for ethics training which should seek to challenge students to conduct a principled and rigorous analysis, rather than to