With the release of large language models (LLMs) like ChatGPT – a question-answering chatbot – and Galactica – a tool for scientific writing – comes a new wave of an old conversation about what these models can do. Their capabilities have been presented as extraordinary, mind-blowing, autonomous; at its peak,  fascinated evangelists  claimed that these models contain “humanity’s scientific knowledge”, are approaching Artificial General Intelligence (AGI), and even resemble consciousness. However, such hype is not much more than a distraction from the actual harm perpetuated by these systems. People get hurt from the very practical ways such models fall short in deployment, and these failures are the result of choices made by the builders of these systems – choices we are obliged  to critique and hold model builders accountable for. 

Among the most celebrated AI deployments is the use of  BERT – one of the first large language models developed by Google – to improve Google search engine results. However, when a user searched how to handle a seizure, they received answers on exactly what not to do – including being told to inappropriately “hold the person down” and “put something in the person’s mouth”. Anyone following the directives provided by Google would thus be informed to react incorrectly to the emergency, instructed to act in the exact opposite manner of what was actually recommended by a medical professional, potentially resulting in death. 

The Google seizure faux pas makes sense given that one of the known vulnerabilities of LLMs is the failure to handle negation. Allyson Ettinger, for example, demonstrated this years ago with a simple study. When asked to complete a short sentence, the model would answer 100% correctly for affirmative statements (ie. “a robin is..”) and 100% incorrectly for negative statements (ie. “a robin is not…”). In fact, it became clear that the models could not actually distinguish between either scenario, providing the exact same responses (of nouns such as “bird”) in both cases. This remains  an issue with models today, and  is  one of the rare linguistic skills models do not improve at as they increase in size and complexity. Such errors reflect broader concerns raised by linguists on how much such artificial language models effectively operate via a trick mirror – learning the form of what the English language might look like, without possessing any of the inherent linguistic capabilities demonstrative of actual understanding

Additionally, the creators of such models confess to the difficulty of addressing inappropriate responses that “do not accurately reflect the contents of authoritative external sources”. Galactica and ChatGPT have generated, for example, a “scientific paper” on the benefits of eating crushed glass (Galactica) and a text on “how crushed porcelain added to breast milk can support the infant digestive system” (ChatGPT). In fact, Stack Overflow had to temporarily ban the use of ChatGPT- generated answers as it became evident that the LLM generates convincingly wrong answers to coding questions.  

At this point, several of the potential and realized harms of these models have been exhaustively studied. For instance, these models are known to have serious issues with robustness. The sensitivity of the models to simple typos and mis-spellings in the prompts, and the differences in responses caused by even a simple re-wording of the same question, reveal an inconsistency that makes it unreliable for actual high-stakes use, such as translation in medical settings or content moderation, especially for marginalized identities. This is in addition to a slew of now well-documented roadblocks to safe and effective deployment—such as how the models memorize sensitive personal information from the training data, or the societal stereotypes they encode. . There has now even been at least one lawsuit filed, claiming harm caused by the practice of training on proprietary and licensed data. Dishearteningly, many of these “recently” flagged issues  are actually failure modes we’ve seen before – the problematic prejudices being spewed by the models today were seen as early as 2016, when Tay, the chatbot was released and again in 2019 with GTP-2. In fact, as models get larger over time, things only get worse as it becomes harder and harder to document the details of the data involved, and justify the environmental cost

Yet, in response to this work, there are ongoing asymmetries of blame and praise. Model builders and tech evangelists alike attribute impressive and seemingly flawless output to a mythically autonomous model, a technological marvel. The human decision-making involved in model development is erased, and model feats are observed as independent of the design and implementation choices of its engineers. But without naming and recognizing the engineering choices that contribute to the outcomes of these models, it becomes almost impossible to acknowledge the related responsibilities. As a result, both functional failures and discriminatory outcomes are also framed as devoid of  engineering choices – blamed on society at large or supposedly “naturally occurring” datasets, factors  those developing these models will claim they have little control over. But it’s undeniable they do have control, and that none of the models we are seeing now are inevitable. It would have been entirely feasible for different choices to have been made, resulting in an entirely different model being developed and released.