There’s Bias in the Models

AI is a mixed bag. It doesn’t always work as intended; it’s frequently wrong; out of nowhere it will turn rude or bigoted for seemingly no reason. However, when it works, it’s a nice additional bridge between a company and its customers, a user and their paper, etc. AI assistants can add chatbots to places where nobody was able to put one before, either due to a lack of resources or convenience. While some disclaimer is needed at the start to ensure they don’t get tricked into selling cars for a single dollar, they’re fairly harmless when it comes to general questions.

Fairly harmless. Aside from potentially unethical data collection, the second biggest issue with AI is that it’s feeding off of data put in by people. Beautiful, imperfect, wonderfully flawed people, humans with all different levels of literacy and belief in the paranormal, humans who in the Dark Ages thought smelling basil would spontaneously generate scorpions inside your skull, humans. There is an idea that if the world were run by machines, utilitarian and efficient, perfectly logical, everything would work out. Hunger would be solved. Diseases would be cured or managed. But the machines being made today and pitched as saviors are not those things. The worst fanfictions in the world are competing with Albert Einstein and Edgar Allen Poe for what word comes next in any given sentence. Even if it were purely nonfiction, the nonfiction is also written by people, and bias is built in – the saying “history is written by the winners” is demonstrated in the gaps between different countries’ textbook recounts of the same wars, all the time.

The machine may be cold, but it’s not logical or utilitarian – every bias, every struggle with facts, every falsely-held belief written down on the web is now also inscribed into the brain of basically every major LLM.

Even the ‘not technically LLM’ stuff suffers for it: a recent trend among writers has been to take a screenshot of their completely insane Google Docs and Grammarly recommendations, correctly spelled words and logically-ordered sentences spelled like Olde Englishe or turned into spaghetti based on the collective data of everyone using the service, all across the bell curves of age and English literacy. Collecting that data would, in theory, be a hack to prevent the services becoming outdated and needing to add new words to the built in dictionaries every two weeks manually… but the machine doesn’t understand what makes unknown ‘words’ legible English, so common enough misspellings of real words become options in it’s algorithms.

Even the picture generators aren’t free! A recent experiment circulating on TikTok had scores of people asking image generators to ‘show a wine glass filled all the way to the top’. A human artist would almost certainly understand that query to mean ‘the liquid is filled to the rim of the glass’, but even with finagling, rephrasing, clarifying, the image generators were unable to give users a picture of a wine glass more than half full. This is because the training data doesn’t have that. A human has imagination – an AI has an ocean of training data, and if the training data doesn’t have the right stuff, it can’t yet ‘imagine’ things it has limited data on.