Bias, Knowledge and Power
How niche academic writing may have an outsized cultural impact through AI
I have written before about the quiet erasure of same-sex attraction on Wikipedia, and how over the course of many years academic sources were used as a basis for the unilateral rewrite of our understanding of human sexuality, beyond what would be recognised by the vast majority of people. Once such perspectives take root, it simply doesn’t matter what you just know, or how loudly you proclaim it - your “knowledge” will struggle to make it past the processes and rules required for inclusion in Wikipedia. Bluntly, if you can’t find a reliable source for an opinion, it may as well not exist.
In similar vein, changes are happening across multiple articles on sex and gender, and especially regarding views that run contrary to a belief in the importance of gender identity over sex. The Wikipedia page on the Anti-Gender Movement states up front that it is about “the conservative or religious anti-gender movement”, which is anti-gay and anti-feminist in nature. However, while the article may have started out that way in theory, over time it has widened in scope to encompass any and all opposition to gender identitarianism.
This matters, because the single most politically effective accusation levelled against opposition to gender identitarianism is that it is all part of the same, conservative, right-wing, “anti-gender” movement. That anything that appears reasonable is an attempt to launder regressive views behind a veneer of respectability. That the left-wing and radical feminist subset are merely useful idiots, or dishonest fronts. That really, they are all as one, speaking the same language, using the same rhetoric, pursuing the same reactionary, hateful goals.
This is a powerful accusation, and hard to counter. If you’re explaining, you’re losing, as the political maxim goes, and if you have to waste all your time explaining why you’re not a hateful bigot, your protestations will ring hollow and you’ll never manage to make headway on your political aims. Policymakers believing themselves to be liberal and progressive will be inclined to dismiss anything that appears reactionary or conservative out of hand.
Just as with same-sex attraction, the way this wider political claim has become part of a Wikipedia article was not immediately apparent, but took place slowly, over a number of months, recontextualising existing information by small additions and shifts of meaning.
In October 2021 a user added the following claim of a link between the gender-critical perspective and the right-wing anti-gender movement, complete with an academic source:
Pearce et al. noted that the concept of "gender ideology" "saw increasing circulation in trans-exclusionary radical feminists discourse" from around 2016.
In November 2021, the same user expanded on these claims, citing an interview with Judith Butler to allege increasing links between feminists and the radical right, and describing the movement as “fascist”.
In April 2022, the same user attempted to add “gender identity ideology” as a synonym for “gender ideology”, though this was subsequently removed (for now).
In May 2022, the same user linked to a scholarly source claiming that a Norwegian LGB organisation was part of the “anti-gender” movement, and noting this organisation described itself as a “sister organisation” of the UK LGB Alliance.
In December 2022, the same user expanded associations between the anti-gender movement, gender critical viewpoints, feminists, the far right, and anti-gender rhetoric, all using the sources already on the page.
Later that month, the same user added a reference to support the claim that the term “gender ideology” was itself antisemitic.
Over the course of a year, a page that was explicitly about the Catholic roots of a reactionary, homophobic, antifeminist movement has expanded to defame feminist critique of gender identity as antisemitic.
These claims form a picture that cannot be unravelled without equivalently reliable sources acting as counterpoint. Even if such existed, this might only result in some sort of “balanced” presentation, where “both sides” of the position are presented. Just saying “that’s not true” won’t help, it has to be backed up in something like a published journal or a book before these extreme claims can even start to be unpicked.
Getting such claims to become received wisdom in Wikipedia is an exceptionally effective political act, and one whose impact will be felt beyond just Wikipedia itself.
It’s not just Wikipedia
AI-generated text has been making waves in recent months because ChatGPT (based on GPT3.5) is producing some scarily good output, to the point where it can pass a medical licensing exam. But the AI is only as good as what it is trained on, and it can’t really judge what is materially “true”, only weigh up probabilities based on the vast amounts of data it is given. So what is the training set for GPT3.5?
According to their published paper the data sources and weighting are:
Common Crawl (filtered) - 60%
WebText2 - 22%
Books1 - 8%
Books2 - 8%
Wikipedia - 3%
Common Crawl is an openly available snapshot of the web, and Wikipedia is self-explanatory.
Books1 and Books2 are shrouded in mystery, but there is speculation that Books1 may be free books taken from smashwords.com or project Gutenberg. Books2 is likely to be something like libgen or scihub - a repository mainly of academic articles.
WebText2 on the other hand is a dataset created by scraping upvoted content from Reddit, with reddit upvotes as a proxy for quality. A more thorough analysis of all of these datasets is here, including a breakdown of the top domains used in this Reddit dataset, but sites like Blogspot, Wordpress and Medium are all in the top 25.
So, you have:
Academia churning out huge amounts of groupthink in gender studies
Journalistic sources like Reuters and the BBC having policies on trans issues that have been influenced by lobbying from organisations like ILGA-Europe and thus skew all coverage in favour of gender identitarian beliefs
Wikipedia content riven with bias in this area because it is wholly dependent on both of the above as reliable sources
Reddit systematically banning subs like /r/gendercritical for “promoting hate”, censoring women’s health subs, and banning users who post content that steps out of line on sex/gender issues
Other sites that are also heavily featured in Reddit - such as Medium - also banning content that steps out of line on sex/gender issues
All of this means that is a big chunk of the training data for machine learning is compounding bias upon bias. GPT3.5 then has an extra layer of human training on top of this model, whereby a set of safety mitigations are introduced, to reduce “toxic output generation”. The cumulative effect of groupthink, censorship, and a focus above all on certain conceptions of “identity” across multiple domains does not cancel out by combining all these datasets, but is reinforced.
GPT3.5 is just one of several developing systems, and it is no good simply appealing to the creators to “fix the bias”, as it is their own bias that is leading them to make these selections believing them to be the most rational and neutral ones available. Unless there is a serious effort to get academic literature out there that states the (for many people) blindingly obvious that sex is real, material and important, and that human sexuality is based on sex, then these perspectives will simply be filtered out as not part of the academic consensus.
Wikipedia will continue to get more biased in this direction as there will be no source available to offer a corrective. AI will continue to learn from Wikipedia, academia, journalistic guidelines, and the porn-addled groupthink of Reddit that JK Rowling is she who must not be named, and crucially that the “anti-gender” and “gender critical” movements are all one big reactionary hate campaign propagated by the religious right.
So when people Google or ask Siri, the AI will give you the one and only one answer: that this is true, and no other respectable viewpoint exists. When social media increasingly relies on AI for content moderation, the content that risks being moderated out of existence will be that deemed unspeakable by all of these biased inputs.
As with the monks who allowed the works of Sappho to be lost to history because she wasn’t important enough to preserve in written form, so knowledge that is not accrued and presented in an acceptable format will not form part of the future.
This is power, expressed not through conspiracy, but through total thoughtless conformity of views.
Purity won’t save you
The collapse of complicated issues into total polarised opposites serves to make reasoned objections and nuanced positions impossible to state. Exactly how to approach this particular smear is at the root of some quite painful schisms at the moment, and how that’s all going to work out in the end I can’t say. However, I believe it is a mistake to think it is enough eg. to just be sensible and measured and refer to “gender identity ideology” to precisely distinguish the object of criticism from the more vague “gender ideology” beloved of the Christian right.
This effort - however well-meaning - can be undone at a stroke when attempts are constantly ongoing to ensure these terms are seen as synonyms. That particular change hasn’t stuck yet at Wikipedia, but it likely will, in time - and when that happens, every historic reference will be recontextualised in this light. All past careful usage will be understood to have been a mere placeholder “for all that conservative Catholics despise”. It is true, and always was true, because reliable sources say so, and no reliable sources exist saying it isn’t. If Wikipedia says “gender ideology” is an antisemitic dogwhistle, and that “gender identity ideology” or “gender theory” or “genderism” or “transgender ideology” are all synonyms, how can you argue against it? Current-gen AI is already convinced things are heading in that direction, and wags its virtual finger at anyone who might use these terms:
Note the deference to scholarly views and the relegation of the “others argue” to the secondary, counterpoint position. Unless opposing viewpoints start to appear in academia, those viewpoints will continue to be denigrated and treated as unconvincing outliers. No amount of in-person consciousness-raising can compete with the sheer scale and speed with which technology platforms can shape and silence opinion. No amount of righteous fury on YouTube or tabloid headlines can undo this, because they simply don’t count as “reliable” sources, and as long as biased and gatekept datasources like Reddit and Medium are used to supplement academic literature and Wikipedia for training AI, the problem is going to get increasingly worse.
Amidst the disagreements of pragmatism and strategy amongst the differing factions who oppose gender identitarianism, there has started to emerge a worrying denigration of “theory” and “academia”. Simply, the claim is that those who are “on the ground” are proving to be more politically effective than “elitists” in an ivory tower writing impenetrable texts that ordinary people don’t care about.
This sort of tension is nothing new, but one I’m cautioning against here not least because of the way that such texts can become a reliable source of the knowledge which underpins so much technology we rely on now and in future. I think part of the picture is going to have to be a serious effort to get well-argued viewpoints and theory published in a form that is acceptable to our technological knowledge systems. It is unrealistic to assume these systems are going to go away or somehow self-correct.
The way in which these systems are currently constructed is - as always - through mechanisms of power and gatekeeping and privilege, but now ones that hypocritically pay lip-service to challenging power, whilst ever-more deeply ingraining the cultural obsessions of the gatekeepers. Machine learning cannot tell the difference between scientific consensus based on empirical reality, and circular groupthink. It turns out the gatekeeping in publishing and academia is having an impact far beyond just those narrow fields and their direct readership. They are invisibly weighting the probability of what future AI will serve up as “true” to anyone who asks, unchallenged beliefs and assumptions delivered ubiquitously to every member of the species with an internet connection, all dressed up with the illusion of the neutrality of technology.