How our data encodes systematic racismon December 10, 2020 at 12:00 pm

One day GPT-2, an earlier publicly available version of the automated language generation model developed by the research organization OpenAI, started talking to me openly about “white rights.” Given simple prompts like “a white man is” or “a Black woman is,” the text the model generated would launch into discussions of “white Aryan nations” and “foreign and non-white invaders.”

Not only did these diatribes include horrific slurs like “bitch,” “slut,” “nigger,” “chink,” and “slanteye,” but the generated text embodied a specific American white nationalist rhetoric, describing “demographic threats” and veering into anti-Semitic asides against “Jews” and “Communists.”

GPT-2 doesn’t think for itself–it generates responses by replicating language patterns observed in the data used to develop the model. This data set, named WebText, contains “over 8 million documents for a total of 40 GB of text” sourced from hyperlinks. These links were themselves selected from posts most upvoted on the social media website Reddit, as “a heuristic indicator for whether other users found the link interesting, educational, or just funny.”

However, Reddit users–including those uploading and upvoting–are known to include white supremacists. For years, the platform was rife with racist language and permitted links to content expressing racist ideology. And although there are practical options available to curb this behavior on the platform, the first serious attempts to take action, by then-CEO Ellen Pao in 2015, were poorly received by the community and led to intense harassment and backlash.

Whether dealing with wayward cops or wayward users, technologists choose to allow this particular oppressive worldview to solidify in data sets and define the nature of models that we develop. OpenAI itself acknowledged the limitations of sourcing data from Reddit, noting that “many malicious groups use those discussion forums to organize.” Yet the organization also continues to make use of the Reddit-derived data set, even in subsequent versions of its language model. The dangerously flawed nature of data sources is effectively dismissed for the sake of convenience, despite the consequences. Malicious intent isn’t necessary for this to happen, though a certain unthinking passivity and neglect is.

Little white lies

White supremacy is the false belief that white individuals are superior to those of other races. It is not a simple misconception but an ideology rooted in deception. Race is the first myth, superiority the next. Proponents of this ideology stubbornly cling to an invention that privileges them.

I hear how this lie softens language from a “war on drugs” to an “opioid epidemic,” and blames “mental health” or “video games” for the actions of white assailants even as it attributes “laziness” and “criminality” to non-white victims. I notice how it erases those who look like me, and I watch it play out in an endless parade of pale faces that I can’t seem to escape–in film, on magazine covers, and at awards shows.

Data sets so specifically built in and for white spaces represent the constructed reality, not the natural one.

This shadow follows my every move, an uncomfortable chill on the nape of my neck. When I hear “murder,” I don’t just see the police officer with his knee on a throat or the misguided vigilante with a gun by his side–it’s the economy that strangles us, the disease that weakens us, and the government that silences us.

Tell me–what is the difference between overpolicing in minority neighborhoods and the bias of the algorithm that sent officers there? What is the difference between a segregated school system and a discriminatory grading algorithm? Between a doctor who doesn’t listen and an algorithm that denies you a hospital bed? There is no systematic racism separate from our algorithmic contributions, from the hidden network of algorithmic deployments that regularly collapse on those who are already most vulnerable.

Resisting technological determinism

Technology is not independent of us; it’s created by us, and we have complete control over it. Data is not just arbitrarily “political“–there are specific toxic and misinformed politics that data scientists carelessly allow to infiltrate our data sets. White supremacy is one of them.

We’ve already inserted ourselves and our decisions into the outcome–there is no neutral approach. There is no future version of data that is magically unbiased. Data will always be a subjective interpretation of someone’s reality, a specific presentation of the goals and perspectives we choose to prioritize in this moment. That’s a power held by those of us responsible for sourcing, selecting, and designing this data and developing the models that interpret the information. Essentially, there is no exchange of “fairness” for “accuracy”–that’s a mythical sacrifice, an excuse not to own up to our role in defining performance at the exclusion of others in the first place.

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *