The walls at the Soho20 gallery in Bushwick, Brooklyn, are covered in multi-colored Post-it notes, the kind you might find stuck to glass meeting room walls at a design firm or coworking space. But these notes are for a different kind of brainstorm.
“Intersectionality,” declares one in all-caps. “Men Explain Things to Me — Solnit,” another one reads, referencing a 2008 essay by the writer Rebecca Solnit. “Is there a feminist programming language?” asks another. “Buffy 4eva,” reads an orange Post-it, next to a blue note that proclaims, “Transwomen are women.”
These are all ideas for the themes and pieces of content that will inform the “Feminist Data Set”: a project to collect data about intersectional feminism in a feminist way. Most data is scraped from existing networks and websites or collected by surveilling people as they move through digital and physical space–as such, it reflects the biases these existing systems have. The Feminist Data Set, on the other hand, aspires to a more equitable goal: collaborative, ethical data collection.
That might sound abstract, but the stakes are high. A 2015 study found that Google is less likely to show advertisements for highly paid jobs to women than it is to men (the company responded that advertisers decide who they want to reach and Google has policies that restrict the use of ads based on some types of personal information). A recent investigation into three commonly used facial recognition algorithms showed that the accuracy when classifying a face’s gender was significantly lower for dark-skinned women than for light-skinned men.
These biases can result from flaws in the algorithm itself, or the data it was trained on. Often, because machine learning algorithms look for patterns in huge amounts of data, the problem lies with the data. The primary reason the study above found such wide disparities in how algorithms recognize people with different colors of skin and genders is because of the dearth of those types of people in the algorithms’ training data. AI tends to compound the problems of a sexist data set, codifying it into a formula that can appear rational and even just despite its bias. A key solution, then, is better data.
“Unlike machine learning projects where you’ll scrape a bunch of data, and just throw it at something, I’m interested in how you can ethically grow data,” says Caroline Sinders, the artist, designer, and machine learning researcher behind the project. “What does sustainable, collaborative, data collection look like?”
A Feminist Approach To Data
Sinders is using something akin to the design thinking process, which relies heavily on brainstorming and talking to the data set’s future users. She holds workshops at various art spaces around the world, where she sits down with people–mostly artists and technologists from the communities based around the art spaces–and they sketch out what types of writing and ideas they want to add to a feminist data set.
Step one? Sinders asks everyone in the room to spend five minutes brainstorming ideologies (like femininity, virtue, and implicit bias) and specific pieces of content (like old maid, cyberfeminism, and Mary Shelley) for the dataset on sticky notes. Then, the entire group organizes them into categories, from high-level ideological frameworks down to individual pieces of content. The exercise is a chance for a particular artistic community to have a say over what feminist data is, while participating in an open-source project that they’ll one day be able to use for their own purposes. Right now, the data set includes a gender-neutral dictionary, essays by Donna Haraway, and journalist Clare Evans’s new book Broad Band, a female-centric history of computing.
A Chance To Encourage Data Literacy
Besides deciding what should go in the data set, the workshops act like a crash course in what data is and how data sets work. “It provides illumination as to how do you take something that’s inherently qualitative–like an ideology or a movement like feminism–and put it into something that has to be quantitative, like an Excel spreadsheet,” Sinders says. “What ends up emerging is data literacy amongst the entire group.”
Crucially, her workshops exposes more people to the fact that data isn’t immutable–far from it. Instead, it shows that data is something that is created by a group of people, and that it reflects the values and positions of those people.
Seeing Biased Data, Firsthand
Sinders witnessed firsthand how bias can perpetuate itself during her time working as a design researcher at the Wikimedia Foundation, where she focused on the patterns of online harassment. Inequity at Wikipedia works in insidious ways. “Wikipedia is built by volunteers under an ethos of an open source shareable encyclopedia, which means that it reflects the value of the community,” she says. “Even though the community agrees on neutrality and it’s pretty neutral, the pages that get built reflect who’s building them, and you can have all sorts of inequity from that.” Namely, you end up with a gender-skewed encyclopedia, where only 30% of biography pages are for women. (Wikipedia has acknowledged its gender problem and supports edit-a-thons to encourage more women to edit the encyclopedia.)
The original idea for the dataset came from Sinders work with online harassment, both at Wikimedia and as an Open Labs fellow at BuzzFeed and the New York art hub Eyebeam. As she was collecting ethnographic data on the alt-right on platforms like Reddit and 4chan, she began to look for what she called “a palate cleanser.” “I started thinking about it from the opposite end,” she says. “How do you think about datasets as protest? I think it’s important as a technologist and artist to look at the systems you exist in and think of ways to make them better or protest them or intervene in them.”
The data set’s larger purpose, of course, is to create an archive that depicts what feminism looks like in 2018. To inform the project, Sinders has been traveling to different feminist bookstores and archives with the intent of including their works in the dataset and listing them as collaborators. She also plans to include the transcripts of interviews she’s doing, such as one with a feminist archivist who runs the Open Archive, which helps people save information to the Internet Archive while protecting their privacy.
The Slow Data Movement
Ultimately, data is only valuable when it is used. The data set will be open source, but Sinders has a very particular use in mind for it. “I’m eventually going to use all this data to train a feminist chatbot,” she says, which will be another art piece in itself without any explicit educational goal. That was her goal in the first place, but she realized that she had to create all the data herself because it didn’t exist.
She has a long way to go. Sinders just wrapped up a workshop in New York at the gallery Soho20, and still needs to add each of the suggested works to the data set. From the London workshop at a gallery called Space, she has about 15 works so far consisting of books, essays, and images. Next, she’s headed to Belgrade, Serbia, this summer, and then San Francisco’s Yerba Buena Center for the Arts to do more workshops.
Sinders isn’t fazed by the project’s creeping pace. That’s entirely the point. “By doing these in small iterations, I think it’s one of the better ways to work toward less biased datasets,” she says. “There’s a slow food movement. Maybe this is a slow data movement.”