Any gay lady will tell you that stereotypes are a double-edged sword. On one hand, it’s really frustrating to explain over and over that yes, you’re gay and no, you don’t hate men. On the other foot, I know I’m not the only one who relies on a girl’s swagger or her alternative lifestyle haircut to let me know if she plays for the home team. Some stereotypes exist because there’s often a tiny grain of truth buried deep in them–some pattern that has appeared over time.
Researchers at the Mitre Corporation are taking advantages of these patterns to try to figure out who Twitter users really are. In a paper presented this week at the Conference on Empirical Methods in Natural Language Processing, they showed how an algorithm that they created is capable of determining a user’s age, location, gender, and political affiliation by analyzing their tweets. The algorithm scans a user’s tweets for the presence of sociolinguistic cues–words, characters, and phrases that are more likely to be used by a specific group–to determine the user’s demographic characteristics.
While the researchers believe their technology could be useful for advertisers looking to break in to niche markets on twitter, others have suggested that it could be helpful in uncovering impostors. The Atlantic Wire’s Rebecca Greenfield thinks that this kind of technology could help us avoid another man-pretending-to-be-a-lesbian fiasco. While I like her thinking, I’m not sure if it could stop another Gay Girl in Damascus. The program was written to recognize certain features and classify users based on binary categories (I’m going to save my breath/fingertips here because talking about how problematic a gender binary is would be preaching to the (intelligent, savvy, and attractive) choir). Let’s take a look at some of the classifiers so that you get an idea of what exactly it’s scanning for.
Hilarious? Yes. Easy to trick? You bet. In the words of Rachel’s Real Computational Linguist friend “to be honest, all someone would have to do to try and fool this thing into thinking its a woman is say ‘omg!!! :)’” It doesn’t take a PhD to figure out that women tend to be more effusive online; in fact, I bet Tom MacMaster or Bill Graber could tell you all about it. Besides, LGBT people tend to experience less pressure to confirm to gendered expectations, making it more difficult to pinpoint the gender of a queer user based on key words alone and even tougher to spot a fraud.
Patterns, stereotypes, and formulas only get you so far. The algorithm is based on the idea that men and women use language differently, something that’s been shown over and over in research. The issue here is that while men and women as aggregate groups behave differently, it’s an ecological fallacy to believe that you can accurately gauge the characteristics of an individual man or woman.
The results of the experiment don’t to much to back up the algorithm’s accuracy either. The baseline for gender was 55%–meaning that since 55% of twitter users are women, if the computer were to randomly guess at the gender of all the users it would theoretically be right 55% of the time. Using the program, the computer was able to determine the user correctly 72% of the time–only a 17% improvement over chance. While it’s clear that there are a variety of markets for programs that allow you ascertain the intent and disposition of otherwise anonymous internet users, it looks like it’ll be a while before they hit the shelves.