When I told a friend about my work at Jetpac he nodded sagely and said "You just can't resist Facebook data can you? Like a dog returning to its own vomit". He's right, I'm completely entranced the information we're pouring into the service. All my privacy investigations were by-products of my obsessive quest for data. So with Facebook's IPO looming, why do I think research using its data will be so world-changing?
Everyone is on Facebook. I know, you're not, but most organizations can treat you like someone without a phone or TV twenty years ago. The medium is so prevalent, if you're not on it's commercially viable to ignore you. This broad coverage also makes it possible to answer questions with the data that are impossible with other sources.
It's intriguing to know which phrases are trending on Twitter, but with only a small proportion of the population on the service, it's hard to know how much that reflects the country as a whole. The small and biased sample immediately makes every conclusion you draw suspect. There's plenty of other ways to mess up your study of course, but if you have two-thirds of a three hundred million population in your data that makes a lot of hard problems solvable.
Love, friendship, family, cooking, travel, play, partying, sickness, entertainment, study, work: We leave traces of almost everything we care about on Facebook. We've never had records like this, outside of personal diaries. Blogs, government records, school transcripts, nothing captures such a rich slice of our lives.
The range of activities on Facebook not only lets us investigate poorly-understood areas of our behavior, it allows us to tie together many more factors than are available from any other source. How does travel affect our chances of getting sick? Are people who are close to their family different in how they date from those who are more distant?
The majority of my friends on Facebook update at least once a day, with quite a few doing multiple updates. We've found the average Jetpac user has had over 200,000 photos shared with them by their friends! This continuous and sustained instrumentation of our lives is unlike anything we've ever seen before, we generate dozens or hundreds of nuggets of information about what we're doing every week. This coverage means it's possible to follow changes over time in a way that few other sources can match.
It's at least theoretically possible for researchers to get their hands on Facebook's data in bulk. A large and increasing amount of activity on the site happens in communal spaces where people know casual friends will see it. Expectations of privacy are a fiercely fought-over issue, but the service is fundamentally about sharing in a much wider way than emails or phone calls allow.
This background means that it's technically feasible to access large amounts of data in a way that's not true for the fragmented and siloed world of email stores, and definitely isn't true for the old-school storage of phone records. The different privacy expectations also allow researchers to at least make a case for analyses like the Politico Facebook project. It's incredibly controversial, for good reason, but I expect to see some rough consensus emerge about how much we trade off privacy for the fruits of research.
I left this until last because I think it's the least distinctive part of Facebook's data. It's nice to have the explicit friendships, but every communication network can derive much better information on relationships based on the implicit signals of who talks to who. There are some advantages to recording the weak ties that most Facebook friendships represent, and it saves an extra analysis set, but even most social networks internally rely on implicit signals for recommendations and other applications that rely on identifying real relationships.
This is the first time in history that most people are creating a detailed record of their lives in a shared space. We've always relied on one-time, narrow surveys of a small number of people to understand ourselves. With Facebook's data we have an incredible source that's so different from existing data we can gather, it makes it possible to answer questions we've never been able to before.
We can already see glimmers of this as hackers machete their way through a jungle of technical and privacy problems, but once the working conditions improve we'll see a flood of established researchers enter the field. They've honed their skills on meagre traditional information sources, and I'll be excited when I see their results on far broader collections of data. The insights into ourselves that their research gives us will change our world radically.