Photo by Afsart
There's information about my Facebook data set scattered around multiple news articles, as well as posts in this blog, but here's the full story of how it all came down.
I'm a software engineer, my last job was at Apple but for the last two years I've been working on my own startup called Mailana. The name comes from 'Mail Analysis', and my goal has been to use the data sitting around in all our inboxes to help us in our day-to-day lives. I spent the first year trying (and failing) to get a toe-hold in the enterprise market. Last year I moved to Boulder to go through the Techstars startup program, where I met Antony Brydon, the former CEO of Visible Path. He described the immense difficulties they'd faced with the enterprise market, which persuaded me to re-focus on the consumer side.
I'd already applied the same technology to Twitter to produce graphs showing who people talked to, and how their friends were clustered into groups. I set out to build that into a fully-fledged service, analyzing people's Twitter, Facebook and webmail communications to understand and help maintain their social networks. It offered features like identifying your inner circle so you could read a stream of just their updates, reminding you when you were falling out of touch with people you'd previously talked to a lot, and giving you information about people you'd just met.
It was the last feature that led me to crawl Facebook. When I meet someone for the first time, I'll often Google their name to find their Twitter and LinkedIn accounts, and maybe Facebook too if it's a social contact rather than business. I wanted to automate that Googling process, so for every new person I started communicating with, I could easily follow or friend them on LinkedIn, Twitter and Facebook. My first thought was to use one of the search engine APIs, but I quickly discovered that they only offer very limited results compared to their web interfaces.
I scratched my head a bit and thought "well, how hard can it be to build my own search engine?". As it turned out, it was very easy. Checking Facebook's robot.txt, they welcome the web crawlers that search engines use to gather their data, so I wrote my own in PHP (very similar to this Google Profile crawler I open-sourced) and left it running for about 6 months. Initially all I wanted to gather was people's names and locations so I could search on those to find public profiles. Talking to a few other startups they also needed the same sort of service so I started looking into either exposing a search API or sharing that sort of 'phone book for the internet' information with them.
I noticed Facebook were offering some other interesting information too, like which pages people were fans of and links to a few of their friends. I was curious what sort of patterns would emerge if I analyzed these relationships, so as a side project I set up fanpageanalytics.com to allow people to explore the data. I was getting more people asking about the data I was using, so before that went live I emailed Dave Morin at Facebook to give him a heads-up and check it was all kosher. We'd chatted a little previously, but I didn't get a reply, and he left the company a month later so my email probably got lost in the chaos.
I had commercial hopes for fanpageanalytics, I felt like there was demand for a compete.com for Facebook pages, but I was also just fascinated by how much the data could tell us about ourselves. Out of pure curiosity I created an interactive map showing how different countries, US states and cities were connected to each other and released it. Crickets chirped, tumbleweed blew past and nobody even replied to or retweeted my announcement. Only 5 or 6 people a day were visiting the site.
That weekend I was avoiding my real work but stuck for ideas on a blog post, and I'd been meaning to check out how good the online Photoshop competitors were. I'd also been chatting to Eric Kirby, a local marketing wizard, who had been explaining how effective catchy labels were for communicating complex polling data, eg 'soccer moms'. With that in mind, I took a screenshot of my city analysis, grabbed SumoPaint and started sketching in the patterns I'd noticed. After drawing those in, I spent a few more minutes coming up with silly names for the different areas and wrote up some commentary on them. I was a bit embarassed by the shallowness of my analysis, and I was keen to see what professional researchers could do with the same information, so I added a postscript offering them an anonymized version of my source data. Once the post was done, I submitted it to news.ycombinator.com as I often do, then went back to coding and forgot about it.
On Sunday around 25,000 people read the article, via YCombinator and Reddit. After that a whole bunch of mainstream news sites picked it up, and over 150,000 people visited it on Monday. On Tuesday I was hanging out with my friends at Gnip trying to make sense of it all when my cell phone rang. It was Facebook's attorney.
He was with the head of their security team, who I knew slightly because I'd reported several security holes to Facebook over the years. The attorney said that they were just about to sue me into oblivion, but in light of my previous good relationship with their security team, they'd give me one chance to stop the process. They asked and received a verbal assurance from me that I wouldn't publish the data, and sent me on a letter to sign confirming that. Their contention was robots.txt had no legal force and they could sue anyone for accessing their site even if they scrupulously obeyed the instructions it contained. The only legal way to access any web site with a crawler was to obtain prior written permission.
Obviously this isn't the way the web has worked for the last 16 years since robots.txt was introduced, but my lawyer advised me that it had never been tested in court, and the legal costs alone of being a test case would bankrupt me. With that in mind, I spent the next few weeks negotiating a final agreement with their attorney. They were quite accommodating on the details, such as allowing my blog post to remain up, and initially I was hopeful that they were interested in a supervised release of the data set with privacy safeguards. Unfortunately it became clear towards the end that they wanted the whole set destroyed. That meant I had to persuade the other startups I'd shared samples with to remove their copies, but finally in mid-March I was able to sign the final agreement.
I'm just glad that the whole process is over. I'm bummed that Facebook are taking a legal position that would cripple the web if it was adopted (how many people would Google need to hire to write letters to every single website they crawled?), and a bit frustrated that people don't understand that the data I was planning to release is already in the hands of lots of commercial marketing firms, but mostly I'm just looking forward to leaving the massive distraction of a legal threat behind and getting on with building my startup. I really appreciate everyone's support, stay tuned for my next project!