A friend recently sent me this link to a new legal document Facebook have added to their site:
http://www.facebook.com/apps/site_scraping_tos_terms.phpIt's the first time I've seen Facebook formally lay out how they think the world should treat the web pages they've made public and have indicated they allow to be crawled through their robots.txt. What it says is what they told me when they threatened to sue me a few months ago; anyone who crawls the web must obtain prior written permission from every site.
Why should you care? It's their attempt to have their cake and eat it too. They want to make as much information as possible about their members public so that they can get traffic from search engines and drive brands to prioritize their Facebook pages, but they know they have users trapped as long as their data is hard to transfer out of the service into any potential competitors.
So, stuck between these two incompatible goals, they've reached for the lawyers. They could change their robots.txt to disallow crawling, or remove the pages they've made public, but that would remove their valuable search traffic. There's a lot of legal backing to the rules in robots.txt, but you'll need deeper pockets than mine to contest Facebook's new interpretation.
What it means in practice is that large established companies are able to crawl (though always with the threat of legal action hanging over them) but smaller, newer startups will be attacked by Facebook's lawyers as soon as they look threatening. Google definitely fall foul of the new rules (caching web pages, the use of data for advertising purposes), so I'd be interested to know if they've signed up? I know these changes would make it impossible for them to get started today, since they'd have to contact each and every website before they crawled them and respond to things like "an accounting of all uses of data collected through Automated Data Collection within ten (10) days of your receipt of Facebookâs request for such an accounting". Avoiding that sort of mess was exactly why the industry agreed on robots.txt as a standard.
To be completely clear, I understand that Facebook need to protect their users' privacy. This does nothing to help that, anyone malicious is free to gather and analyze all the information they have made public about people, Facebook has left it all completely in the open with no technical safeguards. What this does is gives Facebook a legal stick to beat anyone legitimate who tries to openly use the data they've made available in a way they decide they don't like.
Comments