PeteSearch

Ever tried. Ever failed. No matter. Try Again. Fail again. Fail better.


Try my new Big Data project!
Subscribe in a reader

Recent Posts

  • Five short links
  • Want a magic wand?
  • Five short links
  • Add humans to your data pipeline
  • David Thomas, RIP
  • Shell Apps and Silver Bullshit
  • Five short links
  • Five short links
  • Is MySQL viable for data mining?
  • Five short links

Archives

  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011

More...

About

Blog powered by TypePad

Five short links

Fivepuppies
Photo by Pinké

Brain of Mat Kelcey - Mat's been doing some interesting work with Common Crawl, and his blog is a must-read for anyone interested in extracting data from unstructured text.

Google releases natural language dictionaries - Based around Wikipedia page titles as their list of concepts, Google Research have released a really interesting resource, a bit like a thesaurus for machines. Even better, it's available under a liberal CC BY license, so there should be no problem using it in any sort of project.

Dumb like me - A scary story for anyone who makes their living with their mind, from my friend Russ Jurney; "Smart people, like the very attractive, get special treatment they do not know they are getting". Despite all our techno-utopianism, we're still reliant on a fallible hardware platform made of meat.

NSA's security guide to iOS 5 - A wealth of detailed practical information on securing Apple mobile devices.

Things you might not know about jQuery - A good refresher on some of the less obvious cool features of the framework. I've been using .data() extensively on Jetpac.

May 21, 2012 | Permalink | Comments (0) | TrackBack (0)

Want a magic wand?

RedRealmWand-560
I've been collaborated on and off with Nicholas Napp for years, including on a National Science Foundation grant for computer vision on mobile devices. He's extremely experienced in the world of traditional toys, as well as video games, and he's had a compelling dream of tying together motion sensors, smart phones, and a rich real-world game system to produce something magic. The gameplay uses precise location tracking and gestural control through a wand to give you an interface to cast spells that hurt or help other combatants in fights, or help you progress through adventures in other ways.

I love this sort of overlay of virtual layers on top of the physical world, and I believe Nick and his very talented creative collaborator Kevin Mowrer can execute on their vision. They're now on Kickstarter raising money to get started, so if you're intrigued, go check it out. I've donated $250 myself, and I can't wait to get my hands on an early wand.

May 20, 2012 | Permalink | Comments (0) | TrackBack (0)

Five short links

Fivedwarves
Photo by Randy Robertson

Fancy ML techniques don't matter much - "The reason I don’t like Kaggle is that it’s all about squeezing more juice out of existing data." There's a lot of hard-earned wisdom in this post, but I think he's over-estimating the professional world's familiarity with machine learning techniques, and underestimating how hard they are to acquire. I love Kaggle because it allows me to outsource a whole lot of work that requires very specialized skills, so I don't have to support a full-time ML engineer, and I don't want to spend the time and resources I'd need to train an existing team-member to be good at it when we'll only use it ocassionally.

Are your cookies colluding? - The Mozilla folks have released a plugin showing how ad networks are connected, with a network graph visualization that actually seems useful, rather than just being pretty.

An interactive map of the Roman Empire - Calculates the travel time and cost for journeys in the ancient world. Tools like these bring back a perspective that anyone used to modern transport has lost, especially around the crucial power of the sea as much cheaper and faster than land for travel. I was first struck by their power when I ran across time-based maps like this for the medieval world, showing how much more connected coastal settlements were to fishing villages in other countries than to inland towns in their own, and helped me understand how England held on to Dunkirk for so long!

Image Vision Labs - Offers advanced image-processing algorithms as a service. We seem to be locked in an escalating arms race between users determined to upload pictures of their genitals, and platforms determined to stop them.

Pilot lights are evil - Data-driven detective work on where the actual energy usage is going, with a conclusion that's given away in the title, but remains surprising!

May 18, 2012 | Permalink | Comments (0) | TrackBack (0)

Add humans to your data pipeline

Fargo-wood-chipper-scene

I was lucky enough to meet Chris Van Pelt of Crowdflower tonight, and it was fascinating to hear about some of the new developments bubbling away at the company. I'm a longtime fan, they add a lot of value beyond what you get from more basic crowd-sourcing services like Mechanical Turk, but I've always seen them as only an incremental improvement on their competitors. What Chris talked me through over beers felt like a true step forward though.

We started by chatting about their Real Time Foto Moderation tool. This is basically a penis removal tool for photo uploads; you feed in a stream of images and after a short delay you get back flagged results showing which were accepted according to the sort of criteria used by Apple's App Store for content. I was fascinated to hear about some of the rules - bare-chested guys are fine if they're outdoors, but not if they're inside!

This may not sound that revolutionary, but think about what this means. Your application code is calling an API, and getting results back, but behind the curtain is a workforce of humans! Chris likes to call this an RPC, a Remote Person Call. I'm not aware of any other service that allows this kind of unsupervised interaction, crowd-sourcing has always been much more of a batch process with manual transfers of inputs and outputs between the human and automated stages.

This is important because it turns human tasks into modules that can be flexibly inserted into your data pipeline just by signing up on the web site and installing a Ruby gem. This changes crowd-sourcing from a cumbersome custom process that you have to extensively plan up-front into something you can experiment with just like you would any other API. You can build prototypes in a few minutes, test ideas, benchmark against other solutions, and start shipping code much faster.

Chris is free to experiment on the other side of the abstraction layer too. He might partially or completely automate the process and applications would never need to know, as long as the quality of results is consistent. Human-driven versions are likely to be more expensive than computational ones, and the price people are willing to pay for particular services will be a strong signal of which ones are worth sinking developer time into.

There's a lot of hard problems that benefit from a human in the loop, from sentiment analysis to transcription, and I'd love to have a library of APIs for all those that I could drop into my data pipeline as I'm working on new features. Crowdflower is starting to make this possible, so I'll be excited to follow their progress as they roll out more services. If you have an AI-hard problem that's driving you crazy, they might have a solution that lets you pretend we've solved AI!

May 18, 2012 | Permalink | Comments (0) | TrackBack (0)

David Thomas, RIP

Grandad

My grandfather David Thomas had a long life, and packed a lot in. He was one of the youngest lot to fight in World War II, but he didn't like to talk too much about the actual service he'd done. The easiest parts to get him talking about were the people, friends he'd lost, or who he'd stayed in touch with afterwards back into civilian life. He'd ended up in the navy, and on his way to a land base in Sierra Leone servicing torpedo bombers, he'd endured weeks below decks. He knew there wasn't much of a chance that far below if a u-boat struck, but what he remembered was the stink of so many men, without much access to a shower. He got on with it though.

That was his strength, getting on with it. At first when he came back from the war he worked on the buses, where his aircraft engine skills proved handy. When the buses went on strike, he needed to keep supporting his family and switched over to a job at the Post Office. That's one thing I remember, he always had wonderful access to catalogs showing special editions of stamps, and gave me discounted entry to the mail-order "Dinosaur Club" thanks to his connections. He was always keeping his eye out for things like that, little ways to help first his two daughters, then the grandkids like me, and finally the great-grandkids when they arrived.

He was devoted to his wife, my Nan, too, visiting her every day, all day in the hospital for months before she passed away a few years ago. He stayed active right until his end, despite an array of medical problems. It must have helped that he was surrounded by friends and family who loved him. I remember virtual traffic jams of people coming in to see him in his hospital bed, and within a few hours of a new ward the nurses would be new friends. One of the best presents I was ever able to give him was a calendar showing our pet photos, and the exact name, age, ownership, and character of all the animals in the latest one he received was a hot topic of conversation on my last visit to him two weeks ago. He devored a box of chocolates that were another gift, but just a few days later he had a peaceful end, surrounded by family.

He's somebody I admire very much, for many reasons, but his kindness and lifetime of hard work to support his family stand out most of all. I miss him, but the positive impact he had through the way he lived his life will be around for a long time to come.

May 16, 2012 | Permalink | Comments (0) | TrackBack (0)

Shell Apps and Silver Bullshit

Brokenbullet
Photo by ChristianUK

I don't normally go so aggressive with a title, but this statement made me see red:

"If you plan to write an app for iOS or Android, you will save time and create a better product if you stick to Objective-C and Java, respectively."

Go to the iTunes store, download Jetpac for your iPad, and tell me whether you think it's native or HTML5. Guess what, it's heavily reliant on web code! Facebook's iOS apps take the same route, and in fact I'd recommend anybody with an 'always-online' app seriously consider the same approach.

Funnily enough, I agree with a lot of Benjamin's points about the costs of an HTML5 approach to app development. There's a lot more to creating a real app than pointing a native web view at your site, and that gets lost in the hype. What he misses is how much of a development tax you're paying when you're writing native code.

Designing in Interface Builder

"Remember web development in 2004? When you had to create pixel-perfect comps because every element on screen was an image?" I spent five years at Apple writing desktop applications, and I'm still often baffled and confused by Interface Builder. The whole process of creating native screens is an order of magnitude harder for designers, and not much better for developers. Just being able to preview the design in a web page and tweak the CSS live through Firebug is a massive time-saver.

Forgiving languages

Objective C is finally starting to offer automatic memory management, but you'll still have to worry about buffer overflows and all sorts of other low-level details. Java is better, but not by much, and both require static typing. Modern languages like Ruby and Javascript are a lot more forgiving, and I've found that makes development faster and doesn't seem to introduce more bugs. Again, building Jetpac in a combination of server-side Ruby and client-side Javascript got us to market a lot quicker than we'd be able to manage with a native approach. Just as one example, I forgot a retain on a single string in the little native code we do have, and that caused an intermittent crash that put our launch back a week.

Debuggable

Being able to remotely inspect a web view makes a world of difference to debugging. If you keep your app runnable in a desktop browser too, the tools for stepping through code are superb, and in a lot of ways they're now superior to something like XCode. The amount of documentation and community answers from places like StackOverflow for common problems is much larger for the web world than native code, which also is a massive help for resolving bugs.

Extendable

JQuery and Ruby's gem system are incredibly powerful. When we needed an autocomplete text box, grabbing a jquery plugin was simple. Need support for Amazon S3 interaction? We could just grab a gem. The development community behind these technologies is far larger than that for native development, so you're a lot more likely to find an off-the-shelf solution to your problems.

Slow Deployment

We built up a whole set of useful practices around long installation cycles. In a lot of ways I still miss the extended stages of QA that were required for apps that shipped on physical media. The catch is, while you could theoretically do the same thing with an app that's served directly off a website, the disadvantages are so much larger that nobody does! I love being able to fix bugs and solve user problems immediately, and the iteration cycles on new features are much, much faster than anything that requires waiting on Apple's approval process.

Don't believe the hype

We still had to work hard to support hardware-accelerated scrolling and native-feeling swiping, but going HTML5 was definitely the right decision for our app. Is it for yours? I don't know, and neither does Benjamin. As always in engineering it's all about tradeoffs, and no alternative to researching how the drawbacks and advantages of each approach fit your unique requirements. Just don't believe anyone who tells you they know the One True Way to develop for mobile, on either side.

May 09, 2012 | Permalink | Comments (0) | TrackBack (0)

Five short links

Operator5
Picture from Pulp Covers

Ayasdi - A very seductive new visualization and analysis tool, it feels like they've learned a lot from Palantir's success.

Benford's Law: A revised analysis - I'd been using the original study that analyzed public company accounts for fraud over time using Benford's Law as a poster child for the application of numeric methods to journalism. I'm sorry to see that it turned out to be a bogus correlation (thanks to an increase of zeroes in revenue figures) but it's a good reminder of how important peer review and humility are as we're charging ahead with our new techniques. It's the sort of mistake that keeps me awake at nights, knowing how easy it would be to make.

Tiki - A lovely collection of open source code to handle all sorts of file conversions to text. I built some similar functionality into the Data Science Toolkit, but I'm excited to see an Apache-supported alternative.

Stanford Part-of-speech Tagger - A walk-through of a slick project for categorizing words within unstructured English-language text.

The Next Big Thing - How Amazon should be using their information on customers' book habits to drive a social network. I'm convinced that implicit signals will win out over the follow/friend model when it comes to building communities of people, but nobody's built an example that actually works yet.

May 08, 2012 | Permalink | Comments (0) | TrackBack (0)

Five short links

Fiveofhearts
Picture by Jeff Trexler

Brenda Zulu discusses the state of the Zambian blogosphere - A reminder of the basic challenges facing technology in the developing world, with critical bloggers being chased out of the country. It's promising to hear how popular Twitter and Facebook are for microblogging though.

Rickshaw - A D3-based Javascript framework for drawing sophisticated interactive time-series graphs.

All interesting problems are scalability problems - I don't agree with the headline, but there's some spot-on observations in this post. Almost all the costs of successful software are in maintenance, but there's a heavy survivor's bias in those figures, since many codebases never even get used. There are a lot of parallels (if you'll excuse the pun) between the constraints of tiny embedded systems, and those of massively distributed software. That's what I love about engineering, the border between what's needed and what's possible is a rich fractal, with enough repeating patterns to re-apply lessons you've learned, but with plenty of variety so you've no excuse to be bored.

ArcSpread for analyzing web archives - Stanford runs a fantastic project for capturing important web pages as they change over time, and then presenting the results in a form that future historians will be able to use. This paper talks about some of the techniques they use for removing boilerplate navigation and ad content, so that researchers can work with the meat of the page.

MemoirTree - A simple but effective application for capturing oral history from the people around you. One of the joys I discovered during my forays into journalism was how everybody has an interesting story to tell you if you just sit down and ask them about their life, so I'm hoping this catches on.

April 25, 2012 | Permalink | Comments (0) | TrackBack (0)

Is MySQL viable for data mining?

Excavator
Photo by Aitor Escauriaza

I've been involved in an interesting Twitter conversation with Rafi Kam. I don't know anything about his background or plans, but he's obviously working on a data project. I was pleased to be able to point him to EC2 for Poets as a great introduction to Amazon's hosting, but this morning he asked "I'm concerned about nosql learning time and lack of simple querying. Can mysql be a viable back end for data mining?".

The quick answer is that MySQL and other traditional databases are absolutely viable for data mining. In most cases they're actually far superior to NoSQL solutions for anything that involves exploration and experimentation, simply because they have far more mature tools and documentation and a much more flexible interface.

My advice is to always start with a relational database when you're prototyping your product. NoSQL systems like Cassandra offer advantages once you're dealing with truly massive data sets, but relational databases will get you a long, long way. Once your queries start slowing down, that's the time to look at optimizing your database, whether it's by switching to a key/value solution, or more traditional approaches like heavier indexing or even vertically scaling by just buying a faster machine!

Now, NoSQL and the MapReduce approach to data processing are a lot of fun to play with, so I highly recommend learning more about them and using them in toy projects to get familiar with them, but unless the point of the project is to train yourself on the tools, start with something simpler.

April 24, 2012 | Permalink | Comments (1) | TrackBack (0)

Five short links

Starsketch
Picture by Matt Handler

Girls and coding: Female peer pressure scares them off - I wish there was more data to back this argument up, but the idea seems worth investigating. "..there are no great British young geek superstars for them to relate to, male or female" is sad but true too.

Using binary search for debugging - Binary search is useful in all sorts of circumstances beyond traditional programming, and it's great to see this list of some of the unexpected places it comes in handy. Figuring out item counts by binary-searching on URL parameters is particularly cunning.

Superfastmatch - A spectacularly-useful open source tool for quickly detecting identical sections in sets of millions of documents. Originally aimed at detecting lazy journalism using cut-and-paste from press releases or Wikipedia, it's also applicable to plagiarism more widely, or even detecting all the echoes of biblical phrases in Shakespeare's work.

Geocoder.ca sued - A Canadian group spent years building up a crowd-sourced database of postal codes, an essential foundation for almost anyone doing open geographic work, and they're then sued by the Canadian Postal Service for violating their copyright! A very depressing case, but I'm hoping the support and publicity they're receiving convinces the government to back off.

Insight Data Fellows Program - Are you a PhD or post-doc who wants to apply the analytic skills you learned to the technology industry? This six week intensive course looks like a fantastic chance to be mentored by Silicon Valley data folks, and to meet lots of potential employers too!

April 13, 2012 | Permalink | Comments (0) | TrackBack (0)

Does Facebook's purchase of Instagram make sense?

Instagram
Picture by Oridusartic

I've spent the last year obsessed with social photo sharing as I've been building out Jetpac, so while I can't pretend I was expecting it, Facebook's acquisition of Instagram made sense to me. Here's why:

Facebook is a photo sharing site with a social network attached

The extent to which photos have always driven the growth of the network astonished me. Unlike games or even status updates, sharing pictures was an existing social behavior that the recipients understood and welcomed, giving friends and relatives of users a strong incentive to sign up themselves. Nothing else has this kind of pull, it's the bedrock of everything else they do. They currently host 140 billion photos, and are adding 10 billion a month, and that's a crucial engine of engagement.

Instagram has cracked the creative app problem

Instagram's real value is in their experience building a creative app that everybody can use. Nobody else has built an interface that's clear enough to be approachable and yet can produce results that people appreciate. It may sound simple, but it's deceptively hard to replicate from the outside. People like the filtered images because they're expressing a creative act by the taker, something they've put thought and time into, but for a wide audience of creators to use it, it actually has to be a lot easier and quicker than it appears. This balancing act is not only hard to reverse-engineer, it's also helped by an aura of exclusivity in the early days that's near-impossible for an established company to replicate. The flip-side of large companies being able to get easy press is that nobody gets credit for telling their friends about a cool new service they've launched.

Instagram was on the verge of going mainstream

The company clearly had proven their service had wide appeal, and showed all the signs of going into a rapid expansion. Even for behemoths like Facebook it gets very expensive to acquire a startup once they truly blow up, so with the cautionary tale of Yahoo's failure to buy Google in its early days in mind, it was a last chance of sorts for an acquisition. Instagram user's interaction with photos is very different from anything that Facebook offers, so if it did become widely popular there would be a real threat that they'd siphon off users.

Instagram is the first natively-mobile app

When Somini Sengupta asked me about this story for her New York Times post, I felt like I was repeating conventional wisdom, but I realized that's not something everybody's absorbed. It's profoundly shaped the way we approach Jetpac, with a laser-focus on our iPad app, because there's a deep shift in user behavior that established web companies are struggling to adapt to. Facebook is keenly aware of how important mobile is, but they're facing a classic innovator's dilemma where their core web business will suffer if they really prioritize phones and tablets. Bringing in the pioneers in mobile-only applications can't hurt as they wrestle with the changes they know they need to make.

Facebook's own valuation gives it a strong war chest for moves like this, so in their position I see why they made the purchase. The key is understanding how central photo sharing is for their business, and how much they believe in mobile.

April 10, 2012 | Permalink | Comments (0) | TrackBack (0)

Five short links

Fiveposter
Picture by Pink Ponk

Why Open Science failed after the Gulf oil spill - The description of this researcher's interactions with the media rang very true. They took his reports and "eliminated a lot of the caveats and limits that Asper placed on his own results".

Sigma.js - An interactive network graph library, with support for both live force-directed layouts, and importing more complex structures from the desktop Gephi application. It has some very stylish visual defaults too.

Accumulo - I'd missed this Apache database project until now, but I'm interested in their take on the BigTable concept, especially their focus on security controls. Intriguing that it came out of the NSA too.

Visualizing live event broadcast delay - Working backwards from website traffic at different locations to figure out the broadcast delay for a TV commercial.

Online Hex Editor - Does exactly what it says on the tin. I don't know why I'm still amazed by how effective web apps can be, but it's striking how few barriers there are to replacing desktop programs.

April 09, 2012 | Permalink | Comments (0) | TrackBack (0)

Where am I, who am I?

Mobydickmap

"Queequeg was a native of Rokovoko, an island far away to the West and South. It is not down in any map; true places never are."

Where am I right now? Depending on who I'm talking to, I'm in SoMa, San Francisco, South Park, the City, or the Bay Area. What neighborhood is my apartment in? Craigslist had it down as Castro when it was listed. Long-time locals often describe it as Duboce Triangle, but people less concerned with fine differences lump it into the Lower Haight, since I'm only two blocks from Haight Street.

When I first started working with geographic data, I imagined this was a problem to be solved. There had to be a way to cut through the confusion and find a true definition, a clear answer to the question of "Where am I?".

What I've come to realize over the last few years is that geography is a folksonomy. Sure, there's political boundaries, but the only ones that people pay much attention to are states and countries. City limits don't have much effect on people's descriptions of where they live. Just take a look at this map of Los Angeles' official boundaries:

Lacitylimits

There's clearly little correlation between the legal city boundaries and how people describe the place that they live. You could argue that Los Angeles County is the correct region to use, but then people way out in the desert by Littlerock would be included!

The arbitrary and human nature of places is even more pronounced with neighborhoods. As I showed above, there's a surprising amount of consensus on the names of the neighborhoods, but almost none on their boundaries.

Why do I care about all this? It's crucial for data processing to recognize that if you force what the user puts in the 'Location' box into a standardized form, you're losing information. For example, knowing how somebody naturally describes where they are is going to be a lot more useful for grouping them together than a street address or latitude/longitude coordinates. If I choose the Lower Haight label, I'm more likely to be a hippy or a punk, for the Castro I want to identify with the gay image, or if I go for the Mission I'm associating myself with hipsters.

I'm glad Twitter has stuck with their free-form text fields, and I hope Facebook will become more flexible. Don't throw this data away, treasure it! It makes it a lot harder for machines to deal with the content that people produce, but unless you're shipping packages or targeting ICBMs, the payoff of richer knowledge of your users is worth it.

April 06, 2012 | Permalink | Comments (0) | TrackBack (0)

Find amazing travel photos from your friends with Jetpac on the iPad

Ipadappscreenshot1

I'm interrupting my usual stream of geek consciousness to bring you a message from our sponsors. I'm very pleased to announce that the Jetpac iPad app is now available! Some of your friends are taking astonishing travel pictures that you've never seen. Get the app and we'll give you the very best of the two hundred thousand photos your friends have shared on Facebook.

Ratings are very important to help other people discover the app, so if you do enjoy it, please consider taking a few seconds to rate us too.

There's a lot of data stories from this release, and I'll be writing about them over the next few weeks, in between new features and bug-fixes for the next update!

April 04, 2012 | Permalink | Comments (0) | TrackBack (0)

Five short links

Handprint
Photo by Hobvias Sudoneighm

HTTP cookies, or how not to design protocols - Browser protocols feel a lot more like Windows than Unix in their design and evolution. The lack of clear principles means we'll face the same endless-but-just-about-manageable cascade of bugs that afflicted Microsoft's OS.

Nilometer - Predictive analytics from a hole in the ground. The level of the Nile was such a strong sign of the strength of the harvest months later that Cairo's biggest festival was cancelled and replaced with prayers and fasting if it didn't measure up. The 1,400 years of time series data from this instrument spawned some fascinating research in modern times too.

Earth Station: The afterlife of technology at the end of the world - What happens when the future becomes the past? An abandoned satellite tracking station vital for the moon landing, and the trailer park that now surrounds it.

Designing user experiences for imperfect data - Thinking about the UI from the start is vital to building effective data algorithms, and often turns impossible problems into solvable ones, as Matthew demonstrates.

Spatial isn't special - There's a life-cycle to every technology niche. As demand first emerges, the few developers who can serve it can make a handsome living, but gradually knowledge and tools diffuse to a wider world and the specialty becomes a skill that can be acquired rather than an expert you need to hire. This is a very good thing for the wider world, what were hard and expensive problems become cheap and easy to solve, but it's worth remembering that when the money's too good it won't last forever.

March 30, 2012 | Permalink | Comments (0) | TrackBack (0)

Programming and prior experience

I wanted to highlight a comment to my previous post about unpaid work, since I think it deserves to be more prominent:

--------------------------------

I'm a female who majored in computer science but then did not use my degree after graduating (I do editing work now). While I was great with things like red-black trees and k-maps, I would have trouble sometimes with implementations because it was assumed going into the field that you already had a background in it. I did not, beyond a general knowledge of computers. 

I was uncomfortable asking about unix commands (just use "man"! - but how do I interpret it?) or admitting I wasn't sure how to get my compiler running. If you hadn't been coding since middle school, you were behind. I picked up enough to graduate with honors, but still never felt like I knew "enough" to be qualified to work as a "true" programmer. 

I like editing better anyway, so I'm not unhappy with my career, but that enviroment can't be encouraging for any but a certain subset of people, privileged and pushed to start programming early. 

---------------------------------

March 30, 2012 | Permalink | Comments (0) | TrackBack (0)

Five short links

Fiveways
Photo by Elliott Brown

Hollow visions, bullshit, lies, and leadership vs management - "the best creative work depends on getting the little things right", "organizations need both poets and plumbers". So much to absorb here, but it all chimes with my experiences at Apple. Steve Jobs may have been a visionary, but he also knew his business inside and out, and obsessed over details.

Exceptions in C with longjmp and setjmp - When I was learning C, I loved how it felt like complete mastery was within reach, it was contained and logical enough to compile mentally once you had enough experience. longjmp() and setjmp() were two parts I never quite understood until now, so it was fascinating to explore them here.

Web Data Commons - Structured data extracted from 1.5 billion pages. To give you an idea of the economics behind big data, the job only cost around $600 in processing costs.

Greg's Cable Map - A lovely tool for exploring the globe-spanning physical infrastructure that's knitting our world together.

Hammer.js - "Can't touch this!" - A cross-platform Javascript library for advanced gestures on touch devices like tablets and phones. Even just building a basic swipe gesture from tap events is a pain, so this is much needed.

March 26, 2012 | Permalink | Comments (0) | TrackBack (0)

Twelve steps to running your Ruby code across five billion web pages

Stacks2
Photo by Andrew Ferguson

Common Crawl is one of those projects where I rant and rave about how world-changing it will be, and often all I get in response is a quizzical look. It's an actively-updated and programmatically-accessible archive of public web pages, with over five billion crawled so far. So what, you say? This is going to be the foundation of a whole family of applications that have never been possible outside of the largest corporations. It's mega-scale web-crawling for the masses, and will enable startups and hackers to innovate around ideas like a dictionary built from the web, reverse-engineering postal codes, or any other application that can benefit from huge amounts of real-world content.

Rather than grabbing each of you by the lapels individually and ranting, I thought it would be more productive to give you a simple example of how you can run your own code across the archived pages. It's currently released as an Amazon Public Data Set, which means you don't pay for access from Amazon servers, so I'll show you how on their Elastic MapReduce service.

I'm grateful to Ben Nagy for the original Ruby code I'm basing this on. I've made minimal changes to his original code, and built a step-by-step guide describing exactly how to run it. If you're interested in the Java equivalent, I recommend this alternative five-minute guide.

1 - Fetch the example code from github

You'll need git to get the example source code. If you don't already have it, there's a good guide to installing it here:

http://help.github.com/mac-set-up-git/

From a terminal prompt, you'll need to run the following command to pull it from my github project:

git clone git://github.com/petewarden/common_crawl_types.git

2 - Add your Amazon keys

If you don't already have an Amazon account, go to this page and sign up:

https://aws-portal.amazon.com/gp/aws/developer/registration/index.html

Your keys should be accessible here:

https://aws-portal.amazon.com/gp/aws/securityCredentials

To access the data set, you need to supply the public and secret keys. Open up extension_map.rb in your editor and just below the CHANGEME comment add your own keys (it's currently around line 61).

3 - Sign in to the EC2 web console

To control the Amazon web services you'll need to run the code, you need to be signed in on this page:

http://console.aws.amazon.com

4 - Create four buckets on S3

Commoncrawl0

Buckets are a bit like top-level folders in Amazon's S3 storage system. They need to have globally-unique names which don't clash with any other Amazon user's buckets, so when you see me using com.petewarden as a prefix, replace that with something else unique, like your own domain name. Click on the S3 tab at the top of the page and then click the Create Bucket button at the top of the left pane, and enter com.petewarden.commoncrawl01input for the first bucket. Repeat with the following three other buckets:

com.petewarden.commoncrawl01output

com.petewarden.commoncrawl01scripts

com.petewarden.commoncrawl01logging

The last part of their names is meant to indicate what they'll be used for. 'scripts' will hold the source code for your job, 'input' the files that are fed into the code, 'output' will hold the results of the job, and 'logging' will have any error messages it generates.

5 - Upload files to your buckets

Commoncrawl1

Select your 'scripts' bucket in the left-hand pane, and click the Upload button in the center pane. Select extension_map.rb, extension_reduce.rb, and setup.sh from the folder on your local machine where you cloned the git project. Click Start Upload, and it should only take a few seconds. Do the same steps for the 'input' bucket and the example_input.txt file.

6 - Create the Elastic MapReduce job

The EMR service actually creates a Hadoop cluster for you and runs your code on it, but the details are mostly hidden behind their user interface. Click on the Elastic MapReduce tab at the top, and then the Create New Job Flow button to get started.

7 - Describe the job

Commoncrawl2

The Job Flow Name is only used for display purposes, so I normally put something that will remind me of what I'm doing, with an informal version number at the end. Leave the Create a Job Flow radio button on Run your own application, but choose Streaming from the drop-down menu.

8 - Tell it where your code and data are

Commoncrawl3

This is probably the trickiest stage of the job setup. You need to put in the S3 URL (the bucket name prefixed with s3://) for the inputs and outputs of your job. Input Location should be the root folder of the bucket where you put the example_input.txt file, in my case 's3://com.petewarden.commoncrawl01input'. Note that this one is a folder, not a single file, and it will read whichever files are in that bucket below that location.

The Output Location is also going to be a folder, but the job itself will create it, so it mustn't already exist (you'll get an error if it does). This even applies to the root folder on the bucket, so you must have a non-existent folder suffix. In this example I'm using 's3://com.petewarden.commoncrawl01output/01/'.

The Mapper and Reducer fields should point at the source code files you uploaded to your 'scripts' bucket, 's3://com.petewarden.commoncrawl01scripts/extension_map.rb' and 's3://com.petewarden.commoncrawl01scripts/extension_map.rb'. You can leave the Extra Args field blank, and click Continue.

9 - Choose how many machines you'll run on

Commoncrawl4

The defaults on this screen should be fine, with m1.small instance types everywhere, two instances in the core group, and zero in the task group. Once you get more advanced, you can experiment with different types and larger numbers, but I've kept the inputs to this example very small, so it should only take twenty minutes on the default three-machine cluster, which will cost you less than 30 cents. Click Continue.

10 - Set up logging

Commoncrawl6

Hadoop can be a hard beast to debug, so I always ask Elastic MapReduce to write out copies of the log files to a bucket so I can use them to figure out what went wrong. On this screen, leave everything else at the defaults but put the location of your 'logging' bucket for the Amazon S3 Log Path, in this case 's3://com.petewarden.commoncrawl01logging'. A new folder with a unique name will be created for every job you run, so you can specify the root of your bucket. Click Continue.

11 - Specify a boot script

Commoncrawl5

The default virtual machine images Amazon supplies are a bit old, so we need to run a script when we start each machine to install missing software. We do this by selecting the Configure your Bootstrap Actions button, choosing Custom Action for the Action Type, and then putting in the location of the setup.sh file we uploaded, eg 's3://com.petewarden.commoncrawl01scripts/setup.sh'. After you've done that, click Continue.

12 - Run your job

Commoncrawl7

The last screen shows the settings you chose, so take a quick look to spot any typos, and then click Create Job Flow. The main screen should now contain a new job, with the status 'Starting' next to it. After a couple of minutes, that should change to 'Bootstrapping', which takes around ten minutes, and then running the job, which only takes two or three.

Debugging all the possible errors is beyond the scope of this post, but a good start is poking around the contents of the logging bucket, and looking at any description the web UI gives you.

Commoncrawl8

Once the job has successfully run, you should see a few files beginning 'part-' inside the folder you specified on the output bucket. If you open one of these up, you'll see the results of the job.

Commoncrawl9

This job is just a 'Hello World' program for walking the Common Crawl data set in Ruby, and simply counts the frequency of mime types and URL suffixes, and I've only pointed it at a small subset of the data. What's important is that this gives you a starting point to write your own Ruby algorithms to analyse the wealth of information that's buried in this archive. Take a look at the last few lines of extension_map.rb to see where you can add your own code, and edit example_input.txt to add more of the data set once you're ready to sink your teeth in.

Big thanks again to Ben Nagy for putting the code together, and if you're interested in understanding Hadoop and Elastic MapReduce in more detail, I created a video training session that might be helpful. I can't wait to see all the applications that come out of the Common Crawl data set, so get coding!

March 25, 2012 | Permalink | Comments (0) | TrackBack (0)

Unpaid work, sexism, and racism

 

Skatergirldevilboy

Photo by Wayan Vota

You may have been wondering why I haven't been blogging for over a week. I've got the generic excuse of being busy, but truthfully it's because I've had a draft of this post staring back at me for most of that time. God knows I'm not normally one to shy away from controversy, but I also know how tough it is to talk about racism and sexism without generating more heat than light. After two more head-slapping examples of our problem appeared just in the last few days, I couldn't hold off any longer. I'm not a good person to talk about explicit discrimination in the tech industry, I'd turn to somebody like Kristina Chodorow, but I have been struck by one of the more subtle reasons we discourage a lot of potential engineers from joining the profession.

I don't get paid for most of the things I spend my time on. I do my blogging, open source coding, and speak at conferences for free, my books provide beer money, and I've only been able to pay myself a small salary for the last few months, after four years of working on startups. This isn't a plea for sympathy, I love doing what I do and see it all as a great investment in the future. I saved up money during my time at Apple precisely so I'd have the luxury of doing all these things.

I was thinking about this when I read Rebecca Murphey's post about the Fluent conference. Her complaints were mostly about things that seemed intrinsic to commercial conferences to me, but I was struck by her observation that the lack of expenses for speakers hits diversity.

I think it goes beyond conferences though (and I've actually found O'Reilly to be far better at paying contributors than most organizers, and they work very hard on discrimination problems). The media industry relies on unpaid internships as a gateway to journalism careers, which excludes a lot of people. Our tech community chooses its high-flyers from people who have enough money and confidence to spend significant amounts of time on unpaid work. Isn't this likely to exclude a lot of people too?

And yes, we do have a diversity problem. I'm not wringing my hands about this out of a vague concern for 'political correctness', I'm deeply frustrated that I have so much trouble hiring good engineers. I look around at careers that require similar skills, like actuaries, and they include a lot more women and minorities. I desperately need more good people on my team, and the statistics tell me that as a community we're failing to attract or keep a lot of the potential candidates.

We're a meritocracy. Writing, speaking, or coding for free helps talented people get noticed, and it's hard to picture our industry functioning without that process at its heart. We have to think hard about how we can preserve the aspects we need, but open up the system to people we're missing right now. Maybe that means setting up scholarships, having a norm that internships should all be paid, setting aside time for training as part of the job, or even doing a better job of reaching larval engineers earlier in education? Is part of it just talking about the career path more explicitly, so that people understand how crucial spending your weekends coding on open source, etc, can be for your career?

I don't know exactly what to do, but when I look around at yet another room packed with white guys in black t-shirts, I know we're screwing up.

March 23, 2012 | Permalink | Comments (0) | TrackBack (0)

Five short links

Twoplusthree
Photo by Bitzi

Geotagging poses security risks - An impressively level-headed look at how the quiet embedding of locations within photos can cause security issues, especially for the service members it's aimed at.

I can't stop looking at tiny homes - I was so happy to discover I'm not the only one obsessed with houses the size of dog kennels. If you're a fellow sufferer, avoid this site at all costs.

From CMS to DMS - Are we moving into an era of Data Management Systems, that play the same interface role for our data that CMS's do for our content?

Drug data reveals sneaky side effects - Drew Breunig pointed me at this example of how bulk data is more than the sum of its parts. By combining a large amount of adverse reaction reports, a large number of new side effects caused by mixing drugs were discovered.

Gisgraphy - An intriguing open-source LGPL project that offers geocoding services based on OpenStreetMap and Geonames information. I look forward to checking this out and having a play.

March 15, 2012 | Permalink | Comments (0) | TrackBack (0)

Next »