Photo by Tofslie
Amazon's Elastic MapReduce service is a god-send for anyone running big data-processing jobs. It takes the pain and suffering out of configuring Hadoop, and lets you run hundreds of machines in parallel when needed, but without having to pay for them while they're idle. Unfortunately it does still have a few... quirks..., so here's a brain dump of lessons I've learnt while using the service.
Don't put underscores in bucket names. The rest of S3 is quite happy with names like mailana_data_2010_1_25 but EMR really doesn't like those underscores and will fail to run any job that references them. You also can't rename buckets, and moving the data to a new bucket involves a copy that maxes out at about 20 MB/s, so fixing this can take a while.
Invest in some good S3 tools. All your data and code has to live in S3, so you'll be spending a lot of time dealing with buckets. S3cmd is a great command-line tool for working with S3, but I'd also recommend Bucket Explorer for a GUI view.
Start off small. You're charged per-machine, rounded up to the nearest hour. This means if you fire up 100 machines and the job fails in 30 seconds, you'll still be charged 100 machine hours. If you have a job you're not sure will work, start off with a single machine instead. You'll also have a lot fewer log files to sort through to figure out what went wrong!
Use the log files. It's a bit hidden, but on the third screen of the job setup process there's an 'advanced' section that you can reveal. In there, add a bucket path and you'll get your jobs' logs copied to that S3 location. These are life-savers when it comes to figuring out what went wrong. I'm mostly doing streaming work with PHP, so I often end up drilling down into the task_attempts folder. In there, each run on each machine will have a numbered sub-folder, and you'll be able to grab the stderr output from each of them. If a reduce step has gone wrong, I'll usually see a missing number in the output file sequence, and you can use that number to find the job attempt that failed and look at the errors. You can also see jobs that were repeated multiple times because they failed by looking at the final number in the folder name.
GZipped input. A lot of my input data had already been gzipped, but luckily if you pass -jobconf stream.recordreader.compression=gzip in the extra arguments section Hadoop will decompress them on the fly before passing the data to your mapper.
Multiple input folders. My source data was also scattered across a lot of different folders in S3, but happily you can specify multiple input locations by adding -input s3://<your data location> to the extra args section.
Make sure PHP has enough memory. By default PHP scripts will fail if they use more than 32MB of RAM, since it's designed for the web server world. If your input data might be memory intensive, especially on the reducer end, use something like ini_set('memory_limit', '1024M'); to ensure you have enough headroom.