I've spent a lot of the last two years wrestling with different database technologies from vanilla relational systems to exotic key/value stores, but for OpenHeatMap I'm storing all data and settings in S3. To most people that sounds insane, but I've actually been very happy with that decision. How did I get to this point?
Like most people, I started by using MySQL. This worked pretty well for small data sets, though I did have to waste more time than I'd like on housekeeping tasks. The server or process would crash, or I'd change machines, or I'd run out of space on the drive, or a file would be corrupted, and I'd have to mess around getting it running again.
As I started to accumulate larger data sets (eg millions of Twitter updates) MySQL started to require more and more work to keep running well. Indexing is great for medium-scale data sets, but once the index itself grows too large, lots of hard-to-debug performance problems popped up. By the time that I was recompiling the source code and instrumenting it, I'd realized that its abstraction model was now more of a hindrance than a help. If you need to craft your SQL around the details of your database's storage and query optimization algorithms, then you might as well use a more direct low-level interface.
That led me to my dalliance with key-value stores, and my first love was Tokyo Cabinet/Tyrant. Its brutally minimal interface was delightfully easy to get predictable performance from. Unfortunately it was very high maintenance, and English-language support was very hard to find, so after a couple of projects using it I moved on. I still found the key/value interface the right level of abstraction for my work; its essential property was the guarantee that any operation would take a known amount of time, regardless of how large my data grows.
So I put Redis and MongoDB through their paces. My biggest issue was their poor handling of large data loads, and I submitted patches to implement Unix file sockets as a faster alternative to TCP/IP through localhost for that sort of upload. Mongo's support team are superb, and their reponsiveness made Mongo the winner in my mind. Still, I realized I was finding myself wasting too much time on the same mundane maintenance chores that frustrated me back in the MySQL days, which led me to look into databases-as-a-service.
The most well-known of these is Google's AppEngine datastore, but they don't have any way of loading large data sets, and I wasn't going to be able to run all my code on the platform. Amazon's SimpleDB was extremely alluring on the surface, so I spent a lot of time digging into it. They didn't have a good way of loading large data sets either, so I set myself the goal of building my own tool on top of their API. I failed. Their manual sharding requirements, extremely complex programming interface and mysterious threading problems made an apparently straightforward job into a death-march.
While I was doing all this, I had a revelation. Amazon already offered a very simple and widely used key/value database; S3. I'm used to thinking of it as a file system and anyone who's been around databases for a while knows that file systems make attractive small-scale stores that become problematic with large data sets. What I realized was that S3 was actually a massively key/value store dressed up to look like a file system, and so it didn't suffer from the 'too many files in a directory' sort of scaling problems. Here's the advantages it offers:
- Widely used. I can't emphasize how important this is for me, especially after spending so much time on more obscure systems. There's all sorts of beneficial effects that flow from using a tool that lots of others also use, from copious online discussions to the reassurance that it won't be discontinued.
- Reliable. We have very high expectations of up-time for file systems, and S3 has had to meet these. It's not perfect, but backups are easy as pie, and with so many people relying on it there's a lot of pressure to keep it online.
- Simple interface. Everything works through basic HTTP calls, and even external client code (eg Javascript via AJAX) can access public parts of the database without even touching your server.
- Zero maintenance. I've never had to reboot my S3 server or repair a corrupted table. Enough said.
- Distributed and scalable. I can throw whatever I want at S3, and access it from anywhere else. The system hides all the details from me, so it's easy to have a whole army of servers and clients all hammering the store without it affecting performance.
Of course there's a whole shed-load of features missing, most obviously the fact that you can't run any kind of query. The thing is, I couldn't run arbitrary queries on massive datasets anyway, no matter what system I used. At least with S3 I can fire up Elastic MapReduce and feed my data through a Hadoop pipeline to pull out analytics.
So that's where I've ended up, storing all of the data generated by OpenHeatMap as JSON files within both private and public S3 buckets. I'll eventually need to pull in a more complex system like MongoDB as my concurrency and flexibility requirements grow, but it's amazing how far a pseudo-file system can get you.
Comments