Virtually every web developer starts off using a relational database like MySQL. It's so easy to use joins and sorts to implement complex operations your service needs, pretty soon you end up with big chunks of application logic in your SQL queries. You don't know it, but you've just entered The SQL Trap.
I first heard the phrase from Jud Valeski, but it's something I've seen happen to every startup that deals with massive data sets, and I've struggled with it myself. You build a working service on a small scale, and that success brings in more users and data. Suddenly your database is the bottleneck.
At this point you have two choices. You can continue writing your queries in expressive high-level SQL and pour resources into speeding up the underlying system, or you can switch to the database equivalent of assembler with a key/value store and write application code to implement the complex operations on top of that.
In an ideal world a database is a black box - you send a query and the system figures out how to execute that operation speedily. Even in conventional databases though we end up deviating from that, eg by indexing certain rows we know we'll be querying on. After wrestling with speed problems I took a few steps beyond that by denormalizing my tables for common queries to avoid joins at the cost of more fragile update logic. As my data grew, even that wasn't enough, and simple sorts on indexed rows were taking several minutes. I spent some time trying to second-guess MySQL's optimizer by tweaking various opaque limits, but it still insisted on sorting the few thousand rows by writing them out to disk and running something called FILESORT.
At this point I was in the trap. Getting further would require somebody with deeper knowledge of MySQL's internal workings, and would take a lot of tweaking of my queries and my system setup. Large companies end up throwing money at consultants and database vendors at this point, which is a great result for the providers!
Instead as a starving startup I had to bite the bullet and throw out all my lovely queries. I switched to a key/value database for storage, and designed the keys and values to get a workable sub-set of the information I needed for any query. I then sucked all that data in PHP and did the sorting and refining there.
After the initial pain it was a massive relief. I was able to use my standard computer science skills to design the right data-structures and algorithms for my operations, rather than trying to second-guess the black box of SQL optimization. Sure I've now got a whole different set of problems with more complex application code to maintain, but it's taking a lot less time and resources than the SQL route.
Don't get me wrong, premature optimization is still the root of all evil, but if you're dealing with massive datasets and your database is becoming a bottleneck, consider dropping SQL and falling back to something more primitive, before you're trapped!
Comments