I either did a bad job explaining myself, or my last post was wrong, judging by the reaction from a Twitter engineer, and other comments by email. The point I was trying to get across was that queue-based systems look temptingly simple on the surface but require a lot of work to get right. Is it possible to build a robust pipeline based on queues? Yes. Can your team do it? Even if they can, is it worth the time compared to an off-the-shelf batch solution?
I've seen enough cases to believe that there's a queue anti-pattern for building data processing pipelines. Building streaming pipelines is still a research problem, with promising projects like S4, Storm, and Google's Caffeine, but they're all very young or proprietary. It's a tempting approach because it's such an obvious next step in data processing, and it's so easy to get started stringing queues together. That's the wrong choice for most of us mere mortals though, as we'll get sucked into dealing with all the unexpected problems I described, instead of adding features to our product.
I'm wary of using queues for data processing for the same reason I'm wary of using threads for parallelizing code. Experts can create wonderful concurrent systems using threads, but I keep shooting myself in the foot when I use them. They just aren't the right abstraction for most problems. In the same way, when you're designing a pipeline and thinking in terms of queues, take a step back and ask yourself why you can't achieve your goals with a mature batch-based system like Hadoop instead?
Queue-based data pipelines are hard, but they look easy at first. That's why I believe they're so dangerous.
Comments