I'm flying back from a preparatory meeting for Strata, and as always with the O'Reilly events it left my mental cogs whirring. One of the biggest revelations for me was that every single person was using the equivalent of string and chicken wire to build their analysis pipeline, from OKCupid analysts to folks helping the New York Times. As Hilary Mason put it, are we stuck just using grep and awk? This was actually a massive relief for me as I've always been highly embarassed by the hacky code that's behind most of my data pipelines and assumed that I was just ignorant and didn't know the right tools! Though R got an honorable mention from several attendees, almost everyone just picked their favorite scripting language, then cut-and-pasted code and shell commands to get the job done.
The fundamental reason for this is the aptly named 'suffering' phase of data munging where we transform raw HTML pages, JSON, text, CSV, XML, legacy binary or any other semi-structured format into strictly arranged data that we can then run through standard analysis tools. It turns out that (apart from the binary formats) this means writing regular expressions with a thin veneer of logic glueing them together. You can see an example of that in my crawler that gathers school ranking data from the LA Times. There's a single massive regex at the top of the file that contains a group for each piece of data I want to gather. Using PHP to pull the HTML, I match against the raw text, grab out the contents of each group and write them out as a row to a CSV file. My Buzz profile crawler uses the same idea, but with a much more complex set of regexes. To convert unemployment data files I use a similar approach on the strange text file format the BLS uses.
After spending so long building all this throwaway code, I think I know the tool I want. It would let me load a large data set into an interactive editor. I'd then highlight a series of 'records' that would each be transformed into a row in the output CSV file. It would automagically figure out a good regex that matched all the records but none of the intervening filler text. Then you'd select the text that represented the values for each record, the numbers or strings you'd expect to see in each column of the CSV output. By selecting equivalent values in three or four records, the tool could build nested regexes or groups that extracted them from each record. As I was doing all this, I'd see a sample of the results show up in a sidebar pane. I'd be able to hand-tweak the underlying regular expressions driving each part too, and see how the matches change.
This sounds a lot like something you could build into Emacs (but then anything sounds a lot like something you could build into Emacs). I have no experience creating extensions though, so I'd love to know if there's anything already out there, in any environment? It doesn't have to do everything I'm after, just an interactive grep a little more advanced than control-S in Emacs would be an interesting step forward compared to running it all from the command line. Is anybody aware of anything like that? Is there a completely different approach I'm missing?
Comments