I'm very pleased to announce the launch of the Data Science Toolkit. It's a collection of the most useful open source tools and data sets I've found, wrapped in an easy-to-use REST/JSON interface, and available for download as a turnkey virtual machine image.
Over the past few years I've discovered some amazing open-source tools, and built a few I'm pretty proud of myself, but they've always required a lot of effort from developers to use. Take Boilerpipe for example. It's by far the best approach I've found for extracting the main text from a news story or blog post, a vital first step for many data processing operations. But, if it's only available as a Java library, only other Java developers will be able to benefit from it. By wrapping it in a web server interface, and shipping it pre-installed on a VM, I'm hoping to get it into the hands of more developers.
The same goes for other libraries like GeoIQ/Schuyler Erle's Geocoder, a wonderful way of locating any address in the US but previously required a multi-gigabyte download and many hours of data import, or my own Geodict with it's hour-long database setup. By shipping what is essentially a specialized Ubuntu distribution, those setup times are removed, at the cost of a large (5GB) download.
Another benefit of this approach is the ability to run scalably. When all the data you're querying is on the local machine, it's possible to add capacity just by throwing more servers at the problem, without the bandwidth, latency or other limits on calling an external API becoming the bottleneck.
Anyway, please try out the sandbox, check out the documentation, grab the VM or just start up an EC2 instance from the public AMI image ami-9e7d8ff7. This is early days, there's already a pile of bugs along with features and APIs that didn't make it in this version, but I'm excited to see how people use it. I'd also love to see folks jump in and start hacking on it, it's all completely open-source so it's your project as much as mine!
Comments