PeteSearch

Ever tried. Ever failed. No matter. Try Again. Fail again. Fail better.


Try my new Big Data project!
Subscribe in a reader

Recent Posts

  • Five short links
  • No more heatmaps that are just population maps!
  • Five short links
  • Five short links
  • We're all starting to track ourselves
  • Five short links
  • Open Sentiment Analysis
  • Five short links
  • Five short links
  • Do we need a slow software movement?

Archives

  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • November 2012
  • October 2012
  • August 2012
  • July 2012
  • June 2012

More...

About

Blog powered by TypePad

Five short links

Fivestar
Photo by Eldeeem

The Cartography of Bullshit - A righteous rant against a piece of pop-sociology digging into just how flimsy the underlying statistics are. It hits home because numbers I've mined have ended up in similar columns - a White Power group even used some of my research to 'prove' Mexicans were conquering Texas based on the numbers of Juans versus Johns! Take all studies on controversial subjects like race with a massive pinch of salt.

Welcome, recent graduates - Advice I wished I'd had when I looked for my first post-college job. 

Sublime DataConverter - We've ended up using CSV for lists of objects where the property names remain constant and JSON for messier data structures and as a programming model post-transport. We've homebrewed a limited set of routines to automatically scan headers or walk all objects and extract all possible properties so we can automatically convert between the two representations, but this project is a much more general approach to the same problem.

The Split-Apply-Combine Strategy for Data Analysis - A technical but enlightening read from Hadley Wickham, covering ways of applying the same algorithms across many different representations of data.

Nightmare after nightmare: Students trying to replicate work - Remember what I said about taking studies with a pinch of salt? Even with help from the original authors, PhD students had incredible trouble reproducing the results of published papers. This isn't just a problem for social science, all science is a messy business and we need to keep our skepticism intact. That isn't a free pass to ignore evolution and climate change though!

May 20, 2013 | Permalink | Comments (0) | TrackBack (0)

No more heatmaps that are just population maps!

I'm pleased to announce that there's a brand new 0.50 version of the DSTK out! It has a lot of bug fixes, and a couple of major new features, and you can get it on Amazon's EC2 as ami-7b9df412, download the Vagrant box from http://static.datasciencetoolkit.org/dstk_0.50.box, or grab it as a BitTorrent stream from http://static.datasciencetoolkit.org/dstk_0.50.torrent

What are the new features?

The biggest is the integration of high resolution (sub km-squared) geostatistics for the entire globe. You can get population density, elevation, weather and more using the new coordinates2statistics API call. Why is this important? No more heatmaps that are just population maps, for the love of god! I'm using this extensively to normalize my data analysis so that I can actually tell which places actually have an unusually high occurrence of X, rather than just having more people.

I've also added the text2sentiment method, which has been a big help as I've been categorizing positive and negative comments.

text2people now incorporates information from the US Census on which ethnic groups are most likely to have a particular surname, to help you do a rough-and-ready ethnic makeup analysis of a list of names.

I've expanded language support, with a new Ruby gem that you can get via 'gem install dstk' (which includes unit testing), and an R Package adding the two new APIs to Ryan Elmore's original, available as RDSTK. The Python and Javascript clients have been updated to the latest APIs too.

There's also an official .ova version for people using VMware, up at http://static.datasciencetoolkit.org/dstk_0.50.ova

What's still to be done?

The size has ballooned, from about 5GB to nearly 20GB! Most of this is the elevation and other global data, so I'm considering making these optional in the future if that's a problem for a lot of people.

The new surname analysis in text2people has a very high latency on the first request (tens of seconds), which isn't acceptable, so I'll be figuring out a fix for that.

Unit testing has shown that text2sentences isn't working at all!

Thanks to everyone who's contributed to the project so far, both coders and the many good folks who make data openly available! It's exciting to help democratize these tools, I'm looking forward  to hearing feedback on how to keep improving that process.

pete@jetpac.com

May 19, 2013 | Permalink | Comments (0) | TrackBack (0)

Five short links

Stationfive
Photo by Curtis Perry

The Declassification Engine - "Saving history from official secrecy". A fascinating concept that shows how the firehose of cheap distributed computing power fundamentally changes what privacy and secrecy mean. We can probably reconstruct a lot of information that people think they've hidden in these documents, but what are the rules?

A 63-bit floating point type for 64-bit OCaml - I've never used the language, but I adore the bit-fiddling that goes into floating-point representations, and this is a lovely hack on top of them.

Local geocoder - A lovely minimal reverse geocoder that's self-contained, including data. I've been excited to see a blossoming of open geocoding solutions, Nominatim has improved in leaps and bounds, PostGIS now has some strong capabilities, and I've been having fun with the Data Science Toolkit of course!

How to say nothing in 500 words - Ancient advice about writing that's still useful. "Call a fool a fool"!

Olympians Festival - I've been getting a lot out of the local TheaterPub nights in San Francisco, so I'm excited to make it to this twelve-night festival with a whopping 36 new plays in November! I'm also a sucker for the greek myths, ever since I hear up with Tony 'Blackadder' Robinson's retelling of the Iliad as a kid.

May 11, 2013 | Permalink | Comments (0) | TrackBack (0)

Five short links

Fivetype
Photo by Grant Hutchinson

Assuming everybody else sucked - If an industry is behaving in an apparently irrational way, try to figure out the internal logic that's driving that behavior. You'll be much more effective at breaking the rules if you understand what they are first.

Storing and publishing sensor data - Now we're scattering sensors around like confetti, we're generating ever-growing mounds of time-series data, so here's a good overview of where you can shove it.

100,000 Stars - This WebGL exploration of the universe is so good I feel like this should have already been plastered all over the internet already, but maybe I've been living under a rock?

Mapping the product manifold - I started off in image processing, carried what I'd learned to unstructured text, and now I'm fascinated to see techniques flowing back the other way. We're going to be doing crazily effective recognition of images, language, and every other kind of noisy signal within a few years.

What happened to the crypto dream? - A clear-eyed examination of where the crypto dream of the 90's ended up - ""the demand for technologies that will upset that power balance is quite low".

May 07, 2013 | Permalink | Comments (0) | TrackBack (0)

We're all starting to track ourselves

Mapscreenshot

We're releasing a massive and growing amount of information about who we are, where we go, and when. There are hundreds of millions of public checkins already out there, and millions more are being created every day. People think of Foursquare as the leading source, but actually Instagram, Facebook, Twitter, Flickr, Google Plus all produce incredible numbers of geo-located checkins, some of many, many more than Foursquare.

This is going to cause big changes in our world. We've already taught our computers what we buy and read, now we're telling them where we spend our real-world lives. Just our presence at a location at a particular time becomes powerful data when it's combined with all the other people doing the same thing. We're instrumenting our movements at a very detailed level, and sending them out into the ether. Even more amazingly, we're adding high-resolution photos and detailed comments to the checkins.

It's hard to overstate how effective this data can be at solving intractable problems. Economists, sociologists, and epidemiologists would kill to have detailed pictures of the lives we lead at this kind of scale. There will be applications we haven't even thought of too, connecting us with people we should be talking to, introducing us to new experiences, all sorts of feedback that will change how we live.

It's a scary new world to contemplate too of course, which is why I keep blogging about what I'm up to. Recently I've been working with my team at Jetpac analyzing billions of photos from all sorts of social sources, to help both tourists and locals figure out where to go and what to do. I want to share an internal tool we use to explore the data, a map interface to the checkins that people have shared publicly. If you want to get a concrete feel for how our world's changing, check it out:

https://www.jetpac.com/map

It's still an experimental tool so apologies for any bugs, but I hope you find this glimpse of the mountain of public data we're all creating as fascinating as I do! You can find all of the individual photos and other checkins out there on the public web, but seeing them accumulated together in one place still blows my mind.

May 06, 2013 | Permalink | Comments (0) | TrackBack (0)

Five short links

Starknot
Photo by Neil Platform1

GeoURI - I have no earthly use for these, but I love that they exist, and are even an IETF standard!

Nathaniel Bowditch - He created the American Practical Navigator over two hundred years ago. He improved the data quality of previous works and made the results widely available in a form non-specialists could easily understand. That approach transformed navigation then, and it's still incredibly effective today across all sorts of fields.

Digital Elevation Data - On that topic Jonathan de Ferranti has spent years painstaking correcting open-source geographic data about the height of the earth's surface, and then releasing the results openly to anyone who needs them. It may be hard for non-geo folks to understand how tough a problem this is, and how hard he's worked on it, so here are some example renders and an independent review.

Sentiment Analysis Corpora - A fantastic summary and comparison of the raw data sets you need to build sentiment analysis algorithms.

A Major Breakthrough in Image Processing - It's time to retire Lena!

May 02, 2013 | Permalink | Comments (0) | TrackBack (0)

Open Sentiment Analysis

Smileyfingers
Photo by Courtney Carmody

Sentiment analysis is fiendishly hard to solve well, but easy to solve to a first approximation. I've been frustrated that there have been no easy free libraries that make the technology available to non-specialists like me. The problem isn't with the code, there are some amazing libraries like NLTK out there, but everyone guards their training sets of word weights jealously. I was pleased to discover that SentiWordNet is now CC-BY-SA, but even better I found that Finn Årup has made a drop-dead simple list of words available under an Open Database License!

With that in hand, I added some basic tokenizing code and was able to implement a new text2sentiment API endpoint for the Data Science Toolkit:

http://www.datasciencetoolkit.org/developerdocs#text2sentiment

Give it a try, it's as simple as a CURL call from the terminal:

curl -d "I hate this hotel" "http://www.datasciencetoolkit.org/text2sentiment"

{"score": -3.0}

I've been having a blast with it, simple-minded as it is, so I hope you do too!

April 30, 2013 | Permalink | Comments (0) | TrackBack (0)

Five short links

Earthlight

A Global Poverty Map Derived from Satellite Data - This is an old paper from 2006, but I love the idea of using how much light that a neighborhood sends into to the night sky to measure how wealthy it is. Richness is highly correlated with wastefulness, apparently.

Open Multi-lingual WordNets - We're mapping our inner worlds too, these open data sets are incredibly useful information on word meanings for anyone working with computers and human languages.

The Invisible City - A fake Canadian city briefly appeared on OpenStreetMap, complete with an elaborate public transport network. Or was it briefly a real place blinking in and out of existence, with only a lone volunteer mapper spotting it?

The Dark Side of Social Capital - We usually think of community as a good thing, but anybody who grew up in a small town can tell you that the power can be used to exclude outsiders too.

K2C 1N5 - Ervin Ruci is being hounded by the Canadian Postal Service for the crime of making a crowdsourced database of postal codes freely available, and now they've decided they own the copyright to the words "postal code" too!

April 25, 2013 | Permalink | Comments (0) | TrackBack (0)

Five short links

Fiveoclock
Photo by Tasty Goodness

Yoyodyne - How a fictional company was born in the novels of Thomas Pynchon, was adopted by Buckaroo Banzai and Star Trek, and ended up in the GPL.

What will be left of our cities? - The nitty-gritty details of what will happen to our concrete, brick, and steel long after we're dead and gone.

On glitch art, and the fascinating mistakes computers make - I was a terrible VJ with footage, but I had so much fun with live feeds and static. Don't believe technology's mask of perfection, engineers knowwhat a rats' nest every product is under the hood.

Is MS Office the quiet villain of global finance? - Our kids will look back on the last couple of decades as a time when we fell under the spell of cold hard numbers, without really looking at how they were produced.

Search history and accidental class warfare - A variant of the echo chamber effect, and an example of the law of unintended consequences. Recommendation algorithms are becoming our century's version of press barons.

April 22, 2013 | Permalink | Comments (0) | TrackBack (0)

Do we need a slow software movement?

Woolysnail
Photo by Tim Regan

When I was an isolated kid in the English countryside my only connections to the computing world were "Public Domain" floppy disks. Mail-order libraries would send me one of the disks in their catalog if I posted them a pound coin taped to a piece of card. I've never forgotten how important those glimpses into a wider world were, and I'll always be grateful to the people who made their demos, games, and utilities freely available. They were a lifeline to me, and I always wanted to give something back in return. My first contribution was a 'desktop palette' of 16 colors I'd selected for an especially pleasing RISC OS background, which didn't exactly set the world on fire.

That set the tone for most of my open source career - when I release a new project, I expect a deafening silence. There are occasional exceptions, but most of them don't make sense to anyone else, at least at first. The majority get quietly ignored by me and everyone else, but a few I keep working on, and they occasionally get picked up by other people too.

The Data Science Toolkit has turned into one of those sleeper projects. Over the last few months I've had a lot of bug reports, which is the best measure of how many people are actually using the code! There have been some nice companion projects too, like this wrapper for Excel or the new API library for Node. It also powers OpenHeatMap.com, which also keeps growing like a weed entirely through word of mouth. Hearing about the uses has been fascinating; academics of 19th century American literature mapping the spread of place mentions, reporters analyzing documents to track corruption in developing countries, mobile real estate app startups, university alumni associations.

The common thread for everyone using it is that they're marginal, just like I was growing up. There aren't enough of them and they don't have enough money to tempt commercial developers. Young open source software grows in the cracks between profitable problems, and survives on a starvation diet of spare-time coding. This gives it the time to find its niche, its audience, in a way that a more conventional development approach never could. Slow-growing software has the chance to reach people who'd never be found any other way, so if you're working on an unpopular project that you love, don't give up!

April 18, 2013 | Permalink | Comments (0) | TrackBack (0)

Five short links

Fivepound
Photo by Kurtis Garbutt

Geo-location estimation of Flickr images - The caption, title, and description of a photo is incredibly useful when it comes to guessing where a photo was taken, even using fairly crude language analysis algorithms. This is a great paper that parallels a lot of what we've found using unstructured text for image location at Jetpac.

Create a heatmap in Excel - Excel can be crusty and hard to learn, but I'm constantly surprised by how much you can do with it once you dive into its depths.

Death by a thousand paper cuts - I get asked the same questions over and over again by people I've just met once they detect my accent - "Where are you from?", "Do you like soccer?", "Why did you come here?". I appreciate that they're trying to connect with me, but the sheer repetition and predictability can make it hard to answer them with enthusiasm.

I can only imagine how tough it must be to deal with repetitive comments when people are behaving like jerks, rather than being nice. Julie does a good job explaining why, even when any single incident can seem fairly minor, an unending succession of them becomes impossible to deal with. The programming world tolerates people behaving like jerks in small ways towards anyone who isn't like them, over and over and over again.

Big Data and Conflict Prevention - The world ignored warning signs about famines and wars from small data, and they're doing the same thing with big data.

Helsinki Bus Station Theory - The case for sticking it through an apprenticeship so you can do something truly creative afterwards.

April 17, 2013 | Permalink | Comments (0) | TrackBack (0)

Converting to and from Google map tile coordinates in PostGIS

Google Maps' system of power-of-two tiles has become a defacto standard, widely used by all sorts of web mapping software. I've found it handy to use as a caching scheme for our data, but the PostGIS calls to use it were getting pretty messy, so I wrapped them up in a few functions. The code is up at https://github.com/petewarden/postgis2gmap, and here's a quick rundown:

tile_indices_for_lonlat(lonlat geography, zoom_level int)

Takes a PostGIS latitude/longitude point and a zoom level, and returns a geometry object where the X component is the longitude index of the tile, and the Y component is the latitude index. These values are not rounded, so for a lot of purposes you'll need to FLOOR() them both, eg;

SELECT FLOOR(X(tile_indices_for_lonlat(checkins.lonlat, 4))) AS grid_lon, FLOOR(Y(tile_indices_for_lonlat(checkins.lonlat, 4))) AS grid_lat FROM checkins;

lonlat_for_tile_indices(lat_index float8, lon_index float8, zoom_level int)

Does the inverse of the function above, turning a Google Maps tile index for a given zoom level into a PostGIS geometry point. You may notice that the coordinates are given as separate arguments rather than a single geometry object. That's an artifact of how my data is stored. Here's an example:

SELECT X(lonlat_for_tile_indices(6, 2, 4)::geometry), Y(lonlat_for_tile_indices(6, 2, 4)::geometry);

bounds_for_tile_indices(lat_index float8, lon_index float8, zoom_level int)

This takes latitude and longitude coordinates for a tile, and a zoom level, and returns a geography object containing the bounding box for that tile. I mainly use this for limiting queries on geographic data to a particular tile, eg;

SELECT * FROM checkins WHERE ST_Intersects(lonlat, bounds_for_tile_indices(6, 2, 4);

April 09, 2013 | Permalink | Comments (0) | TrackBack (0)

Five short links

Fivehand
Photo by Alan Levine

Elephant - A beautiful open source project to store data in a way that's "as durable as S3, as portable as JSON, and as queryable as HTTP". Tim O'Reilly has talked about the web operating system, and HTTP, JSON, and REST-like APIs (without the annoyances of full REST) have become the interface layer. I know integration will be do-able whenever I see a project based around them.

Median SF rent for a one-bedroom apartment - I wish Craigslist made their data openly available. It's already public, why not enable more useful services like this?

Everything We Know About What Data Brokers Know About You - The data about us that's used for marketing purposes is essentially unregulated. As someone who works with data about people for a living I'm glad I'm able to innovate, but I'm also depressed by how little the public actually cares about how their information is passed around and used.

The Design-Fiction Slider-bar of Disbelief - A corker of a listicle from Bruce Sterling, covering the continuum from imagination to regulation.

Scrapely - I love pulling data from messy HTML pages, and it's great to see more and more support emerging. Don't give me an API, just give me an open robots.txt.

April 08, 2013 | Permalink | Comments (0) | TrackBack (0)

The Chairs and The Shrew

Unibrow
Photo by Jesse Bell

I have middlebrow tendencies, but over the years I've learned that the struggle with difficult work can pay off. I grew to love Infinite Jest, once I figured out Wallace was boring me deliberately, that he cared about the mundane details that make up people's real lives. As a teenager I figured out that subtitles on BBC2 late at night meant nudity, and I wound up appreciating French cinema despite my base motivations. My favorite play from last year was a production of Beckett's End Game, which left me heartbroken for characters who should have been unrelatable, screaming at each other from trash cans.

On Sunday night I made it to the Cutting Ball's production of Eugene Ionesco's The Chairs. I was ready to put some work in but I hoped it would pay off. I left a little disappointed. There was a lot to chew on intellectually, and the performances were fantastic, but I never cared about what was happening. It was a puzzle, but a cold one, and I never felt there was anything at stake, despite it being set at the end of the world. The basic plot (do existentialists care about spoilers?) is that an old married couple, apparently the last people on earth, begin to host a party full of imaginary guests, and the husband prepares to give his inspiring message to the world. The 'orator' who will deliver the message appears as a flesh-and-blood person, and the couple commit suicide, and then the orator delivers what turns out to be nonsensensical gobbledegook. I could imagine a play that made this pack a punch but when the couple threw themselves out of the windows, all that was going through my mind was "How long until we hear the splash?". It didn't feel like the company's fault, the translation was strong and the acting was up to the high standard I've come to expect from the Cutting Ball. The barrier I hit was Ionesco's writing. I know he was demonstrating how the "language of society" breaks down and how hard it is to communicate, but I'm bourgeois enough to want something more than an intellectual thesis from a play. I wish I'd caught The Bald Soprano by the same director a few years ago, one of my friends told me that worked much more effectively for him, so maybe that would have helped me connect with Ionesco?

I had very different expectation for last night's entertainment, The Taming of the Shrew by TheaterPub at the Cafe Royale. I discovered the group last month when they did multiple interpretations of a short experimental play, and I knew they had attracted a team of talented and enthusiastic actors, so I was excited to see how they'd tackle Shakespeare. Shrew isn't an easy play to produce, modern audiences are going to struggle to swallow the central plot, that an opinionated woman needs to be psychologically tortured until she submits to her husband. Shakespeare can't help but write fleshed-out characters though, so there was usually enough wiggle room in the interpretation to make them sympathetic to us. The only exception was Kate's final speech, even with the emphasis that she'd only bow to her husband's honest will it was hard to see as a happy ending. Despite that quibble with the source text, the whole evening was a massive amount of fun. I loved the relish and gusto that the whole cast showed. Kim Saunder's Kate and Paul Jenning's Petrucio appropriately stole the show with big performances that had me laughing and completely believing in their tricky-to-swallow relationship. Paul seemed to be channeling the best of John Goodman and Jack Black as he played the crazy suitor, and Kim's obvious enjoyment of the tongue-lashings Kate gives to the world played perfectly. I'd also single out Shane Rhode for the energy and imagination he brought to the tough part of Grumio, playing up to the audience as he witnessed the ridiculousness unfolding along with us. Ron Talbot, Jan Marsh, Vince Faso, Brian Martin, Sam Bertken, and Sarah Stewart all deserve credit for their work too, everyone was throwing themselves wholeheartedly into their roles, and generating a lot of laughter.

There are three more performances coming up, one tonight (Tuesday March 19th), and then Monday March 25th and Wednesday March 27th, all at 8pm. If you're looking for an experience that's all theater, with an audience and cast that are all there for pure enjoyment of a play, try to make it along. The venue is incredibly relaxed (thanks to the director Stuart Bousel for gently handling a couple of folks who were ahem, a little too relaxed, when the play started), you don't need reservations, and they have great beer!

March 19, 2013 | Permalink | Comments (0) | TrackBack (0)

Five short links

Lincolnmoods
Photo by Earl

Want to live somewhere nice? Be prepared to work longer - How an area's living costs affect poor and rich workers differently.

Moving towards an identity and patient records locator - As Ben Adida points out, a one-way hash of a cell phone number with less than a billion possible inputs is not very useful. The flipside of de-anonymization being easy with enough dimensions is that it should be possible to perform entity disambiguation using the same data, so just store the messy redundant information you get as input, and do joins when you need to. The problem of matching entities is hard because defining an entity is hard, don't fight it.

Open Access Coalition - We're going to look back on the '00's as a golden age when data was open on the web if we're not careful.

DIY McDonald's Recipes - This is a maker project I can get behind! Fast food is my guilty pleasure, thankfully only occasional these days, but I have an engineer's appreciation for the thought that has gone into designing their recipes.

Save yourself from Reddit, Hacker News, Slashdot - A neat little productivity hack from Steve Coast!

March 19, 2013 | Permalink | Comments (0) | TrackBack (0)

Quantity has a quality all its own

Ccdarray
Photo by Kevin Collins

I used to be an image processing engineer. I'd be handed a picture, and I'd have to do something useful with it. To do that I had to take a big mental leap. Instead of seeing it as an image I had to picture it in my mind as a grid of measurements.

At first this was intensely frustrating, because they were deeply crappy measurements. A million factors introduced noise or errors, everything from lenses to sensor noise to encoding software. Gradually I began to make progress, despite all these problems. Decades of engineers before me had figured out inelegant but effective methods of getting value from an unpromising soup of pixels, and I was able to learn from their approaches.

Interesting algorithms in image processing are almost comically domain specific. Thousands of man years of work have gone into detecting and correcting the distinctive reflections that occur when peoples' eyes are caught in a camera flash. Compressing photos effectively requires an exhaustive knowledge of the human perception system, and very clear ideas of the likely subject matter for photos. The process behind facial recognition is a like a game of Mouse Trap, with a whole series of steps that have been empirically proven to work, but which could never have predicted from any theory.

The computer science I was taught at college grew out of mathematics, and assumed that you have a minimal set of clean inputs. Provability and understandability were prized values, and so messy ad-hoc algorithms were seen as dead ends, even if they worked for the problem at hand. Image processing taught me to value them instead, as long as they could be proven to work across the kind of inputs I was likely to encounter in practice.

Once I'd learned that, the world began to look very different. My image processing training gave me the mental tools to tackle problems that other people shied away from. If I have a large enough set of data, I know how to search for the signal, even if the noise is deafening. I'm happy to rely on correlations that aren't guaranteed to hold for all time, as long as I can test it holds in the cases I care about now, and have instrumentation to spot if the prediction stops working. I know that getting 80% of the way there and having a human fill in the blanks is often good enough.

I wasn't the only one to discover how effective this mindset can be, and it has come to be known as Data Science. It's an approach to solving problems that's light on elegance and heavy on pragmatism. It doesn't care about proofs but relies on experiments. Entirely new things are possible once you have massive amounts of data, so even if you're a grizzled old engineer like me and instinctively shy away from trendy new labels, give Big Data a try. Amongst all the marketing hype, there's some powerful techniques for building algorithms that have no right to work, but do.

March 18, 2013 | Permalink | Comments (0) | TrackBack (0)

Five short links

Aperture
Photo by Yersinia

The Deleted City - A spatial reinterpretation of the old Geocities sites. Having data in a single large dump instead of behind an API makes it possible to do things like this with it, things that the creators could never have foreseen.

Asteroid Discovery from 1980 to 2011 - See how our knowledge about the world around us has grown with this amazing animation. At the start new asteroids appear as discrete pinpricks days or months apart, by the end they're being discovered so fast they're a solid mass, it's like a lighthouse beam hitting fog. It's not only that we're finding out more, but that the rate of discovery is accelerating.

Open data on depression treatment in London - I love seeing mass adoption of data technologies, it's this sort of democratization of the tools that makes the real difference to the world. What's special about this approach is that it's so ordinary, what used to be elite techniques are now available to people in every walk of life.

BitDeli - I haven't used this yet, but I love the idea of being able to program custom analytics code, without the hassle of having to host it myself, and with the benefit of being able to reuse other people's approaches too.

Silicon Valley poverty - Even after twelve years here, I'm still shocked by how wide the gap is between the rich and the poor in the US. 

March 13, 2013 | Permalink | Comments (0) | TrackBack (0)

Why I'm a terrible privacy advocate

Handeyes
Photo by Michael Scott

People often think I'm a privacy researcher, thanks to the Facebook and iPhone stories. The truth is I'm just curious about undiscovered data. Because a lot of it is about people's behavior, and that's an inherently creepy area, I blog about what I'm doing to keep myself honest. It might look like I'm on a privacy crusade, but that's just a by-product of my attempts to figure out ethical ways to use these sources of information. I'm a data hacker, and I'm trying to keep my hat clean.

This has been on my mind a lot recently as I'm looking around at all the information that's publicly available about exactly where people have been. Facebook, Google+, Instagram, Flickr, Twitter are all making rich streams of location data available, especially around photos. My vision is a world where I can make those digital footprints visible to ordinary users. Who comes to this bar? Any of my friends? What sort of people take photos at this hotel?

The raw data to do this is already out there in multiple places, and you can do some of it by going to individual sites like Foursquare, but there's something different about merging together scattered information, even if it's all theoretically public already. You have to make a choice before your activities are publicly visible from these services, but the implications of that choice aren't clear until somebody aggregates the data and demonstrates why the sum is greater than its parts.

I wish I could pretend I was only worried about the privacy implications, but the truth is I'm excited about how fun and useful the applications could be!

March 12, 2013 | Permalink | Comments (0) | TrackBack (0)

Five short links

Fiveleaves
Photo by Flood G

BetaShapes - Using geotagged Flickr photos to define San Franciscos neighborhoods as a crowd-sourced 'folksonomy'. I'm entranced by how many useful things emerge from the clouds of data exhaust we're all generating.

Bacteria farming and software design - Code is an incredibly useful tool for artists, I love this behind the scenes look at how an amazing visualization was built.

Ang Lee and the uncertainty of success - The acclaimed director spent six years of his career with no visible signs of making any progress, and this post does a fantastic job highlighting how hard that must have been.

Founders and dysfunctional families - Growing up in chaos is good preparation for working in a startup.

Common Crawl URL search - Thinking of crawling the web? Check out this web interface to see if Common Crawl already has what you're looking for sitting in a handy S3 bucket!

March 11, 2013 | Permalink | Comments (0) | TrackBack (0)

Why should you care that artists are underpaid?

Starvingartists
Picture by Jamie

I've spent most of my career working closely with artists, and they were usually paid less than me. At first this was just awkward, but I began to realize it was part of a deeper problem. Most business owners didn't understand what artists were even adding to the product, and the pay was just a symptom of the lack of respect they had for their contributions. I remember at my last game industry job the owner began replacing experienced 3D artists with high-school graduates being paid a third of their salaries. 

I'm a capitalist, red in tooth and claw, so why did I have a problem with that? At that point, I'd spent six years at the coal face of a creative industry, and I knew how much those people contributed to a successful product. The trouble was it was often in subtle ways that were crucial but easy to miss. In the short term you could easily continue producing sequels, recycling assets and ideas, but it wasn't sustainable. It was a great way of cutting costs, but you ended up with a boring product that was indistingushable from the competition, which meant a lot lower profits over time. If you don't value the craft of experienced creative people, you'll never get a hit, and that's where you really make money.

I learned to find places that valued artists, not just because they were better places to work, but because I thought they had a much higher chance of success. I was never an Apple user before I was asked to join the company back in 2003, and my biggest concern was that they were about to go bust(!), but I was attracted by their reputation. I wasn't disappointed, the designers were very definitely in charge! Dedication to the intangible details of products was the core of Apple's success, and that meant valuing artists. It wasn't always easy, it was a challenging pressure-cooker environment for a lot of my friends, but the importance of what they were doing was never in doubt. I always believed Steve would happily ship off the software engineering side to the other side of the world if he could, but the designers were the company.

That's why I'm sad when I see industries throw away the talent that made them great. The Visual Effects Oscar speech was cut off when the winner started to mention the bankruptcy of Rhythm and Hues, the team that was behind a lot of the shots that won the award (and where several of my friends work). It's the latest casualty of a wave of VFX closures, and a sign that film industry bosses think they can get away with cheaper, less-experienced artists, and audiences won't notice the difference. It's like Detroit in the 70's, they have enough momentum that it will take a while for the problem to be noticeable, but by the time it's obvious, the talent will have disappeared. Fit and finish matters, and when capital-intensive industries like cars, film or games forget that, disaster looms. A free market will eventually correct the problem, but only after a lot of money has been wasted, and a lot of people have gone through hell.

Learn from Apple's success; valuing artists makes you money!

February 25, 2013 | Permalink | Comments (0) | TrackBack (0)

Next »