PeteSearch

Ever tried. Ever failed. No matter. Try Again. Fail again. Fail better.


Try my new Big Data project!
Subscribe in a reader

Recent Posts

  • Hacks for hospital caregiving
  • How does name analysis work?
  • Fixing OpenCV's Java bindings on gcc systems
  • Five short links
  • Five short links
  • Five short links
  • Five short links
  • No more heatmaps that are just population maps!
  • Five short links
  • Five short links

Archives

  • June 2013
  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • November 2012
  • October 2012
  • August 2012
  • July 2012

More...

About

Blog powered by Typepad

Hacks for hospital caregiving

Redcross
Photo by Adam Fagen

I've just emerged from the worst two weeks of my life. My partner suffered a major stroke, ended up in intensive care, and went through brain surgery. Thankfully she's back at home now and on the road to a full recovery. We had no warning or time to prepare, so I had to learn a lot about how to be an effective caregiver on the fly. I discovered I had several friends who'd been in similar situations, so thanks to their advice and some improvisation here are some hacks that worked for me.

Be there

My partner was unconscious or suffering from memory problems most of the time, so I ended up having to make decisions on her behalf. Doctors, nurses and assistants all work to their own schedules, so sitting by the bedside for as long as you can manage is essential to getting information and making requests. Figure out when the most important times to be there are. At Stanford's intensive care unit the nurses work twelve hour shifts, and hand over to the next nurse at 7pm and 7am every day. Make sure you're there for that handover, it's where you'll hear the previous nurse's assessment of the patients condition, and be able to get support from the old nurse for any requests you have for the new one to carry out. The nurses don't often see patients more than once, so you need to provide the long-term context.

The next most important time is when the doctors do their rounds. This can be unpredictable, but for Stanford's neurosurgery team it was often around 5:30am. This may be your one chance to see senior physicians involved in your loved one's care that day. They will page a doctor for you at any time if you request it, but this will usually be the most junior member of the team who's unlikely to be much help beyond very simple questions. During rounds you can catch a full set of the doctors involved, hear what they're thinking about the case, give them information, and ask them questions.

Even if there are no memory problems, the patient is going to be groggy or drugged a lot of the time, and having someone else as an informed advocate is always going to be useful, and the only way to be informed is to put in as much time as you can.

Be nice to the nurses

Nurses are the absolute rulers of their ward. I should know this myself after growing up in a nursing family, but an old friend reminded me just after I'd entered the hospital how much they control access to your loved one. They also carry a lot of weight with the doctors, who know they see the patients for a lot longer than they do, and often have many years more experience. Behaving politely can actually be very hard when you're watching someone you love in intensive care, but it pays off. I was able to spend far more than the official visiting hours with my partner, mostly because the nurses knew I was no trouble, and would actually make their jobs easier by doing mundane things for her, and reassuring her when she had memory problems. This doesn't mean you should be a pushover though. If the nurses know that you have the time to very politely keep insisting that something your loved one needs happens, and will be there to track that it does, you'll be able to get what your partner needs.

Track the drugs

The most harrowing part of the experience was seeing my loved one in pain. Because neurosurgeons need to track their patients cognitive state closely to spot problems, they limit pain relief to a small number of drugs that don't cause too much drowsiness. I knew this was necessary, but it left my partner with very little margin before she was hit with attacks of severe pain. At first I trusted the staff to deal with it, but it quickly became clear that something was going wrong with her care. I discovered she'd had a ten hour overnight gap in the Vicodine medication that dealt with her pain, and she spent the subsequent morning dealing with traumatic pain that was hard to bring back under control. That got me digging into her drugs, and with the help of a friendly nurse, we discovered that she was being given individual Tylenols, and the Vicodine also contained Tylenol, so she would max out on the 4,000mg daily limit of its active ingredient acetaminophen and be unable to take anything until the twenty-four hour window had passed. This was crazy because the Tylenol did exactly nothing to help the pain, but it was preventing her from taking the Vicodine that did have an effect.

Once I knew what was going on I was able to get her switched to Norco, which contains the same strong pain-killer as Vicodine but with less Tylenol. There were other misadventures along the same lines, though none that caused so much pain, so I made sure I kept a notebook of all the drugs she was taking, the frequency, any daily limits, and the times she had taken them last, so I could manually track everything and spot any gaps before they happened. Computerization meant that the nurses no longer did this sort of manual tracking, which is generally great, but also meant they were always taken by surprise when she hit something like the Tylenol daily limit, since the computer rejection would be the first they knew of it.

Start a mailing list

When someone you love falls ill, all of their family and friends will be desperate for news. When I first came up for air, I phoned everyone I could think of to let them know, and then made sure I had their email address. I would be kicked out of the ward for a couple of hours each night, so I used that time to send out a mail containing a progress report. At first I used a manual CC list, but after a few days a friend set up a private Google Group that made managing the increasingly long list of recipients a lot easier. The process of writing a daily update helped me, because it forced me to collect my thoughts and try to make sense of what had happened that day, which was a big help in making decisions. It also allowed me to put out requests for assistance, for things like researching the pros and cons of an operation, looking after our pets, or finding accommodation for family and friends from out-of-town. My goal was to focus as much of my time as possible on looking after my partner. Having a simple way to reach a lot of people at once and being able to delegate easily saved me a lot of time, which helped me give better care.

Minimize baggage

A lot of well-meaning visitors would bring care packages, but these were a problem. During our eleven day stay, we moved wards eight times. Because my partner was in intensive care or close observation the whole time, there were only a couple of small drawers for storage, and very little space around the bed. I was sleeping in a chair by her bedside or in the waiting room, so I didn't have a hotel room to stash stuff. I was also recovering from knee surgery myself, so I couldn't carry very much!

I learned to explain the situation to visitors, and be pretty forthright in asking them to take items back. She didn't need clothing and the hospital supplied basic toiletries, so the key items were our phones, some British tea bags and honey, and one comforting blanket knitted by a friend's mother. Possessions are a hindrance in that sort of setting, the nurses hate having to weave around bags of stuff to take vital signs, there's little storage, and moving them all becomes a royal pain. Figure out what you'll actually use every day, and ask friends to take everything else away. You can always get them to bring something back if you really do need it, but cutting down our baggage was a big weight off my mind.

Sort out the decision-making

My partner was lucid enough early in the process to nominate me as her primary decision-maker when she was incapacitated, even though we're not married. As it happened, all of the treatment decisions were very black-and-white so I never really had to exercise much discretion, but if the worst had happened I would have been forced to guess at what she wanted. I knew the general outlines from the years we've spent together, but after this experience we'll both be filling out 'living wills' to make things a lot more explicit. We're under forty, so we didn't expect to be dealing with this so soon, but life is uncertain. The hospital recommended Five Wishes, which is $5 to complete online, and has legal force in most US states. Even if you don't fill out the form, just talking together about what you want is going to be incredibly useful.

Ask for help

I'm normally a pretty independent person, but my partner and I needed a large team behind us to help her get well. The staff at Stanford, her family, and her friends were all there for us, and gave us a tremendous amount of assistance. It wasn't easy to reach out and ask for simple help like deliveries of clothes and toiletries, but the people around you are looking for ways they can do something useful, it actually makes them feel better. Take advantage of their offers, it will help you focus on your caregiving.

Thanks again to everyone who helped through this process, especially the surprising number of friends who've been through something similar and whose advice helped me with the hacks above.

June 28, 2013 | Permalink | Comments (1) | TrackBack (0)

How does name analysis work?

Inigo
Photo by Glenda Sims

Over the last few months, I've been doing a lot more work with name analysis, and I've made some of the tools I use available as open-source software. Name analysis takes a list of names, and outputs guesses for the gender, age, and ethnicity of each person. This makes it incredibly useful for answering questions about the demographics of people in public data sets. Fundamentally though, the outputs are still guesses, and end-users need to understand how reliable the results are, so I want to talk about the strengths and weaknesses of this approach.

The short answer is that it can never work any better than a human looking at somebody else's name and guessing their age, gender, and race. If you saw Mildred Hermann on a list of names, I bet you'd picture an older white woman, whereas Juan Hernandez brings to mind an Hispanic man, with no obvious age. It should be obvious that this is not always reliable for individuals (I bet there are some young Mildreds out there) but as the sample size grows, the errors tend to cancel each other out.

The algorithms themselves work by looking at data that's been released by the US Census and the Social Security agency. These data sets list the popularity of 90,000 first names by gender and year of birth, and 150,000 family names by ethnicity. I then use these frequencies as the basis for all of the estimates. Crucially, all the guesses depend on how strong a correlation there is between a particular name and a person's characteristics, which varies for each property. I'll give some estimates of how strong these relationships are below, and I link to some papers with more rigorous quantitative evaluations below.

If you are going to use this approach in your own work, the first thing to watch out for is that any correlations are only relevant for people in the US. Names may be associated with very different traits in other countries, and our racial categories especially are social constructs and so don't map internationally.

Gender is the most reliable signal that we can gleam from names. There are some cross-over first names with a mixture of genders, like Francis, and some that are too unique to have data on, but overall the estimate of how many men and women are present in a list of names has proved highly accurate. It helps that there are some regular patterns to augment the sampled data, like names ending with an 'a' being associated with women.

Asian and Hispanic family names tend to be fairly unique to those communities, so an occurrence is a strong signal that the person is a member of that ethnicity. There are some confounding factors though, especially with Spanish-derived names in the Phillipines. There are certain names, especially those from Germany and Nordic countries, that strongly indicate that the owner is of European descent, but many surnames are multi-racial. There are some associations between African-Americans and certain names like Jackson or Smalls, but these are also shared by a lot of people from other ethnic groups. These ambiguities make non-Hispanic and non-Asian measures more indicators than strong metrics, and they won't tell you much until you get into the high hundreds for your sample size.

Age has the weakest correlation with names. There are actually some strong patterns by time of birth, with certain names widely recognized as old-fashioned or trendy, but those tend to be swamped by class and ethnicity-based differences in the popularity of names. I do calculate the most popular year for every name I know about, and compensate for life expectancy using actuarial tables, but it's hard to use that to derive a likely age for a population of people unless they're equally distributed geographically and socially. There tends to be a trickle-down effect where names first become popular amongst higher-income parents, and then spread throughout society over time. That means if have a group of higher-class people, their first names will have become most widely popular decades after they were born, and so they'll tend to appear a lot younger than they actually are. Similar problems exist with different ethnic groups, so overall treat the calculated age with a lot of caution, even with large sample sizes.

You should treat the results of name analysis cautiously - as provisional evidence, not as definitive proof. It's powerful because it helps in cases where no other information is available, but because those cases are often highly-charged and controversial, I'd urge everyone to see it as the start of the process of investigation not the end.

I've relied heavily on the existing academic work for my analysis, so I highly recommend checking out some of these papers if you do want to work with this technique. As an engineer, I'm also working without the benefit of peer review, so suggestions on improvements or corrections would be very welcome at pete@petewarden.com.

Use of Geocoding and Surname Analysis to Estimate Race and Ethnicity - A very readable survey of the use of surname analysis for ethnicity estimation in health statistics.

Estimating Age, Gender, and Identity using First Name Priors - A neat combination of image-processing techniques and first name data to improve the estimates of people's ages and genders in snapshots.

Are Emily and Greg More Employable than Lakisha and Jamal? - Worrying proof that humans rely on innate name analysis to discriminate against minorities.

First names and crime: Does unpopularity spell trouble? - An analysis that shows uncommon names are associated with lower-class parents, and so correlate juvenile delinquency and other ills connected to low socioeconomic status.

Surnames and a theory of social mobility - A recent classic of a paper that uses uncommon surnames to track the effects of social mobility across many generations, in many different societies and time periods.

OnoMap - A project by University College London to correlate surnames worldwide with ethnicities. Commercially-licensed, but it looks like you may be able to get good terms for academic usage.

Text2People - My open-source implementation of name analysis.

June 10, 2013 | Permalink | Comments (1) | TrackBack (0)

Fixing OpenCV's Java bindings on gcc systems

Coffee
Photo by Julian Schroeder

I just spent quite a few hours tracking down a subtle problem with OpenCV's new Java bindings on gcc platforms, like my Ubuntu servers. The short story is that the default for linked symbols was recently changed to hidden on gcc systems, and the Java native interfaces weren't updated to override that default, so any Java programs using native OpenCV functions would mysteriously fail with an UnsatisfiedLinkError. Here's my workaround:

--- a/cmake/OpenCVCompilerOptions.cmake
+++ b/cmake/OpenCVCompilerOptions.cmake
@@ -252,8 +252,8 @@ set(OPENCV_EXTRA_EXE_LINKER_FLAGS_DEBUG "${OPENCV_EXTRA_EXE_LINKER_FLAGS_DEBUG

# set default visibility to hidden
if(CMAKE_COMPILER_IS_GNUCXX AND CMAKE_OPENCV_GCC_VERSION_NUM GREATER 399)
- add_extra_compiler_option(-fvisibility=hidden)
- add_extra_compiler_option(-fvisibility-inlines-hidden)
+# add_extra_compiler_option(-fvisibility=hidden)
+# add_extra_compiler_option(-fvisibility-inlines-hidden)
endif()

The tricky part of tracking this down was that nm didn't show the .hidden attribute, so the library symbols appeared fine, it was only when I switched to objdump after exhausting everything else I could think of that the problem became clear.

Anyway, I wanted to leave some Google breadcrumbs for anyone else who hits this! I've filed a bug with the OpenCV folks, hopefully it will be fixed soon.

June 07, 2013 | Permalink | Comments (0) | TrackBack (0)

Five short links

Fivesign
Photo by Leo Reynolds

External framework problems in Go - Handling dependencies well is extremely hard, and can lead to insane yak-shaving expeditions like this when things go wrong. It's like an avalanche - changing versions on one library can impact several others, so you have to update or downgrade those too, and suddenly you're facing an ever-increasing amount of work just to get back to where you were!

D-wave comparison with classical computers - I don't know enough about quantum computing problems to comment on the details of the argument, but it's awesome to see such a deep technical dive as an instant blog post, rather than having to wait months for a paper.

Blogging is dead, but have we fixed anything? - "I find my blogging here to be too useful to me to stop doing it" sums up why I'm still working in a now-archaic medium!

What statistics should do about big data - "[Statisticians] want an elegant theory that we can then apply to specific problems if they happen to come up." That's been exactly my experience, and why I've never encountered statisticians as I've followed my curiosity to new problems data. The article this is in response to contains the assumption that 'funding agencies' have driven the CS takeover of data processing, but, despite a lot of the founders having roots in academia, almost all the innovations I've seen have been incubated in commercial environments.

The hidden sexism in CS departments - A portrait of managerial cluelessness when dealing with a nasty incident. Even if each occurrence is comparatively minor, it's the steady drip-drip of unwelcoming behavior that drives non-stereotypical geeks out of our world.

June 03, 2013 | Permalink | Comments (0) | TrackBack (0)

Five short links

Fivetally
Photo by ahojohnebrause

Max Headroom and the strange world of pseudo-CGI - I've always been fascinated by cargo cult analog tributes to technology. Maybe my early exposure to Max gave me the bug?

Reidentification as basic science - Arvind does a fantastic job of explaining why the research he does is so important. I love learning more about people from data, and most of the interesting insights come from interrogating it in unusual ways and finding unexpected connections, which is what his work is all about.

A 21cm radio telescope for the cost-conscious - Beautiful geekery. Who doesn't want to map the Milky Way's radio emissions using nothing more sophisticated than a $20 USB TV dongle?

How Google Code worked - An eminently-practical guide to implementing a regular-expression search engine, from the author of the late-lamented Google Code. It even comes with working source code!

3D lightning - Calculating the three-dimensional path of a lightning bolt from two simultaneous pictures taken from different spots.

May 29, 2013 | Permalink | Comments (0) | TrackBack (0)

Five short links

Fivelocks
Photo by Tony Preece

CLAVIN - A very promising open source geotagging project that analyzes unstructured text and identifies geographic entities. It has some very neat tricks up its sleeve to disambiguate common names like 'Springfield' based on the context.

The Sokal Hoax: At whom are we laughing? - Post-modernism makes an easy target for hard scientists, but this is a good reminder that some of the giants of physics made even more meaningless pronouncements about fields they knew nothing about.

Name-cleaver - A scrumptious little project from Sunlight Labs that handles a lot of the messy data cleanup work around people and organization names.

altmetrics: a manifesto - On the topic of scientists being silly, the way we measure academic output is antiquated beyond belief, so it was great to see this from my friend Cameron Neylon. We can do way better than citations.

Improving the security of your SSH private key files - This is what happens when hackers (in the old-school sense) get interested in a topic. Martin's curiosity about how SSH works led him to find out some sub-par default settings that make a passphrase on your keys a lot less effective than you might think. I didn't know about those particular problems, but I've always followed my Apple and kept my keys on an encrypted DMG.

May 24, 2013 | Permalink | Comments (0) | TrackBack (0)

Five short links

Fivestar
Photo by Eldeeem

The Cartography of Bullshit - A righteous rant against a piece of pop-sociology digging into just how flimsy the underlying statistics are. It hits home because numbers I've mined have ended up in similar columns - a White Power group even used some of my research to 'prove' Mexicans were conquering Texas based on the numbers of Juans versus Johns! Take all studies on controversial subjects like race with a massive pinch of salt.

Welcome, recent graduates - Advice I wished I'd had when I looked for my first post-college job. 

Sublime DataConverter - We've ended up using CSV for lists of objects where the property names remain constant and JSON for messier data structures and as a programming model post-transport. We've homebrewed a limited set of routines to automatically scan headers or walk all objects and extract all possible properties so we can automatically convert between the two representations, but this project is a much more general approach to the same problem.

The Split-Apply-Combine Strategy for Data Analysis - A technical but enlightening read from Hadley Wickham, covering ways of applying the same algorithms across many different representations of data.

Nightmare after nightmare: Students trying to replicate work - Remember what I said about taking studies with a pinch of salt? Even with help from the original authors, PhD students had incredible trouble reproducing the results of published papers. This isn't just a problem for social science, all science is a messy business and we need to keep our skepticism intact. That isn't a free pass to ignore evolution and climate change though!

May 20, 2013 | Permalink | Comments (0) | TrackBack (0)

No more heatmaps that are just population maps!

I'm pleased to announce that there's a brand new 0.50 version of the DSTK out! It has a lot of bug fixes, and a couple of major new features, and you can get it on Amazon's EC2 as ami-7b9df412, download the Vagrant box from http://static.datasciencetoolkit.org/dstk_0.50.box, or grab it as a BitTorrent stream from http://static.datasciencetoolkit.org/dstk_0.50.torrent

What are the new features?

The biggest is the integration of high resolution (sub km-squared) geostatistics for the entire globe. You can get population density, elevation, weather and more using the new coordinates2statistics API call. Why is this important? No more heatmaps that are just population maps, for the love of god! I'm using this extensively to normalize my data analysis so that I can actually tell which places actually have an unusually high occurrence of X, rather than just having more people.

I've also added the text2sentiment method, which has been a big help as I've been categorizing positive and negative comments.

text2people now incorporates information from the US Census on which ethnic groups are most likely to have a particular surname, to help you do a rough-and-ready ethnic makeup analysis of a list of names.

I've expanded language support, with a new Ruby gem that you can get via 'gem install dstk' (which includes unit testing), and an R Package adding the two new APIs to Ryan Elmore's original, available as RDSTK. The Python and Javascript clients have been updated to the latest APIs too.

There's also an official .ova version for people using VMware, up at http://static.datasciencetoolkit.org/dstk_0.50.ova

What's still to be done?

The size has ballooned, from about 5GB to nearly 20GB! Most of this is the elevation and other global data, so I'm considering making these optional in the future if that's a problem for a lot of people.

The new surname analysis in text2people has a very high latency on the first request (tens of seconds), which isn't acceptable, so I'll be figuring out a fix for that.

Unit testing has shown that text2sentences isn't working at all!

Thanks to everyone who's contributed to the project so far, both coders and the many good folks who make data openly available! It's exciting to help democratize these tools, I'm looking forward  to hearing feedback on how to keep improving that process.

pete@jetpac.com

May 19, 2013 | Permalink | Comments (0) | TrackBack (0)

Five short links

Stationfive
Photo by Curtis Perry

The Declassification Engine - "Saving history from official secrecy". A fascinating concept that shows how the firehose of cheap distributed computing power fundamentally changes what privacy and secrecy mean. We can probably reconstruct a lot of information that people think they've hidden in these documents, but what are the rules?

A 63-bit floating point type for 64-bit OCaml - I've never used the language, but I adore the bit-fiddling that goes into floating-point representations, and this is a lovely hack on top of them.

Local geocoder - A lovely minimal reverse geocoder that's self-contained, including data. I've been excited to see a blossoming of open geocoding solutions, Nominatim has improved in leaps and bounds, PostGIS now has some strong capabilities, and I've been having fun with the Data Science Toolkit of course!

How to say nothing in 500 words - Ancient advice about writing that's still useful. "Call a fool a fool"!

Olympians Festival - I've been getting a lot out of the local TheaterPub nights in San Francisco, so I'm excited to make it to this twelve-night festival with a whopping 36 new plays in November! I'm also a sucker for the greek myths, ever since I hear up with Tony 'Blackadder' Robinson's retelling of the Iliad as a kid.

May 11, 2013 | Permalink | Comments (0) | TrackBack (0)

Five short links

Fivetype
Photo by Grant Hutchinson

Assuming everybody else sucked - If an industry is behaving in an apparently irrational way, try to figure out the internal logic that's driving that behavior. You'll be much more effective at breaking the rules if you understand what they are first.

Storing and publishing sensor data - Now we're scattering sensors around like confetti, we're generating ever-growing mounds of time-series data, so here's a good overview of where you can shove it.

100,000 Stars - This WebGL exploration of the universe is so good I feel like this should have already been plastered all over the internet already, but maybe I've been living under a rock?

Mapping the product manifold - I started off in image processing, carried what I'd learned to unstructured text, and now I'm fascinated to see techniques flowing back the other way. We're going to be doing crazily effective recognition of images, language, and every other kind of noisy signal within a few years.

What happened to the crypto dream? - A clear-eyed examination of where the crypto dream of the 90's ended up - ""the demand for technologies that will upset that power balance is quite low".

May 07, 2013 | Permalink | Comments (0) | TrackBack (0)

We're all starting to track ourselves

Mapscreenshot

We're releasing a massive and growing amount of information about who we are, where we go, and when. There are hundreds of millions of public checkins already out there, and millions more are being created every day. People think of Foursquare as the leading source, but actually Instagram, Facebook, Twitter, Flickr, Google Plus all produce incredible numbers of geo-located checkins, some of many, many more than Foursquare.

This is going to cause big changes in our world. We've already taught our computers what we buy and read, now we're telling them where we spend our real-world lives. Just our presence at a location at a particular time becomes powerful data when it's combined with all the other people doing the same thing. We're instrumenting our movements at a very detailed level, and sending them out into the ether. Even more amazingly, we're adding high-resolution photos and detailed comments to the checkins.

It's hard to overstate how effective this data can be at solving intractable problems. Economists, sociologists, and epidemiologists would kill to have detailed pictures of the lives we lead at this kind of scale. There will be applications we haven't even thought of too, connecting us with people we should be talking to, introducing us to new experiences, all sorts of feedback that will change how we live.

It's a scary new world to contemplate too of course, which is why I keep blogging about what I'm up to. Recently I've been working with my team at Jetpac analyzing billions of photos from all sorts of social sources, to help both tourists and locals figure out where to go and what to do. I want to share an internal tool we use to explore the data, a map interface to the checkins that people have shared publicly. If you want to get a concrete feel for how our world's changing, check it out:

https://www.jetpac.com/map

It's still an experimental tool so apologies for any bugs, but I hope you find this glimpse of the mountain of public data we're all creating as fascinating as I do! You can find all of the individual photos and other checkins out there on the public web, but seeing them accumulated together in one place still blows my mind.

May 06, 2013 | Permalink | Comments (0) | TrackBack (0)

Five short links

Starknot
Photo by Neil Platform1

GeoURI - I have no earthly use for these, but I love that they exist, and are even an IETF standard!

Nathaniel Bowditch - He created the American Practical Navigator over two hundred years ago. He improved the data quality of previous works and made the results widely available in a form non-specialists could easily understand. That approach transformed navigation then, and it's still incredibly effective today across all sorts of fields.

Digital Elevation Data - On that topic Jonathan de Ferranti has spent years painstaking correcting open-source geographic data about the height of the earth's surface, and then releasing the results openly to anyone who needs them. It may be hard for non-geo folks to understand how tough a problem this is, and how hard he's worked on it, so here are some example renders and an independent review.

Sentiment Analysis Corpora - A fantastic summary and comparison of the raw data sets you need to build sentiment analysis algorithms.

A Major Breakthrough in Image Processing - It's time to retire Lena!

May 02, 2013 | Permalink | Comments (0) | TrackBack (0)

Open Sentiment Analysis

Smileyfingers
Photo by Courtney Carmody

Sentiment analysis is fiendishly hard to solve well, but easy to solve to a first approximation. I've been frustrated that there have been no easy free libraries that make the technology available to non-specialists like me. The problem isn't with the code, there are some amazing libraries like NLTK out there, but everyone guards their training sets of word weights jealously. I was pleased to discover that SentiWordNet is now CC-BY-SA, but even better I found that Finn Årup has made a drop-dead simple list of words available under an Open Database License!

With that in hand, I added some basic tokenizing code and was able to implement a new text2sentiment API endpoint for the Data Science Toolkit:

http://www.datasciencetoolkit.org/developerdocs#text2sentiment

Give it a try, it's as simple as a CURL call from the terminal:

curl -d "I hate this hotel" "http://www.datasciencetoolkit.org/text2sentiment"

{"score": -3.0}

I've been having a blast with it, simple-minded as it is, so I hope you do too!

April 30, 2013 | Permalink | Comments (0) | TrackBack (0)

Five short links

Earthlight

A Global Poverty Map Derived from Satellite Data - This is an old paper from 2006, but I love the idea of using how much light that a neighborhood sends into to the night sky to measure how wealthy it is. Richness is highly correlated with wastefulness, apparently.

Open Multi-lingual WordNets - We're mapping our inner worlds too, these open data sets are incredibly useful information on word meanings for anyone working with computers and human languages.

The Invisible City - A fake Canadian city briefly appeared on OpenStreetMap, complete with an elaborate public transport network. Or was it briefly a real place blinking in and out of existence, with only a lone volunteer mapper spotting it?

The Dark Side of Social Capital - We usually think of community as a good thing, but anybody who grew up in a small town can tell you that the power can be used to exclude outsiders too.

K2C 1N5 - Ervin Ruci is being hounded by the Canadian Postal Service for the crime of making a crowdsourced database of postal codes freely available, and now they've decided they own the copyright to the words "postal code" too!

April 25, 2013 | Permalink | Comments (0) | TrackBack (0)

Five short links

Fiveoclock
Photo by Tasty Goodness

Yoyodyne - How a fictional company was born in the novels of Thomas Pynchon, was adopted by Buckaroo Banzai and Star Trek, and ended up in the GPL.

What will be left of our cities? - The nitty-gritty details of what will happen to our concrete, brick, and steel long after we're dead and gone.

On glitch art, and the fascinating mistakes computers make - I was a terrible VJ with footage, but I had so much fun with live feeds and static. Don't believe technology's mask of perfection, engineers knowwhat a rats' nest every product is under the hood.

Is MS Office the quiet villain of global finance? - Our kids will look back on the last couple of decades as a time when we fell under the spell of cold hard numbers, without really looking at how they were produced.

Search history and accidental class warfare - A variant of the echo chamber effect, and an example of the law of unintended consequences. Recommendation algorithms are becoming our century's version of press barons.

April 22, 2013 | Permalink | Comments (0) | TrackBack (0)

Do we need a slow software movement?

Woolysnail
Photo by Tim Regan

When I was an isolated kid in the English countryside my only connections to the computing world were "Public Domain" floppy disks. Mail-order libraries would send me one of the disks in their catalog if I posted them a pound coin taped to a piece of card. I've never forgotten how important those glimpses into a wider world were, and I'll always be grateful to the people who made their demos, games, and utilities freely available. They were a lifeline to me, and I always wanted to give something back in return. My first contribution was a 'desktop palette' of 16 colors I'd selected for an especially pleasing RISC OS background, which didn't exactly set the world on fire.

That set the tone for most of my open source career - when I release a new project, I expect a deafening silence. There are occasional exceptions, but most of them don't make sense to anyone else, at least at first. The majority get quietly ignored by me and everyone else, but a few I keep working on, and they occasionally get picked up by other people too.

The Data Science Toolkit has turned into one of those sleeper projects. Over the last few months I've had a lot of bug reports, which is the best measure of how many people are actually using the code! There have been some nice companion projects too, like this wrapper for Excel or the new API library for Node. It also powers OpenHeatMap.com, which also keeps growing like a weed entirely through word of mouth. Hearing about the uses has been fascinating; academics of 19th century American literature mapping the spread of place mentions, reporters analyzing documents to track corruption in developing countries, mobile real estate app startups, university alumni associations.

The common thread for everyone using it is that they're marginal, just like I was growing up. There aren't enough of them and they don't have enough money to tempt commercial developers. Young open source software grows in the cracks between profitable problems, and survives on a starvation diet of spare-time coding. This gives it the time to find its niche, its audience, in a way that a more conventional development approach never could. Slow-growing software has the chance to reach people who'd never be found any other way, so if you're working on an unpopular project that you love, don't give up!

April 18, 2013 | Permalink | Comments (0) | TrackBack (0)

Five short links

Fivepound
Photo by Kurtis Garbutt

Geo-location estimation of Flickr images - The caption, title, and description of a photo is incredibly useful when it comes to guessing where a photo was taken, even using fairly crude language analysis algorithms. This is a great paper that parallels a lot of what we've found using unstructured text for image location at Jetpac.

Create a heatmap in Excel - Excel can be crusty and hard to learn, but I'm constantly surprised by how much you can do with it once you dive into its depths.

Death by a thousand paper cuts - I get asked the same questions over and over again by people I've just met once they detect my accent - "Where are you from?", "Do you like soccer?", "Why did you come here?". I appreciate that they're trying to connect with me, but the sheer repetition and predictability can make it hard to answer them with enthusiasm.

I can only imagine how tough it must be to deal with repetitive comments when people are behaving like jerks, rather than being nice. Julie does a good job explaining why, even when any single incident can seem fairly minor, an unending succession of them becomes impossible to deal with. The programming world tolerates people behaving like jerks in small ways towards anyone who isn't like them, over and over and over again.

Big Data and Conflict Prevention - The world ignored warning signs about famines and wars from small data, and they're doing the same thing with big data.

Helsinki Bus Station Theory - The case for sticking it through an apprenticeship so you can do something truly creative afterwards.

April 17, 2013 | Permalink | Comments (0) | TrackBack (0)

Converting to and from Google map tile coordinates in PostGIS

Google Maps' system of power-of-two tiles has become a defacto standard, widely used by all sorts of web mapping software. I've found it handy to use as a caching scheme for our data, but the PostGIS calls to use it were getting pretty messy, so I wrapped them up in a few functions. The code is up at https://github.com/petewarden/postgis2gmap, and here's a quick rundown:

tile_indices_for_lonlat(lonlat geography, zoom_level int)

Takes a PostGIS latitude/longitude point and a zoom level, and returns a geometry object where the X component is the longitude index of the tile, and the Y component is the latitude index. These values are not rounded, so for a lot of purposes you'll need to FLOOR() them both, eg;

SELECT FLOOR(X(tile_indices_for_lonlat(checkins.lonlat, 4))) AS grid_lon, FLOOR(Y(tile_indices_for_lonlat(checkins.lonlat, 4))) AS grid_lat FROM checkins;

lonlat_for_tile_indices(lat_index float8, lon_index float8, zoom_level int)

Does the inverse of the function above, turning a Google Maps tile index for a given zoom level into a PostGIS geometry point. You may notice that the coordinates are given as separate arguments rather than a single geometry object. That's an artifact of how my data is stored. Here's an example:

SELECT X(lonlat_for_tile_indices(6, 2, 4)::geometry), Y(lonlat_for_tile_indices(6, 2, 4)::geometry);

bounds_for_tile_indices(lat_index float8, lon_index float8, zoom_level int)

This takes latitude and longitude coordinates for a tile, and a zoom level, and returns a geography object containing the bounding box for that tile. I mainly use this for limiting queries on geographic data to a particular tile, eg;

SELECT * FROM checkins WHERE ST_Intersects(lonlat, bounds_for_tile_indices(6, 2, 4);

April 09, 2013 | Permalink | Comments (0) | TrackBack (0)

Five short links

Fivehand
Photo by Alan Levine

Elephant - A beautiful open source project to store data in a way that's "as durable as S3, as portable as JSON, and as queryable as HTTP". Tim O'Reilly has talked about the web operating system, and HTTP, JSON, and REST-like APIs (without the annoyances of full REST) have become the interface layer. I know integration will be do-able whenever I see a project based around them.

Median SF rent for a one-bedroom apartment - I wish Craigslist made their data openly available. It's already public, why not enable more useful services like this?

Everything We Know About What Data Brokers Know About You - The data about us that's used for marketing purposes is essentially unregulated. As someone who works with data about people for a living I'm glad I'm able to innovate, but I'm also depressed by how little the public actually cares about how their information is passed around and used.

The Design-Fiction Slider-bar of Disbelief - A corker of a listicle from Bruce Sterling, covering the continuum from imagination to regulation.

Scrapely - I love pulling data from messy HTML pages, and it's great to see more and more support emerging. Don't give me an API, just give me an open robots.txt.

April 08, 2013 | Permalink | Comments (0) | TrackBack (0)

The Chairs and The Shrew

Unibrow
Photo by Jesse Bell

I have middlebrow tendencies, but over the years I've learned that the struggle with difficult work can pay off. I grew to love Infinite Jest, once I figured out Wallace was boring me deliberately, that he cared about the mundane details that make up people's real lives. As a teenager I figured out that subtitles on BBC2 late at night meant nudity, and I wound up appreciating French cinema despite my base motivations. My favorite play from last year was a production of Beckett's End Game, which left me heartbroken for characters who should have been unrelatable, screaming at each other from trash cans.

On Sunday night I made it to the Cutting Ball's production of Eugene Ionesco's The Chairs. I was ready to put some work in but I hoped it would pay off. I left a little disappointed. There was a lot to chew on intellectually, and the performances were fantastic, but I never cared about what was happening. It was a puzzle, but a cold one, and I never felt there was anything at stake, despite it being set at the end of the world. The basic plot (do existentialists care about spoilers?) is that an old married couple, apparently the last people on earth, begin to host a party full of imaginary guests, and the husband prepares to give his inspiring message to the world. The 'orator' who will deliver the message appears as a flesh-and-blood person, and the couple commit suicide, and then the orator delivers what turns out to be nonsensensical gobbledegook. I could imagine a play that made this pack a punch but when the couple threw themselves out of the windows, all that was going through my mind was "How long until we hear the splash?". It didn't feel like the company's fault, the translation was strong and the acting was up to the high standard I've come to expect from the Cutting Ball. The barrier I hit was Ionesco's writing. I know he was demonstrating how the "language of society" breaks down and how hard it is to communicate, but I'm bourgeois enough to want something more than an intellectual thesis from a play. I wish I'd caught The Bald Soprano by the same director a few years ago, one of my friends told me that worked much more effectively for him, so maybe that would have helped me connect with Ionesco?

I had very different expectation for last night's entertainment, The Taming of the Shrew by TheaterPub at the Cafe Royale. I discovered the group last month when they did multiple interpretations of a short experimental play, and I knew they had attracted a team of talented and enthusiastic actors, so I was excited to see how they'd tackle Shakespeare. Shrew isn't an easy play to produce, modern audiences are going to struggle to swallow the central plot, that an opinionated woman needs to be psychologically tortured until she submits to her husband. Shakespeare can't help but write fleshed-out characters though, so there was usually enough wiggle room in the interpretation to make them sympathetic to us. The only exception was Kate's final speech, even with the emphasis that she'd only bow to her husband's honest will it was hard to see as a happy ending. Despite that quibble with the source text, the whole evening was a massive amount of fun. I loved the relish and gusto that the whole cast showed. Kim Saunder's Kate and Paul Jenning's Petrucio appropriately stole the show with big performances that had me laughing and completely believing in their tricky-to-swallow relationship. Paul seemed to be channeling the best of John Goodman and Jack Black as he played the crazy suitor, and Kim's obvious enjoyment of the tongue-lashings Kate gives to the world played perfectly. I'd also single out Shane Rhode for the energy and imagination he brought to the tough part of Grumio, playing up to the audience as he witnessed the ridiculousness unfolding along with us. Ron Talbot, Jan Marsh, Vince Faso, Brian Martin, Sam Bertken, and Sarah Stewart all deserve credit for their work too, everyone was throwing themselves wholeheartedly into their roles, and generating a lot of laughter.

There are three more performances coming up, one tonight (Tuesday March 19th), and then Monday March 25th and Wednesday March 27th, all at 8pm. If you're looking for an experience that's all theater, with an audience and cast that are all there for pure enjoyment of a play, try to make it along. The venue is incredibly relaxed (thanks to the director Stuart Bousel for gently handling a couple of folks who were ahem, a little too relaxed, when the play started), you don't need reservations, and they have great beer!

March 19, 2013 | Permalink | Comments (0) | TrackBack (0)

Next »