PeteSearch

Ever tried. Ever failed. No matter. Try Again. Fail again. Fail better.


Try my new Big Data project!
Subscribe in a reader

Recent Posts

  • Hacks for hospital caregiving
  • How does name analysis work?
  • Fixing OpenCV's Java bindings on gcc systems
  • Five short links
  • Five short links
  • Five short links
  • Five short links
  • No more heatmaps that are just population maps!
  • Five short links
  • Five short links

Archives

  • June 2013
  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • November 2012
  • October 2012
  • August 2012
  • July 2012

More...

About

Blog powered by Typepad

A lovely language visualization

80mill
In case you missed it on ReadWriteWeb, researchers at MIT and NYU have created a fascinating visual map of nouns. They're pulling the word relationships from WordNet, a venerable data set that maps the relationships between over 150,000 words. I'll be studying their paper to understand exactly how they grouped the nouns, since they seem to have done a great job of clustering them in a meaningful way across a 2D surface. That could be very useful for keyword similarity measurements and email grouping.

It also reminds me of a film-strip visualization that I saw a couple of years back, but which I can't find the reference for unfortunately. A frame was taken every 5 seconds from a movie, and shrunk to a few pixels, and then the sequence of images were arranged in a grid. You could get information about the different scenes and moods in the whole movie at once, just from the color of each section. It wasn't much but it was enough to let your brain's visual processing machinery comprehend the structure.

In the same way, this map is a good way of presenting the whole space of noun categories in a way that's much easier to navigate than a hierarchical tree or table. A common trick for memorizing arbitrary data like long random numbers is to associate each part with physical locations, because we evolved to be really good at remembering exactly where all the fruit (and leopards!) were in the local jungle. It's easy to find and return to a given noun in this setup because you're using the same skills.

January 27, 2008 in Outlook API | Permalink | Comments (0) | TrackBack (0)

The silent rise of Sharepoint

Godzilla

According to a new report, over half of companies that use an Exchange mail server also use Sharepoint. This backs up my personal experience. For example, Liz works for a fairly conservative large company but even they are heavy Sharepoint users.

This is a big technological change, but it tends to slip under a lot of people's radar because it's a  closed-source, me-too technology with a very traditional business model. It's successful because it's stable, uses a familiar UI, is easy to deploy, often comes for free with Office and overall works remarkably well.

Microsoft are providing a ready-made distribution channel for getting your technology in front of employees. They're training massive numbers of people to create and consume user-generated content on the company's intranet. The great thing is that they leave plenty of room for third-party products to take advantage of this. They have some Exchange/Sharepoint integration, and no doubt will be increasing that in the future, but there's a fantastic opportunity to present all sorts of interesting mail-derived information in a place people are already looking. A good example of this would be automatically populating each employee's homepage with links to her most frequent internal and external contacts, or adding email-driven keywords there to be found by a 'FindaYoda' style search.

I'm so convinced this is an important direction, I have my own Sharepoint site I'm using as a testbed, hosted with Front Page Web Services [Update- They're now FPWeb.net, at http://www.fpweb.net/sharepoint-hosting/ ]. I'll be posting more about the integration opportunities as I dig deeper, as well as using it when I need to collaborate.

[Update- Eric did a great Sharepoint post on Friday too, with some interesting points on the way collaboration with Sharepoint is heavily grass-roots driven at the moment, which will mean a strong drive for the IT department to catch up]

[Second update -

January 25, 2008 in Outlook API | Permalink | Comments (1) | TrackBack (0)

Two ways you can easily find interesting phrases from an email

D20
Maybe it was my weekly D&D game last night, but probability is on my mind. One thing I've learnt from working in games is that accuracy is overrated in AI. Most problems in that domain have no perfect solution. The trick is to find a technique that's right often enough to be useful, and then make it part of a workflow that makes coping with the incorrect guesses painless for the user.

A lot of Amazon's algorithms work like this. They recommend other books based on rough statistical measures which bring up mostly uninteresting items, but it's right often enough to justify me spending a few seconds looking at what they found. The same goes for their statistically improbable phrases. They're odd and random most of the time but usually one or two of them do give me an insight into the book's contents.

This is interesting for email because when I'm searching through a lot of messages I need a quick way to understand something about what they contain without reading the whole text. One of the key features of Google's search results is the summary they extract surrounding the keywords for each hit. This gives you a pretty good idea of what the page is actually discussing. In a similar way I want to present some key phrases from an email that very quickly give you a sense of what it's about.

The main approach I'm using is vanilla SIPs, but there's a couple of other interesting heuristics (sounds so much more technical than 'ways of guessing'). The first is looking for capitalized phrases within sentences. These are usually proper nouns, so you'll get a rough idea of what people or places are discussed in a document. The second is to find sentences that end with a question mark, so you can see what questions are asked in an email.

These are fun because they're both reliant on easily-parsed quirks of the language, rather than deep semantic processing. This means they're quick and easy to implement. It also means that they're not very portable to other languages, German capitalizes all nouns for example, but one problem at a time!

January 24, 2008 in Outlook API | Permalink | Comments (0) | TrackBack (0)

How to use corporate data to identify experts

Yoda

Nick over at the Disruptor Monkey blog talks about how their FindaYoda feature has proved a surprise hit. This is a way of seeing who else has a lot of material with a keyword you're looking for, and its success backs up one of the hunches that's driving my work. I know from my own experience of working in a large tech company that there's an immense amount of wheel-reinventing going on just because it's so hard to find the right person to talk to.

As a practical example I know of at least four different image comparison tools that were written by different teams for use with automated testing, with pretty much identical requirements. One of the biggest ways I helped productivity was simply by being curious about what other people were working on and making connections when I heard about overlap.

One of the tools I'd love to have is a way to map keywords to people. It's one of the selling points of Krugle's enterprise code search engine. Once you can easily search the whole company's code you can see who else has worked with an API or algorithm. Trampoline systems aim to do something similar using a whole company's email store, they describe it as letting you discover knowledge assets. I'm trying to do something similar with my automatic tag generation for email.

It's not only useful for the people on the coal face, it's also a benefit that seems to resonate with managers. The amount and cost of the redundant effort is often clearer to them than to the folks doing the work. Since the executives are the ones who make the purchasing decisions, that should help the sales process.

January 23, 2008 in Defrag, Implicit Web, Outlook API | Permalink | Comments (0) | TrackBack (0)

How can you solve organizational problems with visualizations?

Inflow

Valdis Krebs and his Orgnet consultancy have probably been looking at practical uses of network analysis longer than anyone. They have applied their InFlow software to hundreds of different cases, with a focus on problem solving within commercial organizations, but also looking at identifying terrorists, and the role of networks in science, medicine, politics and even sport.

I am especially interested in their work helping companies solve communication and organizational issues. I've had plenty of personal experience with merged teams that fail to integrate properly, wasted a lot of time reinventing wheels because we didn't know a problem had already been solved within the company and been stuck in badly configured hierarchies that got in the way of doing the job.To the people at the coal-face the problems were usually clear, but network visualizations are a very powerful tool that could have been used to show management the reality of what's happening. In their case studies, that seems to be exactly how they've used their work, as a navigational tool for upper management to get a better grasp on what's happening in the field, and to suggest possible solutions.

Orgnet's approach is also interesting because they are solving a series of specialized problems with a bespoke, boutique service, whereas most people analyzing company's data are trying to design mass market tools that will solve a large problem like spam or litigation discovery with little hand-holding from the creators of the software. That gives them unique experience exploring some areas that may lead to really innovative solutions to larger problems in the future.

You should check out the Network Weaving blog, written by Valdis, Jack Ricchiuto and June Holley. Another great thing about their work is that their background is in management and organizations, rather than being technical. That seems to help them avoid the common problem of having a technical solution that's looking for a real-world problem to solve!

January 11, 2008 in Defrag, Implicit Web, Outlook API | Permalink | Comments (0) | TrackBack (0)

Is there anything interesting MIT isn't involved in?

Buddygraph

MIT is the Kevin Bacon of the web research world. It's hard to investigate any bleeding-edge topic without bumping into one of their projects. For example Piggy Bank is one of the earliest attempts to build the semantic web from the bottom up, and now I've discovered their work with Social Network Fragments. Danah Boyd collaborated with Jeffrey Potter and his BuddyGraph project to explore how to derive interesting social graphs from someone's email messages.

The app they show is somewhat similar to Outlook Graph. They're using a wire and spring simulation system to produce the graphs and trying to derive some idea of what the underlying social groups are based on the positions that people end up in the network. They haven't released a demo of the tool unfortunately, it appears that it involves more pre-processing than OG, but does have an interface for exploring changes over time, which is not something I've implemented yet. They don't appear to be using any kind of weighting for the connections between people based on frequency of contact. It also requires some additional inputs from the user for things such as email lists and the user's own email identities, and I'd imagine the system assumes a fairly clean set of email without too many automated or junk messages to muddy the data, though they can discard 'isolated' nodes that only have a few connections.

Here's a short demo video showing BuddyGraph in action. The project page doesn't seem to have been updated for a few years, so I'll email Danah and Jeffrey to see if they've done anything interesting in this area since then.

January 10, 2008 in Outlook API | Permalink | Comments (0) | TrackBack (0)

Where can you get free word frequency data?

Dictionary

The Google n-gram data-set is probably as big a word frequency list as you'll ever need, but it has very restrictive license terms that don't allow you to publish it in any form. Since I'm interested in doing some web-based services to let you query the frequency of particular words and phrases, I could fall foul of that restriction. Luckily there are some alternatives, since using the web as a source of word-frequency data has been a big topic in the linguistics community over the last few years.

The Web as Corpus site has a good collection of resources, and in particular it led me to Bill Fletcher's work. He has both written kfNgram, a free tool for generating word and phrase frequency (n-gram) lists from text and html files, he's also made some decent-sized data sets available himself, such as this list with other 100,000 entries.

Also very interesting is the WebCorp project. It has an online word frequency list generator which you can point at any site you're interested in and retrieve the statistics of the text on that page. It also features a search engine which adds a layer of linguistic analysis on top of standard Google search results. It has some neat features such as displaying all occurrences of the search terms within each result, rather than just the standard abbreviated summary that Google produces.

January 09, 2008 in Outlook API | Permalink | Comments (0) | TrackBack (0)

How do you rank emails?

Rank

The core of Google's success is the order it displays search results. Back in the pre-Google days you'd get a seemingly unordered list of all pages that contained a term. Figuring out which pages were most authoritative using PageRank and putting them at the top made finding a useful result much quicker.

Searching emails needs something similar, a way of sorting out the important emails from the trivial. PageRank works by analyzing links between pages, but emails don't have links like that. Instead, you need to use other connections between emails, such as how often a message was replied to and forwarded. Just as a link to another web-page can be seen as a vote for it, so an action such as forwarding or replying is a hard to fake signal that the recipient considers the message worth spending time on.

I'm already using this principal to set the strength of connections between people in Outlook Graph, the thickness and pull of a line is determined by the minimum of the emails sent and received between them. Using the minimum helps to weed out unbalanced relationships such as automated mailers that send out a lot of bacn, but never get sent any email in return.

It's not a new idea, Clearwell has been using something similar for a while:

"To sort messages by relevance, Clearwell's program weighs the background data and content of each email for several factors, including the name of the sender, names of recipients, how many replies the message generated, who replied, how quickly replies came, how many times it was forwarded, attachments and, of course, keywords."

It's obvious enough that I don't doubt other people are doing something like this too, though I'll be interested to discover what patent landmines were laid by the first people to file. Where it gets really interesting is when you also do social graph analysis, then it's actually possible to throw the social distance of the people involved into the mix. The effect is to give more prominence to messages from those you know, or friends of friends, since they're more likely to be talking about things relevant to you than strangers.

January 08, 2008 in Outlook API | Permalink | Comments (4) | TrackBack (0)

What's the secret to Amazon's SIPs algorithm?

Calvincloud

The statistically improbable phrases that Amazon generates from a book's contents seem like they'd be useful to have for a lot of other text content, such as emails or web pages. In particular, it seems like you could do some crude but useful automatic tagging.

There's no technical information available on the algorithm they use, just a vague description of the results it's trying to achieve. They define a SIP as "a phrase that occurs a large number of times in a particular book relative to all Search Inside! books".

The obvious implementation of this for a word or series of words in a candidate text is

  • Calculate how frequently the word or phrase occurs in the current text, by dividing the number of occurrences by the total number of words in the text. Call this Candidate Frequency.
  • Calculate the frequency of the same word of phrase in a larger reference set of set, to get the average frequency that you'd expect it to appear in a typical text. Call this Usual Frequency.
  • To get the Unusualness Score for how unusual a word or phrase is, divide the Candidate Frequency by the Usual Frequency.

In practical terms, if a word appears often in the candidate text, but appears rarely in the reference texts, it will have a high value for Candidate Frequency and a low Usual Frequency, giving a high overall Unusualness Score.

This isn't too hard to implement, so I've been experimenting using Outlook Graph. I take my entire collection of emails as a reference corpus, and then for every sender I apply this algorithm to the text of their emails to obtain the top-scoring improbable phrases. Interestingly, the results aren't as compelling as Amazon's. A lot of words that intuitively aren't very helpful showing up near the top.

I have found a few discussions online from people who've attempted something similar. Most useful were Mark Liberman's intial thoughts on how we pick out key phrases, where he discusses using "simple ratios of observed frequencies to general expectations", and how they will fail because "such tests will pick out far too many words and phrases whose expected frequency over the span of text in question is nearly zero". This sounds like a plausible explanation for some of the quality of the results I'm seeing.

In a later post, he analyzes Amazon's SIP results, to try and understand what it's doing under the hood. The key thing he seems to uncover is that "Amazon is limiting SIPs to things that are plausibly phrases in a linguistic sense". In other words, they're not just applying a simplistic statistical model to pick out SIPs, they're doing some other sorting to determine what combinations of words are acceptable as likely phrases. I'm trying to avoid that sort of linguistic analysis, since once you get into trying to understand the meaning of a text in any way, you're suddenly looking at a mountain of hairy unsolved AI problems, and at the very least a lot of engineering effort.

As a counter-example, S Anand applied the same approach I'm using to Calvin and Hobbes, and got respectable-looking results for both single words and phrases, though he too believes that "clearly Amazon's gotten much further with their system".

There are some other explanations for the quality of the results I'm getting so far. Email is a very informal and unstructured medium compared to books. There's a lot more bumpf, stuff like header information that creeps into the main text that isn't intended for humans to understand. Emails can also be a lot less focused on describing a particular subject or set of concepts, a lot closer to natural speech with content-free filler such as 'hello' and 'with regards'. It's possible too that trying to pull out keywords from all of a particular person's sent emails is not a solvable problem, that there's too much variance in what any one person discusses.

One tweak I found that really improved the quality was discarding any word that only occurs once in the candidate text. That seems to remove some of the noise of junk words, since the repetition of a token usually means it's a genuine word and not just some random characters that have crept in.

Another possible source of error is the reference text I'm comparing against. Using all emails has a certain elegance, since it's both easily available in this context, and will give personalized results for every user, based on what's usual in their world. As an alternative, whilst looking at a paper on Automatically Discovering Word Senses, I came across the MiniPAR project, which includes a word frequency list generated from AP news stories. It will be interesting to try both this and the large Google corpus as the reference instead, and see what difference that makes.

I'm having a lot of fun trying to wrestle this into a usable tool, it feels very promising, and surprisingly neglected. One way of looking at what I'm trying to do is as the inverse of the search problem. Instead of asking 'Which documents match the terms I'm searching for?', I'm trying to answer 'Which terms would find the document I'm looking at in a search?'. This brings up a lot of interesting avenues with search in general, such as suggesting other searches you might try based on the contents of results that seem related to what you're after. Right now though, it feels like I'm not too far from having something useful for tagging emails.

As a final note, here's an example of the top single-word results I'm getting for an old trailworking friend of mine:
Ronsips

The anti-immigration one is surprising, I don't remember that ever coming up, but the others are mostly places or objects that have some relevance to our emails.

One thing I always find incredibly useful, and the reason I created Outlook Graph in the first place, is transforming large data sets into something you can see. For the SIPs problem, the input variables we've got to play with are the candidate and reference frequencies of words. Essentially, I'm trying to find a pattern I can exploit, some correlation between how interesting a word is and the values it has for those two. The best way of spotting those sort of correlations is to draw your data as a 2D scatter graph and see what emerges. In this case, I'm plotting all of the words from a senders emails over the main graph, with the horizontal axis the frequency in the current emails, and the vertical axis representing how often a word shows up in all emails.

Ronscatter

You can see there's a big log jam of words in the bottom left that are rare in both the candidate text, and the background. Towards the top-right corner are the words that are frequent in both, like 'this'. The interesting ones are towards the bottom right, which represents words frequent in the current text, but infrequent in the reference. These are things like 'trails', 'work' or 'drive' that are distinctive to this person's emails.

January 05, 2008 in Defrag, Implicit Web, Outlook API | Permalink | Comments (0) | TrackBack (0)

Should you cross the chasm or avoid it?

Gap

I recently came across a white paper covering Ten Reasons High-Tech Companies Fail. I'm not sure that I agree with all of them, but the discussion of continuous versus discontinuous innovation really rang true.

Crossing the Chasm is a classic bible for technology marketers, focused on how to move from early adopters to the early majority in terms of the technology adoption lifecycle. It describes the gap between them as a chasm because what you need to do to sell to the mainstream is often wildly different than what it takes to get it adopted by customers who are more open to change.

What the white paper highlights is that this 'valley of death' in the adoption cycle only happens when the technology requires a change of behavior by the customer, in his terms is discontinuous. Innovations that don't require such a change are continuous. They don't have such a chasm between innovators and the majority because the perceived cost of behavior changes is a large part of the mainstreams resistance to new technology.

This articulates one of my instincts I've been trying to understand for a while. I was very uncomfortable during one of the Defrag open sessions on adopting collaboration tools, because everyone but me seemed to be in the mode of 'How do we get these damn stubborn users to see how great our wikis, etc are?'. They took it as a given that the answer to getting adoption was figuring out some way to change users' behavior. My experience is that changing people's behavior is extremely costly and likely to fail, and most of the time if you spend enough time thinking about the problem, you can find a way to deliver 80% of the benefits of the technology through a familiar interface.

This is one of the things I really like about Unifyr, they take the file system interface and add the benefits of document management and tagging. It's the idea behind Google Hot Keys too, letting people keep searching as they always have done, but with some extra functionality. It's also why I think there's a big opportunity in email, there's so much interesting data being entered through that interface and nobody's doing much with it. Imagine a seamless bridge between a document management system like Documentum or Sharepoint and all of the informal emails that are the majority of a company's information flow.

Of course, there are some downsides to a continuous strategy. It's harder to get early adopters excited enough to try a product that on the surface looks very similar to what they're already using. They're novelty junkies, they really want to see something obviously new. You also often end up integrating into someone else's product, which is always a precarious position to be in.

Another important complication is that I don't think interface changes are always discontinuous. A classic example is the game Command and Conquer. I believe a lot of their success was based on inventing a new UI that people felt like they already knew. Clicking on a unit and then clicking on something else and having them perform a sensible action based on context like moving or attacking just felt very natural. It didn't feel like a change at all, which drove the game's massive popularity.

I hope to be able to discuss a more modern example of an innovative interface that feels like you already know it, as soon as some friends leave stealth mode!

January 04, 2008 in Defrag, Outlook API | Permalink | Comments (0) | TrackBack (0)

« Previous | Next »