PeteSearch

Ever tried. Ever failed. No matter. Try Again. Fail again. Fail better.


Try my new Big Data project!
Subscribe in a reader

Recent Posts

  • Hacks for hospital caregiving
  • How does name analysis work?
  • Fixing OpenCV's Java bindings on gcc systems
  • Five short links
  • Five short links
  • Five short links
  • Five short links
  • No more heatmaps that are just population maps!
  • Five short links
  • Five short links

Archives

  • June 2013
  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • November 2012
  • October 2012
  • August 2012
  • July 2012

More...

About

Blog powered by Typepad

How to visualize hot topics from conversations

Twittercloud
I've found a new example of a good time-based visualization. Twitterverse shows a tag cloud of the last few hours, based on the world's Twitter conversations. This is one of the things I'll do with email, and it's interesting to see how it works here.

There's a lot of noise in the two-word phrases, with "just realized", "this morning", "this evening" and "this weekend" all showing up as the most common phrases. These don't give much idea of what's on people's minds, but I can imagine you'd need a large stop word system to remove them, and that runs the risk of filtering out interesting phrases too.

A surprising amount of identifiable information came through, especially with single words. For example, xpunkx showed up in the chart, which looked like a user name. Googling it lead me to this twitter account, and then to Andy Lehman's blog. It may just be a glitch of their implementation, but this would be a deal-breaker for most companies if it had been a secret codename gleaned from email messages. Of course, any visualization of common terms from recent internal emails would make a lot of executives nervous if it was widely accessible. Nobody wants to see "layoff" suddenly appear there and cause panic.

It's also surprisingly changeable. Refreshing the one hour view every few minutes causes almost completely different sets of words to appear. Either the phrase frequency is very flat, eg the top phrases are only slightly more popular than the ones just below them, and so they're easily displaced, or their implementation isn't calculating the tag cloud quite in the way I'd expect.

The team at ideacode have done a really good job with Twitterverse, and there's an interesting early sketch of their idea here. Max Kiesler, one of the authors, also has a great overview of time-based visualization on the web with some fascinating examples.

December 07, 2007 in Defrag, Implicit Web, Outlook API | Permalink | Comments (2) | TrackBack (0)

What most email analysis is missing...

Alarmclock
... is time. A mail store is one of the few sources of implicit data that has intrinsic time information baked in. The web has a very spotty and unreliable notion of time. In theory you should be able to tell when a page was last modified, but in practice this varies based on sites, and there's no standard way (other than wayback) to look at the state of sites over an arbitrary period.

Once you've extracted keywords, it's possible to do something basic like Google Trends. Here's an example showing frequency of searches for Santiago and Wildfire:
Trendscreenshot
A friend suggested something similar for corporate email; it would be good to get a feel for the mood of the company based on either common keywords, or some measure of positive and negative words in messages. This could be tracked over time, and whilst it would be a pretty crude measure, could be a good indicator of what people are actually up to. Similarly, pulling out the most common terms in search queries going through the company gateway would give an insight into what people are thinking and working on. There's privacy concerns obviously, but the aggregation of data from a lot of people makes it a lot more anonymous and less invasive. Its harder to turn the beefburger back into a cow, the combined data is a lot less likely to contain identifying or embarassing information.

Similar to Google's trends, but with more information and better presentation is Trendpedia. Here's a comparison of Facebook, MySpace and Friendster over time:
Trendpediascreenshot
So far, the examples have all been of fairly standard line graphs. There's some intriguing possibilities once you start presenting discrete information on a timeline, and allowing interaction and exploration, especially with email. Here's an example of a presidential debate transcript with that sort of interface, from Jeff Clark:
Transcriptscreenshot

All of these show a vertical, one-dimensional slice of information as it changes over time. It's also possible to indicate time for two-dimensional data. The simplest way is to accumulate values onto a plane over time, so you can see how much of the time a certain part was active. Here's an example from Wired, showing how player location over time was plotted for Halo maps to help tweak the design:

Haloscreenshot

What's even more compelling is showing an animation of 2D data as it changes over time. The downside is that it's a lot harder to implement, and I don't know of too many examples. TwitterVision is one, but it's not too useful. Mostly these sort of animations have to be client-side applications. For email, showing the exchange of messages over time on a graph is something that could give some interesting insights.

Thanks to Matthew Hurst for pointing me to a lot of these examples through his excellent blog.

December 06, 2007 in Defrag, Implicit Web, Outlook API | Permalink | Comments (0) | TrackBack (0)

Can a computer generate a good (enough) summary?

Robot

A description of a document is really useful if you're dealing with large amounts of information. If I'm searching through emails, or even when they're coming in, I can usually decide whether a particular message is worth reading based on a short description. Unfortunately, creating a full human-quality description is an AI-complete problem, since it requires an understanding of an email's meaning.

Automatic tag generation is a promising and practical way of creating a short-hand overview of a text, with a few unusual words pulled out. It's somewhat natural, because people do seem to classify objects mentally using a handful of subject headings, even if they wouldn't express a description that way in conversation.

If you asked someone what a particular email was about, she'd probably reply with a few complete sentences; "John was asking about the positronic generator specs. He was concerned they wouldn't be ready in time, and asked you to give him an estimate." This sort of summary also requires a full AI, but it is possible to at least mimic the general form of this type of description, even if the content won't be as high-quality.

The most common place you encounter this is on Google's search results page:
Googlescreenshot
The summary is generated by finding one or two sentences in the page that contain the terms you're looking for. If there's multiple occurences, usually the sentences earliest in the text are favored, along with ones that contain the most terms closest together. It's not a very natural-looking summary, but it does do a good job at picking out the relevant quotations to what you're looking for, and giving a good idea whether it's actually talking about what you want to know.

Amazon's statistically improbable phrases for books are an interesting approach, they try to identify combinations of words that are distinctive to a particular book. These are almost more like tags, and are found by a similar method to statistics-based automatic tagging, by spotting combinations that are frequent in a particular book, and not as common in a background of related items. They don't act as a very good description in practice, they're more useful as a tool for discovering distinctive content. I also discovered they've introduced capitalized phrases, which serve a similar purpose. That's an intriguing hack on the English language to discover proper nouns, I may need to copy that approach.

The final, and most natural, type of summary is created by picking out key sentences from the text, and possibly shortening them. Microsoft Word's implementation is the most widely used, and it isn't very good. There's also an online summarizer you can experiment with that suffers a lot of the same problems.

There are two big barriers to getting good summaries with this method. First, it's hard to identify which bits of the document are actually important. Most methods use location, if it's a heading or the start of a paragraph, and statistical frequency of unusual words, but these aren't very good predictors. Even once you've picked the ones you want to use, here's also very little guarantee that the sentences will make any sense when strung together outside the context of the full document. You often end up with a very confusing narrative. Even MS in their description of their auto summary tool acknowledge that at best it produces a starting point that you'll need to edit.

Overall, for my purposes displaying something like Google or Amazon's summaries for an email might be useful, though I'll have to see if it's any better than just showing the first sentence or two of a message. It doesn't look like the other approaches to producing a more natural summary are good enough to be worth using.

December 05, 2007 in Defrag, Implicit Web, Outlook API | Permalink | Comments (0) | TrackBack (0)

What puzzles can company-wide email solve?

Puzzle

My intuition is that a company's collection of email messages is a rich source of useful information, and people will pay for a service that gives them access to it. What could users do in practice though?

Discover experts.
By analyzing each person's sent messages, it's possible to figure out some good tags to describe them. These would need to be approved and tweaked by the subject before being published, but then you'd have a deep company directory that anyone could query. So many times I've ended up reinventing the wheel because I didn't know that somebody in another department had already tackled a particular problem.

Uncover expertise. Email is the most heavily used content-generation system, hands-down. There's lots of valuable information in messages that never makes it to a wiki or internal blog. The trouble is, that information quickly vanishes, emails are ephemeral. Any mail that's sent to an internally public mailing list should be automatically included on an intranet page that's searchable by keyword, or by person or team. You should also have a button in Outlook that lets you publish any mail thread on that same page. Those published messages produce something very like a blog for each person, effortlessly.

Work together.
People collaborate by emailing each other attachments. Rather than trying to change that, put in a tool that by default uploads the attachment to Sharepoint, accessible only by the email recipients, and rewrites the message so it links to that instead. You'll need a safety-valve that allows people to override that if they really do need it as an attachment, but this method should retain most of the advantages of email collaboration (clear access control, ease-of-use) and add the collaboration benefits of change tracking and a single version of the file.

December 04, 2007 in Implicit Web, Outlook API | Permalink | Comments (0) | TrackBack (0)

Can you automatically generate good tags?

Tag
One interesting feature of Disruptor Monkey's Unifyr is their automatic generation of tags from web pages. Good tags are the basis of a folksonomy, and with them I could do some very useful classification and organization of data. With an organization's email, I'd be able to show people's areas of expertise if I knew which subjects they sent messages about. This could be the answer to the painful problem of 'Who can I ask about X?'.

Creating true human-quality tags would require an AI that could understand the content to the same level a human could, so any automatic process will fall short of that. Is there anything out there that will produce good enough results to use?

There are two main approaches to this problem, which is sometimes known as keyword extraction, since it's very similar to that search engine task. The first is to use statistical analysis to work out which words are significant, with no knowledge of what the words actually mean. This is fundamentally how Google's search works. The second is to use rules about language and information about the meanings of words to pick out the right words. As an example, knowing that notebook means the same as laptop, and so having both words count for the same concept. Powerset is going to be using this approach to search. Danny Sullivan has a thought-provoking piece on why he doesn't think that the method will ever live up to its promise.

KEA is an open-source package for keyword extraction, and is towards the rules-based end of the spectrum, though it sets up those rules using training and some standard thesauruses, rather than manually. I was initially very interested, because it's designed to do exactly what I need, pulling descriptive keywords from a text. Unfortunately, I'd still have to set up a thesaurus and some manually tagged documents for the system to learn from before running it on any information. I would like to start off with something completely unsupervised, so it can be deployed without a skilled operator or any involved setup.

The other alternative is using statistical analysis to identify words that are uncommonly used in most texts, but which are common in the particular one you're looking at. The simplest example I've seen is the PHP automatic keyword generation class. You'll need to register to see the code, but all it does is exclude all stop words, and  then returns the remaining words, and two and three-word phrases, in descending order of frequency. The results are a long way from human tagging, but just good enough to make me think it's worth expanding.

An obvious next step is to expand the stop word concept, and keep track of the general frequency of a lot more words, so you can exclude other common terms, and focus on the unusual ones. The standard way to do this is to take the frequencies from a large corpus of text, often a general one like the Brown corpus that includes hundreds of articles from a variety of sources. For my purposes, it would also be interesting to use the organization's overall email store as the corpus, and identify the words a particular employee uses that most others in the company don't. This would prevent things like the company name from appearing too often.

You'll never get human-grade tags from this sort of system, but you can get keywords that are good enough for some tasks. I hope it will be good enough to identify subject-matter experts within a company, but only battle-testing will answer that.

December 03, 2007 in Coding, Implicit Web, Outlook API | Permalink | Comments (4) | TrackBack (0)

Email data mining by Spoke and Contact Networks

Mining

I've been thinking hard about painful problems that email analysis could solve, and one of them is the use of a corporation's email store to discover internal colleagues who have existing relationships with an external company or person you want to talk to. For example, if you want to sell to IBM, maybe there's someone in your team who's already talking to someone there. Or internally, you might want an introduction to someone in another department to discuss a problem, and it would be good to know who in your team had contacts there already.

I was discussing these thoughts with George Eberstadt, co-founder of nTag, and he pointed me to a couple of successful companies who are already mining email to do this, Spoke and Contact Networks.

Spoke are interesting because they're entirely client-based, rather than running on an organization's whole message store. They work by taking data from everybody who's running their Outlook add-on, along with information pulled from publicly available sources, and feeding into their own database. You can then search that yourself to find information on people you're interested in contacting, and people you know who've already been in touch with them. It's effectively creating a global social network largely based on the email patterns of everybody who belongs.

Technically, it sounds like they're doing some interesting things, such as exchanging special emails to confirm that two people really do know each other, but when I tried for my own name, I didn't get any useful information. I also am surprised that companies would allow the export of their employees email relationships to a third-party. It may just be that this is happening under the radar, but it seems like the sort of thing that a lot of companies would want some safeguards on. The service encourages individual employees to install the software themselves, without any warning that they might be opening up the organization's data to third-party analysis. I know at a lot of the companies I deal with would frown on this, to say the least.

Contact Networks seem much more focused on selling to corporations as a whole, rather than individual employees. They build a social graph pulling from several sources internal to the company including email and calendars, CRM, marketing databases, HR and billing systems. They use this to identify colleagues who know a particular individual, which is a succinct description of a 'painful problem' that companies would be willing to pay money to solve. They seem to have been very successful, with lots of big-name clients and they were just bought out by Thomson.

It's good to see how well Contact Networks have done, it's proof there's demand for the sort of services I'm thinking of, even if they have already solved the immediate problems I was considering.

November 30, 2007 in Defrag, Implicit Web, Outlook API | Permalink | Comments (0) | TrackBack (0)

Seriosity - How much is your email worth?

Coins

I first ran across Seriosity in a recent Wall Street Journal article. It was discussing companies dealing with 'colleague spam' or the problem of bacn in the workplace, where unimportant messages crowd out significant ones in your inbox. Unfortunately the article itself is subscription only, but after mentioning ClearContext and Xobni, they talk about Seriosity's approach.

The heart of their method is a virtual currency, Serios, that you can spend to mark a message you send as important. Essentially, it's a way of making the existing 'important flag' more useful, by making it a scarce resource. The current flag doesn't mean very much, because there's no natural regulation of its use, so some people can mark all their messages as important, whilst others don't use it at all. The Serios spent on a message determine how prominently it appears in the recipient's mailboxes.

The currency system is based on studies of online games, with the recipients of messages receiving the Serios the sender spent on them, and everybody having a limited store of points to use. They've already done a test deployment of their product for Ebay's internal email, with some positive quotes from the company on its usefulness but performance problems with the installed client, apparently fixed in a newer version.

I'm very happy to see something so innovative in its approach to this problem, but I do think they've got some significant hurdles to overcome.

For starters, I'm not convinced that colleague spam is amenable to any algorithmic solution. In the article, they use the example of a department email sent out announcing brownies in the kitchen. I like brownies! I wouldn't want to miss out on that message when it came in, but I might not care if the same email was for carrot-cake. These sort of messages require some understanding of the content and its significance to the recipient to process. I would end up scanning the subject lines of all emails as they came in to make sure I didn't lose a message I cared about.

On a practical level, game economies are really hard to design well, and the constraints there are only about making the experience enjoyable. For a work tool, there's a lot of additional requirements. I'd imagine that the CEO wouldn't want her occasional all-company messages to appear in people's spam folders, so that would either require giving her a massive store of Serios (which would encourage inflation, and by extension require Serios to be allocated by seniority) or give her an opt-out like the existing important flag, which others would also want to use and abuse. There's also a very unbalanced pattern of communication for certain workers who need to send out a lot of informative emails, without necessarily getting many back. For example, an office manager probably sends out a lot of all-building emails, some of which might be urgent, but which will be hard to allocate the right amount of currency to.

They do also mention the analysis of the currency flow as a way of charting how the organization actually works. That's an idea that's close to my own heart, since it lets you see the strength of ties between people who exchange a message, something I'm trying to do by analysing which emails are actually replied to.

Seriosity have obviously been working hard on this for the last couple of years, and it looks like they're getting close to a public release of their Attent product. I look forward to playing with this in practice, since it appears they have an open beta program I can apply for. I would link to the demo, but unfortunately the supplied http://www.seriosity.com/demo.html url seems to be broken, though you can click on the 'view demo' link on this page to get to it.

November 29, 2007 in Outlook API | Permalink | Comments (0) | TrackBack (0)

Disruptor Monkey's Unifyr

Disruptormonkey
Disruptor Monkey are a North Carolina based startup, and they recently popped up on my radar through an article on Brad's blog. There's not much information available about their product, Unifyr, but they do have this well-produced teaser video. Based on that, I'd describe it as a way of accessing a lot of different data sources from a single interface.

The demo shows external web pages, email messages, CRM databases, and internal documents (both stored locally and on the network) all appearing in a unified file structure view. It looks like there's a way to search, sort and organize all of these sources of data from the same interface.

The workflow of using their product appears very streamlined and intuitive. It's using metaphors people already know, a button in a browser toolbar and a folders and files view. This opens up the product to a lot of people who'd be put off by something geekier. They also have obviously thought about making it as hands-off and automatic as possible, with features like automatic tag generation, and making folder sharing very easy. I'd like to use it myself, based on what I've seen, and as Nick put it in an email I like their thinking and am "well down the road drinking the kind of coolaid we enjoy".

I'll look forward to hearing about how they position their product, it seems like it could help out with a lot of business processes, but they're not discussing any focus yet. I'm especially interested in how they approach this, since I'm struggling with finding a painful enough problem to apply my own ideas to. Check out Nick Napp's blog to keep up to date with this very promising company.

Unifyrscreenshot

November 28, 2007 in Outlook API | Permalink | Comments (1) | TrackBack (0)

Does email analysis invade privacy?

Secretpackage
At Defrag, JP Rangaswami was talking over lunch about how he'd opened up his inbox to all his direct reports. This fascinated me, both for the behavior of his subordinates (they were most interested in his sent items, as a way of understanding what he was thinking) and because it was a very logical idea, but one I'd never heard even discussed before.

One of the great strengths of email as a communications tool is that it has a very clear security model. You explicitly list the people you want to receive the email by name, and they're the only ones who get it. There are plenty of ways information can be leaked, such as forwarding on to third parties, but these all require somebody to actively make a decision to do so. By contrast, it's much harder to know who can see an internal wiki page.

A lot of people focused on collaboration seem to believe that worrying about access is just oldthink that needs to be eradicated. The new collaboration tools will enable a new world where information is freely shared across the corporation, vaulting over traditional communication barriers. In my experience, there's a lot of non-technical reasons why people care about access.

Fundamentally, most leaders are rewarded for the results their team or department produces, since that's a lot easier to measure than the contribution their employees have made across the whole company. This means that it's hard to justify spending resources collaborating with other internal teams, even though looked at holistically that might be best for the company. Taken to an extreme, this stovepiping can be crippling, but it's an inherent emergent feature of hierarchal organizations, so I don't see it disappearing anytime soon.

On a individual level, knowledge is power. People may have invested a lot of time building relationships within the company, and one reward is access to information others don't have. This may make them reluctant to share these sources, both for selfish reasons and so they can act as a filter for information requests, to make sure the source isn't overwhelmed with inappropriate ones.

There's also a lot of sensitive personnel-related communications that can go on. Even the knowledge that two people have ever emailed each other can be sensitive if it's a subordinate bypassing her boss and contacting human resources.

The sort of email analysis I'm interested in takes all an organization's messages, and does a global analysis to reveal useful relationships and information, especially about the social graph of the employees. This is not information that people were expecting to reveal when they sent their emails, and while there's nothing illegal about doing this, the emails all belong to the company if they're sent on company accounts, it is breaking the security model that people trust. That's both ethically uncomfortable and likely to be a barrier to adoption.

The solution has to be keeping the ownership and sharing of information within the user's control. One way of doing that is by default only allowing anonymous information to be publicly reported, which could include such things as how many people read or forwarded an email you sent. You could also designate certain internal mailing lists as publicly accessible across the organization. There's already an understanding that lists with open membership policies are not private, so this isn't changing the mental access model that people trust. Going a step further, you can give people tools to share certain emails, the way a lot of people share calendars at the moment. This would work particularly well tied into Sharepoint, since documents there have their own access model. In particular, it might be useful to add a special email address that adds the email to the public intranet, and visible to email analysis tools.

It should be possible to overcome user's concerns about access and email analysis, but it will require some careful design. I can certainly understand why most existing services focus on either client-side tools, or global analysis designed to give top management or forensic analysts an unrestricted view of all emails, those both sidestep these issues.

November 27, 2007 in Outlook API | Permalink | Comments (0) | TrackBack (0)

Google's latest mail API

Stamp

As Brad spotted, my previous post strong-armed Google into introducing a new mail migration API. Well, there was correlation even if I'm not so sure on the causation. Looking through Google's latest offering, it's clearly aimed at one-way migration from other systems to Google Apps, rather than being a two-way interoperability standard that would allow a mix of Exchange and Gmail use within the same system.

To quote from the announcement, they introduced it because "some customers are reluctant to step into the future without bringing along the email from their past". I'd imagine there's some customers who are 'reluctant to step into the future' if it's a one-way trip for all their email data too, locking them into Google's OS going forward. Email, calendars and contacts are crying out for a nice open integration layer. The information you need is comparatively well-defined and bounded, and there's already supported standards for the components of the problem, like imap, vcard and icalendar.

Microsoft has always had a strategy with strong developer support as a priority. This is great for third-party vendors but arguably was a factor in a lot of their security and usability issues. Google doesn't feel the same need to look after external developers, as shown by the removal of their search API. They'd much rather simplify the engineering and user-experience by avoiding the clutter of hosting third-party code within their apps.

Even though it's ugly and COM-tastic, it's possible with enough effort to dig deep into Exchange's data stores and build deeply integrated tools. Moving to Google Apps (or most other SAAS apps I've seen) you're losing that level of access. My hunch is that in a few years time we'll see the same customer pressure that drove MS to open their enterprise tools to customization pushing SAAS companies to either offer APIs or lose business.

November 19, 2007 in Outlook API | Permalink | Comments (0) | TrackBack (0)

« Previous | Next »