PeteSearch

Ever tried. Ever failed. No matter. Try Again. Fail again. Fail better.


Try my new Big Data project!
Subscribe in a reader

Recent Posts

  • Hacks for hospital caregiving
  • How does name analysis work?
  • Fixing OpenCV's Java bindings on gcc systems
  • Five short links
  • Five short links
  • Five short links
  • Five short links
  • No more heatmaps that are just population maps!
  • Five short links
  • Five short links

Archives

  • June 2013
  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • November 2012
  • October 2012
  • August 2012
  • July 2012

More...

About

Blog powered by Typepad

Krugle's approach to the enterprise

Kruglelogo

I've been interested in Krugle ever since I heard Steve Larsen speak at Defrag. They're a specialized search company, focused on returning results that are useful for software developers, including code and technical documentation. What caught my interest was that they had a product that solved a painful problem I know well from my own career; where's our damn code for X? Large companies accumulate a lot of source code over the years. The holy grail of software development has always been reuse, but with the standard enterprise systems it can be more work to find old code that solves a problem than to rewrite it from scratch.

Their main public face is the open-source search site at krugle.org. Here you can run a search on both source code and code-related public web pages. I did a quick test, looking for some reasonably tricky terms from image processing, erode and dilate. Here's what Krugle finds, and for comparison here's the same query on Google's code search. Krugle's results are better in several ways. First, they seem to understand something about the code structure, so they've focused on source that has the keywords in the function name and shows definition of the function. Most of Google's hits are constants or variable names, which are a lot less likely to be useful. Krugle also shows more relevant results for the documentation, under the tech pages tab. A general Google web search for the same terms throws up a lot more pages that aren't useful to developers. Finally, Krugle knows about projects, so you can easily find out more about the context of a piece of source code, rather than just viewing it as an isolated file as you do with Google's code search.

Krugle have also teamed up with some big names like IBM and Sourceforge, to offer specialized search for the large public repositories of code that they control. Unfortunately, I wasn't able to find the Krugle interface directly through Sourceforge's site, and their default code search engine seems fairly poor, producing only two irrelevant results for erode/dilate. Using the Krugle Sourceforge interface produces a lot more, it seems like a shame that Sourceforge don't replace theirs with Krugle.

So, they have a very strong solution for searching public source code. Where it gets interesting is that the same problem exists within organizations. Their solution is a network appliance server that you install inside your intranet, tell it where your source repositories are, and it provides a similar search interface to your proprietary code. I find the appliance approach very interesting, Inboxer take a similar approach for their email search product, and of course there's the Google search appliance.

It seems like a lot of developers must be searching for a solution for the code-finding problem, because it is so painful. It also seems like an easy sell to management, since they'll understand the benefits of reusing code rather than rewriting it. I wonder how the typical sale is made though? I'd imagine it would have to be an engineering initiative, and typically engineering doesn't have a discretionary budget for items like this. They do seem to have a strong presence at technical conferences like AjaxWorld, which must be good for marketing to the development folks at software businesses.

Overall, it seems like a great tool. I think there's a lot to learn from their model for anyone who's trying to turn specialized search software into a product that businesses will buy.

January 03, 2008 in Defrag, Outlook API | Permalink | Comments (0) | TrackBack (0)

Here's a quick way to organize Outlook attachments

Mapilablogo

Outlook Attachment Processor from MAPILab lets you save out all your email attachments to disk, and replaces them with links in the messages. I find it a lot easier to search and organize documents as objects on the file system than when they're embedded in emails, and this add-in makes it painless to move them over. It's got a large array of options, but they're well-explained and have good defaults, so it doesn't feel too much like the space shuttle control panel.

Mapiscreenshot

The most important options cover which messages and attachments are converted to local files, and where they end up. I like the way this addin focuses on solving a single painful problem, but with a lot of flexibility and depth for customizing that solution. It's obviously been heavily driven by user feedback.

Of course, there's a downside to saving all your attachments like this; the links break when you move to a different machine. There's an 'Update Links' tool to change them to a new location to solve this problem, but it shows that separating your attachments from the source PST does add some complications. You can try the add-in free for 30 days with fully-functional trial version, and costs $24 to buy a single-user license.

MAPILab offer a range of other Microsoft plugins, including a couple of tools for Exchange. They employ 25 people, which shows the engineering effort that solid plugins like these require, and that there's market demand for their solutions. They explicitly spell out their strategy as targeting narrow problems, leaving larger companies to "focus on the creation of platforms and technological foundations".

One of the problems I'm interested in solving is making document collaboration through email less painful. Attachment Processor and some of their other tools like File Fetch and File Send Automatically are solving parts of what makes it so awkward. What I'd like to see is a more comprehensive system that offers the advantages of a wiki without having to force people away from sharing documents through email. It seems like an Exchange extension that turned attachments into links to Sharepoint documents, like Attachment Processor does for the local filesystem, would be an interesting direction to go down.

December 23, 2007 in Outlook API | Permalink | Comments (0) | TrackBack (0)

Are Whitehouse emails wide open to hackers?

Whitehouse

When I heard about the deletion of the Whitehouse emails back in April, and Karl Rove's use of a private email account, my first thought was 'wow, they must really struggle to keep that secure'. It's not often my technical research leads to a question of national security, but it turns out they don't struggle, they just leave a large part of their email system unsecured!

Emails that travel outside of an organization to a private email account like Karl's go through an unencrypted, plain text transport system, SMTP. In simple terms, a text document is passed from server to server until it reaches its destination. In theory, anybody who's sitting on the network can see the contents of those messages. Normally, this isn't a big issue, since emails are low value (typically not containing credit card numbers or other information valuable to hackers) and there's so many flying around, just being in the right place to sniff it and picking an interesting one out from the noise is tough.

David Gewirtz, a techie who runs OutlookPower magazine, has spent months researching the technical aspects of the Whitehouse's email use. He's now published a book, and it's scary reading for anyone who cares about America's security. You can read extracts from it at this site, but I recommend looking through the original articles too. Start with "Prepare to be freaked out" to understand how serious the consequences of their poor technology decisions could be. This isn't a partisan or crazy conspiracy book, email is something that every Executive in the last 20 years has made serious mistakes with, and David ends with recommendations on how to improve the current dire situation.

Buy the book, but here's a full list of the related articles:

  • Technical analysis: the White House email controversy
  • The White House email controversy: who runs GWB43.COM?
  • The White House email controversy: a detour into mob journalism
  • The White House email controversy: the nightmare scenario
  • The White House email controversy: an archiving plan only FEMA could love
  • 'Deep Mail' on the White House email controversy
  • The White House email controversy: migrating from Notes to Outlook
  • The White House email controversy: why does Karl Rove keep losing his BlackBerry?
  • The White House email controversy: help us find those missing messages
  • The White House email controversy: a historical perspective
  • The White House email controversy: prepare to be freaked out
  • The White House email controversy: understanding the root causes
  • The White House email controversy: our formal recommendations
  • The White House email controversy: the final questions

December 21, 2007 in Outlook API, Personal | Permalink | Comments (0) | TrackBack (0)

If blog comments are dark matter, then what's the dark energy?

800pxwmap_2003
Brad called blog comments as the dark matter of the net. They're really hard to search, and so there's a lot of useful information that's effectively lost to the world. What's driving a lot of my work is my belief that email is the dark energy.

Dark energy makes up 74% of the universe, versus  22% for dark matter. There's an estimated 200 billion emails sent every day, whereas the number of active blogs is in the low millions. I'm wandering dangerously close to Chinese math, but even assuming the vast majority of emails are low in information content, that's a lot of untapped data that people are entering into computers.

The reason nobody's taking advantage of this is that emails are a very personal and private medium, not intended for public consumption, unlike blog posts or comments which are explicitly published to the world. My hypothesis is that there's a category of people for whom exposing partial information about their email, possibly to a limited audience, will solve some painful problems. JP Rangaswami is my poster child; he opened up his inbox to all his direct reports, as a way of mentoring and sharing information with them, as well as ensuring he doesn't hear much complaining about each other! I wouldn't go that far, but I do wish I could easily expose all of my technical discussion email threads to the rest of my team.

There's practical steps that can be taken within a business setting to make a lot more information available, since that's one place where you have access to a whole set of interacting email messages. I want to find subject matter experts within the organization, or people who have been in contact with an external group or person you want information on. Doing social graph analysis on an exchange server full of messages will help with that, as will statistical analysis for picking out keywords. I'm excited to see what tools I can build on these foundations. Stay tuned...

December 19, 2007 in Defrag, Implicit Web, Outlook API | Permalink | Comments (0) | TrackBack (0)

Inboxer - An easy way to spy on your employee's emails?

Inboxer
I first ran across Inboxer through their excellent Enron email exploration site. They offer a server appliance that sits inside a company's firewall, analyzes all internal email, and offers a GUI interface to explore the messages. They have some sophisticated tools that let you see some common types of emails that management would be interested in, such as objectionable content, recruitment-related or by external contacts. They also let you set up alerts and triggers if particular conditions are met, such as unauthorized employees emailing messages that appear to contain contracts to external addresses. You can experiment with their UI through the Enron site, it seems to be pretty well laid out, and simple enough for non-technical people to use.

Inboxertimegraph
They offer graphs of important statistics over time.

Inboxerscreenshot
There's a set of pre-packaged searches for things management are commonly concerned about. You can drag and drop any of them onto the main pane, and you'll get a view of all the relevant emails.

They've done a great job technically with Inboxer, it seems like a well-rounded service. I'm a bit disturbed that the this is what the market is demanding though. Despite it being pretty clear from a legal standpoint that the company has no duty of privacy, most people don't treat their work emails as public documents. Some of the searches, such as those for recruitment terms, are clearly aimed at catching employees doing something they don't want management to know about, but that aren't aimed at harming the company. I get worried that it would be incredibly tempting to use this as a technical fix for a management problem. Instead of focusing on keeping employees from job-hunting by keeping them happy, just try and punish anyone who makes the mistake of using the company system in their search.

I believe the Inboxer team has done their homework, they've clearly tried a lot of different tools, and this is the one that seems most successful. There's a lot of legitimate uses, especially in regulated industries and government organizations, where there's liability issues that require some email controls. I just wish that a less command-and-control, top-down approach was more popular. If Inboxer also offered a client-side version, I'd much rather work for a company that required that. It could make it clear which emails would be flagged and looked at, before they were sent, and help employees understand how public their work emails really are.

Roger Matus, the CEO of Inboxer, has collected a lot of useful email and messaging news in his blog, Death by Email. I'd recommend a visit if you're interested in their work.

December 17, 2007 in Defrag, Implicit Web, Outlook API | Permalink | Comments (0) | TrackBack (0)

How to access the Enron data painlessly

Enronicscreenshot

Yesterday I gave an overview of the Enron email corpus, but since then I've discovered a lot more resources. A whole academic ecosystem has grown up around it, and it's led me to some really interesting research projects. Even better, the raw data has been put up online in several easy to use formats.

The best place to start is William Cohen's page, which has a direct download link for the text of the messages in a tar, as well as a brief history of the data and links to some of the projects using it. Another great resource is a mysql database containing a cleaned-up version of the complete set, which could be very handy for a web-based demo.

Berkeley has done a lot of interesting work using the emails. Enronic is an email graph viewer, similar in concept to Outlook Graph but with a lot of interesting search and timeline view features. Jeffrey Heer's produced a lot of other interesting visualization work too. He's produced several toolkits, and some compelling work on collaborating through visualization, like the sense.us demographic viewer and annotator.

Equally interesting was this paper on automatically categorizing emails based on their content, comparing some of the popular techniques with the categorization reflected in the email folders that the recipients had used to organize them. Ron Bekkerman has some other interesting papers too, like this one on building a social network from a user's mailbox, and then expanding it by locating the member's home pages on the web.

December 14, 2007 in Defrag, Implicit Web, Outlook API | Permalink | Comments (0) | TrackBack (0)

Which corporation generously donated all their emails to the public domain?

Enronlogo
One of the challenges of trying to build a tool that does something useful with a corporation's emails is finding a good data set to experiment on. No company is going to give a random developer access to all of their internal emails. That's where Enron comes to the rescue. The Federal Energy Regulation Commission released over 16,000 emails to the public as part of its investigation into the 2001 energy crisis.

They're theoretically available online, but through a database interface that seems designed to make it hard to access, and throws up server errors whenever I try to use it. Luckily, they do promise to send you full copies of their .pst databases through the postal system if you pay a fee. If only there were some kind of global electronic network that you could use to transmit files... I will check the license and try to make it available online myself if I can, once I receive the data.

I first became aware of this data through Trampoline Systems's Enron Explorer, which demonstrates their email analysis using this data set. Since then, I also ran across a paper analyzing the human response times to emails that also builds on this information.

December 13, 2007 in Outlook API | Permalink | Comments (0) | TrackBack (0)

The secret to showing time in tag clouds...

Animatedtags

... is animation! I haven't seen this used in any commercial sites, but Moritz Stefaner has a flash example of an animated cloud from his thesis. You should check out his other work there too, it includes some really innovative ways of displaying tags over time, like this graph showing tag usage:

Taggraph

His thesis title is "Visual tools for the socio-semantic web", and he really delivers 9 completely new ways of displaying the data, most of them time-based. Even better, he has interactive and animated examples online for almost all of them. Somebody needs to hire him to develop them further.

Moritz has his own discussion on the motivations and problems with animated tag clouds. For my purposes, I want to give people a way to spot changes in the importance of email topics over time. Static tag clouds are great for showing the relative importance of a large number of keywords at a glance, and animation is a way of bringing to life the rise and decline of topics in an easy to absorb way. Intuitively, a tag cloud of words in the subjects of emails would show 'tax' suddenly blinking larger in the US in April. On a more subtle level, you could track product names in customer support emails, and get an idea of which were taking the most resources over time. Trying to pull that same information from the data arranged as a line graph is a lot harder.

There's some practical problems with animating tag clouds. Static clouds are traditionally arranged with words abutting each other. This means when one changes size, it affects the position of all the words after it. This gives a very distracting effect. One way to avoid this is to accept some level of overlap between the words as they change size, which makes the result visually a lot more cluttered and hard to read. You can increase the average separation between terms, which cuts down the overlap, but does result in a lot sparser cloud.

I'm interested in trying out some other arrangement approaches. For example, I'm fond of the OS X dock animation model, where large icons do squeeze out their neighbors, but in a very unobtrusive way. I'm also hopeful there's some non-flash ways to do this just with JavaScript.

December 12, 2007 in Defrag, Implicit Web, Outlook API | Permalink | Comments (0) | TrackBack (0)

How to write a graph visualizer and create beautiful layouts

Mattnetwork

If your application needs a large graph, as I did with my Outlook mail viewer, the first thing you should do is check for an existing library that will work for you. Matt Hurst has a great checklist for how to evaluate packages against your needs. If you can find one off-the-shelf, it'll save a lot of time.

If you need to write your own, the best way to start is absorbing this wikipedia article on force-based layout algorithms. It has pseudo-code describing the basic process you'll need to run to arrange your graph. It boils down to doing a physics-based simulation of a bunch of particles connected by springs, that repel each other when they get close. If you've ever written a simple particle system, you should be able to handle the needed code.

It's pretty easy to get something that works well for small numbers of nodes, since the calculations needed aren't very intensive. For larger graphs, the tricky part is handling the repulsion, since in theory every node can be repelled by every other node in the graph. This means the naive algorithm loops over every particle every time when calculating the repulsion for each in the graph, which gives O(N-squared) performance. The key to optimizing this is taking advantage of the fact that most nodes are only close enough to be repelled by a few other nodes, and creating a spatial data structure before each pass so you can quickly tell which nodes to look at in any particular region.

I ended up using a 2D array of buckets, each about the size of a particle's repulsion fall-off distance. That meant I could just check the immediately neighboring buckets that a particle was in to find others that would affect it. The biggest problem was keeping the repulsion distance small enough that the number of particles to check was low.

In general, tuning the physics-based parameters to get a particular look is extremely hard. The basic parameters you can alter are the stiffness of the springs, the repulsion force, and the system's friction. Unfortunately, it's hard to know what visual effect changing one of them will have, they're only indirectly linked to desirable properties like an even scattering of nodes. I would recommend implementing an interface that allows you to tweak them as the simulation is running, to try and find a good set for your particular data. I attempted to find some that worked well for my public release, but I do wish there was a different algorithm that was based on satisfying some visually-pleasing constraints as well as the physics-based ones. I did end up implementing a variant on the spring equation that repelled when the connection was too short, which seemed to help reduce the required repulsion distance, and is a lot cheaper to calculate.

A fundamental issue I hit is that all of my nodes are heavily interconnected, which makes positioning nodes so that they are equally separated an insoluble problem. They often end up in very tight clumps in the center, since many of them want to be close to many others.

Another problem I hit was numerical explosions in velocities, because the time-step I was integrating over was too large. This is an old problem in physics simulations, with some very robust solutions, but I was able to get decent behavior with a combination of shorter fixed time steps, and a 'light-speed' maximum velocity. I also considered dynamically reducing the time-step when large velocities were present, but I didn't want to slow the simulation.

I wrote my library in C++, but I've seen good ones running in Java, and I'd imagine any non-interpreted language could handle the computations. All of the display was done through OpenGL, and I actually used GLUT for the interface, since my needs were fairly basic. For profiling, Intel's VTune was very helpful in identifying where my cycles were going. I'd also recommend planning on implementing threading in your simulation code from the start, since you'll almost certainly want to allow user interaction at a higher frequency than the simulation can run with large sets of data.

December 11, 2007 in Coding, Defrag, Implicit Web, Outlook API | Permalink | Comments (0) | TrackBack (0)

See connections in your email with Outlook Graph

Outlookgraphscreenshot

Outlook Graph is a windows application I've been working on to explore the connections that exist in large mail stores. If you're feeling brave, I've just released an alpha version:
Download OutlookGraphInstall_v002.msi.

It will examine all of the Outlook mail messages on the computer, and try to arrange them into a graph with related people close together. The frequency of contacts between two people shown by the darkness of the lines connecting them. My goal is to discover interesting patterns and groupings, as a laboratory for developing new tools based on email data.

The application sends no network traffic, and doesn't modify any data, but since it's still in the early stages of testing, I'd recommend using it only on fully backed-up machines. It runs a physics simulation to find a good graph, so on very large numbers of messages it may take a long time to process. I've been testing on 10-12,000 message mailboxes on my laptop, so I'll be interested in hearing how it scales up beyond that.

December 09, 2007 in Defrag, Implicit Web, Outlook API | Permalink | Comments (0) | TrackBack (0)

« Previous | Next »