How to fix illegal character errors in PHP XML parsing

Stop
Photo by Intimaj

I'm still plagued by occasional failures in my XML parsing due to illegal characters. Explicitly setting the character encoding reduced the frequency, but they're still popping up occasionally. I have a couple of techniques I've tried. One is to use iconv() to strip out any illegal characters for the set I'm using, eg

$output = iconv("ISO-8859-1", "ISO-8859-1//IGNORE", $input);

This apparently works with more complex unicode sets, but at the moment I'm sticking with an 8 bit character encoding. The problem is that all values correspond to a defined character in ISO-8859-1. It took some head-scratching to realize that ISO-8859-1 is not the same as ISO 8859-1! The extra hyphen after ISO denotes an extended version that includes values in the range 0x00 to 0x1f, 0x7f and 0x80 to 0x9f. This fills up the range of mapped values, so that any number between 0 and 255 corresponds to a valid character in ISO-8859-1, and the line above does nothing.

So, in theory that will fix Unicode encodings, but I need something that will handle the characters that are valid in ISO-8859-1 but that aren't allowed by the XML spec. These are the control characters in the range 0x00 to 0x1f, and 0x7f. To replace these you can run a regular expression that looks something like this:

/[\x00-\x19\x7F]//g

I actually had a large file on disk that I wanted to change, so I actually used sed and its control character class shorthand:

sed 's/[[:cntrl:]]//g' messages.xml > messages.xml.fixed

This solved the illegal character error I was hitting. Now I'm hitting "XML error: EntityRef: expecting ';' at line 451837", and inspection of the text hasn't helped me figure out what's wrong yet. At least I've got a lot further through the file.

Even more ways to speed up IMAP Gmail importing in PHP

Bomberos
Photo by Zerega

In my last two articles on importing mail from Google in PHP I thought I'd got performance up to a pretty high level, but once I started testing with mailboxes with over 30,000 mails, I realized I had to be more creative.

The main trick I discovered in that investigation is using imap_fetch_overview() to get information on a lot of messages at once. This is a lot faster than grabbing the full header info for a single message at a time using imap_headerinfo(). The downside is that it doesn't return as much information about each message. For me the most painful loss was that you only get the first recipient. Another wrinkle is that you don't get the sender information separated into the email address and display portions, you just get a single string that may contain either both, or just the address. I had to write my own regex parser to pull out the two components.

I've updated my sample code to use the overview function, and it includes the code to split up the combined sender string too. You can try it online, or download it as evenfasterphpgmail.zip. The sender parsing code is also included below:

function extract_address_from_display($full)
{
    $matchcount = preg_match_all(
"/(.*)<[^\._a-zA-Z0-9-]*([\._a-zA-Z0-9-]+@[\._a-zA-Z0-9-]+).*>/i",
$full, $matches);
    if ($matchcount)
    {
        $address = $matches[2][0];
        $display = $matches[1][0];
    }
    else
    {
        $matchcount = preg_match_all(
"/[\._a-zA-Z0-9-]+@[\._a-zA-Z0-9-]+/i",
$full, $matches);
        if ($matchcount)
        {
            $address = $matches[0][0];
            $display = $address;
        }
        else
        {
            $address = "";
            $display = $full;
        }
    }
   
    return array( "address" => $address, "display" => $display);
}

Welcome to the United States of America

Greencard

I've just been accepted as a permanent resident here in the US, with the green card (actually mostly white) arriving a few days ago. It's taken me 7 years of patience and struggle, but now I've graduated from a temporary work visa tied to a single employer, to an independent person, free to follow my dreams. It's a giddy feeling, both the new-found security that I won't have to leave the country and the liberation of having no restrictions on my professional life.

I'm counting down the days to naturalization now, just 5 years from now I can be a full citizen. I knew very quickly after arriving that I belonged here, as much as I miss my family and friends from Britain. America is full of encouragement for people dreaming big dreams, it's the best place in the world for doing something that's never been done. Thanks to everyone who's kept me going through the long process of getting this sorted out, especially Liz.

How social networks control your company

Chat
Photo by Belinketeneghe

Brokerage and Closure by Ronald Burt is a must-read for anyone interested in innovation and social networks. He's a sociologist with the Chicago Graduate School of Business who's spent years mapping and analyzing the patterns of relationships in large companies like Raytheon. This book describes how new ideas, trust and power flow directly from these networks.

The title refers to the two forces that shape who you talk to. Closure is the technical term for how insular a group of peope are, measured by the strength of relationships between all the insiders, and the weakness of ties with outsiders. If you draw a graph of the communications within a group with high closure, you see a lot of lines between the members, and few contacts with others:
Closure
In everyday language, a cluster of people with high closure would be called a clique. They form because they have some big advantages. It's a lot easier to trust someone you've no experience with if you share mutual friends, because the risk to their reputation will be severe if they let you down. The dense pattern of communications also makes sure that practices and beliefs get spread and standardized quickly throughout the group.

Large organizations are made up of many of these self-contained teams, each with their own shared experiences, ideas and ways of doing things. Brokerage is the act of bridging the gaps, or structural holes, between these groups in the network. People who have connections with multiple groups that would be otherwise unconnected are known as brokers or bridges.

Broker

They play an important role in innovation because they have the chance to introduce good ideas from one team into another, or combine partial insights from multiple groups into a new approach to a problem. They also have political advantages because they have more information about the motivations and goals of other teams, and can use that knowledge to help steer decision-making to avoid conflicts and gain support for initiatives.

Where Burt really shines is the application of this general model to the wealth of data from sociological studies within companies, together with his own personal experiences of working with large businesses. He sets out to prove 4 'stylized facts' about how brokerage and closure works in practice:

Brokers do better. He uses network analysis together with personnel records to show that people who have strong connections outside their immediate team get paid more, and promoted faster.

Brokers have better ideas. Analyzing the ranking of improvements for a supply-chain management department together with the connectedness of the people suggesting them, he builds a case that the reason brokers do better is because of the quality of the ideas they come up with.

Brokerage is useless without closure. This is less of a slam-dunk, but he gathers evidence that brokers don't help when the teams themselves are fragmented and poorly coordinated. Intuitively this makes sense, groups who can't communicate internally won't be able to execute even given the best ideas.

The echo chamber amplifies closure. Treating networks as information circuits ignores the primate biases that actually guide our social behavior. In particular, etiquette demands that we avoid contradicting a conversation partner when possible. This and similar habits mean that reputations are exaggerated in a feedback loop through gossip, since people you talk to will tend to agree with your assessment of someone, even if they don't hold the same opinion. This gives the illusion of corroborating evidence for your views, and tends to tighten the bonds that bind a group together and more strongly exclude outsiders. This is a tough one to tease out from the data, but he shows that the more mutual contacts you share with someone, the stronger your opinion of them, even if that opinion disagrees with the assessments of your shared contacts.

This is vital reading for anyone dealing with social networks because of the applications of these theories to the design of our tools. At the start he talks about the delusion that having lots of contacts in a network adds value, when instead the really valuable connections are those outside your immediate group, and how this is where businesses like LinkedIn and Tacit should be focusing their efforts.

I'm particularly interested because most of my work has been aimed at making brokerage easier and faster. Defrag Connector was about establishing initial trust between conference attendees by revealing mutual friends. I'm analyzing email to reveal the existing communication networks, and identify good candidates for brokerage contacts because they're experts in a helpful area, or have external contacts that would be useful. Most of his data comes from self-reported surveys of who people talk to, I'd love to run some of his work against my large company email data sets. He mentions Valdis Krebs in the foreword, but I was disappointed I didn't see any references to his work deriving networks from implicit communication data.

Burt is writing for an academic audience, so he presents a lot of the primary data backing up his arguments, which can make it a tough read for generalists like me. He's got a readable style though, and I love some of the anecdotes that pop up throughout, such as the quote from a manager explaining that when analyzing improvement ideas "that were either too local in nature, incomprehensible, vague or too whiny, I didn't rate them."

Why the passive voice is considered harmful

Faceless
Photo by MadMannequin

I really, really hate the passive voice. I had to rewrite my bachelor's thesis after my supervisor rejected my active version. People use it to add an aura of faceless authority to what they're writing, as if it's not just someone's opinion, it's the way the world is. Things occur, there's nobody to argue with, they just are. George Orwell agreed too, including it as one of his 5 rules of effective writing.

Most companies I admire write their copy in the active voice, see Feedburner's about page for a good example. It's part of a stance that they are in a conversation with their customers as equals, not talking down to them. The passive voice says "There's no one you can talk to, this is a one-way communication". Active verbs give the feeling that you're hearing from a human being who might welcome a response. Blogs use the active voice, and that's what makes them seem so fresh and energetic.

It's tough when you're starting off to steer clear of passivity. You want as much authority as you can fake, since a big hurdle is getting anyone to take a chance on a startup with no history, but the language you use affects your thoughts and actions. Using the passive voice is all about putting distance between you and your customers, and you'll end up losing out. Be active and engage people instead.

Death of a startup

Graveyard
Photo by Auchinoon

My old roommate Dave taught me snowboarding, and one thing he said stuck with me: "If you don't fall down at least once every day, you're not pushing yourself hard enough". (He also comforted me with the claim that "chicks dig scars" after I impaled my leg on a fencepost on my first day out.) One of the things I've found liberating here in the US compared to England is that it's possible to fail without being labeled as a failure. On that topic Bob Sutton has a post on why "Am I a success or a failure?" is the wrong question to ask.

I've never been through the death-throes of a startup, but Visual Sciences, a games startup I worked at for four years, collapsed in a painful bankruptcy throwing a lot of good friends out of work. Andrew Hyde laments the sense of shame that still comes when you're involved in a failed business, and like me wishes there were more post-mortems out there to help us all learn. Nick Napp, founder of the promising Disruptor Monkey, has taken that up that challenge with a post explaining what happened to the company. It's tough because it's an emotionally charged topic, and there's always details that have to remain private, but he's done a great job covering what he's learnt. Now I guess it's up to me to pick one of my own professional failures and return the favor.

Free loading animations

Ajaxloader0Ajaxloader1_2Ajaxloader2Ajaxloader3Ajaxloader4Ajaxloader5Ajaxloader6Ajaxloader7Ajaxloader8Ajaxloader9

Ajaxloader10Ajaxloader11Ajaxloader12Ajaxloader13Ajaxloader14Ajaxloader15Ajaxloader16Ajaxloader17Ajaxloader19Ajaxloader20Ajaxloader21Ajaxloader22Ajaxloader23Ajaxloader24Ajaxloader25Ajaxloader26Ajaxloader27Ajaxloader28Ajaxloader29

Ajaxloader30Ajaxloader31Ajaxloader32Ajaxloader33Ajaxloader34Ajaxloader18


I've been experimenting with Twitter, since all the cool kids seem to be into it. One of the gems I discovered through it  is Ajaxload, thanks to Daniel Mclaren. It offers 35 styles of loading animations as animated gifs, all completely free. To use it, just pick one of the styles above, choose your background and foreground colors, and download, it couldn't be simpler.

Easily create gorgeous graphs with the Google Charts API

I've looked at a lot of ways to create graphs dynamically on the web. PHP/SWF charts are fantastic if you want a beautiful results, a lot of options, and interactivity, but they require flash, which both limits the platforms that can use them, and can result in slower loading. For better compatibility, you need something that generates images on the fly.

I'd investigated using jpgraph, but the results looked really ugly and it takes up precious cycles on your own server. Then I discovered a free Google web service that generates images on the fly for you, the Charts API. The pictures above are examples of the high-quality results it produces, with clean fonts, nice 3D and most importantly antialiasing. The API is incredibly simple to use, you just pass in the data and options as parameters to the URL. You don't even need to register or get a key. Here's the URLs for the two images:

http://chart.apis.google.com/chart?cht=p3&amp;chs=480x200&amp;chd=s:Hellob&amp;chl=May|June|July|August|September|October http://chart.apis.google.com/chart?cht=lc&amp;chd=s:pqokeYONOMEBAKPOQVTXZdecaZcglprqxuux393ztpoonkeggjp&amp;chco=676767&amp;chls=4.0,3.0,0.0&amp;chs=480x200&amp;chxt=x,y&amp;chxl=0:|1|2|3|4|5|1:|0|50|100&amp;chf=c,lg,90,76A4FB,0.5,ffffff,0|bg,s,EFEFEF

While it's easy to get started with this style, it does have some downsides. Since the data is encoded as part of the URL, there's a hard limit on how many points you can have since some systems choke on URLs over 2000 characters long. The API also doesn't support as many styles or options as PHP/SWF, and no animations is possible.

Despite those disclaimers, this is an amazing tool, and I'll be having a lot of fun with it. One of my favorite features is the map graph type, which lets you easily specify just colors and states or countries, and it generates an image showing that on a simple map. It would be insanely easy to create some geographic data visualizations using it if you've got interesting data. Here's an example of the results:

http://chart.apis.google.com/chart?chco=f5f5f5,edf0d4,6c9642,365e24,13390a&chd=s:fSGBDQBQBBAGABCBDAKLCDGFCLBBEBBEPASDKJBDD9BHHEAACAC&chf=bg,s,eaf7fe&chtm=usa&chld=NYPATNWVNVNJNHVAHIVTNMNCNDNELASDDCDEFLWAKSWIORKYMEOHIAIDCTWYUTINILAKTXCOMDMAALMOMNCAOKMIGAAZMTMSSCRIAR&chs=440x220&cht=t

Speed up your Gmail IMAP downloading

Launch
Photo by IslandBoy

Now I'm getting deeper into using the IMAP API to pull email from Google, I'm hitting a lot of performance issues. Most of them are on the parsing and database loading side, but while profiling I did discover a few ways I was using IMAP inefficiently. I've updated my original PHP/Gmail example with some optimizations. The main speed boost was switching from grabbing all the email headers using imap_headers() just to get the total number of messages in the mailbox. That's very inefficient, especially on large mailboxes. Instead I just call imap_num_msg() to get the count directly, and that's much faster. Another wrinkle was asking for the INBOX mailbox to get all the messages. It's better to look for [Gmail]/All Mail if you want the complete set of non-spam email in case the user has organized their mail into different folders, though you do also get the sent mail as part of that.

Here's the source code as a zip, or you can give it a try online. Big thanks to Rob and Josh at EventVue for trying some of this out on their mailboxes too, they've been a fantastic help.

Do you need a privacy policy?

Hiding
Photo by kkelly2007

I'm a fan of plain English all the time, everywhere. I had to rewrite my undergraduate thesis to use the passive voice ("the experiment was performed") after I submitted a first person narrative ("I performed the experiment"). When I needed a privacy policy, I was pleasantly surprised to find that it didn't have to be in legalese. In fact, it's hard to find any hard instructions on exactly what you do need in there. California law requires that you have one if you collect any personal information, but only gives a general idea of what you need to explain.

Most privacy policies are organised as a series of questions, with the answers spelling out what you collect, how you use it and who else sees it. SafeSelling has a comprehensive article on privacy with a section halfway down the page titled "What should my privacy policy include?" that covers the sections you'll need. I love starting with a template, and ended up basing my policy on this example from the Better Business Bureau.

As I mentioned, I was impressed by how readable most policies are, even for big companies like Google. They don't have much legal force yet in the US. I wonder if they'll be drowned in latinate obscurity if they end up in court more often?

Do you want to know a secret?

Whisper
Photo by Dr John 2005

There's a strong bias in the web community towards openness. We all know information wants to be free, and we apply that to our daily lives through wikis, blogs, twittering. We're very comfortable publishing content that's available to the whole world.

Most people don't behave like this. My parents worry about people knowing they're away on vacation in case the house gets burgled, they go out of their way to make sure someone takes the mail in. Some engineers I've worked with hoard information like gold, afraid that if they give too much away they've lost their own value. Most parties require an invitation from the host, and to avoid giving offense you probably won't tell anyone about it if they're not on the list.

Email has a fine-grained, laborious but very powerful mechanism for controlling who sees any information. You specify exactly who it goes to. Others can forward it, but at least that requires an explicit action and leaves a trail.

Things start to get really interesting once you add circles of trust. JP Rangaswami opens up his mailbox to all his direct reports, but not to the rest of BT. Only my friends on Facebook can see that I've just turned 20 in hexadecimal.

Email is the biggest private silo of information by far. I'd be happy to share 75% of my mailbox with everybody I know, but there's no way to do that, yet. But I'd only be willing to open up any of it if there was some way to be certain that the items I needed to be private really were sacrosanct.  The key to opening up email will be making sure there's a simple and understandable way to keep secrets. Once we do that, we can take the next leap forward in email by liberating all of the information that's currently gathering dust in everyone's inboxes.

Do you ever stare at a blank document wondering where to start?

Blankpaper
Photo by Mark78_xp

When I'm coding, designing a web page or writing up a document, it's often helpful to start with an existing example. I'll usually finish up completely rewriting it, but having a guide to the expected structure and main points to hit makes the process much faster. For business and legal documents, that's where DocStoc comes in.

They offer a platform for users to share free templates for things like business agreements, wills and expense worksheets. It's focused on professional documents, which separates it from services like Scribd, and has a flash-based interface for browsing through the material. They have a rating system that's designed to help you find the most useful content. One thing I'd love to see is an official seal of approval on some of the legal documents, at the moment I would be nervous using it for something like a will without some reassurance.

They're run out of Los Angeles, and recently announced a $3.25 million Series B funding round with Rustic Canyon Partners. I met Jason Nazar, the entrepreneur behind DocStoc, when he gave a talk at the Entrepreneurs Mentor Society last year. Back then it was still in the early stages, and it's great to see it turn into such a local success story. I do wonder if the same idea could be applied to a company intranet, so that commonly used document templates could be shared in a central location?

Have you seen PicLens?

Piclens

I've been a fan of CoolIris's work since they started over two years ago. I ran across them because their first product was the eponymous browser extension that let you view live, in-context previews of web pages by hovering over the links on certain pages. This was in the same area as my GoogleHotKeys and SearchMash projects, where I was trying to find a better interface to search results than the standard text listing.

Since then they've kept innovating and experimenting with new approaches to interacting with web content, and PicLens has been a break-out success. It's also a browser extension, but gives you a full-screen interface to a lot of popular image-based sites like Flickr, YouTube. The images or movies are shown in an infinite 3D wall that you use a gesture-like interface to fly up and down, and zoom into. Initially what draws you in is the gorgeous rendering, with subtle but classy effects like the reflections, and smooth animation and transitions. What keeps you using it is the interface, it's like the next generation of channel surfing. After using it for a while, going back to the traditional model of a static page with embedded 2D images or movies feels very awkward and slow.

This week I was fortunate enough to spend a few hours with the CoolIris team, who I'd never met before. They were a lovely bunch of people, and had an inspiring story. The initial team spent two years self-funding all of their work, running up big credit card bills and working without a salary. Recently they closed a Series A together with Kleiner Perkins and now they're running full steam ahead on some very interesting developments that I could tell you about, but then I'd have to kill you.

They're on the lookout for good engineers, particularly people with experience of writing 3D engines for games or similar graphics applications, and anyone who's been thinking about really innovative interfaces. Contact kathy at cooliris dot com if you're interested in talking to them about this.

My medium can beat up your medium

Scarybloke
Scary bloke by AphasiaFilms

I recently indulged in some arm-waving about how email is the Big Daddy of message systems, despite all the glamorous alternatives taking the spotlight. To back this up with some data, I set out to get some rough global usage figures for the top text-based mediums out there; email, SMS, Facebook, IM, blogs and Twitter.

  • Facebook has over 70 million active users. As a closed system, it's hard to work out the message frequency, but around 2 a day seems plausible to me. That would mean around 50 billion sent each year.

  • The comScore global IM user count is 800 million. Guessing again an average of 2 messages a day, that's 600 billion messages a year.
  • SenderBase indicates that there's around 3 billion non-spam emails a day. That's around 1 trillion messages annually.

Email, SMS and IM are the clear winners in raw volume. It does lead me to wonder about the driving forces behind choosing which system to use.

Privacy is obviously important. Tomi Ahonen has a great comment on this story where he talks about kids using SMS to friends in the same room, not for convenience but because a clandestine communication channel is a powerful social bonding tool. There's a widespread assumption that openness is both good and inevitable, but we're just primates at heart, and sharing secrets is one equivalent of picking fleas off each others backs.

Using the raw numbers like this is obviously unfair. I put a lot more time into an average blog post than an email, my Facebook messages have more content than my IMs, and I seldom use anything but email for business communications. Even so, the statistics make a strong case that despite their growth, other systems will take a long time to pass email.

How to write an Ajax update function with PHP

Fetch
Photo by Bored-Now

I've been writing a lot of Ajax code to request some information from a server, and then update an element on the page with the returned HTML. The basic XMLHttpRequest code to do this is pretty simple, but I've specialized the code to do a couple of common things. First, it always replaces the HTML of the element with the ID given in $replacename, and it takes in a Javascript variable name so you can dynamically alter the URL parameters that are passed in. The second part is really useful when you want a client-side event to trigger the fetch, you can write <select onchange="yourajaxfunction(this.value);"> in a menu, and then define the values in each menu item. Here's the PHP code for the function body:

function add_ajax_fetch_script($fetchurl, $parametersjsvar, $replacename)
{
?>
    var xhr;
    try
    {
        xhr = new ActiveXObject('Msxml2.XMLHTTP');
    }
    catch (e)
    {
        try
        {
            xhr = new ActiveXObject('Microsoft.XMLHTTP');
        }
        catch (e2)
        {
            try
            {
                xhr = new XMLHttpRequest();
            }
            catch (e3)
            {
                xhr = false;
            }
        }
    }

    xhr.onreadystatechange  = function()
    {
        if(xhr.readyState  == 4)
        {
            if(xhr.status  == 200)
                document.getElementById("<?=$replacename?>").innerHTML = xhr.responseText;
            else
                document.getElementById("<?=$replacename?>").innerHTML = "Error code " + xhr.status;
        }
    };

    xhr.open("GET", "<?=$fetchurl?>"+<?=$parametersjsvar?>,  true);
    xhr.send(null);

<?php
}

To use this code, you'd write out the signature and name of your Javascript function, call add_ajax_fetch_script() and then terminate the JS function with a closing curly brace. Eg:

function yourajaxfunction(urlsuffix)
{
<?php
add_ajax_fetch_script("http://someurl.com", "urlsuffix", "someelementid");
?>
}