This is a quick run-down of the technical side of my guest post chronicling the March of Twitter on Hubspot's blog. Do go check out that article, I was able to have a lot of fun using Dharmesh's data.
Putting together that analysis of the early days of Twitter involved a lot of detective work and 'filling in gaps', since I don't have access their internal traffic data, so I want to cover exactly what I did to produce it.
The map of the spread of Twitter over several years was based on a dump of 4.5 million accounts from the Twitter Grader project. Dharmesh had already done some normalization on the location fields, so I first filtered to remove everybody with a non-US address. That left me with 1.5 million profiles to work with. I believe that Grader's collection methods make those a fairly random sampling of those from the full universe, so I could use the frequency of users in different locations over time to build a visualization that accurately showed the relative geographic presence, even if I can't give accurate absolute numbers. This incomplete sampling does mean that I may be missing the actual earliest user for some locations though.
I accomplished this using ad-hoc Python code to process large CSV files. I've published these as random snippets at http://github.com/petewarden/openheatmap/blob/master/mapfileprocess/scratchpad.py
The second analysis looked at the adoption levels over the first few months. This was a lot trickier, since those sort of absolute figures weren't obviously available. Happily I discovered that Twitter gave out id numbers in a sequential way in the early days, so that @biz is id number 13, @noah is 14, etc. I needed to ensure this was actually true for the whole time period I was studying, since I was planning on searching through all the possible first few thousand ids, and if some users had arbitrary large numbers instead I would miss them. To verify this relationship held, I looked at a selection of the earliest users in the Grader data set and verified that all of them had low id numbers, and that the id numbers were assigned in the order they joined. This confirmed that I could rely on this approach, at least until December 2006. There were frequent gaps where ids were either non-assigned or pointed to closed accounts, but this didn't invalidate my sampling strategy. Another potential issue, that also affects the Twitter Grader data set, is that I'm sampling user's current locations, not the locations they had when they joined, but my hope is that most people won't have changed cities in the last four years, so the overall patterns won't be too distorted. There's also a decent number of people with no location set, but I'm hoping that also doesn't impose a systematic bias.
For the first few thousand users I went through every possible id number and pulled the user information for that account into a local file, which I then parsed into a CSV file for further processing. Once the number of new users grew larger in August I switched to sampling only every tenth id and making each found account represent ten users joining in the data. Once hiccup was a change in late November where Twitter appear to switch to incrementing ids by ten instead of one, so only ids ending in the last digit 3 are valid, which I compensated for with a new script. Shortly after that in December I again detected a change in the assignment algorithm that was causing a slew of 'no such account' messages during my lookup, so I decided to stop my gathering at that point.
The code for all this processing is also included in http://github.com/petewarden/openheatmap/blob/master/mapfileprocess/scratchpad.py, though it's all ad-hoc. The data for the first few thousand users is available as a Google Spreadsheet:
https://spreadsheets.google.com/ccc?key=0ArtFa8eBGIw-dHZjOUl3eXRzX19PLUFVQUNTU3FndFE&hl=enYou can also download the derived daily and monthly figures here:
https://spreadsheets.google.com/ccc?key=0ArtFa8eBGIw-dG5FU0hJZHI3RkVVMUgtaDhyczZxM1E&hl=en
https://spreadsheets.google.com/ccc?key=0ArtFa8eBGIw-dFlpS0QxSUw5blEtVjdyd2FaT2FySmc&hl=en
I attacked this problem because I really wanted to learn from Twitter's experiences, and it didn't seem likely that the company themselves would collect and release this sort of information. Of course I'd be overjoyed to see corrections to this provisional history of the service based on internal data, if any friendly Twitter folks would care to contribute? Any other corrections or improvements to my methodology are also welcome from my readers.
Comments