On User Agent Strings and Bots

As part of our effort to drive forward technology and provide unique services that didn’t previously exist, Genius recently released a new URL shortening service, which is currently available to select customers.  Despite not seeming unique on the face of it, our shortener integrates with the rest of Genius’ products, allowing you to track prospects all the way from the top of the funnel in a multi-channel marketing campaign to a signed deal. What does that marketing speak mean?  We provide detailed reporting information on link clicks, much like Bit.ly, but in a fashion that allows sales & marketing to make the most of their time & money.

When we started testing this service in the real world, we found it to be very popular—we got tons of link clicks in seconds! Now, while some of us have a great number of followers, it was clear that we had funny stuff going on.  Upon closer inspection, we could tell the majority of clicks in the first few minutes a link is posted to Facebook or Twitter are from robots indexing content for various search engines. No problem, filtering them out ought to be easy; just look in the user agent string for something like “robot” or “crawler”. Reality isn’t so simple.

We began by creating some short URLs and posting them on social networking sites with accounts that have no friends. Unsurprisingly, doing so resulted in a flurry of clicks for each link posted within the first few minutes, diminishing until the last click about 15 minutes later. While a number of bots identified themselves in a manner that is very easy to distinguish by putting “bot” or “spider” in their user agent string:

Baiduspider+(+http://www.baidu.com/search/spider.htm)
bitlybot
MLBot (www.metadatalabs.com/mlbot)
Mozilla/5.0 (Twiceler-0.9 http://www.cuil.com/twiceler/robot.html)
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Others didn’t use such explicit terms, but it was easy enough to add them to a list:

AppEngine-Google; (+http://code.google.com/appengine; appid: mapthislink)
Mozilla/5.0 (compatible; Butterfly/1.0; +http://labs.topsy.com/butterfly.html) Gecko/2009032608 Firefox/3.0.8
PostRank/2.0 (postrank.com)

We also found a bunch that were obviously programming libraries:

Jakarta Commons-HttpClient/3.1
Java/1.6.0_16
libwww-perl/5.816
PycURL/7.19.3
Python-urllib/2.6

Those are all fine & dandy. While it would be nice to simply match any user agent string with “bot”, “crawler”, or “spider” in its name, creating or purchasing a list isn’t terribly difficult. The frustrating thing was that we consistently got hits on the aforementioned friendless accounts from user agents like this:

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)

Bing is not alone. I would much rather match 'bot' in a string than have to grep a list of useragents #yahoo #facebook #topsy

That looks a lot like Microsoft Internet Explorer 7 running on Windows 7 (NT 6.0) to me, and it is. So why is a bot pretending to be IE? And who is doing it? I answered the latter question by whoising the source IP, which turned up Microsoft. This annoyed me, and I informed Twitter what I thought of Bing, in my best passive-aggressive form.

Shortly thereafter, I got a call from an old friend and fellow CSHer, Andrew Bair, inquiring the source of my discontent with Bing. I told him the troubles of matching user agent strings in an attempt to divine the source of clicks, that Microsoft seemed to be running a bot that misidentified itself, and that I presumed it was related to Bing. Andrew works at Bing, and said that he would talk to the Social folks to see if any of them could shed some light on the situation.

Not long after that, I was contacted by Steve Ickman, a researcher at Microsoft. After providing him with a bit of information, Steve told me that the bot I saw was indeed his, and that it is only somewhat related to Bing. He said that the reason the robot was using an Internet Explorer user agent string was because a lot of (badly written) websites will refuse to give content to a user agent that they don’t recognize. Having programmatically scoured the web myself, I can commiserate. To make things easier on people like me, however, Steve said that he would update the user agent string that his bot presented to make it clear what the crawler is doing.

So, is all well in the world of identifying bots? Unfortunately, no. There is another player from Washington who causes trouble: Amazon. Well, maybe indicting the world’s biggest retailer is too much. It’s really users of their Elastic Compute Cloud that are problematic:

Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.5) Gecko/2008120121 Firefox/3.0.5
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.14) Gecko/2009090217 Ubuntu/9.04 (jaunty) Firefox/3.0.14

What do all of those user agent strings have in common? Nothing of note, except that they all came from blocks of addresses used by Amazon’s EC2. It looks to me like a lot of folks who are writing bots need a user agent string and simply pop their browser over to a site that returns it to them and use that. Or they make something up. Or they use the empty string. All of these things make detecting such programmatic visits to your website difficult, leaving someone to maintain a list of bots. That is, unless we can encourage all programmers to readily identify their bots as such. But that’s like herding cats.

  • Digg
  • StumbleUpon
  • del.icio.us
  • Facebook
  • Twitter
  • Google Bookmarks
  • DZone
  • HackerNews
  • LinkedIn
  • Reddit
  • Heewa

    What about a fuzzy approach? You can rate each link on a fuzzy scale of humanness, with a clear bot-identity giving it a score of 0, and a clear human having a 1. Feed all the data you have about a particular hit (IP, referrer, user agent, etc) into a scoring function, which can do stuff like check the source IP for known Amazon EC2 ranges, etc.

    • http://dinomite.net Drew Stephens

      Absolutely. A simple first pass for us is to use a list of bot user agent strings, but beyond that there are a number of more subtle indicators of bottiness that could be used as part of heuristics to determine the likeliness of a visitor being human.

  • Travis L

    Can you give some sort of estimate on how big a problem this is? Does it happen infrequently enough that your rough approach of searching the UA gets most of the bots? Could be a diminishing returns kind of thing.

    • http://dinomite.net Drew Stephens

      It’s a quite widespread problem; on Twitter and Facebook alone you can easily get 30 unique bots in 10 minutes of posting a link. Depending on the application, this definitely can fall to diminishing returns. Unless you really need to keep all bots away from a page, filtering based upon a list of known bot user agent strings and services that host bots (EC2, etc.) is probably sufficient. A good way to create that list of bots is with friendless accounts on social services, though that certainly doesn’t get them all.

  • Clint

    If you try a fuzzy approach, I’d also suggest you include web client fingerprinting through HTTP header analysis for scoring. Check out the browserrecon project for some interesting work in this area.

    • http://dinomite.net Drew Stephens

      HTTP header order is indeed something that had come up in our discussions, but I hadn’t heard of the browserrecon project, thanks for the tip!

  • http://eng.genius.com Jim Bob