It has recently come to my attention that there is something drastically wrong with the way search engines have been indexing my ASP.NET 2.0 blog.
As I've started to explain previously, this is because of the way the browser detection is set up. To give a brief rundown ASP.NET 2.0 has a default browser definition which seems to assume that the default browser is fairly capable and supports common things such as javascript and cookies. A browser definition can get inherited into other definitions which can then override specific properties to update it for that specific browser or browser version.
Apparently in around March 2006 Google started rolling out updates that changed the Googlebot's useragent string from:
"Googlebot/2.1 (+http://www.googlebot.com/bot.html)" to
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Now the reason for this is so the Googlebot could identify itself as being Mozilla/5.0 compliant which should allow it to be accepted by more webservers. However this breaks the detection pattern in ASP.NET. (And was always broken in Yahoo Slurp, I just don't know if anyone ever noticed).
When the useragent was just "Googlebot/2.1" it wasn't able to be matched and used the "default.browser" detection file which defaulted to a browser of reasonable capabilities. After the change it found itself in the "mozilla.browser" file because it was detected on the "Mozilla" word. So all the following sets of instructions in the "mozilla.browser" file try to establish exactly what platform and variant of Mozilla it is, for example, if its Firefox running on OSX, or if it's the older Mozilla Gecko rendering engine. But because there is no definition for a Generic Mozilla/5.0 compatible browser it gets the most relevant match, being the lowest Mozilla/1.0 compatible settings. Bad!
Because of this bad detection the default Mozilla/1.0 settings assume NO COOKIES and insert the session ID into the url then issues a response status 302 (content temporarily moved). What makes this situation even worse is that the default behavior of search engines is to follow these redirects and index the content on the other side. So basically everytime some random User-agent that claims to be Mozilla/5.0 compliant hits the site it gets Mozilla/1.0 capabilities. What is needed is something to bridge this gap.
Fortunately there is something that can be done that won't even require a recompile of your ASP.NET 2.0 application. Simply create a "genericmozilla5.browser" file in your "/App_Browsers" folder in the root of your application with the following in contents:
<browsers>
<browser id="GenericMozilla5" parentID="Mozilla">
<identification>
<userAgent match="Mozilla/5\.(?'minor'\d+).*[C|c]ompatible; ?(?'browser'.+); ?\+?(http://.+)\)" />
</identification>
<capabilities>
<capability name="majorversion" value="5" />
<capability name="minorversion" value="${minor}" />
<capability name="browser" value="${browser}" />
<capability name="Version" value="5.${minor}" />
<capability name="activexcontrols" value="true" />
<capability name="backgroundsounds" value="true" />
<capability name="cookies" value="true" />
<capability name="css1" value="true" />
<capability name="css2" value="true" />
<capability name="ecmascriptversion" value="1.2" />
<capability name="frames" value="true" />
<capability name="javaapplets" value="true" />
<capability name="javascript" value="true" />
<capability name="jscriptversion" value="5.0" />
<capability name="supportsCallback" value="true" />
<capability name="supportsFileUpload" value="true" />
<capability name="supportsMultilineTextBoxDisplay" value="true" />
<capability name="supportsMaintainScrollPositionOnPostback" value="true" />
<capability name="supportsVCard" value="true" />
<capability name="supportsXmlHttp" value="true" />
<capability name="tables" value="true" />
<capability name="vbscript" value="true" />
<capability name="w3cdomversion" value="1.0" />
<capability name="xml" value="true" />
<capability name="tagwriter" value="System.Web.UI.HtmlTextWriter" />
</capabilities>
</browser>
</browsers>
This will match generic Mozilla compatible browsers and spiders with user-agents strings such as:
- Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
- Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
- Mozilla/5.0 (compatible; AbiLogicBot/1.0; +http://www.abilogic.com/bot.html)
- Mozilla/5.0 (compatible; AnyApexBot/1.0; +http://www.anyapex.com/bot.html)
- Mozilla/5.0 (compatible; BecomeBot/3.0; MSIE 6.0 compatible; +http://www.become.com/site_owners.html)
- Mozilla/5.0 (compatible; MojeekBot/2.0; http://www.mojeek.com/bot.html)
- Mozilla/5.0 (compatible; Scrubby/2.2; +http://www.scrubtheweb.com/)
Other Notes
The MSNBOT also never had this problem because it like the original Googlebot string was never detected and thus received the "default.browser" file settings which support the cookies.
My solution is not a complete fix, I think Microsoft could have done one thing better here. Because the browser string goes into the "mozilla.browser" file, they need another level where when it knows its Mozilla/5.0 compliant it gets the appropriate defaults before it starts to figure out exactly what browser it is. Even though with this approach the exact browsing useragent wouldn't be established, it would at least support future browsers claiming to be compliant at a higher level then just "Mozilla".
Downloads