ASP.NET and Dirty Urls
There are two things that have been bothering me about pages that are getting indexed in Google from an ASP.NET application. The first is somehow there are ASP.NET Session Urls ending up in the Google index. This is bad because searchers that actually do click these links are likely to get a 500 error (internal server error) because they will be trying to access a page of an expired session.
How is Google finding all these 'bad' urls?
Well apparently there is no browser definition in ASP.NET 2.0 for the Googlebot's useragent string, so when the spider hits your ASP.NET page it's browser capabilities are not defined.
Edit: The default browser capabilities are defined to use cookies, the issue occurs because the base Mozilla definition is defined to NOT use cookies. If the browser is not able to accept cookeis .NET gets around this by inserting the session information into the Url and issues a 302 (content temporarily moved) in the response header.
This default behaviour is a good and a bad thing. It's good in the fact that if I'm browsing an asp.net site on a pda that doesn't support cookies I still can. However just about every search engine spider ever created has it's own UserAgent string making it a tough task to issue the standard non-crufted url. One solution to fixing the session urls being indexed in Google is to tell your asp.net application that Googlebot supports cookies and the problem is solved.
To read more about the solution please see my next post.
Dynamic Captcha build-up
Another dynamic aspect that is used on this site are Captcha images, and yes Google's image spider finds those too. Upon trying a Google image search on my domain, it's littered with Captcha images! I've also added an exclude to the robots.txt file for this.
Solution for Captcha images build-up and stop-gap solution for Session Urls
Here is my "robots.txt" file so far for my SingleUserBlog install. *Note the last two lines, "Disallow: /(A(*" should exclude any ASP.NET session urls, (this is not recommended unless you have fixed the Mozilla detection hole). The last line should exclude any captcha images from being indexed.
User-agent: *
Disallow: /LoginPage.aspx
Disallow: /Administration/
Disallow: /(A(*
Disallow: /Captcha.ashx*$