Cleaning Up ASP.NET Sessions in Google

ASP.NET and Dirty Urls

There are two things that have been bothering me about pages that are getting indexed in Google from an ASP.NET application. The first is somehow there are ASP.NET Session Urls ending up in the Google index. This is bad because searchers that actually do click these links are likely to get a 500 error (internal server error) because they will be trying to access a page of an expired session.

Indexed Session Urls in Google Sitemap tools

How is Google finding all these 'bad' urls?

Well apparently there is no browser definition in ASP.NET 2.0 for the Googlebot's useragent string, so when the spider hits your ASP.NET page it's browser capabilities are not defined.

Edit: The default browser capabilities are defined to use cookies, the issue occurs because the base Mozilla definition is defined to NOT use cookies. If the browser is not able to accept cookeis .NET gets around this by inserting the session information into the Url and issues a 302 (content temporarily moved) in the response header.

This default behaviour is a good and a bad thing. It's good in the fact that if I'm browsing an asp.net site on a pda that doesn't support cookies I still can. However just about every search engine spider ever created has it's own UserAgent string making it a tough task to issue the standard non-crufted url. One solution to fixing the session urls being indexed in Google is to tell your asp.net application that Googlebot supports cookies and the problem is solved.

To read more about the solution please see my next post.

Dynamic Captcha build-up

Another dynamic aspect that is used on this site are Captcha images, and yes Google's image spider finds those too. Upon trying a Google image search on my domain, it's littered with Captcha images! I've also added an exclude to the robots.txt file for this.

Captcha image build-up in Google image search

Solution for Captcha images build-up and stop-gap solution for Session Urls

Here is my "robots.txt" file so far for my SingleUserBlog install. *Note the last two lines, "Disallow: /(A(*" should exclude any ASP.NET session urls, (this is not recommended unless you have fixed the Mozilla detection hole). The last line should exclude any captcha images from being indexed.

User-agent: *
Disallow: /LoginPage.aspx
Disallow: /Administration/
Disallow: /(A(*
Disallow: /Captcha.ashx*$

Print | posted on Wednesday, December 06, 2006 11:49 PM

&uot&uot

Comments on this post

# RE: Cleaning Up ASP.NET Sessions in Google

Requesting Gravatar...
Blimey. Great post (and the followup). Any idea why SingleUserBlog would need session state? I know asp.net has it on by default, but it's not actually used anywhere, is it?

Cheers
Matt
Left by Matt Ellis on Dec 11, 2006 11:04 PM

# RE: Cleaning Up ASP.NET Sessions in Google

Requesting Gravatar...
I use a session variable to track user stats with the "Online Presence" webpart. But it looks like the comment form uses session stuff too:

Session["Comment_RandomText"]
Left by Brendan on Dec 12, 2006 1:21 AM

# RE: Cleaning Up ASP.NET Sessions in Google

Requesting Gravatar...
Ah. Looks like I'd better turn it back on then! I'll add the browser definition from your other post - dead useful, thanks.
Left by Matt Ellis on Dec 12, 2006 9:37 AM

# re: Cleaning Up ASP.NET Sessions in Google

Requesting Gravatar...
Just a friendly helpful hint, no-takey-offense, but "it's" is a contraction for "it is".

When you want to speak of a possessive form such as "there's a dog. It's wagging its tail." you drop the apostrophe.

Hope this helps! Spread the word!
Left by Donna on Oct 19, 2008 4:39 AM

# re: Cleaning Up ASP.NET Sessions in Google

Requesting Gravatar...
Awesome, I'll be sure to tag this post with 'English+lessons' :)
Left by Brendan on Oct 19, 2008 8:14 AM
Comments have been closed on this topic.