Fixing Crawl Errors From Google Webmaster Tools on IIS7

August 5, 2010 iis comments

Crawl Errors in Google Webmaster Tools are a way for Google to notify you (a webmaster) that while googlebot tried to crawl a URL on your site it could not access it. It also gives you a description of the error, the date the error was detected and a list of links that point to that URL.

There’s quite a few reasons an error could occur. What to do with specific error is left entirely up to you. In this article I’ll explain how to fix “Not Found” crawl errors using rewrite rules in web.config file.

The first thing you should do is check the “Linked from” list for outside links. If someone out there is linking to your site with a wrong URL it won’t do you much good. There are two things that you could do:

Contact the site that’s linking to you and ask them to correct the link.
You can make a 301 redirect to the correct URL. This will be explained later on.

All of ‘Not Found’ URLs should return a 404 HTTP status code. You can check this with the Fetch as Googlebot tool. There are 2 options you have when dealing with these:

If the URL is pointing to a page that was on your site but no longer exists you should return a 410 HTTP status code. This will tell your users, Google and other search engines that your site is ‘Gone’.
If the URL is pointing to a page that was moved to another URL you should do a 301 redirect that points to the correct URL.

Rewrite Rules and web.config

Rewrite rules are a set of rules that tell your web server what to do with some URLs. These rules can be quite complex. But, as you’ll only be dealing with the URLs from the Crawl Error report, these ones will be pretty simple.

First thing you need to do is see if you have a rewrite section in your web.config file. It should be in system.webserver which is inside the configuration section. If you’re missing any part of this just add it. And also add a rules section inside the rewrite one. This is where all your rules will go. Your web.config should look something like this:

<configuration>
    <!-- ... -->
    <system.webServer>
        <!-- ... -->
        <rewrite>
            <rules>
                 <!-- Here we'll insert the rewrite rules -->
            </rules>
        </rewrite>
        <!-- ... -->
    </system.webServer>
    <!-- ... -->
</configuration>

410 - Gone

URLs to pages that were on your site but no longer exist should return a 410 HTTP status code. There are 2 types of rules you’ll need here depending if the query string is an important part of the URL or not.

To determine which type of URL it is let’s look at 2 examples:

www.fakestore.com/jeans/?item=24
www.fakestore.com/long-jeans/

Both of these URLs should return a 404 HTTP status code. Our fake store doesn’t sell jeans with the id 24 any more. Neither do they any longer hold long jeans. The first example shows that query string is very important. You wouldn’t want to remove the title page of jeans department, just those particular jeans. On the contrary that’s exactly what you’d like to achieve in example 2. Today’s websites with the pretty URL movement spreading all through the web, you’re probably be dealing with example 2 like URLs a lot more.

Query string independent URLs

Let’s start with query string independent URLs. For this you’ll need to use this snippet of XML code:

<rule name="GoneX" stopProcessing="true">
    <match url="^URL/?$" ignoreCase="true" />
    <action type="CustomResponse" statusCode="410" statusReason="Gone" statusDescription="The requested resource is no longer available at the server and no forwarding address is known." />
</rule>

As IIS7 does not allow multiple rules with the same name change the X in GoneX with some unique string. I usually use consecutive numbers (Gone1, Gone2, …).

You’ll also need to replace URL with the URL you got from Crawl Errors report without the domain part. So if you look back at the second example you’ll only need the long-jeans/ part.

As rewrite rules use regular expressions for matching URLs you’ll need to be mindful of a couple of thing:

Make sure that the ^ and the /?$ remain as they are. The ^ and $ mark the start and the end of the string respectively. And the /? says that there is an optional slash at the end.
Leave out the slash at the end of the URL if there is one. www.fakestroe.com/long-jeans and www.fakestore.com/long-jeans/ are technically 2 different URLs, but people might type in one or another. That’s why these rules contain a /? at the end.
You’ll also need to escape all the meta characters. The most common ones in URLs are . - + ( and ). To escape a character you need a backslash in front of it.

With all that said here’s how the above snippet would look like with example 2:

<rule name="GoneX" stopProcessing="true">
    <match url="^long\\-jeans/?$" ignoreCase="true" />
    <action type="CustomResponse" statusCode="410" statusReason="Gone" statusDescription="The requested resource is no longer available at the server and no forwarding address is known." />
</rule>

As you can see our rule is also case insensitive. If your web server now receives a request for a URL and matches it with one of your rules it will return a 410 HTTP status code. You can also change the status description by altering the statusDescription attribute of the action tag. This message will give your visitors an explanation for the error.

You’ll need to create a separate rule for each query string independent URL that points to a no longer existing page.

Query string based URLs

Now let’s look at URLs which hold important information in the query string like our example 1. Here you won’t need separate rules for each URL. This is because the URL attribute in the match tag is only matching the URL without the domain and the query string. So there will only be one rule which will match all URLs:

<rule name="GoneQueryStrings" stopProcessing="true">
    <match url=".*" ignoreCase="true" />
    <conditions logicalGrouping="MatchAny">
        <!-- conditions go here -->
    </conditions>
    <action type="CustomResponse" statusCode="410" statusReason="Gone" statusDescription="The requested resource is no longer available at the server and no forwarding address is known." />
</rule>

Place this rule anywhere amongst your previous rules. You’ll notice the .* in the URL attribute. Regular expression . mathces any character and the * says that there my be 0 of more such characters. This matching any URL. As you don’t want all the URLs on our site to start returning 410s you need some conditions. These conditions look like this:

<add input="{QUERY_STRING}" pattern="QUERYSTRING" />

Now all you have to do is to replace the QUERYSTRING in the pattern attribute with the query string from your URL and insert the condition in the above rule. Don’t forget about escaping the query string. You can have as many conditions as you want inside this rule. If any of these conditions match the query string of the URL request received by your web server it will fire up the action the return a 410 HTTP status code.

301 - Moved permanently

If you find a URL in your Crawl Errors report that has change (moved) you should use a 301 redirect to automatically redirect your users and Google to the new URL.

The difference between a 410 and a 301 rule in your web.config is in the action tag. For 301 you’ll need to use:

<action type="Redirect" url="NEWURL" />

You’ll need to change NEWURL with the URL that you want the old URL to be pointing at. The new URL must be a full URL. So let’s look at the example where www.example.com/old/ is redirected to www.example.com/new/:

<rule name="Moved1" stopProcessing="true">
    <match url="^old/?$" ignoreCase="true" />
    <action type="Redirect" url="https://www.example.com/new/" />
</rule>

The same principle applies to query string based URLs. The only difference is that you need to make a separate rule for every URL you’re redirecting to.

All in due time

When you setup all of your rules your part is done. You can check if everything works as it should with Fatch as Googlebot tool or manually entering URLs into your browser.

Googlebot will see these changes the next time it crawls your URLs and according to RL removal explained, Part IV: Tracking your requests & what not to remove:

…they will drop out naturally over time as we stop crawling URLs that repeatedly 404.

As 410 is stronger then 404 your 410s should start disappearing from the Crawl Errors report over time. The URLs you redirected with a 301 should disappear faster.

For more information on creating rewrite rules go to Creating Rewrite Rules for the URL Rewrite Module.

For testing your Regular Expression you can use Regular Expression Test Tool.

Jogging, Week 1

August 3, 2010 sports comments

Yesterday was the 8th day since I started jogging. I started on Monday, the 26th of July, with lap that somewhat encircles my home village and is 4.65 kilometers in length. That’s 2.89 miles for you imperial readers. I’ve made a map of the route on MapMyRun and named it Lendava GG2. From Tuesday through Saturday continued with a lap that’s a bit smaller - Lendava GG1.

I said to myself that I wouldn’t stop moving the entire length of the lap. I was either going to jog or walk. The walking parts are a kind of pauses between jogging sessions.

The week started without any goals. Every day after waking up I went at it. Usually I started around 12pm as I’m a late sleeper. I just wanted to see where my stamina is at. It turned out I wasn’t in such a bad jogging shape. I ran about a third of the Lendava GG1 before taking my first pause on Tuesday. I took a total of 3 pauses that day. As the days went by I could jog a bit further every day and I managed to eliminate one pause before the week ended. I took Sunday off. A bit of time for my muscles to regenerate.

During last week an internal goal started to emerge as I learned how much my body is capable. I wanted to jog the entire length of Lendava GG1 yesterday without pauses. I seemed quite an achievable goal. But I couldn’t do it. I managed 2/3 of the lap before taking a break. This was the only break I took though. And I managed to cut my fastest time from 19 to 14 minutes. Which is great. I would probably have done it if my sleep wasn’t cut short.

All in all I’m quite happy with the result and I’d like to achieve the mentioned goal during this week. Also this week I’m jogging in the evening. Around 8pm. I want to test out different time periods. I’ll report back about my performance next week.

Do you jog? Let me know how you do it in the comments below.

Discard WordPress Previews From Google Analytics

July 26, 2010 wordpress, google analytics comments

Recently I’ve noticed that my traffic is way higher on the day that I write posts. This would be normal if I posts where published on the same day I wrote them. But that’s not always the case. Sometimes I get inspired and write a bunch of posts and publish the most relevant one immediately and delay others for a few days. Or I could also just write a draft on one day and then finish it later on.

So I went investigating why the traffic was significantly higher on those days. I poked around my Google Analytics reports and in Content » Top Content I saw entries like these:

Wordpress previews in Google Analytics

The p=<number>&preview=true part of the highlighted URLs was the key. Apparently the WordPress plugin I’ve got installed for Google Analytics and stuff like Bing Webmaster Tools, Yahoo Site Explorer, Google Webmaster Tools added the JavaScript code that counts visits even in previews. This wasn’t OK, because it blurs the real traffic results.

So I had two choices. Either change some PHP code of WordPress or the plugin or make Google Analytics discard the preview visits. Neither of these choices isn’t perfect. The problem with the first one is that when WordPress or the plugin get updated my code will be gone and I’d have to do it all over again. The second one on the other hand only works for Goolge Analytics. If you’re using any other traffic analysis services or tools, you’d have to do this for every one of them. I chose the second option as Google Analytics is my most used tool for traffic statistics.

The way to get rid of unwanted URLs in Google Analytics is via filters. The sad news here is that they only work from the moment you add them. You cannot change any data from the past. So I suggest doing this as soon as possible.

So to add this filter I went to my Google Analytics account and clicked on the Edit link in the actions column of the profile I use for this blog. There I clicked on the Add filter link in the upper right hand corner of the Filters Applied to Profile box. I’ve highlighted it here:

Add filter in Google Analytics

I didn’t have a filter like this in my Google Analytics account yet, so I had to create one. The was to be a custom filter that would exclude all of the URLs that have preview=true in them. I called it Discard Previews and when I filled out the Add filter form it looked like this:

Discard Wordpress previews filter for Google Analytics

And that was it. The filter was set up and the traffic stats will be much more representative now.

Samsung T240 Showing Vertical Lines

July 10, 2010 tech comments

My Samsung T240 is starting to brake down on me. As you can see in the pictures 2 vertical lines and a yellow patch showed up. The lines are running all the way from the top to the bottom and the patch makes quite a few pixels yellower.

This started about a week ago. First I noticed the blue line and then the yellow patch, but the yellow line appeared just yesterday. Don’t know what’s going on really.

I bought this LCD about a year ago. It’s my second 24 inch monitor. The first one, a Samsung 2493HM, died on me after just 7 months. Fortunately I’ve got a laptop so I wasn’t left screenless, but going from a 24 to a 15.4 is quite a shock. It seems, though, I’m going to have to go through it again. I just hope that Samsung service in Slovenia works a bit faster this time as it took them 2 months to return the money last time.

UPDATE (July 11th 2010): Today I noticed another 2 vertical lines. They are in between the previous blue and yellow lines. One of the new ones is red and the other is blue, but not as bright as the first blue one. I also think that the yellow patch got a bit bigger as well.

UPDATE (July 15th 2010): A bunch of new vertical lines started appearing. Some of them are full strength now and I could swear that the patch is getting bigger by the day. This is starting to get a bit annoying now.