Is Your Site Crawl Efficient?

By 12th August 2016 Articles
women using her iPad computer tablet at the aiport

Great On-Page SEO and Excellent Linkbuilding are but a part of the complete SEO process. To ensure the value from all On and Off Page SEO work is fully felt a big focus needs to be made on the Technical side.

A hugely important area of Technical SEO, one that encompasses several different elements is Crawl Efficiency. Crawl Efficiency essentially means making the search engine robot spend time on the site where it is important (to you and the user) and stop it wasting its time on unimportant areas of the site.

GoogleBot only spends a finite amount of time on a website (Crawl Budget) and it is of great importance to ensure that it is taking the valuable information away with it. By focussing on the important pages you’re maximising the efficiency of the crawl and sending much more positive queues to Google. A great seminar (and now deck on SlideShare) on the value of and why crawl efficiency matters so much for visibility was delivered by Dawn Anderson back in April. Take a look and it will set this post up better than I ever could.

In this post I’ll be discussing the practices that can be employed onto a website in order to maximise the efficiency of the time GoogleBot spends on a website.

By following through the elements below you should be able to identify areas on your site that, with improvement, can be updated to improve the efficiency of GoogleBot’s crawl. Below are 2 screenshots of a client site crawl before and after we made some of the changes in this post to boost crawl efficiency:

Before that though it’s probably a good idea to give you an idea of what we’d consider to be ‘unimportant’ or ‘wasteful’ pages when it comes to a crawl:

  • T’s & C’s and Privacy Policy pages (unless required to be indexable)
    – I can’t think of any times that I would Google “BRAND X Terms and Conditions” without looking on the site first. In some niches though, you may need to have these indexable as well as for Adwords policies perhaps.
  • Parameter and Paginated Pages i.e. /blog/p=?3
    – You know, when you click at the bottom of a blog page list for the next 10/15 posts or on a page showing 20 products out of 100. Also applies to pages with user IDs or tracking in the URL. Sometimes in eCommerce filtering & Faceted Navigation sites.
  • Canonicalised Pages
    – This is a page that is essentially a duplicate of another so you’ve added a canonical link to the original to mitigate any duplication risks. So why crawl it?
  • Duplicated Pages
    – Duplicate pages 🙂
  • In some cases, product pages.
    – If the products are essentially identical other than the colour or size then you may have hundreds or thousands of products that would be a waste of time to crawl.

Disallow The Same Pages in the robots.txt File

Ok so, what should be everyone’s first port of call in the crawl efficiency list should be a visit to your robots.txt file. This is where you can outline many of the pages, folders, file types and identified parameters that you don’t want to be included in the crawl. For example, you may not want PDFs, your Members page and the internal search pages to be crawled so you’d add the following into the file:

Disallow: /members/
#blocks the member folder and all its pages
Disallow: *.pdf$
#blocks all URLs with a .pdf extension on them
Disallow: /?s=
#blocks search pages (check your search parameter) if you have search results in a subfolder like /search/result=test+search then use the same block as for a subfolder.

 Nofollow Internal Links to Unimportant Pages

Wherever you have internal links on your website to any of these pages it is important to add the rel=”nofollow” tag to these internal links. This ensures that when GoogleBot is visiting the page it doesn’t follow those links to those pages in its crawl.

<META NAME=”ROBOTS” CONTENT=”NOINDEX, FOLLOW”> Those  Same Pages

Adding the rel=”nofollow” tag to internal links is only a part of this process. Although you’re telling o’GoogleBot not to follow the internal link to that page, if it reaches it another way round then it’s still going to crawl it. Addind the NOINDEX meta tag tells the robot to leave the page and find something better to do.

Why ‘FOLLOW’ and not ‘NOFOLLOW’?

This is probably a personal preference from an efficiency perspective but my view is that if after all efforts the robot reaches the page you don’t want it on and is also told not to follow any links on that page then it’s basically got only one way out…. Back where it came from, with the ‘FOLLOW’ in place, at least it’s got escape routes that may not yet have been crawled.

  • This being said, with the ‘NOINDEX’ on there, it’s unlikely the robot will ever look for a link to follow on an escape mission from the page and will still head backwards. Piece of mind for me though.

Remove Pages from XML Sitemaps

So you’ve blocked the pages, folders, parameters and canonicalised pages but GoogleBot still has a way to find them through the XML sitemap files. With that in mind, confusing the robot with mixed signals not only causes inefficiencies but it will probably send back info to Google that says the site is a mess. So in this case then it is vital that you ensure that any pages you have blocked in the robots.txt, nofollowed the links to and added the noindex tag to are NOT in your sitemaps.
– This also includes any on-page sitemaps you have too!!

Prioritise Important Pages in XML Sitemaps

This one is a bit of a no brainer. Ensure that you are giving the most important pages the right priority over others in the XML Sitemap files. This ranges from 1.0 to 0.0 with 1.0 being highest priority. You can also update the change frequency to suit how much the page is changed which will denote it’s importance a little more & add it to GoogleBots ‘Must visit again” list.

Ensure Importance of Pages is Reflected in URL Structure

Another relatively simple one to get right is that you shouldn’t have your important pages buried down deep in the depths of your site structure like http://mysite.co.uk/folder-1/ folder-2/ folder-3/important-page/. One example of this would be as shown below:

I’ve highlighted their important pages in yellow and these are integral to the businesses success. From an efficiency point of view, GoogleBot has to crawl through 3 subfolders, some of which are simply empty pages with links to other deep linking folders on, before reaching the important page.

Ideally the structure should be as flat as possible with at most 1 level of categorisation through a subfolder.

Redirect Broken Links

A 404 page is like a dead end to the robot and it offers little in the way of content for it to eat up. It’s fairly inefficient to allow the robot to spend it’s limited time on the site finding dead end after dead end. To that end, make sure internal references to pages that no longer exist are updated as well as 300 redirecting them to the most relevant live version. Also, check/update the XML sitemap to be sure they aren’t in there!

Internal Redirects

This is a particular bug bear for me. When a page is no longer needed and removed from the site the important thing to do is 300 redirect it to the most relevant other live page. That’s great but a step is missing before the job is complete! There are likely to be links pointing to the old page referenced somewhere on the site (usually the blog) that now point to a page that is then redirected onto another. The efficient thing for the robot is to be pointed directly to the live page and not pass through a redirect.

Bonus tip: If there are no external links pointing to an old page & all internal references are updated to point to the new page then there is no need for a 300 redirect at all. Saves on any server overheads in the long term. For more old page info, see my post on what’s best to do with old pages.

Nofollow Redirecting Affiliate Links

Not likely to affect many but we’ve seen this on a few occasions on client sites where a ‘Buy Now’ link points to a page on their site which is then redirected to the site where the purchase can be made with an affiliate link on it. Something like http://www.mysite.com/affiliate/9009 then redirects to http://www.moneysite.com/product?9009 – Just rel=”nofollow” these links!

Efficiency Through Load Times

With the above elements taken care of the site is a pretty nice place for a robot to crawl! Minimal dead ends, no redirects, easy to understand structure & all unimportant pages blocked. That’s not the end of this story though! A slow site with great technical crawl efficiency still needs fast loading times in order for the robot to get through all the pages before it leaves. There are a few ways to ensure the robot has a speedy trip and these are below. I won’t dwell on them as I’ve covered them in great detail before:

Get a Dedicated Server IP

– GoogleBot’s crawl is IP based which means that if you share an IP address with loads of other websites then you’ll only get a % of it’s time. Get your own IP for it’s complete attention when you’re paid a visit.

Improve Site Load Times

I hope this post has given you some ideas and helped you plan how to improve your own sites crawl efficiency, any questions add them as a comment or email us!

If you’re in SEO and have any areas you’d like to add then please feel free to mention them in the comments below.

 

×

Not found what you’re looking for?


Pop us an email and we’ll get back to you.