HummingbirdUK main logo

HummingbirdUK

Coding solutions to business problems

About us

We use code to create solutions to business challenges, bottle-necks and headaches.

If you think your business has a problem that can be solved through code, we are happy to chat things through without any obligation.

Get in touch

The importance of a good robots.txt file for a Magento store

Home / Blog / The importance of a good robots.txt file for a Magento store

Written by Giles Bennett

Having covered why sitemaps are important, we now turn our attention to the importance of a good robots.txt file for a Magento store. (if you haven't read it already, both of these arose from our post containing a ten-point health check for your online store).

What is a robots.txt file?

A robots.txt file is, as its name suggests, a small text file which is there for robots. Not actual robots, obviously, but search engine spiders. It's used by website owners to give instructions to the search engine spiders which constantly crawl the web updating their owners' search results. Any spider which adheres to the robots.txt protocol - it's not binding, as there's no way to enforce it, but all the good spiders adhere to it - checks for the presence of a robots.txt file on a site before indexing it.

What does a robots.txt file contain?

The most important part of a robots.txt file contains instructions telling spiders if they're welcome or not. The very simplest robots.txt would ban all spiders by saying the following : [code] User-agent: * Disallow: / [/code] The first line says "this applies to all of you spiders", and the second line says "don't index anything". This is fine if you're developing a site and it's not ready to be indexed yet, but beyond that it's not going to do much for your search engine optimisation. If you're wanting to block bots that are annoying (they put too much strain on your server by constantly crawling your site, or they're for search engines which you're never going to want to appear in) then you can target specific bots as follows : [code] User-agent: FatBot Disallow: / [/code]

Preventing specific files or folders from being indexed

To disallow specific files, add the following to the robots.txt : [code] Disallow: /file1.php Disallow: /file2.sh [/code] To disallow a directory, or any of its contents from being indexed, simply add the following : [code] Disallow: /directory1/ Disallow: /directory2/ [/code]

A robots.txt file for Magento stores

Over the years we've developed a set of instructions for a robots.txt specifically for a Magento store : [code] Disallow: /app/ Disallow: /downloader/ Disallow: /errors/ Disallow: /cgi-bin/ Disallow: /includes/ Disallow: /lib/ Disallow: /pkginfo/ Disallow: /shell/ Disallow: /var/ Disallow: /catalogsearch/ Disallow: /catalog/seo_sitemap/category/ Disallow: /catalog/seo_sitemap/product/ Disallow: /index.php/ Disallow: /catalogsearch/result/ Disallow: /catalogsearch/result/index/ Disallow: /catalogsearch/result/index/?* Disallow: /control/ Disallow: /contacts/ Disallow: /customer/ Disallow: /customize/ Disallow: /newsletter/ Disallow: /poll/ Disallow: /review/ Disallow: /sendfriend/ Disallow: /tag/ Disallow: /wishlist/ Disallow: /cron.php Disallow: /cron.sh Disallow: /error_log Disallow: /install.php Disallow: /LICENSE.html Disallow: /LICENSE.txt Disallow: /LICENSE_AFL.txt Disallow: /STATUS.txt [/code]

To restrain the behaviour of bots

You can limit the speed with which bots index your site as follows : [code] User-agent: * Crawl-delay: 10 User-agent: Baiduspider Crawl-delay: 20 [/code] This restricts all bots to crawling one page every ten seconds, then has a more specific rule for Baidu's spider which restricts it to every twenty seconds.

Linking in your sitemap

The final use for a robots.txt is to signpost to bots where your sitemap is - since bots check in with robots.txt first, it's a nice little marker for them, particularly if your sitemap isn't at a standard location like www.yoursite.com/sitemap.xml. [code] # Website Sitemap Sitemap: http://www.theyorkshirepantry.com/sitemap.xml [/code]

One thing to bear in mind

A robots.txt file is public - so there's no point trying to use it to hide files or folders that you don't want people to fine, because the action of hiding them will, itself, be visible in the robots.txt file. Use browser authentication instead.
Author : Giles Bennett

About the author

Giles Bennett built his first website in 1996, and is old enough to miss Netscape Navigator. Initially a lawyer, he jumped ship to IT in 2008, and after 5 years as a freelancer, he founded HummingbirdUK in 2013. He can be reached by email at giles@hummingbirduk.com.