Every Monday, we’ll be going over a basic yet important need for any SEO strategy. This past week, Google changed the Webmaster Guidelines to include that websites must “not block a destination URL for a Google Ad Product” with a robots.txt file. However, this new guideline was removed the next day. The importance of the robots.txt file has always been important, but it only seems to be growing more essential for SEOs who want to abide by the guidelines, which is a good thing. Search engines employ a boy to crawl and index the World Wide Web. The robots.txt file on any site serves as instructions that tells a bot what it is allowed and not allowed to crawl on your site. However, that doesn’t necessarily prevent a bot from crawling these pages anyway.
So what’s the point, then?
To underscore how important a robots.txt file is, it can greatly affect your ranking in search engines like Google. For example, poorly configured robots.txt files result in:
- Lower quality scores for your site
- Blocking of important ads campaigns
- Decrease your organic rankings
- Lead to other problems
For those of you who went along with the message from Google to “remove the robots.txt file completely,” then you must not have any content you don’t want to be crawled or you’re not sure how it works.
When to Use Robots.txt
Whenever you want some content on your site excluded from search, you would use the robots.txt file. However, if you don’t have a robots.txt file at all, servers can throw back a 404 or permission denied error, which does happen and can cause issues. Other times bots aren’t able to find the robots.txt file, and therefore skip over a site just to be on the safe side, as Google tends to lean towards. However rare these instances may be, it’s still a good idea to err on the right side and include a robots.txt file even if it simply states:
User-Agent * Disallow:
This says to the robot about to crawl your site, “My door is open. Come right on in.” If you have any duplicate content, any low quality content, anything considered to be spam or spun, it will be crawled and your rankings will be affected.
How to Use Robots.txt with Disallow
The robots.txt is actually a very simple file to create. You can even use a robots generator. The most important step to using a robots.txt file is that it should be placed in the root folder of your domain. For example, www.yoursite.com/robots.txt is exactly how it should be for a bot to find your robots.txt file correctly and avoid any issues.
It’s important to note that pages included in a disallow robots.txt file will still be indexed if linked from other places. You can use a robots meta tag to prevent that page from being indexed like so:
<html> <head> <title>My Awesome Site</title> <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> </head>
This prevents the page from being indexed in search engine marketing but will allow the crawler to follow links available on other pages. However, robots may not follow this tag, as they can choose to ignore it.
To disallow everyone from accessing a site, you use the following:
User-Agent * Disallow: /
Which Crawlers Follow Robots.txt
Reputable and well-known crawlers like Google, Bing and Yahoo follow robots.txt files. However, there are also plenty of crawlers and botsout there that choose to ignore the file because they like to see what a site is hiding. There are even some malware bots out there that like to look at the robots.txt file since it is largely misunderstood. Some people believe that you can tell a crawler at what time of day to crawl a site, and others think that by adding a page to the robots.txt file, it is completely invisible. It’s not. You should never put a page online with private or secure information just because you have a robots.txt file.
There are ways to allow access by only crawlers you choose. However, it is generally advised not to do this as it can result in lower rankings. To allow only one crawler to access your site, use the following:
User-Agent GoogleBot Disallow: User-Agent * Disallow: /
Now only GoogleBot is allowed to view your site. All others will be blocked. If you do this, Bing, Yahoo and other search engines will not index your site. However, you can disallow certain bad bots that have been spotted for spider or nasty behavior.
How to Disallow Pages Effectively for SEO
If you have a page of content that is old, duplicated, dummy content or for whatever reason, you can disallow a crawler from accessing that page by using the disallow tool to specific directories.
To do so, you’ll use the following:
User-Agent * Disallow: /bad_content_page/ Disallow: /disallowedpage/
Now these pages http://yoursite.com/bad_content_page/ and http://yoursite.com/disallowedpage/ cannot be crawled by bots.
Importance of Robots.txt File to Mobile Searchbots
In some cases, you may notice that a desktop site shows up in a mobile search engine results page and a mobile search engine results page shows up for a desktop search. While it’s a little bit rare, it does happen, and it can happen a lot when viewing sites on a mobile phone. If you want complete control over the mobile user’s experience, then you may want to set up your robots.txt files on your desktop and mobile site differently.
Desktop Site Example Robots.txt
User-Agent: Googlebot User-Agent: bingbot User-Agent: Slurp Allow: / User-agent: Googlebot-Mobile User-Agent: MSNBOT_Mobile User-Agent: YahooSeeker/M1A1-R2D2 Disallow: /
Mobile Site Example Robots.txt
User-Agent: Googlebot User-Agent: bingbot User-Agent: Slurp Disallow: / User-agent: Googlebot-Mobile User-Agent: MSNBOT_Mobile User-Agent: YahooSeeker/M1A1-R2D2 Allow: /
One thing to note is that you don’t ever want to block Googlebot or Googlebot-Mobile because they are important to getting in search results pages. If it is a problem for your user experience, you may need to update your site and find a better responsive design to configure your pages without sacrificing page views because of the robots.txt file.
For help with understanding your mobile and desktop site configuration or to further look into managing your site with sitemaps and htaccess, stay tuned to future Monday posts on What to Know About SEO.