Understanding Robots.txt & Best Practice To Use It

Understanding robots.txt

Robots.txt is a file that helps search engines find what pages on your website aren’t accessible to the bots.

Whenever a bot enters your website it needs to go through the gates wherein it is told which pages the crawlers should not access.

This is especially important to hide pages that you don’t want to show publicly to search engines (like login pages, your internal data if on your web pages)

Is robots.txt important?

Most websites don’t need a robots.txt file.

Over the years the Google algorithms have become pretty smart. It can usually find and index all the important stuff on your site without any help.

But there are a few reasons why you might want to use one:

  1. Hide stuff you don’t want public – Sometimes you’ve got pages on your site that you want to ensure everyone sees. Like a test version of a page or a login page. Robots.txt can tell search engines to stay away from these.
  2. Help search engines focus on what matters – If you’ve got a huge site with tons of pages, search engines might have trouble crawling everything. By using robots.txt to block unimportant pages, you’re helping search engines spend more time on the pages that matter.
  3. Keep certain files private If you’ve got PDFs, images, or other files you don’t want showing up in search results, robots.txt can help with that too.

But here’s the thing – it’s not foolproof. Good bots (like Google) will follow the rules, but bad bots (like spam bots) might ignore them completely.

So, is robots.txt important?

It can be, especially if you’ve got a big site or stuff you want to keep private. But for most small websites, it’s not something you need to lose sleep over.

You can check all of your indexed pages in your search console from the pages section.

image2 1 | Serpple

If there is a mismatch in the number of pages you want to get indexed, you need to check your robots.txt.

Difference in robots.txt, meta robots and X-robots

Alright, let’s break this down in simple terms. These three things robots.txt, meta robots, and X-robots – all tell search engines what to do with your website’s content. But they’re not the same thing. Here’s the deal:

Robots.txt: It’s a file that sits right at the website’s root directory. Its job is to tell search engine crawlers which parts of your site they can and can’t look at. It’s like giving them a map of your website and saying, “You can go here, but not there.”

Meta robots tags: These are more like little signs on each page of your website. They’re bits of code that you put in the <head> section of every webpage. Each sign tells search engines two things about that specific page:

  1. Should they include this page in search results?
  2. Should they follow the links on this page to other pages?

Look at the image below, this code is at the page level.

image9 | Serpple

X-robot tags: These are special signs for stuff that isn’t regular web pages, like PDFs or images. You put these tags in something called the HTTP header of the file. It’s a way to give instructions about files that don’t have a <head> section like regular web pages do.
Robots.txt is for your whole site, meta robots are for individual pages, and X-robots are for special files. Each has its place in telling search engines what’s on your site.

Best Practices of using robots.txt

Create a robots.txt

To use robots.txt, you’ve got to create one first. It’s super simple:

  1. Open up Notepad on your computer.
  2. Now, here’s the format you’ll use:

User-agent: X
Disallow: Y

What does this mean?

  • “User-agent” is just a way of saying which bot you’re talking to.
  • “Disallow” is where you tell the bot what it can’t look at.

Let’s say you don’t want Google looking at your images. You’d write:

User-agent: googlebot
Disallow: /images

Want to talk to all bots at once? Use an asterisk (*) like this:

User-agent: *
Disallow: /images

This tells all bots to stay out of your images folder.

Remember, this is just scratching the surface. There’s a lot more you can do with robots.txt. Google’s got a guide that you can look into to understand all the rules you can use.

The main thing is to keep it simple. Start with the basics, and you can always add more rules as you need them.

image7 | Serpple

Make Your Robots.txt File Easy to Find

Alright, you’ve made your robots.txt file. Now let’s get it up and running.

You can put your robots.txt file pretty much anywhere on your site. But if you want to make sure bots find it easily, stick it here:

https://example.com/robots.txt

Quick heads up: The name of the file matters. Make sure the “r” in “robots.txt” is lowercase.

Check for Errors and Mistakes

You’ve got to get your robots.txt file right. One little mistake could mean your whole site disappears from search results. Yeah, it’s that serious.

You can test your robots.txt file using this free tool. Use it to make sure your robots.txt is doing what you want it to do.

image3 1 | Serpple

Remember, it’s always better to double-check. Take a few minutes to run your robots.txt through this tool. It could save you a big headache down the road.

Understanding more about robots.txt syntax

A robots.txt file is like a set of instructions for search engine bots. Here’s how it’s put together:

  1. Blocks of rules (we call them “directives”)
  2. Each block specifies which bot it’s talking to (the “user-agent”)
  3. Then it tells the bot what it can or can’t do (“allow” or “disallow”)

Let’s look at a simple example:

User-agent: Bingbot
Disallow: /secret-stuff

User-agent: Yandexbot
Disallow: /not-for-yandex

Sitemap: https://www.yourwebsite.com/sitemap.xml

Breaking it down:

1. User-Agent: This is the bot’s name.

For example, if you don’t want Bing to look at your WordPress admin page:

User-agent: Bingbot
Disallow: /wp-admin/

2. Disallow: This tells the bot where it can’t go.
You can have multiple “disallow” lines. If you leave it empty, it means the bot can go anywhere.

3. Allow: This is the opposite of disallow.
It lets you say “yes” to specific pages, even if you’ve said “no” to the whole folder.

For example:

User-agent: Bingbot
Disallow: /blog
Allow: /blog/cool-post

This tells Bing not to look at your blog, except for one specific post.

4. Sitemap: This is like giving search engines a map of your website. It usually goes at the top or bottom of your robots.txt file:

Sitemap: https://www.yourwebsite.com/sitemap.xml

5. Crawl-delay: This tells bots to slow down if they’re putting too much strain on your server.

For example: User-agent: * Crawl-delay: 10

This tells all bots to wait 10 seconds between each page they look at.

Different search engines might handle these instructions slightly differently. It’s always a good idea to check how your specific target search engines interpret robots.txt files.

Best Practices of using robots.txt:

  1. Be specific with user-agents when possible
  2. Use the minimum number of rules to achieve desired results
  3. Regularly review and update your robots.txt
  4. Test your robots.txt using search engine webmaster tools
  5. Remember robots.txt is a suggestion; malicious bots may ignore it

Limitations of robots.txt:

  1. Not a security measure; sensitive content should be protected through other means
  2. Blocked pages can still be indexed if linked from other sites
  3. Different search engines may interpret the file slightly differently

The examples below have all these syntax used, you will better get the grasp of each, also we have explained why each robots.txt below is different from each other and what purpose each have.

Examples of robots.txt files from some of the known websites

In this section, we will try to understand the robots.txt files of some of the giants in the search industry. I will let you go through some parts (or all) of the files to make you better understand each.

Let’s begin!

1. Forbes

image4 1 | Serpple

Forbes uses a two-tiered approach in their robots.txt file: (this is for the screenshot we see above, the original robots.txt file is much larger and it was hard to include the whole file in this blog)

For all user agents (*):

  • They allow the crawling of specific AJAX paths, likely for dynamic content.
  • They disallow crawling of various paths including search, follow, author pages, JSON data, and admin areas.
  • Notable disallowed paths include ‘/typeahead/’, ‘/pepe/’, and ‘/media-manager/’.

For Googlebot news specifically:

  • They disallow crawling of specific company-related paths like ‘/sites/adobe/’, ‘/sites/dell/’, etc.

This strategy allows Forbes to:

  • Keep their main content accessible to search engines
  • Protect user data and admin areas
  • Control how news-specific bots interact with their site
  • Possibly manage syndicated or sponsored content from other companies

2. Capterra

image1 1 | Serpple

Capterra’s robots.txt is more specific:

  • They disallow all bots (*) from crawling certain paths, including comparison pages, external clicks, and search pages.
  • They allow the crawling of specific comparison page formats.
  • They explicitly allow the crawling of software pages and certain image paths.

This approach helps Capterra:

  • Control how their product comparisons appear in search results
  • Prevent indexing of dynamically generated pages (like search results)
  • Ensure their main product pages are crawlable

3. Search Engine Journal

image8 | Serpple

SEJ’s file is highly technical:

  • They disallow the crawling of numerous specific JavaScript files and WordPress plugin paths.
  • They block access to certain URL parameters (like ?s=, ?categories=, etc.)
  • They provide a sitemap link at the end.

This configuration allows SEJ to:

  • Prevent indexing of technical files that don’t provide value to search results
  • Control how their content is indexed based on URL parameters
  • Guide search engines to their sitemap for efficient crawling

4. Trustpilot

image6 | Serpple

Trustpilot focuses on protecting user data and managing their review content:

  • They specifically target Googlebot
  • They disallow access to error pages, evaluation pages, and review transparency pages
  • They block various URL parameters that might create duplicate content

This helps Trustpilot:

  • Maintain user privacy by limiting access to individual reviews
  • Prevent indexing of potential duplicate content created by URL parameters
  • Focus search engine attention on their main, aggregated review pages

5. Tripadvisor

image5 | Serpple

TripAdvisor has an extensive robots.txt file:

  • They have instructions for all user agents (*) and a specific section for Applebot
  • They disallow a vast array of paths, including many internal tools and user-specific pages
  • Notable blocked paths include account management, analytics, and various widget paths

This comprehensive approach allows TripAdvisor to:

  • Protect user data and internal tools from being indexed
  • Prevent crawling of dynamically generated content that might create duplicate issues
  • Focus search engine attention on their main travel content and reviews

Each of these robots.txt files reflects the complex needs of large, content-rich websites, balancing SEO needs with user privacy and content management.

Conclusion

Using robots.txt allows you to have a better control over the indexability of your website. It is advisable for every legitimate business to use one.

However, having a good robots.txt file in your website’s root directory does not guarantee SEO success. Robots.txt is just another part of the whole SEO campaign.

Just a little about us, we are Serpple, an SEO tool, which can help you to do keyword research, track your google ranking position and manage your backlinks.

We have been building an SEO tool for businesses that need action based data. Whether you’re just starting with SEO or looking to refine your strategy, tools Serpple’s tools can help you make the most of your efforts,