93 318 54 36

Robots.txt for SEO: your complete guide

28/01/2022
Elizabeth De Leon

What is robots.txt and why is it important for search engine optimization (SEO)? Robots.txt is a set of optional directives that tell web crawlers what parts of your website they can access. Most search engines, including Google, Bing, Yahoo, and Yandex, support and use robot text to identify which web pages to crawl, index, and display in search results.

If you're having trouble getting your website indexed by search engines, your robots.txt file may be the problem. Robot.txt errors are among the most common SEO technical issues that appear in SEO audit reports and cause a massive drop in search rankings. Even technical SEO service providers and web developers are susceptible to robot.txt errors.

As such, it's important that you understand two things: 1) what robots.txt is and 2) how to use robots.txt in WordPress and other content management systems (CMS). This will help you create a robots.txt file that is SEO optimized and will make it easier for web spiders to crawl and index your web pages.

Let's dive into the basics of robots.txt. Read on and find out how you can leverage the robots.txt file to improve your website's crawlability and indexing capabilities.

What is Robots.txt?

Robots txt, also known as robots exclusion standard or protocol, is a text file located in the root or main directory of your website. It serves as an instruction to SEO spiders on what parts of your website they can and cannot crawl.

Robots.Text Timeline

The robot txt file is a standard proposed by Allweb creator Martijn Koster to regulate how different search engine robots and web crawlers access web content. Here is an overview of the development of the robots txt file over the years:

In 1994, Koster created a web spider that caused malicious attacks on his servers. To protect websites from bad SEO crawlers, Koster developed robot.text to guide search robots to the right pages and prevent them from reaching certain areas of a website.

In 1997, an Internet Draft was created to specify web robot control methods using a robot txt file. Since then, robot.txt has been used to restrict or channel a spider robot to select parts of a website.

On July 1, 2019, Google announced that it is working to formalize the Robot Exclusion Protocol (REP) specifications and make it a web standard, 25 years after search engines created and adopted the robots txt file.

The goal was to detail unspecified scenarios for robot txt analysis and comparison to adapt to modern web standards. This Internet draft indicates that:

1.  Any transfer protocol based on a uniform resource identifier (URI), such as HTTP, Constrained Application Protocol (CoAP), and File Transfer Protocol (FTP), can use txt robots.
2.  Web developers should parse at least the first 500 kibibytes of a robot.text to relieve unnecessary strain on servers.
3.  Robots.txt SEO content is typically cached for up to 24 hours to provide website owners and developers with enough time to update their robot txt file.
4.  Disallowed pages are not crawled for a reasonably long period when a robots txt file becomes inaccessible due to server issues.

Various industry efforts have been made over time to extend robot exclusion mechanisms. However, not all web crawlers can support these new robot text protocols. To clearly understand how robots.text works, let's first define web crawler and answer an important question: How do web crawlers work?

What is a web crawler and how does it work?

A website crawler, also called spider robot , site crawler o search robot , is an Internet bot typically operated by search engines such as Google and Bing. A web spider crawls the web to analyze web pages and ensure that users can retrieve information whenever they need it.

What are web crawlers and what is their role in technical SEO? To define web crawler, it is vital that you familiarize yourself with the different types of site crawlers on the web. Each spider robot has a different purpose:

1. Search Engine Bots

What is a search engine spider? A spider search engine bot is one of the most common SEO crawlers used by search engines to crawl and crawl the Internet. Search engine bots use robots.txt SEO protocols to understand your web crawling preferences. Know the answer to what is a search engine spider? gives you a head start on optimizing your robots.text file and making sure it works.

2. Commercial spider web

A commercial site crawler is a tool developed by software solutions companies to help website owners collect data from their own platforms or public sites. Several companies provide guidelines on how to build a web crawler for this purpose. Be sure to partner with a commercial web crawling company that maximizes the efficiency of an SEO crawler to meet your specific needs.

3. Personal crawler robot

A personal website tracker is designed to help businesses and individuals collect data from search results and/or monitor their website performance. Unlike a spider search engine bot, a personal crawler bot has limited scalability and functionality. If you're curious about how to make a website crawler that does specific jobs to support your technical SEO efforts, check out one of the many guides on the internet that show you how to build a web crawler that runs from your local device.

4. Desktop Site Tracker

A desktop crawler bot runs locally from your computer and is useful for analyzing small websites. However, desktop site crawlers are not recommended if you are analyzing tens or hundreds of thousands of web pages. This is because crawling data from large sites requires custom configuration or proxy servers that a desktop crawler bot does not support.

5. Copyright Crawling Bots

A copyright website crawler looks for content that violates copyright law. This type of search bot can be operated by any company or person that owns copyrighted material, regardless of whether they know how to build a web crawler or not.

6. Cloud-based crawler robot

Cloud-based crawling bots are used as a technical SEO services tool. A cloud-based crawler bot, also known as software as a service (SaaS), runs on any device with an internet connection. This Internet spider has become increasingly popular because it crawls websites of any size and does not require multiple licenses to use on different devices.

Why it is important to know: What are web crawlers?

Search bots are usually programmed to search for robot.text and follow its directives. However, some crawling bots, such as spam , email collectors and  malware bots , they often ignore the robots.txt SEO protocol and do not have the best of intentions when accessing your site's content.

What is web crawler behavior if not a proactive measure to improve your online presence and enhance your user experience? By making an effort to understand the answer to what is a search engine spider? and how it is different from bad site crawlers, you can ensure that a good search engine spider can access your website and prevent unwanted SEO crawlers from ruining your user experience (UX) and search rankings .

Imperva's 8th Annual Bad Bot Report shows that bad web crawling bots generated 25,6% of all site traffic in 2020, while good SEO spiders generated only 15,2% of traffic . With the many disastrous activities that bad spider crawler bots are capable of, such as click fraud, account takeover, content scraping, and spamming, it's worth knowing 1) What is a crawler? website that is beneficial for your site? and 2) What bots do you need to block when creating robot text?

Should marketers learn how to make a website crawler?

You don't necessarily need to learn how to make a website crawler. Leave the technical aspects of developing an SEO crawler to the software solutions companies and focus instead on txt optimization of your SEO robots.

No one creates their own web crawler unless they are specifically pulling data from a site. “From a technical SEO point of view, the tools for website crawling already exist. Only if you are constantly mining tens of GB of data would it be profitable to build and host your own internet tracker.”

How do web crawlers work?

In this fast-paced digital landscape, simply knowing what a web crawler is is not enough to guide your SEO bots' txt optimization. In addition to “what are web crawlers?” You should also answer “how do web crawlers work?” to ensure you create robot text that contains the appropriate directives.

Search spiders are primarily programmed to perform automatic, repetitive searches of the web to create an index. The index is where search engines store web information to retrieve it and display it in search results relevant to the user's query.

An internet crawler follows certain processes and policies to improve the crawling process of your website and achieve your web goal.

So how exactly does a web crawler work? We'll see.

Discover URL Web spiders start crawling the web from a list of URLs, then move between links on the page to crawl websites. To increase your site's crawling and indexing capabilities, be sure to prioritize your website's navigability, create a clear robots.txt sitemap, and submit robots.txt to Google.
Browse a list of seeds Search engines provide their search engine spiders with a list of seeds or URLs to check. Search engine spiders then visit each URL in the list, identify all the links on each page, and add them to the seed list to visit. Web spiders use sitemaps and databases of pre-crawled URLs to explore more web pages on the web.
Add to index Once a search engine spider visits the listed URLs, it locates and renders the content, including text, files, videos, and images, on each web page and adds it to the index.
Update the Index Search engine spiders consider key signals, such as keywords and content relevance and freshness, when analyzing a web page. Once an internet crawler locates any changes to your website, it updates its search index accordingly to ensure it reflects the latest version of the web page.

According to Google, computer programs determine how to crawl a website. They look at perceived importance and relevance, crawl demand, and the level of interest search engines and online users have in your website. These factors affect how often an Internet spider will crawl your web pages.

How does a web crawler work and ensure that all Google web crawling policies and spider crawl requests are met?

To better communicate with a search engine spider about how to crawl a website, SEO technical service providers and WordPress web design experts advise you to create robots.txt that clearly states your data crawling preferences. SEO bot txt is one of the protocols that web spiders use to guide their process of web crawling and crawling Google data across the Internet.

You can customize your robots.txt file to apply it to specific search spiders, prohibit access to particular files or web pages, or control your robots.txt crawl delay.

User agent

Directive of user agent  refers to the name of the SEO crawler the command was intended for. It is the first line for any robots.txt format or rule group.

The user agent command uses a wildcard or the symbol * . It means that the policy applies to all search robots. Policies can also be applied to specific user agents.

Each SEO tracker has a different name. Google web crawlers are called Googlebot , the Bing SEO tracker identifies itself as BingBot and the Yahoo Internet spider is called Slurp . You can find the list of all user agents here .

# Example 1
User agent: *
Disallow: /wp-admin/

In this example, since we used , means that robots.txt blocks all user agents from accessing the URL.

# Example 2
User Agent: Googlebot
Disallow: /wp-admin/

Googlebot was specified as a user agent. This means that all search spiders can access the URL except Google crawlers.

# Example 3
User Agent: Googlebot
User Agent: Slurp
Disallow: /wp-admin/

Example #3 indicates that all user agents except Google's crawler and Yahoo's web spider can access the URL.

Allow

The allow command robots.txt indicates what content is accessible to the user agent. The Robots.txt authorization policy is supported by Google and Bing.

Please note that the protocol authorization of robot.txt should be followed in the rue which can be accessed by Google web crawlers and other SEO spiders. If none is indicated rue , Google crawlers will ignore the robot.txt permission directive.

# Example 1
User agent: *
Allow: /wp-admin/admin-ajax.php
: /wp-admin/

For this example, the robots.txt allow directive applies to all user agents. This means that txtbots prevent all search engines from accessing the /wp-admin/ directory except the /wp-admin/admin-ajax.php page.

# Example 2: Avoid conflicting directives like this
User-agent: *
Allow: /example
: *.php

When you create a robots txt directive like this, Google crawlers and search spiders will be confused about what to do with the URL http://www.yourwebsite.com/example.php . It is not clear what protocol to follow.

To avoid Google web crawling issues, be sure to avoid using wildcards when using the robots.txt allow and robots disallow directives together.

Decline

The robots.txt disallow command is used to specify which URLs should not be accessed by Google crawling robots and website crawling spiders. Like the robots.txt allow command, the robots.txt disallow directive should also be followed by the path you do not want Google web crawlers to access.

# Example 1
User agent: *
Disallow: /wp-admin/

For this example, the robots disallow all command prevents all user agents from accessing the /wp-admin/ directory.
The robots.txt disallow command is used to specify which URLs should not be accessed by Google crawling robots and website crawling spiders. Like the robots.txt allow command, the robots.txt disallow directive should also be followed by the path you do not want Google web crawlers to access.

# Example 2
User agent: *
Do not allow:

This robots.txt reject command tells a Google web crawler and other search robots to crawl the Google pages of the website, the entire website, because nothing is prohibited.

Note: Although this robots rejection policy contains only two lines, be sure to follow the correct robots.txt format. Do not write user agent: * Disallow: on one line because this is incorrect. When you create robots.txt, each directive must be on a separate line.

# Example 3
User agent: *
Do not allow: /

The / symbol represents the root in the hierarchy of a website. For this example, the robot.txt disallow directive is equivalent to the robots disallow all command. Simply put, you are hiding your entire website from Google spiders and other search robots.

Note: As in the previous example ( user-agent: * Disallow: ), avoid using a one-line robots.txt syntax ( user-agent: * Disallow: / ) to disallow access to your website.

A robots.txt format like this user agent: * Disallow: / would confuse a Google crawler and could cause WordPress robots.txt parsing problems.

Site map

The robots.txt sitemap command is used to point Google spiders and web crawlers to the XML sitemap. The robots.txt sitemap is compatible with Bing, Yahoo, Google and Ask.

As for how to add a sitemap to robots.txt? Knowing the answer to these questions is helpful, especially if you want as many search engines as possible to access your sitemap.

# Example
user agent: *
Disallow: /wp-admin/
Sitemap: https://yourwebsite.com/sitemap1.xml
Sitemap: https://yourwebsite.com/sitemap2.xml

In this example, the robots disallow command tells all search robots not to access /wp-admin/ . The robot.txt syntax also indicates that there are two sitemaps that can be found on the website. When you know how to add a sitemap to robots.txt, you can place multiple XML sitemaps in your robots txt file.

Crawl delay

The robots.txt crawl delay directive is supported by all major spider robots. Prevents a Google web crawler and other search spiders from overloading a server. The txt robots crawl delay command allows administrators to specify how long Google spiders and web crawlers should wait between each Google crawl request, in milliseconds.

# Example
user agent: *
Disallow: /wp-admin/
Disallow: /calendar/
Disallow: /events/UserAgent: BingBot Disallow
: /calendar/ Do not allow
: /events/
Crawl-delay: 10Sitemap: https://yourwebsite.com/sitemap.xml

In this example, the robots.txt crawl delay directive tells search robots to wait a minimum of 10 seconds before requesting another URL.

Some web spiders, such as Google web crawler, do not support txt robots crawl delay commands. Be sure to run your robots.txt syntax in a robots txt checker before submitting robots.txt to Google and other search engines to avoid parsing issues.

Baidu, for example, does not support txt robot crawl delay policies, but you can take advantage of Baidu Webmaster Tools to control the crawl frequency of your website. You can also use Google Search Console (GSC) to define the crawling frequency of the web crawler.

Host

The host directive tells search spiders your preferred mirror domain or replica of your website hosted on a different server. Mirror domain is used to distribute traffic load and avoid latency and server load on your website.

# Example
user agent: *
Disallow: /wp-admin/Host: yourwebsite.com

WordPress' robot.txt host directive lets you decide whether you want search engines to display yourwebsite.com or www.yourwebsite.com.

End of string operator

The $ sign is used to indicate the end of a URL and direct a Google web crawler on how to crawl a website with parameters. It is placed at the end of the path.

# Example
user agent: *
Disallow: *.html$

In this example, the robots txt nofollow directive tells a Google crawler and other user agents not to crawl Google website URLs that end in .html.

This means URLs with parameters like this https://yourwebsite.com/page. html ?lang=en it would still be included in Google's crawl request, since the URL doesn't end after .html .

Comments

The comments serve as a guide for specialists in web design and development, and are preceded by the sign 🇧🇷 They can be placed at the beginning of a WordPress robot.txt line or after a command. If you are placing comments after a directive, make sure they are on the same line.

Everything after # It will be ignored by Google's crawling robots and search spiders.

# Example 1: Block access to the /wp-admin/ directory for all search robots.
User agent: *
Disallow: /wp-admin/
#Example 2
User Agent: *#Applies to all search spiders.
Disallow: /wp-admin/#Block access to the /wp-admin/ directory.

What is Robots.txt used for?

The Robot.txt syntax is used to manage spider crawl traffic to your website. It plays a crucial role in making your website more accessible to search engines and online visitors.

Do you want to learn how to use robots.txt and create robots txt for your website? Here are the main ways you can improve your SEO performance with robots.txt for WordPress and other CMS:

1 . Avoid overloading your website with Google web crawling and search bot requests.
2 . Prevent Google crawl robots and search spiders from crawling private sections on your website using txt nofollow robots directives.
3 . Protect your website from malicious bots.
4 . Maximize your crawl budget – The number of pages web crawlers can crawl and index on your website within a given time period.
5 . Increase the crawlability and indexability of your website.
6 _ Avoid duplicate content in search results.
7 . Hide unfinished pages from Google web crawlers and search spiders before they are ready for publication.
8. Improve your user experience.
9 _ Pass link equity or link juice to the correct pages.

Wasting your crawl budget and resources on pages with low-value URLs can negatively impact your crawling and indexing ability. Don't wait until your site experiences several technical SEO issues and a significant drop in rankings before finally prioritizing learning how to create txt robots for SEO.

Master Google robots.txt optimization and you'll protect your website from harmful bots and online threats.

Do all websites need to create robot text?

Not all websites need to create a robots.txt file. Search engines like Google have systems in place for how to crawl website Google pages, and automatically ignore duplicate or unimportant versions of a page.

However, technical SEO specialists recommend that you create a robots.txt file and implement robots txt best practices to enable better and faster web crawling and indexing by Google's crawling robots and search spiders.

New websites don't need to worry about how to use robots.txt, as their goal is to make their web pages accessible to as many search spiders as possible. On the other hand, if your website is more than a year old, it could start gaining traffic and attract Google crawl requests and search spider request problems.

[When this happens] you will need to block those URLs in the WordPress robots.txt file so that your crawl budget is not affected,” Dagohoy said. “Keep in mind that search engine bots crawl less on websites with a lot of broken URLs, and you don't want that for your site.”

As mentioned above, knowing how to edit robots.txt for SEO gives you a significant advantage. More importantly, it gives you peace of mind knowing that your website is protected from malicious attacks from malicious bots.

Location of WordPress Robots.txt

Ready to create robots.txt? The first step to achieving your web target budget is to learn how to find robots.txt on your website. You can find the location of WordPress robots.txt by going to your site's URL and adding the parameter /robots.txt .

For example: yourwebsite.com /robots.txt

The robots.txt deny and allow directives, the robots.txt Google and search robots directory also includes a robots.txt sitemap to direct web crawlers to the XML sitemap and avoid wasting website budget. objective spider web tracking.

Where is Robots.txt in WordPress?

WordPress is considered the most popular and widely used CMS in the world, powering approximately 40 percent of all websites on the web. It's no surprise that many website owners want to learn how to edit robots.txt in WordPress. Some even turn to WordPress web design professionals for help with robots.txt optimization for WordPress.

Where is robots.txt in WordPress? Follow these steps to access your WordPress robots.txt file:

1 . Log in to your WordPress dashboard as an administrator.


2
 . Navigate to “SEO.” 

3 . Click on “Yoast”. This is a WordPress plugin that you need to install on your website to edit WordPress robots.txt and create robots txt updates anytime you need it.

4 . Click "File Editor." This tool allows you to make quick changes to your Google robots.txt directives.

5 . You can now view your WordPress robots.txt file and edit the WordPress robots.txt directory.

As for how to access robots.txt in WordPress and update your robots.txt disallow directives to show the URL restricted by robots txt? Simply follow the same process you used to determine where robots.txt is located in WordPress.

Don't forget to save any changes you make to your robots.txt file for WordPress to ensure your robots.txt no index and robots.txt allow commands are up to date.

How to find Robots.txt in cPanel

cPanel is one of the most popular Linux-based control panels, used to manage web hosting accounts with maximum efficiency. Web developers also use cPanel to create a robots.txt file.

How to find robots.txt in cPanel: Follow these steps to access your web crawlers and Google robots txt file in cPanel.

1 . Sign in to your cPanel account.
2 . Open the " File Manager » and go to the root directory of your site.
3 . You should be able to access Google's search robots and robots txt file in the same location as the index or first page of your website.

How to edit Robots.txt in cPanel

If you want to edit your robots.txt reject directory or make any necessary changes to your robots.txt syntax, simply:

1 . Highlight the robots.txt file without index.
2 . Click on " Editor »Or» Edit code » in the top menu to edit your txt nofollow robots commands.
3 . Click on " Save Changes » to save the latest modifications to your robots.txt reject directory.

How to Create Txt Robots in cPanel

To create a robots.txt file in cPanel, perform the following steps:

1 . Sign in to your cPanel account.
2 . Go to the section » Archives " and click " File Manager «.
3 . Click on " New file » and press the button » Create new file «. You can now create a robots.txt file.

How to find Magento Robots.txt

 

In addition to the common question of how to access robots.txt in WordPress, many website owners are also looking to learn how to access, edit, and optimize Magento robots.txt to better communicate the robots txt-restricted URL to search spiders.

Magento is an eCommerce platform with built-in PHP designed to help web developers create SEO-optimized eCommerce websites. And how to find Magento robots.txt?

1 . Log in to your Magento dashboard.
2 . Go to " Panel administration » and then click » Stores «.
3 . Go to " Configuration «, then select » Configuration «.
4 . Open the section » Search Engine Robots «. You can now view and edit your robots.txt file to determine the robots txt restricted URL.
5 . When finished, click the » button Save settings «.

How about how to create txt robots in Magento? The same process applies when you create a robots.txt file for Magento. You can also click the » button Reset defaults » if you need to restore the default instructions.

Robot Text Best Practices

Learning how to access robots.txt in WordPress and how to edit robots.txt on various platforms are just the initial steps to optimizing your robots.txt no index and robots.txt allow directives.

To guide your robots.txt optimization process, follow these steps:

1 . Run regular audits using a robots txt checker. Google offers a free robots txt checker to help you determine any robots.txt issues on your website.

2 . Learn how to add a sitemap to robots.txt and apply it to your robots.txt file.
3 . Take advantage of robots.txt blocking directives to prevent search robots from accessing private files or unfinished pages on your website.
4 . Check your server logs.
5 . Monitor your crawl report in Google Search Console (GSC) to identify how many search spiders are crawling your website. The GSC report shows your total crawl requests by response, file type, purpose, and Googlebot type.

6 _ Check if your website is generating traffic and requests from malicious bots. If so, you should block them using robots.txt block all directives.
7 . If your website has a lot of 404 and 500 errors and they are causing web crawling problems, you can implement 301 redirects. In case the errors increase rapidly and reach millions of 404 pages and 500 errors, you can use robots txt to block all policies to restrict some user agents from accessing your web pages and files. Be sure to optimize your robots.txt file to resolve recurring web crawling issues.
8 _ Request professional SEO technical services and web development solutions to successfully implement robots txt block all, robot.txt allow and other directives in your robots.txt syntax.

Common Robots.txt Errors You Should Avoid

Take note of these common mistakes when creating your robots.txt file and be sure to avoid them to improve your site's crawlability and online performance:

❌ Place robots.txt directives on a single line. Each robot txt directive should always be on a separate line to provide clear instructions to web crawlers on how to crawl a website.
Incorrect: User Agent: * Do not allow: /
Incorrect: User Agent: * Do not allow:

❌Error sending robots.txt to Google. Always submit your updated robots.txt file to Google. Whether you have made small changes such as adding robots.txt, deny all commands to specific user agents or removing robots, disallow all policies, be sure to click the Submit button. This way, Google will be notified of any changes you have made to your robots.txt file.

❌Incorrect robots.txt index directives. If you do, your website runs the risk of not being crawled by search robots, losing valuable traffic, and worse, a sudden drop in search rankings.

❌Do not place the robot text file in the root directory. Putting your robots.txt file in subdirectories could make it undetectable by web crawlers.
Incorrect: https://www.yourwebsite.com/assets/robots.txt
Correct: https://www.yourwebsite.com/robots.txt

❌Improper use of robots.txt denies all commands, wildcards, forward slashes, and other directives. Always run your robots.text file in a robots.txt validator before saving it and submitting it to Google and other search engines, so that it does not generate robots.txt errors.

❌Rely on the robots.txt file generator to generate the robots.txt file. Although a robots.txt file generator is a useful tool, relying solely on it without performing manual checks on robots.txt deny policies, allowing robot.txt commands and user agents in your robot txt file is a bad idea. practice. If you have a small website, it is acceptable to use a robots.txt file generator to generate robots.txt. But if you own an e-commerce website or offer many services, be sure to get expert help to create and optimize your robots.txt file.

❌Ignore robots.txt validator reports. A robots.txt validator is there for a reason. So, max out your robots.txt checker and other tools to make sure your robots.txt optimization efforts for SEO are on the right track.

Get control of your tracking budget

Dealing with robots.txt optimization and other technical SEO issues can be exhausting, especially if you don't have the proper resources, manpower, and capabilities to perform the necessary tasks. Don't stress yourself out dealing with website issues that professionals could resolve quickly.

Do you need to update your website?

Do you need any of our web design services? In IndianWebs We have extensive experience, and a team of programmers and web designers in different specialties, we are capable of offering a wide range of services in the creation of custom web pages. Whatever your project is, we will tackle it.