A file called Robots.txt instructs search engine spiders not to explore specific pages or portions of a website. Robots.txt requests are recognised and honoured by most major search engines, including Google, Bing, and Yahoo.
A robots.txt file isn’t required for most websites.
This is because Google can generally discover and index all of your site’s key pages.
They’ll also automatically exclude pages that aren’t significant or duplicate versions of other pages from indexing.
However, there are three primary reasons why you should utilise a robots.txt file.
There are instances when you don’t want some pages on your site to be indexed. You may have a staging version of a page, for example. Alternatively, a login page.
These pages are necessary. However, you do not want uninvited visitors to land on them. In this scenario, you’d use robots.txt to prevent search engine crawlers and bots from accessing certain sites.
You could have a crawl budget problem if you’re having trouble getting all of your pages indexed. By using robots.txt to blacklist irrelevant pages, Googlebot can focus more of its crawl budget on the pages that really matter.
For preventing pages from being indexed, meta directives can be just as effective as Robots.txt. Meta directives, on the other hand, are ineffective for multimedia resources like as PDFs and pictures. This is where the robots.txt file comes in.
What’s the bottom line? Search engine spiders are told not to explore particular pages on your website using the robots.txt file.
In Google Search Console, you can see how many pages you’ve indexed.
You don’t require a Robots.txt file if the number corresponds to the number of pages you want indexed.
However, if that number is greater than you anticipated (and you detect URLs that should not be indexed), it’s time to establish a robots.txt file for your website.
The first thing you should do is build your robots.txt file.
Because it’s a text file, you can make one with Windows Notepad.
And regardless of how you create your robots.txt file, the format remains the same:
Allow: Y User-agent: X
The user-agent is the bot you’re conversing with.
All of the pages or parts after “disallow” are the ones you wish to block.
Consider the following scenario:
Googlebot is the user agent.
/images is not allowed.
This rule tells Googlebot not to index your website’s image folder.
You may also use an asterisk (*) to communicate with any bots who visit your website.
Consider the following scenario:
* Allow: /images User-agent:
The “*” warns all spiders to stay away from your images folder.
A robots.txt file can be used in a variety of ways. More information on the different rules you may use to ban or enable bots to crawl different pages of your site can be found in this useful tutorial from Google.
Make it simple to locate your Robots.txt file.
It’s time to activate your robots.txt file now that you have it.
Your robots.txt file can potentially be placed in any of your site’s primary directories.
However, I propose putting your robots.txt file at: https://example.com/robots.txt to maximise the chances of it being detected.
(Please keep in mind that your robots.txt file is case-sensitive.) As a result, ensure sure the filename starts with a lowercase “r.”
Examine for errors and omissions.
It’s critical that your robots.txt file be properly configured. Your entire site might be deindexed if you make a single error.
You don’t have to pray that your code is set up correctly, thankfully. You may use Google’s Robots Testing Tool to compare Robots.txt and Meta Directives.
Why would you use robots.txt when the “noindex” meta tag can be used to restrict pages at the page level?
The noindex tag, as I previously stated, is difficult to use on multimedia content like as movies and PDFs.
Also, if you have hundreds of pages to block, it’s sometimes faster to use robots.txt to block the entire area of the site rather than individually adding a noindex tag to each one.
There are other situations when you don’t want to waste any crawl budget by having Google land on sites that have a noindex tag.
However, I advocate utilising meta directives instead of robots.txt in everything except those three edge situations. They’re less difficult to put into practise. There’s also a lower risk of a calamity.
It’s sometimes more effective to block multiple pages at once than than listing them one by one. A robots.txt file might simply block the directory that includes them if they are all in the same area of the website.
As an example, consider the following:
/ mesa/ is disallowed, which implies that no sites in the mesa directory should be crawled.
The “/” here denotes the “root” of a website’s structure, or the page from which all other pages branch out, thus it includes the homepage as well as any other pages connected from it. Search engine bots are unable to crawl the page using this query.
In other words, a single slash may remove a whole website from the Internet’s searchable database!
Allow: The “Allow” command notifies bots that they are permitted to visit a certain URL or directory, as one might anticipate. This command allows bots to access one webpage while preventing bots from accessing the rest of the file’s webpages. This command is not recognised by all search engines.