Over time, search engines will typically crawl all of your website’s pages and files unless instructed otherwise.
As long as there’s at least one link pointing to a page, search engines will find it.
Without going into too much detail, the most important concept you should know is that they use the information gathered during their crawling efforts to assign search rankings.
To determine where a page should rank, search engines need to crawl it.
Some of your website’s pages, however, are probably more valuable than others.
Your website may have pages consisting of duplicate content, for instance, or it may have landing pages designated for paid advertising campaigns.
Allowing search engines to crawl these unwanted pages will only divert them away from your website’s more valuable pages.
To block search engines from crawling unwanted pages, you’ll need to create a robots.txt file.
Contents
What Is a Robots.txt File?
Also known as a robots exclusion standard protocol, a robots.txt file is a computer file that contains instructions on how search engines should crawl the website with which it’s used.
You can use it to tell search engines which pages, files or directories of your website they shouldn’t crawl.
When they first land on your website, search engines will check it for a robots.txt file.
If a robots.txt file is present, they’ll comply with its directives by avoiding the specified locations.
So how can you do this on your own blog or WordPress site?
Adding A Robots.txt File To Your WordPress Blog Or Website
[su_youtube url=”https://youtu.be/Fee4DY5gvLg”]
Taking the video tutorial one step further, here are a few best practices when adding a robots.txt file to your site.
1) Upload to Root Directory
Search engines will look for your website’s robots.txt in its root directory.
If you place it in a subdirectory, they may not find it.
Even if search engines do find the robots.txt file, they won’t obey its directives since the standard specifically requires a root directory placement.
After creating a robots.txt file, upload it to the root directory where your website’s homepage is located.
2) Create a Single File
You should only create a single robots.txt file for your website.
Whether you want to block search engines from crawling one page or 1,000 pages, you can place all the necessary directives in a single file.
Distributing the directives across multiple robots.txt files won’t work.
You must name the file “robots.txt,” and you must place it in the root folder of your website.
Since you can’t have multiple files with the same name in the same directory, you can’t use multiple robots.txt files.
3) Save in Plain Text Format
For search engines to recognize and honor your website’s robots.txt file, you must save it in plain text format.
It shouldn’t contain any Hypertext Markup Language (HTML) code, Hypertext Preprocessor (PHP) code or Cascading Stylesheets (CSS) code.
All it needs to convey crawling instructions to search engines is plain text.
Therefore, you can create a robots.txt file using a basic text editor like Notepad for Windows or TextEdit for macOS.
Just remember to check the file extension before saving to ensure it shows “.txt,” which denotes a plain text format.
4) Place One Directive Per Line
When creating a robots.txt file, place one directive per line.
To block Googlebot from crawling two pages, for example, you should create two separate directives, each of which placed on a separate line.
You don’t have to specify Googlebot twice.
Rather, you can specify Googlebot a single time directly above the pair of directives.
The robots exclusion standard requires the use of groups.
Each group should contain a line mentioning the user agent or agents for which the directives are intended, followed by the line-separated directives themselves.
If you use multiple groups, separate them with an empty line.
Here’s an example of a directive group that blocks Google from crawling two pages:
[su_note note_color=”#E5E5E5″]
User-agent: Googlebot
Disallow: /category/page-one.html
Disallow: /category/page-two.html
[/su_note]
5) Beware of Capitalization
You need to be conscious of capitalization when creating your website’s robots.txt file.
While user agent names — the names of search engines’ crawlers — aren’t case-sensitive, filepaths are.
If a filepath listed in a directive uses lowercase letters when the actual path uses uppercase letters, search engines won’t obey it.
A filepath is a location in a directive group that points to page, file or directory.
All filepaths should begin with a forward slash, followed by the exact location of the page, file or directory that you want search engines to avoid crawling.
Incorrect capitalization will make the directive invalid, meaning search engines will still crawl it.
6) Test For Errors
It’s a good idea to test your website’s robots.txt file for errors.
Google offers a robots.txt file testing tool at that can reveal common errors.
To use it, add your website to Google Search Console, visit the testing tool’s URL and select your site from the drop-down menu of verified properties.
Google’s robots.txt file testing tool will display your website’s robots.txt file.
If there are any errors present, such as incorrect syntax, it will highlight them.
You can also use Google’s robots.txt file testing tool to verify the blockage of URLs.
If you created a directive to block Googlebot from crawling a page, enter the page’s filepath in the field at the bottom of the testing tool and click the “TEST” button.
7) Specify Sitemap
While its main purpose is to block search engines from crawling specific pages, files or directories, you can use a robots.txt file to specify your website’s sitemap as well.
After uploading a sitemap to your website’s root directory, you can create a unique type of directive in your site’s robots.txt file pointing to its location.
Search engines will visit the sitemap where they can find an easy-to-read list of all your website’s pages.
The sitemap directive uses the following format:
[su_note note_color=”#E5E5E5″]
Sitemap: https://example.com/sitemap.xml
[/su_note]
Keep in mind, the filepath for your website’s sitemap should contain your site’s domain and its prefix.
You can omit this information when creating traditional disallow directives.
Wrapping Up
With a robots.txt file, you’ll have greater control over how search engines crawl your website.
You can use this simple text file to instruct search engines not to crawl specific pages, files or entire directories while also providing them with a sitemap.
Search engines don’t require it, but a robots.txt file is worth creating if you want search engines to crawl some parts of your website and not others.