Frequently Asked Question
Frequently Asked Question
In a nutshell, a robots.txt file tells search engine crawlers which URLs the crawler can access on your site.
Robots.txt files are used mainly to prevent crawlers from accessing site content you don’t want to be crawled, and to avoid overloading your site with requests.
Google supports the following fields:
user-agent: identifies which crawler the rules apply to.
allow: a URL path that may be crawled.
disallow: a URL path that may not be crawled.
sitemap: the complete URL of a sitemap.
Limitations of robots.txt
A robots.txt does not prevent content from being indexed in Google.
To keep a web page out of Google, use a noindex, or password-protect the page.
robots.txt directives may not be supported by all search engines.
While Googlebot and other respectable web crawlers obey the instructions in a robots.txt file, other crawlers might not. Therefore, if you want to keep information secure from web crawlers, it's better to use other blocking methods, such as password-protecting private files on your server.
Different crawlers interpret syntax differently.
Although respectable web crawlers follow the directives in a robots.txt file, each crawler might interpret the directives differently. You should know the proper syntax for addressing different web crawlers as some might not understand certain instructions.
A page that's disallowed in robots.txt can still be indexed if linked to from other sites.
While Google won't crawl content blocked by a robots.txt file, Google could find and index a disallowed URL if it is linked from other places on the web. As a result, the URL address and, potentially, other publicly available information such as anchor text in links to the page can still appear in Google search results. To prevent this, use use noindex.
Additional information:
https://developers.google.com/search/docs/advanced/robots/robots_txt#syntax