Use a robots.txt file to prevent search engines, such as Google, from indexing all or some of your site's pages or assets.
The robots.txt file should be added to the root of your server. For most Backlight sites, that means it will sit alongside the
backlight folder, like this:
You can add a robots.txt file to your site at any time, though the results will not be immediate if search engines have already crawled and indexed your site. They should catch up eventually.
Ultimately, it's up to you whether to use a robots.txt file, and what rules to include. But let's tackle a couple of scenarios.
Block Indexing of Thumbnail Images
One possible use of the robots.txt file is to block indexing of your thumbnails files. This would prevent them being included in Google Images searches, for example. Why would we do this? Because we want Google to index the large-size images, not the thumbnails!
We can disallow indexing of our thumbnails by including this in our robots.txt file:
This disallows all user-agents from indexing JPG files in folders with the name
Block Indexing of Sensitive Albums
You might like to prevent search engine indexing of specific albums or album sets, for example, if you photograph children, nudes or other sensitive subjects. Or you might have a clients area that you want to ensure doesn't show up in Google Images results.
The best way to keep Google out of your albums is to set rules in your robots.txt file.
This disallows all user-agents from indexing a top-level set named
This disallows all user-agents from indexing an album or album set named
clients, located in your
As you can see, the paths are very specific. Make sure to get them right.
You can use wildcards in your path:
This would prevent indexing of any of the following:
But would NOT safeguard a top-level set:
You can set multiple directives for a block:
Crawlers always follow the most specific ruleset, and ignore the rest. So we might expand our directives to encourage more nuanced behavior. Rulesets are divided by line-breaks. Consider this:
Here we have three distinct rulesets, or groups. The first group uses a wildcard to target all user-agents, but actually only applies to all non-Google user-agents. That's because the subsequent groups provide alternative rules specific to Googlebots.
For Google, the second group basically undoes the first, and so Google search does index the thumbnails. But not for Google Images searches, because the third rule specifically blocks the indexing of thumbnails for Google Images.
The order of the directives is unimportant; swap the position of groups 2 and 3, and the outcome would be the same. Ordering has no bearing on specificity.
Also, user-agents will use or ignore a block in its entirety. Consider this example with two blocks:
The wildcard rules are overridden in their entirety by the Googlebot rule due to specificity. We're allowing Googlebot to index thumbnails, but because Googlebot will now ignore the previous block of rules, it will also index the
clients folder because we're not explicitly disallowing it in the Googlebot block of directives.