robots.txt
Use a robots.txt file to prevent search engines, such as Google, from indexing all or some of your site's pages or assets.
The robots.txt file should be added to the root of your server. For most Backlight sites, that means it will sit alongside the backlight
folder, like this:
.htaccess
backlight
galleries
index.php
robots.txt
You can add a robots.txt file to your site at any time, though the results will not be immediate if search engines have already crawled and indexed your site. They should catch up eventually.
Read Google's "About robots.txt" for complete information, or Test your robots.txt with the robots.txt Tester.
Ultimately, it's up to you whether to use a robots.txt file, and what rules to include. But let's tackle a couple of scenarios.
Block Indexing of Thumbnail Images
One possible use of the robots.txt file is to block indexing of your thumbnails files. This would prevent them being included in Google Images searches, for example. Why would we do this? Because we want Google to index the large-size images, not the thumbnails!
We can disallow indexing of our thumbnails by including this in our robots.txt file:
User-agent: *
Disallow: /*/thumbnails/*.jpg$
This disallows all user-agents from indexing JPG files in folders with the name thumbnails
.
Block Indexing of Sensitive Albums
You might like to prevent search engine indexing of specific albums or album sets, for example, if you photograph children, nudes or other sensitive subjects. Or you might have a clients area that you want to ensure doesn't show up in Google Images results.
The best way to keep Google out of your albums is to set rules in your robots.txt file.
This disallows all user-agents from indexing a top-level set named clients
:
User-agent: *
Disallow: /clients
This disallows all user-agents from indexing an album or album set named clients
, located in your galleries
folder:
User-agent: *
Disallow: /galleries/clients
As you can see, the paths are very specific. Make sure to get them right.
You can use wildcards in your path:
User-agent: *
Disallow: /*/nudes
This would prevent indexing of any of the following:
- /clients/nudes
- /galleries/nudes
- /galleries/female/nudes
- /galleries/nudes/female
- /galleries/nudes/male
But would NOT safeguard a top-level set:
- /nudes
You can set multiple directives for a block:
User-agent: *
Disallow: /clients
Disallow: /nudes
Disallow: /*/nudes
About Specificity
Crawlers always follow the most specific ruleset, and ignore the rest. So we might expand our directives to encourage more nuanced behavior. Rulesets are divided by line-breaks. Consider this:
User-agent: *
Disallow: /*/thumbnails/*.jpg$
User-agent: Googlebot
Allow: /*/thumbnails/*.jpg$
User-agent: Googlebot-Image
Disallow: /*/thumbnails/*.jpg$
Here we have three distinct rulesets, or groups. The first group uses a wildcard to target all user-agents, but actually only applies to all non-Google user-agents. That's because the subsequent groups provide alternative rules specific to Googlebots.
For Google, the second group basically undoes the first, and so Google search does index the thumbnails. But not for Google Images searches, because the third rule specifically blocks the indexing of thumbnails for Google Images.
The order of the directives is unimportant; swap the position of groups 2 and 3, and the outcome would be the same. Ordering has no bearing on specificity.
Also, user-agents will use or ignore a block in its entirety. Consider this example with two blocks:
User-agent: *
Disallow: /*/thumbnails/*.jpg$
Disallow: /clients
User-agent: Googlebot
Allow: /*/thumbnails/*.jpg$
The wildcard rules are overridden in their entirety by the Googlebot rule due to specificity. We're allowing Googlebot to index thumbnails, but because Googlebot will now ignore the previous block of rules, it will also index the clients
folder because we're not explicitly disallowing it in the Googlebot block of directives.