robots.txt
Use a robots.txt file to prevent search engines, such as Google, from indexing all or some of your site's pages or assets.
Backlight ships with a default robots.txt
file, though you will need to manually move it to make use of it. The file is located in:
/backlight/modules/custom-sources
The robots.txt file should be added to the root of your server. For most Backlight sites, that means it will sit alongside the backlight
folder, like this:
.htaccess
backlight
galleries
index.php
robots.txt
You can add a robots.txt file to your site at any time, though the results will not be immediate if search engines have already crawled and indexed your site. They should catch up eventually.
Read Google's "About robots.txt" for complete information, or Test your robots.txt with the robots.txt Tester.
Ultimately, it's up to you whether to use a robots.txt file, and what rules to include. But let's tackle a couple of scenarios.
Block Indexing of Thumbnail Images
One possible use of the robots.txt file is to block indexing of your thumbnails files. This would prevent them being included in Google Images searches, for example. Why would we do this? Because we want Google to index the large-size images, not the thumbnails!
We can disallow indexing of our thumbnails by including this in our robots.txt file:
User-agent: *
Disallow: /*/thumbnails/*.jpg$
This disallows all user-agents from indexing JPG files in folders with the name thumbnails
.
Block Indexing of Sensitive Albums
You might like to prevent search engine indexing of specific albums or album sets, for example, if you photograph children, nudes or other sensitive subjects. Or you might have a clients area that you want to ensure doesn't show up in Google Images results.
The best way to keep Google out of your albums is to set rules in your robots.txt file.
This disallows all user-agents from indexing a top-level set named clients
:
User-agent: *
Disallow: /clients
This disallows all user-agents from indexing an album or album set named clients
, located in your galleries
folder:
User-agent: *
Disallow: /galleries/clients
As you can see, the paths are very specific. Make sure to get them right.
You can use wildcards in your path:
User-agent: *
Disallow: /*/nudes
This would prevent indexing of any of the following:
- /clients/nudes
- /galleries/nudes
- /galleries/female/nudes
- /galleries/nudes/female
- /galleries/nudes/male
But would NOT safeguard a top-level set:
- /nudes
You can set multiple directives for a block:
User-agent: *
Disallow: /clients
Disallow: /nudes
Disallow: /*/nudes
About Specificity
Crawlers always follow the most specific ruleset, and ignore the rest. So we might expand our directives to encourage more nuanced behavior. Rulesets are divided by line-breaks. Consider this:
User-agent: *
Disallow: /*/thumbnails/*.jpg$
User-agent: Googlebot
Allow: /*/thumbnails/*.jpg$
User-agent: Googlebot-Image
Disallow: /*/thumbnails/*.jpg$
Here we have three distinct rulesets, or groups. The first group uses a wildcard to target all user-agents, but actually only applies to all non-Google user-agents. That's because the subsequent groups provide alternative rules specific to Googlebots.
For Google, the second group basically undoes the first, and so Google search does index the thumbnails. But not for Google Images searches, because the third rule specifically blocks the indexing of thumbnails for Google Images.
The order of the directives is unimportant; swap the position of groups 2 and 3, and the outcome would be the same. Ordering has no bearing on specificity.
Also, user-agents will use or ignore a block in its entirety. Consider this example with two blocks:
User-agent: *
Disallow: /*/thumbnails/*.jpg$
Disallow: /clients
User-agent: Googlebot
Allow: /*/thumbnails/*.jpg$
The wildcard rules are overridden in their entirety by the Googlebot rule due to specificity. We're allowing Googlebot to index thumbnails, but because Googlebot will now ignore the previous block of rules, it will also index the clients
folder because we're not explicitly disallowing it in the Googlebot block of directives.
Blocking AI bots
It is now becoming possible to block web crawlers from various AI projects using directives in your robots.txt file, at least insofar as these companies are honoring robots.txt directives (dubious!).
In an attempt to help protect our users from having their images included in datasets for generative AI tools, from Backlight 5.4.1 we have begun to include such directives in our distributed robots.txt
file, though you will need to manually copy this file into your site root to use it.
We will here endeavor to provide as complete a list of external resources as possible so that you may further customize your experience. I am updating a live document in our support forum with new bots as I discover them, and will continue to update Backlight's distributed robots.txt file.
See Blocking AI Data Scrapers in Backlight for live updates.
Reporting by 404 Media, and the agents list at Dark Visitors have been good resources for staying abreast of this moving landscape. Also, Cory Dransfeldt is maintaining a list of AI bots to block on Github.
For more on the history and purpose of robots.txt, and that state of things in 2024, The Verge has published an excellent article, The text file that runs the internet, well worth reading.