Skip to main content

robots.txt

Use a robots.txt file to prevent search engines, such as Google, from indexing all or some of your site's pages or assets.

Backlight ships with a default robots.txt file, though you will need to manually move it to make use of it. The file is located in:

/backlight/modules/custom-sources

The robots.txt file should be added to the root of your server. For most Backlight sites, that means it will sit alongside the backlight folder, like this:

.htaccess
backlight
galleries
index.php
robots.txt

You can add a robots.txt file to your site at any time, though the results will not be immediate if search engines have already crawled and indexed your site. They should catch up eventually.

Read Google's "About robots.txt" for complete information, or Test your robots.txt with the robots.txt Tester.

Ultimately, it's up to you whether to use a robots.txt file, and what rules to include. But let's tackle a couple of scenarios.

Block Indexing of Thumbnail Images

One possible use of the robots.txt file is to block indexing of your thumbnails files. This would prevent them being included in Google Images searches, for example. Why would we do this? Because we want Google to index the large-size images, not the thumbnails!

We can disallow indexing of our thumbnails by including this in our robots.txt file:

User-agent: *
Disallow: /*/thumbnails/*.jpg$

This disallows all user-agents from indexing JPG files in folders with the name thumbnails.

Block Indexing of Sensitive Albums

You might like to prevent search engine indexing of specific albums or album sets, for example, if you photograph children, nudes or other sensitive subjects. Or you might have a clients area that you want to ensure doesn't show up in Google Images results.

The best way to keep Google out of your albums is to set rules in your robots.txt file.

This disallows all user-agents from indexing a top-level set named clients:

User-agent: *
Disallow: /clients

This disallows all user-agents from indexing an album or album set named clients, located in your galleries folder:

User-agent: *
Disallow: /galleries/clients

As you can see, the paths are very specific. Make sure to get them right.

You can use wildcards in your path:

User-agent: *
Disallow: /*/nudes

This would prevent indexing of any of the following:

  • /clients/nudes
  • /galleries/nudes
  • /galleries/female/nudes
  • /galleries/nudes/female
  • /galleries/nudes/male

But would NOT safeguard a top-level set:

  • /nudes

You can set multiple directives for a block:

User-agent: *
Disallow: /clients
Disallow: /nudes
Disallow: /*/nudes

About Specificity

Crawlers always follow the most specific ruleset, and ignore the rest. So we might expand our directives to encourage more nuanced behavior. Rulesets are divided by line-breaks. Consider this:

User-agent: *
Disallow: /*/thumbnails/*.jpg$

User-agent: Googlebot
Allow: /*/thumbnails/*.jpg$

User-agent: Googlebot-Image
Disallow: /*/thumbnails/*.jpg$

Here we have three distinct rulesets, or groups. The first group uses a wildcard to target all user-agents, but actually only applies to all non-Google user-agents. That's because the subsequent groups provide alternative rules specific to Googlebots.

For Google, the second group basically undoes the first, and so Google search does index the thumbnails. But not for Google Images searches, because the third rule specifically blocks the indexing of thumbnails for Google Images.

The order of the directives is unimportant; swap the position of groups 2 and 3, and the outcome would be the same. Ordering has no bearing on specificity.

Also, user-agents will use or ignore a block in its entirety. Consider this example with two blocks:

User-agent: *
Disallow: /*/thumbnails/*.jpg$
Disallow: /clients

User-agent: Googlebot
Allow: /*/thumbnails/*.jpg$

The wildcard rules are overridden in their entirety by the Googlebot rule due to specificity. We're allowing Googlebot to index thumbnails, but because Googlebot will now ignore the previous block of rules, it will also index the clients folder because we're not explicitly disallowing it in the Googlebot block of directives.

Blocking AI bots

It is now becoming possible to block web crawlers from various AI projects using directives in your robots.txt file. In an attempt to help protect our users from having their images included in datasets for generative AI tools, from Backlight 5.4.1 we have begun to include such directives in our distributed robots.txt file, though you will need to manually copy this file into your site root to use it.

We will here endeavor to provide as complete a list of external resources as possible so that you may further customize your experience.

Anthropic AI / Claude

We are including directives to block Anthropic AI and their Claude AI assistance, following patterns being used elsewhere. However, it is not clear whether these directives -- targeting user agents "anthropic-ai" and "Claude-Web" -- will be effective, as there has been no documentation from Anthropic.

Common Crawl

Common Crawl (CCBot) is a 501(c)(3) non-profit organization dedicated to providing a copy of the Internet to Internet researchers, companies and individuals at no cost for the purpose of research and analysis.

Google Extended

According to Google, Google Extended is "A standalone product token that web publishers can use to manage whether their sites help improve Bard and Vertex AI generative APIs, including future generations of models that power those products."

The new crawler has been added to the Google Search Central documentation on web crawlers.

GPTBot

OpenAI's GPTBot gathers data for ChatGPT and its related products.