Sometimes you want to serve a different robots.txt file based on request URL, this could be because you run your site directly from CDN like I do. This means you’ll be able to get duplicate content, and to disallow this, we want to serve a different robots.txt file, or set specific headers.
As a small test scenario I’ll use www.lucasrolff.com as an example. www.lucasrolff.com is pointing directly to a CDN so the whole site is running there and using origin pull, this means that the CDN will fetch content from origin (the web server), but to fetch content it needs to fetch it from some URL or IP located at the web server. Since this site is running on a shared IP, it’s not possible to fetch using the IP, because this would be 10 times easier. So what have been done is to set up a domain called o(dot)lucasrolff.com
In this case, we’ll create a robots.txt file called real-robots.txt and fill in the content we want to allow or disallow
User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/
Sitemap: https://www.lucasrolff.com/sitemap.xml.gz
We’ll also create a file called o-robots.txt and put our disallow in it:
User-agent: *
Disallow: /
Now we have 2 robots.txt files, for each of our ‘environment’. Now let’s do some mod_rewrite foo.
First we’re going to set the X-Robots-Tag header which google follows, after this we’ll make the rule for our CDN (www.lucasrolff.com in this case), and after this make the rule for the origin.
RewriteEngine On
Header set X-Robots-Tag "none" ENV=NO_INDEX
RewriteCond %{HTTP_HOST} ^o.lucasrolff.com$ #This will check if it's origin
RewriteCond %{HTTP:Server} (EC) #This will check if it's EdgeCast (Some CDN's use Via header)
RewriteCond %{REQUEST_URI} (robots.txt)$ [NC] #Check if the request URI is robots.txt
RewriteRule (.*)$ real-robots.txt [L] #Load content from real-robots.txt
RewriteCond %{HTTP_HOST} ^o.lucasrolff.com$ #This will check if it's origin
RewriteCond %{HTTP:Server} (EC) #This will check if it's EdgeCast (Some CDN's use Via header)
RewriteRule (.*)$ $1 [L] #All requests will just get handled normally
RewriteCond %{HTTP_HOST} ^o.lucasrolff.com$ #Again check if it's origin
RewriteCond %{REQUEST_URI} (robots.txt)$ [NC] #Check if the request URI is robots.txt
RewriteRule (.*)$ o-robots.txt [L] #Serve o-robots.txt
RewriteCond %{HTTP_HOST} ^o.lucasrolff.com$ #If it's origin
RewriteRule (.*)$ $1 [L,E=NO_INDEX:1] #Redirect as usual, but this time, on all requests, set the header 'X-Robots-Tag'
The code above is quite a lot, for doing a simple thing, but we make sure that CDN will cache the right robots file, and that origin will get disallowed from getting indexed, both by using robots.txt but also setting a header on all requests. This means it shouldn’t affect your SEO in a bad way, because you won’t have duplicate content.
If you have any questions or better ways of doing this, please let me know!