Trouble with seo redirects / blocking

ChrisKingWebDev · December 4, 2020, 5:44am

Hi Leon. Thanks for making all these tools. I’m very close to having the exact wordpress set up that I want. I’m just stuck on a tiny thing at the moment.

I’ve got a multi-wordpress set up so each site is edited at a url like site1. mywordpress. com and deployed to S3 and served from site1. com. I would like the site1. mywordpress. com to have a robots.txt file that blocks searchbots, but I need the robots.txt in the S3 bucket to be the proper one. Unfortunately it uploads the robots.txt that will block my real site from traffic.

The easiest solution would be a way to just ignore robots.txt and to manually put a different one in the bucket. I could also overwrite the crawled robots.txt using a hook, if you can point me in the right direction. Or maybe there is another options I’m not thinking of.

ChrisKingWebDev · December 4, 2020, 5:45am

I had to put spaces in the urls because I’m a new user.

leonstafford · December 4, 2020, 6:12am

Hi Chris,

There should be a hook available for you to tap into - just to confirm - are you using WP2Static or Static HTML Output plugin?

My approach, though not always possible, would be to keep your dev servers non-public, which also means no need for a robots.txt file at all on them or you can use the one intended for production deployment, because nobody will hit it on your dev server.

Does that make sense? Let me know if there’s a blocking reason why you can’t make your dev servers non-public and I can rethink. It removes a LOT of potential security problems if only you and your team can access the sites, using something like HTTP Basic Authentication (fairly easy) or a VPN / IP based firewall (harder)

ChrisKingWebDev · December 4, 2020, 6:47am

I’m using WP2Static with the S3 Add on.

The main reason for keeping the main site public is that I eventually want to build it out into a hands off hosting company and the central wordpress site in the multisite will be the site for that, which I want public.

I tried using a plugin that made all the sites require a wordpress login. But unfortunately, with it enabled the crawler just got the login on all pages.

ChrisKingWebDev · December 4, 2020, 4:54pm

Another possible solution would be to allow the crawler to have a logged in account on all sites. Then I could keep the force login plugin activated. Could I add that to a pre-crawl hook?

leonstafford · December 4, 2020, 11:40pm

@ChrisKingWebDev yep, that’s available currently with support for HTTP Basic Auth login/password.

I advise to use that on the webserver vs a plugin, as a) I don’t think that’s supported for crawling and b) it’s implemented at the WordPress application level, which isn’t as secure as the webserver level that you’d implement HTTP Basic Authentication at. For the multisite, you should only need to enable the basic auth once to cover all sites. If you’re with a hosting company, they may have a one-click option to enable that, else there are different steps for Nginx, Apache and other webservers, using an htpasswd file.

These are some of the hooks/filters available in WP2Static, where you could tap into and add or modify a robots.txt file.

It sounds like a new add-on for managing robots.txt files could be very useful to WP2Static, ie input a custom one or choose what to do with existing one when deploying.

ChrisKingWebDev · December 5, 2020, 1:37am

Thanks for the help. I think I’ll attempt writing a different robots.txt before the deploy. Failing that, I’ll just remove it and manually upload one to the s3 bucket.

An extra plugin would be really sweet. It could cover other relevant files, but I can’t think of any at the moment.