Guru
Oct 8, 2011
Sha Menz
Project Manager, Inbound Marketing at BF Design Studio
Hi Rick,
If you wish to use the robots.txt method to disallow all or part of your site's https protocol, you simply need to load two separate robots.txt files.
The http and https protocols are basically viewed by bots as if they were two completely separate root domains (which I guess you already know as you have mentioned the fact that port 443 is used for the secure protocol).
Google's advice is that to use this method, you should have a separate robots.txt file for each protocol with code as follows:
For your http protocol (
http://www.startuploans.org/robots.txt):
User-agent: *
Allow: /
For the https protocol (
https://www.startuploans.org/robots.txt):
User-agent: *
Disallow: /
However, blocking crawlers with robots.txt is not the most reliable method for excluding pages from Search engines. The reason for this is that the page will continue to be indexed if it happens to be found via a link from another page. Basically, the robots.txt is the sign on the front door that says "Please stay out of our house", but it is never seen by the people who enter via the rear exit or climb in a window!
The most reliable method of excluding pages is to add the noindex meta tag as suggested by MagentoWebDeveloper and Alan.When a bot encounters the noindex meta tag it will send a signal to the search engine to de-index the page and there is no further problem.
I would generally use noindex, follow rather than noindex, nofollow as the nofollow tag will stop the flow of link value through your site. In most cases, as long as the noindex is in place, there is no reason to be worried about the links on the pages being followed.
You should NEVER use both methods at the same time.