How to Deindex Google Web Pages and Files in 2021?


Do you want to deindex a page or a group of pages on your site from search engines (Google, Qwant, Bing, etc.)? Want to deindex a PDF file or other files without source code? Do you want to block the indexing of a new site or a site under construction? Are you wondering what are the most effective methods? Here is everything that works in 2021 to effectively deindex (or block indexing) a web page or any type of file hosted on your site (image, pdf, doc,…).

Google announced it in 2019, the deindexing of pages, files and groups of pages via the “Noindex:” command of robots.txt is no longer taken into account by the number 1 search engine in the world (since September 1, 2019).

Here is an overview of all the alternatives to this simple method of deindexing.

ATTENTION 1 : before making any changes to your site yourself that affect indexing, I strongly advise you to surround yourself with professionals (developers and SEOs) to avoid any heavy impact on your website and your business. These techniques should ideally not be learned by beginners.

ATTENTION 2 : indexing is to be dissociated from the crawl. Indeed, if a page has a non-indexing directive, it can still be crawled by search engines. So, if your goal is to save your crawl budget, you will have to couple your non-indexing directives with blocking these same pages (only after effective de-indexing of the pages) via the robots.txt.

Table of Contents

1 – The html meta noindex tag to be placed in the page header

the meta noindex is the method most used by webmasters, developers and SEOs when it comes to deindex (or prevent future indexing) of web pages in search engines like Google. It is also the preferred method of search engines because it is supported by all of them.

For his part, Google considers this to be THE most effective method of deindexing or index blocking (whether using the meta noindex in the html code or via the http headers).

Very easy to set up when placed in the html code, it does not require much technical knowledge but must be used only on pages that we want to see deindexed under penalty of seeing its referencing and its traffic drastically drop!

How to set up the noindex meta tag on a page or a group of html pages?

On the pages that you want to be blocked from indexing or removed from the Google index, you will therefore need to add this simple code in the part of your web page:

<meta name="robots" content="noindex">

If you only want to block indexing for Google, you can do so with a directive like this:

<meta name="googlebot" content="noindex">

If you want to automate the addition of this type of tags on a specific group of pages, you must go through a developer. To check the correct implementation of these tags, Chrome extensions can be useful if you are not a follower of CTRL + U to analyze the source code of the pages yourself.

How long does it take for Google to take noindex tags into account?

Once the noindex meta tags are in place, the deindexing will take place when Google has returned to your pages. This can take a few days to several weeks depending on your site and the depth of the pages passed in noindexe.

This is also whypages with a noindex must absolutely not be blocked in the robots.txt before deindexing ! This would completely prevent Google from discovering the noindex tags and therefore taking them into account.

How to speed up the consideration of the noindex tag by Google?

Once the noindex tag is in place, to speed up the process on a selection of pages that you want to quickly deindex, you can use Search Console to find the page in question and then ask Google to crawl it.

This action will reduce the time it takes for the meta noindex to be taken into account by the search engine, however you will not be able to do it for all of your pages.

2 – The X-Robots-Tag noindex HTTP header to deindex pages and files without source code (pdf, images, etc.)

If you want to deindex files that do not have source code, you will have no choice but to use the noindex X-Robots-Tag HTTP headers to deindex them from search engines.

To do this, you will have to call on a developer if your technical knowledge does not allow you to safely modify the .htaccess file Where httpd.conf.

Concretely, to deindex a web page, a pdf file or even an image using the X-Robots-Tag http header, just add ” noindex ”Following the latter.

Here is what it should look like in the http headers of the pages and files to be deindexed:

HTTP/1.1 200 OK Date: Tue, 25 May 2019 21:42:43 GMT

(…)
X-Robots-Tag: noindex
(…)

How to implement the X-Robots-Tag noindex in concrete terms?

To set it up, you are going to have to edit your .htaccess file by including such directives:

Example 1: code to place in the .htaccess at the root of the site to deindex all the pdf files of a site:


<Files ~ ".pdf$">
Header set X-Robots-Tag "noindex, nofollow"
</Files>

Example 2: code to deindex all image files (png, jpeg, gif) of an entire site:


<Files ~ ".(png|jpe?g|gif)$">
Header set X-Robots-Tag "noindex"
</Files>

To set up a noindex on a group of pages generated via PHP, you will need to manage this via the header.php with a code of this type:

header("X-Robots-Tag: noindex", true);

If you want search engines not to follow the links on a page (not very logical in most cases), you can associate the “noindex” with the “nofollow” directive.

If you want to consult the official Google documentation about the X-Robots-Tag, it’s here: https://developers.google.com/search/reference/robots_meta_tag?hl=fr

3 – Return an HTTP 404 or 410 response code to deindex pages and files that no longer exist (and not replaced)

If a web page or a file no longer exists on your site and no element replaces it, return a response code 404 (not found) or 410 (gone) is good practice allowing the eventual deindexing of this page or resource.

What are the differences between code 404 and 410 in SEO?

404 Not Found Resource not found.
410 Gone The resource is no longer available and no redirect address is known.

For Google, the two codes indicate that the resource does not exist, however the 410 confirms that the latter existed in the past but that it will no longer exist in the future, so it is more precise than a code of 404 response.

4 – Block the crawl of pages and files via robots.txt (when they have never been indexed in the past)

If you want to block the indexing of pages or files from a new site or from a site that has never been indexed or crawled by search engines, in this unique case, blocking the crawl via robots.txt will involve that these pages and resources will not be indexable and therefore indexed (because not crawlable).

To do so, you just need to add directives of this type in your robots.txt file (available at the root of your site):

Example 1: to block the crawl (and indexing if new site) of .pdf files)

Disallow: /*.pdf

Example 2: to block the crawl (and indexing if new site) of pages of a specific category)

Disallow: */categorie-a-ne-pas-indexer/*

Before implementing such guidelines on your site, I strongly recommend that you use Google’s robots.txt test tool to test them upstream before going live.

5 – Block URLs not to be indexed with a password

Widely used to block the indexing of a preprod or dev server, this technique is also effective if search engines have never been able to crawl and index a site and pages in the past.

To set up this, just configure password protection via the .htaccess file and the .htpasswd file

Here’s what an example code looks like in the .htaccess:


AuthType Basic
AuthName "Cet espace est interdit"
AuthUserFile /path/to/.htpasswd
Require valid-user

You want to receive our best articles ?

Leave a Comment

Your email address will not be published. Required fields are marked *