Why Google Indexes Blocked Web Pages via @sejournal, @martinibuster

Google’s John Mueller answered a question about why Google indexes pages that are disallowed from crawling by robots.txt and why the it’s safe to ignore the related Search Console reports about those crawls.

Bot Traffic To Query Parameter URLs

The person asking the question documented that bots were creating links to non-existent query parameter URLs (?q=xyz) to pages with noindex meta tags that are also blocked in robots.txt. What prompted the question is that Google is crawling the links to those pages, getting blocked by robots.txt (without seeing a noindex robots meta tag) then getting reported in Google Search Console as “Indexed, though blocked by robots.txt.”

The person asked the following question:

“But here’s the big question: why would Google index pages when they can’t even see the content? What’s the advantage in that?”

Google’s John Mueller confirmed that if they can’t crawl the page they can’t see the noindex meta tag. He also makes an interesting mention of the site:search operator, advising to ignore the results because the “average” users won’t see those results.

He wrote:

“Yes, you’re correct: if we can’t crawl the page, we can’t see the noindex. That said, if we can’t crawl the pages, then there’s not a lot for us to index. So while you might see some of those pages with a targeted site:-query, the average user won’t see them, so I wouldn’t fuss over it. Noindex is also fine (without robots.txt disallow), it just means the URLs will end up being crawled (and end up in the Search Console report for crawled/not indexed — neither of these statuses cause issues to the rest of the site). The important part is that you don’t make them crawlable + indexable.”

Related: Google Reminds Websites To Use Robots.txt To Block Action URLs

Takeaways:

1. Confirmation Of Limitations Of Site: Search

Mueller’s answer confirms the limitations in using the Site:search advanced search operator for diagnostic reasons. One of those reasons is because it’s not connected to the regular search index, it’s a separate thing altogether.

Google’s John Mueller commented on the site search operator in 2021:

“The short answer is that a site: query is not meant to be complete, nor used for diagnostics purposes.

A site query is a specific kind of search that limits the results to a certain website. It’s basically just the word site, a colon, and then the website’s domain.

This query limits the results to a specific website. It’s not meant to be a comprehensive collection of all the pages from that website.”

The site operator doesn’t reflect Google’s search index, making it unreliable for understanding what pages Google has indexed or note indexed. Like Google’s other advanced search operators, they are unreliable as tools for understanding anything related to how Google ranks or indexes content.

2. Noindex tag without using a robots.txt is fine for these kinds of situations where a bot is linking to non-existent pages that are getting discovered by Googlebot. Noindex tags on pages that are not blocked by a disallow in the robots.txt allows Google to crawl the page and read the noindex directive, ensuring the page won’t appear in the search index, which is preferable if the goal is to keep a page out of Google’s search index.

3. URLs with the noindex tag will generate a “crawled/not indexed” entry in Search Console and won’t have a negative effect on the rest of the website.
These Search Console entries, in the context of pages that are purposely blocked, only indicate that Google crawled the page but did not index it, essentially saying that this happened, not (in this specific context) that there’s something wrong that needs fixing.

This entry is useful for alerting publishers for pages that are inadvertently blocked by a noindex tag,  or by some other cause that’s preventing the page from being indexed. Then it’s something to investigate

4. How Googlebot handles URLs with noindex tags that are blocked from crawling by a robots.txt disallow but are also discoverable by links.
If Googlebot can’t crawl a page, then it’s unable to read and apply the noindex tag, so the page may still be indexed based on URL discovery from an internal or external link.

Google’s documentation of the noindex meta tag has a warning about the use of robots.txt to disallow pages that have a noindex tag in the meta data:

“For the noindex rule to be effective, the page or resource must not be blocked by a robots.txt file, and it has to be otherwise accessible to the crawler. If the page is blocked by a robots.txt file or the crawler can’t access the page, the crawler will never see the noindex rule, and the page can still appear in search results, for example if other pages link to it.”

5. How site: searches differ from regular searches in Google’s indexing process
Site: searches are limited to a specific domain and are disconnected from the primary search index, making them not reflective of Google’s actual search index and less useful for diagnosing indexing issues.

Read the question and answer on LinkedIn:

Why would Google index pages when they can’t even see the content?

Featured Image by Shutterstock/Krakenimages.com

Leave a Reply

Your email address will not be published. Required fields are marked *