I had an e-commerce company reach out to me earlier in the year for help. They wanted to have an audit completed after making some important changes to their site.
As part of our initial communication, they prepared a bulleted list of changes that had been implemented so I would be aware of them before analyzing the site. That list included any changes in rankings, traffic and indexation.
One of those bullets stood out: They had seen a big spike in indexation after the recent changes went live. Now, this is a site that had been impacted by major algorithm updates over the years, so the combination of big site changes (without SEO guidance) and a subsequent spike in indexation scared the living daylights out of me.
I checked Google Search Console (GSC), and this is what I saw: 6,560 pages indexed jumped to 16,215 in one week. That’s an increase of 160 percent.
It was clear that digging into this problem and finding out what happened would be a priority. My hope was that if mistakes were pushed to production, and the wrong pages were being indexed, I could surface those problems and fix them before any major damage was done.
I unleashed Screaming Frog and DeepCrawl on the site, using both Googlebot and Googlebot for Smartphones as the user-agents. I was eager to dig into the crawl data.
First, the site is not responsive. Instead, it uses dynamic serving, which means different HTML and CSS can be delivered based on user-agent.
The recent changes were made to the mobile version of the site. After those changes were implemented, Googlebot was being driven to many thin URLs via a faceted navigation (only available on the mobile pages). Those thin URLs were clearly being indexed. At a time where Google’s quality algorithms seem to be on overload, that’s never a good thing.
The crawls I performed surfaced a number of pages based on the mobile faceted navigation — and many of them were horribly thin or blank. In addition, the HTML Improvements report (yes, that report many people totally ignore) listed a number of those thin URLs in the duplicate title tags report.
I dug into GSC while the crawls were running and started surfacing many of those problematic URLs. Here’s a screen shot showing close to 4,000 thin URLs in the report. That wasn’t all of the problematic URLs, but you could see Google was finding them.
We clearly had a situation where technical SEO problems led to thin content. I’ve mentioned this problem many times while writing about major algorithm updates, and this was a great example of that happening. Now, it was time to collect as much data as possible, and then communicate the core problems to my client.
The first thing I explained was that the mobile-first index would be coming soon, and it would probably be best if the site were moved to a responsive design. Then my client could be confident that all of the pages contained the same content, structured data, directives and so on. They agreed with me, and that’s the long-term goal for the site.
Second, and directly related to the problem I surfaced, I explained that they should either canonicalize, noindex or 404 all of the thin pages being linked to from the faceted navigation on mobile. As Googlebot crawls those pages again, it should pick up the changes and start dropping them from the index.
My client asked about blocking via robots.txt, and I explained that if the pages are blocked, then Googlebot will never see the noindex tag. That’s a common question, and I know there’s a lot of confusion about that.
It’s only after those pages are removed from the index that they should be blocked via robots.txt (if you choose to go down that path). My client actually decided to 404 the pages, rolled out the changes, and then moved on to other important findings from the audit and crawl analysis.
And then my client asked an important question. It’s one that many have asked after noindexing or removing low-quality or thin pages from their sites.
“How long will it take for Google to drop those pages from the index??”
Ah, a great question — and the answer can be different for every site and situation. I explained that depending on the importance of those pages, the URLs could be removed relatively quickly, or it could take a while (even months or longer).
For example, since these were thin pages generated by a faceted navigation, they probably weren’t high on Google’s list from an importance and priority standpoint. And if that was the case, then Google might not crawl those pages frequently (or any time soon). My recommendation was to move on to other items and just monitor indexation over time.
Note: I did explain that my client could add those thin URLs to an XML sitemap file once removed from the site in order to speed up the process of Google discovering the 404s. I believe my client did that based on the mobile crawl data and the HTML improvements reporting. That doesn’t mean the URLs would be immediately dropped from the index, but it could help with discovery.
So we proceeded with the remediation plan based on the crawl analysis and audit and let Google crawl the problematic pages. We monitored the Index Status report to see when those pages would start dropping, hoping that would be soon (but realistically knowing it could take a while).
And then, in late August, an email hit my inbox from my client with the subject line, “Indexation finally dropped in GSC.” It seems there was a major drop in indexation, falling right back down to where my client was before the problematic pages were indexed! In fact, there were about 500 fewer pages indexed than before the spike.
Actually, there were two drops. The first was about two months into making the changes, and then there was a much larger drop about three months in. You can see the trending below:
So, for this site and situation, it took Google about three months to drop all of those problematic pages from the index once the changes were implemented (and for that to be reflected in the Index Status report in GSC). It’s important to note that each situation can be different, and the time to deindex problematic pages can vary. However, for my client, it was three months.
Also, Google’s John Mueller has explained that the data for the Index Status report is updated several times per week, but we know the reporting graph is updated once per week. If that’s the case, then it did take Google quite a bit of time to remove these thin URLs from the index.
Google’s John Mueller explaining how often Index Status is updated (at 40:36 in the video):
Mistakenly publishing thin pages can be problematic on several levels. First, your users could be accessing those thin or low-quality pages (which can impact user happiness). Second, Google can also be crawling and indexing those pages. We know that Google will count all pages that are indexed when evaluating quality for a site, so it’s critically important to know this is happening, understand how to fix it, and then monitor indexation over time.
Here are some final thoughts and tips:
There are times when websites mistakenly publish low-quality or thin content. When that happens, it’s extremely important to identify and surface those pages quickly. And when you do, your next step is to properly handle those pages by noindexing, canonicalizing or 404ing the URLs.
Once you take care of the situation, it can take time for Google to crawl those pages, process the changes, and then drop the pages from the index. You simply need to be patient knowing you have implemented the right fix. Over time, those pages should drop — just like they did in this situation.
The post How long does it take to deindex low-quality or thin content published by accident? [case study] appeared first on Search Engine Land.