I’ve been doing SEO for a while now and something I come across frequently is indexation issues in Google. You’d be surprised by the number of large brands who seem to be unaware of duplicate, similar or poor content being indexed on their website and the potential impact it could have on their organic search performance.
For this reason, I thought I’d share an SEO analysis trick which quickly identifies and communicates any serious content and indexation issues a site may have. Any webmaster, SEO or digital marketing manager in charge of a medium to large website could complete this analysis in less than 10 minutes.
What this indexation analysis shows
In a nutshell, this analysis can show you exactly how many URLs are being indexed which you do not want Google to add to its vast databanks. The content which you do not want search engines to crawl and index is usually placed into two separate groups: duplicate content and thin/poor content.
These types of content will have an impact on a site’s organic search visibility, and if both are being indexed then you need to act.
Duplicate content vs thin content
Before I go into how to perform the quick analysis, first let’s look at why. Duplicate content is discussed A LOT across the internet and refers to blocks of content on a page which either match or are similar to another page’s content. This can be both internal (on your website) and external (on another website).
Let me start by demystifying the idea which is flowing around on the internet about a duplicate content penalty. Google has stated that there is ‘no such thing as a duplicate content penalty’: a penalty is only incurred if the duplicate content is deceptive and tries to manipulate search engine results.
HOWEVER, just because duplicate content does not incur any immediate penalty (manual or algorithmic) does not mean that it shouldn’t be managed. Duplicate content can still affect rankings and the efficiency with which search engines crawl your website.
For example, a client recently made changes to their website, causing canonical tags to be removed from the website and duplicate content to be crawled and indexed. This has impacted the SEO visibility of the website as Google chose to rank duplicate content instead of canonical URLs for important targeted keywords – even the Google algorithm gets it wrong sometimes!
A great study by Pi Datametrics also showed that external duplicate content will influence your organic search visibility. (I’m in one of the photos in the post – bonus points for anyone who spots me.) The study found that even if you published a piece of content first, if a more authoritative site nicks that content then it can outrank yours for the content which you created!
If you would like to read more about how Google handles duplicate content, take a look at their official blog post on the matter.
The second type of content is thin or poor content. The term ‘thin content’ is used across our industry quite loosely to describe pages which offer no real value to users, and I personally use it to describe content which isn’t going rank very well in organic search results.
These are the pages you need to worry about. Why? Well, sites with thin content are going to be impacted by Panda or one of Google’s many content quality algorithms which target unhelpful and low-quality content. The only way to deal with this type of content is to roll up your sleeves and make it more valuable and useful for your users.
Below is an SEO visibility graph from Searchmetrics which shows a website with thin content being heavily affected by Panda 4.1:
(Note: they haven’t addressed any of the site architecture or content issues, so they’ve never really recovered.)
The indexation analysis
Now we’ve covered the why, let’s focus on the how. It’s SEO analysis time!
Note: throughout this I will be using Branded3 as an example.
Before we start you will need the following:
- XML sitemap uploaded to your website and Search Console (Learn about sitemaps here)
- Google Analytics installed (Learn how to install GA here)
- Google Search Console installed (Learn how to install and verify here)
- Microsoft Excel or if you want a cheaper option a pen and paper
Right, now onto the quick analysis (which doesn’t seem so quick now that I’ve written it out). First, you’ll need to open Excel and mark four rows as:
- URLs submitted in XML sitemap – this will show you the pages you want search engines to crawl and index
- URLs indexed in XML sitemap – this shows you the number of URLs Google has chosen to index from your XML sitemap file
- Pages receiving Google organic traffic – this shows the number of pages Google is choosing to display in its top 10 little blue links and are receiving natural visits
- Index report – the number of URLs actually indexed in Google, not just what you want Google to see
To find out the number of URLs in your XML sitemap go to the Search Console > Crawl > Sitemap and click on the ‘All’ tab. This will show you the number of URLs you have submitted to Google via your XML sitemap.
Note down the number of URLs submitted.
Next go to your Google Analytics > Acquisition > Overview > Channels > Organic Search > Change dimension to source > Click Google > Add a secondary dimension and landing page. Make sure you segment the data to the last 90 days’ worth of data. All you really want at the moment is the number of pages receiving organic search traffic at the bottom of the report.
Note down the number of pages receiving Google organic search traffic.
Finally, you will need to identify the number of URLs in Google’s index. I usually use the Index Report in Google’s Search Console as it is more accurate than a site: search operator.
Note down the number of total URLs indexed in Google.
The result should be a table and a chart that looks something like the below.
Ta-dah! Now you have actionable data which shows you a quick overview of your website’s indexation status. From this quick analysis, you or your team can see if you need to investigate any indexation issues on your website and fix any content problems there might be. The numbers shouldn’t be perfect but that’s what you should be aiming for if you want to dominate organic search.
From my experience, the best performing websites have roughly a 1:1 ratio of pages indexed vs those receiving organic search traffic. The greater the number of pages in Google’s index than are receiving Google organic search traffic, the more of an indication that there are indexation and content problems to be fixed.
For example, below is the analysis I performed for one of my best performing accounts (65% of their targeted keywords are on page 1 of Google UK).
As you can see, the number of pages receiving organic search traffic vs indexed in Google is almost equal. The vast majority of content on the website is useful and the site has a great overall architecture, which means there is little duplicate or thin content across the site.
A few tips for your indexation investigation
Here are a couple of hints and tips for any of those who want to investigate and fix their site’s indexation issues:
- Check which pages you actually want users to find on your website and compile a list, then make sure they’re in the XML sitemap. More often than not many webmasters and brands don’t actually know what’s on their website. Remember search engines can see everything if you let them!
- Before doing any further heavy analysis, check that you have all your XML sitemaps submitted to the Search Console. Yep – that includes separate sitemaps for blogs or news sections of the website, as more often than not they are in a different CMS like WordPress. If you haven’t submitted all the XML sitemaps, do the quick analysis again and see if the results change.
- Actually check the URLs in the XML sitemap for any crawl errors and to see if the pages need to be in the sitemap, as they need to be as accurate as possible. This is the file that highlights which important pages you’re telling search engines to crawl/index. Google has created documentation on XML sitemap best practices here.
- My first port of call in removing URLs from Google’s index is duplicate content. The main reason for this is that it doesn’t (usually) require a lot of work to be removed. If you want to know how to deal with this type of issue then I’d read Hobo Web’s detailed blog post on duplicate content. It’s a comprehensive post and full of useful tips.
- Next is dealing with thin content or pages which aren’t ranking in Google’s 10 little blue links. For any website, but especially larger ones, this is going to require a lot of work. There’s a lot of material on the internet about how to deal with content, search intent and usability. My advice here would be to begin drilling down into specific pages, focusing on important pages first, and being brutally honest if these pages deserve to be displayed to users.
- I would read ‘Understanding Google Panda: Definitive Algo Guide for SEOs’ by Jennifer Slegg at SEMPost and ‘Don’t blame Google’s Phantom Update for your lost site traffic’ by KresLynn Ellsworth from NetHosting. Both will help you understand what you’re aiming for with your content (from Google’s perspective anyway), and it’ll also mean you won’t spend years of your life reading up on what Google wants from you.
- While thinking about your content, asses your website architecture and the usability of your site. Important pages should be no more than three clicks away from the homepage to benefit from the link equity of the website, and every bit of content should be on a unique page. Remove any redundant categories or tags which provide little value. This post by Yoast on the subject is pretty good.
- Before you start removing or noindexing pages remember to check Google Analytics and Search Console for organic sessions and click data for the last three months (90 days) at a minimum. The last thing you want to do is lose organic search traffic!
- Last but not least, don’t be afraid to make drastic changes to the website in the name of cleaning up indexation issues. I recently told a client to remove 5,000+ pages from their news section which had a very low amount of organic search traffic. Each page had very thin content and the number of thin pages on the site outnumbered the service level pages (which we wanted to rank). As Google began to reprocess the data and deindex the thin pages, we have begun to see improvements in their SEO performance for long tail keywords! The client is even ranking on page 1 of Google UK for a targeted keyword, all through cleaning up the site’s indexation issues.
So there you have it: a very quick way of identifying if you have any duplicate or thin content issues. This technique also helps you quantify larger website issues and communicate them to the rest of the wider digital team.
Now that you’ve identified that there are indexation issues which need to be fixed (and data to back you up), you now must start the long (frustrating) process of cleaning up. Hopefully my tips will steer you in the right direction.