February 17, 2019

Build a Search Intent Dashboard to Unlock Better Opportunities

Posted by scott.taft

We've been talking a lot about search intent this week, and if you've been following along, you’re likely already aware of how “search intent” is essential for a robust SEO strategy. If, however, you’ve ever laboured for hours classifying keywords by topic and search intent, only to end up with a ton of data you don’t really know what to do with, then this post is for you.

I’m going to share how to take all that sweet keyword data you’ve categorized, put it into a Power BI dashboard, and start slicing and dicing to uncover a ton insights — faster than you ever could before.

Building your keyword list

Every great search analysis starts with keyword research and this one is no different. I’m not going to go into excruciating detail about how to build your keyword list. However, I will mention a few of my favorite tools that I’m sure most of you are using already:

  • Search Query Report — What better place to look first than the search terms already driving clicks and (hopefully) conversions to your site.
  • Answer The Public — Great for pulling a ton of suggested terms, questions and phrases related to a single search term.
  • InfiniteSuggest — Like Answer The Public, but faster and allows you to build based on a continuous list of seed keywords.
  • MergeWords — Quickly expand your keywords by adding modifiers upon modifiers.
  • Grep Words — A suite of keyword tools for expanding, pulling search volume and more.

Please note that these tools are a great way to scale your keyword collecting but each will come with the need to comb through and clean your data to ensure all keywords are at least somewhat relevant to your business and audience.

Once I have an initial keyword list built, I’ll upload it to STAT and let it run for a couple days to get an initial data pull. This allows me to pull the ‘People Also Ask’ and ‘Related Searches’ reports in STAT to further build out my keyword list. All in all, I’m aiming to get to at least 5,000 keywords, but the more the merrier.

For the purposes of this blog post I have about 19,000 keywords I collected for a client in the window treatments space.

Categorizing your keywords by topic

Bucketing keywords into categories is an age-old challenge for most digital marketers but it’s a critical step in understanding the distribution of your data. One of the best ways to segment your keywords is by shared words. If you’re short on AI and machine learning capabilities, look no further than a trusty Ngram analyzer. I love to use this Ngram Tool from guidetodatamining.com — it ain’t much to look at, but it’s fast and trustworthy.

After dropping my 19,000 keywords into the tool and analyzing by unigram (or 1-word phrases), I manually select categories that fit with my client’s business and audience. I also make sure the unigram accounts for a decent amount of keywords (e.g. I wouldn’t pick a unigram that has a count of only 2 keywords).

Using this data, I then create a Category Mapping table and map a unigram, or “trigger word”, to a Category like the following:

You’ll notice that for “curtain” and “drapes” I mapped both to the Curtains category. For my client’s business, they treat these as the same product, and doing this allows me to account for variations in keywords but ultimately group them how I want for this analysis.

Using this method, I create a Trigger Word-Category mapping based on my entire dataset. It’s possible that not every keyword will fall into a category and that’s okay — it likely means that keyword is not relevant or significant enough to be accounted for.

Creating a keyword intent map

Similar to identifying common topics by which to group your keywords, I’m going to follow a similar process but with the goal of grouping keywords by intent modifier.

Search intent is the end goal of a person using a search engine. Digital marketers can leverage these terms and modifiers to infer what types of results or actions a consumer is aiming for.

For example, if a person searches for “white blinds near me”, it is safe to infer that this person is looking to buy white blinds as they are looking for a physical location that sells them. In this case I would classify “near me” as a “Transactional” modifier. If, however, the person searched “living room blinds ideas” I would infer their intent is to see images or read blog posts on the topic of living room blinds. I might classify this search term as being at the “Inspirational” stage, where a person is still deciding what products they might be interested and, therefore, isn’t quite ready to buy yet.

There is a lot of research on some generally accepted intent modifiers in search and I don’t intent to reinvent the wheel. This handy guide (originally published in STAT) provides a good review of intent modifiers you can start with.

I followed the same process as building out categories to build out my intent mapping and the result is a table of intent triggers and their corresponding Intent stage.

Intro to Power BI

There are tons of resources on how to get started with the free tool Power BI, one of which is from own founder Will Reynold’s video series on using Power BI for Digital Marketing. This is a great place to start if you’re new to the tool and its capabilities.

Note: it’s not about the tool necessarily (although Power BI is a super powerful one). It’s more about being able to look at all of this data in one place and pull insights from it at speeds which Excel just won’t give you. If you’re still skeptical of trying a new tool like Power BI at the end of this post, I urge you to get the free download from Microsoft and give it a try.

Setting up your data in Power BI

Power BI’s power comes from linking multiple datasets together based on common “keys." Think back to your Microsoft Access days and this should all start to sound familiar.

Step 1: Upload your data sources

First, open Power BI and you’ll see a button called “Get Data” in the top ribbon. Click that and then select the data format you want to upload. All of my data for this analysis is in CSV format so I will select the Text/CSV option for all of my data sources. You have to follow these steps for each data source. Click “Load” for each data source.

Step 2: Clean your data

In the Power BI ribbon menu, click the button called “Edit Queries." This will open the Query Editor where we will make all of our data transformations.

The main things you’ll want to do in the Query Editor are the following:

  • Make sure all data formats make sense (e.g. keywords are formatted as text, numbers are formatted as decimals or whole numbers).
  • Rename columns as needed.
  • Create a domain column in your Top 20 report based on the URL column.

Close and apply your changes by hitting the "Edit Queries" button, as seen above.

Step 3: Create relationships between data sources

On the left side of Power BI is a vertical bar with icons for different views. Click the third one to see your relationships view.

In this view, we are going to connect all data sources to our ‘Keywords Bridge’ table by clicking and dragging a line from the field ‘Keyword’ in each table and to ‘Keyword’ in the ‘Keywords Bridge’ table (note that for the PPC Data, I have connected ‘Search Term’ as this is the PPC equivalent of a keyword, as we’re using here).

The last thing we need to do for our relationships is double-click on each line to ensure the following options are selected for each so that our dashboard works properly:

  • The cardinality is Many to 1
  • The relationship is “active”
  • The cross filter direction is set to “both”

We are now ready to start building our Intent Dashboard and analyzing our data.

Building the search intent dashboard

In this section I’ll walk you through each visual in the Search Intent Dashboard (as seen below):

Top domains by count of keywords

Visual type: Stacked Bar Chart visual

Axis: I’ve nested URL under Domain so I can drill down to see this same breakdown by URL for a specific Domain

Value: Distinct count of keywords

Legend: Result Types

Filter: Top 10 filter on Domains by count of distinct keywords

Keyword breakdown by result type

Visual type: Donut chart

Legend: Result Types

Value: Count of distinct keywords, shown as Percent of grand total

Metric Cards

Sum of Distinct MSV

Because the Top 20 report shows each keyword 20 times, we need to create a calculated measure in Power BI to only sum MSV for the unique list of keywords. Use this formula for that calculated measure:

Sum Distinct MSV = SUMX(DISTINCT('Table'[Keywords]), FIRSTNONBLANK('Table'[MSV], 0))

Keywords

This is just a distinct count of keywords

Slicer: PPC Conversions

Visual type: Slicer

Drop your PPC Conversions field into a slicer and set the format to “Between” to get this nifty slider visual.

Tables

Visual type: Table or Matrix (a matrix allows for drilling down similar to a pivot table in Excel)

Values: Here I have Category or Intent Stage and then the distinct count of keywords.

Pulling insights from your search intent dashboard

This dashboard is now a Swiss Army knife of data that allows you to slice and dice to your heart’s content. Below are a couple examples of how I use this dashboard to pull out opportunities and insights for my clients.

Where are competitors winning?

With this data we can quickly see who the top competing domains are, but what’s more valuable is seeing who the competitors are for a particular intent stage and category.

I start by filtering to the “Informational” stage, since it represents the most keywords in our dataset. I also filter to the top category for this intent stage which is “Blinds”. Looking at my Keyword Count card, I can now see that I’m looking at a subset of 641 keywords.

Note: To filter multiple visuals in Power BI, you need to press and hold the “Ctrl” button each time you click a new visual to maintain all the filters you clicked previously.

The top competing subdomain here is videos.blinds.com with visibility in the top 20 for over 250 keywords, most of which are for video results. I hit ctrl+click on the Video results portion of videos.blinds.com to update the keywords table to only keywords where videos.blinds.com is ranking in the top 20 with a video result.

From all this I can now say that videos.blinds.com is ranking in the top 20 positions for about 30 percent of keywords that fall into the “Blinds” category and the “Informational” intent stage. I can also see that most of the keywords here start with “how to”, which tells me that most likely people searching for blinds in an informational stage are looking for how to instructions and that video may be a desired content format.

Where should I focus my time?

Whether you’re in-house or at an agency, time is always a hit commodity. You can use this dashboard to quickly identify opportunities that you should be prioritizing first — opportunities that can guarantee you’ll deliver bottom-line results.

To find these bottom-line results, we’re going to filter our data using the PPC conversions slicer so that our data only includes keywords that have converted at least once in our PPC campaigns.

Once I do that, I can see I’m working with a pretty limited set of keywords that have been bucketed into intent stages, but I can continue by drilling into the “Transactional” intent stage because I want to target queries that are linked to a possible purchase.

Note: Not every keyword will fall into an intent stage if it doesn’t meet the criteria we set. These keywords will still appear in the data, but this is the reason why your total keyword count might not always match the total keyword count in the intent stages or category tables.

From there I want to focus on those “Transactional” keywords that are triggering answer boxes to make sure I have good visibility, since they are converting for me on PPC. To do that, I filter to only show keywords triggering answer boxes. Based on these filters I can look at my keyword table and see most (if not all) of the keywords are “installation” keywords and I don’t see my client’s domain in the top list of competitors. This is now an area of focus for me to start driving organic conversions.

Wrap up

I’ve only just scratched the surface — there’s tons that can can be done with this data inside a tool like Power BI. Having a solid data set of keywords and visuals that I can revisit repeatedly for a client and continuously pull out opportunities to help fuel our strategy is, for me, invaluable. I can work efficiently without having to go back to keyword tools whenever I need an idea. Hopefully you find this makes building an intent-based strategy more efficient and sound for your business or clients.


Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!

February 17, 2019

Build a Search Intent Dashboard to Unlock Better Opportunities

Posted by scott.taft

We've been talking a lot about search intent this week, and if you've been following along, you’re likely already aware of how “search intent” is essential for a robust SEO strategy. If, however, you’ve ever laboured for hours classifying keywords by topic and search intent, only to end up with a ton of data you don’t really know what to do with, then this post is for you.

I’m going to share how to take all that sweet keyword data you’ve categorized, put it into a Power BI dashboard, and start slicing and dicing to uncover a ton insights — faster than you ever could before.

Building your keyword list

Every great search analysis starts with keyword research and this one is no different. I’m not going to go into excruciating detail about how to build your keyword list. However, I will mention a few of my favorite tools that I’m sure most of you are using already:

  • Search Query Report — What better place to look first than the search terms already driving clicks and (hopefully) conversions to your site.
  • Answer The Public — Great for pulling a ton of suggested terms, questions and phrases related to a single search term.
  • InfiniteSuggest — Like Answer The Public, but faster and allows you to build based on a continuous list of seed keywords.
  • MergeWords — Quickly expand your keywords by adding modifiers upon modifiers.
  • Grep Words — A suite of keyword tools for expanding, pulling search volume and more.

Please note that these tools are a great way to scale your keyword collecting but each will come with the need to comb through and clean your data to ensure all keywords are at least somewhat relevant to your business and audience.

Once I have an initial keyword list built, I’ll upload it to STAT and let it run for a couple days to get an initial data pull. This allows me to pull the ‘People Also Ask’ and ‘Related Searches’ reports in STAT to further build out my keyword list. All in all, I’m aiming to get to at least 5,000 keywords, but the more the merrier.

For the purposes of this blog post I have about 19,000 keywords I collected for a client in the window treatments space.

Categorizing your keywords by topic

Bucketing keywords into categories is an age-old challenge for most digital marketers but it’s a critical step in understanding the distribution of your data. One of the best ways to segment your keywords is by shared words. If you’re short on AI and machine learning capabilities, look no further than a trusty Ngram analyzer. I love to use this Ngram Tool from guidetodatamining.com — it ain’t much to look at, but it’s fast and trustworthy.

After dropping my 19,000 keywords into the tool and analyzing by unigram (or 1-word phrases), I manually select categories that fit with my client’s business and audience. I also make sure the unigram accounts for a decent amount of keywords (e.g. I wouldn’t pick a unigram that has a count of only 2 keywords).

Using this data, I then create a Category Mapping table and map a unigram, or “trigger word”, to a Category like the following:

You’ll notice that for “curtain” and “drapes” I mapped both to the Curtains category. For my client’s business, they treat these as the same product, and doing this allows me to account for variations in keywords but ultimately group them how I want for this analysis.

Using this method, I create a Trigger Word-Category mapping based on my entire dataset. It’s possible that not every keyword will fall into a category and that’s okay — it likely means that keyword is not relevant or significant enough to be accounted for.

Creating a keyword intent map

Similar to identifying common topics by which to group your keywords, I’m going to follow a similar process but with the goal of grouping keywords by intent modifier.

Search intent is the end goal of a person using a search engine. Digital marketers can leverage these terms and modifiers to infer what types of results or actions a consumer is aiming for.

For example, if a person searches for “white blinds near me”, it is safe to infer that this person is looking to buy white blinds as they are looking for a physical location that sells them. In this case I would classify “near me” as a “Transactional” modifier. If, however, the person searched “living room blinds ideas” I would infer their intent is to see images or read blog posts on the topic of living room blinds. I might classify this search term as being at the “Inspirational” stage, where a person is still deciding what products they might be interested and, therefore, isn’t quite ready to buy yet.

There is a lot of research on some generally accepted intent modifiers in search and I don’t intent to reinvent the wheel. This handy guide (originally published in STAT) provides a good review of intent modifiers you can start with.

I followed the same process as building out categories to build out my intent mapping and the result is a table of intent triggers and their corresponding Intent stage.

Intro to Power BI

There are tons of resources on how to get started with the free tool Power BI, one of which is from own founder Will Reynold’s video series on using Power BI for Digital Marketing. This is a great place to start if you’re new to the tool and its capabilities.

Note: it’s not about the tool necessarily (although Power BI is a super powerful one). It’s more about being able to look at all of this data in one place and pull insights from it at speeds which Excel just won’t give you. If you’re still skeptical of trying a new tool like Power BI at the end of this post, I urge you to get the free download from Microsoft and give it a try.

Setting up your data in Power BI

Power BI’s power comes from linking multiple datasets together based on common “keys." Think back to your Microsoft Access days and this should all start to sound familiar.

Step 1: Upload your data sources

First, open Power BI and you’ll see a button called “Get Data” in the top ribbon. Click that and then select the data format you want to upload. All of my data for this analysis is in CSV format so I will select the Text/CSV option for all of my data sources. You have to follow these steps for each data source. Click “Load” for each data source.

Step 2: Clean your data

In the Power BI ribbon menu, click the button called “Edit Queries." This will open the Query Editor where we will make all of our data transformations.

The main things you’ll want to do in the Query Editor are the following:

  • Make sure all data formats make sense (e.g. keywords are formatted as text, numbers are formatted as decimals or whole numbers).
  • Rename columns as needed.
  • Create a domain column in your Top 20 report based on the URL column.

Close and apply your changes by hitting the "Edit Queries" button, as seen above.

Step 3: Create relationships between data sources

On the left side of Power BI is a vertical bar with icons for different views. Click the third one to see your relationships view.

In this view, we are going to connect all data sources to our ‘Keywords Bridge’ table by clicking and dragging a line from the field ‘Keyword’ in each table and to ‘Keyword’ in the ‘Keywords Bridge’ table (note that for the PPC Data, I have connected ‘Search Term’ as this is the PPC equivalent of a keyword, as we’re using here).

The last thing we need to do for our relationships is double-click on each line to ensure the following options are selected for each so that our dashboard works properly:

  • The cardinality is Many to 1
  • The relationship is “active”
  • The cross filter direction is set to “both”

We are now ready to start building our Intent Dashboard and analyzing our data.

Building the search intent dashboard

In this section I’ll walk you through each visual in the Search Intent Dashboard (as seen below):

Top domains by count of keywords

Visual type: Stacked Bar Chart visual

Axis: I’ve nested URL under Domain so I can drill down to see this same breakdown by URL for a specific Domain

Value: Distinct count of keywords

Legend: Result Types

Filter: Top 10 filter on Domains by count of distinct keywords

Keyword breakdown by result type

Visual type: Donut chart

Legend: Result Types

Value: Count of distinct keywords, shown as Percent of grand total

Metric Cards

Sum of Distinct MSV

Because the Top 20 report shows each keyword 20 times, we need to create a calculated measure in Power BI to only sum MSV for the unique list of keywords. Use this formula for that calculated measure:

Sum Distinct MSV = SUMX(DISTINCT('Table'[Keywords]), FIRSTNONBLANK('Table'[MSV], 0))

Keywords

This is just a distinct count of keywords

Slicer: PPC Conversions

Visual type: Slicer

Drop your PPC Conversions field into a slicer and set the format to “Between” to get this nifty slider visual.

Tables

Visual type: Table or Matrix (a matrix allows for drilling down similar to a pivot table in Excel)

Values: Here I have Category or Intent Stage and then the distinct count of keywords.

Pulling insights from your search intent dashboard

This dashboard is now a Swiss Army knife of data that allows you to slice and dice to your heart’s content. Below are a couple examples of how I use this dashboard to pull out opportunities and insights for my clients.

Where are competitors winning?

With this data we can quickly see who the top competing domains are, but what’s more valuable is seeing who the competitors are for a particular intent stage and category.

I start by filtering to the “Informational” stage, since it represents the most keywords in our dataset. I also filter to the top category for this intent stage which is “Blinds”. Looking at my Keyword Count card, I can now see that I’m looking at a subset of 641 keywords.

Note: To filter multiple visuals in Power BI, you need to press and hold the “Ctrl” button each time you click a new visual to maintain all the filters you clicked previously.

The top competing subdomain here is videos.blinds.com with visibility in the top 20 for over 250 keywords, most of which are for video results. I hit ctrl+click on the Video results portion of videos.blinds.com to update the keywords table to only keywords where videos.blinds.com is ranking in the top 20 with a video result.

From all this I can now say that videos.blinds.com is ranking in the top 20 positions for about 30 percent of keywords that fall into the “Blinds” category and the “Informational” intent stage. I can also see that most of the keywords here start with “how to”, which tells me that most likely people searching for blinds in an informational stage are looking for how to instructions and that video may be a desired content format.

Where should I focus my time?

Whether you’re in-house or at an agency, time is always a hit commodity. You can use this dashboard to quickly identify opportunities that you should be prioritizing first — opportunities that can guarantee you’ll deliver bottom-line results.

To find these bottom-line results, we’re going to filter our data using the PPC conversions slicer so that our data only includes keywords that have converted at least once in our PPC campaigns.

Once I do that, I can see I’m working with a pretty limited set of keywords that have been bucketed into intent stages, but I can continue by drilling into the “Transactional” intent stage because I want to target queries that are linked to a possible purchase.

Note: Not every keyword will fall into an intent stage if it doesn’t meet the criteria we set. These keywords will still appear in the data, but this is the reason why your total keyword count might not always match the total keyword count in the intent stages or category tables.

From there I want to focus on those “Transactional” keywords that are triggering answer boxes to make sure I have good visibility, since they are converting for me on PPC. To do that, I filter to only show keywords triggering answer boxes. Based on these filters I can look at my keyword table and see most (if not all) of the keywords are “installation” keywords and I don’t see my client’s domain in the top list of competitors. This is now an area of focus for me to start driving organic conversions.

Wrap up

I’ve only just scratched the surface — there’s tons that can can be done with this data inside a tool like Power BI. Having a solid data set of keywords and visuals that I can revisit repeatedly for a client and continuously pull out opportunities to help fuel our strategy is, for me, invaluable. I can work efficiently without having to go back to keyword tools whenever I need an idea. Hopefully you find this makes building an intent-based strategy more efficient and sound for your business or clients.


Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!

February 17, 2019

Detecting Link Manipulation and Spam with Domain Authority

Posted by rjonesx.

Over 7 years ago, while still an employee at Virante, Inc. (now Hive Digital), I wrote a post on Moz outlining some simple methods for detecting backlink manipulation by comparing one's backlink profile to an ideal model based on Wikipedia. At the time, I was limited in the research I could perform because I was a consumer of the API, lacking access to deeper metrics, measurements, and methodologies to identify anomalies in backlink profiles. We used these techniques in spotting backlink manipulation with tools like Remove'em and Penguin Risk, but they were always handicapped by the limitations of consumer facing APIs. Moreover, they didn't scale. It is one thing to collect all the backlinks for a site, even a large site, and judge every individual link for source type, quality, anchor text, etc. Reports like these can be accessed from dozens of vendors if you are willing to wait a few hours for the report to complete. But how do you do this for 30 trillion links every single day?

Since the launch of Link Explorer and my residency here at Moz, I have had the luxury of far less filtered data, giving me a far deeper, clearer picture of the tools available to backlink index maintainers to identify and counter manipulation. While I in no way intend to say that all manipulation can be detected, I want to outline just some of the myriad surprising methodologies to detect spam.

The general methodology

You don't need to be a data scientist or a math nerd to understand this simple practice for identifying link spam. While there certainly is a great deal of math used in the execution of measuring, testing, and building practical models, the general gist is plainly understandable.

The first step is to get a good random sample of links from the web, which you can read about here. But let's assume you have already finished that step. Then, for any property of those random links (DA, anchor text, etc.), you figure out what is normal or expected. Finally, you look for outliers and see if those correspond with something important - like sites that are manipulating the link graph, or sites that are exceptionally good. Let's start with an easy example, link decay.

Link decay and link spam

Link decay is the natural occurrence of links either dropping off the web or changing URLs. For example, if you get links after you send out a press release, you would expect some of those links to eventually disappear as the pages are archived or removed for being old. And, if you were to get a link from a blog post, you might expect to have a homepage link on the blog until that post is pushed to the second or third page by new posts.

But what if you bought your links? What if you own a large number of domains and all the sites link to each other? What if you use a PBN? These links tend not to decay. Exercising control over your inbound links often means that you keep them from ever decaying. Thus, we can create a simple hypothesis:

Hypothesis: The link decay rate of sites manipulating the link graph will differ from sites with natural link profiles.

The methodology for testing this hypothesis is just as we discussed before. We first figure out what is natural. What does a random site's link decay rate look like? Well, we simply get a bunch of sites and record how fast links are deleted (we visit a page and see a link is gone) vs. their total number of links. We then can look for anomalies.

In this case of anomaly hunting, I'm going to make it really easy. No statistics, no math, just a quick look at what pops up when we first sort by Lowest Decay Rate and then sort by Highest Domain Authority to see who is at the tail-end of the spectrum.

spreadsheet of sites with high deleted link ratios

Success! Every example we see of a good DA score but 0 link decay appears to be powered by a link network of some sort. This is the Aha! moment of data science that is so fun. What is particularly interesting is we find spam on both ends of the distribution — that is to say, sites that have 0 decay or near 100% decay rates both tend to be spammy. The first type tends to be part of a link network, the second part tends to spam their backlinks to sites others are spamming, so their links quickly shuffle off to other pages.

Of course, now we do the hard work of building a model that actually takes this into account and accurately reduces Domain Authority relative to the severity of the link spam. But you might be asking...

These sites don't rank in Google — why do they have decent DAs in the first place?

Well, this is a common problem with training sets. DA is trained on sites that rank in Google so that we can figure out who will rank above who. However, historically, we haven't (and no one to my knowledge in our industry has) taken into account random URLs that don't rank at all. This is something we're solving for in the new DA model set to launch in early March, so stay tuned, as this represents a major improvement on the way we calculate DA!

Spam Score distribution and link spam

One of the most exciting new additions to the upcoming Domain Authority 2.0 is the use of our Spam Score. Moz's Spam Score is a link-blind (we don't use links at all) metric that predicts the likelihood a domain will be indexed in Google. The higher the score, the worse the site.

Now, we could just ignore any links from sites with Spam Scores over 70 and call it a day, but it turns out there are fascinating patterns left behind by common link manipulation schemes waiting to be discovered by using this simple methodology of using a random sample of URLs to find out what a normal backlink profile looks like, and then see if there are anomalies in the way Spam Score is distributed among the backlinks to a site. Let me show you just one.

It turns out that acting natural is really hard to do. Even the best attempts often fall short, as did this particularly pernicious link spam network. This network had haunted me for 2 years because it included a directory of the top million sites, so if you were one of those sites, you could see anywhere from 200 to 600 followed links show up in your backlink profile. I called it "The Globe" network. It was easy to look at the network and see what they were doing, but could we spot it automatically so that we could devalue other networks like it in the future? When we looked at the link profile of sites included in the network, the Spam Score distribution lit up like a Christmas tree.

spreadsheet with distribution of spam scores

Most sites get the majority of their backlinks from low Spam Score domains and get fewer and fewer as the Spam Score of the domains go up. But this link network couldn't hide because we were able to detect the sites in their network as having quality issues using Spam Score. If we relied only on ignoring the bad Spam Score links, we would have never discovered this issue. Instead, we found a great classifier for finding sites that are likely to be penalized by Google for bad link building practices.

DA distribution and link spam

We can find similar patterns among sites with the distribution of inbound Domain Authority. It's common for businesses seeking to increase their rankings to set minimum quality standards on their outreach campaigns, often DA30 and above. An unfortunate outcome of this is that what remains are glaring examples of sites with manipulated link profiles.

Let me take a moment and be clear here. A manipulated link profile is not necessarily against Google's guidelines. If you do targeted PR outreach, it is reasonable to expect that such a distribution might occur without any attempt to manipulate the graph. However, the real question is whether Google wants sites that perform such outreach to perform better. If not, this glaring example of link manipulation is pretty easy for Google to dampen, if not ignore altogether.

spreadsheet with distribution of domain authorityA normal link graph for a site that is not targeting high link equity domains will have the majority of their links coming from DA0–10 sites, slightly fewer for DA10–20, and so on and so forth until there are almost no links from DA90+. This makes sense, as the web has far more low DA sites than high. But all the sites above have abnormal link distributions, which make it easy to detect and correct — at scale — link value.

Now, I want to be clear: these are not necessarily examples of violating Google's guidelines. However, they are manipulations of the link graph. It's up to you to determine whether you believe Google takes the time to differentiate between how the outreach was conducted that resulted in the abnormal link distribution.

What doesn't work

For every type of link manipulation detection method we discover, we scrap dozens more. Some of these are actually quite surprising. Let me write about just one of the many.

The first surprising example was the ratio of nofollow to follow links. It seems pretty straightforward that comment, forum, and other types of spammers would end up accumulating lots of nofollowed links, thereby leaving a pattern that is easy to discern. Well, it turns out this is not true at all.

The ratio of nofollow to follow links turns out to be a poor indicator, as popular sites like facebook.com often have a higher ratio than even pure comment spammers. This is likely due to the use of widgets and beacons and the legitimate usage of popular sites like facebook.com in comments across the web. Of course, this isn't always the case. There are some sites with 100% nofollow links and a high number of root linking domains. These anomalies, like "Comment Spammer 1," can be detected quite easily, but as a general measurement the ratio does not serve as a good classifier for spam or ham.

So what's next?

Moz is continually traversing the the link graph looking for ways to improve Domain Authority using everything from basic linear algebra to complex neural networks. The goal in mind is simple: We want to make the best Domain Authority metric ever. We want a metric which users can trust in the long run to root out spam just like Google (and help you determine when you or your competitors are pushing the limits) while at the same time maintaining or improving correlations with rankings. Of course, we have no expectation of rooting out all spam — no one can do that. But we can do a better job. Led by the incomparable Neil Martinsen-Burrell, our metric will stand alone in the industry as the canonical method for measuring the likelihood a site will rank in Google.


We're launching Domain Authority 2.0 on March 5th! Check out our helpful resources here, or sign up for our webinar this Thursday, February 21st for more info on how to communicate changes like this to clients and stakeholders:

Save my spot!


Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!

February 17, 2019

Detecting Link Manipulation and Spam with Domain Authority

Posted by rjonesx.

Over 7 years ago, while still an employee at Virante, Inc. (now Hive Digital), I wrote a post on Moz outlining some simple methods for detecting backlink manipulation by comparing one's backlink profile to an ideal model based on Wikipedia. At the time, I was limited in the research I could perform because I was a consumer of the API, lacking access to deeper metrics, measurements, and methodologies to identify anomalies in backlink profiles. We used these techniques in spotting backlink manipulation with tools like Remove'em and Penguin Risk, but they were always handicapped by the limitations of consumer facing APIs. Moreover, they didn't scale. It is one thing to collect all the backlinks for a site, even a large site, and judge every individual link for source type, quality, anchor text, etc. Reports like these can be accessed from dozens of vendors if you are willing to wait a few hours for the report to complete. But how do you do this for 30 trillion links every single day?

Since the launch of Link Explorer and my residency here at Moz, I have had the luxury of far less filtered data, giving me a far deeper, clearer picture of the tools available to backlink index maintainers to identify and counter manipulation. While I in no way intend to say that all manipulation can be detected, I want to outline just some of the myriad surprising methodologies to detect spam.

The general methodology

You don't need to be a data scientist or a math nerd to understand this simple practice for identifying link spam. While there certainly is a great deal of math used in the execution of measuring, testing, and building practical models, the general gist is plainly understandable.

The first step is to get a good random sample of links from the web, which you can read about here. But let's assume you have already finished that step. Then, for any property of those random links (DA, anchor text, etc.), you figure out what is normal or expected. Finally, you look for outliers and see if those correspond with something important - like sites that are manipulating the link graph, or sites that are exceptionally good. Let's start with an easy example, link decay.

Link decay and link spam

Link decay is the natural occurrence of links either dropping off the web or changing URLs. For example, if you get links after you send out a press release, you would expect some of those links to eventually disappear as the pages are archived or removed for being old. And, if you were to get a link from a blog post, you might expect to have a homepage link on the blog until that post is pushed to the second or third page by new posts.

But what if you bought your links? What if you own a large number of domains and all the sites link to each other? What if you use a PBN? These links tend not to decay. Exercising control over your inbound links often means that you keep them from ever decaying. Thus, we can create a simple hypothesis:

Hypothesis: The link decay rate of sites manipulating the link graph will differ from sites with natural link profiles.

The methodology for testing this hypothesis is just as we discussed before. We first figure out what is natural. What does a random site's link decay rate look like? Well, we simply get a bunch of sites and record how fast links are deleted (we visit a page and see a link is gone) vs. their total number of links. We then can look for anomalies.

In this case of anomaly hunting, I'm going to make it really easy. No statistics, no math, just a quick look at what pops up when we first sort by Lowest Decay Rate and then sort by Highest Domain Authority to see who is at the tail-end of the spectrum.

spreadsheet of sites with high deleted link ratios

Success! Every example we see of a good DA score but 0 link decay appears to be powered by a link network of some sort. This is the Aha! moment of data science that is so fun. What is particularly interesting is we find spam on both ends of the distribution — that is to say, sites that have 0 decay or near 100% decay rates both tend to be spammy. The first type tends to be part of a link network, the second part tends to spam their backlinks to sites others are spamming, so their links quickly shuffle off to other pages.

Of course, now we do the hard work of building a model that actually takes this into account and accurately reduces Domain Authority relative to the severity of the link spam. But you might be asking...

These sites don't rank in Google — why do they have decent DAs in the first place?

Well, this is a common problem with training sets. DA is trained on sites that rank in Google so that we can figure out who will rank above who. However, historically, we haven't (and no one to my knowledge in our industry has) taken into account random URLs that don't rank at all. This is something we're solving for in the new DA model set to launch in early March, so stay tuned, as this represents a major improvement on the way we calculate DA!

Spam Score distribution and link spam

One of the most exciting new additions to the upcoming Domain Authority 2.0 is the use of our Spam Score. Moz's Spam Score is a link-blind (we don't use links at all) metric that predicts the likelihood a domain will be indexed in Google. The higher the score, the worse the site.

Now, we could just ignore any links from sites with Spam Scores over 70 and call it a day, but it turns out there are fascinating patterns left behind by common link manipulation schemes waiting to be discovered by using this simple methodology of using a random sample of URLs to find out what a normal backlink profile looks like, and then see if there are anomalies in the way Spam Score is distributed among the backlinks to a site. Let me show you just one.

It turns out that acting natural is really hard to do. Even the best attempts often fall short, as did this particularly pernicious link spam network. This network had haunted me for 2 years because it included a directory of the top million sites, so if you were one of those sites, you could see anywhere from 200 to 600 followed links show up in your backlink profile. I called it "The Globe" network. It was easy to look at the network and see what they were doing, but could we spot it automatically so that we could devalue other networks like it in the future? When we looked at the link profile of sites included in the network, the Spam Score distribution lit up like a Christmas tree.

spreadsheet with distribution of spam scores

Most sites get the majority of their backlinks from low Spam Score domains and get fewer and fewer as the Spam Score of the domains go up. But this link network couldn't hide because we were able to detect the sites in their network as having quality issues using Spam Score. If we relied only on ignoring the bad Spam Score links, we would have never discovered this issue. Instead, we found a great classifier for finding sites that are likely to be penalized by Google for bad link building practices.

DA distribution and link spam

We can find similar patterns among sites with the distribution of inbound Domain Authority. It's common for businesses seeking to increase their rankings to set minimum quality standards on their outreach campaigns, often DA30 and above. An unfortunate outcome of this is that what remains are glaring examples of sites with manipulated link profiles.

Let me take a moment and be clear here. A manipulated link profile is not necessarily against Google's guidelines. If you do targeted PR outreach, it is reasonable to expect that such a distribution might occur without any attempt to manipulate the graph. However, the real question is whether Google wants sites that perform such outreach to perform better. If not, this glaring example of link manipulation is pretty easy for Google to dampen, if not ignore altogether.

spreadsheet with distribution of domain authorityA normal link graph for a site that is not targeting high link equity domains will have the majority of their links coming from DA0–10 sites, slightly fewer for DA10–20, and so on and so forth until there are almost no links from DA90+. This makes sense, as the web has far more low DA sites than high. But all the sites above have abnormal link distributions, which make it easy to detect and correct — at scale — link value.

Now, I want to be clear: these are not necessarily examples of violating Google's guidelines. However, they are manipulations of the link graph. It's up to you to determine whether you believe Google takes the time to differentiate between how the outreach was conducted that resulted in the abnormal link distribution.

What doesn't work

For every type of link manipulation detection method we discover, we scrap dozens more. Some of these are actually quite surprising. Let me write about just one of the many.

The first surprising example was the ratio of nofollow to follow links. It seems pretty straightforward that comment, forum, and other types of spammers would end up accumulating lots of nofollowed links, thereby leaving a pattern that is easy to discern. Well, it turns out this is not true at all.

The ratio of nofollow to follow links turns out to be a poor indicator, as popular sites like facebook.com often have a higher ratio than even pure comment spammers. This is likely due to the use of widgets and beacons and the legitimate usage of popular sites like facebook.com in comments across the web. Of course, this isn't always the case. There are some sites with 100% nofollow links and a high number of root linking domains. These anomalies, like "Comment Spammer 1," can be detected quite easily, but as a general measurement the ratio does not serve as a good classifier for spam or ham.

So what's next?

Moz is continually traversing the the link graph looking for ways to improve Domain Authority using everything from basic linear algebra to complex neural networks. The goal in mind is simple: We want to make the best Domain Authority metric ever. We want a metric which users can trust in the long run to root out spam just like Google (and help you determine when you or your competitors are pushing the limits) while at the same time maintaining or improving correlations with rankings. Of course, we have no expectation of rooting out all spam — no one can do that. But we can do a better job. Led by the incomparable Neil Martinsen-Burrell, our metric will stand alone in the industry as the canonical method for measuring the likelihood a site will rank in Google.


We're launching Domain Authority 2.0 on March 5th! Check out our helpful resources here, or sign up for our webinar this Thursday, February 21st for more info on how to communicate changes like this to clients and stakeholders:

Save my spot!


Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!