May 22, 2018

Unwrapping the Secrets of SEO: Sports Betting Joins the SEO Fast Track

Odds are you haven’t followed the US Supreme Court’s recent ruling that’s struck down a ban on commercial sports betting in most states and ...

The post Unwrapping the Secrets of SEO: Sports Betting Joins the SEO Fast Track appeared first on Searchmetrics SEO Blog.

May 22, 2018

Unwrapping the Secrets of SEO: Sports Betting Joins the SEO Fast Track

Odds are you haven’t followed the US Supreme Court’s recent ruling that’s struck down a ban on commercial sports betting in most states and ...

The post Unwrapping the Secrets of SEO: Sports Betting Joins the SEO Fast Track appeared first on Searchmetrics SEO Blog.

May 21, 2018

Backlink Blindspots: The State of Robots.txt

Posted by rjonesx.

Here at Moz we have committed to making Link Explorer as similar to Google as possible, specifically in the way we crawl the web. I have discussed in previous articles some metrics we use to ascertain that performance, but today I wanted to spend a little bit of time talking about the impact of robots.txt and crawling the web.

Most of you are familiar with robots.txt as the method by which webmasters can direct Google and other bots to visit only certain pages on the site. Webmasters can be selective, allowing certain bots to visit some pages while denying other bots access to the same. This presents a problem for companies like Moz, Majestic, and Ahrefs: we try to crawl the web like Google, but certain websites deny access to our bots while allowing that access to Googlebot. So, why exactly does this matter?

Why does it matter?

Graph showing how crawlers hop from one link to another

As we crawl the web, if a bot encounters a robots.txt file, they're blocked from crawling specific content. We can see the links that point to the site, but we're blind regarding the content of the site itself. We can't see the outbound links from that site. This leads to an immediate deficiency in the link graph, at least in terms of being similar to Google (if Googlebot is not similarly blocked).

But that isn't the only issue. There is a cascading failure caused by bots being blocked by robots.txt in the form of crawl prioritization. As a bot crawls the web, it discovers links and has to prioritize which links to crawl next. Let's say Google finds 100 links and prioritizes the top 50 to crawl. However, a different bot finds those same 100 links, but is blocked by robots.txt from crawling 10 of the top 50 pages. Instead, they're forced to crawl around those, making them choose a different 50 pages to crawl. This different set of crawled pages will return, of course, a different set of links. In this next round of crawling, Google will not only have a different set they're allowed to crawl, the set itself will differ because they crawled different pages in the first place.

Long story short, much like the proverbial butterfly that flaps its wings eventually leading to a hurricane, small changes in robots.txt which prevent some bots and allow others ultimately leads to very different results compared to what Google actually sees.

So, how are we doing?

You know I wasn't going to leave you hanging. Let's do some research. Let's analyze the top 1,000,000 websites on the Internet according to Quantcast and determine which bots are blocked, how frequently, and what impact that might have.

Methodology

The methodology is fairly straightforward.

  1. Download the Quantcast Top Million
  2. Download the robots.txt if available from all top million sites
  3. Parse the robots.txt to determine whether the home page and other pages are available
  4. Collect link data related to blocked sites
  5. Collect total pages on-site related to blocked sites.
  6. Report the differences among crawlers.

Total sites blocked

The first and easiest metric to report is the number of sites which block individual crawlers (Moz, Majestic, Ahrefs) while allowing Google. Most site that block one of the major SEO crawlers block them all. They simply formulate robots.txt to allow major search engines while blocking other bot traffic. Lower is better.

Bar graph showing number of sites blocking each SEO tool in robots.txt

Of the sites analyzed, 27,123 blocked MJ12Bot (Majestic), 32,982 blocked Ahrefs, and 25,427 blocked Moz. This means that among the major industry crawlers, Moz is the least likely to be turned away from a site that allows Googlebot. But what does this really mean?

Total RLDs blocked

As discussed previously, one big issue with disparate robots.txt entries is that it stops the flow of PageRank. If Google can see a site, they can pass link equity from referring domains through the site's outbound domains on to other sites. If a site is blocked by robots.txt, it's as though the outbound lanes of traffic on all the roads going into the site are blocked. By counting all the inbound lanes of traffic, we can get an idea of the total impact on the link graph. Lower is better.

According to our research, Majestic ran into dead ends on 17,787,118 referring domains, Ahrefs on 20,072,690 and Moz on 16,598,365. Once again, Moz's robots.txt profile was most similar to that of Google's. But referring domains isn't the only issue with which we should be concerned.

Total pages blocked

Most pages on the web only have internal links. Google isn't interested in creating a link graph — they're interested in creating a search engine. Thus, a bot designed to act like Google needs to be just as concerned about pages that only receive internal links as they are those that receive external links. Another metric we can measure is the total number of pages that are blocked by using Google's site: query to estimate the number of pages Google has access to that a different crawler does not. So, how do the competing industry crawlers perform? Lower is better.

Once again, Moz shines on this metric. It's not just that Moz is blocked by fewer sites— Moz is blocked by less important and smaller sites. Majestic misses the opportunity to crawl 675,381,982 pages, Ahrefs misses 732,871,714 and Moz misses 658,015,885. There's almost an 80 million-page difference between Ahrefs and Moz just in the top million sites on the web.

Unique sites blocked

Most of the robots.txt disallows facing Moz, Majestic, and Ahrefs are simply blanket blocks of all bots that don't represent major search engines. However, we can isolate the times when specific bots are named deliberately for exclusion while competitors remain. For example, how many times is Moz blocked while Ahrefs and Majestic are allowed? Which bot is singled out the most? Lower is better.

Ahrefs is singled out by 1201 sites, Majestic by 7152 and Moz by 904. It is understandable that Majestic has been singled out, given that they have been operating a very large link index for many years, a decade or more. It took Moz 10 years to accumulate 904 individual robots.txt blocks, and took Ahrefs 7 years to accumulate 1204. But let me give some examples of why this is important.

If you care about links from name.com, hypermart.net, or eclipse.org, you can't rely solely on Majestic.

If you care about links from popsugar.com, dict.cc, or bookcrossing.com, you can't rely solely on Moz.

If you care about links from dailymail.co.uk, patch.com, or getty.edu, you can't rely solely on Ahrefs.

And regardless of what you do or which provider you use, you can't links from yelp.com, who.int, or findarticles.com.

Conclusions

While Moz's crawler DotBot clearly enjoys the closest robots.txt profile to Google among the three major link indexes, there's still a lot of work to be done. We work very hard on crawler politeness to ensure that we're not a burden to webmasters, which allows us to crawl the web in a manner more like Google. We will continue to work more to improve our performance across the web and bring to you the best backlink index possible.

Thanks to Dejan SEO for the beautiful link graph used in the header image and Mapt for the initial image used in the diagrams.


Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!

May 20, 2018

What Google’s GDPR Compliance Efforts Mean for Your Data: Two Urgent Actions

Posted by willcritchlow

It should be quite obvious for anyone that knows me that I’m not a lawyer, and therefore that what follows is not legal advice. For anyone who doesn’t know me: I’m not a lawyer, I’m certainly not your lawyer, and what follows is definitely not legal advice.

With that out of the way, I wanted to give you some bits of information that might feed into your GDPR planning, as they come up more from the marketing side than the pure legal interpretation of your obligations and responsibilities under this new legislation. While most legal departments will be considering the direct impacts of the GDPR on their own operations, many might miss the impacts that other companies’ (namely, in this case, Google’s) compliance actions have on your data.

But I might be getting a bit ahead of myself: it’s quite possible that not all of you know what the GDPR is, and why or whether you should care. If you do know what it is, and you just want to get to my opinions, go ahead and skip down the page.

What is the GDPR?

The tweet-length version is that the GDPR (General Data Protection Regulation) is new EU legislation covering data protection and privacy for EU citizens, and it applies to all companies offering goods or services to people in the EU.

Even if you aren’t based in the EU, it applies to your company if you have customers who are, and it has teeth (fines of up to the greater of 4% of global revenue or EUR20m). It comes into force on May 25. You have probably heard about it through the myriad organizations who put you on their email list without asking and are now emailing you to “opt back in.”

In most companies, it will not fall to the marketing team to research everything that has to change and achieve compliance, though it is worth getting up to speed with at least the high-level outline and in particular its requirements around informed consent, which is:

"...any freely given, specific, informed, and unambiguous indication of the data subject's wishes by which he or she, by a statement or by a clear affirmative action, signifies agreement to the processing of personal data relating to him or her."

As always, when laws are made about new technology, there are many questions to be resolved, and indeed, jokes to be made:

Can you recommend a GDPR expert?
-yes
Can I have their email address?
-no
— Adam Cleevely (@ACleevely) May 2, 2018

But my post today isn’t about what you should do to get compliant — that’s specific to your circumstances — and a ton has been written about this already:

My intention is not to write a general guide, but rather to warn you about two specific things you should be doing with analytics (Google Analytics in particular) as a result of changes Google is making because of GDPR.

Unexpected consequences of GDPR

When you deal directly with a person in the EU, and they give you personally identifiable information (PII) about themselves, you are typically in what is called the "data controller" role. The GDPR also identifies another role, which it calls "data processor," which is any other company your company uses as a supplier and which handles that PII. When you use a product like Google Analytics on your website, Google is taking the role of data processor. While most of the restrictions of the GDPR apply to you as the controller, the processor must also comply, and it’s here that we see some potentially unintended (but possibly predictable) consequences of the legislation.

Google is unsurprisingly seeking to minimize their risk (I say it’s unsurprising because those GDPR fines could be as large as $4.4 billion based on last year’s revenue if they get it wrong). They are doing this firstly by pushing as much of the obligation onto you (the data controller) as possible, and secondly, by going further by default than the GDPR requires and being more aggressive than the regulation requires in shutting down accounts that infringe their terms (regardless of whether the infringement also infringes the GDPR).

This is entirely rational — with GA being in most cases a product offered for free, and the value coming to Google entirely in the aggregate, it makes perfect sense to limit their risks in ways that don’t degrade their value, and to just kick risky setups off the platform rather than taking on extreme financial risk for individual free accounts.

It’s not only Google, by the way. There are other suppliers doing similar things which will no doubt require similar actions, but I am focusing on Google here simply because GA is pervasive throughout the web marketing world. Some companies are even going as far as shutting down entirely for EU citizens (like unroll.me). See this Twitter thread of others.

Consequence 1: Default data retention settings for GA will delete your data

Starting on May 25, Google will be changing the default for data retention, meaning that if you don’t take action, certain data older than the cutoff will be automatically deleted.

You can read more about the details of the change on Krista Seiden’s personal blog (Krista works at Google, but this post is written in her personal capacity).

The reason I say that this isn’t strictly a GDPR thing is that it is related to changes Google is making on their end to ensure that they comply with their obligations as a data processor. It gives you tools you might need but isn’t strictly related to your GDPR compliance. There is no particular “right” answer to the question of how long you need to/should be/are allowed to keep this data stored in GA under the GDPR, but by my reading, given that it shouldn’t be PII anyway (see below) it isn’t really a GDPR question for most organizations. In particular, there is no particular reason to think that Google’s default is the correct/mandated/only setting you can choose under the GDPR.

Action: Review the promises being made by your legal team and your new privacy policy to understand the correct timeline setting for your org. In the absence of explicit promises to your users, my understanding is that you can retain any of this data you were allowed to capture in the first place unless you receive a deletion request against it. So while most orgs will have at least some changes to make to privacy policies at a minimum, most GA users can change back to retain this data indefinitely.

Consequence 2: Google is deleting GA accounts for capturing PII

It has long been against the Terms of Service to store any personally identifiable information (PII) in Google Analytics. Recently, though, it appears that Google has become far more diligent in checking for the presence of PII and robust in their handling of accounts found to contain any. Put more simply, Google will delete your account if they find PII.

It’s impossible to know for sure that this is GDPR-related, but being able if necessary to demonstrate to regulators that they are taking strict actions against anyone violating their PII-related terms is an obvious move for Google to reduce the risk they face as a Data Processor. It makes particular sense in an area where the vast majority of accounts are free accounts. Much like the previous point, and the reason I say that this is related to Google’s response to the GDPR coming into force, is that it would be perfectly possible to get your users’ permission to record their data in third-party services like GA, and fully comply with the regulations. Regardless of the permissions your users give you, Google’s GDPR-related crackdown (and heavier enforcement of the related terms that have been present for some time) means that it’s a new and greater risk than it was before.

Action: Audit your GA profile and implementation for PII risks:

  • There are various ways you can search within GA itself to find data that could be personally identifying in places like page titles, URLs, custom data, etc. (see these two excellent guides)
  • You can also audit your implementation by reviewing rules in tag manager and/or reviewing the code present on key pages. The most likely suspects are the places where people log in, take key actions on your site, give you additional personal information, or check out

Don’t take your EU law advice from big US tech companies

The internal effort and coordination required at Google to do their bit to comply even “just” as data processor is significant. Unfortunately, there are strong arguments that this kind of ostensibly user-friendly regulation which incurs outsize compliance burdens on smaller companies will cement the duopoly and dominance of Google and Facebook and enables them to pass the costs and burdens of compliance onto sectors that are already struggling.

Regardless of the intended or unintended consequences of the regulation, it seems clear to me that we shouldn’t be basing our own businesses’ (and our clients’) compliance on self-interested advice and actions from the tech giants. No matter how impressive their own compliance, I’ve been hugely underwhelmed by guidance content they’ve put out. See, for example, Google’s GDPR “checklist” — not exactly what I’d hope for:

Client Checklist: As a marketer we know you need to select products that are compliant and use personal data in ways that are compliant. We are committed to complying with the GDPR and would encourage you to check in on compliance plans within your own organisation. Key areas to think about:  How does your organisation ensure user transparency and control around data use? Do you explain to your users the types of data you collect and for what purposes? Are you sure that your organisation has the right consents in place where these are needed under the GDPR? Do you have all of the relevant consents across your ad supply chain? Does your organisation have the right systems to record user preferences and consents? How will you show to regulators and partners that you meet the principles of the GDPR and are an accountable organisation?

So, while I’m not a lawyer, definitely not your lawyer, and this is not legal advice, if you haven’t already received any advice, I can say that you probably can’t just follow Google’s checklist to get compliant. But you should, as outlined above, take the specific actions you need to take to protect yourself and your business from their compliance activities.


Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!