The Incapsula 2014 bot traffic report looked at 20,000 websites (of all sizes) over a 90-day period and found that bots account for 56% of all website traffic; 29% were malicious in nature. Additional insight showed the more you build your brand, the larger a target you become.
While there are services out there that automate much more advanced techniques than what’s shown here, this article is meant to be an easy starting point (using Excel) to understand the basics of using Log Files, blocking bad bots at the server level and cleaning up Analytics reports.
All servers keep a list of every request to the site they host. Whether a customer is using the Firefox browser or Googlebot is looking for newly created pages, all activity is recorded in a simple file.
The location of these log files depends on the type of server or host you have. Here are some details on common platforms.
Once files have been found, consolidate, then open in Excel (or your preferred method). Due to the size of some log files, this can often be more easily said than done. For most smaller to medium sites, using a computer with a lot of processing power should be sufficient.
Below, .log files were manually consolidated into a new .txt file using a plain text editor, then opened in Excel using text-to-columns and a “space” delimiter, with a little additional cleanup to get the column headers to line up.
After consolidating and opening logs in Excel, it’s fairly easy to find the number of hits by IP.
To do this:
As a final step in identifying potential bad bots, find which user agents are associated with IPs hitting your site the most. To do this, go back to the pivot table and simply add the User Agent to the Row Label section of the Pivot Table.
Now, finding the User Agent associated with the top-hitting IP is as simple as a text search. In this case, the IP has no declared User Agent (was from China) and hit the site over 80,000 times more than any other IP.
Now that the malicious IP has been identified, use these instructions to prevent number inflation in Analytics, then block that IP from accessing the site completely.
Using Filters in Google Analytics, you can exclude IPs. Navigate to Admin -> Choose View (always a good idea to Create New View when making changes like this) -> Filters -> + New Filter -> Predefined -> Exclude traffic from the IP addresses -> Specify IP (regular expression).
Tip: Google Analytics automatically blocks known crawlers identified by IAB (a $14,000 value for non-members). Just navigate to Admin -> View Settings, and under where it says “Bot Filtering,” check “Exclude all hits from known bots and spiders.” It’s always a best practice to create a new view before altering profile settings.
If you use Omniture, there are three methods to exclude data by IP.
Similar to identifying where the log files are located, the method of blocking IPs from accessing your site at the server level changes depending on the type of server you use.
Third-party solutions route all traffic through a network to identify bots (good and bad) in real time. They don’t just look at IPs and User Agent Strings, but also HTTP Headers, navigational site behavior and many other factors. Some sites are using methods like reCAPTCHA to ensure their sites visitors are human.
What other methods have you heard of that can help protect against the “rise of the bad bots?”