An Introductory Guide to Log File Analysis

Welcome to our guide to log file analysis. This guide is designed to help you learn what log files are, why we analyse them, how to marry that data against other datasets, and how you can optimise your site through it. We'll also be looking at some of the tooling we use to give you inspiration for what you could do.

What are Log Files?

Every time anyone interacts with your site, human or bot, they make requests to a webserver. Ones you might have heard of include Apache, Node.js and Nginx, but there's many others, such as lighttpd, Jetty, Tomcat and so on. These are pieces of software designed to take files and serve them to end users over the internet.

Now, whilst it slows down your site very fractionally, you'll tend to have access logs for your websites. These allow you to view a history of what's been requested, what requested it, where the request was referred from and so on. Access logs aren't the only kind of server log - there's also error logs for example. However, we're just interested in access logs for the purpose of this.

Access logs can use any format the admin specifies, but the most common is known as the combined log format, and logs the following details:

  • Host
  • Identity
  • User ID
  • Date
  • Request
  • Status code
  • Object size
  • Referrer
  • User-Agent HTTP request header

This is almost exactly the same as the common log format, with the addition of the last two fields.

For example, a common log request might look like this:

127.0.0.1 user-identifier foo [12/Jan/2018:13:55:36 -0700] "GET /logo.png HTTP/1.0" 200 1072

In the string above...

  • 127.0.0.1 is the IP address of the client which made the request (a user's computer, or a bot's server for example)
  • user-identifier is the RFC 1413 identity of the client
  • foo is the userid requesting the document
  • [10/Oct/2000:13:55:36 -0700] gives respectively the date, time, and time zone when the request was received and processed
  • "GET /apache_pb.gif HTTP/1.0" is the request itself. The three parts are the method of the request (GET), the resource requested (/logo.png), and the HTTP protocol used (HTTP/1.1)
  • 200 is the HTTP status code returned. A 2xx means the request was successful, whilst 3xx codes are redirections, 4xx codes are client errors, and 5xx codes denote server errors
  • 1072 is the size of the object which should be returned to the client, as measured in bytes

Any time in a log you see "-", that means the data was missing. So the following has no data for the identifier or user, but adds in the last two fields for the combined log format:

111.222.333.123 - - [12/Jan/2018:13:08:39 -0400] "GET /index.htm HTTP/1.1" 200 198 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

In this example, there's no referrer (as we can see from the "-"), and the request header of the browser was Googlebot.

Which brings us on to...

What is Log File Analysis?

Log file analysis is the process of taking an access log and looking at the data in it in aggregate, to learn what pages are being crawled too frequently or never, or what's being crawled that shouldn't be, finding errors or frequently requested pages with redirects and so on. This data can be used to help tidy up a site's architecture, and ensure the best possible user experience for both humans and bots.

This matters in the case of search engines, because search engines will only crawl a site so much. As a rough rule of thumb, the more important a search engine thinks your site is, the more they'll try to crawl it. Thus sites like Twitter, Wikipedia and large news organisations are crawled constantly, whilst a niche company with few links pointing at their site and no social activity will be rarely crawled. As a result, if you're a low authority site, you need to maximise the time Google and co spend crawling your site, and if you're large, you want to make it as efficient as possible.

Preparing Log Files

To go about analysing a site, firstly you need to get hold of its access log files, and strip it to only include visits from Googlebot. That means looking for lines with which match the following user agent:

"Mozilla/5.0 (compatible; Googlebot/2.1; http://www.google.com/bot.html)"

However, you also need to validate that they come from IP addresses which are actual Googlebot servers. You can make this more efficient by making a list of all the unique IPs for the lines which appear to be Googlebot addresses, and then stripping any lines which had IP addresses that weren't Google's.

The best way to do that is as described on Google's Verifying Googlebot page. If you don't want to create your own tester, you can use ours at https://toughcompetent.com/tools/googlebot_tester/index.php?ip=66.249.76.72. Simply change the ip value to any IP address you want to test. It returns a JSON response, with an isGoogle parameter of true or false.

Alternatively, if you want to host it yourself, here's a sample PHP code snippet to do it:

function testGooglebot ($ip) {
    if (!filter_var($ip, FILTER_VALIDATE_IP)0
        return FALSE;

    $hostname = gethostbyaddr($ip);
    $test = preg_match("/.*google(bot)?\.com/", $hostname)
    $ipByHostname = gethostbyname($hostname);
    return ($ipByHostname === $ip)
        ? $test === 1
        : FALSE;
}

These can be amended to any bot, provided you know the server it should come from. As such you can still use this method for if you're performing SEO for, for example, Baidu or Yandex.

Log File Analysis

Having got your freshly prepared list of Googlebot-crawled resources, you'll now need some sort of log file parser. Many exist, including ones from Screaming Frog, and Microsoft's Log Parser. We've created our own software specifically for performing the kinds of operations we do, but at a push you could even use Excel, although it means you'll need to turn your log files into CSV files and fiddle around a bit to make it work. Also, it vastly limits the number of rows you can look at to around 100,000 or fewer, which often isn't enough. But for smaller sites, at a pinch it can work.

We'd also recommend adding data into this dataset. We'll commonly use XML sitemaps as a canonical list of every page that should exist in a site, and our own crawl of a site to log things like the page title, header metadata and so on.

Here's the sorts of things you're looking for:

Crawl Frequency & Volume

This gives you an idea as to how often Google's prepared to visit your site, and how long it'll spend there. Whatever happens, you want your site to be as crawl-efficient as possible, but if you're not getting crawled often or deeply, you really need to work to make sure everything gets indexed correctly.

Request Volume & Status Code

This is simply a count of each bot for each resource, and will let you see what's getting visited and how often, as well as how that resource performs. You may well have a page that works sometimes but is down others, or a page that existed and has now been redirected. This type of analysis lets us see those sorts of things, as well as filtering to see errors that come up regularly. We can then address those issues, by removing or redirecting links to expired resources (40x codes), redirecting links to their new locations (30x codes), and fixing server issues around resources that crash for any reason (50x codes).

Large Resource Request Volume

We often break down a site through k-means clustering to get an idea of the distribution of assets in a site by size and type, but any way you want to approach this works. You'll often be able to see things that will take a long time to crawl, such as large image files, exceptionally large pages or javascript files. Any of these that aren't required for Googlebot to index can be blocked in your robots file, or optimised to try and bring the file size down.

Also, anything that appears particularly large, and is also a HTML file, is something ripe for optimisation.

Uncrawled Resources

Anything that doesn't exist in the log files, but is in the XML sitemap is always worth noting down. It could be that they're pages blocked in the robots.txt file, or they're buried so far in the architecture Google never finds them. These are worth either removing from the sitemaps in the first case, or are signs you need to revisit how your internal site architecture works for the second. Commonly we see these issues when there's complex filtering or navigation in ecommerce or large media sites.

Blockable Resources

We occasionally see some particularly interesting things where Googlebot has found something it either shouldn't have or doesn't need to. Common examples of this are directories of PDFs or other resources that don't need to be indexed, URLs with parameters in (commonly comment IDs for blog posts, utm_* URLs and similar) and even API endpoints! By breaking down assets by response type we can see these sorts of issues.

Soft 404 Errors

A soft 404 is a page that hasn't been found, and says so to the user, but returns a 200 status code. By having a crawl of the site, we can look for titles that indicate a 404 (things like "Opps! Page not found" and so on), but that have a 20x response code. We can then add those to the list of missing resources to be fixed. You might want to look these URLs up in your analytics package to get a list of which to do first, by how much traffic they receive.

Faceted Navigation & Pagination

In the case of faceted navigation, you're presenting a user with a great experience in being able to drill down to any selection of products they'd like. However, it creates vast numbers of internal links for Google to crawl, which needs to be addressed or you'll risk having serious indexation issues.

For pagination, it's easy to create areas of your site where Google will disappear off into a spiral of crawl inefficiency. As a rough rule of thumb, this can be easily fixed by creating links to the mid points in the list of pagination entries for Google to crawl. This essentially turns the crawl into a binary search tree for discovering and indexing content. For more information on this, check out this great post from Portent on pagination tunnels.

Duplicate Content

Any time in your site crawl you see multiple pages with the same title tag, you're either getting duplicate content, or something close enough that it's generating the same title repeatedly. In either case, it means something will need fixing; if it's duplicate pages, they'll need to be removed, and if it's templating generating pages which are very similar, how those pages are generated in your site's architecture will need looking at. A common example of the latter would be product pages that site in multiple categories, where the site adds the category to the URL.

Snag List Creation

Once you've created a list of all the things that need fixing, you can categorise them by how much impact they're likely to have. For example, if you're auditing to increase domain authority, you could look at pages that have expired and check for inbound links, using an export from something like Ahrefs or Majestic. That can be used to work out what pages need addressing first.

On the other hand, if you're looking to optimise crawl budget, you can look at which parts of the site produce non-200 response codes and fix those, before moving on to other issues later.

The kinds of things you'll be looking for are ways to gain greater benefit from external links, internal site architecture restructuring, and crawl budget optimisation. Depending on where your priorities are will allow you to work out what work to prioritise, and this is especially true where you've limited development time or resource. However, over time you should look to fix each problem identified.

Hopefully you've found this useful. If you're having issues with the technical side of your site's SEO, get in touch. We'd love to help.

If you've enjoyed this post, you might want to follow our founder, Pete Watson-Wailes on Twitter