Analysing server log files for crawl data
Ever feel like you’re not getting the full picture from Google Search Console? Well, maybe that’s because you’re not. If you have access to the server side of your own or your client’s website, then you have a wealth of information right at the end of your fingertips in the form of your access log files.
Your log files can be found in different locations in the FTP, depending on what type of website you operate. A quick online search in relation to your website build should provide you with enough information to find your access logs, for instance Branded3’s logs were found in the folder /var/log/httpd/.
Once you have selected the files you want, you will be presented with lines and lines of data in a text document that at first glance won’t appear to make any sense, much like the picture below.
Don’t worry! As daunting as it looks, the process of turning this information into something that’s actually beneficial isn’t as hard as it may initially seem.
Step 1: Now you have your ugly text file, open Excel, then create and name a tab for your raw data. Copy the complete text file and paste it into A1. As you will see, it pastes perfectly into individual cells, although the task may take a while, depending on the size of your site.
Next, in order to trim the data down to a more manageable size, you need to understand what it means and what you want to get out of it. My aim was to analyse a day’s worth of data, so I needed to remove everything outside the date range I was interested in.
Each row will have data in this format:
Host IP –> Date / Time -> Directory / Page Path -> Response Code -> Bandwidth -> Referrer -> Spider Details
It is now time to establish a basic understanding of the data and select the date range you want to analyse. In my case, I opted for 21/Dec/2015, simply because these were the logs available to me at the time.
In order to do this, I put all of the data into a table (CTRL + T) and filtered for “does not contain” my selected date range. Once I’d done that, I was able to select everything else in the table, delete rows, then clear the view.
Most of the time, however, if you have a larger site, this won’t work, as Excel only allows you to filter 10,000 rows. This can prove something of a nightmare, but with a bit of Excel wizardry you can get around the problem. In the column next to the data, type the formula:
Make sure you replace 21/Dec with your own date value, of course. Then, once you’ve run that formula down all the URLs, keep every entry containing “1” and delete every entry containing “0”.
Managing The Data
Step 2: Now you need to split up the data into manageable chunks so you can pivot it further later. Create another tab, named something along the lines of “TextToColumns Raw”, and make a table with the following headers for columns:
Host, Blank, Blank 1, Date, Time, Page, Response Code, Bandwidth, Referrer, Crawl Details
Paste your raw data into the Host column, then go to the Data tab and text to column the raw data by the space delimiter. Now the data should look something like this:
Cleansing The Data
Step 3: Now you just need to get the data to a stage where it’s easy to pivot table. So, create another new tab called “Log File Data” – this is where you will do the final data cleanse. On this page, you need to create a table with the tabs:
Host, Date, Page, Directory, Response Codes, Bandwidth, Referrer, Crawl Details, Spider
Now, paste the corresponding data from the previous tab, in which you completed the text to column, to the relevant columns in the new table. This should leave the Directory and Spider columns without any text.
In the Directory column type the formula:
And in the Spider column paste:
This can be extended to include any regular crawl detail occurrences. These are just the crawlers that I profiled and was interested in for further analysis.
Getting The Results
Step 4: In order to further understand and analyse the data, I wanted to pivot table it, so I created a tab called “Dashboard” to throw everything together. There is a lot of information you can get out of the log file but here are some of the pivot tables I made:
- Crawl Count (Row = Spider, Values = Count of Spider) – This counts the number of times each spider i.e BingBot, GoogleBot, Yahoo visits your site in the time period you set.
- Error Count By Spider (Columns = Response Code, Rows = Spider, Values = Count of Spider) – This counts the total error codes that each of the spiders has picked up, so you can question why a particular spider has picked up a certain number of errors and another hasn’t.
- Spider & Response Code Breakdown (Filters = Spider & Response Code, Rows = Page) – This allows you to find out the response code pages for each spider, a breakdown of the previous pivot.
- Largest Crawl Referrer (Filters = Spider, Rows = Referrer, Values = Count of Referrer) – This allows you to find out for each spider your top refer and investigate if any external links are causing your site to be crawled more regularly.
These are just a few filters I have experimented with so far. There are plenty more combinations you can use to analyse your log file data.
Speeding Up The Process
In an effort to speed up this tedious process, I built an automated 3-step tool in VBA to do all the leg work for me. It took quite a while to create and the tool may have a few bugs, but it appears to be working at the moment and is much quicker than performing the task manually.
Talking about this tool could fill a whole post in itself, so I’m going to cut it short here. If you’re interested in finding out more about the tool or log file analysis, feel free to comment or contact us and we will be happy to help.