Other Articles

How to parse apache log files with Awk

Using awk to get information from text files

Every now and then, I find myself looking at a bunch of apache/nginx webserver logs trying to make sense of some spike in traffic, or some odd behaviour on the server. In order to do that efficiently, I need to be able to extrapolate some information from the logs very quickly. The fastest way to do this in my opinion is by using awk.

I have found GoAccess lately as a very quick way to give me a report of web traffic, but sometimes I need the information in plain text format or I don’t always have access to goaccess on the server.. so I still end up relying on awk.

Example Log file

The following is the output of an apache log that I had to download from WP Engine.

40.77.188.136 example.com - [16/Jun/2021:00:17:19 +0000] "GET /wp-content/themes/salient/js/third-party/select2.min.js?ver=1623798364 HTTP/1.0" 200 66522 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b"
40.77.189.124 example.com - [16/Jun/2021:00:17:20 +0000] "GET /wp-content/plugins/popup-press/js/libs/jquery.cookie.js?ver=1.4.1 HTTP/1.0" 200 3238 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b"
40.77.190.126 example.com - [16/Jun/2021:00:17:20 +0000] "GET /wp-content/plugins/popup-press/js/libs/jquery.easing.1.3.js?ver=1.3 HTTP/1.0" 200 8305 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b"
40.77.167.41 example.com - [16/Jun/2021:00:17:23 +0000] "GET /what-we-do/training/in-person-classes/ HTTP/1.0" 200 206002 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
114.119.137.180 example.com - [16/Jun/2021:00:17:27 +0000] "GET /what-we-do/training/in-person-classes?1587668887 HTTP/1.0" 301 0 "-" "Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"
114.119.137.180 example.com - [16/Jun/2021:00:17:29 +0000] "GET /what-we-do/training/in-person-classes/?1587668887 HTTP/1.0" 200 206057 "-" "Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"
104.196.63.107 example.com - [16/Jun/2021:00:17:53 +0000] "POST /wp-cron.php?doing_wp_cron=1623802673.1307730674743652343750 HTTP/1.0" 403 1673 "https://example.com/wp-cron.php?doing_wp_cron=1623802673.1307730674743652343750" "WordPress/5.7; https://example.com"
207.46.13.24 example.com - [16/Jun/2021:00:17:53 +0000] "GET /what-we-do/training/in-person-classes/?1614351193 HTTP/1.0" 200 206022 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
114.119.131.28 example.com - [16/Jun/2021:00:17:53 +0000] "GET /what-we-do/training/in-person-classes/?1623559885 HTTP/1.0" 200 206061 "-" "Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"
157.55.39.165 example.com - [16/Jun/2021:00:18:06 +0000] "GET /what-we-do/training/in-person-classes/?1620248397 HTTP/1.0" 200 206024 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"

Awk Cheatsheet for Combined Format Logs

awk '{print $1}' combined_log         # requester ip address (%h)
awk '{print $2}' combined_log         # (%l) (the virtualhost being requested)
awk '{print $3}' combined_log         # userid (%u) (if basic auth was used)
awk '{print $4,5}' combined_log       # date/time (%t)
awk '{print $6}' combined_log         # Request Type (GET, POST, etc..)
awk '{print $7}' combined_log         # URL Requested (/about, or /blog or /xmlrpc.php) etc
awk '{print $8}' combined_log         # HTTP version 

awk '{print $9}' combined_log         # status code (%>s)
awk '{print $10}' combined_log        # size (%b)

awk '{print $11}' combined_log        # http referer 
awk '{print $12}' combined_log        # User Agent

awk -F\" '{print $2}' combined_log    # request line (%r)
awk -F\" '{print $4}' combined_log    # referer
awk -F\" '{print $6}' combined_log    # user agent

Summarize the response codes

First, I want to get an idea of how the server has been responding, if there are wierd response codes, basically, I want to make sure that the server is giving a 200 OK response more than anything else.

» awk '{print $9}' combined_log | sort | uniq -c | sort -r                                                                ◉ ◼◼◼◼◼◼◼◼◻◻

89782 200
5106 301
2384 403
 272 404
 150 206
  80 304
  64 302
  33 499
   4 500
   2 444
   1 409
   1 400

List all 403 requests and summarize them

As you can see above, there were 2384 responses that were HTTP 403 Forbidden.. I may want to know what’s being denied so often.. so show me all the requests that resulted in the server sending a 403 response code.

» awk '($9 ~ /403/)' combined_log

That results in too many entries, it’s just going to output the 2384 requests.. which I can’t parse on my own. So.. I pipe awk through awk and then ask for just the unique entries.

» awk '($9 ~ /403/)' combined_log | awk '{print $9,$7}' | uniq -c | sort -r
 22 403 /xmlrpc.php
   6 403 //?s=/abc/abc/abc/$%7B@print(eval($_POST[c]))%7D
   4 403 /xmlrpc.php
   4 403 /xmlrpc.php
   4 403 /xmlrpc.php
   3 403 /xmlrpc.php
   3 403 /xmlrpc.php

Find the ips that are causing the most number of 403s

This is tells me that a lot of requests are being made to xmlrpc.php by different ip addresses. Now I want to know what those ip addresses are.

» awk '($9 ~ /403/)' combined_log | awk '{print $1,$7}' | uniq -c | sort -r
  22 52.15.212.3 /xmlrpc.php
   6 182.61.37.112 //?s=/abc/abc/abc/$%7B@print(eval($_POST[c]))%7D
   3 71.47.160.116 /xmlrpc.php
   3 182.61.37.112 //yp/product.php?pagesize=${${@eval%28$_POST[-62]%29}}
   2 73.252.149.81 /xmlrpc.php

As you can see, I can keep chaining the output of awk to another awk, or to sort or uniq and get all the information I need.

To get more information on the offending ip addresses, I like to use Ip Info

If you want you can check out this post), which was the basis of my original cheatsheet