Every now and then, I find myself looking at a bunch of apache/nginx webserver logs trying to make sense of some spike in traffic, or some odd behaviour on the server. In order to do that efficiently, I need to be able to extrapolate some information from the logs very quickly. The fastest way to do this in my opinion is by using awk.
I have found GoAccess lately as a very quick way to give me a report of web traffic, but sometimes I need the information in plain text format or I don’t always have access to goaccess on the server.. so I still end up relying on awk.
Example Log file
The following is the output of an apache log that I had to download from WP Engine.
40.77.188.136 example.com - [16/Jun/2021:00:17:19 +0000] "GET /wp-content/themes/salient/js/third-party/select2.min.js?ver=1623798364 HTTP/1.0" 200 66522 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b"
40.77.189.124 example.com - [16/Jun/2021:00:17:20 +0000] "GET /wp-content/plugins/popup-press/js/libs/jquery.cookie.js?ver=1.4.1 HTTP/1.0" 200 3238 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b"
40.77.190.126 example.com - [16/Jun/2021:00:17:20 +0000] "GET /wp-content/plugins/popup-press/js/libs/jquery.easing.1.3.js?ver=1.3 HTTP/1.0" 200 8305 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b"
40.77.167.41 example.com - [16/Jun/2021:00:17:23 +0000] "GET /what-we-do/training/in-person-classes/ HTTP/1.0" 200 206002 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
114.119.137.180 example.com - [16/Jun/2021:00:17:27 +0000] "GET /what-we-do/training/in-person-classes?1587668887 HTTP/1.0" 301 0 "-" "Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"
114.119.137.180 example.com - [16/Jun/2021:00:17:29 +0000] "GET /what-we-do/training/in-person-classes/?1587668887 HTTP/1.0" 200 206057 "-" "Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"
104.196.63.107 example.com - [16/Jun/2021:00:17:53 +0000] "POST /wp-cron.php?doing_wp_cron=1623802673.1307730674743652343750 HTTP/1.0" 403 1673 "https://example.com/wp-cron.php?doing_wp_cron=1623802673.1307730674743652343750" "WordPress/5.7; https://example.com"
207.46.13.24 example.com - [16/Jun/2021:00:17:53 +0000] "GET /what-we-do/training/in-person-classes/?1614351193 HTTP/1.0" 200 206022 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
114.119.131.28 example.com - [16/Jun/2021:00:17:53 +0000] "GET /what-we-do/training/in-person-classes/?1623559885 HTTP/1.0" 200 206061 "-" "Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"
157.55.39.165 example.com - [16/Jun/2021:00:18:06 +0000] "GET /what-we-do/training/in-person-classes/?1620248397 HTTP/1.0" 200 206024 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
Awk Cheatsheet for Combined Format Logs
awk '{print $1}' combined_log # requester ip address (%h)
awk '{print $2}' combined_log # (%l) (the virtualhost being requested)
awk '{print $3}' combined_log # userid (%u) (if basic auth was used)
awk '{print $4,5}' combined_log # date/time (%t)
awk '{print $6}' combined_log # Request Type (GET, POST, etc..)
awk '{print $7}' combined_log # URL Requested (/about, or /blog or /xmlrpc.php) etc
awk '{print $8}' combined_log # HTTP version
awk '{print $9}' combined_log # status code (%>s)
awk '{print $10}' combined_log # size (%b)
awk '{print $11}' combined_log # http referer
awk '{print $12}' combined_log # User Agent
awk -F\" '{print $2}' combined_log # request line (%r)
awk -F\" '{print $4}' combined_log # referer
awk -F\" '{print $6}' combined_log # user agent
Summarize the response codes
First, I want to get an idea of how the server has been responding, if there are wierd response codes, basically, I want to make sure that the server is giving a 200 OK response more than anything else.
» awk '{print $9}' combined_log | sort | uniq -c | sort -r ◉ ◼◼◼◼◼◼◼◼◻◻
89782 200
5106 301
2384 403
272 404
150 206
80 304
64 302
33 499
4 500
2 444
1 409
1 400
List all 403 requests and summarize them
As you can see above, there were 2384 responses that were HTTP 403 Forbidden.. I may want to know what’s being denied so often.. so show me all the requests that resulted in the server sending a 403 response code.
» awk '($9 ~ /403/)' combined_log
That results in too many entries, it’s just going to output the 2384 requests.. which I can’t parse on my own. So.. I pipe awk through awk and then ask for just the unique entries.
» awk '($9 ~ /403/)' combined_log | awk '{print $9,$7}' | uniq -c | sort -r
22 403 /xmlrpc.php
6 403 //?s=/abc/abc/abc/$%7B@print(eval($_POST[c]))%7D
4 403 /xmlrpc.php
4 403 /xmlrpc.php
4 403 /xmlrpc.php
3 403 /xmlrpc.php
3 403 /xmlrpc.php
Find the ips that are causing the most number of 403s
This is tells me that a lot of requests are being made to xmlrpc.php by different ip addresses. Now I want to know what those ip addresses are.
» awk '($9 ~ /403/)' combined_log | awk '{print $1,$7}' | uniq -c | sort -r
22 52.15.212.3 /xmlrpc.php
6 182.61.37.112 //?s=/abc/abc/abc/$%7B@print(eval($_POST[c]))%7D
3 71.47.160.116 /xmlrpc.php
3 182.61.37.112 //yp/product.php?pagesize=${${@eval%28$_POST[-62]%29}}
2 73.252.149.81 /xmlrpc.php
As you can see, I can keep chaining the output of awk to another awk, or to sort or uniq and get all the information I need.
To get more information on the offending ip addresses, I like to use Ip Info
If you want you can check out this post), which was the basis of my original cheatsheet