Let's first list the important information that we may need from the Apache logs
IP address
Time stamp
HTTP method
URI path
Response code
User agent
To read a log file, I prefer to read it as lines
apache_logs = File.readlines "/var/log/apache2/access.log"
I was looking for a simple regular expression for Apache logs. I found one here with small tweak.
apache_regex = /(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) - (.{0})- \[([^\]]+?)\] "(GET|POST|PUT|DELETE) ([^\s]+?) (HTTP\/1\.1)" (\d+) (\d+) "-" "(.*)"/
So I came up with this small method which parses and converts Apache "access.log" file to an array contains a list of hashes with our needed information.
#!/usr/bin/env ruby# KING SABRI | @KINGSABRI​​apache_logs = File.readlines "/var/log/apache2/access.log"​def parse(logs)​apache_regex = /(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) - (.{0})- \[([^\]]+?)\] "(GET|POST|PUT|DELETE) ([^\s]+?) (HTTP\/1\.1)" (\d+) (\d+) ([^\s]+?) "(.*)"/​result_parse = []logs.each do |log|parser = log.scan(apache_regex)[0]​# If can't parse the log line for any reason.if log.scan(apache_regex)[0].nil?puts "Can't parse: #{log}\n\n"nextend​parse ={:ip => parser[0],:user => parser[1],:time => parser[2],:method => parser[3],:uri_path => parser[4],:protocol => parser[5],:code => parser[6],:res_size => parser[7],:referer => parser[8],:user_agent => parser[9]}result_parse << parseend​return result_parseend​require 'pp'pp parse(apache_logs)
Returns
[{:ip=>"127.0.0.1",:user=>"",:time=>"12/Dec/2015:20:09:05 +0300",:method=>"GET",:uri_path=>"/",:protocol=>"HTTP/1.1",:code=>"200",:res_size=>"3525",:referer=>"\"-\"",:user_agent=>"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36"},{:ip=>"127.0.0.1",:user=>"",:time=>"12/Dec/2015:20:09:05 +0300",:method=>"GET",:uri_path=>"/icons/ubuntu-logo.png",:protocol=>"HTTP/1.1",:code=>"200",:res_size=>"3689",:referer=>"\"http://localhost/\"",:user_agent=>"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36"},{:ip=>"127.0.0.1",:user=>"",:time=>"12/Dec/2015:20:09:05 +0300",:method=>"GET",:uri_path=>"/favicon.ico",:protocol=>"HTTP/1.1",:code=>"404",:res_size=>"500",:referer=>"\"http://localhost/\"",:user_agent=>"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36"}]
Note: The Apache LogFormat is configured as LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\"" combined
which is the default configurations.
%h is the remote host (i.e. the client IP address)
%l is the identity of the user determined by identd (not usually used since not reliable)
%u is the user name determined by HTTP authentication
%t is the time the request was received.
%r is the request line from the client. ("GET / HTTP/1.0")
%>s is the status code sent from the server to the client (200, 404 etc.)
%b is the size of the response to the client (in bytes)
Referer is the page that linked to this URL.
User-agent is the browser identification string.
Here is a basic IIS log regular expression
iis_regex = /(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) (\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) ([^\s]++?) (\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) (\d{2}) (GET|POST|PUT|DELETE) ([^\s]++?) - (\d+) (\d+) (\d+) (\d+) ([^\s]++?) (.*)/