Iterating over lines in multiple Linux log files using Python
I needed to parse through my Nginx log files to debug a problem. However, the logs are separated into many files, most of them are gzipped, and I wanted the ordering within the files reversed. So I abstracted the logic to handle this into a function. Now I can pass a glob pattern such as /var/log/nginx/cache.log* to my function, and iterate over each line in all the files as if they were one file. Here is my function. Let me know if there is a better way to do this.
Update 2010-02-24:To handle multiple log files on a remote host, see my script on github.
import glob
import gzip
import re
def get_lines(log_glob):
"""Return an iterator of each line in all files matching log_glob.
Lines are sorted most recent first.
Files are sorted by the integer in the suffix of the log filename.
Suffix may be one of the following:
.X (where X is an integer)
.X.gz (where X is an integer)
If the filename does not end in either suffix, it is treated as if X=0
"""
def sort_by_suffix(a, b):
def get_suffix(fname):
m = re.search(r'.(?:\.(\d+))?(?:\.gz)?$', fname)
if m.lastindex:
suf = int(m.group(1))
else:
suf = 0
return suf
return get_suffix(a) - get_suffix(b)
filelist = glob.glob(log_glob)
for filename in sorted(filelist, sort_by_suffix):
if filename.endswith('.gz'):
fh = gzip.open(filename)
else:
fh = open(filename)
for line in reversed(fh.readlines()):
yield line
fh.close()
Here is an example run on my machine. It prints the first 15 characters of every 1000th line of all my syslog files.
for i, line in enumerate(get_lines('/var/log/syslog*')):
if not i % 1000:
print line[:15]
File listing:
$ ls -l /var/log/syslog* -rw-r----- 1 syslog adm 169965 2010 01/23 00:18 /var/log/syslog -rw-r----- 1 syslog adm 350334 2010 01/22 08:03 /var/log/syslog.1 -rw-r----- 1 syslog adm 18078 2010 01/21 07:49 /var/log/syslog.2.gz -rw-r----- 1 syslog adm 16700 2010 01/20 07:43 /var/log/syslog.3.gz -rw-r----- 1 syslog adm 18197 2010 01/19 07:52 /var/log/syslog.4.gz -rw-r----- 1 syslog adm 15737 2010 01/18 07:45 /var/log/syslog.5.gz -rw-r----- 1 syslog adm 16157 2010 01/17 07:54 /var/log/syslog.6.gz -rw-r----- 1 syslog adm 20285 2010 01/16 07:48 /var/log/syslog.7.gz
Results:
Jan 22 23:57:01 Jan 22 14:09:01 Jan 22 03:51:01 Jan 21 17:35:01 Jan 21 14:37:33 Jan 21 08:35:01 Jan 20 22:12:01 Jan 20 11:56:01 Jan 20 01:41:01 Jan 19 15:18:01 Jan 19 04:53:01 Jan 18 18:35:01 Jan 18 08:40:01 Jan 17 22:10:01 Jan 17 11:32:01 Jan 17 01:05:01 Jan 16 14:27:01 Jan 16 04:01:01 Jan 15 17:25:01 Jan 15 08:50:01