Iterating over lines in multiple Linux log files using Python

Date: 2010-01-23 | Modified: 2010-02-24 | Tags: files_directories, linux, python, ubuntu | 3 Comments

I needed to parse through my Nginx log files to debug a problem. However, the logs are separated into many files, most of them are gzipped, and I wanted the ordering within the files reversed. So I abstracted the logic to handle this into a function. Now I can pass a glob pattern such as /var/log/nginx/cache.log* to my function, and iterate over each line in all the files as if they were one file. Here is my function. Let me know if there is a better way to do this.

Update 2010-02-24:To handle multiple log files on a remote host, see my script on github.

import glob
import gzip
import re
 
def get_lines(log_glob):
    """Return an iterator of each line in all files matching log_glob.
    Lines are sorted most recent first.
    Files are sorted by the integer in the suffix of the log filename.
    Suffix may be one of the following:
         .X (where X is an integer)
         .X.gz (where X is an integer)
    If the filename does not end in either suffix, it is treated as if X=0
    """
    def sort_by_suffix(a, b):
        def get_suffix(fname):
            m = re.search(r'.(?:\.(\d+))?(?:\.gz)?$', fname)
            if m.lastindex:
                suf = int(m.group(1))
            else:
                suf = 0
            return suf
        return get_suffix(a) - get_suffix(b)
 
    filelist = glob.glob(log_glob)
    for filename in sorted(filelist, sort_by_suffix):
        if filename.endswith('.gz'):
            fh = gzip.open(filename)
        else:
            fh = open(filename)
        for line in reversed(fh.readlines()):
            yield line
        fh.close()

Here is an example run on my machine. It prints the first 15 characters of every 1000th line of all my syslog files.

for i, line in enumerate(get_lines('/var/log/syslog*')):
    if not i % 1000:
        print line[:15]

File listing:

$ ls -l /var/log/syslog*
-rw-r----- 1 syslog adm 169965 2010 01/23 00:18 /var/log/syslog
-rw-r----- 1 syslog adm 350334 2010 01/22 08:03 /var/log/syslog.1
-rw-r----- 1 syslog adm  18078 2010 01/21 07:49 /var/log/syslog.2.gz
-rw-r----- 1 syslog adm  16700 2010 01/20 07:43 /var/log/syslog.3.gz
-rw-r----- 1 syslog adm  18197 2010 01/19 07:52 /var/log/syslog.4.gz
-rw-r----- 1 syslog adm  15737 2010 01/18 07:45 /var/log/syslog.5.gz
-rw-r----- 1 syslog adm  16157 2010 01/17 07:54 /var/log/syslog.6.gz
-rw-r----- 1 syslog adm  20285 2010 01/16 07:48 /var/log/syslog.7.gz

Results:

Jan 22 23:57:01
Jan 22 14:09:01
Jan 22 03:51:01
Jan 21 17:35:01
Jan 21 14:37:33
Jan 21 08:35:01
Jan 20 22:12:01
Jan 20 11:56:01
Jan 20 01:41:01
Jan 19 15:18:01
Jan 19 04:53:01
Jan 18 18:35:01
Jan 18 08:40:01
Jan 17 22:10:01
Jan 17 11:32:01
Jan 17 01:05:01
Jan 16 14:27:01
Jan 16 04:01:01
Jan 15 17:25:01
Jan 15 08:50:01

How to get the filename and it's parent directory in Python — posted 2011-12-28
How to remove ^M characters from a file with Python — posted 2011-10-03
Options for listing the files in a directory with Python — posted 2010-04-19
Monitoring a filesystem with Python and Pyinotify — posted 2010-04-09
os.path.relpath() source code for Python 2.5 — posted 2010-03-31
A hack to copy files between two remote hosts using Python — posted 2010-02-08

Comments

#1 David commented on 2010-01-23:

This is a very interesting function. I'm just curious why are you ordering your files by their numbers when default ordering obviously gives the same result and saves the regex.

#2 Eliot commented on 2010-01-23:

Hi David, I should have used a better example. When the numbering gets up to 10, the ordering returned by glob() is no longer correct. I will update my example. I thought about using a shell command like "ls -t", but thought this way would be better if the file modification times got messed up.

#3 David commented on 2010-01-24:

Thanks for the precisions Eliot, I took the habit of padding numbers with 0 to avoid this trap when sorting by alphanumerical order. I still have to remember this when I cannot control the numbering schema though :-/

I do agree that relying on the modification date would be too brittle.