SaltyCrane Blog — Notes on JavaScript and web development

Iterating over lines in multiple Linux log files using Python

I needed to parse through my Nginx log files to debug a problem. However, the logs are separated into many files, most of them are gzipped, and I wanted the ordering within the files reversed. So I abstracted the logic to handle this into a function. Now I can pass a glob pattern such as /var/log/nginx/cache.log* to my function, and iterate over each line in all the files as if they were one file. Here is my function. Let me know if there is a better way to do this.

Update 2010-02-24:To handle multiple log files on a remote host, see my script on github.

import glob
import gzip
import re
def get_lines(log_glob):
    """Return an iterator of each line in all files matching log_glob.
    Lines are sorted most recent first.
    Files are sorted by the integer in the suffix of the log filename.
    Suffix may be one of the following:
         .X (where X is an integer)
         .X.gz (where X is an integer)
    If the filename does not end in either suffix, it is treated as if X=0
    def sort_by_suffix(a, b):
        def get_suffix(fname):
            m ='.(?:\.(\d+))?(?:\.gz)?$', fname)
            if m.lastindex:
                suf = int(
                suf = 0
            return suf
        return get_suffix(a) - get_suffix(b)
    filelist = glob.glob(log_glob)
    for filename in sorted(filelist, sort_by_suffix):
        if filename.endswith('.gz'):
            fh =
            fh = open(filename)
        for line in reversed(fh.readlines()):
            yield line

Here is an example run on my machine. It prints the first 15 characters of every 1000th line of all my syslog files.

for i, line in enumerate(get_lines('/var/log/syslog*')):
    if not i % 1000:
        print line[:15]

File listing:

$ ls -l /var/log/syslog*
-rw-r----- 1 syslog adm 169965 2010 01/23 00:18 /var/log/syslog
-rw-r----- 1 syslog adm 350334 2010 01/22 08:03 /var/log/syslog.1
-rw-r----- 1 syslog adm  18078 2010 01/21 07:49 /var/log/syslog.2.gz
-rw-r----- 1 syslog adm  16700 2010 01/20 07:43 /var/log/syslog.3.gz
-rw-r----- 1 syslog adm  18197 2010 01/19 07:52 /var/log/syslog.4.gz
-rw-r----- 1 syslog adm  15737 2010 01/18 07:45 /var/log/syslog.5.gz
-rw-r----- 1 syslog adm  16157 2010 01/17 07:54 /var/log/syslog.6.gz
-rw-r----- 1 syslog adm  20285 2010 01/16 07:48 /var/log/syslog.7.gz


Jan 22 23:57:01
Jan 22 14:09:01
Jan 22 03:51:01
Jan 21 17:35:01
Jan 21 14:37:33
Jan 21 08:35:01
Jan 20 22:12:01
Jan 20 11:56:01
Jan 20 01:41:01
Jan 19 15:18:01
Jan 19 04:53:01
Jan 18 18:35:01
Jan 18 08:40:01
Jan 17 22:10:01
Jan 17 11:32:01
Jan 17 01:05:01
Jan 16 14:27:01
Jan 16 04:01:01
Jan 15 17:25:01
Jan 15 08:50:01


#1 David commented on :

This is a very interesting function. I'm just curious why are you ordering your files by their numbers when default ordering obviously gives the same result and saves the regex.

#2 Eliot commented on :

Hi David, I should have used a better example. When the numbering gets up to 10, the ordering returned by glob() is no longer correct. I will update my example. I thought about using a shell command like "ls -t", but thought this way would be better if the file modification times got messed up.

#3 David commented on :

Thanks for the precisions Eliot, I took the habit of padding numbers with 0 to avoid this trap when sorting by alphanumerical order. I still have to remember this when I cannot control the numbering schema though :-/

I do agree that relying on the modification date would be too brittle.