Saltycrane logo

SaltyCrane Blog

Notes on Python, Django, and web development on Ubuntu Linux

    

Find the N longest lines in a file with Python

Here's a Python problem I attempted recently:

Write a program to read a multiple line text file and write the N longest lines to a new file. Where N and the file to be read are specified on the command line. Optimization is important.

Here's my solution:

import sys

def main(filename=sys.argv[1], 
         N=int(sys.argv[2])):
    """ Finds the N longest lines in filename and writes to filename + ".new"
    """
    lines = open(filename).readlines()
    lines.sort(cmp=lambda x,y: cmp(len(y), len(x)))
    open(filename+".new", "w").write("".join(lines[:N]))

if __name__ == '__main__':
    main()

What do you think? Is there a faster way?

4 Comments — feed icon Comments feed for this post


#1 Charlie commented on 2009-06-29:

Using a heap is faster.

import sys,heapq

def main(filename=sys.argv[1],
         N=int(sys.argv[2])):
    """ Finds the N longest lines in filename and writes to filename + ".new"
    """
    heap = []
    lines = []
    lines = open(filename).readlines()
    heap = heapq.nlargest(N,lines,len)
    open(filename+".new", "w").write("".join(heap))

if __name__ == '__main__':
    main()

#2 Eliot commented on 2009-07-01:

Charlie,
Thank you very much for your solution. I profiled both methods and your method is much faster. For my 60MB test file, my algorithm using "sort" ran in 12.6 CPU seconds and your heap algorithm ran in 0.4 CPU seconds.

For those interested, here is my test:

import heapq
import sys
import cProfile

filename = "saltycrane.com.access.log"
N = 20

def main():
    print "running saltycrane"
    cProfile.run('saltycrane(filename, N)')
    print "running charlie"
    cProfile.run('charlie(filename, N)')

def saltycrane(filename, N):
    """ Finds the N longest lines in filename and writes to filename + ".new"
    """
    lines = open(filename).readlines()
    lines.sort(cmp=lambda x,y: cmp(len(y), len(x)))
    open(filename+".new", "w").write("".join(lines[:N]))

def charlie(filename, N):
    """ Finds the N longest lines in filename and writes to filename + ".new"
    """
    heap = []
    lines = []
    lines = open(filename).readlines()
    heap = heapq.nlargest(N,lines,len)
    open(filename+".new", "w").write("".join(heap))

if __name__ == '__main__':
    main()

And the results:

running saltycrane
         10981328 function calls in 12.584 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.089    0.089   12.584   12.584 <string>:1(<module>)
        1    2.288    2.288   12.494   12.494 n_longest.py:14(saltycrane)
  2745330    6.273    0.000    9.955    0.000 n_longest.py:18(<lambda>)
  2745330    1.381    0.000    1.381    0.000 {cmp}
  5490660    2.301    0.000    2.301    0.000 {len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}
        1    0.252    0.252    0.252    0.252 {method 'readlines' of 'file' objects}
        1    0.000    0.000    0.000    0.000 {method 'write' of 'file' objects}
        2    0.000    0.000    0.000    0.000 {open}

running charlie
         12 function calls in 0.379 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.035    0.035    0.379    0.379 <string>:1(<module>)
        1    0.000    0.000    0.170    0.170 heapq.py:324(nlargest)
        1    0.000    0.000    0.343    0.343 n_longest.py:21(charlie)
        1    0.170    0.170    0.170    0.170 {_heapq.nlargest}
        1    0.000    0.000    0.000    0.000 {itertools.tee}
        1    0.000    0.000    0.000    0.000 {map}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}
        1    0.173    0.173    0.173    0.173 {method 'readlines' of 'file' objects}
        1    0.000    0.000    0.000    0.000 {method 'write' of 'file' objects}
        2    0.000    0.000    0.000    0.000 {open}

Here is the documentation for heapq: http://docs.python.org/library/heapq.html


#3 Boris Terzic commented on 2009-08-10:

Using a priority queue for this sort of thing is of course optimal but if I'm not mistaken the result of "open(filename)" is iterable. You can save massive amounts of memory and quite noticeable performance by just calling heapq on the result of "open(filename)". Gives 2.5s vs 3s on a 135MB file for me.


#4 Eliot commented on 2009-08-10:

Boris,
Thank you for the improvement. I ran the following code:

def boris(filename, N):
    heap = []
    heap = heapq.nlargest(N, open(filename), len)
    open(filename+".new", "w").write("".join(heap))

and got the following results for my 102MB file:

saltycrane: 22.3 CPU seconds
charlie: 0.76 CPU seconds
boris: 0.55 CPU seconds

Post a comment

Required
Required, but not displayed
Optional

Format using Markdown. (No HTML.)
  • Code blocks: prefix each line by at least 4 spaces or 1 tab (and a blank line before and after)
  • Code span: surround with backticks
  • Blockquotes: prefix lines to be quoted with >
  • Links: <URL>
  • Links w/ description: [description](URL)
Created with Django | Hosted by Slicehost