Find the N longest lines in a file with Python

Date: 2009-06-28 | Tags: algorithms, python | 4 Comments

Here's a Python problem I attempted recently:

Write a program to read a multiple line text file and write the N longest lines to a new file. Where N and the file to be read are specified on the command line. Optimization is important.

Here's my solution:

import sys

def main(filename=sys.argv[1], 
         N=int(sys.argv[2])):
    """Find the N longest lines in filename and write to filename + ".new"
    """
    lines = open(filename).readlines()
    lines.sort(cmp=lambda x,y: cmp(len(y), len(x)))
    open(filename+".new", "w").write("".join(lines[:N]))

if __name__ == '__main__':
    main()

What do you think? Is there a faster way?

Find all combinations of a set of lists with itertools.product — posted 2011-11-01
Some more python recursion examples — posted 2011-10-05
Free Computer Science courses online — posted 2009-06-30
How to reverse words in a sentence using Python and C — posted 2009-04-22
Python recursion example to navigate tree data — posted 2008-08-19

Comments

#1 Charlie commented on 2009-06-29:

Using a heap is faster.

import sys,heapq

def main(filename=sys.argv[1],
         N=int(sys.argv[2])):
    """ Finds the N longest lines in filename and writes to filename + ".new"
    """
    heap = []
    lines = []
    lines = open(filename).readlines()
    heap = heapq.nlargest(N,lines,len)
    open(filename+".new", "w").write("".join(heap))

if __name__ == '__main__':
    main()

#2 Eliot commented on 2009-07-01:

Charlie,
Thank you very much for your solution. I profiled both methods and your method is much faster. For my 60MB test file, my algorithm using "sort" ran in 12.6 CPU seconds and your heap algorithm ran in 0.4 CPU seconds.

For those interested, here is my test:

import heapq
import sys
import cProfile

filename = "saltycrane.com.access.log"
N = 20

def main():
    print "running saltycrane"
    cProfile.run('saltycrane(filename, N)')
    print "running charlie"
    cProfile.run('charlie(filename, N)')

def saltycrane(filename, N):
    """ Finds the N longest lines in filename and writes to filename + ".new"
    """
    lines = open(filename).readlines()
    lines.sort(cmp=lambda x,y: cmp(len(y), len(x)))
    open(filename+".new", "w").write("".join(lines[:N]))

def charlie(filename, N):
    """ Finds the N longest lines in filename and writes to filename + ".new"
    """
    heap = []
    lines = []
    lines = open(filename).readlines()
    heap = heapq.nlargest(N,lines,len)
    open(filename+".new", "w").write("".join(heap))

if __name__ == '__main__':
    main()

And the results:

running saltycrane
         10981328 function calls in 12.584 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.089    0.089   12.584   12.584 <string>:1(<module>)
        1    2.288    2.288   12.494   12.494 n_longest.py:14(saltycrane)
  2745330    6.273    0.000    9.955    0.000 n_longest.py:18(<lambda>)
  2745330    1.381    0.000    1.381    0.000 {cmp}
  5490660    2.301    0.000    2.301    0.000 {len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}
        1    0.252    0.252    0.252    0.252 {method 'readlines' of 'file' objects}
        1    0.000    0.000    0.000    0.000 {method 'write' of 'file' objects}
        2    0.000    0.000    0.000    0.000 {open}


running charlie
         12 function calls in 0.379 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.035    0.035    0.379    0.379 <string>:1(<module>)
        1    0.000    0.000    0.170    0.170 heapq.py:324(nlargest)
        1    0.000    0.000    0.343    0.343 n_longest.py:21(charlie)
        1    0.170    0.170    0.170    0.170 {_heapq.nlargest}
        1    0.000    0.000    0.000    0.000 {itertools.tee}
        1    0.000    0.000    0.000    0.000 {map}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}
        1    0.173    0.173    0.173    0.173 {method 'readlines' of 'file' objects}
        1    0.000    0.000    0.000    0.000 {method 'write' of 'file' objects}
        2    0.000    0.000    0.000    0.000 {open}

Here is the documentation for heapq: http://docs.python.org/library/heapq.html

#3 Boris Terzic commented on 2009-08-10:

Using a priority queue for this sort of thing is of course optimal but if I'm not mistaken the result of "open(filename)" is iterable. You can save massive amounts of memory and quite noticeable performance by just calling heapq on the result of "open(filename)". Gives 2.5s vs 3s on a 135MB file for me.

#4 Eliot commented on 2009-08-10:

Boris,
Thank you for the improvement. I ran the following code:

def boris(filename, N):
    heap = []
    heap = heapq.nlargest(N, open(filename), len)
    open(filename+".new", "w").write("".join(heap))

and got the following results for my 102MB file:

saltycrane: 22.3 CPU seconds
charlie: 0.76 CPU seconds
boris: 0.55 CPU seconds

SaltyCrane Blog — Notes on JavaScript and web development

Find the N longest lines in a file with Python

Comments

Links

Related posts

Comments

Links