Find the N longest lines in a file with Python
Here's a Python problem I attempted recently:
Write a program to read a multiple line text file and write the N longest lines to a new file. Where N and the file to be read are specified on the command line. Optimization is important.
Here's my solution:
import sys
def main(filename=sys.argv[1],
N=int(sys.argv[2])):
""" Finds the N longest lines in filename and writes to filename + ".new"
"""
lines = open(filename).readlines()
lines.sort(cmp=lambda x,y: cmp(len(y), len(x)))
open(filename+".new", "w").write("".join(lines[:N]))
if __name__ == '__main__':
main()
What do you think? Is there a faster way?
4
Comments
—
Comments feed for this post
#2 Eliot commented on 2009-07-01:
Charlie,
Thank you very much for your solution. I profiled both methods and your method is much faster. For my 60MB test file, my algorithm using "sort" ran in 12.6 CPU seconds and your heap algorithm ran in 0.4 CPU seconds.
For those interested, here is my test:
import heapq
import sys
import cProfile
filename = "saltycrane.com.access.log"
N = 20
def main():
print "running saltycrane"
cProfile.run('saltycrane(filename, N)')
print "running charlie"
cProfile.run('charlie(filename, N)')
def saltycrane(filename, N):
""" Finds the N longest lines in filename and writes to filename + ".new"
"""
lines = open(filename).readlines()
lines.sort(cmp=lambda x,y: cmp(len(y), len(x)))
open(filename+".new", "w").write("".join(lines[:N]))
def charlie(filename, N):
""" Finds the N longest lines in filename and writes to filename + ".new"
"""
heap = []
lines = []
lines = open(filename).readlines()
heap = heapq.nlargest(N,lines,len)
open(filename+".new", "w").write("".join(heap))
if __name__ == '__main__':
main()
And the results:
running saltycrane
10981328 function calls in 12.584 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.089 0.089 12.584 12.584 <string>:1(<module>)
1 2.288 2.288 12.494 12.494 n_longest.py:14(saltycrane)
2745330 6.273 0.000 9.955 0.000 n_longest.py:18(<lambda>)
2745330 1.381 0.000 1.381 0.000 {cmp}
5490660 2.301 0.000 2.301 0.000 {len}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects}
1 0.252 0.252 0.252 0.252 {method 'readlines' of 'file' objects}
1 0.000 0.000 0.000 0.000 {method 'write' of 'file' objects}
2 0.000 0.000 0.000 0.000 {open}
running charlie
12 function calls in 0.379 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.035 0.035 0.379 0.379 <string>:1(<module>)
1 0.000 0.000 0.170 0.170 heapq.py:324(nlargest)
1 0.000 0.000 0.343 0.343 n_longest.py:21(charlie)
1 0.170 0.170 0.170 0.170 {_heapq.nlargest}
1 0.000 0.000 0.000 0.000 {itertools.tee}
1 0.000 0.000 0.000 0.000 {map}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects}
1 0.173 0.173 0.173 0.173 {method 'readlines' of 'file' objects}
1 0.000 0.000 0.000 0.000 {method 'write' of 'file' objects}
2 0.000 0.000 0.000 0.000 {open}
Here is the documentation for heapq: http://docs.python.org/library/heapq.html
#3 Boris Terzic commented on 2009-08-10:
Using a priority queue for this sort of thing is of course optimal but if I'm not mistaken the result of "open(filename)" is iterable. You can save massive amounts of memory and quite noticeable performance by just calling heapq on the result of "open(filename)". Gives 2.5s vs 3s on a 135MB file for me.
#4 Eliot commented on 2009-08-10:
Boris,
Thank you for the improvement. I ran the following code:
def boris(filename, N):
heap = []
heap = heapq.nlargest(N, open(filename), len)
open(filename+".new", "w").write("".join(heap))
and got the following results for my 102MB file:
saltycrane: 22.3 CPU seconds
charlie: 0.76 CPU seconds
boris: 0.55 CPU seconds
Post a comment
About
I'm Eliot and this is my notepad for programming topics such as Python, Django, Ubuntu, Emacs, etc... more »
Search Blog
Tags
-
algorithms
(4)
-
aws
(8)
-
blogproject
(20)
-
c_cplusplus
(12)
-
cardstore
(8)
-
colinux
(2)
-
concurrency
(9)
-
conkeror
(2)
-
cygwin
(18)
-
datastructures
(15)
-
datetime
(3)
-
dell
(3)
-
django
(39)
-
emacs
(20)
-
files_directories
(10)
-
install_setup
(7)
-
javascript
(3)
-
keyboard
(6)
-
matplotlib
(5)
-
mercurial
(4)
-
nginx
(2)
-
preferences
(8)
-
processes
(3)
-
pyqt
(18)
-
python
(122)
-
ratpoison
(3)
-
regexes
(5)
-
rsync
(3)
-
softwaretools
(17)
-
sql
(13)
-
ssh
(7)
-
subversion
(6)
-
twisted
(6)
-
ubuntu
(60)
-
urxvt
(5)
-
vxworks
(25)
-
webservices
(4)
-
wmii
(7)
Blogroll
- Adam Gomaa
- Alex Clemesha
- Amir Salihefendic
- Armin Ronacher
- David Beazley
- David Ziegler
- Duncan McGreggor
- Gareth Rushgrave
- Glyph Lefkowitz
- Guido van Rossum
- Ian Bicking
- Jacob Kaplan-Moss
- James Bennett
- James Tauber
- Jesper Noehr
- Matt Harrison
- Nikolay Kolev
- Parand Darugar
- Peter Baumgartner
- Peter Bengtsson
- Rob Hudson
- Simon Willison
- Will McGugan
#1 Charlie commented on 2009-06-29:
Using a heap is faster.