SaltyCrane: regexeshttps://www.saltycrane.com/blog/2011-04-20T19:00:32-07:00(Not too successfully) trying to use Unix tools instead of Python utility scripts
2011-04-20T19:00:32-07:00https://www.saltycrane.com/blog/2011/04/trying-use-unix-tools-instead-python-utility-scripts/<p>
Inspired by articles such as
<a href="http://gregable.com/2010/09/why-you-should-know-just-little-awk.html">
<em>Why you should learn just a little Awk</em></a> and
<a href="http://www.johndcook.com/blog/2011/04/19/learn-one-sed-command/">
<em>Learn one sed command</em></a>, I am trying to make use of Unix tools
<code>sed</code>, <code>awk</code>, <code>grep</code>, <code>cut</code>, <code>uniq</code>, <code>sort</code>,
etc. instead of writing short Python utility scripts.
</p>
<p>Here is a Python script I wrote this week. It greps a file for a given
regular expression pattern and returns a unique, sorted, list of matches
inside the capturing parentheses.
</p>
<pre class="python"># grep2.py
import re
import sys
def main():
patt = sys.argv[1]
filename = sys.argv[2]
text = open(filename).read()
matchlist = set(m.group(1) for m in re.finditer(patt, text, re.MULTILINE))
for m in sorted(matchlist):
print m
if __name__ == '__main__':
main()</pre>
<p>
As an example, I used my script to search
<a href="http://code.djangoproject.com/browser/django/tags/releases/1.3/django/contrib/admin/templates/admin/edit_inline/tabular.html">
one of the Django admin template files</a> for all the Django template markup in the file.
</p>
<pre class="console">$ python grep2.py '({{[^{}]+}}|{%[^{}]+%})' tabular.html </pre>
<p>Output:</p>
<pre>{% admin_media_prefix %}
{% blocktrans with inline_admin_formset.opts.verbose_name|title as verbose_name %}
{% cycle "row1" "row2" %}
{% else %}
{% endblocktrans %}
{% endfor %}
{% endif %}
{% endspaceless %}
{% for field in inline_admin_formset.fields %}
{% for field in line %}
{% for fieldset in inline_admin_form %}
{% for inline_admin_form in inline_admin_formset %}
{% for line in fieldset %}
{% if field.is_hidden %}
{% if field.is_readonly %}
{% if field.required %}
{% if forloop.first %}
{% if forloop.last %}
{% if inline_admin_form.form.non_field_errors %}
{% if inline_admin_form.has_auto_field %}
{% if inline_admin_form.original %}
{% if inline_admin_form.original or inline_admin_form.show_url %}
{% if inline_admin_form.show_url %}
{% if inline_admin_formset.formset.can_delete %}
{% if not field.widget.is_hidden %}
{% if not forloop.last %}
{% load i18n adminmedia admin_modify %}
{% spaceless %}
{% trans "Delete?" %}
{% trans "Remove" %}
{% trans "View on site" %}
{{ field.contents }}
{{ field.field }}
{{ field.field.errors.as_ul }}
{{ field.field.name }}
{{ field.label|capfirst }}
{{ forloop.counter0 }}
{{ inline_admin_form.deletion_field.field }}
{{ inline_admin_form.fk_field.field }}
{{ inline_admin_form.form.non_field_errors }}
{{ inline_admin_form.original }}
{{ inline_admin_form.original.id }}
{{ inline_admin_form.original_content_type_id }}
{{ inline_admin_form.pk_field.field }}
{{ inline_admin_formset.formset.management_form }}
{{ inline_admin_formset.formset.non_form_errors }}
{{ inline_admin_formset.formset.prefix }}
{{ inline_admin_formset.opts.verbose_name_plural|capfirst }}
{{ inline_admin_form|cell_count }}
{{ verbose_name }}</pre>
<p>Here's my attempt at using Unix tools:</p>
<pre class="console">$ sed -rn 's/^.*(\{\{.*\}\}|\{%.*%\}).*$/\1/gp' tabular.html | sort | uniq </pre>
<p>However the output isn't quite the same:</p>
<pre>{% admin_media_prefix %}
{% else %}
{% endblocktrans %}
{% endfor %}
{% endif %}
{% endspaceless %}
{% for field in inline_admin_formset.fields %}
{% for field in line %}
{% for fieldset in inline_admin_form %}
{% for inline_admin_form in inline_admin_formset %}
{% for line in fieldset %}
{% if field.is_readonly %}
{% if inline_admin_form.form.non_field_errors %}
{% if inline_admin_form.original or inline_admin_form.show_url %}
{% if inline_admin_formset.formset.can_delete %}
{% if not field.widget.is_hidden %}
{% load i18n adminmedia admin_modify %}
{% spaceless %}
{% trans "Remove" %}
{{ field.contents }}
{{ field.field }}
{{ field.field.errors.as_ul }}
{{ field.field.name }}
{{ field.label|capfirst }}
{{ inline_admin_form.fk_field.field }}
{{ inline_admin_form.form.non_field_errors }}
{{ inline_admin_formset.formset.management_form }}
{{ inline_admin_formset.formset.non_form_errors }}
{{ inline_admin_formset.formset.prefix }}
{{ inline_admin_formset.opts.verbose_name_plural|capfirst }}</pre>
<p>Unix tools are powerful and concise, but I still need to
get a lot more comfortable with their syntax. Please leave a comment
if you know how to fix my command.
</p>
How to search C code for division or sqrt
2008-07-24T15:12:20-07:00https://www.saltycrane.com/blog/2008/07/how-search-c-code-division-or-sqrt/<p>The following Python script searches through C code for division or sqrt
and prints the line of code and the line number. It skips C comments.
To use, run <code>python find_divides.py filename.c</code>
</p>
<pre class="python">#!/usr/bin/python
"""find_divides.py
usage: python find_divides.py filename
"""
import re
import sys
def main():
filename = sys.argv[1]
text = open(filename).read()
lines = text.splitlines()
lines = ["%4d: %s" % (i, line) for (i, line) in enumerate(lines)]
text = "\n".join(lines)
text = remove_comments_and_strings(text)
for line in text.splitlines():
if ("/" in line) or ("sqrt" in line):
print line
def remove_comments_and_strings(text):
""" remove c-style comments and strings
text: blob of text with comments (can include newlines)
returns: text with comments and strings removed
"""
pattern = r"""
## --------- COMMENT ---------
/\* ## Start of /* ... */ comment
[^*]*\*+ ## Non-* followed by 1-or-more *'s
( ##
[^/*][^*]*\*+ ##
)* ## 0-or-more things which don't start with /
## but do end with '*'
/ ## End of /* ... */ comment
| ## -OR- various things which aren't comments:
( ##
## ------ " ... " STRING ------
" ## Start of " ... " string
( ##
\\. ## Escaped char
| ## -OR-
[^"\\] ## Non "\ characters
)* ##
" ## End of " ... " string
| ## -OR-
##
## ------ ' ... ' STRING ------
' ## Start of ' ... ' string
( ##
\\. ## Escaped char
| ## -OR-
[^'\\] ## Non '\ characters
)* ##
' ## End of ' ... ' string
| ## -OR-
##
## ------ ANYTHING ELSE -------
(. ## Anything other char
[^/"'\\]*) ## Chars which doesn't start a comment, string
) ## or escape
"""
regex = re.compile(pattern, re.VERBOSE|re.MULTILINE|re.DOTALL)
goodstuff = [m.group(5) for m in regex.finditer(text) if m.group(5)]
return "".join(goodstuff)
if __name__ == "__main__":
main()</pre>
How to remove C style comments using Python
2007-11-28T17:25:00-08:00https://www.saltycrane.com/blog/2007/11/remove-c-comments-python/<p>The Perl FAQ has an
entry <a href="http://perldoc.perl.org/perlfaq6.html#How-do-I-use-a-regular-expression-to-strip-C-style-comments-from-a-file%3f">How
do I use a regular expression to strip C style comments from a
file?</a> Since I've switched to Python, I've adapted the Perl
solution to Python. This regular expression was created by Jeffrey Friedl and
later modified by Fred Curtis. I'm not certain, but it
appears to use the "unrolling the loop" technique described in
Chapter 6 of <em>Mastering Regular Expressions</em>.</p>
<p>remove_comments.py:</p>
<pre class="python">import re
import sys
def remove_comments(text):
""" remove c-style comments.
text: blob of text with comments (can include newlines)
returns: text with comments removed
"""
pattern = r"""
## --------- COMMENT ---------
/\* ## Start of /* ... */ comment
[^*]*\*+ ## Non-* followed by 1-or-more *'s
( ##
[^/*][^*]*\*+ ##
)* ## 0-or-more things which don't start with /
## but do end with '*'
/ ## End of /* ... */ comment
| ## -OR- various things which aren't comments:
( ##
## ------ " ... " STRING ------
" ## Start of " ... " string
( ##
\\. ## Escaped char
| ## -OR-
[^"\\] ## Non "\ characters
)* ##
" ## End of " ... " string
| ## -OR-
##
## ------ ' ... ' STRING ------
' ## Start of ' ... ' string
( ##
\\. ## Escaped char
| ## -OR-
[^'\\] ## Non '\ characters
)* ##
' ## End of ' ... ' string
| ## -OR-
##
## ------ ANYTHING ELSE -------
. ## Anything other char
[^/"'\\]* ## Chars which doesn't start a comment, string
) ## or escape
"""
regex = re.compile(pattern, re.VERBOSE|re.MULTILINE|re.DOTALL)
noncomments = [m.group(2) for m in regex.finditer(text) if m.group(2)]
return "".join(noncomments)
if __name__ == '__main__':
filename = sys.argv[1]
code_w_comments = open(filename).read()
code_wo_comments = remove_comments(code_w_comments)
fh = open(filename+".nocomments", "w")
fh.write(code_wo_comments)
fh.close()</pre>
<br>Example:<br>
To test the script, I created a test file called <code>testfile.c</code>:
<pre>/* This is a C-style comment. */
This is not a comment.
/* This is another
* C-style comment.
*/
"This is /* also not a comment */"</pre>
<br>Run the script:<br>
To use the script, I put the script, <code>remove_comments.py</code>,
and my test file, <code>testfile.c</code>, in the same directory and ran the
following command:<br>
<pre>python remove_comments.py testfile.c</pre>
<br>Results:<br>
The script created a new file called <code>testfile.c.nocomments</code>:
<pre>
This is not a comment.
"This is /* also not a comment */"</pre>
<br><br><br>
---------------<br>
Minor note on Perl to Python migration:<br>
I modified the original regular expression comments a little bit.
In particular, I had to put at least one character after the <code>##
Non "\</code> and <code>## Non '\</code> lines because, in Python,
the backslash was escaping the following newline character and the
closing parenthesis on the following line was being treated as a
comment by the regular expression engine. This is the error I got,
before the fix:
<pre>$ python remove_comments.py
Traceback (most recent call last):
File "remove_comments.py", line 39, in <module>
regex = re.compile(pattern, re.VERBOSE|re.MULTILINE|re.DOTALL)
File "C:\Programs\Python25\lib\re.py", line 180, in compile
return _compile(pattern, flags)
File "C:\Programs\Python25\lib\re.py", line 233, in _compile
raise error, v # invalid expression
sre_constants.error: unbalanced parenthesis</pre>
Using Python's finditer to highlight search items
2007-10-16T17:39:00-07:00https://www.saltycrane.com/blog/2007/10/using-pythons-finditer-to-highlight/<p>I am trying to search through various text and highlight certain search terms within that text using HTML markup. As an example, if I take a paragraph of text from Paul Prescod's <a href="http://www.prescod.net/python/IsPythonLisp.html">essay</a>, I would like to highlight the search terms <em>"lisp"</em>, <em>"python"</em>, <em>"perl"</em>, <em>"java"</em>, and <em>"C"</em> each in different colors. My first attempt at this problem looked somthing like:</p>
<pre class="python">for sentence in re.split(r"[?.]\s+", text):
match = re.search(r"\blisp\b", sentence, re.I)
if match:
color = 'red'
else:
match = re.search(r"\bpython\b", sentence, re.I)
if match:
color = 'blue'
else:
match = re.search(r"\bperl\b", sentence, re.I)
if match:
color = 'orange'</pre>
<p>I didn't finish it because, not only is it ugly and verbose, it doesn't do what I want. Instead of matching all the search terms, it only matches the first one in each sentence. Fortunately, I took some time to rethink the problem (i.e. search the internet (this <a href="http://mail.python.org/pipermail/python-list/2004-December/298175.html">thread</a> on the Python mailing list was helpful (I guess my Perl background is still showing) as was <a href="http://effbot.org/zone/xml-scanner.htm">this article</a> which I previously referenced. (hmmm, this is starting to look like Lisp.))) and made a prettier (and correct) version using my new favorite regular expression method, <code>finditer</code>, and the <code>MatchObject</code>'s <code>lastindex</code> attribute. Here is the working example:</p>
<pre class="python">import re
COLOR = ['red', 'blue', 'orange', 'violet', 'green']
text = """Graham says that Perl is cooler than Java and Python than Perl. In some circles, maybe. Graham uses the example of Slashdot, written in Perl. But what about Advogato, written in C? What about all of the cool P2P stuff being written in all three of the languages? Considering that Perl is older than Java, and was at one time the Next Big Language, I think you would have a hard time getting statistical evidence that programmers consider Perl "cooler" than Java, except perhaps by virtue of the fact that Java has spent a few years as the "industry standard" (and is thus uncool for the same reason that the Spice Girls are uncool) and Perl is still "underground" (and thus cool, for the same reason that ambient is cool). Python is even more "underground" than Perl (and thus cooler?). Maybe all Graham has demonstrated is that proximity to Lisp drives a language underground. Except that he's got the proximity to Lisp argument backwards too."""
regex = re.compile(r"(\blisp\b)|(\bpython\b)|(\bperl\b)|(\bjava\b)|(\bc\b)", re.I)
i = 0; output = "<html>"
for m in regex.finditer(text):
output += "".join([text[i:m.start()],
"<strong><span style='color:%s'>" % COLOR[m.lastindex-1],
text[m.start():m.end()],
"</span></strong>"])
i = m.end()
print "".join([output, text[m.end():], "</html>"])</pre>
This example loops over each match in the iterator object returned by <code>finditer</code>. For each match, non-matching text and matching text surrounded with the HTML <code><span></code> tag are appended to the <code>output</code> string. <code>start()</code> and <code>end()</code> return the indices to the start and end positions of the matching text. The color of the text is determined by using <code>lastindex</code> to index into a list of colors. <code>lastindex</code> is the index of the group of the last match. So, it is <em>"1"</em> if <em>"lisp"</em> is matched, <em>"2"</em> if <em>"python"</em> is matched, <em>"3"</em> if <em>"perl"</em> is matched, and so on. I need to subtract 1 because the list indexing starts at 0. The last line adds on the rest of the non-matching text, and prints it. When viewed in a browser, it looks something like this:
<blockquote>Graham says that <strong><span style="color:orange">Perl</span></strong> is cooler than <strong><span style="color:violet">Java</span></strong> and <strong><span style="color:blue">Python</span></strong> than <strong><span style="color:orange">Perl</span></strong>. In some circles, maybe. Graham uses the example of Slashdot, written in <strong><span style="color:orange">Perl</span></strong>. But what about Advogato, written in <strong><span style="color:green">C</span></strong>? What about all of the cool P2P stuff being written in all three of the languages? Considering that <strong><span style="color:orange">Perl</span></strong> is older than <strong><span style="color:violet">Java</span></strong>, and was at one time the Next Big Language, I think you would have a hard time getting statistical evidence that programmers consider <strong><span style="color:orange">Perl</span></strong> "cooler" than <strong><span style="color:violet">Java</span></strong>, except perhaps by virtue of the fact that <strong><span style="color:violet">Java</span></strong> has spent a few years as the "industry standard" (and is thus uncool for the same reason that the Spice Girls are uncool) and <strong><span style="color:orange">Perl</span></strong> is still "underground" (and thus cool, for the same reason that ambient is cool). <strong><span style="color:blue">Python</span></strong> is even more "underground" than <strong><span style="color:orange">Perl</span></strong> (and thus cooler?). Maybe all Graham has demonstrated is that proximity to <strong><span style="color:red">Lisp</span></strong> drives a language underground. Except that he's got the proximity to <strong><span style="color:red">Lisp</span></strong> argument backwards too.</blockquote>
Using Python's finditer for Lexical Analysis
2007-10-16T17:35:00-07:00https://www.saltycrane.com/blog/2007/10/using-pythons-finditer-for-lexical/<p>Fredrik Lundh wrote a good article called <a href="http://effbot.org/zone/xml-scanner.htm">Using Regular Expressions for Lexical Analysis</a> which explains how to use Python regular expressions to read an input string and group characters into lexical units, or tokens. The author's first group of examples read in a simple expression, <code>"b = 2 + a*10"</code>, and output strings classified as one of three token types: symbols (e.g. <code>a</code> and <code>b</code>), integer literals (e.g. <code>2</code> and <code>10</code>), and operators (e.g. <code>=</code>, <code>+</code>, and <code>*</code>). His first three examples use the <code>findall</code> method and his fourth example uses the undocumented <code>scanner</code> method from the <code>re</code> module. Here is the example code from the fourth example. Note that the <em>"1"</em> in the first column of the results corresponds to the integer literals token group, <em>"2"</em> corresponds to the symbols group, and <em>"3"</em> to the operators group.</p>
<pre class="python">import re
expr = "b = 2 + a*10"
pos = 0
pattern = re.compile("\s*(?:(\d+)|(\w+)|(.))")
scan = pattern.scanner(expr)
while 1:
m = scan.match()
if not m:
break
print m.lastindex, repr(m.group(m.lastindex))</pre>
Here are the results:
<pre>2 'b'
3 '='
1 '2'
3 '+'
2 'a'
3 '*'
1 '10'</pre>
<p>Since this article was dated 2002, and the author was using Python 2.0, I wondered if this was the most current approach. The author notes that recent versions (i.e. version 2.2 or later) of Python allow you to use the <code>finditer</code> method which uses an internal <code>scanner</code> object. Using <code>finditer</code> makes the example code much simpler. Here is Fredrik's example using <code>finditer</code>:
<pre class="python">import re
expr = "b = 2 + a*10"
regex = re.compile("\s*(?:(\d+)|(\w+)|(.))")
for m in regex.finditer(expr):
print m.lastindex, repr(m.group(m.lastindex))</pre>
</p><p>Running it produces the same results as the original.</p>
Python finditer regular expression example
2007-10-03T12:15:00-07:00https://www.saltycrane.com/blog/2007/10/python-finditer-regular-expression/<p>I often process text line by line using
the <code>splitlines()</code> method with a <code>for</code>
loop. This works great most of the time, however, sometimes, the text
is not neatly divisible into lines, or, I need to match multiple items
per line. This is where the <code>re</code>
module's <code>finditer</code> function can
help. <code>finditer</code> returns an iterator over all
non-overlapping matches for the regular expression pattern in the
string. (See <a href="http://www.python.org/doc/current/lib/node46.html">docs</a>.)
It is a powerful tool for text processing and one that I don't use
often enough.</p>
<p>Here is a simple example which demonstrates the use
of <code>finditer</code>. It reads in a page of html text, finds all
the occurrences of the word <em>"the"</em> and prints <em>"the"</em>
and the following word. It also prints the character position of each
match using the <code>MatchObject</code>'s <code>start()</code>
method. (See <a href="http://www.python.org/doc/current/lib/match-objects.html">docs</a>.)
Note that, for simplicity, I didn't mess with the HTML tags at all. I
just pretended it was plain text. Oh, and the example text is taken
from Steve Yegge's
article: <a href="http://steve-yegge.blogspot.com/2007/08/how-to-make-funny-talk-title-without.html">How
To Make a Funny Talk Title Without Using The Word "Weasel"</a></p>
<p><strong>Python code:</strong></p>
<pre class="python">import re
import urllib2
html = urllib2.urlopen('http://steve-yegge.blogspot.com/2007/08/how-to-make-funny-talk-title-without.html').read()
pattern = r'\b(the\s+\w+)\s+'
regex = re.compile(pattern, re.IGNORECASE)
for match in regex.finditer(html):
print "%s: %s" % (match.start(), match.group(1))</pre>
<p><strong>Results:</strong></p>
<pre style="height: 300px; overflow:auto">1301: The Word
12291: The Word
13367: the cut
14025: the car
15050: the free
15513: the third
15558: the sessions
15617: the ONLY
15684: the ground
15911: the OSI
15933: The Attack
16051: The gist
16115: the term
16178: the creator
16741: the thing
16850: the same
16877: the thing
16942: the next
17131: the talk
17374: the room
17727: the hell
17782: the term
17830: the 1980s
18083: the whole
18158: the same
18230: the mountain
18305: the seat
18537: The pro
18718: the banner
18928: the poor
19006: the midst
19223: the buzzwagon
19326: the source
19437: the OSI
19855: the OSI
19927: the other
20055: the Ten
20404: The 22
20517: the OSI
20616: the book
21098: the collective
21553: the proposed
21681: the Five
21932: the nearest
22690: The rest
22858: the entertaining
23255: the crap
23561: the next
23661: the registration
23963: the registration
24114: the restaurant
24289: the people
24456: the second
24597: the current
24871: The Style
24929: the front
25047: the curtain
25132: the movie
25159: The hospital
25249: the night
25881: the way
25892: the rear
25927: the crowd
26194: the podium
26262: the front
26521: the door
26593: the front
26622: The economist
27128: the thing
27228: The next
27290: the Pirate
27409: the material
27461: the crowd
27621: the next
27916: The technician
28084: the way
28487: the technician
28735: the exciting
35709: The Next
36587: The Pinocchio
45436: the Kingdom
45679: The Truth
51623: the same
52526: The Word</pre>