SaltyCrane: regexes

(Not too successfully) trying to use Unix tools instead of Python utility scripts

2011-04-20T19:00:32-07:00

Inspired by articles such as Why you should learn just a little Awk and Learn one sed command, I am trying to make use of Unix tools sed, awk, grep, cut, uniq, sort, etc. instead of writing short Python utility scripts.

Here is a Python script I wrote this week. It greps a file for a given regular expression pattern and returns a unique, sorted, list of matches inside the capturing parentheses.

# grep2.py

import re
import sys


def main():
    patt = sys.argv[1]
    filename = sys.argv[2]

    text = open(filename).read()
    matchlist = set(m.group(1) for m in re.finditer(patt, text, re.MULTILINE))
    for m in sorted(matchlist):
        print m


if __name__ == '__main__':
    main()

As an example, I used my script to search one of the Django admin template files for all the Django template markup in the file.

$ python grep2.py '({{[^{}]+}}|{%[^{}]+%})' tabular.html

Output:

{% admin_media_prefix %}
{% blocktrans with inline_admin_formset.opts.verbose_name|title as verbose_name %}
{% cycle "row1" "row2" %}
{% else %}
{% endblocktrans %}
{% endfor %}
{% endif %}
{% endspaceless %}
{% for field in inline_admin_formset.fields %}
{% for field in line %}
{% for fieldset in inline_admin_form %}
{% for inline_admin_form in inline_admin_formset %}
{% for line in fieldset %}
{% if field.is_hidden %}
{% if field.is_readonly %}
{% if field.required %}
{% if forloop.first %}
{% if forloop.last %}
{% if inline_admin_form.form.non_field_errors %}
{% if inline_admin_form.has_auto_field %}
{% if inline_admin_form.original %}
{% if inline_admin_form.original or inline_admin_form.show_url %}
{% if inline_admin_form.show_url %}
{% if inline_admin_formset.formset.can_delete %}
{% if not field.widget.is_hidden %}
{% if not forloop.last %}
{% load i18n adminmedia admin_modify %}
{% spaceless %}
{% trans "Delete?" %}
{% trans "Remove" %}
{% trans "View on site" %}
{{ field.contents }}
{{ field.field }}
{{ field.field.errors.as_ul }}
{{ field.field.name }}
{{ field.label|capfirst }}
{{ forloop.counter0 }}
{{ inline_admin_form.deletion_field.field }}
{{ inline_admin_form.fk_field.field }}
{{ inline_admin_form.form.non_field_errors }}
{{ inline_admin_form.original }}
{{ inline_admin_form.original.id }}
{{ inline_admin_form.original_content_type_id }}
{{ inline_admin_form.pk_field.field }}
{{ inline_admin_formset.formset.management_form }}
{{ inline_admin_formset.formset.non_form_errors }}
{{ inline_admin_formset.formset.prefix }}
{{ inline_admin_formset.opts.verbose_name_plural|capfirst }}
{{ inline_admin_form|cell_count }}
{{ verbose_name }}

Here's my attempt at using Unix tools:

$ sed -rn 's/^.*(\{\{.*\}\}|\{%.*%\}).*$/\1/gp' tabular.html | sort | uniq

However the output isn't quite the same:

{% admin_media_prefix %}
{% else %}
{% endblocktrans %}
{% endfor %}
{% endif %}
{% endspaceless %}
{% for field in inline_admin_formset.fields %}
{% for field in line %}
{% for fieldset in inline_admin_form %}
{% for inline_admin_form in inline_admin_formset %}
{% for line in fieldset %}
{% if field.is_readonly %}
{% if inline_admin_form.form.non_field_errors %}
{% if inline_admin_form.original or inline_admin_form.show_url %}
{% if inline_admin_formset.formset.can_delete %}
{% if not field.widget.is_hidden %}
{% load i18n adminmedia admin_modify %}
{% spaceless %}
{% trans "Remove" %}
{{ field.contents }}
{{ field.field }}
{{ field.field.errors.as_ul }}
{{ field.field.name }}
{{ field.label|capfirst }}
{{ inline_admin_form.fk_field.field }}
{{ inline_admin_form.form.non_field_errors }}
{{ inline_admin_formset.formset.management_form }}
{{ inline_admin_formset.formset.non_form_errors }}
{{ inline_admin_formset.formset.prefix }}
{{ inline_admin_formset.opts.verbose_name_plural|capfirst }}

Unix tools are powerful and concise, but I still need to get a lot more comfortable with their syntax. Please leave a comment if you know how to fix my command.

How to search C code for division or sqrt

2008-07-24T15:12:20-07:00

The following Python script searches through C code for division or sqrt and prints the line of code and the line number. It skips C comments. To use, run python find_divides.py filename.c

#!/usr/bin/python

"""find_divides.py

usage: python find_divides.py filename
"""

import re
import sys

def main():
    filename = sys.argv[1]
    text = open(filename).read()
    lines = text.splitlines()
    lines = ["%4d: %s" % (i, line) for (i, line) in enumerate(lines)]
    text = "\n".join(lines)
    text = remove_comments_and_strings(text)

    for line in text.splitlines():
        if ("/" in line) or ("sqrt" in line):
            print line

def remove_comments_and_strings(text):
    """ remove c-style comments and strings
        text: blob of text with comments (can include newlines)
        returns: text with comments and strings removed
    """
    pattern = r"""
                            ##  --------- COMMENT ---------
           /\*              ##  Start of /* ... */ comment
           [^*]*\*+         ##  Non-* followed by 1-or-more *'s
           (                ##
             [^/*][^*]*\*+  ##
           )*               ##  0-or-more things which don't start with /
                            ##    but do end with '*'
           /                ##  End of /* ... */ comment
         |                  ##  -OR-  various things which aren't comments:
           (                ## 
                            ##  ------ " ... " STRING ------
             "              ##  Start of " ... " string
             (              ##
               \\.          ##  Escaped char
             |              ##  -OR-
               [^"\\]       ##  Non "\ characters
             )*             ##
             "              ##  End of " ... " string
           |                ##  -OR-
                            ##
                            ##  ------ ' ... ' STRING ------
             '              ##  Start of ' ... ' string
             (              ##
               \\.          ##  Escaped char
             |              ##  -OR-
               [^'\\]       ##  Non '\ characters
             )*             ##
             '              ##  End of ' ... ' string
           |                ##  -OR-
                            ##
                            ##  ------ ANYTHING ELSE -------
             (.              ##  Anything other char
             [^/"'\\]*)      ##  Chars which doesn't start a comment, string
           )                ##    or escape
    """
    regex = re.compile(pattern, re.VERBOSE|re.MULTILINE|re.DOTALL)
    goodstuff = [m.group(5) for m in regex.finditer(text) if m.group(5)]
    return "".join(goodstuff)

if __name__ == "__main__":
    main()

How to remove C style comments using Python

2007-11-28T17:25:00-08:00

The Perl FAQ has an entry How do I use a regular expression to strip C style comments from a file? Since I've switched to Python, I've adapted the Perl solution to Python. This regular expression was created by Jeffrey Friedl and later modified by Fred Curtis. I'm not certain, but it appears to use the "unrolling the loop" technique described in Chapter 6 of Mastering Regular Expressions.

remove_comments.py:

import re
import sys

def remove_comments(text):
    """ remove c-style comments.
        text: blob of text with comments (can include newlines)
        returns: text with comments removed
    """
    pattern = r"""
                            ##  --------- COMMENT ---------
           /\*              ##  Start of /* ... */ comment
           [^*]*\*+         ##  Non-* followed by 1-or-more *'s
           (                ##
             [^/*][^*]*\*+  ##
           )*               ##  0-or-more things which don't start with /
                            ##    but do end with '*'
           /                ##  End of /* ... */ comment
         |                  ##  -OR-  various things which aren't comments:
           (                ## 
                            ##  ------ " ... " STRING ------
             "              ##  Start of " ... " string
             (              ##
               \\.          ##  Escaped char
             |              ##  -OR-
               [^"\\]       ##  Non "\ characters
             )*             ##
             "              ##  End of " ... " string
           |                ##  -OR-
                            ##
                            ##  ------ ' ... ' STRING ------
             '              ##  Start of ' ... ' string
             (              ##
               \\.          ##  Escaped char
             |              ##  -OR-
               [^'\\]       ##  Non '\ characters
             )*             ##
             '              ##  End of ' ... ' string
           |                ##  -OR-
                            ##
                            ##  ------ ANYTHING ELSE -------
             .              ##  Anything other char
             [^/"'\\]*      ##  Chars which doesn't start a comment, string
           )                ##    or escape
    """
    regex = re.compile(pattern, re.VERBOSE|re.MULTILINE|re.DOTALL)
    noncomments = [m.group(2) for m in regex.finditer(text) if m.group(2)]

    return "".join(noncomments)

if __name__ == '__main__':
    filename = sys.argv[1]
    code_w_comments = open(filename).read()
    code_wo_comments = remove_comments(code_w_comments)
    fh = open(filename+".nocomments", "w")
    fh.write(code_wo_comments)
    fh.close()

Example:
To test the script, I created a test file called testfile.c:

/* This is a C-style comment. */
This is not a comment.
/* This is another
 * C-style comment.
 */
"This is /* also not a comment */"

Run the script:
To use the script, I put the script, remove_comments.py, and my test file, testfile.c, in the same directory and ran the following command:

python remove_comments.py testfile.c

Results:
The script created a new file called testfile.c.nocomments:

This is not a comment.

"This is /* also not a comment */"

---------------
Minor note on Perl to Python migration:
I modified the original regular expression comments a little bit. In particular, I had to put at least one character after the

##
  Non "\

and ## Non '\ lines because, in Python, the backslash was escaping the following newline character and the closing parenthesis on the following line was being treated as a comment by the regular expression engine. This is the error I got, before the fix:

$ python remove_comments.py
Traceback (most recent call last):
  File "remove_comments.py", line 39, in <module>
    regex = re.compile(pattern, re.VERBOSE|re.MULTILINE|re.DOTALL)
  File "C:\Programs\Python25\lib\re.py", line 180, in compile
    return _compile(pattern, flags)
  File "C:\Programs\Python25\lib\re.py", line 233, in _compile
    raise error, v # invalid expression
sre_constants.error: unbalanced parenthesis

Using Python's finditer to highlight search items

2007-10-16T17:39:00-07:00

I am trying to search through various text and highlight certain search terms within that text using HTML markup. As an example, if I take a paragraph of text from Paul Prescod's essay, I would like to highlight the search terms "lisp", "python", "perl", "java", and "C" each in different colors. My first attempt at this problem looked somthing like:

for sentence in re.split(r"[?.]\s+", text):
    match = re.search(r"\blisp\b", sentence, re.I)
    if match:
        color = 'red'
    else:
        match = re.search(r"\bpython\b", sentence, re.I)
        if match:
            color = 'blue'
        else:
            match = re.search(r"\bperl\b", sentence, re.I)
            if match:
                color = 'orange'

I didn't finish it because, not only is it ugly and verbose, it doesn't do what I want. Instead of matching all the search terms, it only matches the first one in each sentence. Fortunately, I took some time to rethink the problem (i.e. search the internet (this thread on the Python mailing list was helpful (I guess my Perl background is still showing) as was this article which I previously referenced. (hmmm, this is starting to look like Lisp.))) and made a prettier (and correct) version using my new favorite regular expression method, finditer, and the MatchObject's lastindex attribute. Here is the working example:

import re

COLOR = ['red', 'blue', 'orange', 'violet', 'green']

text = """Graham says that Perl is cooler than Java and Python than Perl. In some circles, maybe. Graham uses the example of Slashdot, written in Perl. But what about Advogato, written in C? What about all of the cool P2P stuff being written in all three of the languages? Considering that Perl is older than Java, and was at one time the Next Big Language, I think you would have a hard time getting statistical evidence that programmers consider Perl "cooler" than Java, except perhaps by virtue of the fact that Java has spent a few years as the "industry standard" (and is thus uncool for the same reason that the Spice Girls are uncool) and Perl is still "underground" (and thus cool, for the same reason that ambient is cool). Python is even more "underground" than Perl (and thus cooler?). Maybe all Graham has demonstrated is that proximity to Lisp drives a language underground. Except that he's got the proximity to Lisp argument backwards too."""

regex = re.compile(r"(\blisp\b)|(\bpython\b)|(\bperl\b)|(\bjava\b)|(\bc\b)", re.I)

i = 0; output = "<html>"
for m in regex.finditer(text):
    output += "".join([text[i:m.start()],
                       "<strong><span style='color:%s'>" % COLOR[m.lastindex-1],
                       text[m.start():m.end()],
                       "</span></strong>"])
    i = m.end()
print "".join([output, text[m.end():], "</html>"])

This example loops over each match in the iterator object returned by finditer. For each match, non-matching text and matching text surrounded with the HTML <span> tag are appended to the output string. start() and end() return the indices to the start and end positions of the matching text. The color of the text is determined by using lastindex to index into a list of colors. lastindex is the index of the group of the last match. So, it is "1" if "lisp" is matched, "2" if "python" is matched, "3" if "perl" is matched, and so on. I need to subtract 1 because the list indexing starts at 0. The last line adds on the rest of the non-matching text, and prints it. When viewed in a browser, it looks something like this:

Graham says that Perl is cooler than Java and Python than Perl. In some circles, maybe. Graham uses the example of Slashdot, written in Perl. But what about Advogato, written in C? What about all of the cool P2P stuff being written in all three of the languages? Considering that Perl is older than Java, and was at one time the Next Big Language, I think you would have a hard time getting statistical evidence that programmers consider Perl "cooler" than Java, except perhaps by virtue of the fact that Java has spent a few years as the "industry standard" (and is thus uncool for the same reason that the Spice Girls are uncool) and Perl is still "underground" (and thus cool, for the same reason that ambient is cool). Python is even more "underground" than Perl (and thus cooler?). Maybe all Graham has demonstrated is that proximity to Lisp drives a language underground. Except that he's got the proximity to Lisp argument backwards too.

Using Python's finditer for Lexical Analysis

2007-10-16T17:35:00-07:00

Fredrik Lundh wrote a good article called Using Regular Expressions for Lexical Analysis which explains how to use Python regular expressions to read an input string and group characters into lexical units, or tokens. The author's first group of examples read in a simple expression, "b = 2 + a*10", and output strings classified as one of three token types: symbols (e.g. a and b), integer literals (e.g. 2 and 10), and operators (e.g. =, +, and *). His first three examples use the findall method and his fourth example uses the undocumented scanner method from the re module. Here is the example code from the fourth example. Note that the "1" in the first column of the results corresponds to the integer literals token group, "2" corresponds to the symbols group, and "3" to the operators group.

import re

expr = "b = 2 + a*10"
pos = 0
pattern = re.compile("\s*(?:(\d+)|(\w+)|(.))")
scan = pattern.scanner(expr)
while 1:
    m = scan.match()
    if not m:
        break
    print m.lastindex, repr(m.group(m.lastindex))

Here are the results:

2 'b'
3 '='
1 '2'
3 '+'
2 'a'
3 '*'
1 '10'

Since this article was dated 2002, and the author was using Python 2.0, I wondered if this was the most current approach. The author notes that recent versions (i.e. version 2.2 or later) of Python allow you to use the finditer method which uses an internal scanner object. Using finditer makes the example code much simpler. Here is Fredrik's example using finditer:

import re

expr = "b = 2 + a*10"
regex = re.compile("\s*(?:(\d+)|(\w+)|(.))")
for m in regex.finditer(expr):
    print m.lastindex, repr(m.group(m.lastindex))

Running it produces the same results as the original.

Python finditer regular expression example

2007-10-03T12:15:00-07:00

I often process text line by line using the splitlines() method with a for loop. This works great most of the time, however, sometimes, the text is not neatly divisible into lines, or, I need to match multiple items per line. This is where the re module's finditer function can help. finditer returns an iterator over all non-overlapping matches for the regular expression pattern in the string. (See docs.) It is a powerful tool for text processing and one that I don't use often enough.

Here is a simple example which demonstrates the use of finditer. It reads in a page of html text, finds all the occurrences of the word "the" and prints "the" and the following word. It also prints the character position of each match using the MatchObject's start() method. (See docs.) Note that, for simplicity, I didn't mess with the HTML tags at all. I just pretended it was plain text. Oh, and the example text is taken from Steve Yegge's article: How To Make a Funny Talk Title Without Using The Word "Weasel"

Python code:

import re
import urllib2

html = urllib2.urlopen('http://steve-yegge.blogspot.com/2007/08/how-to-make-funny-talk-title-without.html').read()
pattern = r'\b(the\s+\w+)\s+'
regex = re.compile(pattern, re.IGNORECASE)
for match in regex.finditer(html):
    print "%s: %s" % (match.start(), match.group(1))

Results:

1301: The Word
12291: The Word
13367: the cut
14025: the car
15050: the free
15513: the third
15558: the sessions
15617: the ONLY
15684: the ground
15911: the OSI
15933: The Attack
16051: The gist
16115: the term
16178: the creator
16741: the thing
16850: the same
16877: the thing
16942: the next
17131: the talk
17374: the room
17727: the hell
17782: the term
17830: the 1980s
18083: the whole
18158: the same
18230: the mountain
18305: the seat
18537: The pro
18718: the banner
18928: the poor
19006: the midst
19223: the buzzwagon
19326: the source
19437: the OSI
19855: the OSI
19927: the other
20055: the Ten
20404: The 22
20517: the OSI
20616: the book
21098: the collective
21553: the proposed
21681: the Five
21932: the nearest
22690: The rest
22858: the entertaining
23255: the crap
23561: the next
23661: the registration
23963: the registration
24114: the restaurant
24289: the people
24456: the second
24597: the current
24871: The Style
24929: the front
25047: the curtain
25132: the movie
25159: The hospital
25249: the night
25881: the way
25892: the rear
25927: the crowd
26194: the podium
26262: the front
26521: the door
26593: the front
26622: The economist
27128: the thing
27228: The next
27290: the Pirate
27409: the material
27461: the crowd
27621: the next
27916: The technician
28084: the way
28487: the technician
28735: the exciting
35709: The Next
36587: The Pinocchio
45436: the Kingdom
45679: The Truth
51623: the same
52526: The Word