Saltycrane logo

SaltyCrane Blog

Notes on Python, Django, and web development on Ubuntu Linux

    

Using Python's finditer to highlight search items

I am trying to search through various text and highlight certain search terms within that text using HTML markup. As an example, if I take a paragraph of text from Paul Prescod's essay, I would like to highlight the search terms "lisp", "python", "perl", "java", and "C" each in different colors. My first attempt at this problem looked somthing like:

for sentence in re.split(r"[?.]\s+", text):
    match = re.search(r"\blisp\b", sentence, re.I)
    if match:
        color = 'red'
    else:
        match = re.search(r"\bpython\b", sentence, re.I)
        if match:
            color = 'blue'
        else:
            match = re.search(r"\bperl\b", sentence, re.I)
            if match:
                color = 'orange'

I didn't finish it because, not only is it ugly and verbose, it doesn't do what I want. Instead of matching all the search terms, it only matches the first one in each sentence. Fortunately, I took some time to rethink the problem (i.e. search the internet (this thread on the Python mailing list was helpful (I guess my Perl background is still showing) as was this article which I previously referenced. (hmmm, this is starting to look like Lisp.))) and made a prettier (and correct) version using my new favorite regular expression method, finditer, and the MatchObject's lastindex attribute. Here is the working example:

import re

COLOR = ['red', 'blue', 'orange', 'violet', 'green']

text = """Graham says that Perl is cooler than Java and Python than Perl. In some circles, maybe. Graham uses the example of Slashdot, written in Perl. But what about Advogato, written in C? What about all of the cool P2P stuff being written in all three of the languages? Considering that Perl is older than Java, and was at one time the Next Big Language, I think you would have a hard time getting statistical evidence that programmers consider Perl "cooler" than Java, except perhaps by virtue of the fact that Java has spent a few years as the "industry standard" (and is thus uncool for the same reason that the Spice Girls are uncool) and Perl is still "underground" (and thus cool, for the same reason that ambient is cool). Python is even more "underground" than Perl (and thus cooler?). Maybe all Graham has demonstrated is that proximity to Lisp drives a language underground. Except that he's got the proximity to Lisp argument backwards too."""

regex = re.compile(r"(\blisp\b)|(\bpython\b)|(\bperl\b)|(\bjava\b)|(\bc\b)", re.I)

i = 0; output = "<html>"
for m in regex.finditer(text):
    output += "".join([text[i:m.start()],
                       "<strong><span style='color:%s'>" % COLOR[m.lastindex-1],
                       text[m.start():m.end()],
                       "</span></strong>"])
    i = m.end()
print "".join([output, text[m.end():], "</html>"])
This example loops over each match in the iterator object returned by finditer. For each match, non-matching text and matching text surrounded with the HTML <span> tag are appended to the output string. start() and end() return the indices to the start and end positions of the matching text. The color of the text is determined by using lastindex to index into a list of colors. lastindex is the index of the group of the last match. So, it is "1" if "lisp" is matched, "2" if "python" is matched, "3" if "perl" is matched, and so on. I need to subtract 1 because the list indexing starts at 0. The last line adds on the rest of the non-matching text, and prints it. When viewed in a browser, it looks something like this:
Graham says that Perl is cooler than Java and Python than Perl. In some circles, maybe. Graham uses the example of Slashdot, written in Perl. But what about Advogato, written in C? What about all of the cool P2P stuff being written in all three of the languages? Considering that Perl is older than Java, and was at one time the Next Big Language, I think you would have a hard time getting statistical evidence that programmers consider Perl "cooler" than Java, except perhaps by virtue of the fact that Java has spent a few years as the "industry standard" (and is thus uncool for the same reason that the Spice Girls are uncool) and Perl is still "underground" (and thus cool, for the same reason that ambient is cool). Python is even more "underground" than Perl (and thus cooler?). Maybe all Graham has demonstrated is that proximity to Lisp drives a language underground. Except that he's got the proximity to Lisp argument backwards too.

2 Comments — feed icon Comments feed for this post


#1 Jay G commented on 2008-06-25:

Good advice on ORing the different terms into one regex. I started something like this as you did; with multiple searches. This is much cleaner. In Java I don't see an equivalent of lastindex, but a dict/map of terms to colors works fine.


#2 badc0re commented on 2012-06-06:

cool, i was seeking for a text highlight :)

Post a comment

Required
Required, but not displayed
Optional

Format using Markdown. (No HTML.)
  • Code blocks: prefix each line by at least 4 spaces or 1 tab (and a blank line before and after)
  • Code span: surround with backticks
  • Blockquotes: prefix lines to be quoted with >
  • Links: <URL>
  • Links w/ description: [description](URL)
Created with Django | Hosted by Linode