Fredrik Lundh wrote a good article called Using Regular Expressions for Lexical Analysis which explains how to use Python regular expressions to read an input string and group characters into lexical units, or tokens. The author's first group of examples read in a simple expression,
"b = 2 + a*10", and output strings classified as one of three token types: symbols (e.g.
b), integer literals (e.g.
10), and operators (e.g.
*). His first three examples use the
findall method and his fourth example uses the undocumented
scanner method from the
re module. Here is the example code from the fourth example. Note that the "1" in the first column of the results corresponds to the integer literals token group, "2" corresponds to the symbols group, and "3" to the operators group.
import re expr = "b = 2 + a*10" pos = 0 pattern = re.compile("\s*(?:(\d+)|(\w+)|(.))") scan = pattern.scanner(expr) while 1: m = scan.match() if not m: break print m.lastindex, repr(m.group(m.lastindex))
2 'b' 3 '=' 1 '2' 3 '+' 2 'a' 3 '*' 1 '10'
Since this article was dated 2002, and the author was using Python 2.0, I wondered if this was the most current approach. The author notes that recent versions (i.e. version 2.2 or later) of Python allow you to use the
finditer method which uses an internal
scanner object. Using
finditer makes the example code much simpler. Here is Fredrik's example using
import re expr = "b = 2 + a*10" regex = re.compile("\s*(?:(\d+)|(\w+)|(.))") for m in regex.finditer(expr): print m.lastindex, repr(m.group(m.lastindex))
Running it produces the same results as the original.
- (Not too successfully) trying to use Unix tools instead of Python utility scripts — posted 2011-04-20
- How to search C code for division or sqrt — posted 2008-07-24
- How to remove C style comments using Python — posted 2007-11-28
- Using Python's finditer to highlight search items — posted 2007-10-16
- Python finditer regular expression example — posted 2007-10-03