# Using Python's finditer for Lexical Analysis

Fredrik Lundh wrote a good article called Using Regular Expressions for Lexical Analysis which explains how to use Python regular expressions to read an input string and group characters into lexical units, or tokens. The author's first group of examples read in a simple expression, `"b = 2 + a*10"`, and output strings classified as one of three token types: symbols (e.g. `a` and `b`), integer literals (e.g. `2` and `10`), and operators (e.g. `=`, `+`, and `*`). His first three examples use the `findall` method and his fourth example uses the undocumented `scanner` method from the `re` module. Here is the example code from the fourth example. Note that the "1" in the first column of the results corresponds to the integer literals token group, "2" corresponds to the symbols group, and "3" to the operators group.

```import re

expr = "b = 2 + a*10"
pos = 0
pattern = re.compile("\s*(?:(\d+)|(\w+)|(.))")
scan = pattern.scanner(expr)
while 1:
m = scan.match()
if not m:
break
print m.lastindex, repr(m.group(m.lastindex))
```
Here are the results:
```2 'b'
3 '='
1 '2'
3 '+'
2 'a'
3 '*'
1 '10'```

Since this article was dated 2002, and the author was using Python 2.0, I wondered if this was the most current approach. The author notes that recent versions (i.e. version 2.2 or later) of Python allow you to use the `finditer` method which uses an internal `scanner` object. Using `finditer` makes the example code much simpler. Here is Fredrik's example using `finditer`:

```import re

expr = "b = 2 + a*10"
regex = re.compile("\s*(?:(\d+)|(\w+)|(.))")
for m in regex.finditer(expr):
print m.lastindex, repr(m.group(m.lastindex))
```

Running it produces the same results as the original.