How to remove C style comments using Python
The Perl FAQ has an entry How do I use a regular expression to strip C style comments from a file? Since I've switched to Python, I've adapted the Perl solution to Python. This regular expression was created by Jeffrey Friedl and later modified by Fred Curtis. I'm not certain, but it appears to use the "unrolling the loop" technique described in Chapter 6 of Mastering Regular Expressions.
remove_comments.py:
import re
import sys
def remove_comments(text):
""" remove c-style comments.
text: blob of text with comments (can include newlines)
returns: text with comments removed
"""
pattern = r"""
## --------- COMMENT ---------
/\* ## Start of /* ... */ comment
[^*]*\*+ ## Non-* followed by 1-or-more *'s
( ##
[^/*][^*]*\*+ ##
)* ## 0-or-more things which don't start with /
## but do end with '*'
/ ## End of /* ... */ comment
| ## -OR- various things which aren't comments:
( ##
## ------ " ... " STRING ------
" ## Start of " ... " string
( ##
\\. ## Escaped char
| ## -OR-
[^"\\] ## Non "\ characters
)* ##
" ## End of " ... " string
| ## -OR-
##
## ------ ' ... ' STRING ------
' ## Start of ' ... ' string
( ##
\\. ## Escaped char
| ## -OR-
[^'\\] ## Non '\ characters
)* ##
' ## End of ' ... ' string
| ## -OR-
##
## ------ ANYTHING ELSE -------
. ## Anything other char
[^/"'\\]* ## Chars which doesn't start a comment, string
) ## or escape
"""
regex = re.compile(pattern, re.VERBOSE|re.MULTILINE|re.DOTALL)
noncomments = [m.group(2) for m in regex.finditer(text) if m.group(2)]
return "".join(noncomments)
if __name__ == '__main__':
filename = sys.argv[1]
code_w_comments = open(filename).read()
code_wo_comments = remove_comments(code_w_comments)
fh = open(filename+".nocomments", "w")
fh.write(code_wo_comments)
fh.close()
Example:
To test the script, I created a test file called
testfile.c:
/* This is a C-style comment. */ This is not a comment. /* This is another * C-style comment. */ "This is /* also not a comment */"
Run the script:
To use the script, I put the script,
remove_comments.py,
and my test file, testfile.c, in the same directory and ran the
following command:python remove_comments.py testfile.c
Results:
The script created a new file called
testfile.c.nocomments:
This is not a comment. "This is /* also not a comment */"
---------------
Minor note on Perl to Python migration:
I modified the original regular expression comments a little bit. In particular, I had to put at least one character after the
##
Non "\ and ## Non '\ lines because, in Python,
the backslash was escaping the following newline character and the
closing parenthesis on the following line was being treated as a
comment by the regular expression engine. This is the error I got,
before the fix:
$ python remove_comments.py
Traceback (most recent call last):
File "remove_comments.py", line 39, in <module>
regex = re.compile(pattern, re.VERBOSE|re.MULTILINE|re.DOTALL)
File "C:\Programs\Python25\lib\re.py", line 180, in compile
return _compile(pattern, flags)
File "C:\Programs\Python25\lib\re.py", line 233, in _compile
raise error, v # invalid expression
sre_constants.error: unbalanced parenthesis
1
Comment
—
Comments feed for this post
Post a comment
About
I'm Eliot and this is my notepad for programming topics such as Python, Django, Ubuntu, Emacs, etc... more »
Search Blog
Tags
-
algorithms
(4)
-
aws
(8)
-
blogproject
(20)
-
c_cplusplus
(12)
-
cardstore
(8)
-
colinux
(2)
-
concurrency
(9)
-
conkeror
(2)
-
cygwin
(18)
-
datastructures
(15)
-
datetime
(3)
-
dell
(3)
-
django
(39)
-
emacs
(20)
-
files_directories
(10)
-
install_setup
(7)
-
javascript
(3)
-
keyboard
(6)
-
matplotlib
(5)
-
mercurial
(4)
-
nginx
(2)
-
preferences
(8)
-
processes
(3)
-
pyqt
(18)
-
python
(122)
-
ratpoison
(3)
-
regexes
(5)
-
rsync
(3)
-
softwaretools
(17)
-
sql
(13)
-
ssh
(7)
-
subversion
(6)
-
twisted
(6)
-
ubuntu
(60)
-
urxvt
(5)
-
vxworks
(25)
-
webservices
(4)
-
wmii
(7)
Blogroll
- Adam Gomaa
- Alex Clemesha
- Amir Salihefendic
- Armin Ronacher
- David Beazley
- David Ziegler
- Duncan McGreggor
- Gareth Rushgrave
- Glyph Lefkowitz
- Guido van Rossum
- Ian Bicking
- Jacob Kaplan-Moss
- James Bennett
- James Tauber
- Jesper Noehr
- Matt Harrison
- Nikolay Kolev
- Parand Darugar
- Peter Baumgartner
- Peter Bengtsson
- Rob Hudson
- Simon Willison
- Will McGugan
#1 Ari commented on 2009-09-29:
Thanks for your very useful post, which I found through Google.
It is exactly what I needed! I need to write a script for work that performs some checks in C++ code, but obviously the comments should be disregarded.