How to remove C style comments using Python

Date: 2007-11-28 | Modified: 2009-11-19 | Tags: c_cplusplus, python, regexes | 8 Comments

The Perl FAQ has an entry How do I use a regular expression to strip C style comments from a file? Since I've switched to Python, I've adapted the Perl solution to Python. This regular expression was created by Jeffrey Friedl and later modified by Fred Curtis. I'm not certain, but it appears to use the "unrolling the loop" technique described in Chapter 6 of Mastering Regular Expressions.

remove_comments.py:

import re
import sys

def remove_comments(text):
    """ remove c-style comments.
        text: blob of text with comments (can include newlines)
        returns: text with comments removed
    """
    pattern = r"""
                            ##  --------- COMMENT ---------
           /\*              ##  Start of /* ... */ comment
           [^*]*\*+         ##  Non-* followed by 1-or-more *'s
           (                ##
             [^/*][^*]*\*+  ##
           )*               ##  0-or-more things which don't start with /
                            ##    but do end with '*'
           /                ##  End of /* ... */ comment
         |                  ##  -OR-  various things which aren't comments:
           (                ## 
                            ##  ------ " ... " STRING ------
             "              ##  Start of " ... " string
             (              ##
               \\.          ##  Escaped char
             |              ##  -OR-
               [^"\\]       ##  Non "\ characters
             )*             ##
             "              ##  End of " ... " string
           |                ##  -OR-
                            ##
                            ##  ------ ' ... ' STRING ------
             '              ##  Start of ' ... ' string
             (              ##
               \\.          ##  Escaped char
             |              ##  -OR-
               [^'\\]       ##  Non '\ characters
             )*             ##
             '              ##  End of ' ... ' string
           |                ##  -OR-
                            ##
                            ##  ------ ANYTHING ELSE -------
             .              ##  Anything other char
             [^/"'\\]*      ##  Chars which doesn't start a comment, string
           )                ##    or escape
    """
    regex = re.compile(pattern, re.VERBOSE|re.MULTILINE|re.DOTALL)
    noncomments = [m.group(2) for m in regex.finditer(text) if m.group(2)]

    return "".join(noncomments)

if __name__ == '__main__':
    filename = sys.argv[1]
    code_w_comments = open(filename).read()
    code_wo_comments = remove_comments(code_w_comments)
    fh = open(filename+".nocomments", "w")
    fh.write(code_wo_comments)
    fh.close()

Example:
To test the script, I created a test file called testfile.c:

/* This is a C-style comment. */
This is not a comment.
/* This is another
 * C-style comment.
 */
"This is /* also not a comment */"

Run the script:
To use the script, I put the script, remove_comments.py, and my test file, testfile.c, in the same directory and ran the following command:

python remove_comments.py testfile.c

Results:
The script created a new file called testfile.c.nocomments:

This is not a comment.

"This is /* also not a comment */"

---------------
Minor note on Perl to Python migration:
I modified the original regular expression comments a little bit. In particular, I had to put at least one character after the ## Non "\ and ## Non '\ lines because, in Python, the backslash was escaping the following newline character and the closing parenthesis on the following line was being treated as a comment by the regular expression engine. This is the error I got, before the fix:

$ python remove_comments.py
Traceback (most recent call last):
  File "remove_comments.py", line 39, in <module>
    regex = re.compile(pattern, re.VERBOSE|re.MULTILINE|re.DOTALL)
  File "C:\Programs\Python25\lib\re.py", line 180, in compile
    return _compile(pattern, flags)
  File "C:\Programs\Python25\lib\re.py", line 233, in _compile
    raise error, v # invalid expression
sre_constants.error: unbalanced parenthesis

(Not too successfully) trying to use Unix tools instead of Python utility scripts — posted 2011-04-20
How to search C code for division or sqrt — posted 2008-07-24
Using Python's finditer to highlight search items — posted 2007-10-16
Using Python's finditer for Lexical Analysis — posted 2007-10-16
Python finditer regular expression example — posted 2007-10-03

Comments

#1 Ari commented on 2009-09-29:

Thanks for your very useful post, which I found through Google.

It is exactly what I needed! I need to write a script for work that performs some checks in C++ code, but obviously the comments should be disregarded.

#2 Suresh Kumar Prajapati commented on 2010-08-19:

Greate job man.... i've googled many times but not get anython ... can we just write same script in the sed.....

#3 Matthew Conway commented on 2011-01-18:

Thanks for this very useful code; can I possibly have your kind permission to use this code as part of a build system I'm working on?

#4 Zheng Huang commented on 2012-03-09:

I tried to use this script to remove this comment line.

/* 123 /* 456 */ 789 */

But it won't remove all to them. It left the following.

789 */

Does anyone know how to fix the regular expression?

Thanks in advance

#5 Michael Wild commented on 2012-03-30:

@Zheng: The behaviour is correct. You can't nest C-comments, the first */ that appears, closes the comment.

@Elliot: Any chance of getting this to work with creative abuse of comments? E.g.:

/\
/ This is a valid line commment \
continuing here.

/\
* This is a valid C-style comment *\
/

#6 Kyle commented on 2012-11-01:

Excellent script, works wonderfully. The only thing that could be improved was if it removed "//" comments

Ex:

// This is a comment in .cpp

Any tips?

#7 Ashwin commented on 2013-03-05:

It's a great piece of code and works flawlessly. But I have some problems with the C-code lines such as 'int var_name_xx_2_onj = 0;'. The code returns only 'int var_name_xx_2_onj =' I find it strange, but may I'm doing something wrong. Anyone to guide me?

#8 yak commented on 2014-03-03:

Good job!

Here's a minified version:

def remove_comments(text):
    """Remove C-style /*comments*/ from a string."""

    p = r'/\*[^*]*\*+([^/*][^*]*\*+)*/|("(\\.|[^"\\])*"|\'(\\.|[^\'\\])*\'|.[^/"\'\\]*)'
    return ''.join(m.group(2) for m in re.finditer(p, text, re.M|re.S) if m.group(2))

SaltyCrane Blog — Notes on JavaScript and web development

How to remove C style comments using Python

Comments

Links

Related posts

Comments

Links