SaltyCrane Blog — Notes on JavaScript and web development

(Not too successfully) trying to use Unix tools instead of Python utility scripts

Inspired by articles such as Why you should learn just a little Awk and Learn one sed command, I am trying to make use of Unix tools sed, awk, grep, cut, uniq, sort, etc. instead of writing short Python utility scripts.

Here is a Python script I wrote this week. It greps a file for a given regular expression pattern and returns a unique, sorted, list of matches inside the capturing parentheses.

# grep2.py

import re
import sys


def main():
    patt = sys.argv[1]
    filename = sys.argv[2]

    text = open(filename).read()
    matchlist = set(m.group(1) for m in re.finditer(patt, text, re.MULTILINE))
    for m in sorted(matchlist):
        print m


if __name__ == '__main__':
    main()

As an example, I used my script to search one of the Django admin template files for all the Django template markup in the file.

$ python grep2.py '({{[^{}]+}}|{%[^{}]+%})' tabular.html 

Output:

{% admin_media_prefix %}
{% blocktrans with inline_admin_formset.opts.verbose_name|title as verbose_name %}
{% cycle "row1" "row2" %}
{% else %}
{% endblocktrans %}
{% endfor %}
{% endif %}
{% endspaceless %}
{% for field in inline_admin_formset.fields %}
{% for field in line %}
{% for fieldset in inline_admin_form %}
{% for inline_admin_form in inline_admin_formset %}
{% for line in fieldset %}
{% if field.is_hidden %}
{% if field.is_readonly %}
{% if field.required %}
{% if forloop.first %}
{% if forloop.last %}
{% if inline_admin_form.form.non_field_errors %}
{% if inline_admin_form.has_auto_field %}
{% if inline_admin_form.original %}
{% if inline_admin_form.original or inline_admin_form.show_url %}
{% if inline_admin_form.show_url %}
{% if inline_admin_formset.formset.can_delete %}
{% if not field.widget.is_hidden %}
{% if not forloop.last %}
{% load i18n adminmedia admin_modify %}
{% spaceless %}
{% trans "Delete?" %}
{% trans "Remove" %}
{% trans "View on site" %}
{{ field.contents }}
{{ field.field }}
{{ field.field.errors.as_ul }}
{{ field.field.name }}
{{ field.label|capfirst }}
{{ forloop.counter0 }}
{{ inline_admin_form.deletion_field.field }}
{{ inline_admin_form.fk_field.field }}
{{ inline_admin_form.form.non_field_errors }}
{{ inline_admin_form.original }}
{{ inline_admin_form.original.id }}
{{ inline_admin_form.original_content_type_id }}
{{ inline_admin_form.pk_field.field }}
{{ inline_admin_formset.formset.management_form }}
{{ inline_admin_formset.formset.non_form_errors }}
{{ inline_admin_formset.formset.prefix }}
{{ inline_admin_formset.opts.verbose_name_plural|capfirst }}
{{ inline_admin_form|cell_count }}
{{ verbose_name }}

Here's my attempt at using Unix tools:

$ sed -rn 's/^.*(\{\{.*\}\}|\{%.*%\}).*$/\1/gp' tabular.html | sort | uniq 

However the output isn't quite the same:

{% admin_media_prefix %}
{% else %}
{% endblocktrans %}
{% endfor %}
{% endif %}
{% endspaceless %}
{% for field in inline_admin_formset.fields %}
{% for field in line %}
{% for fieldset in inline_admin_form %}
{% for inline_admin_form in inline_admin_formset %}
{% for line in fieldset %}
{% if field.is_readonly %}
{% if inline_admin_form.form.non_field_errors %}
{% if inline_admin_form.original or inline_admin_form.show_url %}
{% if inline_admin_formset.formset.can_delete %}
{% if not field.widget.is_hidden %}
{% load i18n adminmedia admin_modify %}
{% spaceless %}
{% trans "Remove" %}
{{ field.contents }}
{{ field.field }}
{{ field.field.errors.as_ul }}
{{ field.field.name }}
{{ field.label|capfirst }}
{{ inline_admin_form.fk_field.field }}
{{ inline_admin_form.form.non_field_errors }}
{{ inline_admin_formset.formset.management_form }}
{{ inline_admin_formset.formset.non_form_errors }}
{{ inline_admin_formset.formset.prefix }}
{{ inline_admin_formset.opts.verbose_name_plural|capfirst }}

Unix tools are powerful and concise, but I still need to get a lot more comfortable with their syntax. Please leave a comment if you know how to fix my command.

Comments


#1 Mike commented on :

I find perl to be more powerful than sed/awk, so I typically use that for my one-liners. (I also learned it first...). Here's the perl one-liner (it took me 20 mins to figure it out):

cat ~/tabular2.html | perl -ne 'while (/(\{\{[^\}]+\}\}|\{%[^\{\}]+%\})/g){print "$1\n"}'   | sort | uniq

#2 Eliot commented on :

Mike: Nice work-- that worked for me. (I just changed the formatting of your command because your \n got eaten.) As a Python bigot, I tend to think Python can do anything Perl can do. But I concede that one-liners is one place Perl wins. One problem: as I write increasingly complicated Perl one-liners, I will be tempted to make them into Perl scripts. Then I will be programming in Perl, and that is unacceptable! ;)

I learned Perl first also and learned regular expressions in Perl. This is why it's frustrating when grep, sed, awk, etc. don't use Perl-compatible regular expressions.


#3 hungnv commented on :

if you try to use these python lines with very_big_text_file, it will be pain is the ass. But it works fine with sed, awk, grep...as far as I know :-)


#4 Eliot commented on :

hungnv: yes I agree Unix tools are very fast and efficient so that is one of the reasons to choose them over custom scripts.


#5 Ari commented on :

Eliot: I've been on a similar path this past week, to parse California's legislative data. I've found Sed quite powerful and fast for (a) text substitutions that are (b) likely to occur on one line (i.e. not across \n).

A couple of notes that might give you incentive to continue with sed: there is a version of Sed, called "Super Sed", that allows you to use Perl (and therefore mostly Python) regex, with the -R switch. Best description of ssed is here, the download is here. After 'make', you can save the binary as /usr/bin/ssed (on a Mac), to avoid clashing with your existing version of Sed.

Where sed broke down for me was in finding patterns across newlines. There are tricks to do this (e.g. using the N command to pull in the next line, branching, etc.) But like you say about Perl scripts, it starts to look like programming, and the syntax is too messy to keep up with that.

BTW--thanks for your tutorials on installing virtualenv. Best I've seen on the web.


#6 Eliot commented on :

Ari: Thanks a lot of for your comment. Though some warn against it, I can't help but be impressed by your resume. Thanks for the tip on ssed. I will give it a try. For Ubuntu users, it can be installed using APT:

$ sudo apt-get install ssed

#7 Ari commented on :

Yes, you do have to consider resumes (including mine) with a grain of salt. Nice to see that ssed has an Ubuntu package; I am often tempted to use apt-get on the Mac, but have been burnt before...


#8 Alvin Mites commented on :

I've found many unix tools to be much easier to work with using a number of bashrc alias commands such as:

alias grpd='grep -irn'
alias grpy='grep -irn --include=*.py'

Course I do the same with python commands (someone might want to copy and paste some of these)

alias py='python'
alias pyins='sudo python setup.py install'
alias loadvirt='source bin/activate'

alias dj='python manage.py'
alias djshell='python manage.py shell'
alias djsync='python manage.py syncdb'
alias djrun='python manage.py runserver'

#9 Enc commented on :

Your regex matches just 1 django markup PER LINE, when your template html file sometimes contains more than 1 per line. The ^.* and .*$ wildcards will match _anything_ from the left and from the right of 1 word match, even if it is another perfectly good Django markup (because regex's are greedy). Hence, you're leaving just 1 match per line, which is why you're getting fewer results. The trick is to try to match anything from the left and from the right _that isn't the start of another markup_ and to place that on its own line (using \n). The result of the sed command below will print each django markup matched surrounded by empty lines. Then, the grep command removes those empty lines, and the sort command sorts the output and removes duplicate lines. The -V parameter causes sorting to be in expected order (as in your output from python), otherwise sort tries to be smart and sorts according to the first alphabetical letters.

sed -n 's@[^}]*\(\({%[^%]*%}\)\|\({{[^}]*}}\)\)[^{]*@\n\1\n@gp' tabular.html | grep -v '^$' | sort -u -V

Hope this helps!


#10 Eliot commented on :

Enc: Thank you for fixing my command for me! If this were Stack Overflow, I'd give you the check mark. Thanks also for the great explanation of how everything works. I have to say, that sed command is kind of hard on my eyes. I think I'm going to stick with my Python script for this case at least.


#11 gthomas commented on :

If you want to search in HUGE files, use mmap! I.e.

fh = open(filename, 'rb')
text = mmap.mmap(fh.fileno(), 0, access=mmap.ACCESS_READ)

#12 Eliot commented on :

gthomas: Thanks, mmap looks pretty interesting.


#13 Cristian Stroparo commented on :

egrep -o '(bla1|bla2)' | sort -u

Alternation is without escape syntax i.e. a bare vertical bar, and -o is to display only the matching parts, each match on its line (this is of egrep from gnu folks).