Saltycrane logo

SaltyCrane Blog

Notes on Python, Django, and web development on Ubuntu Linux

    

Using Python's gzip and StringIO to compress data in memory

I needed to gzip some data in memory that would eventually end up saved to disk as a .gz file. I thought, That's easy, just use Python's built in gzip module.

However, I needed to pass the data to pycurl as a file-like object. I didn't want to write the data to disk and then read it again just to pass to pycurl. I thought, That's easy also-- just use Python's cStringIO module.

The solution did end up being simple, but figuring out the solution was a lot harder than I thought. Below is my roundabout process of finding the simple solution.

Here is my setup/test code. I am running Python 2.7.3 on Ubuntu 12.04.

import cStringIO
import gzip


STUFF_TO_GZIP = """Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur?"""
FILENAME = 'myfile.json.gz'


def pycurl_simulator(fileobj):

    # Get the file size
    fileobj.seek(0, 2)
    filesize = fileobj.tell()
    fileobj.seek(0, 0)

    # Read the file data
    fout = open(FILENAME, 'wb')
    fout.write(fileobj.read())
    fout.close()

    return filesize

Try 1: seek from the end fails

Here is my first attempt using cStringIO with the gzip module.

def try1_seek_from_end_fails():

    ftemp = cStringIO.StringIO()
    fgzipped = gzip.GzipFile(
        filename=FILENAME, mode='wb', fileobj=ftemp)
    fgzipped.write(STUFF_TO_GZIP)
    filesize = pycurl_simulator(fgzipped)
    print filesize

I got this exception:

        Traceback (most recent call last):
          File "tmp.py", line 232, in <module>
            try1_seek_from_end_fails()
          File "tmp.py", line 83, in try1_seek_from_end_fails
            filesize = pycurl_simulator(fgzipped)
          File "tmp.py", line 25, in pycurl_simulator
            fileobj.seek(0, 2)
          File "/usr/lib/python2.7/gzip.py", line 415, in seek
            raise ValueError('Seek from end not supported')
        ValueError: Seek from end not supported

It turns out the gzip object doesn't support seeking from the end. See this thread on the Python mailing list: http://mail.python.org/pipermail/python-list/2009-January/519398.html

Try 2: data is not compressed

What if we don't seek() from the end and just tell() where we are? (It should be at the end after doing a write(), right?) Unfortunately, this gave me the uncompressed size.

Reading from the GzipFile object also gave me an error saying that I couldn't read from a writable object.

def try2_data_is_not_compressed():

    ftemp = cStringIO.StringIO()
    fgzipped = gzip.GzipFile(
        filename=FILENAME, mode='wb', fileobj=ftemp)
    fgzipped.write(STUFF_TO_GZIP)
    filesize = fgzipped.tell()
    print filesize

Try 5: file much too small

I googled, then looked at the source code for gzip.py. I found that the compressed data was in the StringIO object. So I performed my file operations on it instead of the GzipFile object. Now I was able to write the data out to a file. However, the size of the file was much too small.

def try5_file_much_too_small():

    fgz = cStringIO.StringIO()
    gzip_obj = gzip.GzipFile(
        filename=FILENAME, mode='wb', fileobj=fgz)
    gzip_obj.write(STUFF_TO_GZIP)
    filesize = pycurl_simulator(fgz)
    print filesize

Try 6: unexpected end of file

I saw there was a flush() method in the source code. I added a call to flush(). This time, I got a reasonable file size, however, when trying to gunzip it from the command line, I got the following error:

        gzip: myfile.json.gz: unexpected end of file
def try6_unexpected_end_of_file():

    fgz = cStringIO.StringIO()
    gzip_obj = gzip.GzipFile(
        filename=FILENAME, mode='wb', fileobj=fgz)
    gzip_obj.write(STUFF_TO_GZIP)
    gzip_obj.flush()
    filesize = pycurl_simulator(fgz)
    print filesize

Try 7: got it working

I knew that GzipFile worked properly when writing files directly as opposed to reading from the StringIO object. It turns out the difference was that there was code in the close() method of GzipFile which wrote some extra required data. Now stuff was working.

def try7_got_it_working():

    fgz = cStringIO.StringIO()
    gzip_obj = gzip.GzipFile(
        filename=FILENAME, mode='wb', fileobj=fgz)
    gzip_obj.write(STUFF_TO_GZIP)
    gzip_obj.flush()

    # Do stuff that GzipFile.close() does
    gzip_obj.fileobj.write(gzip_obj.compress.flush())
    gzip.write32u(gzip_obj.fileobj, gzip_obj.crc)
    gzip.write32u(gzip_obj.fileobj, gzip_obj.size & 0xffffffffL)

    filesize = pycurl_simulator(fgz)
    print filesize

Try 8: (not really) final version

Here's the (not really) final version using a subclass of GzipFile that adds a method to write the extra data at the end. If also overrides close() so that stuff isn't written twice in case you need to use close(). Also, the separate flush() call is not needed.

def try8_not_really_final_version():

    class MemoryGzipFile(gzip.GzipFile):
        """
        A GzipFile subclass designed to be used with in memory file like
        objects, i.e. StringIO objects.
        """

        def write_crc_and_filesize(self):
            """
            Flush and write the CRC and filesize. Normally this is done
            in the close() method. However, for in memory file objects,
            doing this in close() is too late.
            """
            self.fileobj.write(self.compress.flush())
            gzip.write32u(self.fileobj, self.crc)
            # self.size may exceed 2GB, or even 4GB
            gzip.write32u(self.fileobj, self.size & 0xffffffffL)

        def close(self):
            if self.fileobj is None:
                return
            self.fileobj = None
            if self.myfileobj:
                self.myfileobj.close()
                self.myfileobj = None

    fgz = cStringIO.StringIO()
    gzip_obj = MemoryGzipFile(
        filename=FILENAME, mode='wb', fileobj=fgz)
    gzip_obj.write(STUFF_TO_GZIP)
    gzip_obj.write_crc_and_filesize()

    filesize = pycurl_simulator(fgz)
    print filesize

Try 9: didn't need to do that (final version)

It turns out I can close the GzipFile object and the StringIO object remains available. So that MemoryGzipFile class above is completely unnecessary. I am dumb. Here is the final iteration:

def try9_didnt_need_to_do_that():

    fgz = cStringIO.StringIO()
    gzip_obj = gzip.GzipFile(
        filename=FILENAME, mode='wb', fileobj=fgz)
    gzip_obj.write(STUFF_TO_GZIP)
    gzip_obj.close()

    filesize = pycurl_simulator(fgz)
    print filesize

References

Here is some googling I did:

3 Comments — feed icon Comments feed for this post


#1 Norman Harman commented on 2013-05-24:

Good article. I appreciate and it is nice to see your exploration / thought process.


#2 pablo trukhanov commented on 2013-05-30:
fgz = cStringIO.StringIO()

with gzip.GzipFile(filename=FILENAME, mode='wb', fileobj=fgz) as gzip_obj:
    gzip_obj.write(STUFF_TO_GZIP)

filesize = pycurl_simulator(fgz)
print filesize

works as expected


#3 Eliot commented on 2013-05-30:

pablo: Using a context manager makes it even simpler-- thanks!

Post a comment

Required
Required, but not displayed
Optional

Format using Markdown. (No HTML.)
  • Code blocks: prefix each line by at least 4 spaces or 1 tab (and a blank line before and after)
  • Code span: surround with backticks
  • Blockquotes: prefix lines to be quoted with >
  • Links: <URL>
  • Links w/ description: [description](URL)
Created with Django | Hosted by Linode