1. b.93z.org
  2. Notes

How to create .gz reproducibly

Essentially, gzip file consists of “members”, each represented by following diagram:

+---+---+--+---+---+---+---+---+---+--+     +=========================================+
|___|___|__|FLG|     MTIME     |___|OS| ... |...original file name, zero-terminated...|
+---+---+--+---+---+---+---+---+---+--+     +=========================================+

Member parts not relevant to our discussion are not named, shown or described, but, if you want specifics, you can look at GZIP file format specification version 4.3 (RFC 1952) and gzip module implementation.

FLG is a flag byte that may indicate presence of original file name; as done in Python’s gzip module (more precisely, in GzipFile._write_gzip_header):

if fname:
    flags = FNAME
# ...
if fname:
    self.fileobj.write(fname + b'\000')

MTIME is “the most recent modification time of the original file being compressed” and “if the compressed data did not come from a file, MTIME is set to the time at which compression started”. In Python optional mtime argument (POSIX timestamp) is taken and set to self._write_mtime in constructor of GzipFile. Then self._write_mtime is used in _write_gzip_header method:

mtime = self._write_mtime
if mtime is None:
    mtime = time.time()

Finally, OS indicates the type of file system, and in Python’s gzip it is set to 255—“Unknown”:

self.fileobj.write(b'\377')

If you are not sure why 255, b'\377' is 0xff, which is indeed 255:

>>> b'\377'
b'\xff'
>>> 0xff
255

I guess “Unknown” value for OS is hardcoded in gzip library to ensure portability.

So, when you are making a gzip file, there are two result-influencing factors: original filenames and timestamps that are kept in members of file (type of filesystem—OS—is set by Python’s gzip library, so no need to worry about it).

Implementation

Our goal is to build a script (mkgz.py) that creates gzip file containing file pointed to by path passed as a command-line argument.

#!/usr/bin/env python3

import gzip
import shutil
import argparse


def gz(path, *, mtime, filename, dest):
    with open(path, "rb") as src_file, open(dest, "wb") as _dest_file:
        with gzip.GzipFile(
                fileobj=_dest_file, mode="w",
                filename=filename, mtime=mtime) as dest_file:
            shutil.copyfileobj(src_file, dest_file)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("path")
    parser.add_argument("--mtime", type=float, default=None)
    parser.add_argument("--filename", default=None)
    parser.add_argument("dest")
    args = parser.parse_args()
    gz(args.path, mtime=args.mtime, filename=args.filename, dest=args.dest)

So, having tar archive out1.tar, we could use ./mkgz.py out1.tar out1.tar.gz command to create gzip file out1.tar.gz, but it would contain timestamp and filename (command to get such output is file out1.tar.gz):

out1.tar.gz: gzip compressed data, was "out1.tar", last modified: Fri Nov 11 00:00:07 2016, max compression

As you may have noticed, mkgz.py can accept --mtime and --filename command-line arguments: they are passed to gzip.GzipFile constructor. Let’s use these arguments:

$ ./mkgz.py out2.tar out2.tar.gz --mtime 1.23 --filename ''
$ file out2.tar.gz
out2.tar.gz: gzip compressed data, last modified: Thu Jan  1 00:00:01 1970, max compression

As we can see above, timestamp (“last modified”) is set to one we passed as --mtime 1.23. filename is set to empty string.

Let’s make two more gzip files where only filename will vary:

$ ./mkgz.py out3.tar out3.tar.gz --mtime 1.23
$ ./mkgz.py out3.tar out4.tar.gz --mtime 1.23 --filename ''

If you’ll compare outputs of hd out3.tar.gz | sed 2q and hd out4.tar.gz | sed 2q, you’d see that out3.tar.gz indeed contains filename of out3.tar (while out4.tar.gz does not):

00000000  1f 8b 08 08 01 00 00 00  02 ff 6f 75 74 33 2e 74  |..........out3.t|
00000010  61 72 00 ed 9b 41 6e c2  30 10 45 bd f6 29 72 83  |ar...An.0.E..)r.|
00000000  1f 8b 08 00 01 00 00 00  02 ff ed 9b 41 6e c2 30  |............An.0|
00000010  10 45 bd f6 29 72 83 7a  6c 8f 7d 9e 24 a6 2a 52  |.E..)r.zl.}.$.*R|

Now that we looked at tar and gzip separately, we can proceed and combine them.

© 2008–2017 93z.org