Archiving and Data Compression

tar: Tape ARchiver

Although we have already seen a use for tar in the Chapter 12, Building and Installing Free Software chapter, we haven't explained how it works, which we will do in this section. Just like find, tar is a long standing UNIX utility, so its syntax is a bit special. The syntax is:

tar [options] [files...]

Here is a list of some options. Note that all of them have an equivalent long option, but you will have to refer to the manual page for this as they won't be listed here.

Note

the initial dash (-) of short options is now deprecated with tar, except after a long option.

  • c: this option is used in order to create new archives

  • x: this option is used in order to extract files from an existing archive

  • t: list files from an existing archive

  • v: list the files which are added to an archive or extracted from an archive, or, in conjunction with the t option (see above), it outputs a long listing of files instead of a short one

  • f <file>: create archive with name <file>, extract from archive <file> or list files from archive <file>. If this parameter is omitted, the default file will be /dev/rmt0, which is generally the special file associated with a streamer. If the file parameter is - (a dash), the input or output (depending on whether you create an archive or extract from one) will be associated to the standard input or standard output

  • z: tells tar that the archive to create should be compressed with gzip, or that the archive to extract from is compressed with gzip

  • j: same as z, but the program used for compression is bzip2

  • p: when extracting files from an archive, preserve all file attributes, including ownership, last access time and so on. Very useful for file system dumps.

  • r: append the list of files given on the command line to an existing archive. Note that the archive to which you want to append files should not be compressed!

  • A: append archives given on the command line to the one submitted with the f option. Similar to r, the archives should not be compressed in order for this to work.

There are many, many, many other options, so you may want to refer to the tar(1) manual page for the entire list. See, for example, the d option. Let's proceed with an example. Say you want to create an archive of all images in /shared/images, compressed with bzip2, named images.tar.bz2 and located in your home directory. You will then type:

 #
 # Note: you must be in the directory from which
 #   you want to archive files!
 #
$ cd /shared
$ tar cjf ~/images.tar.bz2 images/

As you can see, we used three options here: c told tar we wanted to create an archive, j to compress it with bzip2, and f ~/images.tar.bz2 that the archive was to be created in our home directory, and its name will be images.tar.bz2. We may want to check if the archive is valid now. We can do this by listing its files:

 #
 # Get back to our home directory
 #
$ cd
$ tar tjvf images.tar.bz2

Here, we told tar to list (t) files from archive images.tar.bz2 (f images.tar.bz2), warned that this archive was compressed with bzip2 (j), and that we wanted a long listing (v). Now, say you have erased the images directory. Fortunately, your archive is intact, and you now want to extract it back to its original place, in /shared. But as you don't want to break your find command for new images, you need to preserve all file attributes:

 #
 # cd to the directory where you want to extract
 #
$ cd /shared
$ tar jxpf ~/images.tar.bz2

And here you are!

Now, let's say you want to extract the directory images/cars from the archive, and nothing else. Then you can type this:

$ tar jxf ~/images.tar.bz2 images/cars

If you try to back up special files, tar will take them as what they are, special files, and will not dump their contents. So yes, you can safely put /dev/mem in an archive. It also deals correctly with links, so do not worry about this either. For symbolic links, also look at the h option in the manpage.

bzip2 and gzip: Data Compression Programs

You can see that we have already talked of these two programs when dealing with tar. Unlike winzip under Windows, archiving and compressing are done using two separate utilities –– tar for archiving, and the two programs which we will now introduce for compressing data: bzip2 and gzip. You might also use a different compression tool, programs like zip, arj or rar also exist for GNU/Linux (but they are rarely used).

At first, bzip2 was written as a replacement for gzip. Its compression ratios are generally better, but on the other hand, it requires more RAM while working. The reason why gzip is still used is that it is still more widespread than bzip2.

Both commands have a similar syntax:

gzip [options] [file(s)]

If no filename is given, both gzip and bzip2 will wait for data from the standard input and send the result to the standard output. Therefore, you can use both programs in pipes. Both programs also have a set of common options:

  • -1, ..., -9: set the compression ratio. The higher the number, the better the compression, but better also means slower.

  • -d: uncompress file(s). This is equivalent to using gunzip or bunzip2.

  • -c: dump the result of compression/decompression of files given as parameters to the standard output.

Warning

By default, both gzip and bzip2 erase the file(s) that they have compressed (or uncompressed) if you don't use the -c option. You can avoid doing this in bzip2 by using the -k option. gzip has no equivalent option.

Now some examples. Let's say you want to compress all files ending with .txt in the current directory using bzip2. You would type:

$ bzip2 -9 *.txt

Let's say you want to share your image archives with someone, but he doesn't have bzip2, only gzip. You don't need to uncompress the archive and re-compress it, you can just uncompress to the standard output, use a pipe, compress from standard input and redirect the output to the new archive:

bzip2 -dc images.tar.bz2 | gzip -9 >images.tar.gz

You could have typed bzcat instead of bzip2 -dc. There is an equivalent for gzip but its name is zcat, not gzcat. You also have bzless for bzip2 file and zless for gzip if you want to view compressed files right away without having to uncompress them first. As an exercise, try and find the command you would have to type in order to view compressed files without uncompressing them, and without using bzless or zless.