Archiving and compressing

In this video we will be working with a ZIP file that you can download and unpack with

$ wget http://bit.ly/bashfile -O bfiles.zip
$ unzip bfiles.zip

Unlike an SSD or a hard drive on your laptop, the filesystem on HPC cluster was designed to store large files, ideally with parallel I/O. As a result, it handles any large number of small I/O requests (reads or writes) very poorly, sometimes bringing the I/O system to a halt. For this reason, we strongly recommend that users do not store many thousands of small files – instead you should pack them into a small number of large archives. This is where the archiving tool tar comes in handy.

Working with tar and gzip/gunzip (8 min)

Covered topics: tar and g(un)zip.

Managing many files with Disk ARchiver (DAR)

tar is by far the most widely used archiving tool on UNIX-like systems. Since it was originally designed for sequential write/read on magnetic tapes, it does not index data for random access to its contents. A number of 3rd-party tools can add indexing to tar. However, there is a modern version of tar called DAR (stands for Disk ARchiver) that has some nice features:

  • each DAR archive includes an index for fast file list/restore,
  • DAR supports full / differential / incremental backup,
  • DAR has build-in compression on a file-by-file basis to make it more resilient against data corruption and to avoid compressing already compressed files such as video,
  • DAR supports strong encryption,
  • DAR can detect corruption in both headers and saved data and recover with minimal data loss,

and so on. Learning DAR is not part of this course. In the future, if you want to know more about working with DAR, please watch our DAR webinar (scroll down to see it).