About the Author

Douglas EadlineDouglas Eadline PhD, is both a practitioner and a chronicler of the Linux Cluster HPC revolution. He has worked with parallel computers since 1988 and is a co-author of the original Beowulf How To document.  Prior to starting and editing the popular http://clustermonkey.net web site in 2005, he served as Editor-in-chief for ClusterWorld Magazine. He is currently Senior HPC Editor for Linux Magazine and a consultant to the HPC industry. Doug holds a Ph.D. in Analytical Chemistry from Lehigh University and has been building, deploying, and using Linux HPC clusters since 1995.

User Rating: / 0
PoorBest 
A blog about making HPC things (kind of) work

An HPC user found his local file system was filing up with data files. He asked me if he should get more storage. My first response was to ask him how he was using the storage server. He explained that he kept the result of his protein folding runs in the file system. Each directory was for one model and represented quite a bit of computer time. He wanted to keep the historical record, but had no need to use the data in his daily research.

I then proceeded to examine the file server and found that he could save a lot of space by simply archiving and compressing older directories. A typical directory could be compressed 3-4 times. the compression would take about 10-15 minutes using bzip2. The file server had eight cores and the load on the system was low.

In order to speed thing up, I decided to try using a parallel version of bzip2. Perhaps using all 8 cores I could compress the "compression" time! In order to test pbzip2, I first archived an example directory with the following size and MD5:

  • Uncompressed File Size: 4.2G
  • MD5: 370256432a78c6314f7808eafeb347b4

I then compressed it with the sequential version of bzip2. The results were as follows:

  • Wall clock time: 13 minutes; 34 seconds
  • Compressed size: 1.3G

The compressed data was 3.2 times smaller, returning about 3 GB back to the file server. Next, I tried using pbzip2. The results were as follows:

  • Wall clock time: 4 minutes; 22 seconds
  • Compressed size: 1.3G

The data compression was the same as before, but the time was 3.1 times faster! More importantly, I did not have do anything other than use a different compression program. It was a true parallel plug-and-play solution. As a final check I uncompressed the file and calculated the MD5 sum:

  • Restored File Size: 4.2G
  • MD5: 370256432a78c6314f7808eafeb347b4

A perfect match (it should be) to the original file. The job of compressing these files just became much faster.