You are currently viewing Various types of data compression in MapReduce

Various types of data compression in MapReduce

When Hadoop word comes to mind instantly, one more word also comes side by side in mind which is big data. Big data means a very large amount of data. When we need to play with a large amount of data there will always be an issue of scarcity of space. So, how can we or Hadoop as architecture can handle such a critical issue? Hadoop has provided very nice and important to rescue us from this issue. The resolution is data compression. We can do data compression using different Hadoop libraries on our huge dataset. If you are still not clear about what are the benefits of data compression in Hadoop let me show you. As we will compress the dataset size required to store data will decrease drastically. On the other end as we all know we need to transfer data among the Hadoop clusters from one machine to another. So, as a result of data compression data size will decrease, and eventually, the speed at which data will be transferred over the network will also increase. As we understood now the purpose of data compression in Map Reduce programming let’s see how the different types of compression options available in Hadoop map-reduce.

Types of Compression in Map Reduce

There are different types of compression techniques available. Each has different characteristics. Let’s see a list of available compression techniques and understand each one of them in detail.

  • Deflate
  • Gzip
  • Bzip2
  • LZO
  • LZ4
  • Snappy

Here we can distinguish each one of them based on the trade-off between space and time. It means if the speed of compression is more then compression will be less and if the time required to compress will be more then the compression will be of small space.

Deflate

Deflate is a lossless data compression file format that uses a combination of LZ77 and Huffman coding. Its file extension is .deflate. Having single file and not splittable. It stores data as a series of blocks.

Gzip

Gzip is a compression technique of type deflate. It is being used to store compressed data. Gzip has a .gz extension and has gzip tool that works as a single file and is not splittable. It provides a high compression ratio so the time required to compress files will also be high and it will take more time. It also uses high resource utilization for compression and decompression.

Zip

Zip is having a tool zip to perform compression. It uses deflate compression algorithm internally. It has a zip extension. It can have multiple files and is splittable at the file boundaries level.

Bzip2

Bzip2 is fairly slow comparatively. But, it has more compression ratio. It can’t have multiple files and but it is splittable. It uses the bzip2 algorithm and uses the bzip2 tool. It takes more time to compress and decompress data.

LZO

LZO is another compression technique. It uses lzop tool to perform compression. It uses the LZO algorithm internally. It can not have multiple files and is not splittable. It is a faster compression technique. It uses a low compression ratio.

LZ4

LZ4 is a compression technique that can be used at any speed-to-compression ratio. We don’t need external indexing in this technique. We can split the compressed files in this approach.

Snappy

Snappy is a technique that provides the best trade-off between speed and compression. It focuses on more speed and less compression. It is used widely in corporate organizations. This is generally used to compress formats like Avro and sequence files.

When comes to an option to select a technique that is more beneficial to our use case when there is a large size file we should not use the technique in which the file is not splittable.

Conclusion

We are using Hadoop map-reduce algorithms and as we understood different compression techniques and the benefits of each of them we will be able to decide when to use these compression techniques and which compression technique best matches which scenario. Hope this will help you to make the right decision as and when such a scenario comes up front. Also, make a mental note that it is not necessary to

This Post Has 2 Comments

  1. israel night club

    When I originally commented I appear to have clicked on the -Notify me when new comments are added- checkbox and from now on whenever a comment is added I recieve four emails with the exact same comment. Perhaps there is a way you can remove me from that service? Many thanks!

Comments are closed.