Improved GNU gzip Data Compression
Date: 5th March 1999
Author: Roger Sayle
Version: 0.4
Introduction
The document describes several alternative design decisions available when implementing gzip-compatible data compression. The GNU gzip program, developed by Jean-Loup Gailly, is a de facto and Internet standard for data compression. The original program was developed to provide an efficient compression utility for the public domain, free from the many compression patents. However, in applications with different performance constraints, the design decisions and tradeoffs made in gzip need to be re-evaluated. For example, in on-the-fly file system compression performance and efficient random access decompression are more important than compression quality. Similarly, in archiving large static databases or compressing data for redistribution, it is reasonable to spend more time and achieve better compression.
GZIP File Format
The exact specification of the gzip file format is given in Internet RFC 1952. A gzipped file consists of a short header, an arbitrary number of compressed "blocks" and an eight-byte trailer. The trailer consists of the original uncompressed file size and a cyclic redundancy check of the uncompressed data. The header is typically 10 bytes long and starts with a three-byte magic number indicating a gzipped file using the "deflate" compression method. Amongst the flags in the gzip header is the original uncompressed file’s modification time (or zero if it was compressed from stdin), an indication that the original file was ASCII or binary, and the source operating system. If the appropriate bits are set in the header, it is immediately followed by the zero terminated string indicating the original uncompressed filename.
Immediately after the header are an arbitrary number of "deflate" blocks. The "deflate" compressed file format is described in detail in Internet RFC 1951. Each deflate block consists of a variable number of bits. Consecutive blocks are packed into bytes and each block need not start on a byte boundary. Each block begins with a three-bit header. The first bit indicates whether this block is the last in the compressed file, and the next two bits encode the block "type". Three different block types are currently used in deflate; stored blocks, static blocks and dynamic blocks.
Stored Blocks
The easiest block type to understand is the "stored block", which contains uncompressed data. After the two-bit stored block code is read, the decompressor skips to the next byte boundary. The following four bytes encode the block length in bytes and the ones complement of the block length. Stored blocks are allowed to contain up to 65535 bytes. After this block header, the original uncompressed data is stored unmodified. Support for this block type allows gzip to store incompressible data at an overhead of 5 bytes for every 65535 bytes, plus the size of the gzip header.
Compressed Blocks
The remaining two block types, "static blocks" and "dynamic blocks", are used to encode compressed data. The compression scheme is based upon Lempel and Ziv’s substitutional compression. A substitutional compressor works by replacing a block of text with a reference to an earlier occurrence. If the reference is smaller that the text block the resulting stream is smaller. A frequently encountered subset of substitutional compressors uses finite windows. In the sliding window scheme, each reference can only refer to a string up to some predetermined maximum number of characters earlier. For the gzip compression method, references can be between 3 and 258 characters long and reference text strings between 1 and 32768 bytes earlier. The repeated text need not be within the same "deflate" block, and may refer to the contents of a different block type.
The coding scheme used by gzip uses a single alphabet of 286 literal/length codes, and a second alphabet of 30 distance codes, where certain length and distance codes also require extra bits to specify the precise distance. Literal code values 0 to 255 are used to represent the original byte in the input stream, value 256 is used to represent end-of-block and values 257 to 285 represent length codes. Literal values 257 to 264 represent lengths of 3 to 10 bytes respectively. Literal values 265, 266, 267 and 268 represents lengths of 11 to 12 bytes, 13 to 14 bytes, 15 to 16 bytes and 17 to 18 bytes respectively. To distinguish between lengths, these literal values are always followed by a single extra bit. Literal values between 269 to 272 represent four-byte ranges and each are followed by two extra bits. And so on until literal 284 that is followed by five bits and encodes lengths between 227 and 257 bytes and literal 285 that uniquely encodes length 258.
After a length code, literal values between 257 and 285, the offset is encoded using a second alphabet of 30 symbols. Similar to the length encoding each distance code is also associated with a number of extra bits. Distance codes 0 to 3 represent distances 1 to 4 respectively. Distance codes 4 and 5 represent distances 5-6 and 7-8 respectively and are followed by a single extra bit. And so on, until distance code 29 that is followed by 13 extra bits and represents distances between 24577 and 32768.
The actual encoding of these two alphabets uses Huffman encoding where each symbol’s representation is limited to a maximum of 15 bits. By using canonical Huffman encoding, the bit sequences for each code can be specified by just the length of each symbol’s representation. Hence to encode the above deflate representation requires 286 literal code lengths between 0 and 15, and 30 distance code lengths between 0 and 15.
The difference between "static blocks" and "dynamic blocks" is that static blocks use a predefined table of code lengths for each literal and distance symbol, whereas dynamic blocks encode the required symbol lengths at the beginning of each block.
Static Compressed Blocks
In static compressed blocks the Huffman codes for the two alphabets are fixed. The Huffman literal code lengths are eight bits for codes 0-143, nine bits for codes 144-255, seven bits for codes 256-279 and 8 bits for codes 280-287. Distance codes 0-29 are represented by fixed length five-bit codes.
Dynamic Compressed Blocks
In dynamic compressed blocks, the Huffman code lengths for the two alphabets are stored immediately after the block type and before the actual compressed data, first the literal/length codes then the distance codes. For even greater compactness, these 315 parameters are compressed using both run-length and Huffman encoding. For details of this encoding, refer to Internet RFC 1951. However, it is sufficient to note that encoding the code lengths uses both run-length encoding and length-limited Huffman encoding. This document will discuss the effect of different run-length encoding strategies and length-limited Huffman code selection strategies on the size of this block header. For typical data files, the Huffman coding information requires about 400 bits or about 50 bytes.
Trivial Implementations
The stored compression method is implemented as "gzip2 –0", which simply package the original file in maximum sized stored blocks. On the 34506 byte "1crn.pdb" test file, the compressed output file is 34538 bytes, and on the 3989442 byte "nci95" test file, the compressed output file is 3989771. This expansion is the worst worst-case scenario.
A slightly different implementation of the same idea is "gzip2 –1". This implementation performs no compression (either Lempel-Ziv or Huffman) but stores the original file in a single large static block. For the 34506 byte "1crn.pdb", this does only 3 bytes better with 34535 bytes and for the 3989442 byte "nci95" does slightly better at 3989468. Here the expansion is only 26 bytes, nearly 300 bytes better than the stored method. However, this is taking advantage of the fact that the byte values 144 to 255 occur very rarely or not at all in the source files. Random binary data would not perform so well.
Repeat Detection
The principal source of compression in gzip is the identification of repeated strings for the Lempel-Ziv encoding. Two very different string-matching algorithms are used in the original GNU gzip and the investigated implementation.
The method used by the existing GNU gzip implementation is based upon the string-matching algorithm of Rabin & Karp. This maintains a hash table of all three-character strings in the input file. By default under UNIX, this hash table is 32K entries. The hash function (c1<<10)^(c2<<5)^c3 and can be calculated incrementally. Collisions are resolved by maintaining an array of the previous occurrence. This array is the same length as the input text buffer and acts as the hash table overflow list. Because there are 16 million possible triples in a binary file, the hash table and the overflow chain contain collisions and therefore memcmp must be used to check that a given entry is indeed identical for some length.
Searching this hash table for suitable repeats is a time consuming process and so GNU gzip trims the searching of collision chains at different levels of compression. The first trade-off is the maximum length of the collision chain to search. GNU gzip only inspects the first 4, 8, 32, 16, 32, 128, 256, 1024 and 4096 entries on a collision chain for levels –1 to –9 respectively. In theory, a collision chain can contain 32K entries and many of the may be collision as not even match the first three characters. As will be explained in the section on lazy matches, GNU gzip only searches the first quarter of this collision list limit, if it already has a reasonable match. A second trade-off is that searching of a collision chain terminates as soon as a suitably long repeat length is found. These limiting lengths are 8, 16, 32, 16, 32, 128, 128, 258 and 258 respectively.
Two very different implementation strategies were compared in this evaluation. The first approach is based upon bioinformatics algorithms originally intended for identifying large repeats in bacterial genomes. This algorithm maintains two arrays, one previous occurrence array (similar to the collision chain used by GNU gzip above) and the other array holding the length of that previous repeat. In a first pass, each byte in the input buffer is scanned sequentially and its location stored in an array of 256 positions. If that byte has occurred previously it is stored in the previous array and the length array set to one, otherwise the length array is set to zero. This pass is analogous to the Rabin & Karp algorithm described above but with a 256-entry collision free hash-table used for indexing single character strings. The following paths scan the input buffer from back to front, identifying increasingly longer repeats based upon the shorter repeats found so far. Working from the end of the buffer, the length array is checked to see if it contains a maximum length repeat so far. If it does then the chain created by previous links with the same length is scanned. At each entry, the next character is compared to the next character of the query/starting position. If a match is found the length is incremented and the position of the matching repeat is stored. The algorithm terminates as soon as no repeats are extended during an iteration, or the maximum repeat length, 258, is reached. Upon completion, each character has associated with it the length and position of the longest previous match. If that maximum length string appears more than once previously, the position indicates the later.
The algorithm above finds the longest possible match in a near optimal number of character comparisons. These longest matches would provide optimal encoding if the offset were encoded in a fixed size field. However, the Huffman encoding and extra bits used in the deflate methods mean that a shorter match may compress better than a longer one if it occurs much closer. To investigate this, one of the dynamic programming algorithms described below need to find all repeats three characters and longer at each position. To achieve this a hybrid of the two approaches given above was used. The later method was iterated exactly three times to produce a collision chain guaranteed to contain entries with matching first three characters. This is then scanned during the compression phase in a manner similar to that used by GNU gzip.
Repeat Selection
Having determined the longest previous repeat for each character in the input block, the next task is to decide which characters to encode as literals and which to encode as runs.
Greedy Algorithm
The simplest solution is to traverse the block starting at the beginning. If the longest previous repeat is less than three characters, emit the appropriate literal and advance to the next character. If the longest previous repeat is three characters or longer, output the length and offset of the run and advance the appropriate number of characters. This step is then repeated until the end of the block is reached. This method is implemented as "gzip2 –2", which places the entire file in a single static block. This optimization level reduces the 3989442 byte "nci95" down to 1424821 bytes, and the 34506 byte "1crn.pdb" down to 12510 bytes.
Lazy Algorithm
An improvement over the above "greedy" algorithm is used by GNU gzip, called the "lazy" algorithm. This proceeds almost identically to the greedy algorithm above, but only outputs a run if the length is greater than two and is at least as long as the longest repeat for the following character. This "lazy" condition allows the encoder to peek at the next character, and output a literal if a better repeat can be started at the next position. This method is implemented as "gzip2 –3" which also generates a single static block. This compresses "nci95" from 3989442 to 1367315 bytes and "1crn.pdb" from 34506 down to 11799 bytes.
As mentioned in the previous section on repeat detection, GNU gzip uses heuristics to speed up repeat detection when using the "lazy" algorithm. Because it locates repeats during the "repeat selection" pass over a data block, it is able to use different tradeoffs when a lazy match has already been found. Hence at different compression levels, GNU gzip expends less time probing the hash table, determining the maximum repeat length if there is a suitably long repeat for the preceding character.
Simple Dynamic Programming
Both the greedy algorithm and the lazy algorithm are heuristics for efficiently achieving good compression by selecting reasonable repeats. However, the optimal selection of repeats may be achieved by using a two-pass dynamic programming algorithm. This first pass works from the back of the block to the front determining for each character position, the optimal number of bits for encoding the rest of the block. At each position, it determines whether it is better to output a literal or a maximum length run. The cost of outputting a literal is determined by adding the code length of the literal to the optimal encoding of the following position. The cost of outputting a repeat is calculated by determining the number of bits to encode the length and offset of the repeat, and adding that to the optimal encoding beginning length characters later. The algorithm then chooses the lower cost option, and stores the whether a repeat or a literal is chosen. This continues until the optimal encoding from the beginning of the buffer to the end is discovered. A second pass then traverses the block from the beginning to extract the actual encoding.
This algorithm has been implemented in "gzip2" as the function "OptimalRepeats1" and its performance evaluated. Generating a single static block, this method compresses "nci95" from 3989442 bytes down to 1322354 bytes, and "1crn.pdb" from 34506 bytes down to 11287 bytes.
Because this algorithm takes two passes and examines every character position during the first pass, it is slower than the heuristic methods used above but faster than the two compression methods described next. For this reason, "OptimalRepeats1" is used as the repeat selection method of "gzip2 –7" described later.
Another issue, is that this is the first repeat selection algorithm that makes use of the Huffman encoding used to encode literals, lengths and distances. Hence, literals with short encodings may be selected in preference over repeats that require long length and offset encodings. One potential pitfall is that when using dynamic blocks, some literals, lengths and distances may not be allocated a Huffman code. In these cases, the Huffman code length of that symbol appears to be zero bits. This case needs to be detected explicitly and avoided in the dynamic programming algorithm.
Length Optimal Dynamic Programming
The above dynamic programming algorithm determines the optimal repeat selection under the constraint that the only two options are to emit a repeat of maximum length or a single literal character. One improvement on this scheme is to also consider any length run, longer than two characters and up to the maximum detected length. This algorithm requires only a minor modification to the one given above, and is implemented in "OptimalRepeats2" and available via the command "gzip2 –4". Level 4 compression again generates a single static block and reduces "nci95" from 3989442 bytes to 1315864 bytes, and "1crn.pdb" from 34506 bytes to 11241 bytes.
Length and Offset Optimal Dynamic Programming
A further refinement upon the "length optimal" algorithm described above is to optimize both the length of the repeat and its location. In the previous algorithm, the location of the repeat is restricted to be the location of the longest previous repeat as determined in the repeat detection pass. The central comparison of the dynamic programming algorithm can be modified to scan backwards from a given character locating previous three character repeats, and then attempting the extend each of them (considering each length in turn). As described in the section on "repeat detection" this requires very different preprocessing to determine chains of triplets, with the greater fraction of the string matching being performed in the CPU-intensive dynamic programming loop. This method is implemented in "OptimalRepeats3". However because it is so time consuming it is only ever used in maximum "gzip2 –9" compression. This algorithm when used to create a single static block compresses the 3989442 byte "nci95" down to 1290872 and the 34506 byte "1crn.pdb" down to 11235 bytes.
Buffer Management
The previous chapter describes near optimal generation of GNU gzip compatible static blocks, using the default deflate fixed Huffman encoding. The one failing with the implementations described so far is in the way they process the input files as fixed size blocks. The input file is read 64K at a time into a buffer and compressed independently of the blocks before or after it. This leads to two sources of inefficiency. The first is that encoding the start of the second and subsequent blocks can’t refer to repeats occurring at the end of the previous block. The second is that repeats are artificially truncated at the end of each block. Because a repeat offset can be up to 32K previously and a repeat length is restricted to 258 bytes, the former problem is more severe than the latter.
The solution is to use a buffer length of at least 96K, allowing for a 64K block and the previous 32K for detecting repeats. This can be implemented either as a circular buffer, or by using "memcpy" to always copy the last 32K to the beginning of the buffer after each block. The table below shows the effect of this modification on the repeat selection algorithms described above. Before and after data is only presented for the "nci95" test file, as "1crn.pdb" is less than 64K and therefore unaffected.
Algorithm |
Before |
After |
greedy |
1424821 |
1353584 |
lazy |
1367315 |
1299830 |
simple DP |
1322354 |
1254425 |
length DP |
1315864 |
1247424 |
length & offset DP |
1290872 |
1220489 |
Dynamic Blocks
The previous sections have described the near-optimal gzip-compatible compression using a single static blocks. The next step is to make use of dynamic blocks and customize the Huffman encodings to reflect the literal and distance frequencies.
The first change when using dynamic blocks in deflate compression is that multiple blocks are typically required. For each input block, the frequencies of literals, lengths and offsets are determined to determine the optimal Huffman encoding for that block. The distribution of frequencies in the sample of the first block need not reflect the distribution in the second and subsequent blocks. Not only can this result in degraded compression but it is impossible to encode a character, if it didn’t occur in the first block and therefore wasn’t given a Huffman encoding. Having multiple blocks is also a beneficial feature, as it allows the compression stream to adapt to changing symbol frequency at regular intervals. Each block can be processed independently, analyzed for symbol frequencies and compressed using variants of the algorithms described above. Each block is (currently) the size of the input buffer, 64K. Only the last buffer, detected by reading less than 64K, has the block "bfinal" bit set.
The packing of the input file into blocks with dynamic compression is one of the few areas left for improvement with the final algorithm, as will be described in "future work" section below. By optimally selecting dynamic block boundaries, even higher levels of data compression should be attainable. Firstly, if a block has very similar symbol frequencies to the previous block, it may be beneficial to continue with the same Huffman encoding rather than emit an end-of-block code and another dynamic block header. An improvement upon this would be to try to determine how far into the input buffer the current Huffman encoding is suitable. For example, it is possible to use the current Huffman encodings up until the first non-coded symbol is encountered. Finally, determining the continued suitability of a Huffman encoding allows the input buffer to be split into arbitrary sized blocks. This would allow the compressor to follow rapid changes in symbol frequencies. A possible example of this is UNIX tar files, which may contain long stretches of ASCII data and then switch suddenly to binary executables or compressed data with very different characteristics. The current 64K block-size is a tradeoff between the frequency at which Huffman trees are generated and the ability of the compressor to follow distribution changes. For very homogeneous files, the larger the block-size the better.
Final Block Detection
One minor implementation enhancement is to speculatively test for end of file after each input block is read. The standard library "eof" function only returns true after a hard end of file has been encountered. This means that eof may return false and the next read still returns zero bytes. For files that are a multiple of the input buffer size, this results in a final block encoding zero bytes, rather than setting the final block bit on the previous block. The shortest encoding for such a block is 10 bits (using the 7bit EOF code in a static block). Although this happens very infrequently, adding 10bits to particular sized files, it can easily be avoided by checking that another byte can be read after each block read.
Dynamic Block Header Encoding
Each dynamic block starts with an encoding of the Huffman code lengths used by the literal/length and distance alphabets. The header contains the number of codes in each of these alphabets followed by an array of code lengths between 0 and 15. To compress this information even further it is run-length and Huffman encoded itself. The development of efficient header encodings parallels the development of efficient data encodings. The run-lengths can be determined by a heuristic "greedy" algorithm or an optimal "dynamic programming". An "initial" header encoding can be determined by using a "greedy" algorithm and then Huffman encoding (using any of the methods described below) the symbol frequencies used by this initial encoding. An "improved" header can be determined by then using "dynamic programming" with the code lengths determined for the initial header, and determine Huffman codes used in the dynamic programming solution. An "optimal" header encoding can be found by repeating the dynamic programming and Huffman encoding the resultant frequencies until the size of the block header stops decreasing. These three implementations will be referred to as the "initial", "improved" and "optimal" header encodings in the following sections.
Huffman Encoding
This section describes compression implementations that vary only in the method used to determine the length-limited Huffman codes for the symbols occurring in an input block.
Balanced Codes
The simplest form of Huffman encoding is not to perform Huffman encoding at all and use a single length code for all n distinct symbols. The code determination algorithm determines the number of non-zero literal frequencies and assigns them all the code length m, which is the lowest integer value such that (1<<m) >= n. Hence for the test file "1crn.pdb" that contains 45 different symbols, each symbol is encoded using 6 bits. The following table lists the compressed file sizes of "1crn.pdb" and "nci95" using balanced Huffman codes for both the header and data encoding. Because there is so little information contained in the dynamic block header, the different header encodings have virtually no effect.
Header Encoding |
1crn.pdb |
nci95 |
Original |
34506 |
3989442 |
Initial |
25918 |
2994216 |
Improved |
25918 |
2994215 |
Optimal |
25918 |
2994215 |
The one minor set-back with this implementation, is that although is conforms to the deflate compression method file format specification, GNU gzip’s internal integrity checking reports that the files are corrupt as there are unused Huffman codes!
Almost-Balanced Codes
To overcome the problem with gzip reporting unused Huffman codes, the tree can be tweaked such that some of the codes are one bit shorter than the rest. With the "1crn.pdb" example, there are 19 (64-45) unused codes of length 6. A suitable almost-balanced Huffman tree can be generated by giving the 19 most common symbols a 5-bit code and the remaining 26 symbols a 6-bit code. The results of using almost balanced Huffman encoding for both the header and the data are given in the table below.
Header Encoding |
1crn.pdb |
nci95 |
Original |
34506 |
3989442 |
Initial |
21929 |
2539589 |
Improved |
21929 |
2539589 |
Optimal |
21929 |
2539589 |
Length-Limited Huffman Encoding with Heuristic Balancing
The next step is to actually Huffman encode the symbol frequencies. A Huffman tree is built in the usual way, selecting the two least common symbols, combining them and replacing them with a pseudo-symbol with their combined frequency. This step is repeated iteratively until all the nodes have been combined into a single root node. The implementation used in gzip2 makes use of a heap to implement an efficient priority queue. Additionally, if two nodes have the same frequency, the node with the shortest depth (longest path from root to leaf) is selected. This heuristic attempts to generate shorter codes for each symbol whilst providing identical compression.
This would be all that is required except for a restriction on the maximum length of a Huffman code imposed by the deflate format. Literal/length and distance codes must be a maximum of 15-bits long and the code length codes used in the block headers must be a maximum of 7-bits long. An analysis of Huffman tree construction reveals that a 16-bit Huffman code can be formed after only about two thousand characters. This is well below the 64K block size limit and hence we need to rebalance the tree to limit its depth to 15-bits.
The first approach is to use the heuristic used by GNU gzip. This rebalances the tree by increasing the length of the deepest non-maximum depth node in the tree to compensate for the truncation of the over-sized codes. Consider as an example the set of frequencies {1,1,3,5,6,11,13}. Building an unrestricted Huffman tree results in the symbol lengths {5,5,4,3,2,2,2} which has a total message length of 97 bits. Now consider restricting the maximum length of each symbol to be a maximum of 4 bits long. The GNU gzip heuristic results in the symbol lengths {4,4,4,4,2,2,2}, where the fourth most frequent symbol acquires a longer symbol to compensate for the two least frequent symbols being truncated. The total message length of the heuristically balanced tree is 100 bits, with an increase of 3 bits.
The heuristic length-limited Huffman encoding algorithm was implemented in gzip2 to compare its performance. As this is the Huffman encoding technique used by GNU gzip, it serves as a benchmark. The table below gives the compressed file sizes for "1crn.pdb" and "nci95" using the heuristic rebalancing for both the data and header encoding.
Header Encoding |
1crn.pdb |
nci95 |
Original |
34506 |
3989442 |
Initial |
15306 |
1817564 |
Improved |
15306 |
1817536 |
Optimal |
15306 |
1817514 |
Optimal Length-Limited Huffman Encoding
Implied by the term "heuristic" in the previous section the GNU gzip Huffman tree rebalancing does not guarantee optimal length-limited Huffman code assignment. Recent several papers describing the "package-merge" algorithm have been presented which solve the length-limited Huffman code assignment problem. Recalling the example given above, the package-merge algorithm generates the set of code lengths {4,4,3,3,3,2,2} which meets the maximum code length criterion but encodes the entire message in only 98-bits. Only 1-bit worse than the optimal unrestricted encoding and 2-bits better than the heuristic method.
The package merge algorithm was also coded up in "gzip2" and the results are given in the table below. Note that because there are no code-length overflows in "1crn.pdb", the compressed file sizes are identical to those given above.
Header Encoding |
1crn.pdb |
nci95 |
Original |
34506 |
3989442 |
Initial |
15306 |
1817561 |
Improved |
15306 |
1817533 |
Optimal |
15306 |
1817512 |
Data compression using the optimal length-limited Huffman coding algorithm and without any repeats is available from the command line as "gzip2 –5". This level of compression uses optimal header encoding and corresponds to the last line in the previous table.
Combining Repeats and Huffman Encoding
The previous sections examined near-optimal gzip-compatible compression, first using a fixed Huffman alphabet and searching for optimal repeats and then selecting an optimal Huffman encoding but without using repeats. In this final section, we evaluate methods for combining the two approaches.
Greedy and Lazy Repeat Selection
The easiest cases to consider at the use of the "greedy" and "lazy" repeat selection algorithms. The reason these are the easy cases is that they make no use of the Huffman code length information and hence will always select the same repeats independent of the symbol code lengths used. This algorithm simply selects the appropriate repeats and literals are described in the "greedy" and "lazy" repeat selection algorithms given above, and then determines the Huffman encoding used the symbol frequencies selected.
When using initial header encoding and optimal length-limited Huffman encoding, the "greedy" repeat selection algorithm compresses the 34506 byte "1crn.pdb" down to 10226 bytes and the 3989442 byte "nci95" down to 1110846 bytes. Under the same conditions the "lazy" repeat selection algorithm compresses "1crn.pdb" down to 9601 bytes, and "nci95" down to 1072346 bytes. Interestingly, different header encodings have strange interactions with the final file size. Using optimal header encoding actual makes the "lazy" variants worse (+1byte for "1crn.pdb" and +26bytes for "nci95"), but slightly improves the "greedy" variant (no change for "1crn.pdb" and –13bytes for "nci95").
Compression level 6 within "gzip2" implements the lazy repeat selection, optimal length-limited Huffman encoding and initial header encoding. This gzip2 compression-level should perform similarly to GNU gzip at maximum compression.
Two-Pass Dynamic Programming
A very similar approach can be applied to the dynamic programming repeat selection methods. Initially a dynamic programming pass is made over the current block using a set of seed code lengths, such as the static compression code set. Then using symbol frequencies discovered by the first dynamic programming pass, generate a Huffman encoding of the required symbols. Finally using these symbol code lengths, make a second dynamic programming pass over the block
Compression level 7 within "gzip2" currently implements both passes using the "simple DP" algorithm, optimal length-limited Huffman encoding and initial header encoding.
Iterative Dynamic Programming
In the two-pass dynamic programming implementation described above, the optimal path found in the second pass may not be the same or even similar to the one found in the first pass. As a result, the frequencies of symbols found on this path may be sufficiently different to result in different Huffman codes being assigned to them. To take advantage of this an "iterative dynamic programming" scheme has been developed for maximal data compression. This method starts of identically to the two-pass dynamic programming scheme, but keeps re-encoding the symbol frequencies and re-running the dynamic programming repeat selection until there is no further decrease in size. Typically the symbol encoding and optimal repeats become self-consistent within three or four iterations and the algorithm terminates.
Gzip2’s compression level 8 implements iterative dynamic programming, using "length DP" for each pass, optimal length-limited Huffman encoding and optimal header encoding. The maximum compression level, "gzip2 –9" implements iterative dynamic programming, with "length & offset DP" for each pass, optimal length-limited Huffman encoding and optimal header encoding.
Compression Type Selection
One final improvement is to modify the algorithms described above to automatically select between static and dynamic compression for blocks. As the initial pass of the two-pass DP and iterative DP determine the (optimal) encoding length using the static default Huffman lengths, this value can be compared to the best encoding found by these methods. If the optimal dynamic block encoding including the overhead of the Huffman trees is worse than the static method, then the buffer should be output as a static block. This modification is implemented for compression levels 6, 7, 8 and 9 in "gzip2", but is only ever used on poorly compressible data. A further modification still would be to allow the data to be stored in uncompressed blocks as well.
File Format Padding
It is possible to place extra bytes in gzip format files to pad out blocks. Two possible sequences are to emit a zero-byte long stored block, which occupies 35bits rounded to the next byte. Alternatively, it is possible to emit a 10 bit static block containing just the end of data code.
Results
The following table gives a baseline of results for each of the nine compression levels in GNU gzip-1.2.4. For comparison, the compressed file sizes produced by the implementation described above at maximum compression are also given.
Filename |
1crn.pdb |
nci95 |
||
Size |
Time |
Size |
Time |
|
cat |
34506 |
3989442 |
||
gzip –1 |
11381 |
1498873 |
||
gzip –2 |
11093 |
1416278 |
||
gzip –3 |
10867 |
1301633 |
||
gzip –4 |
10388 |
1271393 |
||
gzip –5 |
9921 |
1191353 |
||
gzip –6 |
9711 |
0.036 |
1117806 |
6.704 |
gzip –7 |
9662 |
0.044 |
1094311 |
9.247 |
gzip –8 |
9503 |
0.090 |
1070387 |
20.520 |
gzip –9 |
9484 |
0.165 |
1069573 |
21.742 |
gzip2 -6 |
9601 |
0.149 |
1072392 |
27.723 |
gzip2 -7 |
8848 |
0.175 |
1036736 |
31.065 |
gzip2 –8 |
8716 |
0.334 |
1023505 |
45.867 |
gzip2 –9 |
8597 |
1:00.732 |
1019606 |
46:57.62 |
Calgary Text Corpus
Test |
gzip –9 |
gzip2 –9 |
Ratio |
|||
Filename |
Original |
Compressed |
Bits/Byte |
Compressed |
Bits/Byte |
|
bib |
111261 |
34900 |
2.51 |
33917 |
2.44 |
2.81% |
book1 |
768771 |
312281 |
3.25 |
299997 |
3.12 |
3.93% |
book2 |
610856 |
206158 |
2.70 |
198100 |
2.59 |
3.91% |
geo |
102400 |
68414 |
5.34 |
65694 |
5.13 |
3.98% |
news |
377109 |
144400 |
3.06 |
140265 |
2.98 |
2.86% |
obj1 |
21504 |
10320 |
3.84 |
10240 |
3.81 |
0.78% |
obj2 |
246814 |
81087 |
2.63 |
78715 |
2.55 |
2.93% |
paper1 |
53161 |
18543 |
2.79 |
17930 |
2.70 |
3.31% |
paper2 |
82199 |
29667 |
2.89 |
28467 |
2.77 |
4.04% |
pic |
513216 |
52381 |
0.82 |
49445* |
0.77* |
5.61% |
progc |
39611 |
13261 |
2.68 |
12978 |
2.62 |
2.13% |
progl |
71646 |
16164 |
1.80 |
15527 |
1.73 |
3.94% |
progp |
49379 |
11186 |
1.81 |
10824 |
1.75 |
3.24% |
trans |
93695 |
18862 |
1.61 |
18286 |
1.56 |
3.05% |
*
The figures for the pic test are for "gzip2 –8", as maximum compression took too long!
Future Improvements
There are one of two places where it should be possible to improve further still upon the compression ratios given above. The simplest loss is that runs that span a block boundary are currently truncated. This minor problem can be solved by reading the 256 bytes following the block being compressed, and adding sufficient logic to handle that into the code. Another improvement is to avoid re-sending almost identical Huffman code lengths at the beginning of each block by testing whether the next block can be efficiently encoded using the previous blocks Huffman lengths. If it can continue using it, otherwise emit an end-of-block code and the new set of code lengths. This can lead to much longer deflate blocks at the expense of slowly following changing symbol frequencies. One can push this further by encoding a block just as far as the first symbol that can’t be encoded by the current code lengths, and emitting a new tree for the remaining characters. These are symptoms of the more general problem of deciding when to change alphabet encodings and choosing optimal block lengths. This could be done with some form of branch-and-bound dynamic programming, but may take much longer time for very little gain. Finally, an improved optimization approach that actively attempts to avoid using symbols that occur very infrequently, or use a more popular code, thereby reducing the Huffman coding pressure.
Conclusions
The results given above indicate that it is possible to outperform the existing gzip 1.2.4 implementation in terms of compression quality though at significantly longer execution times. For some applications, spending this extra time to achieve near optimal compression within the gzip format is a reasonable trade-off.
References