Benchmarking

Canterbury Corpus

The Canterbury Corpus is a set of small sized files that represent a wide range of data formats and is used to compare lossless compression algorithm performance. Applications can compare typical data sets with the corpus files to determine a rough estimate of compression performance. Since the benchmark does not measure compression speed the LZ4 compression utility has been compiled without modifications using GCC and run on a Windows PC to obtain the following results. The results are comparable to the original LZ4 implementation with the default command line options (fast compression).

File Category Size (bytes) Ratio (H2LS 10) Ratio (H2LS 12) Ratio (H2LS 14)
alice29.txt English text 152089 1.3079 1.5266 1.6782
asyoulik.txt Shakespeare 125179 1.2684 1.4457 1.585
cp.html HTML source 24603 1.6461 1.9097 2.0294
fields.c C source 11150 1.9161 2.0614 2.0896
grammar.lsp LISP source 3721 1.8512 1.9062 1.9141
kennedy.xls Excel Spreadsheet 1029744 2.3826 2.5877 2.7481
lcet10.txt Technical writing 426754 1.3491 1.6052 1.7899
plrabn12.txt Poetry 481861 1.1735 1.3459 1.5018
ptt5 CCITT test set 513216 5.6318 5.8115 5.9357
sum SPARC Executable 38240 1.6961 1.9148 1.9827
xargs.1 GNU manual page 4227 1.4526 1.5326 1.5632
Total   2810784 1.8572 2.0987 2.2825

The benchmark files, additional details and results for other compression algorithms can be found on the Canterbury Corpus webpage.

Silesia Corpus

The Silesia Corpus is a set of large size files that represent a wide range of data formats and is used to compare lossless compression algorithm performance. While large file sizes are not typical in embedded applications they can still be used to compare typical data sets with the corpus files to determine a rough estimate of compression performance. Since the benchmark does not measure compression speed the LZ4 compression utility has been compiled without modifications using GCC and run on a Windows PC to obtain the following results. The results are comparable to the original LZ4 implementation with the default command line options (fast compression).

File Category Size (bytes) Ratio (H2LS 10) Ratio (H2LS 12) Ratio (H2LS 14)
alice29.txt English Text 10192446 1.2268 1.4259 1.5882
mozilla Executable 51220480 1.7236 1.8536 1.9501
mr Medical image 9970564 1.5728 1.6653 1.7521
nci Database 33553445 4.5649 5.357 5.7073
ooffice Executable 6152192 1.2934 1.3811 1.466
osdb Database 10085684 1.1354 1.4785 1.965
reymont Polish pdf 6627202 1.5583 1.7694 1.8747
samba Source code 21606400 2.1857 2.5442 2.7149
sao Binary data 7251944 1.0208 1.0538 1.0945
webster HTML 41458703 1.5805 1.8459 2.0475
xml HTML 5345280 3.2759 3.8561 4.1454
x-ray Medical image 8474240 1.0036 1.0115 1.0441
Total   211938580 1.7244 1.9328 2.0910

The benchmark files, additional details and results for other compression algorithms can be found on the Silesia Corpus webpage.

LZ4 Compression Speed

The speed at which data can be compressed is directly related to the compression ratio of the data. Data that can be compressed at a higher ratio can be compressed faster because there are more matched bytes and fewer literals that must be hashed into the table. The plot below shows a synthetic benchmark created to show how well the LZ4 algorithm performs on an embedded device. The original file being compressed is 32KB and a hash table size of 4096 is used.

LZ4 Ratio Time

Fig. 1 LZ4 Ratio Time

As the compression ratio decreases the advantage becomes even more significant. This is because lower compression ratio results in less matches and more calls to the hashing function, decreasing the overall compression speed.