Benchmarking¶
Canterbury Corpus¶
The Canterbury Corpus is a set of small sized files that represent a wide range of data formats and is used to compare lossless compression algorithm performance. Applications can compare typical data sets with the corpus files to determine a rough estimate of compression performance. Since the benchmark does not measure compression speed the LZ4 compression utility has been compiled without modifications using GCC and run on a Windows PC to obtain the following results. The results are comparable to the original LZ4 implementation with the default command line options (fast compression).
File | Category | Size (bytes) | Ratio (H2LS 10) | Ratio (H2LS 12) | Ratio (H2LS 14) |
---|---|---|---|---|---|
alice29.txt |
English text | 152089 | 1.3079 | 1.5266 | 1.6782 |
asyoulik.txt |
Shakespeare | 125179 | 1.2684 | 1.4457 | 1.585 |
cp.html |
HTML source | 24603 | 1.6461 | 1.9097 | 2.0294 |
fields.c |
C source | 11150 | 1.9161 | 2.0614 | 2.0896 |
grammar.lsp |
LISP source | 3721 | 1.8512 | 1.9062 | 1.9141 |
kennedy.xls |
Excel Spreadsheet | 1029744 | 2.3826 | 2.5877 | 2.7481 |
lcet10.txt |
Technical writing | 426754 | 1.3491 | 1.6052 | 1.7899 |
plrabn12.txt |
Poetry | 481861 | 1.1735 | 1.3459 | 1.5018 |
ptt5 |
CCITT test set | 513216 | 5.6318 | 5.8115 | 5.9357 |
sum |
SPARC Executable | 38240 | 1.6961 | 1.9148 | 1.9827 |
xargs.1 |
GNU manual page | 4227 | 1.4526 | 1.5326 | 1.5632 |
Total | 2810784 | 1.8572 | 2.0987 | 2.2825 |
The benchmark files, additional details and results for other compression algorithms can be found on the Canterbury Corpus webpage.
Silesia Corpus¶
The Silesia Corpus is a set of large size files that represent a wide range of data formats and is used to compare lossless compression algorithm performance. While large file sizes are not typical in embedded applications they can still be used to compare typical data sets with the corpus files to determine a rough estimate of compression performance. Since the benchmark does not measure compression speed the LZ4 compression utility has been compiled without modifications using GCC and run on a Windows PC to obtain the following results. The results are comparable to the original LZ4 implementation with the default command line options (fast compression).
File | Category | Size (bytes) | Ratio (H2LS 10) | Ratio (H2LS 12) | Ratio (H2LS 14) |
---|---|---|---|---|---|
alice29.txt |
English Text | 10192446 | 1.2268 | 1.4259 | 1.5882 |
mozilla |
Executable | 51220480 | 1.7236 | 1.8536 | 1.9501 |
mr |
Medical image | 9970564 | 1.5728 | 1.6653 | 1.7521 |
nci |
Database | 33553445 | 4.5649 | 5.357 | 5.7073 |
ooffice |
Executable | 6152192 | 1.2934 | 1.3811 | 1.466 |
osdb |
Database | 10085684 | 1.1354 | 1.4785 | 1.965 |
reymont |
Polish pdf | 6627202 | 1.5583 | 1.7694 | 1.8747 |
samba |
Source code | 21606400 | 2.1857 | 2.5442 | 2.7149 |
sao |
Binary data | 7251944 | 1.0208 | 1.0538 | 1.0945 |
webster |
HTML | 41458703 | 1.5805 | 1.8459 | 2.0475 |
xml |
HTML | 5345280 | 3.2759 | 3.8561 | 4.1454 |
x-ray |
Medical image | 8474240 | 1.0036 | 1.0115 | 1.0441 |
Total | 211938580 | 1.7244 | 1.9328 | 2.0910 |
The benchmark files, additional details and results for other compression algorithms can be found on the Silesia Corpus webpage.
LZ4 Compression Speed¶
The speed at which data can be compressed is directly related to the compression ratio of the data. Data that can be compressed at a higher ratio can be compressed faster because there are more matched bytes and fewer literals that must be hashed into the table. The plot below shows a synthetic benchmark created to show how well the LZ4 algorithm performs on an embedded device. The original file being compressed is 32KB and a hash table size of 4096 is used.
As the compression ratio decreases the advantage becomes even more significant. This is because lower compression ratio results in less matches and more calls to the hashing function, decreasing the overall compression speed.