Benchmarking¶

Canterbury Corpus¶

The Canterbury Corpus is a set of small sized files that represent a wide range of data formats and is used to compare lossless compression algorithm performance. Applications can compare typical data sets with the corpus files to determine a rough estimate of compression performance. Since the benchmark does not measure compression speed the LZ4 compression utility has been compiled without modifications using GCC and run on a Windows PC to obtain the following results. The results are comparable to the original LZ4 implementation with the default command line options (fast compression).

File	Category	Size (bytes)	Ratio (H2LS 10)	Ratio (H2LS 12)	Ratio (H2LS 14)
`alice29.txt`	English text	152089	1.3079	1.5266	1.6782
`asyoulik.txt`	Shakespeare	125179	1.2684	1.4457	1.585
`cp.html`	HTML source	24603	1.6461	1.9097	2.0294
`fields.c`	C source	11150	1.9161	2.0614	2.0896
`grammar.lsp`	LISP source	3721	1.8512	1.9062	1.9141
`kennedy.xls`	Excel Spreadsheet	1029744	2.3826	2.5877	2.7481
`lcet10.txt`	Technical writing	426754	1.3491	1.6052	1.7899
`plrabn12.txt`	Poetry	481861	1.1735	1.3459	1.5018
`ptt5`	CCITT test set	513216	5.6318	5.8115	5.9357
`sum`	SPARC Executable	38240	1.6961	1.9148	1.9827
`xargs.1`	GNU manual page	4227	1.4526	1.5326	1.5632
Total		2810784	1.8572	2.0987	2.2825

The benchmark files, additional details and results for other compression algorithms can be found on the Canterbury Corpus webpage.

Silesia Corpus¶

The Silesia Corpus is a set of large size files that represent a wide range of data formats and is used to compare lossless compression algorithm performance. While large file sizes are not typical in embedded applications they can still be used to compare typical data sets with the corpus files to determine a rough estimate of compression performance. Since the benchmark does not measure compression speed the LZ4 compression utility has been compiled without modifications using GCC and run on a Windows PC to obtain the following results. The results are comparable to the original LZ4 implementation with the default command line options (fast compression).

File	Category	Size (bytes)	Ratio (H2LS 10)	Ratio (H2LS 12)	Ratio (H2LS 14)
`alice29.txt`	English Text	10192446	1.2268	1.4259	1.5882
`mozilla`	Executable	51220480	1.7236	1.8536	1.9501
`mr`	Medical image	9970564	1.5728	1.6653	1.7521
`nci`	Database	33553445	4.5649	5.357	5.7073
`ooffice`	Executable	6152192	1.2934	1.3811	1.466
`osdb`	Database	10085684	1.1354	1.4785	1.965
`reymont`	Polish pdf	6627202	1.5583	1.7694	1.8747
`samba`	Source code	21606400	2.1857	2.5442	2.7149
`sao`	Binary data	7251944	1.0208	1.0538	1.0945
`webster`	HTML	41458703	1.5805	1.8459	2.0475
`xml`	HTML	5345280	3.2759	3.8561	4.1454
`x-ray`	Medical image	8474240	1.0036	1.0115	1.0441
Total		211938580	1.7244	1.9328	2.0910

The benchmark files, additional details and results for other compression algorithms can be found on the Silesia Corpus webpage.

LZ4 Compression Speed¶

The speed at which data can be compressed is directly related to the compression ratio of the data. Data that can be compressed at a higher ratio can be compressed faster because there are more matched bytes and fewer literals that must be hashed into the table. The plot below shows a synthetic benchmark created to show how well the LZ4 algorithm performs on an embedded device. The original file being compressed is 32KB and a hash table size of 4096 is used.

Fig. 1 LZ4 Ratio Time

As the compression ratio decreases the advantage becomes even more significant. This is because lower compression ratio results in less matches and more calls to the hashing function, decreasing the overall compression speed.