Fast GF(256) Cauchy MDS Block Erasure Codec in C
cm256 is a simple library for erasure codes. From given data it generates redundant data that can be used to recover the originals.
It is roughly 2x faster than Longhair, and CM256 supports input data that is not a multiple of 8 bytes.
Currently only Visual Studio 2013 is supported, though other versions of MSVC may work.
The original data should be split up into equally-sized chunks. If one of these chunks is erased, the redundant data can fill in the gap through decoding.
The erasure code is parameterized by three values (OriginalCount
, RecoveryCount
, BlockBytes
). These are:
- The number of blocks of original data (
OriginalCount
), which must be less than 256. - The number of blocks of redundant data (
RecoveryCount
), which must be no more than256 - OriginalCount
.
For example, if a file is split into 3 equal pieces and sent over a network, OriginalCount
is 3.
And if 2 additional redundant packets are generated, RecoveryCount
is 2.
In this case up to 256 - 3 = 253 additional redundant packets can be generated.
Include the cm256.* and gf256.* files in your project and consult the cm256.h header for usage.
Documentation is provided in the header file cm256.h.
When your application starts up it should call cm256_init()
to verify that the library is linked properly:
#include "cm256.h"
if (cm256_init()) {
// Wrong static library
exit(1);
}
To generate redundancy, use the cm256_encode
function. To solve for the original data use the cm256_decode
function.
Example usage:
bool ExampleFileUsage()
{
if (cm256_init())
{
exit(1);
}
cm256_encoder_params params;
// Number of bytes per file block
params.BlockBytes = 4321;
// Number of blocks
params.OriginalCount = 33;
// Number of additional recovery blocks generated by encoder
params.RecoveryCount = 12;
// Size of the original file
static const int OriginalFileBytes = params.OriginalCount * params.BlockBytes;
// Allocate and fill the original file data
uint8_t* originalFileData = new uint8_t[OriginalFileBytes];
memset(originalFileData, 1, OriginalFileBytes);
// Pointers to data
cm256_block blocks[256];
for (int i = 0; i < params.OriginalCount; ++i)
{
blocks[i].Block = originalFileData + i * params.BlockBytes;
}
// Recovery data
uint8_t* recoveryBlocks = new uint8_t[params.RecoveryCount * params.BlockBytes];
// Generate recovery data
if (cm256_encode(params, blocks, recoveryBlocks))
{
exit(1);
}
// Initialize the indices
for (int i = 0; i < params.OriginalCount; ++i)
{
blocks[i].Index = cm256_get_original_block_index(params, i);
}
//// Simulate loss of data, subsituting a recovery block in its place ////
blocks[0].Block = recoveryBlocks; // First recovery block
blocks[0].Index = cm256_get_recovery_block_index(params, 0); // First recovery block index
//// Simulate loss of data, subsituting a recovery block in its place ////
if (cm256_decode(params, blocks))
{
exit(1);
}
// blocks[0].Index will now be 0.
delete[] originalFileData;
delete[] recoveryBlocks;
return true;
}
The example above is just one way to use the cm256_decode
function.
This API was designed to be flexible enough for UDP/IP-based file transfer where the blocks arrive out of order.
CM256 demonstrates similar encoding and (worst case) decoding performance:
Encoder: 1296 bytes k = 100 m = 1 : 5.55886 usec, 23314.1 MBps
Decoder: 1296 bytes k = 100 m = 1 : 6.72915 usec, 19259.5 MBps
Encoder: 1296 bytes k = 100 m = 2 : 17.2617 usec, 7507.93 MBps
Decoder: 1296 bytes k = 100 m = 2 : 19.6023 usec, 6611.46 MBps
Encoder: 1296 bytes k = 100 m = 3 : 30.4275 usec, 4259.31 MBps
Decoder: 1296 bytes k = 100 m = 3 : 32.4755 usec, 3990.7 MBps
Encoder: 1296 bytes k = 100 m = 4 : 40.6675 usec, 3186.82 MBps
Decoder: 1296 bytes k = 100 m = 4 : 43.5932 usec, 2972.94 MBps
Encoder: 1296 bytes k = 100 m = 5 : 51.7852 usec, 2502.64 MBps
Decoder: 1296 bytes k = 100 m = 5 : 51.4926 usec, 2516.86 MBps
Encoder: 1296 bytes k = 100 m = 6 : 62.6104 usec, 2069.94 MBps
Decoder: 1296 bytes k = 100 m = 6 : 64.9509 usec, 1995.35 MBps
Encoder: 1296 bytes k = 100 m = 7 : 76.3612 usec, 1697.2 MBps
Decoder: 1296 bytes k = 100 m = 7 : 75.191 usec, 1723.61 MBps
Encoder: 1296 bytes k = 100 m = 8 : 85.1384 usec, 1522.23 MBps
Decoder: 1296 bytes k = 100 m = 8 : 83.0904 usec, 1559.75 MBps
Encoder: 1296 bytes k = 100 m = 9 : 96.2561 usec, 1346.41 MBps
Decoder: 1296 bytes k = 100 m = 9 : 95.3784 usec, 1358.8 MBps
Encoder: 1296 bytes k = 100 m = 10 : 110.592 usec, 1171.87 MBps
Decoder: 1296 bytes k = 100 m = 10 : 109.714 usec, 1181.25 MBps
Encoder: 1296 bytes k = 100 m = 20 : 223.525 usec, 579.801 MBps
Decoder: 1296 bytes k = 100 m = 20 : 209.481 usec, 618.671 MBps
Encoder: 1296 bytes k = 100 m = 30 : 372.737 usec, 347.699 MBps
Decoder: 1296 bytes k = 100 m = 30 : 322.707 usec, 401.603 MBps
Encoder: 1296 bytes k = 100 m = 40 : 471.626 usec, 274.794 MBps
Decoder: 1296 bytes k = 100 m = 40 : 434.762 usec, 298.094 MBps
Encoder: 1296 bytes k = 100 m = 50 : 592.751 usec, 218.642 MBps
Decoder: 1296 bytes k = 100 m = 50 : 545.939 usec, 237.389 MBps
(These performance numbers are out of date and not well calibrated - Decoding now takes the same time as encoding within a few microseconds thanks to the new matrix solver.)
Longhair Library Results:
Note that I hand-optimized the MemXOR.cpp implementation on this PC to run faster than what is available on github, so this is a fair comparison.
Encoded k=100 data blocks with m=1 recovery blocks in 4.09607 usec : 31640.1 MB/s
+ Decoded 1 erasures in 5.85144 usec : 22148.4 MB/s
Encoded k=100 data blocks with m=2 recovery blocks in 41.5452 usec : 3119.5 MB/s
+ Decoded 2 erasures in 43.5931 usec : 2972.94 MB/s
Encoded k=100 data blocks with m=3 recovery blocks in 80.7498 usec : 1604.96 MB/s
+ Decoded 3 erasures in 86.6013 usec : 1496.51 MB/s
Encoded k=100 data blocks with m=4 recovery blocks in 123.465 usec : 1049.69 MB/s
+ Decoded 4 erasures in 127.854 usec : 1013.66 MB/s
Encoded k=100 data blocks with m=5 recovery blocks in 76.9464 usec : 1684.29 MB/s
+ Decoded 5 erasures in 88.6493 usec : 1461.94 MB/s
Encoded k=100 data blocks with m=6 recovery blocks in 87.7717 usec : 1476.56 MB/s
+ Decoded 6 erasures in 100.352 usec : 1291.45 MB/s
Encoded k=100 data blocks with m=7 recovery blocks in 103.863 usec : 1247.8 MB/s
+ Decoded 7 erasures in 127.269 usec : 1018.32 MB/s
Encoded k=100 data blocks with m=8 recovery blocks in 118.784 usec : 1091.05 MB/s
+ Decoded 8 erasures in 145.701 usec : 889.494 MB/s
Encoded k=100 data blocks with m=9 recovery blocks in 146.871 usec : 882.406 MB/s
+ Decoded 9 erasures in 158.574 usec : 817.284 MB/s
Encoded k=100 data blocks with m=10 recovery blocks in 156.819 usec : 826.433 MB/s
+ Decoded 10 erasures in 181.102 usec : 715.619 MB/s
Encoded k=100 data blocks with m=20 recovery blocks in 282.039 usec : 459.511 MB/s
+ Decoded 20 erasures in 370.103 usec : 350.172 MB/s
Encoded k=100 data blocks with m=30 recovery blocks in 428.618 usec : 302.367 MB/s
+ Decoded 30 erasures in 614.693 usec : 210.837 MB/s
Encoded k=100 data blocks with m=40 recovery blocks in 562.323 usec : 230.472 MB/s
+ Decoded 40 erasures in 855.188 usec : 151.546 MB/s
Encoded k=100 data blocks with m=50 recovery blocks in 727.041 usec : 178.257 MB/s
+ Decoded 50 erasures in 1181.11 usec : 109.727 MB/s
Results Discussion:
For m=1 they are both running the same kind of code, so they're basically the same.
For m=2 and m=3, CM256 is 2.5x faster.
For m=4, CM256 is 3x faster in this case. Longhair could use more tuning. Back when I wrote it, the right time to switch to the Windowed decoder was at m=5, but on my new PC it seems like m=4 is a better time to do it. CM256 only has one mode so it doesn't require any tuning for best performance.
For m=5...30, CM256 performance is not quite 2x faster, maybe 1.7x or so.
For m>30, CM256 is at least 2x faster.
The approach taken in CM256 is similar to the Intel Storage Acceleration Library (ISA-L) available here:
https://01.org/intel%C2%AE-storage-acceleration-library-open-source-version/downloads
ISA-L more aggressively optimizes the matrix multiplication operation, which is the most expensive step of encoding.
CM256 takes better advantage of the m=1 case and the first recovery symbol, which is also possible with the Vandermonde matrices supported by ISA-L.
ISA-L uses a O(N^3) Gaussian elimination solver for decoding. The CM256 decoder solves the linear system using a fast O(N^2) LDU-decomposition algorithm from "Pivoting and Backward Stability of Fast Algorithms for Solving Cauchy Linear Equations" (T. Boros, T. Kailath, V. Olshevsky), which was hand-optimized for memory accesses.
This software was written entirely by myself ( Christopher A. Taylor [email protected] ). If you find it useful and would like to buy me a coffee, consider tipping.