Adventures in optimising ChaCha20 for Raaz.

reddit this

In this post, I note down some of the things I learned while optimising ChaCha20 implementation in the Raaz library.

TLDR: Do not waste time on hand optimising when there is GCC, the beast from the savanna.

A brief description of the ChaCha20

We have a chacha state which is essentially a 4x4 matrix of 32-bit unsigned word (Word32 in Haskell). The main body of the chacha transform consists of performing a quarter round, QROUND from now on, first on all the rows of the matrix and then on each of the diagonals of the matrix. For example, if the matrix is

	x0  x4  x8   x12
	x1  x5  x9   x13
	x2  x6  x10  x14
	x3  x7  x11  x15

Then a single round of ChaCha20 is the following operations.

	QROUND(x0, x4, x8,  x12); // QROUND(row0)
	QROUND(x1, x5, x9,  x13); // QROUND(row1)
	QROUND(x2, x6, x10, x14); // QROUND(row2)
	QROUND(x3, x7, x11, x15); // QROUND(row3)

	QROUND(x0, x5, x10, x15); // QROUND(diag0)
	QROUND(x1, x6, x11, x12); // QROUND(diag1)
	QROUND(x2, x7, x8,  x13); // QROUND(diag2)
	QROUND(x3, x4, x9,  x14); // QROUND(diag3)

The chacha20 stream is generated by performing this row and diagonal transformations 20 times followed by some adding up.

The first step was to get a portable version of ChaCha20 with the associated tests, which was done without much difficulty. You can have a look at it There was nothing clever here other than making sure that the variables xi’s are declared register.

Vector optimisations

The quarter round essentially consists of only addition and xor. It can easily be seen that the 4-row operations are independent of each other. Similarly the four diagonal operations are independent of each other. If we store each of the rows in a 128-bit vector consisting of 4 Word32’s, then the 4-row QROUND can be performed by a single vector version of QROUND. In other words, if we have the vectors

V0 = {x0,   x1,   x2,   x3 };
V1 = {x4,   x5,   x6,   x7 };
V2 = {x8,   x9,   x10,  x11};
V3 = {x12,  x13,  x14,  x15};

then the single QROUND(V0,V1,V2,V3) performs the QROUND for all the rows. For the diagonal ones we need to shift V1,V2, and V3 by 1, 2, and 3 positions to the left respectively and perform a single QROUND(V0,V1,V2,V3) again.

The optimisation for 256-bit vector instructions is similar, but we have to do 2-chacha blocks at a time. This is possible because the two chacha20 blocks differ only by a counter and can be done in parallel.

With the above optimisations I wrote two versions, one using the 128-bit vector the other using 256-bit version using the support for vector types in GCC. Again the implementation was delightfully simple, in fact in many ways much simpler than the portable one because now we can use vector types. The result of these optimisations were not very impressive though.

  1. The performance of 128-bit vector version as compared to the portable version does not improve at all (actually it is slightly worse) instead of the expected 2x improvement.

  2. The performance of 256-bit version with avx2 is better than the portable one by roughly a factor of 2x where as the naive expectations are a 4x improvement1.

I did look at the generate assembly and the code looked much like what was expected. The 128-bit code was using %xmm registers as expected and the 256-bit version was using the %ymm registers.

Letting the beast loose.

Cabal install and stack compiles C code with -O2. Just to see what has GCC got under its hood, I compiled stuff with the flags -O3 and what a surprise/shock I had. The portable C started competing with the 256-bit implementation. The generated assembly had the SIMD registers in it. Encouraged, I decided to throw in the -march=native and the figures blew me off. The portable C implementation was giving me an insane 14.4 Gbps where as the 256-bit one was only giving me 9.55Gbps.

Final tweaking and Cabal flags.

I can make two assumptions on the the message blocks because they are used only through FFI. Firstly, I can assume that the message block is 32-byte (256-bit) aligned by making sure that the Haskell stub function only passes 32-byte aligned buffers to the portable C implementation. Implementations of primitives in Raaz are instances of the BlockAlgorithm class. All I need to do is to ensure is that bufferStartAlignment is 32 for portable C implementation as opposed to word alignment. I can also assume that the pointer is not aliased. This two tweaks gives only minor improvements and need not be included at all.

The final performance figures are available at:

The cabal package now has two flags opt-native and opt-vectorise which results in adding the flags -O2 -ftree-vectorize -march=native. These are not enabled by default but can be enabled if you are compiling for yourself. With these changes there is no more a need for the vector implementations. I am keeping it for historical records.

Who should not enable the above flags?

If you are cross compiling, for example, when you are compiling the code for packaging for a distro, it is not advisable to enable these flags. Note that the native in this case would mean the machine on which it is compiled and not on the one where it is run. There is not much of an advantage in the hand coded vector implementation either because one cannot make assumptions on the target.

If you know that your target machine supports vector instructions, you can vectorise with -O2 -ftree-vectorize. Throw in an -mavx2, if you are sure of its supported on the target, and get most of the performance described above. This would be the use case when you are compiling on your local machine but deploying on the cloud for example.

Update: I got rid of the restrict key word in the portable c file because not all GCC versions seem to be familiar with it.

  1. The naive expectation will not hold because of some vector overheads.