System.Runtime.Intrinsics.X86

Hello,

My library that uses NaCl.Core has 2 variants, one that uses NaCl.Core, and the other that uses the libsodium native library. The libsodium variant runs 4 times faster, but has deployment pitfalls where you have to ensure the correct native file is used for the processor architecture and OS. I've worked around those pitfalls but it got me wondering about optimization in NaCl.Core as a fully managed solution has less friction.

I was wondering if you had looked at the `System.Runtime.Intrinsics.X86` namespace in .NET Core 3.0?

I am new to intrinsics so don't consider the following to be authoritative but I thought I'd present my experience as a data point for using intrinsics.

#### XOR proof-of-concept

I did a proof of concept on an easy bit to update - the XOR in `Snuffle.cs` (this doesn't improve performance much).

At the top:
```
#if NETCOREAPP3_0
    using System.Runtime.Intrinsics;
    using System.Runtime.Intrinsics.X86;
#endif
```

In `Process`:
```
            using (var owner = MemoryPool<byte>.Shared.Rent(BLOCK_SIZE_IN_BYTES))
            {
                for (var i = 0; i < numBlocks; i++)
                {
                    ProcessKeyStreamBlock(nonce, i + InitialCounter, owner.Memory.Span);

#if NETCOREAPP3_0
                    if (i == numBlocks - 1)
                        Xor(output, input, owner.Memory.Span, length % BLOCK_SIZE_IN_BYTES, offset, i); // last block
                    else
                        XorIntrinsic(output, input, owner.Memory.Span, BLOCK_SIZE_IN_BYTES, offset, i);
#else
                    if (i == numBlocks - 1)
                        Xor(output, input, owner.Memory.Span, length % BLOCK_SIZE_IN_BYTES, offset, i); // last block
                    else
                        Xor(output, input, owner.Memory.Span, BLOCK_SIZE_IN_BYTES, offset, i);
#endif

                    owner.Memory.Span.Clear();
                }
            }
```

New method at the bottom:
```
#if NETCOREAPP3_0
        private static unsafe void XorIntrinsic(Span<byte> output, ReadOnlySpan<byte> input, ReadOnlySpan<byte> block, int len, int offset, int curBlock)
        {
            var blockOffset = curBlock * BLOCK_SIZE_IN_BYTES;

            //To do - input length validation checks
            fixed (byte* pOut = output, pInA = input, pInB = block)
            {
                byte* pOutEnd = pOut + offset + blockOffset + len;
                byte* pOutCurrent = pOut + offset + blockOffset;
                byte* pInACurrent = pInA + blockOffset;
                byte* pInBCurrent = pInB;

                while (pOutCurrent + 8 <= pOutEnd)
                {
                    var inputAVector = Avx.LoadVector256(pInACurrent);
                    var inputBVector = Avx.LoadVector256(pInBCurrent);
                    var outputVector = Avx2.Xor(inputAVector, inputBVector);
                    Avx.Store(pOutCurrent, outputVector);

                    pOutCurrent += 8;
                    pInACurrent += 8;
                    pInBCurrent += 8;
                }
            }
        }
#endif
```

That compiled and benchmarked successfully, running approximately the same speed, perhaps a tiny amount faster. 

#### Conclusion

It seems that to fully implement intrinsics in the `Snuffle.cs` class (and hence get large performance gains) `ProcessKeyStreamBlock` would be the place to start, and hence the implementation of `u0.h/u1.h/u4.h/u8.h` in [this directory](https://github.com/jedisct1/libsodium/tree/927dfe8e2eaa86160d3ba12a7e3258fbc322909c/src/libsodium/crypto_stream/chacha20/dolbeau).

Whilst I don't have the bandwidth for a pull request that is a mass update to include intrinsics, if you were to establish a style/methodology for including intrinsics, I could contribute in parts when time allows.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

System.Runtime.Intrinsics.X86 #28

XOR proof-of-concept

Conclusion

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

System.Runtime.Intrinsics.X86 #28

Description

XOR proof-of-concept

Conclusion

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions