Some performance tricks with .NET strings

I've created a pull request on the ASP.NET Core repository. At the beginning, the changes were just about changing the unsafe code (char*) for stackalloc to a safe version with Span<T>. So, it was a very small change. During the review, Oleksandr Kolomiiets, Günther Foidl, and David Fowler have suggested a few additional changes to improve the performance. Thank you very much for taking the time to review the PR! The comments were very interesting, so I've decided to explain them in a post.

Here's the explanation of each change, and you'll see that the final code is twice more performant than the initial code!

Original code

The code allows encoding a long number into a base 32 string. The code has already been optimized as explained in the comment, but we'll see how can optimize even more!

private static readonly string s_encode32Chars = "0123456789ABCDEFGHIJKLMNOPQRSTUV";

public static unsafe string GenerateId(long id)
{
    // The following routine is ~310% faster than calling long.ToString() on x64
    // and ~600% faster than calling long.ToString() on x86 in tight loops of 1 million+ iterations
    // See: https://github.com/aspnet/Hosting/pull/385

    // stackalloc to allocate array on stack rather than heap
    char* buffer = stackalloc char[13];
    buffer[0] = s_encode32Chars[(int)(id >> 60) & 31];
    buffer[1] = s_encode32Chars[(int)(id >> 55) & 31];
    buffer[2] = s_encode32Chars[(int)(id >> 50) & 31];
    buffer[3] = s_encode32Chars[(int)(id >> 45) & 31];
    buffer[4] = s_encode32Chars[(int)(id >> 40) & 31];
    buffer[5] = s_encode32Chars[(int)(id >> 35) & 31];
    buffer[6] = s_encode32Chars[(int)(id >> 30) & 31];
    buffer[7] = s_encode32Chars[(int)(id >> 25) & 31];
    buffer[8] = s_encode32Chars[(int)(id >> 20) & 31];
    buffer[9] = s_encode32Chars[(int)(id >> 15) & 31];
    buffer[10] = s_encode32Chars[(int)(id >> 10) & 31];
    buffer[11] = s_encode32Chars[(int)(id >> 5) & 31];
    buffer[12] = s_encode32Chars[(int)id & 31];
    return new string(buffer);
}

Using Span<T>

My first commit replaces the unsafe code with Span<char>. It is nothing more that a small rewrite that doesn't impact the performance. This change is now possible as ASP.NET Core 3 will only support .NET Core 3. ASP.NET Core 2.2 targets .NET Standard 2.0 which doesn't contain Span<T>. If you are not familiar with Spans, you should read the Adam Sitnik's post: https://adamsitnik.com/Span/.

Here're the changes:

private static readonly string s_encode32Chars = "0123456789ABCDEFGHIJKLMNOPQRSTUV";

// Remove unsafe keyword
public static string GenerateId(long id)
{
    // Replace char* with Span<T>
    Span<char> buffer = stackalloc char[13];
    buffer[0] = s_encode32Chars[(int)(id >> 60) & 31];
    buffer[1] = s_encode32Chars[(int)(id >> 55) & 31];
    ...
    buffer[11] = s_encode32Chars[(int)(id >> 5) & 31];
    buffer[12] = s_encode32Chars[(int)id & 31];
    return new string(buffer, 0, 13);
}

Using string.Create

A first optimization is to use string.Create. In the previous code, a buffer is created on the stack, then the string is created on the heap from the buffer. This means the buffer is copied to the new string instance. Using string.Create we avoid the copy of the buffer. The method allocates the string directly on the heap and you can set the content using the delegate (code on GitHub).

private static readonly string s_encode32Chars = "0123456789ABCDEFGHIJKLMNOPQRSTUV";

public static string GenerateId(long id)
{
    return string.Create(13, id, (buffer, value) =>
    {
        // Do not use id here, instead use value (read hereafter for more details)
        buffer[0] = s_encode32Chars[(int)(value >> 60) & 31];
        buffer[1] = s_encode32Chars[(int)(value >> 55) & 31];
        ...
        buffer[11] = s_encode32Chars[(int)(value >> 5) & 31];
        buffer[12] = s_encode32Chars[(int)value & 31];
    }
}

You may have notice that I do not use id inside the lambda expression. Indeed, it would introduce a closure. A closure is bad here as the delegate cannot be cached, and a new delegate will be instantiated each time the method is called. In the end, this would decrease the performance (about 30% in this case) and add some pressure to the GC. This is the reason why the second parameter of String.Create exists. This parameter prevents the use of a closure. You can read more about closure on the Jetbrains'blog.

The benchmark indicates using string.Create is about 35% faster!

Reversing assignation order

David Fowler has suggested to reverse the assignation order (e.g. assign buffer[12] then buffer[11] and so on). This allows the JIT to not add bounds checks. Indeed, if you can access the index 12, there is no reason for lower indices to be out of range.

private static readonly string s_encode32Chars = "0123456789ABCDEFGHIJKLMNOPQRSTUV";

public static string GenerateId(long id)
{
    return string.Create(13, id, (buffer, value) =>
    {
        // Assign buffer from 12 to 0 to avoid a bound check
        buffer[12] = s_encode32Chars[(int)value & 31];
        buffer[11] = s_encode32Chars[(int)(value >> 5) & 31];
        ...
        buffer[1] = s_encode32Chars[(int)(value >> 55) & 31];
        buffer[0] = s_encode32Chars[(int)(value >> 60) & 31];
    }
}

While the JIT emits an optimized code, the optimized version does have the same performance on my computer. As suggested by Günther Foidl, it may be because the branch predictor of the CPU is doing a good job in this case. It doesn't mean this optimization is not useful. Other CPU may not have a good branch prediction in this case. Also, the size of the emitted code is smaller, which is always a good thing.

Here's the assembly code generated by the JIT with the optimization.

The full assembly code is available on the following Gist:

  • Unoptimized version: You can see the bounds check at every access (line 273, 286, 297, and so on)
  • Optimized version: You can see there is a bounds check only for the first access (line 76), but not for the next accesses (line 88)

Copying the field into a local variable

The JIT must emit code to load s_encode32Chars for each access, as it is a mutable array (the reference is readonly, but the elements are not immutable) and from JIT's point of view the elements could change, so the fresh data must be loaded.

This can be avoided by loading it once into a local, where the JIT can track that nothing changes, and read-only access is done.

private static readonly string s_encode32Chars = "0123456789ABCDEFGHIJKLMNOPQRSTUV";

public static string GenerateId(long id)
{
    return string.Create(13, id, (buffer, value) =>
    {
        // Copy the reference
        var encode32Chars = s_encode32Chars;

        buffer[12] = encode32Chars[(int)value & 31];
        buffer[11] = encode32Chars[(int)(value >> 5) & 31];
        ...
        buffer[1] = encode32Chars[(int)(value >> 55) & 31];
        buffer[0] = encode32Chars[(int)(value >> 60) & 31];
    }
}

The benchmark indicates the code is about 4% faster than the previous version!

Removing the explicit cast

The last optimization is to remove the explicit cast to int. The indexer of a string only accept an int, hence the cast. But, we can easily change the string to char[], so we can use a long as the indexer. Without the casts the JIT produces better code for 64bit systems. For 32bit systems additional code may be generated, resulting in worse performance. But most of the application are now on 64bits (especially web servers), so this change should be acceptable.

// Change from string to char[], so we can use the long
private static readonly char[] s_encode32Chars = "0123456789ABCDEFGHIJKLMNOPQRSTUV".ToCharArray();

public static string GenerateId(long id)
{
    return string.Create(13, id, (buffer, value) =>
    {
        var encode32Chars = s_encode32Chars;

        // Remove explicit cast in the indexer
        buffer[12] = encode32Chars[value & 31];
        buffer[11] = encode32Chars[(value >> 5) & 31];
        ...
        buffer[1] = encode32Chars[(value >> 55) & 31];
        buffer[0] = encode32Chars[(value >> 60) & 31];
    }
}

The benchmark indicates the code is about 5% faster than the previous version!

Benchmark

I've created a benchmark using BenchmarkDotNet to measure how performant each change of the pull request is. You can find the code on Gist.

BenchmarkDotNet=v0.11.3, OS=Windows 10.0.17763.253 (1809/October2018Update/Redstone5)
Intel Core i5-6600 CPU 3.30GHz (Skylake), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=3.0.100-preview-009812
  [Host] : .NET Core 3.0.0-preview-27122-01 (CoreCLR 4.6.27121.03, CoreFX 4.7.18.57103), 64bit RyuJIT
  Core   : .NET Core 3.0.0-preview-27122-01 (CoreCLR 4.6.27121.03, CoreFX 4.7.18.57103), 64bit RyuJIT

Job=Core  Runtime=Core
MethodMeanMedianAllocated Memory/Op
Original34.21 ns34.24 ns48 B
Span33.86 ns33.77 ns48 B
String.Create21.87 ns22.15 ns48 B
String.Create with closure46.30 ns46.30 ns136 B
Reverse assignation21.69 ns21.59 ns48 B
Use local18.95 ns18.98 ns48 B
Remove explicit cast17.85 ns17.16 ns48 B

The final code is twice faster than the original version and there is no more unsafe code. Thanks to all the reviewers of the pull request to help me discovering new C# features!

Here's another great post about these changes by Nima Ara: Generating IDs in C#, 'safely' and efficiently

Leave a reply