Split a string into lines without any allocation

 
 
  • Gérald Barré

It's very common to split a string into lines. You can write something like that:

C#
var str = "Nickname: meziantou\r\nName: Gérald Barré";
var separators = new [] { '\r', '\n' };
var lines = str.Split(separators, StringSplitOptions.RemoveEmptyEntries);
foreach (var line in lines)
{
    // - Allocate 1 string per line + 1 array
    // - The Split method may allocate
}

This code create one string per line and allocate one array. So, there are lots of allocations. Also, it splits the whole string even if you only need a few lines. You can also use a StringReader:

C#
var reader = new StringReader(str);
string line;
while ((line = reader.ReadLine()) != null)
{
    // - Allocate 1 string per line
    // - The ReadLine method may allocate
}

But this still allocates one StringReader and one string per line.

In the recent .NET release, you can take advantage of ReadOnlySpan<char> to avoid some allocations when working with strings. My goal is to create a method SplitLines that doesn't allocate and can be used this way:

C#
foreach (ReadOnlySpan<char> line in str.SplitLines())
{
    // whatever, but without any allocation 😊
}

Before showing the code, here're some important C# / .NET notions to understand.

##Making code compatible with the foreach operator

The foreach statement is not limited to IEnumerable<T> and can be applied to an instance of any type that satisfies the following conditions:

  • has the public parameterless GetEnumerator method whose return type is either class, struct, or interface type,
  • the return type of the GetEnumerator method has the public Current property and the public parameterless MoveNext method whose return type is Boolean.
C#
var enumerable = new List<int> { 1, 2, 3 };
foreach(var item in enumerable)
{
    Console.WriteLine(item);
}

// The foreach operator is rewritten by the compiler as the following code
var enumerator = enumerable.GetEnumerator();
while (enumerator.MoveNext())
{
    var item = enumerator.Current;
    Console.WriteLine(item);
}

That means that the GetEnumerator method can return a struct that has a bool MoveNext() method and a Current property. This way you can avoid one allocation. Indeed, the result doesn't need to be casted to IEnumerator<T> which would allocate.

##Span<T> and ReadOnlySpan<T>

These types provide a type-safe and memory-safe representation of a contiguous region of arbitrary memory. This type can represent any array-like (array, string, ArraySegment<T>) type and has the same optimizations as an array, so it is fast. Spans are very useful to create a slice of an array-like object. In the case of a string this is similar to Substring but without any allocation as Span<T> is a value type (struct).

C#
ReadOnlySpan<int> array = new int[]{1,2,3,4,5,6};
ReadOnlySpan<int> slice = array[1..^1]; // [2,3,4,5]

ReadOnlySpan<char> str = "meziantou";
ReadOnlySpan<char> substring = str[0..3]; // mez

In our case, ReadOnlySpan<char> will allow us to substring the string line by line by using the Slice method. Unless Substring, it won't create a new string on the heap, so this will prevent an allocation.

##ref struct

Span<T> and ReadOnlySpan<T> are ref struct. Instances of a ref struct type are allocated on the stack and can't escape to the managed heap. The compiler ensures that by reporting an error when you try to do an operation that may allocate the struct on the heap. For instance:

  • A ref struct can't be a declared type of a field of a class or a non-ref struct

    👉🏼 This means the enumerator that we'll implement needs to be a ref struct to contains the ReadOnlySpan<char>

  • A ref struct can't implement interfaces

    👉🏼 Hopefully, you don't need to implement an interface to be compatible with the foreach operator

  • A ref struct can't be boxed to System.ValueType or System.Object

    👉🏼 The foreach operator doesn't cast the value of the iterator, so this is ok

Note that there are other limitations explained in the documentation but they don't apply to the code of this post.

#SplitLines implementation

Now that you understand the foreach operator and ref structs, let's see the code:

C#
using System;

public static class StringExtensions
{
    public static LineSplitEnumerator SplitLines(this string str)
    {
        // LineSplitEnumerator is a struct so there is no allocation here
        return new LineSplitEnumerator(str.AsSpan());
    }

    // Must be a ref struct as it contains a ReadOnlySpan<char>
    public ref struct LineSplitEnumerator
    {
        private ReadOnlySpan<char> _str;

        public LineSplitEnumerator(ReadOnlySpan<char> str)
        {
            _str = str;
            Current = default;
        }

        // Needed to be compatible with the foreach operator
        public LineSplitEnumerator GetEnumerator() => this;

        public bool MoveNext()
        {
            var span = _str;
            if (span.Length == 0) // Reach the end of the string
                return false;

            var index = span.IndexOfAny('\r', '\n');
            if (index == -1) // The string is composed of only one line
            {
                _str = ReadOnlySpan<char>.Empty; // The remaining string is an empty string
                Current = new LineSplitEntry(span, ReadOnlySpan<char>.Empty);
                return true;
            }

            if (index < span.Length - 1 && span[index] == '\r')
            {
                // Try to consume the '\n' associated to the '\r'
                var next = span[index + 1];
                if (next == '\n')
                {
                    Current = new LineSplitEntry(span.Slice(0, index), span.Slice(index, 2));
                    _str = span.Slice(index + 2);
                    return true;
                }
            }

            Current = new LineSplitEntry(span.Slice(0, index), span.Slice(index, 1));
            _str = span.Slice(index + 1);
            return true;
        }

        public LineSplitEntry Current { get; private set; }
    }

    public readonly ref struct LineSplitEntry
    {
        public LineSplitEntry(ReadOnlySpan<char> line, ReadOnlySpan<char> separator)
        {
            Line = line;
            Separator = separator;
        }

        public ReadOnlySpan<char> Line { get; }
        public ReadOnlySpan<char> Separator { get; }

        // This method allow to deconstruct the type, so you can write any of the following code
        // foreach (var entry in str.SplitLines()) { _ = entry.Line; }
        // foreach (var (line, endOfLine) in str.SplitLines()) { _ = line; }
        // https://learn.microsoft.com/en-us/dotnet/csharp/fundamentals/functional/deconstruct?WT.mc_id=DT-MVP-5003978#deconstructing-user-defined-types
        public void Deconstruct(out ReadOnlySpan<char> line, out ReadOnlySpan<char> separator)
        {
            line = Line;
            separator = Separator;
        }

        // This method allow to implicitly cast the type into a ReadOnlySpan<char>, so you can write the following code
        // foreach (ReadOnlySpan<char> entry in str.SplitLines())
        public static implicit operator ReadOnlySpan<char>(LineSplitEntry entry) => entry.Line;
    }
}

Here're some usage examples of the SplitLines() method:

C#
var str = "Nickname: meziantou\r\nName: Gérald Barré";

foreach (ReadOnlySpan<char> line in str.SplitLines())
{
    Console.WriteLine(line.ToString());
}

foreach (var (line, endOfLine) in str.SplitLines())
{
    Console.Write(line.ToString());
    Console.Write(endOfLine.ToString());
}

foreach (var lineEntry in str.SplitLines())
{
    Console.Write(lineEntry.Line.ToString());
    Console.Write(lineEntry.Sparator.ToString());
}

#Benchmark

Let's create a quick benchmark with BenchmarkDotNet:

C#
[MemoryDiagnoser]
public class SplitLinesBenchmark
{
    private const string Data = "Nickname: meziantou\r\nFirstName: Gérald\nLastName: Barré";
    private readonly char[] Separators = new[] { '\r', '\n' };

    [Benchmark]
    public void StringReader()
    {
        var reader = new StringReader(Data);
        string line;
        while ((line = reader.ReadLine()) != null)
        {
        }
    }

    [Benchmark]
    public void Split()
    {
        foreach (var line in Data.Split(new char[] { '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries))
        {
        }
    }

    [Benchmark]
    public void Split_Preallocated()
    {
        foreach (var line in Data.Split(Separators, StringSplitOptions.RemoveEmptyEntries))
        {
        }
    }

    [Benchmark]
    public void Span()
    {
        foreach (ReadOnlySpan<char> item in Data.SplitLines())
        {
        }
    }
}

public class Program
{
    public static void Main(string[] args) => BenchmarkRunner.Run<SplitLinesBenchmark>();
}
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22621
AMD Ryzen 7 5800X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-preview.5.22307.18
  [Host]     : .NET 7.0.0 (7.0.22.30112), X64 RyuJIT
  DefaultJob : .NET 7.0.0 (7.0.22.30112), X64 RyuJIT
MethodMeanErrorStdDevGen 0Allocated
StringReader37.51 ns0.639 ns0.598 ns0.0124208 B
Split73.75 ns0.854 ns0.799 ns0.0186312 B
Split_Preallocated69.16 ns1.231 ns1.152 ns0.0167280 B
Span35.87 ns0.734 ns0.785 ns--

The code is faster and doesn't allocate at all 😃

#Additional references

Do you have a question or a suggestion about this post? Contact me!

Follow me:
Enjoy this blog?Buy Me A Coffee💖 Sponsor on GitHub