This post is part of the series 'Strings in .NET'. Be sure to check out the rest of the blog posts of the series!
When we talk about new lines, most developers think about \r\n (Windows) and \n (Unix-like systems). That works most of the time, but it is not the full picture.
Unicode and several regex engines recognize additional line break characters. If your application parses user input, logs, CSV-like content, or cross-platform data, this can matter more than you think.
#The line breaks you should know
The most common line terminators are:
\r\n (CRLF, U+000D U+000A)\n (LF, U+000A)\r (CR, U+000D)
Unicode Technical Report #18 (RL1.6) also highlights these line boundaries:
\u0085 (NEL, Next Line)\u2028 (LS, Line Separator)\u2029 (PS, Paragraph Separator)
You can also consider the following whitespace characters as line breaks in some contexts:
\v (VT, Vertical Tab, U+000B)\f (FF, Form Feed, U+000C)
So a robust "split by lines" implementation should not assume only CR and LF.
#Why this matters in real code
You can receive these characters from:
- Copied text from office tools or legacy systems
- Files transformed through multiple encoding and normalization steps
If you only split on \r\n or \n, some records can stay merged into one line, which may break parsing, validation, or reporting.
#Safer parsing in .NET
If you need to split text into lines, prefer a pattern that handles Unicode newline sequences.
C#
using System.Text.RegularExpressions;
string input = "A\u2028B\u0085C\r\nD";
string[] lines = Regex.Split(input, @"(?>\r\n|[\n\v\f\r\u0085\u2028\u2029])");
// lines = ["A", "B", "C", "D"]
In .NET, \R is not yet supported, so you must use an explicit pattern such as (?>\r\n|[\n\v\f\r\u0085\u2028\u2029]).
If you want to avoid allocating a string[] and one string per line, .NET 9 also provides Regex.EnumerateSplits:
C#
using System;
using System.Text.RegularExpressions;
ReadOnlySpan<char> input = "A\u2028B\u0085C\r\nD";
foreach (Range split in Regex.EnumerateSplits(input, @"(?>\r\n|[\n\v\f\r\u0085\u2028\u2029])"))
{
ReadOnlySpan<char> line = input[split];
ProcessLine(line);
}
static void ProcessLine(ReadOnlySpan<char> line)
{
// TODO: process the line without allocations
}
For normalization, .NET also provides string.ReplaceLineEndings, which is useful when you want to convert all line endings to a single convention before processing.
C#
string input = "A\u2028B\u0085C\r\nD";
string normalized = input.ReplaceLineEndings("\n");
string[] lines = normalized.Split('\n');
// normalized = "A\nB\nC\nD"
// lines = ["A", "B", "C", "D"]
If you want a dedicated helper to enumerate lines, you can also look at this implementation in Meziantou.Framework: StringExtensions.SplitLines.cs.
C#
using Meziantou.Framework;
string input = "A\r\nB\nC\rD";
foreach (var (line, separator) in input.SplitLines())
{
Console.WriteLine($"Line='{line}', Separator='{separator}'");
}
// Line='A', Separator='\r\n'
// Line='B', Separator='\n'
// Line='C', Separator='\r'
// Line='D', Separator=''
SplitLines is great when you need to preserve the exact separator. It also avoids allocations by using ReadOnlySpan<char>. It also support break modes: Standard, Unicode, and UnicodeWithLegacyControls. }
Note that TextReader.ReadLine() only recognizes \r\n, \n, and \r. It does not handle the Unicode line separators, so it may not be suitable for all scenarios.
#Additional resources
Do you have a question or a suggestion about this post? Contact me!