How to correctly count the number of characters of a string

 
 
  • Gérald Barré

This post is part of the series 'Strings in .NET'. Be sure to check out the rest of the blog posts of the series!

Counting characters seems trivial: just use the string.Length property. Unfortunately, it is not that simple. If you have read my previous post about comparing strings, you already know that strings can be surprisingly tricky.

Consider the character 👨‍👩‍👧‍👦. The "Family: Man, Woman, Girl, Boy" emoji is a sequence of the 👨 Man, 👩 Woman, 👧 Girl, and 👦 Boy emojis joined by zero-width joiners, displayed as a single emoji on supported platforms. You could say the string is 1 character long, or 4 (one per emoji), or 7 (counting the joiners). Let's ask .NET:

C#
Console.WriteLine("👨‍👩‍👧‍👦".Length); // 11 😲

To understand what it means, let's go back to the basics: what is a character?

Unicode is the universal character encoding maintained by the Unicode Consortium. It provides the basis for processing, storage, and interchange of text data in any language across all modern software and information technology protocols. Unicode defines, among other things, a list of characters and multiple ways to represent them:

  • UTF-32: Record code-points as they are at a fixed length of 4 bytes
  • UTF-16: Code points that fit in 16 bits are stored as a single 16-bit integer. Code points that require more than 16 bits are stored as a pair of 16-bit integers (surrogate pairs)
  • UTF-8: Records characters using a variable length of 1 to 4 bytes

In .NET, a char represents a character as a UTF-16 code unit. Some characters fit in 1 char (16 bits), while others require 2 char values (32 bits). The latter are called surrogate pairs. You can check whether two characters form a surrogate pair using char.IsSurrogatePair(char, char).

C#
Console.WriteLine(char.IsSurrogatePair('\u0061', '\u0300'));
// False, U+0061 ("a") and U+0300 ("`") are 2 different characters

Console.WriteLine(char.IsSurrogatePair('\uD852', '\uDF62'));
// True, \uD852\uDF62 is the UTF-16 representation of the Unicode character U+24B62 "𤭢"

.NET Core 3.0 introduced a new type: Rune. A Rune represents a Unicode scalar value and is encoding-agnostic. You can enumerate Runes in a string using string.EnumerateRunes(). For instance, the character "𤭢" is represented by 2 UTF-16 code points (a surrogate pair): U+D852 and U+DF62, but there is only 1 rune.

C#
// First example: Single code-point character
var str = "a"; // character "a"
Console.WriteLine(str); // Print "a"
Console.WriteLine(str.Length); // 1
Console.WriteLine(str.EnumerateRunes().Count()); // 1

// Second example: surrogate pair
var str = "\uD852\uDF62"; // character "𤭢" encoded in UTF-16
Console.WriteLine(str); // Print "𤭢"
Console.WriteLine(str.Length); // 2
Console.WriteLine(str.EnumerateRunes().Count()); // 1 (U+24B62)

A third concept for dealing with characters is TextElement. .NET defines a text element as a unit of text displayed as a single character, that is, a grapheme. A combination of multiple Unicode scalars can form a single grapheme. For instance, the character "à" (latin small letter a with grave, U+00E0) can be represented using 2 Unicode scalars: the latin small letter "a" (U+0061) followed by the combining grave accent (U+0300). Although there are 2 Unicode scalars, there is only 1 grapheme. You can enumerate text elements using StringInfo.GetTextElementEnumerator(string).

C#
// The following string contains 2 characters: "latin small letter a" (U+0061) and the "Combining grave accent" (U+0300).
// Note that this is different from the character "latin small letter a with grave" (U+00E0)!
var str = "\u0061\u0300";

Console.WriteLine(str); // Print "à", there is only one grapheme
Console.WriteLine(str.Length); // 2, there are 2 UTF-16 code points
Console.WriteLine(str.EnumerateRunes().Count()); // 2, there are 2 unicode scalars
Console.WriteLine(new StringInfo(str).LengthInTextElements); // 1, there is only 1 grapheme

TextElementEnumerator enumerator = StringInfo.GetTextElementEnumerator(s);
while (enumerator.MoveNext())
{
    Console.WriteLine(enumerator.Current); // Only one item: "à"
}

.NET Framework and .NET Core 3.1 (and older) do not implement the latest version of the grapheme-counting algorithm (UAX-29), which means the count can be incorrect in some cases. This was fixed in .NET 5, primarily for emojis, as shown in the following example:

C#
Console.WriteLine("👨‍👩‍👧‍👦".Length);
// 11
// Parentheses show surrogate pairs
// (\uD83D, \uDC68), \u200D, (\uD83D, \uDC69), \u200D, (\uD83D, \uDC67), \u200D, (\uD83D, \uDC66)

Console.WriteLine("👨‍👩‍👧‍👦".EnumerateRunes().Count());
// 7
// U+1F468, U+200D, U+1F469, U+200D, U+1F467, U+200D, U+1F466

Console.WriteLine(new StringInfo("👨‍👩‍👧‍👦").LengthInTextElements);
// 1 in .NET 5+
// Previous versions of .NET don't follow the latest version of Unicode Text Segmentation (UAX29), and display 7

#Conclusion

Unicode defines a list of characters, multiple ways to encode them (UTF-8, UTF-16, UTF-32), and how to group them into graphemes. In .NET, a System.Char represents a UTF-16 code point.

string.Length: Number of UTF-16 code-points needed to represent the string EnumerateRunes().Count(): Number of Unicode scalars (Rune) in the string StringInfo.GetTextElementEnumerator(s): Number of graphemes in the string

Additional references:

Do you have a question or a suggestion about this post? Contact me!

Follow me:
Enjoy this blog?