Counting characters may seems trivial. You can just use the
string.Length property, can't you? Unfortunately, this is not that trivial. If you remember my previous post about comparing strings, you know that strings can be very tricky 😃
Let's use the character
👨👩👧👦. The "Family: Man, Woman, Girl, Boy" emoji is a sequence of the 👨 Man, 👩 Woman, 👧 Girl and 👦 Boy emojis. These are combined using a zero width joiner between each character and display as a single emoji on supported platforms. You can consider there is only one character. But at the same time it could be valid to say the string is 4 character long as it's a sequence of 4 emojis. Or maybe it is 7 characters long if you count the joiners. Or maybe there are more characters? Let's ask .NET:
Console.WriteLine("👨👩👧👦".Length); // 11 😲
To understand what it means, let's go back to the basic: what is a character?
Unicode is the universal character encoding, maintained by the Unicode Consortium. This encoding standard provides the basis for processing, storage and interchange of text data in any language in all modern software and information technology protocols. Unicode defines, among other things, a list of characters and multiple ways to represent them:
- UTF-32: Record code-points as they are at a fixed length of 4 bytes
- UTF-16: Code-points that fit below 2 bytes are recorded as integers of 2 bytes, and more than those that exceed them are recorded using pairs (4 bytes at 2x2)
- UTF-8: Record with variable length of 1 to 4 bytes
In .NET, a
char represents a character as a UTF-16 code unit. This means some characters are represented using 1
char (16 bits) and some are represented using 2
char (32 bits). When it needs 2 characters, this is called a surrogate pair. You can check if 2 characters represent a surrogate pair using
// False, U+0061 ("a") and U+0300 ("`") are 2 different characters
// True, \uD852\uDF62 is the UTF-16 representation of the Unicode character U+24B62 "𤭢"
.NET Core 3.0 introduced a new type:
Rune. A rune represents a Unicode scalar value. A rune is encoding-agnostic. You can enumerates
string.EnumerateRunes(). For instance, the character "𤭢" is represented using 2 UTF-16 code-points (surrogate pair): U+D852 and U+DF62, but there is only 1 rune.
// First example: Single code-point character
var str = "a"; // character "a"
Console.WriteLine(str); // Print "a"
Console.WriteLine(str.Length); // 1
Console.WriteLine(str.EnumerateRunes().Count()); // 1
// Second example: surrogate pair
var str = "\uD852\uDF62"; // character "𤭢" encoded in UTF-16
Console.WriteLine(str); // Print "𤭢"
Console.WriteLine(str.Length); // 2
Console.WriteLine(str.EnumerateRunes().Count()); // 1 (U+24B62)
There is a third type to deal with characters:
TextElement. .NET defines a text element as a unit of text that is displayed as a single character, that is, a grapheme. A combination of multiple Unicode scalars can be represented as a single grapheme. For instance, you can represent the character "à" (latin small letter a with grave (U+00E0)) using 2 Unicode scalars: the Latin small letter "a" (U+0061) followed by the combining grave accent (U+0300). While there are 2 characters (Unicode scalars), there is only 1 grapheme. You can enumerate
// The following string contains 2 characters: "latin small letter a" (U+0061) and the "Combining grave accent" (U+0300).
// Note that this is different from the character "latin small letter a with grave" (U+00E0)!
var str = "\u0061\u0300";
Console.WriteLine(str); // Print "à", there is only one grapheme
Console.WriteLine(str.Length); // 2, there are 2 UTF-16 code points
Console.WriteLine(str.EnumerateRunes().Count()); // 2, there are 2 unicode scalars
Console.WriteLine(new StringInfo(str).LengthInTextElements); // 1, there is only 1 grapheme
TextElementEnumerator enumerator = StringInfo.GetTextElementEnumerator(s);
Console.WriteLine(enumerator.Current); // Only one item: "à"
The current version of .NET Framework and .NET Core doesn't implement the latest version of the algorithm to count the number of graphemes (UAX-29). This means the count may be wrong in some case. This is fixed in .NET 5, so in a few months everything should be fine! This is mostly the case with emojis, as in the following example:
// Parentheses show surrogate pairs
// (\uD83D, \uDC68), \u200D, (\uD83D, \uDC69), \u200D, (\uD83D, \uDC67), \u200D, (\uD83D, \uDC66)
// U+1F468, U+200D, U+1F469, U+200D, U+1F467, U+200D, U+1F466
// 1 in .NET 5
// Previous versions of .NET don't follow the latest version of Unicode Text Segmentation (UAX29), and display 7
Unicode defines a list of characters, multiple ways to encode them (UTF-8, UTF-16, UTF-32), and also how to group them to create graphemes. In .NET a
System.Char represent a UTF-16 code point.
string.Length: Number of UTF-16 code-points needed to represent the string
EnumerateRunes().Count(): Number of Unicode scalars (Rune) in the string
StringInfo.GetTextElementEnumerator(s): Number of graphemes in the string
Do you have a question or a suggestion about this post? Contact me!