How to correctly count the number of characters of a string
This post is part of the series 'Strings in .NET'. Be sure to check out the rest of the blog posts of the series!
- String comparisons are harder than it seems
- How to correctly count the number of characters of a string (this post)
- Correctly converting a character to lower/upper case
- How not to read a string from an UTF-8 stream
- Regex with IgnoreCase option may match more characters than expected
- How to remove diacritics from a string in .NET
Counting characters may seem trivial. Indeed, you can use the string.Length
property, can't you? Unfortunately, this is not that trivial. If you remember my previous post about comparing strings, you know that strings can be very tricky 😃
Let's use the character 👨👩👧👦
. The "Family: Man, Woman, Girl, Boy" emoji is a sequence of the 👨 Man, 👩 Woman, 👧 Girl and 👦 Boy emojis. These are combined using a zero width joiner between each character and display as a single emoji on supported platforms. You can consider there is only one character. But at the same time it could be valid to say the string is 4 characters long as it's a sequence of 4 emojis. Or maybe it is 7 characters long if you count the joiners. Or maybe there are more characters? Let's ask .NET:
Console.WriteLine("👨👩👧👦".Length); // 11 😲
To understand what it means, let's go back to the basics: what is a character?
Unicode is the universal character encoding, maintained by the Unicode Consortium. This encoding standard provides the basis for processing, storage, and interchange of text data in any language in all modern software and information technology protocols. Unicode defines, among other things, a list of characters and multiple ways to represent them:
- UTF-32: Record code-points as they are at a fixed length of 4 bytes
- UTF-16: Code-points that fit below 2 bytes are recorded as integers of 2 bytes, and more than those that exceed them are recorded using pairs (4 bytes at 2x2)
- UTF-8: Record with variable length of 1 to 4 bytes
In .NET, a char
represents a character as a UTF-16 code unit. This means some characters are represented using 1 char
(16 bits) and some are represented using 2 char
(32 bits). When it needs 2 characters, this is called a surrogate pair. You can check if 2 characters represent a surrogate pair using char.IsSurrogatePair(char, char)
.
Console.WriteLine(char.IsSurrogatePair('\u0061', '\u0300'));
// False, U+0061 ("a") and U+0300 ("`") are 2 different characters
Console.WriteLine(char.IsSurrogatePair('\uD852', '\uDF62'));
// True, \uD852\uDF62 is the UTF-16 representation of the Unicode character U+24B62 "𤭢"
.NET Core 3.0 introduced a new type: Rune
. A rune represents a Unicode scalar value. A rune is encoding-agnostic. You can enumerates Rune
s using string.EnumerateRunes()
. For instance, the character "𤭢" is represented using 2 UTF-16 code-points (surrogate pair): U+D852 and U+DF62, but there is only 1 rune.
// First example: Single code-point character
var str = "a"; // character "a"
Console.WriteLine(str); // Print "a"
Console.WriteLine(str.Length); // 1
Console.WriteLine(str.EnumerateRunes().Count()); // 1
// Second example: surrogate pair
var str = "\uD852\uDF62"; // character "𤭢" encoded in UTF-16
Console.WriteLine(str); // Print "𤭢"
Console.WriteLine(str.Length); // 2
Console.WriteLine(str.EnumerateRunes().Count()); // 1 (U+24B62)
There is a third type to deal with characters: TextElement
. .NET defines a text element as a unit of text that is displayed as a single character, that is, a grapheme. A combination of multiple Unicode scalars can be represented as a single grapheme. For instance, you can represent the character "à" (latin small letter a with grave (U+00E0
) using 2 Unicode scalars: the Latin small letter "a" (U+0061
) followed by the combining grave accent (U+0300
). While there are 2 characters (Unicode scalars), there is only 1 grapheme. You can enumerate TextElement
s using StringInfo.GetTextElementEnumerator(string)
// The following string contains 2 characters: "latin small letter a" (U+0061) and the "Combining grave accent" (U+0300).
// Note that this is different from the character "latin small letter a with grave" (U+00E0)!
var str = "\u0061\u0300";
Console.WriteLine(str); // Print "à", there is only one grapheme
Console.WriteLine(str.Length); // 2, there are 2 UTF-16 code points
Console.WriteLine(str.EnumerateRunes().Count()); // 2, there are 2 unicode scalars
Console.WriteLine(new StringInfo(str).LengthInTextElements); // 1, there is only 1 grapheme
TextElementEnumerator enumerator = StringInfo.GetTextElementEnumerator(s);
while (enumerator.MoveNext())
{
Console.WriteLine(enumerator.Current); // Only one item: "à"
}
The current version of .NET Framework and .NET Core doesn't implement the latest version of the algorithm to count the number of graphemes (UAX-29). This means the count may be wrong in some cases. This is fixed in .NET 5, so in a few months everything should be fine! This is mostly the case with emojis, as in the following example:
Console.WriteLine("👨👩👧👦".Length);
// 11
// Parentheses show surrogate pairs
// (\uD83D, \uDC68), \u200D, (\uD83D, \uDC69), \u200D, (\uD83D, \uDC67), \u200D, (\uD83D, \uDC66)
Console.WriteLine("👨👩👧👦".EnumerateRunes().Count());
// 7
// U+1F468, U+200D, U+1F469, U+200D, U+1F467, U+200D, U+1F466
Console.WriteLine(new StringInfo("👨👩👧👦").LengthInTextElements);
// 1 in .NET 5
// Previous versions of .NET don't follow the latest version of Unicode Text Segmentation (UAX29), and display 7
#Conclusion
Unicode defines a list of characters, multiple ways to encode them (UTF-8, UTF-16, UTF-32), and also how to group them to create graphemes. In .NET a System.Char
represent a UTF-16 code point.
string.Length
: Number of UTF-16 code-points needed to represent the string EnumerateRunes().Count()
: Number of Unicode scalars (Rune) in the string StringInfo.GetTextElementEnumerator(s)
: Number of graphemes in the string
Additional references:
- Using Rune by Levi Broderick
- Introducing System.Rune
- .NET 5 breaking change: StringInfo and TextElementEnumerator classes are now UAX29-compliant
- UTF-8, UTF-16, UTF-32 & BOM
Do you have a question or a suggestion about this post? Contact me!