Correctly converting a character to lower/upper case

 
 
  • Gérald Barré

This post is part of the series 'Strings in .NET'. Be sure to check out the rest of the blog posts of the series!

Strings are complicated! One common mistake is using char.IsUpper or char.ToUpper incorrectly. For instance, when converting the first character of a string to uppercase for display. The naive approach, which is often incorrect, is as follows:

C#
static string FirstCharacterToUpperCaseBad(string str)
{
    if(string.IsNullOrEmpty(str) || char.IsUpper(str[0]))
        return str;
    return char.ToUpperInvariant(str[0]) + str[1..];
}

This method works well for many strings. For instance, "abc" will correctly be changed to "Abc". However, the Latin alphabet is not the only one. Consider the Osage alphabet. The character 𐓸 should become 𐓐 when converted to uppercase. However, FirstCharacterToUpperCaseBad("𐓸") returns the same string.

In .NET, a string is a sequential read-only collection of char objects. A char represents a UTF-16 code unit. UTF-16 is a character encoding that maps Unicode code points to sequences of 16-bit code units. It is a variable-length encoding, where code points are encoded using one or two 16-bit code units.

The string "𐓸" consists of two chars because it requires two UTF-16 code units to represent the character. Consequently, "𐓸".Length returns 2. The following screenshot shows how the characters a and 𐓸 are encoded in UTF-16:

source: https://tools.meziantou.net/string-info

Accessing "𐓸"[0] retrieves only the first UTF-16 code unit of "𐓸". This represents only half of the character. Without the second half, it is impossible to determine if the character is uppercase or how to change its casing. Therefore, char.ToUpperInvariant("𐓸"[0]) returns the character unchanged.

The correct approach involves checking if the first character is part of a surrogate pair (composed of two chars) and using both chars for the conversion. Instead of manually handling this with char.IsSurrogate, you can use the Rune type to manage the complexity:

C#
static string FirstCharacterToUpperCase(string str)
{
    if(string.IsNullOrEmpty(str))
        return str;

    // Get the first Rune of the string
    var result = Rune.DecodeFromUtf16(str, out var rune, out var charsConsumed);

    // Check if the rune is uppercase
    if (result != OperationStatus.Done || Rune.IsUpper(rune))
        return str;

    // Convert the first rune to uppercase and concatenate it to the rest of the string
    return Rune.ToUpperInvariant(rune) + str[charsConsumed..];
}

You can now test this method with various strings:

C#
FirstCharacterToUpperCase("abc def");   // Abd def   (Latin)
FirstCharacterToUpperCase("𐓷𐓘𐓻𐓘𐓻𐓟 𐒻𐓟"); // 𐓏𐓘𐓻𐓘𐓻𐓟 𐒻𐓟 (Osage)
FirstCharacterToUpperCase("𐐿𐐱𐐻");       // 𐐗𐐱𐐻       (Deseret)
// etc. (U+10C80, U+118A0, U+16E40)

In general, when working with arbitrary text, consider using Rune instead of char.

Do you have a question or a suggestion about this post? Contact me!

Follow me:
Enjoy this blog?Buy Me A Coffee💖 Sponsor on GitHub