How to remove diacritics from a string in .NET

 
 
  • Gérald Barré

This post is part of the series 'Strings in .NET'. Be sure to check out the rest of the blog posts of the series!

Diacritics are marks added to characters to convey additional information. They appear in many languages, such as French, Spanish, and German, and commonly indicate accents (e.g., é).

The post How to correctly count the number of characters of a string covers diacritics in more detail. To recap: a diacritic is a glyph added to a letter. For instance, é is composed of the letter e and the acute accent ´. In Unicode, a diacritic is a separate character from its base character, so é consists of two characters: e (U+0065 Latin Small Letter E) and ´ (U+0301 Combining Acute Accent). Note that é can also be represented by the single precomposed character é (U+00E9 Latin Small Letter E with Acute).

In .NET, you can convert a string from its canonical (composed) form to a decomposed form using the Normalize method. In the canonical form, diacritics are combined with the base character. In the decomposed form, they are separated. For example, the canonical form of é is U+00E9, while the decomposed form uses U+0065 and U+0301. The following code illustrates the difference:

C#
EnumerateRune("é");
// é (00E9 LowercaseLetter)

EnumerateRune("é".Normalize(NormalizationForm.FormD));
// e (0065 LowercaseLetter)
// ' (0301 NonSpacingMark)

void EnumerateRune(string str)
{
    foreach (var rune in str.EnumerateRunes())
    {
        Console.WriteLine($"{rune} ({rune.Value:X4} {Rune.GetUnicodeCategory(rune)})");
    }
}

Now that you know how to convert a string to a decomposed form, you can remove the diacritics. The common algorithm to do it is as follows:

  • Normalize the string to Unicode Normalization Form D (NFD).
  • Iterate over each character and keep only the characters that are not non-spacing marks.
  • Concatenate the characters to get the final string.
C#
public static string RemoveDiacritics(string input)
{
    string normalized = input.Normalize(NormalizationForm.FormD);
    StringBuilder builder = new StringBuilder();

    foreach (char c in normalized)
    {
        if (CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
        {
            builder.Append(c);
        }
    }

    return builder.ToString();
}

While this is a common question on the internet, the main use case is comparing strings in an accent-insensitive manner. In this case, you can use the CompareOptions.IgnoreNonSpace option of the string.Compare method instead of using the previous method. This will be faster and avoid errors due to the complexity of the Unicode standard.

C#
public static bool AreEqualIgnoringAccents(string s1, string s2)
{
    return string.Compare(s1, s2, CultureInfo.CurrentCulture, CompareOptions.IgnoreNonSpace) == 0;
}

#Additional resources

Do you have a question or a suggestion about this post? Contact me!

Follow me:
Enjoy this blog?