This post is part of the series 'Strings in .NET'. Be sure to check out the rest of the blog posts of the series!
Diacritics are a way to add additional information to a character. They are used in many languages, such as French, Spanish, and German. A common usage of diacritics is to add an accent to a letter (e.g., é). In this post, I describe how to remove diacritics from a string in .NET.
In the post How to correctly count the number of characters of a string, I already wrote about diacritics. Here is a quick reminder: a diacritic is a glyph added to a letter. For instance, the letter é is composed of the letter e and the acute accent ´. In Unicode, the diacritic is a separate character from the base character. This means that the letter é is composed of 2 characters: e (U+0065 Latin Small Letter E) and ´ (U+0301 Combining Acute Accent). Note that the letter é can also be represented by the single character é (U+00E9 Latin Small Letter E with Acute).
In .NET, you can convert the string representation from the canonical form to a decomposed form using the Normalize method. The canonical form is the form where the diacritics are combined with the base character. The decomposed form is the form where the diacritics are separated from the base character. For instance, the canonical form of the character é is U+00E9, and the decomposed form is U+0065 and U+0301. You can quickly see the difference by using the following code:
C#
EnumerateRune("é");
// é (00E9 LowercaseLetter)
EnumerateRune("é".Normalize(NormalizationForm.FormD));
// e (0065 LowercaseLetter)
// ' (0301 NonSpacingMark)
void EnumerateRune(string str)
{
foreach (var rune in str.EnumerateRunes())
{
Console.WriteLine($"{rune} ({rune.Value:X4} {Rune.GetUnicodeCategory(rune)})");
}
}
Now that you know how to convert a string to a decomposed form, you can remove the diacritics. The common algorithm to do it is as follows:
- Normalize the string to Unicode Normalization Form D (NFD).
- Iterate over each character and keep only the characters that are not non-spacing marks.
- Concatenate the characters to get the final string.
C#
public static string RemoveDiacritics(string input)
{
string normalized = input.Normalize(NormalizationForm.FormD);
StringBuilder builder = new StringBuilder();
foreach (char c in normalized)
{
if (CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
{
builder.Append(c);
}
}
return builder.ToString();
}
While this is a common question on the internet, the main use case is comparing strings in an accent-insensitive manner. In this case, you can use the CompareOptions.IgnoreNonSpace option of the string.Compare method instead of using the previous method. This will be faster and avoid errors due to the complexity of the Unicode standard.
C#
public static bool AreEqualIgnoringAccents(string s1, string s2)
{
return string.Compare(s1, s2, CultureInfo.CurrentCulture, CompareOptions.IgnoreNonSpace) == 0;
}
#Additional resources
Do you have a question or a suggestion about this post? Contact me!