String comparisons are harder than it seems

  • .NET

Comparing strings is different from comparing numbers. 2 numbers are equal if their values are identical. For instance 1 is equal to 1, and 1 is not equal to 2. That's trivial. When it comes to string, things are different. For instance, do you want a case-sensitive comparison? What about the different ways to write the same letter. For instance, the letter ß is common in German, but it's also possible to write it ss as it is easier on lots of keyboards. In .NET there are 6 ways to compare strings:

  • Ordinal:

    Performs a simple byte comparison that is independent of language. This is most appropriate when comparing strings that are generated programmatically or when comparing case-sensitive resources such as passwords.

  • OrdinalIgnoreCase:

    Treats the characters in the strings to compare as if they were converted to uppercase using the conventions of the invariant culture, and then performs a simple byte comparison that is independent of language. This is most appropriate when comparing strings that are generated programmatically or when comparing case-insensitive resources such as paths and filenames.

  • InvariantCulture:

    Compares strings in a linguistically relevant manner, but it is not suitable for display in any particular culture. Its major application is to order strings in a way that will be identical across cultures.

  • InvariantCultureIgnoreCase:

    Compares strings in a linguistically relevant manner that ignores case, but it is not suitable for display in any particular culture. Its major application is to order strings in a way that will be identical across cultures.

  • CurrentCulture:

    Can be used when strings are linguistically relevant. For example, if strings are displayed to the user, or if strings are the result of user interaction, culture-sensitive string comparison should be used to order the string data.

  • CurrentCultureIgnoreCase:

    Can be used when strings are linguistically relevant but their case is not. For example, if strings are displayed to the user but case is unimportant, culture-sensitive, case-insensitive string comparison should be used to order the string data.

It's important to explicitly specify the comparison mode to avoid unexpected behaviors, or even worse, security issues. For instance, if you compare 2 passwords using the current culture, the passwords may be equal whereas they may be actually different!

Examples

It's easy to understand the difference between Ordinal and Ordinal ignore case. However, it may be harder to understand what a culture-specific comparison means. So, here are some examples that show some differences depending on the comparison used.

Let's start with some basic comparisons using Ordinal and OrdinalIgnoreCase:

// Basic comparisons
string.Equals("a", "a", StringComparison.Ordinal);           // true
string.Equals("a", "A", StringComparison.Ordinal);           // false
string.Equals("a", "a", StringComparison.OrdinalIgnoreCase); // true
string.Equals("a", "A", StringComparison.OrdinalIgnoreCase); // true

Now we can try with a non-obvious character. The German character ß (Eszett) can also be written ss, so they can be considered as equivalent as you can see in the following examples:

string.Equals("ss", "ß", StringComparison.OrdinalIgnoreCase);          // false
string.Equals("ss", "ß", StringComparison.InvariantCulture);           // true on Windows / false on Linux (WSL)
string.Equals("ss", "ß", StringComparison.InvariantCultureIgnoreCase); // true on Windows / false on Linux (WSL)

Other tricky character is the character i in the Turkish language. In Latin languages there is only one i. In Turkish, there are 2: the dotless ı and the dotted i. There are also 2 different uppercase characters: I and İ. You'll find more details in the 2 following posts:

CultureInfo.CurrentCulture = new CultureInfo("en-US");
string.Equals("i", "I", StringComparison.CurrentCultureIgnoreCase); // true
string.Equals("i", "İ", StringComparison.CurrentCultureIgnoreCase); // false

CultureInfo.CurrentCulture = new CultureInfo("tr-TR");
string.Equals("i", "I", StringComparison.CurrentCultureIgnoreCase)); // false
string.Equals("i", "İ", StringComparison.CurrentCultureIgnoreCase)); // true

To my mind, the strangest culture is the Thai culture. It doesn't contain the symbol dot (.), so comparisons are very funky!

// The thai culture doesn't contain '.', so comparisons can be a little bit strange (tested on Ubuntu 18.04):
CultureInfo.CurrentCulture = new CultureInfo("th_TH.UTF8");
"Test".StartsWith(".", StringComparison.CurrentCulture); // true!!!
"12.4".IndexOf(".", StringComparison.CurrentCulture);    // 0

Ok, the dot was nice, but you can continue with numbers. When comparing number, Thai digits are converted to Arabic digits (0 to 9 in Thai: ๐ ๑ ๒ ๓ ๔ ๕ ๖ ๗ ๘ ๙) as explain in the Unicode documentation

CultureInfo.CurrentCulture = new CultureInfo("th_TH.UTF8");
string.Equals("1", "๑", StringComparison.CurrentCulture);             // true
string.Equals("1", "๑", StringComparison.InvariantCultureIgnoreCase); // false

\0 are ignored in non-ordinal comparisons:

string.Equals("a\0b", "ab", StringComparison.Ordinal);          // false
string.Equals("a\0b", "ab", StringComparison.InvariantCulture); // true
string.Equals("a\0b", "ab", StringComparison.CurrentCulture);   // true

The normalization of the strings can also change the results between the ordinal and non-ordinal comparison modes. In the following example, the first string contains only 1 character: é. The second one contains 2 characters: e and \u0301 (Combining Acute Accent). The second string is the normalization in form D of the first string.

"é".Normalize(System.Text.NormalizationForm.FormD); // e\u0301

string.Equals("é", "e\u0301", StringComparison.Ordinal);          // false
string.Equals("é", "e\u0301", StringComparison.InvariantCulture); // true
string.Equals("é", "e\u0301", StringComparison.CurrentCulture);   // true

I hope you now understand how important it is to specify the way you want to compare strings.

Where to specify the comparison mode

Almost all methods that work with strings have a parameter to specify the comparison mode. There are 2 ways, using StringComparison or an IEqualityComparer<string>.

// String methods
"" == ""; // Should be string.Equals("", "", StringComparison.Ordinal);
string.Equals("", "", StringComparison.Ordinal);
string.Compare("", "", StringComparison.Ordinal);
"".Equals("", StringComparison.Ordinal);
"".IndexOf("", StringComparison.CurrentCulture);
"".EndsWith("", StringComparison.CurrentCulture);
"".StartsWith("", StringComparison.CurrentCulture);
"".ToLower(CultureInfo.CurrentCulture);
"".ToUpper(CultureInfo.CurrentCulture);

// Enumerable extensions
new [] { "" }.Contains("", StringComparer.Ordinal);
new [] { "" }.Distinct(StringComparer.Ordinal);
new [] { "" }.GroupBy(x => x, StringComparer.Ordinal);
// ...

// HashSet and Dictionary constructors
new HashSet<string>(StringComparer.Ordinal);
new Dictionary<string, object>(StringComparer.Ordinal);
new ConcurrentDictionary<string, object>(StringComparer.Ordinal);

// GetHashCode
"".GetHashCode(StringComparison.Ordinal); // .NET Core 2.0 only
StringComparer.Ordinal.GetHashCode("");

Getting warnings in the IDE using a Roslyn Analyzer

You can check the usages of these methods in your applications using a Roslyn analyzer. The good news is the free analyzer I've made already contains rules for that: https://github.com/meziantou/Meziantou.Analyzer. In fact, it was the first rules I've created because it's so easy to forget them, especially for junior developers.

Follow me:
Enjoy this blog?Buy Me A CoffeeDonate with PayPal

Leave a reply