Regex with IgnoreCase option may match more characters than expected

 
 
  • Gérald Barré

This post is part of the series 'Strings in .NET'. Be sure to check out the rest of the blog posts of the series!

In a previous post, I explained why \d is not the same as [0-9]. This post explores why [a-zA-Z] and [a-z] with the IgnoreCase option are not equivalent.

C#
var regex1 = new Regex("^[a-zA-Z]+$");
var regex2 = new Regex("^[a-z]+$", RegexOptions.IgnoreCase | RegexOptions.CultureInvariant);

Console.WriteLine(regex1.IsMatch("Test")); // true
Console.WriteLine(regex2.IsMatch("Test")); // true

The results differ, however, when using the Kelvin sign:

C#
// Kelvin Sign (U+212A)
Console.WriteLine(regex1.IsMatch("K")); // false
Console.WriteLine(regex2.IsMatch("K")); // true

When RegexOptions.IgnoreCase is specified, the regex engine performs case-insensitive comparisons using case mappings. Two characters match if they are considered equivalent under those casing rules (for example, if one is the lowercase version of the other).

In this case, char.ToLowerInvariant('K') (Kelvin Sign) returns 'k' (Latin Small Letter K). Therefore, when using the IgnoreCase option, [a-z] matches the Kelvin sign because 'k' is in the range a-z. However, the Kelvin Sign itself is not part of the [a-zA-Z] set, so the regex [a-zA-Z] does not match it.

Therefore, the following two regular expressions are equivalent:

C#
new Regex("^[a-z]+$", RegexOptions.IgnoreCase | RegexOptions.CultureInvariant);
new Regex("^[A-Za-z\u212A]+$");

To learn how .NET determines which case mappings to use, see the code of the GenerateRegexCasingTable tool on GitHub.

#Additional resources

Do you have a question or a suggestion about this post? Contact me!

Follow me:
Enjoy this blog?