Regex with IgnoreCase option may match more characters than expected

 
 
  • Gérald Barré

This post is part of the series 'Strings in .NET'. Be sure to check out the rest of the blog posts of the series!

In a previous post, I explained why \d is different from [0-9]. In this post, I'll explain why the regex [a-zA-Z] is different from the regex [a-z] with the IgnoreCase option.

C#
var regex1 = new Regex("^[a-zA-Z]+$");
var regex2 = new Regex("^[a-z]+$", RegexOptions.IgnoreCase | RegexOptions.CultureInvariant);

Console.WriteLine(regex1.IsMatch("Test")); // true
Console.WriteLine(regex2.IsMatch("Test")); // true

However, you get different results if you use the Kelvin sign:

C#
// Kelvin Sign (U+212A)
Console.WriteLine(regex1.IsMatch("K")); // false
Console.WriteLine(regex2.IsMatch("K")); // true

When a regular expression specifies the option RegexOptions.IgnoreCase then comparisons between the input and the pattern will be case-insensitive. To support this, Regex needs to define which case mappings shall be used for the comparisons. A case mapping exists whenever you have two characters 'A' and 'B', where either 'A' is the ToLower() representation of 'B' or both 'A' and 'B' are lowercase to the same character.

In this case, char.ToLowerInvariant('K') (Kelvin Sign) is 'k' (Latin Small Letter K). So, when using IgnoreCase regex option, [a-z] matches the Kelvin sign. However, the Kelvin Sign is not part of the [a-zA-Z] set. That's why the regex [a-zA-Z] does not match the Kelvin Sign.

To conclude, the following regular expressions are equivalent:

C#
new Regex("^[a-z]+$", RegexOptions.IgnoreCase | RegexOptions.CultureInvariant);
new Regex("^[A-Za-z\u212A]+$");

If you are curious about how .NET knows which case mappings to use, you can read the code of the GenerateRegexCasingTable tool on GitHub.

Do you have a question or a suggestion about this post? Contact me!

Follow me:
Enjoy this blog?Buy Me A Coffee💖 Sponsor on GitHub