.NET Regex: \d is different from [0-9]

From the .NET documentation of Regex, \d matches any decimal digit. The signification of a "decimal digit" depends on the options of the regex:

  • Without RegexOptions.ECMAScript (default): \d means \p{Nd}, e.g. any character from the Unicode category "Decimal digit"
  • With RegexOptions.ECMAScript: \d means [0-9]

The Unicode category "Decimal digit" contains characters such as 0, 1 or 2 but also characters from other languages such as ٣, ٧, ൩ or ໓. The full list contains 610 characters:

0x0030-0x0039,  // ASCII
0x0660-0x0669,  // Arabic-Indic
0x06f0-0x06f9,  // Eastern Arabic-Indic
0x0966-0x096f,  // Devanagari
0x09e6-0x09ef,  // Bengali
0x0a66-0x0a6f,  // Gurmukhi
0x0ae6-0x0aef,  // Gujarati
0x0b66-0x0b6f,  // Oriya
0x0c66-0x0c6f,  // Telugu
0x0ce6-0x0cef,  // Kannada
0x0d66-0x0d6f,  // Malayalam
0x0e50-0x0e59,  // Thai
0x0ed0-0x0ed9,  // Lao
0x0f20-0x0f29,  // Tibetan
0x1040-0x1049,  // Myanmar
0x17e0-0x17e9,  // Khmer
0x1810-0x1819,  // Mongolian
0x1946-0x194f,  // Limbu
0xff10-0xff19,  // Fullwidth
0x1d7ce-0x1d7d7 // Math Bold
0x1d7d8-0x1d7e1 // Math Double
0x1d7e2-0x1d7eb // Math SansSerif
0x1d7ec-0x1d7f5 // Math SS Bold
0x1d7f6-0x1d7ff // Math Monosp

Here're some examples to show the differences:

// \u0030 - \u0039
Regex.IsMatch("0123456789", "\\d{10}");   // True
Regex.IsMatch("0123456789", "[0-9]{10}"); // True

// DEVANAGARI DIGIT: \u0966 - \u096F
Regex.IsMatch("०१२३४५६७८९", "\\d{10}");   // True
Regex.IsMatch("०१२३४५६७८९", "[0-9]{10}"); // False

// RegexOptions.ECMAScript
Regex.IsMatch("0123456789", "\\d{10}", RegexOptions.ECMAScript); // True
Regex.IsMatch("०१२३४५६७८९", "\\d{10}", RegexOptions.ECMAScript); // False

The next time you want to match a digit in a regex, make sure to know which kind of digits you want to match [0-9] or \p{Nd}.

Leave a reply