Asked  7 Months ago    Answers:  5   Viewed   32 times

I just wrote a regex for use with the php function preg_match that contains the following part:

[w-.]

To match any word character, as well as a minus sign and the dot. While it seems to work in preg_match, I tried to put it into a utility called Reggy and it complaints about "Empty range in char class". Trial and error taught me that this issue was solved by escaping the minus sign, turning the regex into

[w-.]

Since the original appears to work in PHP, I am wondering why I should or should not be escaping the minus sign, and - since the dot is also a character with a meaning in PHP - why I would not need to escape the dot. Is the utility I am using just being silly, is it working with another regex dialect or is my regex really incorrect and am I just lucky that preg_match lets me get away with it?

 Answers

76

In many regex implementations, the following rules apply:

Meta characters inside a character class are:

  • ^ (negation)
  • - (range)
  • ] (end of the class)
  • (escape char)

So these should all be escaped. There are some corner cases though:

  • - needs no escaping if placed at the very start, or end of the class ([abc-] or [-abc]). In quite a few regex implementations, it also needs no escaping when placed directly after a range ([a-c-abc]) or short-hand character class ([w-abc]). This is what you observed
  • ^ needs no escaping when it's not at the start of the class: [^a] means any char except a, and [a^] matches either a or ^, which equals: [^a]
  • ] needs no escaping if it's the only character in the class: []] matches the char ]
Wednesday, March 31, 2021
 
Asher
answered 7 Months ago
100

PHP docs quote a small part of the PCRE docs. Here are some more details (emphasis mine) from PCRE 8.36:

If a compiled pattern is going to be used several times, it is worth spending more time analyzing it in order to speed up the time taken for matching. The function pcre_study() takes a pointer to a compiled pattern as its first argument. If studying the pattern produces additional information that will help speed up matching, pcre_study() returns a pointer to a pcre_extra block, in which the study_data field points to the results of the study.

...

Studying a pattern does two things: first, a lower bound for the length of subject string that is needed to match the pattern is computed. This does not mean that there are any strings of that length that match, but it does guarantee that no shorter strings match. The value is used to avoid wasting time by trying to match strings that are shorter than the lower bound. You can find out the value in a calling program via the pcre_fullinfo() function.

Studying a pattern is also useful for non-anchored patterns that do not have a single fixed starting character. A bitmap of possible starting bytes is created. This speeds up finding a position in the subject at which to start matching. (In 16-bit mode, the bitmap is used for 16-bit values less than 256. In 32-bit mode, the bitmap is used for 32-bit values less than 256.)

Please note that in the later PCRE version (v10.00, also called PCRE2), the lib has undergone a massive refactoring and API redesign. One of the consequences is that studying is always performed in PCRE 10.00 and above. I don't know when PHP will make use of PCRE2, but it will happen sooner or later because PCRE 8.x won't get any new features from now on.

Here's a quote from the PCRE2 release announcment:

Explicit "studying" of compiled patterns has been abolished - it now always happens automatically. JIT compiling is done by calling a new function, pcre2_jit_compile() after a successful return from pcre2_compile().


As for your second question:

If the "S" modifier is used per-thread only, how does it differs from the PCRE cache of compiled regexps?

There's no cache in PCRE itself, but PHP maintains a cache of regexes to avoid recompiling the same pattern over and over again, for instance in case you use a preg_ function inside a loop.

Wednesday, March 31, 2021
 
ritch
answered 7 Months ago
64

Which characters you must and which you mustn't escape indeed depends on the regex flavor you're working with.

For PCRE, and most other so-called Perl-compatible flavors, escape these outside character classes:

.^$*+?()[{|

and these inside character classes:

^-]

For POSIX extended regexes (ERE), escape these outside character classes (same as PCRE):

.^$*+?()[{|

Escaping any other characters is an error with POSIX ERE.

Inside character classes, the backslash is a literal character in POSIX regular expressions. You cannot use it to escape anything. You have to use "clever placement" if you want to include character class metacharacters as literals. Put the ^ anywhere except at the start, the ] at the start, and the - at the start or the end of the character class to match these literally, e.g.:

[]^-]

In POSIX basic regular expressions (BRE), these are metacharacters that you need to escape to suppress their meaning:

.^$*[

Escaping parentheses and curly brackets in BREs gives them the special meaning their unescaped versions have in EREs. Some implementations (e.g. GNU) also give special meaning to other characters when escaped, such as ? and +. Escaping a character other than .^$*(){} is normally an error with BREs.

Inside character classes, BREs follow the same rule as EREs.

If all this makes your head spin, grab a copy of RegexBuddy. On the Create tab, click Insert Token, and then Literal. RegexBuddy will add escapes as needed.

Tuesday, June 1, 2021
 
Puneet
answered 5 Months ago
92

There's a good tutorial on rebuilding the RPM for pcre here.

If you scroll down to "Updated RPM file for..." you'll find some pre-built RPM's if you just want it to work (remember to restart Apache after you're done, not just a graceful reload).

The tl;dr version is: recompile pcre with --enable-utf8 and --enable-unicode-properties

Thursday, July 29, 2021
 
Floris
answered 3 Months ago
59

I don't know the complete set of characters - but I wouldn't rely on the knowledge anyway, and I wouldn't put it into code. Instead, I would use Regex.Escape whenever I wanted some literal text that I wasn't sure about:

// Don't actually do this to check containment... it's just a little example.
public bool RegexContains(string haystack, string needle)
{
    Regex regex = new Regex("^.*" + Regex.Escape(needle) + ".*$");
    return regex.IsMatch(haystack);
}
Monday, August 2, 2021
 
Akdeniz
answered 3 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :
 
Share