Asked  7 Months ago    Answers:  5   Viewed   74 times

I found it in the following regex:

[(?:[^][]|(?R))*]

It matches square brackets (with their content) together with nested square brackets.

 Answers

16

[^][] is a character class that means all characters except [ and ].

You can avoid escaping [ and ] special characters since it is not ambiguous for the PCRE, the regex engine used in preg_ functions.

Since [^] is incorrect in PCRE, the only way for the regex to parse is that ] is inside the character class which will be closed later. The same with the [ that follows. It can not reopen a character class (except a POSIX character class [:alnum:]) inside a character class. Then the last ] is clear; it is the end of the character class. However, a [ outside a character class must be escaped since it is parsed as the beginning of a character class.

In the same way, you can write []] or [[] or [^[] without escaping the [ or ] in the character class.

Note: since PHP 7.3, you can use the inline xx modifier that allows blank characters to be ignored even inside character classes. This way you can write these classes in a less ambigous way like that: (?xx) [^ ][ ] [ ] ] [ [ ] [^ [ ].

You can use this syntax with several regex flavour: PCRE (PHP, R), Perl, Python, Java, .NET, GO, awk, Tcl (if you delimit your pattern with curly brackets, thanks Donal Fellows), ...

But not with: Ruby, JavaScript (except for IE < 9), ...

As m.buettner noted, [^]] is not ambiguous because ] is the first character, [^a]] is seen as all that is not a a followed by a ]. To have a and ], you must write: [^a]] or [^]a]

In particular case of JavaScript, the specification allow [] as a regex token that never matches (in other words, [] will always fail) and [^] as a regex that matches any character. Then [^]] is seen as any character followed by a ]. The actual implementation varies, but modern browser generally sticks to the definition in the specification.

Pattern details:

[          # literal [
(?:         # open a non capturing group
    [^][]   # a character that is not a ] or a [
  |         # OR
    (?R)    # the whole pattern (here is the recursion)
)*          # repeat zero or more time
]          # a literal ]

In your pattern example, you don't need to escape the last ]

But you can do the same with this pattern a little bit optimized, and more useful cause reusable as subpattern (with the (?-1)): ([(?:[^][]+|(?-1))*+])

(                     # open the capturing group
    [                # a literal [
        (?:           # open a non-capturing group
            [^][]+    # all characters but ] or [ one or more time
          |           # OR
            (?-1)     # the last opened capturing group (recursion)
                      # (the capture group where you are)
        )*+           # repeat the group zero or more time (possessive)
    ]                 # literal ] (no need to escape)
)                     # close the capturing group

or better: ([[^][]*(?:(?-1)[^][]*)*+]) that avoids the cost of an alternation.

Wednesday, March 31, 2021
 
nfechner
answered 7 Months ago
15

This will work only for non-nested parentheses:

    $regex = <<<HERE
    /  "  ( (?:[^"\\]++|\\.)*+ ) "
     | '  ( (?:[^'\\]++|\\.)*+ ) '
     | ( ( [^)]*                  ) )
     | [s,]+
    /x
    HERE;

    $tags = preg_split($regex, $str, -1,
                         PREG_SPLIT_NO_EMPTY
                       | PREG_SPLIT_DELIM_CAPTURE);

The ++ and *+ will consume as much as they can and give nothing back for backtracking. This technique is described in perlre(1) as the most efficient way to do this kind of matching.

Wednesday, March 31, 2021
 
KingCrunch
answered 7 Months ago
52

The standard disclaimer applies: Parsing HTML with regular expressions is not ideal. Success depends on the well-formedness of the input on a character-by-character level. If you cannot guarantee this, the regex will fail to do the Right Thing at some point.

Having said that:

<ab[^>]*>(.*?)</a>   // match group one will contain the link text
Saturday, May 29, 2021
 
lewiguez
answered 5 Months ago
69

It is shorthand for O(g(n) log^k g(n))

Thursday, August 12, 2021
 
avon_verma
answered 3 Months ago
26

\pL is a Unicode property shortcut. It can also be written as asp{L} or p{Letter}. It matches any kind of letter from any language.

Thursday, October 14, 2021
 
Jim
answered 1 Week ago
Jim
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :