Asked  7 Months ago    Answers:  5   Viewed   33 times

I have random text stored in $sentences. Using regex, I want to split the text into sentences, see:

function splitSentences($text) {
    $re = '/                # Split sentences on whitespace between them.
        (?<=                # Begin positive lookbehind.
          [.!?]             # Either an end of sentence punct,
        | [.!?]['"]        # or end of sentence punct and quote.
        )                   # End positive lookbehind.
        (?<!                # Begin negative lookbehind.
          Mr.              # Skip either "Mr."
        | Mrs.             # or "Mrs.",
        | T.V.A.         # or "T.V.A.",
                            # or... (you get the idea).
        )                   # End negative lookbehind.
        s+                 # Split on whitespace between sentences.
        /ix';

    $sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);
    return $sentences;
}

$sentences = splitSentences($sentences);

print_r($sentences);

It works fine.

However, it doesn't split into sentences if there are unicode characters:

$sentences = 'Entertainment media properties. Fairy Tail and Tokyo Ghoul.';

Or this scenario:

$sentences = "Entertainment media properties.&Acirc;&nbsp; Fairy Tail and Tokyo Ghoul.";

What can I do to make it work when unicode characters exist in the text?

Here is an ideone for testing.

Bounty info

I am looking for a complete solution to this. Before posting an answer, please read the comment thread I had with WiktorStribi?ew for more relevant info on this issue.

 Answers

55

As it should be expected, any sort of natural language processing is not a trivial task. The reason for it is that they are evolutionary systems. There is no single person who sat down and thought about which are good ideas and which - not. Every rule has 20-40% exceptions. With that said the complexity of a single regex that can do your bidding would be off the charts. Still, the following solution relies mainly on regexes.


  • The idea is to gradually go over the text.
  • At any given time, the current chunk of the text will be contained in two different parts. One, which is the candidate for a substring before a sentence boundary and another - after.
  • The first 10 regex pairs detect positions which look like sentence boundaries, but actually aren't. In that case, before and after are advanced without registering a new sentence.
  • If none of these pairs matches, matching will be attempted with the last 3 pairs, possibly detecting a boundary.

As for where did these regexes come from? - I translated this Ruby library, which is generated based on this paper. If you truly want to understand them, there is no alternative but to read the paper.

As far as accuracy goes - I encourage you to test it with different texts. After some experimentation, I was very pleasantly surprised.

In terms of performance - the regexes should be highly performant as all of them have either a A or Z anchor, there are almost no repetition quantifiers, and in the places there are - there can't be any backtracking. Still, regexes are regexes. You will have to do some benchmarking if you plan to use this is tight loops on huge chunks of text.


Mandatory disclaimer: excuse my rusty php skills. The following code might not be the most idiomatic php ever, it should still be clear enough to get the point across.


function sentence_split($text) {
    $before_regexes = array('/(?:(?:['"„][.!?…]['"”]s)|(?:[^.]s[A-Z].s)|(?:b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd).s)|(?:b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd).s[A-Z].s)|(?:bApr.s)|(?:bAug.s)|(?:bBros.s)|(?:bCo.s)|(?:bCorp.s)|(?:bDec.s)|(?:bDist.s)|(?:bFeb.s)|(?:bInc.s)|(?:bJan.s)|(?:bJul.s)|(?:bJun.s)|(?:bMar.s)|(?:bNov.s)|(?:bOct.s)|(?:bPh.?D.s)|(?:bSept?.s)|(?:bp{Lu}.p{Lu}.s)|(?:bp{Lu}.sp{Lu}.s)|(?:bcf.s)|(?:be.g.s)|(?:besp.s)|(?:betbsbal.s)|(?:bvs.s)|(?:p{Ps}[!?]+p{Pe} ))Z/su',
        '/(?:(?:[.s]p{L}{1,2}.s))Z/su',
        '/(?:(?:[[(]*...[])]* ))Z/su',
        '/(?:(?:b(?:pp|[Vv]iz|i.?s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c.?s*f|vs).s))Z/su',
        '/(?:(?:b[Ee]tc.s))Z/su',
        '/(?:(?:[.!?…]+p{Pe} )|(?:[[(]*…[])]* ))Z/su',
        '/(?:(?:bp{L}.))Z/su',
        '/(?:(?:bp{L}.s))Z/su',
        '/(?:(?:b[Ff]igs?.s)|(?:b[nN]o.s))Z/su',
        '/(?:(?:["”']s*))Z/su',
        '/(?:(?:[.!?…][x{00BB}x{2019}x{201D}x{203A}"'p{Pe}x{0002}]*s)|(?:r?n))Z/su',
        '/(?:(?:[.!?…]['"x{00BB}x{2019}x{201D}x{203A}p{Pe}x{0002}]*))Z/su',
        '/(?:(?:sp{L}[.!?…]s))Z/su');
    $after_regexes = array('/A(?:)/su',
        '/A(?:[p{N}p{Ll}])/su',
        '/A(?:[^p{Lu}])/su',
        '/A(?:[^p{Lu}]|I)/su',
        '/A(?:[^p{Lu}])/su',
        '/A(?:p{Ll})/su',
        '/A(?:p{L}.)/su',
        '/A(?:p{L}.s)/su',
        '/A(?:p{N})/su',
        '/A(?:s*p{Ll})/su',
        '/A(?:)/su',
        '/A(?:p{Lu}[^p{Lu}])/su',
        '/A(?:p{Lu}p{Ll})/su');
    $is_sentence_boundary = array(false, false, false, false, false, false, false, false, false, false, true, true, true);
    $count = 13;

    $sentences = array();
    $sentence = '';
    $before = '';
    $after = substr($text, 0, 10);
    $text = substr($text, 10);

    while($text != '') {
        for($i = 0; $i < $count; $i++) {
            if(preg_match($before_regexes[$i], $before) && preg_match($after_regexes[$i], $after)) {
                if($is_sentence_boundary[$i]) {
                    array_push($sentences, $sentence);
                    $sentence = '';
                }
                break;
            }
        }

        $first_from_text = $text[0];
        $text = substr($text, 1);
        $first_from_after = $after[0];
        $after = substr($after, 1);
        $before .= $first_from_after;
        $sentence .= $first_from_after;
        $after .= $first_from_text;
    }

    if($sentence != '' && $after != '') {
        array_push($sentences, $sentence.$after);
    }

    return $sentences;
}

$text = "Mr. Entertainment media properties. Fairy Tail 3.5 and Tokyo Ghoul.";
print_r(sentence_split($text));
Wednesday, March 31, 2021
 
Xatoo
answered 7 Months ago
61

Try this:

(?:[w-](?<!_))+

It does a simple match on anything that is encoded as a w (or a dash) and then has a zero-width lookbehind that ensures that the character that was just matched is not a underscore.

Otherwise you could pick this one:

(?:[^_W]|-)+

which is a more set-based approach (note the uppercase W)

OK, I had a lot of fun with unicode in php's flavor of PCREs :D Peekaboo says there is a simple solution available:

[p{L}p{N}-]+

p{L} matches anything unicode that qualifies as a Letter (note: not a word character, thus no underscores), while p{N} matches anything that looks like a number (including roman numerals and more exotic things).
- is just an escaped dash. Although not strictly necessary, I tend to make it a point to escape dashes in character classes... Note, that there are dozens of different dashes in unicode, thus giving rise to the following version:

[p{L}p{N}p{Pd}]+

Where "Pd" is Punctuation Dash, including, but not limited to our minus-dash-thingy. (Note, again no underscore here).

Wednesday, March 31, 2021
 
Chvanikoff
answered 7 Months ago
25

Try a Unicode range:

'/[x{0410}-x{042F}]/u'  // matches a capital cyrillic letter in the range A to Ya

Don't forget the /u flag for Unicode.

In your case:

"#[name=([a-zA-Z0-9x{0430}-x{044F}x{0410}-x{042F} .-]+)*]#u"

Note that the STAR in your regex is redundant. Everything already gets "eaten" by the PLUS. This would do the same:

"#[name=([a-zA-Z0-9x{0430}-x{044F}x{0410}-x{042F} .-]+)]#u"
Wednesday, July 28, 2021
 
huhushow
answered 3 Months ago
22

The solution is to match and capture the abbreviations and build the replacement using a callback:

var re = /b(w.w.)|([.?!])s+(?=[A-Za-z])/g; 
var str = 'This is a long string with some numbers 123.456,78 or 100.000 and e.g. some abbreviations in it, which shouldn't split the sentence. Sometimes there are problems, i.e. in this one. here and abbr at the end x.y.. cool.';
var result = str.replace(re, function(m, g1, g2){
  return g1 ? g1 : g2+"r";
});
var arr = result.split("r");
document.body.innerHTML = "<pre>" + JSON.stringify(arr, 0, 4) + "</pre>";

Regex explanation:

  • b(w.w.) - match and capture into Group 1 the abbreviation (consisting of a word character, then . and again a word character and a .) as a whole word
  • | - or...
  • ([.?!])s+(?=[A-Za-z]):
    • ([.?!]) - match and capture into Group 2 either . or ? or !
    • s+ - match 1 or more whitespace symbols...
    • (?=[A-Za-z]) - that are before an ASCII letter.
Thursday, August 5, 2021
 
newbStudent
answered 3 Months ago
79

Unicode RegExp for splitting sentences: (?<=[.?!;])s+(?=p{Lu})

Explained demo here: http://regex101.com/r/iR7cC8

Tuesday, August 10, 2021
 
skrilled
answered 2 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :