Asked  7 Months ago    Answers:  5   Viewed   41 times

To match a string with pattern like:


To get -TEXT-, I came to know that this works:

/-(.+?)-/ // -TEXT-

As of what I know, ? makes preceding token as optional as in:

colou?r matches both colour and color

I initially put in regex to get -TEXT- part like this:


But it gave -TEXT-someMore-.

How does adding ? stops regex to get the -TEXT- part correctly? Since it used to make preceding token optional not stopping at certain point like in above example ?



As you say, ? sometimes means "zero or one", but in your regex +? is a single unit meaning "one or more — and preferably as few as possible". (This is in contrast to bare +, which means "one or more — and preferably as many as possible".)

As the documentation puts it:

However, if a quantifier is followed by a question mark, then it becomes lazy, and instead matches the minimum number of times possible, so the pattern /*.*?*/ does the right thing with the C comments. The meaning of the various quantifiers is not otherwise changed, just the preferred number of matches. Do not confuse this use of question mark with its use as a quantifier in its own right. Because it has two uses, it can sometimes appear doubled, as in d??d which matches one digit by preference, but can match two if that is the only way the rest of the pattern matches.

Wednesday, March 31, 2021
answered 7 Months ago

The /u modifier is for unicode support. Support for it was added to JavaScript in ES2015.

Read to learn more information about unicode in regex with JavaScript.

Polish characters:

? u0104
? u0106
? u0118
? u0141
? u0143
Ó u00D3
? u015A
? u0179
? u017B
? u0105
? u0107
? u0119
? u0142
? u0144
ó u00F3
? u015B
? u017A
? u017C

All special Polish characters:

Wednesday, March 31, 2021
answered 7 Months ago

I'm not sure that a regex would be the best way of building a robust comparison tool. A simple regex might be part of a larger solution that used more sophisticated algorithms for non-exact matching.

There are a variety of readily-available options for English, some of which could be extended fairly simply to languages that use the Latin alphabet. Most of these algorithms have been around for years or even decades and are well-documented, though they all have limits.

I imagine that there are similar algorithms for non-Latin alphabets but I can't comment on their availability firsthand.

Phonetic Algorithms

The Soundex algorithm is nearly 100 years old and has been implemented in multiple programming languages. It is used to determine a numeric value based on the pronunciation of a string. It is not precise but it may be useful for identifying similar sounding words/syllables. I've experimented with it in MS SQL Server and it is available in PHP.

General consensus (including the PHP docs) is that Metaphone is much more accurate than Soundex when dealing with the English language. There are numerous implementations available (Wikipedia has a long list at the end of the article) and it is included in PHP.

Double Metahpone supports a second encoding of a word corresponding to an alternate pronunciation of the word.

As with Metaphone, Double Metaphone has been implemented in many programming languages (example).

Word Deconstruction

Levenshtein can be used to suggest alternate spellings (for example, to normalize user input) and might be useful as part of a more granular algorithm for alliteration and assonance.

Logically, it would help to understand the syllabication of the words in the string so that each word could be deconstructed. The syllable break could resolve ambiguity as to how two adjacent letters should be pronounced. This thread has a few links:

PHP Syllable Detection

Wednesday, March 31, 2021
answered 7 Months ago

(?: starts a non-capturing group. It's no different to ( unless you're retrieving groups from the regex after use. See What is a non-capturing group? What does a question mark followed by a colon (?:) mean?.

Wednesday, June 16, 2021
answered 5 Months ago

The key difference between ? and ?? concerns their laziness. ?? is lazy, ? is not.

Let's say you want to search for the word "car" in a body of text, but you don't want to be restricted to just the singular "car"; you also want to match against the plural "cars".

Here's an example sentence:

I own three cars.

Now, if I wanted to match the word "car" and I only wanted to get the string "car" in return, I would use the lazy ?? like so:


This says, "look for the word car or cars; if you find either, return car and nothing more".

Now, if I wanted to match against the same words ("car" or "cars") and I wanted to get the whole match in return, I'd use the non-lazy ? like so:


This says, "look for the word car or cars, and return either car or cars, whatever you find".

In the world of computer programming, lazy generally means "evaluating only as much as is needed". So the lazy ?? only returns as much as is needed to make a match; since the "s" in "cars" is optional, don't return it. On the flip side, non-lazy (sometimes called greedy) operations evaluate as much as possible, hence the ? returns all of the match, including the optional "s".

Personally, I find myself using ? as a way of making other regular expression operators lazy (like the * and + operators) more often than I use it for simple character optionality, but YMMV.

See it in Code

Here's the above implemented in Clojure as an example:

(re-find #"cars??" "I own three cars.")
;=> "car"

(re-find #"cars?" "I own three cars.")
;=> "cars"

The item re-find is a function that takes its first argument as a regular expression #"cars??" and returns the first match it finds in the second argument "I own three cars."

Tuesday, June 22, 2021
answered 5 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :