Asked  7 Months ago    Answers:  5   Viewed   25 times

In my string I have utf-8 non-breaking space (0xc2a0) and I want to replace it with something else.

When I use

$str=preg_replace('~xc2xa0~', 'X', $str);

it works OK.

But when I use

$str=preg_replace('~x{C2A0}~siu', 'W', $str);

non-breaking space is not found (and replaced).

Why? What is wrong with second regexp?

The format x{C2A0} is correct, also I used u flag.

 Answers

40

Actually the documentation about escape sequences in PHP is wrong. When you use xc2xa0 syntax, it searches for UTF-8 character. But with x{c2a0} syntax, it tries to convert the Unicode sequence to UTF-8 encoded character.

A non breaking space is U+00A0 (Unicode) but encoded as C2A0 in UTF-8. So if you try with the pattern ~x{00a0}~siu, it will work as expected.

Wednesday, March 31, 2021
 
mcography
answered 7 Months ago
15

This will work only for non-nested parentheses:

    $regex = <<<HERE
    /  "  ( (?:[^"\\]++|\\.)*+ ) "
     | '  ( (?:[^'\\]++|\\.)*+ ) '
     | ( ( [^)]*                  ) )
     | [s,]+
    /x
    HERE;

    $tags = preg_split($regex, $str, -1,
                         PREG_SPLIT_NO_EMPTY
                       | PREG_SPLIT_DELIM_CAPTURE);

The ++ and *+ will consume as much as they can and give nothing back for backtracking. This technique is described in perlre(1) as the most efficient way to do this kind of matching.

Wednesday, March 31, 2021
 
KingCrunch
answered 7 Months ago
52

The standard disclaimer applies: Parsing HTML with regular expressions is not ideal. Success depends on the well-formedness of the input on a character-by-character level. If you cannot guarantee this, the regex will fail to do the Right Thing at some point.

Having said that:

<ab[^>]*>(.*?)</a>   // match group one will contain the link text
Saturday, May 29, 2021
 
lewiguez
answered 5 Months ago
81

UTF-8 has an advantage in the case where ASCII characters represent the majority of characters in a block of text, because UTF-8 encodes these into 8 bits (like ASCII). It is also advantageous in that a UTF-8 file containing only ASCII characters has the same encoding as an ASCII file.

UTF-16 is better where ASCII is not predominant, since it uses 2 bytes per character, primarily. UTF-8 will start to use 3 or more bytes for the higher order characters where UTF-16 remains at just 2 bytes for most characters.

UTF-32 will cover all possible characters in 4 bytes. This makes it pretty bloated. I can't think of any advantage to using it.

Sunday, June 6, 2021
 
pwaring
answered 5 Months ago
93

Your file has to encode your string as utf-8 before quoting it, and the string should be unicode. Also you have to specify the appropriate file encoding for your source file in the coding section:

# -*- coding: utf-8 -*-

import urllib

s = u'î'
print urllib.quote(s.encode('utf-8'))

Gives me the output:

%C3%AE
Thursday, August 12, 2021
 
Ahmed Haque
answered 3 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :