Asked  7 Months ago    Answers:  5   Viewed   26 times

Im having a problem with removing non-utf8 characters from string, which are not displaying properly. Characters are like this 0x97 0x61 0x6C 0x6F (hex representation)

What is the best way to remove them? Regular expression or something else ?

 Answers

59

Using a regex approach:

$regex = <<<'END'
/
  (
    (?: [x00-x7F]                 # single-byte sequences   0xxxxxxx
    |   [xC0-xDF][x80-xBF]      # double-byte sequences   110xxxxx 10xxxxxx
    |   [xE0-xEF][x80-xBF]{2}   # triple-byte sequences   1110xxxx 10xxxxxx * 2
    |   [xF0-xF7][x80-xBF]{3}   # quadruple-byte sequence 11110xxx 10xxxxxx * 3 
    ){1,100}                        # ...one or more times
  )
| .                                 # anything else
/x
END;
preg_replace($regex, '$1', $text);

It searches for UTF-8 sequences, and captures those into group 1. It also matches single bytes that could not be identified as part of a UTF-8 sequence, but does not capture those. Replacement is whatever was captured into group 1. This effectively removes all invalid bytes.

It is possible to repair the string, by encoding the invalid bytes as UTF-8 characters. But if the errors are random, this could leave some strange symbols.

$regex = <<<'END'
/
  (
    (?: [x00-x7F]               # single-byte sequences   0xxxxxxx
    |   [xC0-xDF][x80-xBF]    # double-byte sequences   110xxxxx 10xxxxxx
    |   [xE0-xEF][x80-xBF]{2} # triple-byte sequences   1110xxxx 10xxxxxx * 2
    |   [xF0-xF7][x80-xBF]{3} # quadruple-byte sequence 11110xxx 10xxxxxx * 3 
    ){1,100}                      # ...one or more times
  )
| ( [x80-xBF] )                 # invalid byte in range 10000000 - 10111111
| ( [xC0-xFF] )                 # invalid byte in range 11000000 - 11111111
/x
END;
function utf8replacer($captures) {
  if ($captures[1] != "") {
    // Valid byte sequence. Return unmodified.
    return $captures[1];
  }
  elseif ($captures[2] != "") {
    // Invalid byte of the form 10xxxxxx.
    // Encode as 11000010 10xxxxxx.
    return "xC2".$captures[2];
  }
  else {
    // Invalid byte of the form 11xxxxxx.
    // Encode as 11000011 10xxxxxx.
    return "xC3".chr(ord($captures[3])-64);
  }
}
preg_replace_callback($regex, "utf8replacer", $text);

EDIT:

  • !empty(x) will match non-empty values ("0" is considered empty).
  • x != "" will match non-empty values, including "0".
  • x !== "" will match anything except "".

x != "" seem the best one to use in this case.

I have also sped up the match a little. Instead of matching each character separately, it matches sequences of valid UTF-8 characters.

Wednesday, March 31, 2021
 
Revent
answered 7 Months ago
15

This will work only for non-nested parentheses:

    $regex = <<<HERE
    /  "  ( (?:[^"\\]++|\\.)*+ ) "
     | '  ( (?:[^'\\]++|\\.)*+ ) '
     | ( ( [^)]*                  ) )
     | [s,]+
    /x
    HERE;

    $tags = preg_split($regex, $str, -1,
                         PREG_SPLIT_NO_EMPTY
                       | PREG_SPLIT_DELIM_CAPTURE);

The ++ and *+ will consume as much as they can and give nothing back for backtracking. This technique is described in perlre(1) as the most efficient way to do this kind of matching.

Wednesday, March 31, 2021
 
KingCrunch
answered 7 Months ago
52

The standard disclaimer applies: Parsing HTML with regular expressions is not ideal. Success depends on the well-formedness of the input on a character-by-character level. If you cannot guarantee this, the regex will fail to do the Right Thing at some point.

Having said that:

<ab[^>]*>(.*?)</a>   // match group one will contain the link text
Saturday, May 29, 2021
 
lewiguez
answered 5 Months ago
35

Just remove all non-ASCII characters:

>>> s.decode('utf8').encode('ascii', errors='ignore')
'http://www.google.com blah blah#%#@$^blah'

Other possible solution:

>>> import string
>>> s = 'xe2x80x9chttp://www.google.comxe2x80x9d blah blah#%#@$^blah'
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, s)
'http://www.google.com blah blah#%#@$^blah'

Or use Regular expressions:

>>> import re
>>> re.sub(r'[^x00-x7f]',r'', s) 
'http://www.google.com blah blah#%#@$^blah'

Pick your favorite one.

Tuesday, August 10, 2021
 
im1dermike
answered 3 Months ago
68

I'm guessing that the source of the URL is more at fault. Perhaps you're fixing the wrong problem? Removing "strange" characters from a URI might give it an entirely different meaning.

With that said, you may be able to remove all of the non-ASCII characters with a simple string replacement:

String fixed = original.replaceAll("[^\x20-\x7e]", "");

Or you can extend that to all non-four-byte-UTF-8 characters if that doesn't cover the "�" character:

String fixed = original.replaceAll("[^\u0000-\uFFFF]", "");
Friday, August 13, 2021
 
juananrey
answered 3 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :