Asked  7 Months ago    Answers:  5   Viewed   63 times

I would like to replace invalid UTF-8 chars with quotation marks (PHP 5.3.5).

So far I have this solution, but invalid characters are removed, instead of being replaced by '?'.

function replace_invalid_utf8($str)
{
  return mb_convert_encoding($str, 'UTF-8', 'UTF-8');
}

echo mb_substitute_character()."n";

echo replace_invalid_utf8('éééaaaàààeeé')."n";
echo replace_invalid_utf8('eeeaaaaaaeeé')."n";

Should output:

63 // ASCII code for '?' character
???aaa???eé // or ??aa??eé
eeeaaaaaaeeé

But currently outputs:

63
aaaee // removed invalid characters
eeeaaaaaaeeé

Any advice?

Would you do it another way (using a preg_replace() for example?)

Thanks.

 Answers

14

You can use mb_convert_encoding() or htmlspecialchars()'s ENT_SUBSTITUTE option since PHP 5.4. Of cource you can use preg_match() too. If you use intl, you can use UConverter since PHP 5.5.

Recommended substitute character for invalid byte sequence is U+FFFD. see "3.1.2 Substituting for Ill-Formed Subsequences" in UTR #36: Unicode Security Considerations for the details.

When using mb_convert_encoding(), you can specify a substitute character by passing Unicode code point to mb_substitute_character() or mbstring.substitute_character directive. The default character for substitution is ? (QUESTION MARK - U+003F).

// REPLACEMENT CHARACTER (U+FFFD)
mb_substitute_character(0xFFFD);

function replace_invalid_byte_sequence($str)
{
    return mb_convert_encoding($str, 'UTF-8', 'UTF-8');
}

function replace_invalid_byte_sequence2($str)
{
    return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 'UTF-8'));
}

UConverter offers both procedual and object-oriented API.

function replace_invalid_byte_sequence3($str)
{
    return UConverter::transcode($str, 'UTF-8', 'UTF-8');
}

function replace_invalid_byte_sequence4($str)
{
    return (new UConverter('UTF-8', 'UTF-8'))->convert($str);
}

When using preg_match(), you need pay attention to the range of bytes for avoiding the vulnerability of UTF-8 non-shortest form. the range of trail bytes change depending on the range of lead bytes.

lead byte: 0x00 - 0x7F, 0xC2 - 0xF4
trail byte: 0x80(or 0x90 or 0xA0) - 0xBF(or 0x8F)

you can refer to the following resources for checking the byte range.

  1. "Syntax of UTF-8 Byte Sequences" in RFC 3629
  2. "Table 3-7. Well-Formed UTF-8 Byte Sequences" in the Unicode Standard 6.1
  3. "Multilingual form encoding" in W3C Internationalization"

The byte range table is the below.

      Code Points    First Byte Second Byte Third Byte Fourth Byte
  U+0000 -   U+007F   00 - 7F
  U+0080 -   U+07FF   C2 - DF    80 - BF
  U+0800 -   U+0FFF   E0         A0 - BF     80 - BF
  U+1000 -   U+CFFF   E1 - EC    80 - BF     80 - BF
  U+D000 -   U+D7FF   ED         80 - 9F     80 - BF
  U+E000 -   U+FFFF   EE - EF    80 - BF     80 - BF
 U+10000 -  U+3FFFF   F0         90 - BF     80 - BF    80 - BF
 U+40000 -  U+FFFFF   F1 - F3    80 - BF     80 - BF    80 - BF
U+100000 - U+10FFFF   F4         80 - 8F     80 - BF    80 - BF

How to replace invalid byte sequence without breaking valid characters is shown in "3.1.1 Ill-Formed Subsequences" in UTR #36: Unicode Security Considerations and "Table 3-8. Use of U+FFFD in UTF-8 Conversion" in The Unicode Standard.

The Unicode Standard shows an example:

before: <61    F1 80 80  E1 80  C2    62    80    63    80    BF    64  >
after:  <0061  FFFD      FFFD   FFFD  0062  FFFD  0063  FFFD  FFFD  0064>

Here is the implementation by preg_replace_callback() according to the above rule.

function replace_invalid_byte_sequence5($str)
{
    // REPLACEMENT CHARACTER (U+FFFD)
    $substitute = "xEFxBFxBD";
    $regex = '/
      ([x00-x7F]                       #   U+0000 -   U+007F
      |[xC2-xDF][x80-xBF]            #   U+0080 -   U+07FF
      | xE0[xA0-xBF][x80-xBF]       #   U+0800 -   U+0FFF
      |[xE1-xECxEExEF][x80-xBF]{2} #   U+1000 -   U+CFFF
      | xED[x80-x9F][x80-xBF]       #   U+D000 -   U+D7FF
      | xF0[x90-xBF][x80-xBF]{2}    #  U+10000 -  U+3FFFF
      |[xF1-xF3][x80-xBF]{3}         #  U+40000 -  U+FFFFF
      | xF4[x80-x8F][x80-xBF]{2})   # U+100000 - U+10FFFF
      |(xE0[xA0-xBF]                  #   U+0800 -   U+0FFF (invalid)
      |[xE1-xECxEExEF][x80-xBF]    #   U+1000 -   U+CFFF (invalid)
      | xED[x80-x9F]                  #   U+D000 -   U+D7FF (invalid)
      | xF0[x90-xBF][x80-xBF]?      #  U+10000 -  U+3FFFF (invalid)
      |[xF1-xF3][x80-xBF]{1,2}       #  U+40000 -  U+FFFFF (invalid)
      | xF4[x80-x8F][x80-xBF]?)     # U+100000 - U+10FFFF (invalid)
      |(.)                               # invalid 1-byte
    /xs';

    // $matches[1]: valid character
    // $matches[2]: invalid 3-byte or 4-byte character
    // $matches[3]: invalid 1-byte

    $ret = preg_replace_callback($regex, function($matches) use($substitute) {

        if (isset($matches[2]) || isset($matches[3])) {

            return $substitute;

        }

        return $matches[1];

    }, $str);

    return $ret;
}

You can compare byte directly and avoid preg_match's restriction about byte size by this way.

function replace_invalid_byte_sequence6($str) {

    $size = strlen($str);
    $substitute = "xEFxBFxBD";
    $ret = '';

    $pos = 0;
    $char;
    $char_size;
    $valid;

    while (utf8_get_next_char($str, $size, $pos, $char, $char_size, $valid)) {
        $ret .= $valid ? $char : $substitute;
    }

    return $ret;
}

function utf8_get_next_char($str, $str_size, &$pos, &$char, &$char_size, &$valid)
{
    $valid = false;

    if ($str_size <= $pos) {
        return false;
    }

    if ($str[$pos] < "x80") {

        $valid = true;
        $char_size =  1;

    } else if ($str[$pos] < "xC2") {

        $char_size = 1;

    } else if ($str[$pos] < "xE0")  {

        if (!isset($str[$pos+1]) || $str[$pos+1] < "x80" || "xBF" < $str[$pos+1]) {

            $char_size = 1;

        } else {

            $valid = true;
            $char_size = 2;

        }

    } else if ($str[$pos] < "xF0") {

        $left = "xE0" === $str[$pos] ? "xA0" : "x80";
        $right = "xED" === $str[$pos] ? "x9F" : "xBF";

        if (!isset($str[$pos+1]) || $str[$pos+1] < $left || $right < $str[$pos+1]) {

            $char_size = 1;

        } else if (!isset($str[$pos+2]) || $str[$pos+2] < "x80" || "xBF" < $str[$pos+2]) {

            $char_size = 2;

        } else {

            $valid = true;
            $char_size = 3;

       }

    } else if ($str[$pos] < "xF5") {

        $left = "xF0" === $str[$pos] ? "x90" : "x80";
        $right = "xF4" === $str[$pos] ? "x8F" : "xBF";

        if (!isset($str[$pos+1]) || $str[$pos+1] < $left || $right < $str[$pos+1]) {

            $char_size = 1;

        } else if (!isset($str[$pos+2]) || $str[$pos+2] < "x80" || "xBF" < $str[$pos+2]) {

            $char_size = 2;

        } else if (!isset($str[$pos+3]) || $str[$pos+3] < "x80" || "xBF" < $str[$pos+3]) {

            $char_size = 3;

        } else {

            $valid = true;
            $char_size = 4;

        }

    } else {

        $char_size = 1;

    }

    $char = substr($str, $pos, $char_size);
    $pos += $char_size;

    return true;
}

The test case is here.

function run(array $callables, array $arguments)
{
    return array_map(function($callable) use($arguments) {
         return array_map($callable, $arguments);
    }, $callables);
}

$data = [
    // Table 3-8. Use of U+FFFD in UTF-8 Conversion
    // http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf)
    "x61"."xF1x80x80"."xE1x80"."xC2"."x62"."x80"."x63"
    ."x80"."xBF"."x64",

    // 'FULL MOON SYMBOL' (U+1F315) and invalid byte sequence
    "xF0x9Fx8Cx95"."xF0x9Fx8C"."xF0x9Fx8C"
];

var_dump(run([
    'replace_invalid_byte_sequence', 
    'replace_invalid_byte_sequence2',
    'replace_invalid_byte_sequence3',
    'replace_invalid_byte_sequence4',
    'replace_invalid_byte_sequence5',
    'replace_invalid_byte_sequence6'
], $data));

As a note, mb_convert_encoding has a bug that breaks s valid character just after invalid byte sequence or remove invalid byte sequence after valid characters without adding U+FFFD.

$data = [
    // U+20AC
    "xE2x82xAC"."xE2x82xAC"."xE2x82xAC",
    "xE2x82"    ."xE2x82xAC"."xE2x82xAC",

    // U+24B62
    "xF0xA4xADxA2"."xF0xA4xADxA2"."xF0xA4xADxA2",
    "xF0xA4xAD"    ."xF0xA4xADxA2"."xF0xA4xADxA2",
    "xA4xADxA2"."xF0xA4xADxA2"."xF0xA4xADxA2",

    // 'FULL MOON SYMBOL' (U+1F315)
    "xF0x9Fx8Cx95" . "xF0x9Fx8C",
    "xF0x9Fx8Cx95" . "xF0x9Fx8C" . "xF0x9Fx8C"
];

Although preg_match() can be used intead of preg_replace_callback, this function has a limition on bytesize. See bug report #36463 for details. You can confirm it by the following test case.

str_repeat('a', 10000)

Finally, the result of my benchmark is following.

mb_convert_encoding()
0.19628190994263
htmlspecialchars()
0.082863092422485
UConverter::transcode()
0.15999984741211
UConverter::convert()
0.29843020439148
preg_replace_callback()
0.63967490196228
direct comparision
0.71933102607727

The benchmark code is here.

function timer(array $callables, array $arguments, $repeat = 10000)
{

    $ret = [];
    $save = $repeat;

    foreach ($callables as $key => $callable) {

        $start = microtime(true);

        do {

            array_map($callable, $arguments);

        } while($repeat -= 1);

        $stop = microtime(true);
        $ret[$key] = $stop - $start;
        $repeat = $save;

    }

    return $ret;
}

$functions = [
    'mb_convert_encoding()' => 'replace_invalid_byte_sequence',
    'htmlspecialchars()' => 'replace_invalid_byte_sequence2',
    'UConverter::transcode()' => 'replace_invalid_byte_sequence3',
    'UConverter::convert()' => 'replace_invalid_byte_sequence4',
    'preg_replace_callback()' => 'replace_invalid_byte_sequence5',
    'direct comparision' => 'replace_invalid_byte_sequence6'
];

foreach (timer($functions, $data) as $description => $time) {

    echo $description, PHP_EOL,
         $time, PHP_EOL;

}
Wednesday, March 31, 2021
 
PandemoniumSyndicate
answered 7 Months ago
25

When [dropping] the encoding settings mentioned above all characters [are rendered] correctly but the encoding that is detected shows either windows-1252 or ISO-8859-1 depending on the browser.

Then that's what you're really sending. None of the encoding settings in your bullet list will actually modify your output in any way; all they do is tell the browser what encoding to assume when interpreting what you send. That's why you're getting those ?s - you're telling the browser that what you're sending is UTF-8, but it's really ISO-8859-1.

Wednesday, March 31, 2021
 
Whakkee
answered 7 Months ago
10

My best guess is that the filename itself isn't using UTF-8. Or at least scandir() isn't picking it up like that.

Maybe mb_detect_encoding() can shed some light?

var_dump(mb_detect_encoding($filename));

If not, try to guess the encoding (CP1252 or ISO-8859-1 would be my first guess) and convert it to UTF-8, see if the output is valid:

var_dump(mb_convert_encoding($filename, 'UTF-8', 'Windows-1252'));
var_dump(mb_convert_encoding($filename, 'UTF-8', 'ISO-8859-1'));
var_dump(mb_convert_encoding($filename, 'UTF-8', 'ISO-8859-15'));

Or using iconv():

var_dump(iconv('WINDOWS-1252', 'UTF-8', $filename));
var_dump(iconv('ISO-8859-1',   'UTF-8', $filename));
var_dump(iconv('ISO-8859-15',  'UTF-8', $filename));

Then when you've figured out which encoding is actually used, your code should look somewhat like this (assuming CP1252):

$filename = htmlentities(mb_convert_encoding($filename, 'UTF-8', 'Windows-1252'), ENT_QUOTES, 'UTF-8');
Wednesday, March 31, 2021
 
MDDY
answered 7 Months ago
73

I've put together some solutions and finally it works.

What I've done is the following: First, I've put together all solutions with adding this line:

ini_set('default_charset', 'UTF-8');
iconv_set_encoding("input_encoding", "UTF-8");
iconv_set_encoding("internal_encoding", "UTF-8");
iconv_set_encoding("output_encoding", "UTF-8");
mb_internal_encoding("UTF-8");

This did not work.

I looked at all the links, the utf8_encode - utf8_decode method didn't work. Then I took a look at the functions, I found the mbstring, so I replaced every string function with its mbstring equivalent.

This worked fine. Then, I figured out that mb_internal_encoding("UTF-8"); is enough. So now it works. Thanks for all the suggestions!

Wednesday, March 31, 2021
 
ALH
answered 7 Months ago
ALH
72

UPDATE Indeed this is a PHP bug on Windows. There are workarounds like below, but the best solution I have seen is to use the WFIO extension. This extension provides a new protocol wfio:// for file streams and allows PHP to properly handle UTF-8 characters on the Windows file-system. wfio:// supports a number of PHP functions including fopen, scandir, mkdir, copy, rename, etc.

original solution

So this problem is related to a PHP bug on Windows: http://bugs.php.net/bug.php?id=47096

Unicode characters get mangled by PHP on move_upload_file - although I have also seen the issue with rename and ZipArchive so I think it's a general issue with PHP and Windows.

I have adapted a workaround from Wordpress found here. I have to store the file with the mangled file name and then sanitize it on download/email/display.

Here are the adapted methods I'm using in case it's of use to someone in future. This still isn't much use if you're trying to zip files before downloading/emailing or you need to write the files to a network share.

public static function sanitizeFilename($filename, $utf8 = true)
{
if ( self::seems_utf8($filename) == $utf8 )
    return $filename;

// On Windows platforms, PHP will mangle non-ASCII characters, see http://bugs.php.net/bug.php?id=47096
if ( 'WIN' == substr( PHP_OS, 0, 3 ) ) {
        if(setlocale( LC_CTYPE, 0 )=='C'){ // Locale has not been set and the default is being used, according to answer by Colin Morelli at http://stackoverflow.com/questions/13788415/how-to-retrieve-the-current-windows-codepage-in-php
                // thus, we force the locale to be explicitly set to the default system locale
                $codepage = 'Windows-' . trim( strstr( setlocale( LC_CTYPE, '' ), '.' ), '.' );
        }
        else {
                $codepage = 'Windows-' . trim( strstr( setlocale( LC_CTYPE, 0 ), '.' ), '.' );
        }
        $charset = 'UTF-8';
        if ( function_exists( 'iconv' ) ) {

                if ( false == $utf8 ){
                    $filename = iconv( $charset, $codepage . '//IGNORE', $filename );
                }
                else {
                    $filename = iconv( $codepage, $charset, $filename );
                }
        } elseif ( function_exists( 'mb_convert_encoding' ) ) {
                if ( false == $utf8 )
                        $filename = mb_convert_encoding( $filename, $codepage, $charset );
                else
                        $filename = mb_convert_encoding( $filename, $charset, $codepage );
        }
}

return $filename;       

}

public static function seems_utf8($str) {
    $length = strlen($str);
    for ($i=0; $i < $length; $i++) {
            $c = ord($str[$i]);
            if ($c < 0x80) $n = 0; # 0bbbbbbb
            elseif (($c & 0xE0) == 0xC0) $n=1; # 110bbbbb
            elseif (($c & 0xF0) == 0xE0) $n=2; # 1110bbbb
            elseif (($c & 0xF8) == 0xF0) $n=3; # 11110bbb
            elseif (($c & 0xFC) == 0xF8) $n=4; # 111110bb
            elseif (($c & 0xFE) == 0xFC) $n=5; # 1111110b
            else return false; # Does not match any model
            for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ?
                    if ((++$i == $length) || ((ord($str[$i]) & 0xC0) != 0x80))
                            return false;
            }
    }
    return true;

}
Saturday, August 21, 2021
 
kiruwka
answered 2 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :