Asked  6 Months ago    Answers:  5   Viewed   113 times

How can I save a json-encoded string with international characters to the databse and then parse the decoded string in the browser?

<?php           
    $string = "très agréable";  
    // to the database 
    $j_encoded = json_encode(utf8_encode($string)); 
    // get from Database 
    $j_decoded = json_decode($j_encoded); 
?>    
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="fr" lang="fr">
    <?= $j_decoded ?>
</html> 

 Answers

62

This is an encoding issue. It looks like at some point, the data gets represented as ISO-8859-1.

Every part of your process needs to be UTF-8 encoded.

  • The database connection

  • The database tables

  • Your PHP file (if you are using special characters inside that file as shown in your example above)

  • The content-type headers that you output

Wednesday, June 23, 2021
 
Juriy
answered 6 Months ago
44
  • mb_internal_encoding('UTF-8') doesn't do anything by itself, it only sets the default encoding parameter for each mb_ function. If you're not using any mb_ function, it doesn't make any difference. If you are, it makes sense to set it so you don't have to pass the $encoding parameter each time individually.
  • IMO mb_detect_encoding is mostly useless since it's fundamentally impossible to accurately detect the encoding of unknown text. You should either know what encoding a blob of text is in because you have a specification about it, or you need to parse appropriate meta data like headers or meta tags where the encoding is specified.
  • Using mb_check_encoding to check if a blob of text is valid in the encoding you expect it to be in is typically sufficient. If it's not, discard it and throw an appropriate error.
  • Regarding:

    does this mean I have to use all multi byte functions instead of its core functions

    If you are manipulating strings that contain multibyte characters, then yes, you need to use the mb_ functions to avoid getting wrong results. The core string functions only work on a byte level, not a character level, which is what you typically want when working with strings.

  • utf8_general_ci vs. utf8_bin only makes a difference when collating, i.e. sorting and comparing strings. With utf8_bin data is treated in binary form, i.e. only identical data is identical. With utf8_general_ci some logic is applied, e.g. "é" sorts together with "e" and upper case is considered equal to lower case.
Wednesday, March 31, 2021
 
ammezie
answered 9 Months ago
17

With the code you've got in your example, the output is:

json_encode($response, JSON_UNESCAPED_UNICODE);
"package":"zv???tkanalouce"

You see the question marks in there because they have been introduced by mb_convert_encoding. This happens when you use encoding detection ("auto" as third parameter) and that encoding detection is not able to handle a character in the input, replacing it with a question mark. Exemplary line of code:

$row['url'] = mb_convert_encoding($tmprow['url'], "UTF-8", "auto");

This also means that the data coming out of your database is not UTF-8 encoded because mb_convert_encoding($buffer, 'UTF-8', 'auto'); does not introduce question marks if $buffer is UTF-8 encoded.

Therefore you need to find out which charset is used in your database connection because the database driver will convert strings into the encoding of the connection.

Most easy is that you just tell per that database link that you're asking for UTF-8 strings and then just use them:

$mysqli = new mysqli("localhost", "my_user", "my_password", "test");

/* check connection */
if (mysqli_connect_errno()) {
    printf("Connect failed: %sn", mysqli_connect_error());
    exit();
}

/* change character set to utf8 */
if (!$mysqli->set_charset("utf8")) {
    printf("Error loading character set utf8: %sn", $mysqli->error);
} else {
    printf("Current character set: %sn", $mysqli->character_set_name());
}

The previous code example just shows how to set the default client character set to UTF-8 with mysqli. It has been taken from the manual, see as well the material we have on site about that, e.g. utf 8 - PHP and MySQLi UTF8.

You can then greatly improve your code:

$response = $result->fetch_all(MYSQLI_ASSOC);

$json = json_encode($response, JSON_UNESCAPED_UNICODE);

if (FALSE === $json) {
    throw new LogicException(
        sprintf('Not json: %d - %s', json_last_error(), json_last_error_msg())
    );
}

header('Content-Type: application/json'); 
echo $json;
Wednesday, March 31, 2021
 
mistero
answered 9 Months ago
17

The string literal and the text in the file is not equivalent. $text is already utf-8 (?????) and iconv does nothing to it. This is because you use escape sequences to put the actual binary value in the string. with the data in the file xd0xa2xd0xb0xd0xb9xd0xbdxd0xb0 is not escaped because it was read from a file and stored in a variable so its not a string literal. Try this to convert the data

$text = file_get_contents('log.txt');
$text = str_replace('x', '', trim($text));
$text = pack('H*', $text);
var_dump($text); 
Friday, May 28, 2021
 
waylaidwanderer
answered 7 Months ago
73

I want to just filter it out

You have got an unspecified encoding/charset with your data. This is a huge problem.

You can first try to convert it into utf-8 and then strip all non-printable characters:

$str = iconv('utf-8', 'utf-8//ignore', $str);

echo preg_replace('/[^pLpNpPpSpZ]/u', '', $str);

The problem is, that the iconv function can only try. It will drop any invalid character sequence. As of php 5.4 it will drop the complete string however, if the input encoding specified is invalid.

You will see a warning since PHP 5.3 already that the input string has an invalid encoding.

You can go around this by removing all invalid utf-8 byte sequences first:

$str = valid_utf8_bytes($str);

echo preg_replace('/[^pLpNpPpSpZ]/u', '', $str);

/**
 * get valid utf-8 byte squences
 *
 * take over all matching bytes, drop an invalid sequence until first
 * non-matching byte.
 * 
 * @param string $str
 * @return string
 */
function valid_utf8_bytes($str)
{
    $return = '';
    $length = strlen($str);
    $invalid = array_flip(array("xEFxBFxBF" /* U-FFFF */, "xEFxBFxBE" /* U-FFFE */));

    for ($i=0; $i < $length; $i++)
    {
        $c = ord($str[$o=$i]);

        if ($c < 0x80) $n=0; # 0bbbbbbb
        elseif (($c & 0xE0) === 0xC0) $n=1; # 110bbbbb
        elseif (($c & 0xF0) === 0xE0) $n=2; # 1110bbbb
        elseif (($c & 0xF8) === 0xF0) $n=3; # 11110bbb
        elseif (($c & 0xFC) === 0xF8) $n=4; # 111110bb
        else continue; # Does not match

        for ($j=++$n; --$j;) # n bytes matching 10bbbbbb follow ?
            if ((++$i === $length) || ((ord($str[$i]) & 0xC0) != 0x80))
                continue 2
        ;

        $match = substr($str, $o, $n);

        if ($n === 3 && isset($invalid[$match])) # test invalid sequences
            continue;

        $return .= $match;
    }
    return $return;
}
Saturday, August 7, 2021
 
MasterJoe
answered 4 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :
 
Share