Asked  8 Months ago    Answers:  5   Viewed   42 times

When using "special" Unicode characters they come out as weird garbage when encoded to JSON:

php > echo json_encode(['foo' => '?']);
{"foo":"u99ac"}

Why? Have I done something wrong with my encodings?

(This is a reference question to clarify the topic once and for all, since this comes up again and again.)

 Answers

26

First of all: There's nothing wrong here. This is how characters can be encoded in JSON. It is in the official standard. It is based on how string literals can be formed in Javascript ECMAScript (section 7.8.4 "String Literals") and is described as such:

Any code point may be represented as a hexadecimal number. The meaning of such a number is determined by ISO/IEC 10646. If the code point is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the code point. [...] So, for example, a string containing only a single reverse solidus character may be represented as "u005C".

In short: Any character can be encoded as u...., where .... is the Unicode code point of the character (or the code point of half of a UTF-16 surrogate pair, for characters outside the BMP).

"?"
"u99ac"

These two string literals represent the exact same character, they're absolutely equivalent. When these string literals are parsed by a compliant JSON parser, they will both result in the string "?". They don't look the same, but they mean the same thing in the JSON data encoding format.

PHP's json_encode preferably encodes non-ASCII characters using u.... escape sequences. Technically it doesn't have to, but it does. And the result is perfectly valid. If you prefer to have literal characters in your JSON instead of escape sequences, you can set the JSON_UNESCAPED_UNICODE flag in PHP 5.4 or higher:

php > echo json_encode(['foo' => '?'], JSON_UNESCAPED_UNICODE);
{"foo":"?"}

To emphasise: this is just a preference, it is not necessary in any way to transport "Unicode characters" in JSON.

Wednesday, March 31, 2021
 
TheTechnicalPaladin
answered 8 Months ago
64

Going by Gumbo and Pekka's advice, I wrote curl_exec_utf8

/** The same as curl_exec except tries its best to convert the output to utf8 **/
function curl_exec_utf8($ch) {
    $data = curl_exec($ch);
    if (!is_string($data)) return $data;

    unset($charset);
    $content_type = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);

    /* 1: HTTP Content-Type: header */
    preg_match( '@([w/+]+)(;s*charset=(S+))?@i', $content_type, $matches );
    if ( isset( $matches[3] ) )
        $charset = $matches[3];

    /* 2: <meta> element in the page */
    if (!isset($charset)) {
        preg_match( '@<metas+http-equiv="Content-Type"s+content="([w/]+)(;s*charset=([^s"]+))?@i', $data, $matches );
        if ( isset( $matches[3] ) ) {
            $charset = $matches[3];
            /* In case we want do do further processing downstream: */
            $data = preg_replace('@(<metas+http-equiv="Content-Type"s+content="[w/]+s*;s*charset=)([^s"]+)@i', '$1utf-8', $data, 1);
        }
    }

    /* 3: <xml> element in the page */
    if (!isset($charset)) {
        preg_match( '@<?xml.+encoding="([^s"]+)@si', $data, $matches );
        if ( isset( $matches[1] ) ) {
            $charset = $matches[1];
            /* In case we want do do further processing downstream: */
            $data = preg_replace('@(<?xml.+encoding=")([^s"]+)@si', '$1utf-8', $data, 1);
        }
    }

    /* 4: PHP's heuristic detection */
    if (!isset($charset)) {
        $encoding = mb_detect_encoding($data);
        if ($encoding)
            $charset = $encoding;
    }

    /* 5: Default for HTML */
    if (!isset($charset)) {
        if (strstr($content_type, "text/html") === 0)
            $charset = "ISO 8859-1";
    }

    /* Convert it if it is anything but UTF-8 */
    /* You can change "UTF-8"  to "UTF-8//IGNORE" to 
       ignore conversion errors and still output something reasonable */
    if (isset($charset) && strtoupper($charset) != "UTF-8")
        $data = iconv($charset, 'UTF-8', $data);

    return $data;
}

The regexes are mostly from http://nadeausoftware.com/articles/2007/06/php_tip_how_get_web_page_content_type

Wednesday, March 31, 2021
 
treeface
answered 8 Months ago
17

With the code you've got in your example, the output is:

json_encode($response, JSON_UNESCAPED_UNICODE);
"package":"zv???tkanalouce"

You see the question marks in there because they have been introduced by mb_convert_encoding. This happens when you use encoding detection ("auto" as third parameter) and that encoding detection is not able to handle a character in the input, replacing it with a question mark. Exemplary line of code:

$row['url'] = mb_convert_encoding($tmprow['url'], "UTF-8", "auto");

This also means that the data coming out of your database is not UTF-8 encoded because mb_convert_encoding($buffer, 'UTF-8', 'auto'); does not introduce question marks if $buffer is UTF-8 encoded.

Therefore you need to find out which charset is used in your database connection because the database driver will convert strings into the encoding of the connection.

Most easy is that you just tell per that database link that you're asking for UTF-8 strings and then just use them:

$mysqli = new mysqli("localhost", "my_user", "my_password", "test");

/* check connection */
if (mysqli_connect_errno()) {
    printf("Connect failed: %sn", mysqli_connect_error());
    exit();
}

/* change character set to utf8 */
if (!$mysqli->set_charset("utf8")) {
    printf("Error loading character set utf8: %sn", $mysqli->error);
} else {
    printf("Current character set: %sn", $mysqli->character_set_name());
}

The previous code example just shows how to set the default client character set to UTF-8 with mysqli. It has been taken from the manual, see as well the material we have on site about that, e.g. utf 8 - PHP and MySQLi UTF8.

You can then greatly improve your code:

$response = $result->fetch_all(MYSQLI_ASSOC);

$json = json_encode($response, JSON_UNESCAPED_UNICODE);

if (FALSE === $json) {
    throw new LogicException(
        sprintf('Not json: %d - %s', json_last_error(), json_last_error_msg())
    );
}

header('Content-Type: application/json'); 
echo $json;
Wednesday, March 31, 2021
 
mistero
answered 8 Months ago
22

Just explaining my comment:

objects in foreach loops are always passed by reference

When you use a foreach loop for an array of objects the variable that you are using inside the loop is a pointer to that object so it works as a reference, any change on the object inside the loop is a change on the object outside. This is because:

objects are always passed by reference (@user3137702 quote)

Detailed and official explanation here.


When you copy and unset your variable:

$copyThing = $thing;
unset($copyThing->property);

you are creating another pointer and unseting it, so the original value is a gone. As a matter of fact, since the foreach loop also uses a pointer the $things array is also affected.

check this ideone (notice the vardump [where the 'a' property is gone], as the output is the same as you got)


I do not know in which version it changed, if ever, as it seems like default object/pointer behavior


As a workaround (some ideas):

  1. Copy your initial array
  2. Use clone: $x = clone($obj); (As long as the default copy constructor works for your objects)
Saturday, May 29, 2021
 
Kevin_Kinsey
answered 5 Months ago
42

Like jrturton mentions, ¹, ² and ³ were from a legacy character set (Latin 1) and therefore included in a different place. This also means that lots of fonts don't have support for more superscript numbers, as many only strive for Latin, Greek and Cyrillic with a few punctuation symbols thrown in. So the remaining ones are taken from a different font over which you as an author have little to no control.

As an example:

Superscript numbers

Those are the superscript numerals from 1 to 9 and 0. The run of text was formatted in Arial in Word. You see what happened to the rest of them. Contrary to what jrturton believes, there is no reshaping of existing glyphs involved. This is just font substitution.

Wednesday, July 7, 2021
 
radmen
answered 4 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :
 
Share