Asked  7 Months ago    Answers:  5   Viewed   40 times

Is there a function that will change UTF-8 to Unicode leaving non special characters as normal letters and numbers?

ie the German word "tchüß" would be rendered as something like "tch20AC21AC" (please note that I am making the Unicode codes up).

EDIT: I am experimenting with the following function, but although this one works well with ASCII 32-127, it seems to fail for double byte chars:

function strToHex ($string)
{
    $hex = '';
    for ($i = 0; $i < mb_strlen ($string, "utf-8"); $i++)
    {
        $id = ord (mb_substr ($string, $i, 1, "utf-8"));
        $hex .= ($id <= 128) ? mb_substr ($string, $i, 1, "utf-8") : "&#" . $id . ";";
}

    return ($hex);
}

Any ideas?

EDIT 2: Found solution: The PHP ord() function does not work for double byte chars. Use instead: http://nl.php.net/manual/en/function.ord.php#78032

 Answers

94

Converting one character set to another can be done with iconv:

http://php.net/manual/en/function.iconv.php

Note that UTF is already an Unicode encoding.

Another way is simply using htmlentities with the right character set:

http://php.net/manual/en/function.htmlentities.php

Wednesday, March 31, 2021
 
aurelijusv
answered 7 Months ago
64

Going by Gumbo and Pekka's advice, I wrote curl_exec_utf8

/** The same as curl_exec except tries its best to convert the output to utf8 **/
function curl_exec_utf8($ch) {
    $data = curl_exec($ch);
    if (!is_string($data)) return $data;

    unset($charset);
    $content_type = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);

    /* 1: HTTP Content-Type: header */
    preg_match( '@([w/+]+)(;s*charset=(S+))?@i', $content_type, $matches );
    if ( isset( $matches[3] ) )
        $charset = $matches[3];

    /* 2: <meta> element in the page */
    if (!isset($charset)) {
        preg_match( '@<metas+http-equiv="Content-Type"s+content="([w/]+)(;s*charset=([^s"]+))?@i', $data, $matches );
        if ( isset( $matches[3] ) ) {
            $charset = $matches[3];
            /* In case we want do do further processing downstream: */
            $data = preg_replace('@(<metas+http-equiv="Content-Type"s+content="[w/]+s*;s*charset=)([^s"]+)@i', '$1utf-8', $data, 1);
        }
    }

    /* 3: <xml> element in the page */
    if (!isset($charset)) {
        preg_match( '@<?xml.+encoding="([^s"]+)@si', $data, $matches );
        if ( isset( $matches[1] ) ) {
            $charset = $matches[1];
            /* In case we want do do further processing downstream: */
            $data = preg_replace('@(<?xml.+encoding=")([^s"]+)@si', '$1utf-8', $data, 1);
        }
    }

    /* 4: PHP's heuristic detection */
    if (!isset($charset)) {
        $encoding = mb_detect_encoding($data);
        if ($encoding)
            $charset = $encoding;
    }

    /* 5: Default for HTML */
    if (!isset($charset)) {
        if (strstr($content_type, "text/html") === 0)
            $charset = "ISO 8859-1";
    }

    /* Convert it if it is anything but UTF-8 */
    /* You can change "UTF-8"  to "UTF-8//IGNORE" to 
       ignore conversion errors and still output something reasonable */
    if (isset($charset) && strtoupper($charset) != "UTF-8")
        $data = iconv($charset, 'UTF-8', $data);

    return $data;
}

The regexes are mostly from http://nadeausoftware.com/articles/2007/06/php_tip_how_get_web_page_content_type

Wednesday, March 31, 2021
 
treeface
answered 7 Months ago
21

Well, finally I solved it using:

mb_convert_encoding($text,'ISO-8859-15','utf-8');
Wednesday, March 31, 2021
 
Noob_Programmer
answered 7 Months ago
60

This should do it:

var byteArray = new Uint8Array(text.length * 2);
for (var i = 0; i < text.length; i++) {
    byteArray[i*2] = text.charCodeAt(i) // & 0xff;
    byteArray[i*2+1] = text.charCodeAt(i) >> 8 // & 0xff;
}

It's the inverse of your decodeUTF16LE function. Notice that neither works with code points outside of the BMP.

Saturday, May 29, 2021
 
Farnabaz
answered 5 Months ago
92

I suspect you mean an Em Dash (—). ISO-8859-1 doesn't include this character, so you aren't going to have much luck converting it to that encoding.

You could use htmlentities(), but I'd suggest moving off ISO-8859-1 to UTF-8 for publication.

Saturday, May 29, 2021
 
Besnik
answered 5 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :