Asked  7 Months ago    Answers:  5   Viewed   32 times

I'm using PHP to handle text from a variety of sources. I don't anticipate it will be anything other than UTF-8, ISO 8859-1, or perhaps Windows-1252. If it's anything other than one of those, I just need to make sure the text gets turned into a valid UTF-8 string, even if characters are lost. Does the //TRANSLIT option of iconv solve this?

For example, would this code ensure that a string is safe to insert into a UTF-8 encoded document (or database)?

function make_safe_for_utf8_use($string) {

    $encoding = mb_detect_encoding($string, "UTF-8,ISO-8859-1,WINDOWS-1252");

    if ($encoding != 'UTF-8') {
        return iconv($encoding, 'UTF-8//TRANSLIT', $string);
    }
    else {
        return $string;
    }
}

 Answers

35

UTF-8 can store any Unicode character. If your encoding is anything else at all, including ISO-8859-1 or Windows-1252, UTF-8 can store every character in it. So you don't have to worry about losing any characters when you convert a string from any other encoding to UTF-8.

Further, both ISO-8859-1 and Windows-1252 are single-byte encodings where any byte is valid. It is not technically possible to distinguish between them. I would chose Windows-1252 as your default match for non-UTF-8 sequences, as the only bytes that decode differently are the range 0x80-0x9F. These decode to various characters like smart quotes and the Euro in Windows-1252, whereas in ISO-8859-1 they are invisible control characters which are almost never used. Web browsers may sometimes say they are using ISO-8859-1, but often they will really be using Windows-1252.

would this code ensure that a string is safe to insert into a UTF-8 encoded document

You would certainly want to set the optional ‘strict’ parameter to TRUE for this purpose. But I'm not sure this actually covers all invalid UTF-8 sequences. The function does not claim to check a byte sequence for UTF-8 validity explicitly. There have been known cases where mb_detect_encoding would guess UTF-8 incorrectly before, though I don't know if that can still happen in strict mode.

If you want to be sure, do it yourself using the W3-recommended regex:

if (preg_match('%^(?:
      [x09x0Ax0Dx20-x7E]            # ASCII
    | [xC2-xDF][x80-xBF]             # non-overlong 2-byte
    | xE0[xA0-xBF][x80-xBF]         # excluding overlongs
    | [xE1-xECxEExEF][x80-xBF]{2}  # straight 3-byte
    | xED[x80-x9F][x80-xBF]         # excluding surrogates
    | xF0[x90-xBF][x80-xBF]{2}      # planes 1-3
    | [xF1-xF3][x80-xBF]{3}          # planes 4-15
    | xF4[x80-x8F][x80-xBF]{2}      # plane 16
)*$%xs', $string))
    return $string;
else
    return iconv('CP1252', 'UTF-8', $string);
Wednesday, March 31, 2021
 
DMTintner
answered 7 Months ago
81

You have probably come to mix encoding types. For example. A page that is sent as iso-8859-1, but get UTF-8 text encoding from MySQL or XML would typically fail.

To solve this problem you must keep control on input ecodings type in relation to the type of encoding you have chosen to use internal.

If you send it as an iso-8859-1, your input from the user is also iso-8859-1.

header("Content-type:text/html; charset: iso-8859-1");

And if mysql sends latin1 you do not have to do anything.

But if your input is not iso-8859-1 you must converted it, before it's sending to the user or to adapt it to Mysql before it's store.

mb_convert_encoding($text, mb_internal_encoding(), 'UTF-8'); // If it's UTF-8 to internal encoding

Short it means that you must always have input converted to fit internal encoding and convereter output to match the external encoding.


This is the internal encoding I have chosen to use.

mb_internal_encoding('iso-8859-1'); // Internal encoding

This is a code i use.

mb_language('uni'); // Mail encoding
mb_internal_encoding('iso-8859-1'); // Internal encoding
mb_http_output('pass'); // Skip

function convert_encoding($text, $from_code='', $to_code='')
{
    if (empty($from_code))
    {
        $from_code = mb_detect_encoding($text, 'auto');
        if ($from_code == 'ASCII')
        {
            $from_code = 'iso-8859-1';
        }
    }

    if (empty($to_code))
    {
        return mb_convert_encoding($text, mb_internal_encoding(), $from_code);
    }
    return mb_convert_encoding($text, $to_code, $from_code);
}

function encoding_html($text, $code='')
{
    if (empty($code))
    {
        return htmlentities($text, ENT_NOQUOTES, mb_internal_encoding());
    }

    return mb_convert_encoding(htmlentities($text, ENT_NOQUOTES, $code), mb_internal_encoding(), $code);
}
function decoding_html($text, $code='')
{
    if (empty($code))
    {
        return html_entity_decode($text, ENT_NOQUOTES, mb_internal_encoding());
    }

    return mb_convert_encoding(html_entity_decode($text, ENT_NOQUOTES, $code), mb_internal_encoding(), $code);
}
Wednesday, March 31, 2021
 
capsid
answered 7 Months ago
66

Make sure the connection to your database is also using this character set:

$conn = mysql_connect($server, $username, $password);
mysql_set_charset("UTF8", $conn);

According to the documentation of mysql_set_charset at php.net:

Note:
This is the preferred way to change the charset. Using mysql_query() to execute 
SET NAMES .. is not recommended.

See also: http://nl3.php.net/manual/en/function.mysql-set-charset.php

Check the character set of your current connection with:

echo mysql_client_encoding($conn);

See also: http://nl3.php.net/manual/en/function.mysql-client-encoding.php

If you have done these things and add weird characters to your table, you will see it is displayed correct.

Wednesday, March 31, 2021
 
laurent
answered 7 Months ago
46

What you're looking for is the Unicode code point, i.e. the numeric identifier by which the character is known in the Unicode character table. The "cheapest" way to do this is through the UCS-2 character encoding, which maps 1:1 from bytes unto the Unicode code points:

echo bin2hex(iconv('UTF-8', 'UCS-2', '?'));
// 3042

Caveats: the returned code is always 4 hexadecimal digits long (which you may or may not like) and UCS-2 does not support characters higher than the BMP, i.e. higher than code point FFFF.

Wednesday, March 31, 2021
 
RemiX
answered 7 Months ago
79

you probably need to tell it to look in the global scope:


     function doSomething()
     {
         global $con;
         $con->tralalala();
     }
Friday, May 28, 2021
 
Jesse
answered 5 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :