Asked  4 Months ago    Answers:  5   Viewed   55 times

Does PHP have any standard function(s) to convert Unicode strings to plain, good old-fashioned ANSI strings (or whatever format PHP's htmlentities understands?

Is there any function that converts UTF-8 strings to HTML that can be understood by the most popular browsers?

 Answers

61

This can't work properly. Stored with Unicode there are many more Characters than with ANSI. So if you "convert" to ANSI, you will loose lots of charackters.

http://php.net/manual/en/function.htmlentities.php

You can use Unicode (UTF-8) charset with htmlentities:

string htmlentities ( string $string [, int $flags = ENT_COMPAT [, string $charset [, bool $double_encode = true ]]] )

htmlentities($myString, ENT_COMPAT, "UTF-8"); should work.

Thursday, August 5, 2021
 
CoderGuy123
answered 4 Months ago
44
  • mb_internal_encoding('UTF-8') doesn't do anything by itself, it only sets the default encoding parameter for each mb_ function. If you're not using any mb_ function, it doesn't make any difference. If you are, it makes sense to set it so you don't have to pass the $encoding parameter each time individually.
  • IMO mb_detect_encoding is mostly useless since it's fundamentally impossible to accurately detect the encoding of unknown text. You should either know what encoding a blob of text is in because you have a specification about it, or you need to parse appropriate meta data like headers or meta tags where the encoding is specified.
  • Using mb_check_encoding to check if a blob of text is valid in the encoding you expect it to be in is typically sufficient. If it's not, discard it and throw an appropriate error.
  • Regarding:

    does this mean I have to use all multi byte functions instead of its core functions

    If you are manipulating strings that contain multibyte characters, then yes, you need to use the mb_ functions to avoid getting wrong results. The core string functions only work on a byte level, not a character level, which is what you typically want when working with strings.

  • utf8_general_ci vs. utf8_bin only makes a difference when collating, i.e. sorting and comparing strings. With utf8_bin data is treated in binary form, i.e. only identical data is identical. With utf8_general_ci some logic is applied, e.g. "é" sorts together with "e" and upper case is considered equal to lower case.
Wednesday, March 31, 2021
 
ammezie
answered 9 Months ago
93

The solutions are platform-dependent. On Windows use MultiByteToWideChar and WideCharToMultiByte API functions. On Unix/linux platforms iconv library is quite popular.

Friday, July 16, 2021
 
Uours
answered 5 Months ago
73

I want to just filter it out

You have got an unspecified encoding/charset with your data. This is a huge problem.

You can first try to convert it into utf-8 and then strip all non-printable characters:

$str = iconv('utf-8', 'utf-8//ignore', $str);

echo preg_replace('/[^pLpNpPpSpZ]/u', '', $str);

The problem is, that the iconv function can only try. It will drop any invalid character sequence. As of php 5.4 it will drop the complete string however, if the input encoding specified is invalid.

You will see a warning since PHP 5.3 already that the input string has an invalid encoding.

You can go around this by removing all invalid utf-8 byte sequences first:

$str = valid_utf8_bytes($str);

echo preg_replace('/[^pLpNpPpSpZ]/u', '', $str);

/**
 * get valid utf-8 byte squences
 *
 * take over all matching bytes, drop an invalid sequence until first
 * non-matching byte.
 * 
 * @param string $str
 * @return string
 */
function valid_utf8_bytes($str)
{
    $return = '';
    $length = strlen($str);
    $invalid = array_flip(array("xEFxBFxBF" /* U-FFFF */, "xEFxBFxBE" /* U-FFFE */));

    for ($i=0; $i < $length; $i++)
    {
        $c = ord($str[$o=$i]);

        if ($c < 0x80) $n=0; # 0bbbbbbb
        elseif (($c & 0xE0) === 0xC0) $n=1; # 110bbbbb
        elseif (($c & 0xF0) === 0xE0) $n=2; # 1110bbbb
        elseif (($c & 0xF8) === 0xF0) $n=3; # 11110bbb
        elseif (($c & 0xFC) === 0xF8) $n=4; # 111110bb
        else continue; # Does not match

        for ($j=++$n; --$j;) # n bytes matching 10bbbbbb follow ?
            if ((++$i === $length) || ((ord($str[$i]) & 0xC0) != 0x80))
                continue 2
        ;

        $match = substr($str, $o, $n);

        if ($n === 3 && isset($invalid[$match])) # test invalid sequences
            continue;

        $return .= $match;
    }
    return $return;
}
Saturday, August 7, 2021
 
MasterJoe
answered 4 Months ago
94

Assuming your starting string is a Unicode string with literal backslashes, you first need a byte string to use the unicode-escape codec, but the octal escapes are UTF-8, so you'll need to convert it again to a byte string and then decode as UTF-8:

>>> s = r'training345256214346210220345276214.txt'
>>> s
'training\345\256\214\346\210\220\345\276\214.txt'
>>> s.encode('latin1')
b'training\345\256\214\346\210\220\345\276\214.txt'
>>> s.encode('latin1').decode('unicode-escape')
'trainingå®x8cæx88x90å¾x8c.txt'
>>> s.encode('latin1').decode('unicode-escape').encode('latin1')
b'trainingxe5xaex8cxe6x88x90xe5xbex8c.txt'
>>> s.encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')
'training完成後.txt'

Note that the latin1 codec does a direct translation of Unicode codepoints U+0000 to U+00FF to bytes 00-FF.

Thursday, November 25, 2021
 
AntoineB
answered 1 Week ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :
 
Share