Asked  7 Months ago    Answers:  5   Viewed   33 times

I have this code to decode numeric html entities to the UTF8 equivalent character.

I'm trying to convert this character:

’

which should output:

?

However, it just disappears (no output). (i've checked the source code of the page, the page has the correct utf8 character set headers/meta tags).

Does anyone know what is wrong with the code?

function entity_decode($string, $quote_style = ENT_COMPAT, $charset = "UTF-8") {    
     $string = html_entity_decode($string, $quote_style, $charset);

     $string = preg_replace_callback('~&#x([0-9a-fA-F]+);~i', "chr_utf8_callback", $string);
     $string = preg_replace('~&#([0-9]+);~e', 'chr_utf8("\1")', $string);

    //this is another method, which also doesn't work.. 
     //$string = preg_replace_callback("/(&#[0-9]+;)/", "entity_decode_callback", $string);

     return $string; 
}




function chr_utf8_callback($matches) { 
     return chr_utf8(hexdec($matches[1])); 
}

function chr_utf8($num) {   
     if ($num < 128) return chr($num);
     if ($num < 2048) return chr(($num >> 6) + 192) . chr(($num & 63) + 128);
     if ($num < 65536) return chr(($num >> 12) + 224) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
     if ($num < 2097152) return chr(($num >> 18) + 240) . chr((($num >> 12) & 63) + 128) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
     return '';
}

function entity_decode_callback($m) { 
     return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES"); 
} 

 echo '=' . entity_decode('&#146;');

 Answers

56

html_entity_decode already does what you're looking for:

$string = '&#146;';

echo html_entity_decode($string, ENT_COMPAT, 'UTF-8');

It will return the character:

’   binary hex: c292

Which is PRIVATE USE TWO (U+0092). As it's private use, your PHP configuration/version/compile might not return it at all.

Also there are some more quirks:

But in HTML (other than XHTML, which uses XML rules), it's a long-standing browser quirk that character references in the range &#128; to &#159; are misinterpreted to mean the characters associated with bytes 128 to 159 in the Windows Western code page (cp1252) instead of the Unicode characters with those code points. The HTML5 standard finally documents this behaviour.

See: &#146; is getting converted as “u0092” by nokogiri in ruby on rails

Wednesday, March 31, 2021
 
THEK
answered 7 Months ago
89

This may be a job for the mb_detect_encoding() function.

In my limited experience with it, it's not 100% reliable when used as a generic "encoding sniffer" - It checks for the presence of certain characters and byte values to make an educated guess - but in this narrow case (it'll need to distinguish just between UTF-8 and ISO-8859-1 ) it should work.

<?php
$text = $entity['Entity']['title'];

echo 'Original : ', $text."<br />";
$enc = mb_detect_encoding($text, "UTF-8,ISO-8859-1");

echo 'Detected encoding '.$enc."<br />";

echo 'Fixed result: '.iconv($enc, "UTF-8", $text)."<br />";

?>

you may get incorrect results for strings that do not contain special characters, but that is not a problem.

Wednesday, March 31, 2021
 
keisar
answered 7 Months ago
48

Your code works for me :-?

In the manual page for htmlentities() we can read:

Return Values

Returns the encoded string.

If the input string contains an invalid code unit sequence within the given encoding an empty string will be returned, unless either the ENT_IGNORE or ENT_SUBSTITUTE flags are set.

My guess is that the input data is not properly encoded as UTF-8 and the function is returning an empty string. (Assuming that the script is not crashing, i.e., code after that part still runs.)

Wednesday, March 31, 2021
 
MassiveAttack
answered 7 Months ago
11

– is common mojibake for an en dash (), which is a different character from a hyphen.

It is the result of taking the UTF-8–encoded form of the dash (0xe2 0x80 0x93) and incorrectly assuming that it is actually encoded using Windows-1252.

Interpreting those three bytes as Windows-1252: 0xe2, 0x80 and 0x93 separately represent â, and .

Assuming the offending character is in the blurb field, if you query SELECT HEX(blurb) FROM tpf_parks (with a suitable WHERE clause), you will see the hex encoding of the offending bytes.

If you see E28093 in there, then the database value is correctly encoded as UTF-8 and there will be a character encoding mismatch in your client or server configuration.

If, however, you see C3A2E282ACE2809C, then the character has already been encoded incorrectly in the database — i.e. interpreted incorrectly, then saved as the UTF-8 representation of those 3 characters. If this is the case you'll need to update the data to fix the issue. You could do this using iconv:

$fixedData = iconv("utf-8", "windows-1252", $badData);

This will convert the doubly-converted bytes back to the UTF-8 encoding.

Saturday, May 29, 2021
 
Hilmi
answered 5 Months ago
36

Then maybe you will need the HttpUtility.HtmlDecode?. It should work, you just need to add a reference to System.Web. At least this was the way in .Net Framework < 4.

For example the following code:

MessageBox.Show(HttpUtility.HtmlDecode("&amp;&copy;"));

Worked and the output was as expected (ampersand and copyright symbol). Are you sure the problem is within HtmlDecode and not something else?

UPDATE: Another class capable of doing the job, WebUtility (again HtmlDecode method) came in the newer versions of .Net. However, there seem to be some problems with it. See the HttpUtility vs. WebUtility question.

Saturday, June 19, 2021
 
AlterPHP
answered 4 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :