Asked  7 Months ago    Answers:  5   Viewed   43 times

I am loading a HTML from an external server. The HTML markup has UTF-8 encoding and contains characters such as ?,š,?,?,ž etc. When I load the HTML with file_get_contents() like this:

$html = file_get_contents('http://example.com/foreign.html');

It messes up the UTF-8 characters and loads Å, ¾, ¤ and similar nonsense instead of proper UTF-8 characters.

How can I solve this?

UPDATE:

I tried both saving the HTML to a file and outputting it with UTF-8 encoding. Both doesn't work so it means file_get_contents() is already returning broken HTML.

UPDATE2:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="sk" lang="sk">
<head>

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<meta http-equiv="Content-Language" content="sk" />
<title>Test</title>

</head>
<body>


<?php

$html = file_get_contents('http://example.com');
echo htmlentities($html);

?>

</body>
</html>

 Answers

97

Alright. I have found out the file_get_contents() is not causing this problem. There's a different reason which I talk about in another question. Silly me.

See this question: Why Does DOM Change Encoding?

Wednesday, March 31, 2021
 
BenOfTheNorth
answered 7 Months ago
95
$message = str_replace("$SAD", "HAPPY", $message);

needs to be:

$message = str_replace('$SAD', "HAPPY", $message);

Otherwise PHP will interpret it as the variable $SAD. See this post for an explanation on the difference between single and double quotes.

Saturday, May 29, 2021
 
Vlad
answered 5 Months ago
50

It turns out this is a bug in either Apache httpd or PHP, as well as a bug in Zend Framework v 1.x.

The bug occurs when a content-length header's value exceeds the actual content length.

For example,

curl http://localhost/index.php -H "Content-Length: 3" --data "12"

In the above example, a 10 second timeout must be reached after calling file_get_contents('php://input') before the request body is returned.

In Zend Framework v1.x, setting the raw body of a Zend_HTTP_Client object causes a Content-Length header to be calculated and injected into the request. However, unless the request is a POST, PUT or DELETE request, the content will be omitted from the actual request, which, in turn, triggers the Apache/PHP invalid-content-length bug.

I have opened a bug with PHP and will also open a bug with Zend Framework.

Saturday, May 29, 2021
 
MannfromReno
answered 5 Months ago
72

UPDATE Indeed this is a PHP bug on Windows. There are workarounds like below, but the best solution I have seen is to use the WFIO extension. This extension provides a new protocol wfio:// for file streams and allows PHP to properly handle UTF-8 characters on the Windows file-system. wfio:// supports a number of PHP functions including fopen, scandir, mkdir, copy, rename, etc.

original solution

So this problem is related to a PHP bug on Windows: http://bugs.php.net/bug.php?id=47096

Unicode characters get mangled by PHP on move_upload_file - although I have also seen the issue with rename and ZipArchive so I think it's a general issue with PHP and Windows.

I have adapted a workaround from Wordpress found here. I have to store the file with the mangled file name and then sanitize it on download/email/display.

Here are the adapted methods I'm using in case it's of use to someone in future. This still isn't much use if you're trying to zip files before downloading/emailing or you need to write the files to a network share.

public static function sanitizeFilename($filename, $utf8 = true)
{
if ( self::seems_utf8($filename) == $utf8 )
    return $filename;

// On Windows platforms, PHP will mangle non-ASCII characters, see http://bugs.php.net/bug.php?id=47096
if ( 'WIN' == substr( PHP_OS, 0, 3 ) ) {
        if(setlocale( LC_CTYPE, 0 )=='C'){ // Locale has not been set and the default is being used, according to answer by Colin Morelli at http://stackoverflow.com/questions/13788415/how-to-retrieve-the-current-windows-codepage-in-php
                // thus, we force the locale to be explicitly set to the default system locale
                $codepage = 'Windows-' . trim( strstr( setlocale( LC_CTYPE, '' ), '.' ), '.' );
        }
        else {
                $codepage = 'Windows-' . trim( strstr( setlocale( LC_CTYPE, 0 ), '.' ), '.' );
        }
        $charset = 'UTF-8';
        if ( function_exists( 'iconv' ) ) {

                if ( false == $utf8 ){
                    $filename = iconv( $charset, $codepage . '//IGNORE', $filename );
                }
                else {
                    $filename = iconv( $codepage, $charset, $filename );
                }
        } elseif ( function_exists( 'mb_convert_encoding' ) ) {
                if ( false == $utf8 )
                        $filename = mb_convert_encoding( $filename, $codepage, $charset );
                else
                        $filename = mb_convert_encoding( $filename, $charset, $codepage );
        }
}

return $filename;       

}

public static function seems_utf8($str) {
    $length = strlen($str);
    for ($i=0; $i < $length; $i++) {
            $c = ord($str[$i]);
            if ($c < 0x80) $n = 0; # 0bbbbbbb
            elseif (($c & 0xE0) == 0xC0) $n=1; # 110bbbbb
            elseif (($c & 0xF0) == 0xE0) $n=2; # 1110bbbb
            elseif (($c & 0xF8) == 0xF0) $n=3; # 11110bbb
            elseif (($c & 0xFC) == 0xF8) $n=4; # 111110bb
            elseif (($c & 0xFE) == 0xFC) $n=5; # 1111110b
            else return false; # Does not match any model
            for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ?
                    if ((++$i == $length) || ((ord($str[$i]) & 0xC0) != 0x80))
                            return false;
            }
    }
    return true;

}
Saturday, August 21, 2021
 
kiruwka
answered 2 Months ago
100

Check out Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

EDIT 20140523: Also, watch Characters, Symbols and the Unicode Miracle by Tom Scott on YouTube - it's just under ten minutes, and a wonderful explanation of the brilliant 'hack' that is UTF-8

Saturday, September 4, 2021
 
Xun Yang
answered 2 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :