Asked  7 Months ago    Answers:  5   Viewed   36 times

I'm trying to load parse a Google Weather API response (Chinese response).

Here is the API call.

// This code fails with the following error
$xml = simplexml_load_file('http://www.google.com/ig/api?weather=11791&hl=zh-CN');

( ! ) Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 1: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xB6 0xE0 0xD4 0xC6 in C:htdocsweather.php on line 11

Why does loading this response fail?

How do I encode/decode the response so that simplexml loads it properly?

Edit: Here is the code and output.

<?php
$googleData = file_get_contents('http://www.google.com/ig/api?weather=11102&hl=zh-CN');
$xml = simplexml_load_string($googleData);

( ! ) Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 1: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xB6 0xE0 0xD4 0xC6 in C:htdocstest4.php on line 3 Call Stack Time Memory Function Location 1 0.0020 314264 {main}( ) ..test4.php:0 2 0.1535 317520 simplexml_load_string ( string(1364) ) ..test4.php:3

( ! ) Warning: simplexml_load_string() [function.simplexml-load-string]: t_system data="SI"/>

( ! ) Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in C:htdocstest4.php on line 3 Call Stack Time Memory Function Location 1 0.0020 314264 {main}( ) ..test4.php:0 2 0.1535 317520 simplexml_load_string ( string(1364) ) ..test4.php:3

 Answers

66

The problem here is that SimpleXML doesn't look at the HTTP header to determine the character encoding used in the document and simply assumes it's UTF-8 even though Google's server does advertise it as

Content-Type: text/xml; charset=GB2312

You can write a function that will take a look at that header using the super-secret magic variable $http_response_header and transform the response accordingly. Something like that:

function sxe($url)
{   
    $xml = file_get_contents($url);
    foreach ($http_response_header as $header)
    {   
        if (preg_match('#^Content-Type: text/xml; charset=(.*)#i', $header, $m))
        {   
            switch (strtolower($m[1]))
            {   
                case 'utf-8':
                    // do nothing
                    break;

                case 'iso-8859-1':
                    $xml = utf8_encode($xml);
                    break;

                default:
                    $xml = iconv($m[1], 'utf-8', $xml);
            }
            break;
        }
    }

    return simplexml_load_string($xml);
}
Wednesday, March 31, 2021
 
Angolao
answered 7 Months ago
50
function toCelsius($deg) {
    return ($deg-32)/1.8;
}

If your temperature in F is here: $current[0]->temp_f['data']

Then all you have to do is this: toCelsius($current[0]->temp_f['data']

Saturday, May 29, 2021
 
Zulakis
answered 5 Months ago
13

From the documentation for <|>:

The parser is called predictive since q is only tried when parser p didn't consume any input (i.e.. the look ahead is 1).

In your case both the parses consume "#\" before failing, so the other alternative can't be evaluated. You can use try to ensure backtracking works as expected:

The parser try p behaves like parser p, except that it pretends that it hasn't consumed any input when an error occurs.

Something like the next:

try parseSpecialCharNotation <|> parseSingleChar

Side note: is it better to extract "#\" out of the parsers because otherwise you are doing the same work twice. Something like the next:

do
  string "#\"
  try parseSpecialCharNotation <|> parseSingleChar

Also, you can use string combinator instead of a series of char parsers.

Thursday, August 26, 2021
 
MannfromReno
answered 2 Months ago
73

Every now and then the API stops working for short periods of time, the last days more often a 403 is trown. For my site, last night it happened 13 times. But the site tries immediately again and the second or third time, the data loads without problems. As the API is unofficial, not sure what’s causing the 403.

Make sure you cache the data as the API will block your IP temporary when you make too much requests. In my case, I cache for 20 minutes and if no data can retrieved, the site will not try more than 10 times to reload the API. Once I forgot to turn caching on after debugging and as my site did many hundred requests (with every visitor), the IP was blocked within an hour. If a remember correct, the error was not a 403. Fortunately, the block lasts for less than a half day.

Wednesday, September 1, 2021
 
Pratap Vhatkar
answered 2 Months ago
74

The numeric character reference "&#5;" is not legal in valid XML Documents. I refer you to the section 4.1 Character and Entity References in the XML recommendation:

Characters referred to using character references MUST match the production for Char.

Now if we follow the link and look at the production for Char:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

we see that there are some characters that can appear neither literally, nor as a numeric character reference in a valid XML Document.

An oddity that; I've learned something about XML today :).

See this conversation on ASCII control characters in XML for a possible workaround.

Wednesday, October 13, 2021
 
Null
answered 2 Weeks ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :
 
Share