Asked  7 Months ago    Answers:  5   Viewed   45 times

DOMDocument seems to convert Chinese characters into codes, for instance,

???? will become ä½ çš„ä¹±å‘

How can I keep the Chinese or other foreign language as they are instead of converting them into codes?

Below is my simple test,

$dom = new DOMDocument();
$dom->loadHTML($html);

If I add this below before loadHTML(),

$html = mb_convert_encoding($html, "HTML-ENTITIES", "UTF-8"); 

I get,

你的乱发

Even though the coverted codes will be displayed as Chinese characters, 你的乱发 still are not ???? what I am after....

 Answers

40

DOMDocument seems to convert Chinese characters into codes [...]. How can I keep the Chinese or other foreign language as they are instead of converting them into codes?

$dom = new DOMDocument();
$dom->loadHTML($html);

If you're using the loadHTML function to load a HTML chunk. By default DOMDocument expects that string to be in HTML's default encoding (ISO-8859-1) however most often the charset (sic!) is meta-information provided next to the string you're using and not inside. To make this more complicated, that meta-information be be even inside the string.

Anyway as you have not shared the string data of the HTML and you have not specified the encoding, it's hard to tell specifically what is going on.

I assume the HTML is UTF-8 encoded but this is not signalled within the HTML string. So the following work-around can help:

$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);

// dirty fix
foreach ($doc->childNodes as $item)
    if ($item->nodeType == XML_PI_NODE)
        $doc->removeChild($item); // remove hack
$doc->encoding = 'UTF-8'; // insert proper

It injects an encoding hint on the very beginning (and removes it after the HTML has been loaded). From that point on, DOMDocument will return UTF-8 (as always).

Wednesday, March 31, 2021
 
jenny
answered 7 Months ago
50
<?php
$in = 'nin2 hao3 ma';
$out = 'nín h?o ma';

function replacer($match) {
  static $trTable = array(
    1 => array(
      'a' => '?',
      'e' => '?',
      'i' => '?',
      'o' => '?',
      'u' => '?',
      'ü' => '?',
      'A' => '?',
      'E' => '?'),
    2 => array('i' => 'í'),
    3 => array('a' => '?')
  );
  list(, $word, $i) = $match;
  return str_replace(
    array_keys($trTable[$i]),
    array_values($trTable[$i]),
    $word); }

// Outputs: bool(true)
var_dump(preg_replace_callback('~(w+)(d+)~', 'replacer', $in) === $out);
Wednesday, March 31, 2021
 
rblarsen
answered 7 Months ago
91

mb_convert_encoding() isn't the correct function for what you're trying to achieve: you should really be using html_entity_decode() instead, because it will only convert the actual html entities to UTF-8, and won't affect the existing UTF-8 characters in the string.

$text = "äöü &auml; &ouml; &uuml; &#223;";
var_dump(html_entity_decode($text, ENT_COMPAT | ENT_HTML401, 'UTF-8'));

which gives

string(18) "äöü ä ö ü ß"

Demo

Wednesday, March 31, 2021
 
tika
answered 7 Months ago
89

Even if you use string formatting, sometimes you still need white spaces at the beginning or the end of your string. For these cases, neither escaping with , nor xml:space attribute helps. You must use HTML entity &#160; for a whitespace.

Use &#160; for non-breakable whitespace.
Use &#032; for regular space.

Wednesday, June 2, 2021
 
Naveen
answered 5 Months ago
81
$colNumber = PHPExcel_Cell::columnIndexFromString($colString);

returns 1 from a $colString of 'A', 26 from 'Z', 27 from 'AA', etc.

and the (almost) reverse

$colString = PHPExcel_Cell::stringFromColumnIndex($colNumber);

returns 'A' from a $colNumber of 0, 'Z' from 25, 'AA' from 26, etc.

EDIT

A couple of useful tricks:

There is a toArray() method for the worksheet class:

$this->datasets = $this->objPHPExcel->setActiveSheetIndex(0)->toArray();

which accepts the following parameters:

* @param  mixed    $nullValue          Value returned in the array entry if a cell doesn't exist
* @param  boolean  $calculateFormulas  Should formulas be calculated?
* @param  boolean  $formatData         Should formatting be applied to cell values?
* @param  boolean  $returnCellRef      False - Return a simple array of rows and columns indexed by number counting from zero
*                                      True - Return rows and columns indexed by their actual row and column IDs

although it does use the iterators, so would be slightly slower

OR

Take advantage of PHP's ability to increment character strings Perl Style

$highestColumm = $this->objPHPExcel->setActiveSheetIndex(0)->getHighestColumn(); // e.g. "EL" 
$highestRow = $this->objPHPExcel->setActiveSheetIndex(0)->getHighestRow();  

$highestColumm++;
for ($row = 1; $row < $highestRow + 1; $row++) {     
    $dataset = array();     
    for ($column = 'A'; $column != $highestColumm; $column++) {
        $dataset[] = $this->objPHPExcel->setActiveSheetIndex(0)->getCell($column . $row)->getValue();
    }
    $this->datasets[] = $dataset;
}

and if you're processing a large number of rows, you might actually notice the performance improvement of ++$row over $row++

Sunday, August 1, 2021
 
TaylorMac
answered 3 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :
 
Share