Asked  7 Months ago    Answers:  5   Viewed   36 times

I am retreiving some html strings from my database and I would like to parse these strings into my DOMDocument. The problem is, that the DOMDocument gives warnings at special characters.

Warning: DOMDocumentFragment::appendXML() [domdocumentfragment.appendxml]: Entity: line 2: parser error : Entity 'nbsp' not defined in page.php on line 189

I wonder why and I wonder how to solve this. This are some code fragments of my page. How can I fix these kind of warnings?

$doc = new DOMDocument();

// .. create some elements first, like some divs and a h1 ..

while($row = mysql_fetch_array($result))
{
    $messageEl = $doc->createDocumentFragment();
    $messageEl->appendXML($row['message']); // gives it's warnings here!

    $otherElement->appendChild($messageEl);
}

echo $doc->saveHTML();

I also found something about validation, but when I apply that, my page won't load anymore. The code I tried for that was something like this.

$implementation = new DOMImplementation();
$dtd = $implementation->createDocumentType('html','-//W3C//DTD XHTML 1.0 Transitional//EN','http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd');

$doc = $implementation->createDocument('','',$dtd);
$doc->validateOnParse = true;
$doc->formatOutput = true;

// in the same whileloop, I used the following:
$messageEl = $doc->createDocumentFragment();
$doc->validate(); // which stopped my code, but error- and warningless.
$messageEl->appendXml($row['message']);

Thanks in advance!

 Answers

56

There is no   in XML. The only character entities that have an actual name defined (instead of using a numeric reference) are &, <, >, " and '.

That means you have to use the numeric equivalent of a non-breaking space, which is   or (in hex)  .

If you are trying to save HTML into an XML container, then save it as text. HTML and XML may look similar but they are very distinct. appendXML() expects well-formed XML as an argument. Use the nodeValue property instead, it will XML-encode your HTML string without any warnings.

// document fragment is completely unnecessary
$otherElement->nodeValue = $row['message'];
Wednesday, March 31, 2021
 
Lloydworth
answered 7 Months ago
49

Solution:

$oDom = new DOMDocument();
$oDom->encoding = 'utf-8';
$oDom->loadHTML( utf8_decode( $sString ) ); // important!

$sHtml = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">';
$sHtml .= $oDom->saveHTML( $oDom->documentElement ); // important!

The saveHTML() method works differently specifying a node. You can use the main node ($oDom->documentElement) adding the desired !DOCTYPE manually. Another important thing is utf8_decode(). All the attributes and the other methods of the DOMDocument class, in my case, don't produce the desired result.

Wednesday, March 31, 2021
 
ioleo
answered 7 Months ago
40

Yep, nodeValue will strip the tags.

You'll want to create a fragment of the data, and then append it.

<?php

$data = '<div class="dummydata"></div>';
$doc = new DOMDocument();
$doc->loadHTML('<!DOCTYPE html><div id="container">contents</div>');

$el = $doc->getElementById('container');

$children = $el->childNodes;
while ($children->length)
    $el->removeChild($children->item(0));

$frag = $el->ownerDocument->createDocumentFragment();
$frag->appendXML($data);
$el->appendChild($frag);

echo $doc->saveHTML();

Be sure to replace the loadHTML and saveHTML with your code if you want to integrate it.

Output:

<!DOCTYPE html>
<html><body><div id="container"><div class="dummydata"></div></div></body></html>
Wednesday, March 31, 2021
 
godot
answered 7 Months ago
46

No, there is no way of specifying a particular doctype to use, or to modify the requirements of the existing one.

Your best workable solution is going to be to disable error reporting with libxml_use_internal_errors:

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML('...');
libxml_clear_errors();
Wednesday, June 2, 2021
 
ajreal
answered 5 Months ago
53

There are reserved characters, that have a reserved meanings, those are delimiters — :/?#[]@ — and subdelimiters — !$&'()*+,;=

There is also a set of characters called unreserved characters — alphanumerics and -._~ — which are not to be encoded.

That means, that anything that doesn't belong to unreserved characters set is supposed to be %-encoded, when they do not have special meaning (e.g. when passed as a part of GET parameter).

See also RFC3986: Uniform Resource Identifier (URI): Generic Syntax

Tuesday, June 8, 2021
 
Dail
answered 5 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :