Asked  8 Months ago    Answers:  5   Viewed   34 times

I was trying to do it with "getElementsByTagName", but it wasn't working, I'm new to using DOMDocument to parse HTML, as I used to use regex until yesterday some kind fokes here told me that DOMEDocument would be better for the job, so I'm giving it a try :)

I google around for a while looking for some explains but didn't find anything that helped (not with the class anyway)

So I want to capture "Capture this text 1" and "Capture this text 2" and so on.

Doesn't look to hard, but I can't figure it out :(

<div class="main">
    <div class="text">
    Capture this text 1
    </div>
</div>

<div class="main">
    <div class="text">
    Capture this text 2
    </div>
</div>

 Answers

22

If you want to get :

  • The text
  • that's inside a <div> tag with class="text"
  • that's, itself, inside a <div> with class="main"

I would say the easiest way is not to use DOMDocument::getElementsByTagName -- which will return all tags that have a specific name (while you only want some of them).

Instead, I would use an XPath query on your document, using the DOMXpath class.


For example, something like this should do, to load the HTML string into a DOM object, and instance the DOMXpath class :

$html = <<<HTML
<div class="main">
    <div class="text">
    Capture this text 1
    </div>
</div>

<div class="main">
    <div class="text">
    Capture this text 2
    </div>
</div>
HTML;

$dom = new DOMDocument();
$dom->loadHTML($html);

$xpath = new DOMXPath($dom);


And, then, you can use XPath queries, with the DOMXPath::query method, that returns the list of elements you were searching for :

$tags = $xpath->query('//div[@class="main"]/div[@class="text"]');
foreach ($tags as $tag) {
    var_dump(trim($tag->nodeValue));
}


And executing this gives me the following output :

string 'Capture this text 1' (length=19)
string 'Capture this text 2' (length=19)
Wednesday, March 31, 2021
 
samayo
answered 8 Months ago
57

If you're willing to sacrifice speed for correctness, then go ahead and try to roll your own parser with regular expressions.

You say "Personally, I've found it more complicated to parse HTML through DOM." Are you optimizing for correctness of results, or how easy it is for you to write the code?

If all you want is speed and code that's not complicated, why not just use this:

$array_of_photos = Array( 'booger.jpg', 'aunt-martha-on-a-horse.png' );

or maybe just

$array_of_photos = Array();

Those run in constant time, and they're easy to understand. No problem, right?

What's that? You want accurate results? Then don't parse HTML with regular expressions.

Finally, when you're working with a parser like DOM, you're working with a piece of code that has been well-tested and debugged for years. When you're writing your own regular expressions to do the parsing, you're working with code that you're going to have to write, test and debug yourself. Why would you not want to work with the tools that many people have been using for many years? Do you think you can do a better job yourself on the fly?

Wednesday, March 31, 2021
 
iftheshoefritz
answered 8 Months ago
32

You should check td has a child. Select anchor tag using getElementsByTagName() and check the selection has content using length property. If the td has anchor in child, use getAttribute() to get href attribute of it.

$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('td') as $node) {
    $nodeAnchor = $node->getElementsByTagName("a");
    if ($nodeAnchor->length)
        $array_data[] = $nodeAnchor->item(0)->getAttribute("href");
    $array_data[] = $node->nodeValue;
}

See demo

Wednesday, March 31, 2021
 
Farnabaz
answered 8 Months ago
97
public DOMNode DOMNode::replaceChild ( DOMNode $newnode , DOMNode $oldnode )

http://php.net/manual/en/domnode.replacechild.php

Something like this:

$iframes = $dom->getElementsByTagName('iframe'); 
$i = $iframes->length - 1; 
while ($i > -1) { 
    $iframe = $iframes->item($i); 
    $ignore = false; 
    $img = $dom->createElement("img");
    $img->setAttribute("src",$iframe->getAttribute('src'));
    $iframe->parentNode->replaceChild($img, $iframe); 
    $i--; 
} 
Wednesday, March 31, 2021
 
supermitch
answered 8 Months ago
57

This is what PHP Tidy is for. For example:

<?php
ob_start();
?>
<html>a html document</html>
<?php
$html = ob_get_clean();

// Specify configuration
$config = array(
           'indent'         => true,
           'output-xhtml'   => true,
           'wrap'           => 200);

// Tidy
$tidy = new tidy;
$tidy->parseString($html, $config, 'utf8');
$tidy->cleanRepair();

// Output
echo $tidy;
?>

See HTML Tidy Configuration Options.

Saturday, May 29, 2021
 
e_i_pi
answered 5 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :
 
Share