Asked  7 Months ago    Answers:  5   Viewed   87 times

How can one parse HTML/XML and extract information from it?



Native XML Extensions

I prefer using one of the native XML extensions since they come bundled with PHP, are usually faster than all the 3rd party libs and give me all the control I need over the markup.


The DOM extension allows you to operate on XML documents through the DOM API with PHP 5. It is an implementation of the W3C's Document Object Model Core Level 3, a platform- and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure and style of documents.

DOM is capable of parsing and modifying real world (broken) HTML and it can do XPath queries. It is based on libxml.

It takes some time to get productive with DOM, but that time is well worth it IMO. Since DOM is a language-agnostic interface, you'll find implementations in many languages, so if you need to change your programming language, chances are you will already know how to use that language's DOM API then.

A basic usage example can be found in Grabbing the href attribute of an A element and a general conceptual overview can be found at DOMDocument in php

How to use the DOM extension has been covered extensively on StackOverflow, so if you choose to use it, you can be sure most of the issues you run into can be solved by searching/browsing Stack Overflow.


The XMLReader extension is an XML pull parser. The reader acts as a cursor going forward on the document stream and stopping at each node on the way.

XMLReader, like DOM, is based on libxml. I am not aware of how to trigger the HTML Parser Module, so chances are using XMLReader for parsing broken HTML might be less robust than using DOM where you can explicitly tell it to use libxml's HTML Parser Module.

A basic usage example can be found at getting all values from h1 tags using php

XML Parser

This extension lets you create XML parsers and then define handlers for different XML events. Each XML parser also has a few parameters you can adjust.

The XML Parser library is also based on libxml, and implements a SAX style XML push parser. It may be a better choice for memory management than DOM or SimpleXML, but will be more difficult to work with than the pull parser implemented by XMLReader.


The SimpleXML extension provides a very simple and easily usable toolset to convert XML to an object that can be processed with normal property selectors and array iterators.

SimpleXML is an option when you know the HTML is valid XHTML. If you need to parse broken HTML, don't even consider SimpleXml because it will choke.

A basic usage example can be found at A simple program to CRUD node and node values of xml file and there is lots of additional examples in the PHP Manual.

3rd Party Libraries (libxml based)

If you prefer to use a 3rd-party lib, I'd suggest using a lib that actually uses DOM/libxml underneath instead of string parsing.

FluentDom - Repo

FluentDOM provides a jQuery-like fluent XML interface for the DOMDocument in PHP. Selectors are written in XPath or CSS (using a CSS to XPath converter). Current versions extend the DOM implementing standard interfaces and add features from the DOM Living Standard. FluentDOM can load formats like JSON, CSV, JsonML, RabbitFish and others. Can be installed via Composer.


Wa72HtmlPageDom` is a PHP library for easy manipulation of HTML documents using It requires DomCrawler from Symfony2 components for traversing the DOM tree and extends it by adding methods for manipulating the DOM tree of HTML documents.

phpQuery (not updated for years)

phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library written in PHP5 and provides additional Command Line Interface (CLI).

Also see:


Zend_Dom provides tools for working with DOM documents and structures. Currently, we offer Zend_Dom_Query, which provides a unified interface for querying DOM documents utilizing both XPath and CSS selectors.


QueryPath is a PHP library for manipulating XML and HTML. It is designed to work not only with local files, but also with web services and database resources. It implements much of the jQuery interface (including CSS-style selectors), but it is heavily tuned for server-side use. Can be installed via Composer.


fDOMDocument extends the standard DOM to use exceptions at all occasions of errors instead of PHP warnings or notices. They also add various custom methods and shortcuts for convenience and to simplify the usage of DOM.


sabre/xml is a library that wraps and extends the XMLReader and XMLWriter classes to create a simple "xml to object/array" mapping system and design pattern. Writing and reading XML is single-pass and can therefore be fast and require low memory on large xml files.


FluidXML is a PHP library for manipulating XML with a concise and fluent API. It leverages XPath and the fluent programming pattern to be fun and effective.

3rd-Party (not libxml-based)

The benefit of building upon DOM/libxml is that you get good performance out of the box because you are based on a native extension. However, not all 3rd-party libs go down this route. Some of them listed below

PHP Simple HTML DOM Parser

  • An HTML DOM parser written in PHP5+ lets you manipulate HTML in a very easy way!
  • Require PHP 5+.
  • Supports invalid HTML.
  • Find tags on an HTML page with selectors just like jQuery.
  • Extract contents from HTML in a single line.

I generally do not recommend this parser. The codebase is horrible and the parser itself is rather slow and memory hungry. Not all jQuery Selectors (such as child selectors) are possible. Any of the libxml based libraries should outperform this easily.

PHP Html Parser

PHPHtmlParser is a simple, flexible, html parser which allows you to select tags using any css selector, like jQuery. The goal is to assiste in the development of tools which require a quick, easy way to scrape html, whether it's valid or not! This project was original supported by sunra/php-simple-html-dom-parser but the support seems to have stopped so this project is my adaptation of his previous work.

Again, I would not recommend this parser. It is rather slow with high CPU usage. There is also no function to clear memory of created DOM objects. These problems scale particularly with nested loops. The documentation itself is inaccurate and misspelled, with no responses to fixes since 14 Apr 16.


  • A universal tokenizer and HTML/XML/RSS DOM Parser
  •    Ability to manipulate elements and their attributes
  •    Supports invalid HTML and UTF8
  •    Can perform advanced CSS3-like queries on elements (like jQuery -- namespaces supported) 
  • A HTML beautifier (like HTML Tidy)
  •    Minify CSS and Javascript
  •    Sort attributes, change character case, correct indentation, etc. 
  • Extensible
  •    Parsing documents using callbacks based on current character/token
  •    Operations separated in smaller functions for easy overriding 
  • Fast and Easy

Never used it. Can't tell if it's any good.


You can use the above for parsing HTML5, but there can be quirks due to the markup HTML5 allows. So for HTML5 you want to consider using a dedicated parser, like


A Python and PHP implementations of a HTML parser based on the WHATWG HTML5 specification for maximum compatibility with major desktop web browsers.

We might see more dedicated parsers once HTML5 is finalized. There is also a blogpost by the W3's titled How-To for html 5 parsing that is worth checking out.


If you don't feel like programming PHP, you can also use Web services. In general, I found very little utility for these, but that's just me and my use cases.


ScraperWiki's external interface allows you to extract data in the form you want for use on the web or in your own applications. You can also extract information about the state of any scraper.

Regular Expressions

Last and least recommended, you can extract data from HTML with regular expressions. In general using Regular Expressions on HTML is discouraged.

Most of the snippets you will find on the web to match markup are brittle. In most cases they are only working for a very particular piece of HTML. Tiny markup changes, like adding whitespace somewhere, or adding, or changing attributes in a tag, can make the RegEx fails when it's not properly written. You should know what you are doing before using RegEx on HTML.

HTML parsers already know the syntactical rules of HTML. Regular expressions have to be taught for each new RegEx you write. RegEx are fine in some cases, but it really depends on your use-case.

You can write more reliable parsers, but writing a complete and reliable custom parser with regular expressions is a waste of time when the aforementioned libraries already exist and do a much better job on this.

Also see Parsing Html The Cthulhu Way


If you want to spend some money, have a look at

  • PHP Architect's Guide to Webscraping with PHP

I am not affiliated with PHP Architect or the authors.

Wednesday, March 31, 2021
answered 7 Months ago

There is nothing magical about MVC. It's purpose is to split user interface interaction into three distinct roles. The important separation is that between Model and Presentation layer. The presentation layer consists of Controller (handles and delegates request from the UI to the model) and View (renders Model data).

Your model is your core application. It is likely layered itself, for instance into a Data Access layer (your AdoDB stuff), a Domain Model and a Service Layer. How you organize the Model is really up to the application you want to build. The important thing with MVC is to keep the model independent from the presentation. Your application should be able to solve the problem it was written for without the UI.The UI is just one interface on top.

Basically, as long as your controller is kept thin and does this

class SomeController
    public function someAction()
        $input   = filter_input(/* ... */);
        $adoDb   = $this->getModel('MyAdoDbClass');
        $newData = $adoDb->doSomethingWithInput($input);

and not this

class SomeController
    public function someAction()
        $input  = filter_input(/* ... */);
        $adoDb  = new AdoDb; 
            all the code that belongs to doSomethingWithInput
         echo '<html>'; 
            all the code that should belongs to the View                 ...

you are fine. Like I said, there is nothing magical about. You gotta keep 'em separated.

I suggest you have a look at other frameworks to see how they approach MVC. That's not to say you should copy or use them, but try to learn how they go about MVC. Also have a look at Rasmus Lerdorf's article The no-framework PHP MVC framework

Friday, May 28, 2021
answered 5 Months ago
$data = explode("n", $data);
$last_line = end($data);
$parts = explode("t", $last_line);
Saturday, May 29, 2021
answered 5 Months ago

You can try parsing an HTML file using a XML parser, but it’s likely to fail. The reason is that HTML documents can have the following HTML features that XML parsers don’t understand.

  • elements that never have end tags and that don’t use XML’s so-called “self-closing tag syntax”; e.g., <br>, <meta>, <link>, and <img> (also known as void elements)
  • elements that don’t require end tags; e.g., <p> <dt> <li> (their end tags can be implied)
  • elements that can contain unescaped markup "<" characters; e.g., style, textarea, title, script; <script> if (a < b) … </script>, <title>Using the "<" operator</title>
  • attributes with unquoted values; for example, <metacharset=utf-8>
  • attributes that are empty, with no separate value given at all; e.g., <inputdisabled>

An XML parser will fail to parse any HTML document that uses any of those features.

An HTML parser, on the other hand, will basically never fail no matter what a document contains.

All that said, there has also been work done toward developing a new type of XML parsing—so-called XML5 parsing—capable of handling things like empty/unquoted attributes attributes even in XML documents. There is a draft XML5 specification, as well as an XML5 parser, xml5ever.

The intended use is to make an HTML parser, that is part of a web crawler application

If you’re going to create a web-crawler application, you should absolutely use an HTML parser—and ideally, an HTML parser that conforms to the parsing requirements in the HTML standard.

These days, there are such conformant HTML parsers for many (or even most) languages; e.g.:

  • parse5 (node.js/JavaScript)
  • html5lib (python)
  • html5ever (rust)
  • html5 parser (java)
  • gumbo (c, with bindings for ruby, objective c, c++, per, php, c#, perl, lua, D, julia…)

Thursday, August 5, 2021
answered 3 Months ago

After looking around MSDN, I finally found an implementation solution to my problem:

    using System;
    using HtmlAgilityPack;
    using System.Xml;

    namespace HockeyStats
        class StatsParser
            private string htmlCode;
            private static string fileName = "[" + DateTime.Now.ToShortDateString() + " NHL Stats].xml";

            public StatsParser(string htmlCode)
                this.htmlCode = htmlCode;


            public void ParseHtml()

                HtmlDocument doc = new HtmlDocument();
                XmlWriter writer = null;

                    // Create an XmlWriterSettings object with the correct options. 
                    XmlWriterSettings settings = new XmlWriterSettings();
                    settings.Indent = true;
                    settings.IndentChars = ("  ");
                    settings.OmitXmlDeclaration = false;

                    // Create the XmlWriter object and write some content.
                    writer = XmlWriter.Create(@"...."+fileName, settings);
                    writer.WriteAttributeString("Date", DateTime.Now.ToShortDateString());

                // Iterate all rows within another row
                HtmlNodeCollection rows = doc.DocumentNode.SelectNodes(".//tr/tr");
                for (int i = 0; i < rows.Count; ++i)
                    // Iterate all columns in this row
                    HtmlNodeCollection cols = rows[i].SelectNodes(".//td[@class='statBox']");
                    for (int j = 0; j < 20; ++j)
                                switch (j)
                                    case 0:
                                            writer.WriteAttributeString("Rank", cols[j].InnerText.Trim()); break;
                                    case 1: writer.WriteElementString("Name", cols[j].InnerText.Trim()); break;
                                    case 2: writer.WriteElementString("Team", cols[j].InnerText.Trim()); break;
                                    case 3: writer.WriteElementString("Pos", cols[j].InnerText.Trim()); break;
                                    case 4: writer.WriteElementString("GP", cols[j].InnerText.Trim()); break;
                                    case 5: writer.WriteElementString("G", cols[j].InnerText.Trim()); break;
                                    case 6: writer.WriteElementString("A", cols[j].InnerText.Trim()); break;
                                    case 7: writer.WriteElementString("PlusMinus", cols[j].InnerText.Trim()); break;
                                    case 8: writer.WriteElementString("PIM", cols[j].InnerText); break;
                                    case 9: writer.WriteElementString("PP", cols[j].InnerText); break;
                                    case 10: writer.WriteElementString("SH", cols[j].InnerText); break;
                                    case 11: writer.WriteElementString("GW", cols[j].InnerText); break;
                                    case 12: writer.WriteElementString("OT", cols[j].InnerText); break;
                                    case 13: writer.WriteElementString("Shots", cols[j].InnerText); break;
                                    case 14: writer.WriteElementString("ShotPctg", cols[j].InnerText); break;
                                    case 15: writer.WriteElementString("TOIPerGame", cols[j].InnerText); break;
                                    case 16: writer.WriteElementString("ShiftsPerGame", cols[j].InnerText); break;
                                    case 17: writer.WriteElementString("FOWinPctg", cols[j].InnerText); break;

                    if (writer != null)

which gives the following XML file as an output:

<?xml version="1.0" encoding="utf-8" ?> 
<Stats Date="2011-01-01">
 <Player Rank="1">
  <Name>Sidney Crosby</Name> 
Sunday, August 15, 2021
answered 2 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :