Asked  9 Months ago    Answers:  5   Viewed   254 times

I'm trying to parse the DMOZ content/structures XML files into MySQL, but all existing scripts to do this are very old and don't work well. How can I go about opening a large (+1GB) XML file in PHP for parsing?

 Answers

52

There are only two php APIs that are really suited for processing large files. The first is the old expat api, and the second is the newer XMLreader functions. These apis read continuous streams rather than loading the entire tree into memory (which is what simplexml and DOM does).

For an example, you might want to look at this partial parser of the DMOZ-catalog:

<?php

class SimpleDMOZParser
{
    protected $_stack = array();
    protected $_file = "";
    protected $_parser = null;

    protected $_currentId = "";
    protected $_current = "";

    public function __construct($file)
    {
        $this->_file = $file;

        $this->_parser = xml_parser_create("UTF-8");
        xml_set_object($this->_parser, $this);
        xml_set_element_handler($this->_parser, "startTag", "endTag");
    }

    public function startTag($parser, $name, $attribs)
    {
        array_push($this->_stack, $this->_current);

        if ($name == "TOPIC" && count($attribs)) {
            $this->_currentId = $attribs["R:ID"];
        }

        if ($name == "LINK" && strpos($this->_currentId, "Top/Home/Consumer_Information/Electronics/") === 0) {
            echo $attribs["R:RESOURCE"] . "n";
        }

        $this->_current = $name;
    }

    public function endTag($parser, $name)
    {
        $this->_current = array_pop($this->_stack);
    }

    public function parse()
    {
        $fh = fopen($this->_file, "r");
        if (!$fh) {
            die("Epic fail!n");
        }

        while (!feof($fh)) {
            $data = fread($fh, 4096);
            xml_parse($this->_parser, $data, feof($fh));
        }
    }
}

$parser = new SimpleDMOZParser("content.rdf.u8");
$parser->parse();
Wednesday, March 31, 2021
 
macha
answered 9 Months ago
31

In PHP, you can read in extreme large XML files with the XMLReaderDocs:

$reader = new XMLReader();
$reader->open($xmlfile);

Extreme large XML files should be stored in a compressed format on disk. At least this makes sense as XML files have a high compression ratio. For example gzipped like large.xml.gz.

PHP supports that quite well with XMLReader via the compression wrappersDocs:

$xmlfile = 'compress.zlib://path/to/large.xml.gz';

$reader = new XMLReader();
$reader->open($xmlfile);

The XMLReader allows you to operate on the current element "only". That means it's forward-only. If you need to keep parser state, you need to build it your own.

I often find it helpful to wrap the basic movements into a set of iterators that know how to operate on XMLReader like iterating through elements or child-elements only. You find this outlined in Parse XML with PHP and XMLReader.

See as well:

  • PHP open gzipped XML
Wednesday, March 31, 2021
 
Besnik
answered 9 Months ago
47

Probably you are running out of memory. Try to increase your memory limit

ini_set('memory_limit','128M');

or whatever the amount of memory is neccesary (it depends on your server). I leave you here some links with other ways of increasing the memory limit of your server:

PHP: Increase memory limit

PHP: Increase memory limit 2

Friday, August 13, 2021
 
Markol
answered 4 Months ago
59

I think a fairly comfortable way of dealing with XML files in C# is using Linq to XML.

using (FileStream lStream = new FileStream("ConfigurationSettings.xml", FileMode.Open, FileAccess.Read))
{
     XElement lRoot = XElement.Load(lReader)
     string userLogin = "user1";
     XElement user = lRoot.Element("UserSettings").Elements("user").Where(x => x.Attribute("Key").Value == userLogin).FirstOrDefault();
      if (user != null)
      {
          // returns red
          string color = user.Element("layout").Attribute("color").Value;

          // returns 5
          string fontSize = user.Element("layout").Attribute("fontsize").Value;
      }

}
Tuesday, October 5, 2021
 
Ionică Bizău
answered 2 Months ago
59

First, have you tried ElementTree (either the built-in pure-Python or C versions, or, better, the lxml version)? I'm pretty sure none of them actually read the whole file into memory.

The problem, of course, is that, whether or not it reads the whole file into memory, the resulting parsed tree ends up in memory.

ElementTree has a nifty solution that's pretty simple, and often sufficient: iterparse.

for event, elem in ET.iterparse(xmlfile, events=('end')):
  ...

The key here is that you can modify the tree as it's built up (by replacing the contents with a summary containing only what the parent node will need). By throwing out all the stuff you don't need to keep in memory as it comes in, you can stick to parsing things in the usual order without running out of memory.

The linked page gives more details, including some examples for modifying XML-RPC and plist as they're processed. (In those cases, it's to make the resulting object simpler to use, not to save memory, but they should be enough to get the idea across.)

This only helps if you can think of a way to summarize as you go. (In the most trivial case, where the parent doesn't need any info from its children, this is just elem.clear().) Otherwise, this won't work for you.

The standard solution is SAX, which is a callback-based API that lets you operate on the tree a node at a time. You don't need to worry about truncating nodes as you do with iterparse, because the nodes don't exist after you've parsed them.

Most of the best SAX examples out there are for Java or Javascript, but they're not too hard to figure out. For example, if you look at http://cs.au.dk/~amoeller/XML/programming/saxexample.html you should be able to figure out how to write it in Python (as long as you know where to find the documentation for xml.sax).

There are also some DOM-based libraries that work without reading everything into memory, but there aren't any that I know of that I'd trust to handle a 40GB file with reasonable efficiency.

Monday, October 18, 2021
 
Peter Bridger
answered 2 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :
 
Share