Asked  7 Months ago    Answers:  5   Viewed   81 times

I have used the XML Parser before, and even though it worked OK, I wasn't happy with it in general, it felt like I was using workarounds for things that should be basic functionality.

I recently saw SimpleXML but I haven't tried it yet. Is it any simpler? What advantages and disadvantages do both have? Any other parsers you've used?



I would have to say SimpleXML takes the cake because it is firstly an extension, written in C, and is very fast. But second, the parsed document takes the form of a PHP object. So you can "query" like $root->myElement.

Wednesday, March 31, 2021
answered 7 Months ago

You're on the right track with XMLReader. Rather conveniently it includes the method expand() which will return a copy of the current node as a DOMNode. This will let you handle each individual Tree in memory with the DOM API.

As for handling nodes - evaluate and descend recursively.


$data = [
    'var1' => 1.05,
    'var2' => 0.76

$dom    = new DOMDocument();
$xpath  = new DOMXPath($dom);
$reader = new XMLReader();

// Read until reaching the first Tree.
while ($reader->read() && $reader->localName !== 'Tree');

while ($reader->localName === 'Tree') {
    $tree = $dom->importNode($reader->expand(), true);

    echo evaluateTree($data, $tree, $xpath), "n";

    // Move on to the next.


function evaluateTree(array $data, DOMElement $tree, DOMXPath $xpath)
    foreach ($xpath->query('./Node', $tree) as $node) {
        $field    = $xpath->evaluate('string(./SimplePredicate/@field)', $node);
        $operator = $xpath->evaluate('string(./SimplePredicate/@operator)', $node);
        $value    = $xpath->evaluate('string(./SimplePredicate/@value)', $node);

        if (evaluatePredicate($data[$field], $operator, $value)) {
            // Descend recursively.
            return evaluateTree($data, $node, $xpath);

    // Reached the end of the line.
    return $tree->getAttribute('id');

function evaluatePredicate($left, $operator, $right)
    switch ($operator) {
        case "lessOrEqual":
            return $left <= $right;
        case "greaterThan":
            return $left > $right;
            return false;


Saturday, May 29, 2021
answered 5 Months ago

If speed and memory is no problem, dom4j is a really good option. If you need speed, using a StAX parser like Woodstox is the right way, but you have to write more code to get things done and you have to get used to process XML in streams.

Tuesday, June 1, 2021
answered 5 Months ago

I use jQuery for this. Here is a good example:

(EDIT: Note - the following blog seems to have gone away.)

There are also lots and lots of good examples in the jQuery documentation:

EDIT: Due to the blog for my primary example going away, I wanted to add another example that shows the basics and helps with namespace issues:

Thursday, June 24, 2021
answered 4 Months ago
str(data.frame(t(unlist(L)), stringsAsFactors=FALSE))
# 'data.frame': 1 obs. of  15 variables:
#  $ CIP.RecordList.Record.Date                          : chr "2017-05-26T00:00:00"
#  $ CIP.RecordList.Record.Grade                         : chr "2"
#  $ CIP.RecordList.Record.ReasonsList.Reason.Code       : chr "R"
#  $ CIP.RecordList.Record.ReasonsList.Reason.Description: chr "local"
#  $ CIP.RecordList.Record.Score                         : chr "xxx"
#  $ CIP.RecordList.Record.Date.1                        : chr "2017-04-30T00:00:00"
#  $ CIP.RecordList.Record.Grade.1                       : chr "2"
#  $ CIP.RecordList.Record.ReasonsList.Reason.Code.1     : chr "R"
#  $ CIP.RecordList.Record.Score.1                       : chr "xyx"
#  $ Individual.General.FirstName                        : chr "MM"
#  $ Inquiries.InquiryList.Inquiry.DateOfInquiry         : chr "2017-03-19"
#  $ Inquiries.InquiryList.Inquiry.Reason                : chr "cc"
#  $ Inquiries.InquiryList.Inquiry.DateOfInquiry.1       : chr "2016-10-14"
#  $ Inquiries.InquiryList.Inquiry.Reason.1              : chr "er"
#  $ Inquiries.Summary.NumberOfInquiries                 : chr "2"

If you want to convert strings that have a suitable representation as numbers, assuming that df is the data frame above:

data.frame(t(lapply(df, function(x) 
               ifelse(<-suppressWarnings(as.numeric(x))), x, y))))

Strings that do not have a number representation will not be converted.



A) In some comments the OP added a further request for execution speed, which is normally not a issue for one time tasks such as data import. The solution above is based on recursion, as explicitly required in the question. Of course, traversing up and down the nodes adds a lot of overhead.

B) One recent answer here proposes a complex method based on a collection of external tools. There might of course be different nice utilities to manage XML files, but IMHO much of the XPATH work can be comfortably and efficiently done in R itself.

C) The OP wonders if it is possible to "create separate data.frames for each list object of XML".

D) I noticed that in the question tags, the OP (seems to) require the newer xml2 package.

I address the points above using XPATH straight from R.

XPATH approach

Below I extract in a separate data frame the Record node. One can use the same approach for other (sub)nodes too.

xx=(xml_find_all(xx, "//Record"))
    xx <- xml_find_all(xx, ".//descendant::*[not(*)]"))
#  user  system elapsed 
# 38.00    0.36   38.35 
system.time(xx <- xml_text(xx))
#  user  system elapsed 
# 68.39    0.05   68.53 
head(data.frame(t(matrix(xx, 5))))
#                    X1 X2 X3    X4  X5
# 1 2017-05-26T00:00:00  2  R local xxx
# 2 2017-04-30T00:00:00  2  R       xyx
# 3 2017-05-26T00:00:00  2  R local xxx
# 4 2017-04-30T00:00:00  2  R       xyx
# 5 2017-05-26T00:00:00  2  R local xxx
# 6 2017-04-30T00:00:00  2  R       xyx

(You might want to add further code to name data frame columns)

Time is referred to my average laptop.


The core of the solutions lies in the XPATH .//descendant::*[not(*)]. .//descendant:: extracts all descendants of the current context (the Record node); adding [not(*)] further flattens the layout. This allows to linearize a tree structure, making it more for suitable for data science modeling.

The flexibility of * comes at a price in terms of computation. However, the computational burden does no lie on R, which is an interpreted language, but comes at the expenses of the highly efficient external C library libxml2. Results should be equal or better than those of other utilities and libraries.

Tuesday, October 26, 2021
answered 2 Days ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :