Asked  7 Months ago    Answers:  5   Viewed   31 times

I was just reviewing a previous post I made and noticed a number of people suggesting that I don't use Regex to parse xml. In that case the xml was relatively simple, and Regex didn't pose any problems. I was also parsing a number of other code formats, so for the sake of uniformity it made sense. But I'm curious how this might pose a problem in other cases. Is this just a 'don't reinvent the wheel' type of issue?



The real trouble is nested tags. Nested tags are very difficult to handle with regular expressions. It's possible with balanced matching, but that's only available in .NET and maybe a couple other flavors. But even with the power of balanced matching, an ill-placed comment could potentially throw off the regular expression.

For example, this is a tricky one to parse...

    <div id="parse-this">
        <!-- oops</div> -->
        try to get this value with regex

You could be chasing edge cases like this for hours with a regular expression, and maybe find a solution. But really, there's no point when there are specialized XML, XHTML, and HTML parsers out there that do the job more reliably and efficiently.

Tuesday, June 1, 2021
answered 7 Months ago

Actually, the only time that's ever really bit me was when I was debugging, and commented out bar():

  // bar();

Other than that, I tend to use:

if(foo) bar();

Which takes care of the above case.

EDIT Thanks for clarifying the question, I agree, we should not write code to the lowest common denominator.

Tuesday, June 1, 2021
answered 7 Months ago

Technically controller should be smaller & compact, it should not be playing with a DOM. Controller will only interested to have a business logic & binding level logic that are being called on events.

As per my perspective the reason behind "You should not manipulate DOM from the controller" is, It's just because of separation of concern. If you do the DOM manipulation from the controller then it gets tightly coupled to your controller, & that piece of code can't become reusable. So by writing that code in directive, the same code could be easily become a plug-able & reusable component. You can use the same DOM manipulation elsewhere just by putting directive tag/element.

Looked at directive definition, then you will analysed that it just meant to play with DOM, as it give a controller over DOM before rendering it using preLInk function & also post rendering of DOM you can get in postLink function.

Also directive make you available the directive element, you no need to make it compile because that element has already compiled with jQLite which is smaller version of jQuery used in angular. No need of selector here to get directive element DOM.

Friday, July 23, 2021
answered 5 Months ago

0x0 (aka NUL) is not an allowed character in XML :

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Therefore your data is not XML, and any conformant XML processor must report an error such as the one you received.

You must repair the data by removing any illegal characters by treating it as text, not XML, manually or automatically before using it with any XML libraries.

For Python, see Removing control characters from a string in python for tips on how to remove NUL from a string. This must be done before treating the data as XML.

Saturday, August 7, 2021
answered 4 Months ago
str(data.frame(t(unlist(L)), stringsAsFactors=FALSE))
# 'data.frame': 1 obs. of  15 variables:
#  $ CIP.RecordList.Record.Date                          : chr "2017-05-26T00:00:00"
#  $ CIP.RecordList.Record.Grade                         : chr "2"
#  $ CIP.RecordList.Record.ReasonsList.Reason.Code       : chr "R"
#  $ CIP.RecordList.Record.ReasonsList.Reason.Description: chr "local"
#  $ CIP.RecordList.Record.Score                         : chr "xxx"
#  $ CIP.RecordList.Record.Date.1                        : chr "2017-04-30T00:00:00"
#  $ CIP.RecordList.Record.Grade.1                       : chr "2"
#  $ CIP.RecordList.Record.ReasonsList.Reason.Code.1     : chr "R"
#  $ CIP.RecordList.Record.Score.1                       : chr "xyx"
#  $ Individual.General.FirstName                        : chr "MM"
#  $ Inquiries.InquiryList.Inquiry.DateOfInquiry         : chr "2017-03-19"
#  $ Inquiries.InquiryList.Inquiry.Reason                : chr "cc"
#  $ Inquiries.InquiryList.Inquiry.DateOfInquiry.1       : chr "2016-10-14"
#  $ Inquiries.InquiryList.Inquiry.Reason.1              : chr "er"
#  $ Inquiries.Summary.NumberOfInquiries                 : chr "2"

If you want to convert strings that have a suitable representation as numbers, assuming that df is the data frame above:

data.frame(t(lapply(df, function(x) 
               ifelse(<-suppressWarnings(as.numeric(x))), x, y))))

Strings that do not have a number representation will not be converted.



A) In some comments the OP added a further request for execution speed, which is normally not a issue for one time tasks such as data import. The solution above is based on recursion, as explicitly required in the question. Of course, traversing up and down the nodes adds a lot of overhead.

B) One recent answer here proposes a complex method based on a collection of external tools. There might of course be different nice utilities to manage XML files, but IMHO much of the XPATH work can be comfortably and efficiently done in R itself.

C) The OP wonders if it is possible to "create separate data.frames for each list object of XML".

D) I noticed that in the question tags, the OP (seems to) require the newer xml2 package.

I address the points above using XPATH straight from R.

XPATH approach

Below I extract in a separate data frame the Record node. One can use the same approach for other (sub)nodes too.

xx=(xml_find_all(xx, "//Record"))
    xx <- xml_find_all(xx, ".//descendant::*[not(*)]"))
#  user  system elapsed 
# 38.00    0.36   38.35 
system.time(xx <- xml_text(xx))
#  user  system elapsed 
# 68.39    0.05   68.53 
head(data.frame(t(matrix(xx, 5))))
#                    X1 X2 X3    X4  X5
# 1 2017-05-26T00:00:00  2  R local xxx
# 2 2017-04-30T00:00:00  2  R       xyx
# 3 2017-05-26T00:00:00  2  R local xxx
# 4 2017-04-30T00:00:00  2  R       xyx
# 5 2017-05-26T00:00:00  2  R local xxx
# 6 2017-04-30T00:00:00  2  R       xyx

(You might want to add further code to name data frame columns)

Time is referred to my average laptop.


The core of the solutions lies in the XPATH .//descendant::*[not(*)]. .//descendant:: extracts all descendants of the current context (the Record node); adding [not(*)] further flattens the layout. This allows to linearize a tree structure, making it more for suitable for data science modeling.

The flexibility of * comes at a price in terms of computation. However, the computational burden does no lie on R, which is an interpreted language, but comes at the expenses of the highly efficient external C library libxml2. Results should be equal or better than those of other utilities and libraries.

Tuesday, October 26, 2021
answered 1 Month ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :