Asked  7 Months ago    Answers:  5   Viewed   44 times

Consider the following XML:

<?xml version="1.0" encoding="UTF-8"?>
<OS>
    <data>
        <OSes>
            <centos>
                <v_5>
                    <i386>
                        <id>centos5-32</id>
                        <name>CentOS 5 - 32 bit</name>
                        <version>5</version>
                        <architecture>32</architecture>
                        <os>centos</os>
                    </i386>
                    <x86_64>
                        <id>centos5-64</id>
                        <name>CentOS 5 - 64 bit</name>
                        <version>5</version>
                        <architecture>64</architecture>
                        <os>centos</os>
                    </x86_64>
                </v_5>
                <v_6>
                    <i386>
                        <id>centos6-32</id>
                        <name>CentOS 6 - 32 bit</name>
                        <version>6</version>
                        <architecture>32</architecture>
                        <os>centos</os>
                    </i386>
                    <x86_64>
                        <id>centos6-64</id>
                        <name>CentOS 6 - 64 bit</name>
                        <version>6</version>
                        <architecture>64</architecture>
                        <os>centos</os>
                    </x86_64>
                </v_6>
            </centos>
            <ubuntu>
                <v_10>
                    <i386>
                        <id>ubuntu10-32</id>
                        <name>Ubuntu 10 - 32 bit</name>
                        <version>10</version>
                        <architecture>32</architecture>
                        <os>ubuntu</os>
                    </i386>
                    <amd64>
                        <id>ubuntu10-64</id>
                        <name>Ubuntu 10 - 64 bit</name>
                        <version>10</version>
                        <architecture>64</architecture>
                        <os>ubuntu</os>
                    </amd64>
                </v_10>
            </ubuntu>
        </OSes>
    </data>
</OS>

From the XML document above, I want to extract following 5 element node

  1. <id>
  2. <name>
  3. <version>
  4. <architecture>
  5. <os>

And have them as a array. I tried doing the following:

<?php 
require_once "xml.php";

    try {
        $xml = new SimpleXMLElement($xmlstr);
        foreach($xml->xpath(' //id | //name | //version// | //architecture | //os ') as $record) {
        echo $record;
    }
    } catch(Exception $e){
        echo $e->getMessage();
    }

the above code works but each record is an separate object. I want someone to consolidate all 5 elements nodes as one array element. something like this:

$osList = Array( [0] => Array(
                               ["id"] => "<id>",
                               ["name"] => "<name>",
                               ["version"] => "<version>",
                               ....
)
 .....
);

syntax isn't correct but you get the idea. any idea how to do this?

 Answers

31

this might help

$obj = new SimpleXMLElement($xml);
$rtn = array();
$cnt = 0;
foreach($obj->xpath('///OSes/*/*') as $rec)
{
  foreach ($rec as $rec_obj)
  {
    if (!isset($rtn[$cnt]))
    {
      $rtn[$cnt] = array();
    }

    foreach ($rec_obj as $name=>$val)
    {
      $rtn[$cnt][(string)$name] = (string)$val;
    }
    ++$cnt;
  }
}
Saturday, May 29, 2021
 
dimitarvp
answered 7 Months ago
17

You were probably not "cycling" the countries and cities:

<?php
    $xml_file = '/path';
    $xml = simplexml_load_file($xml_file);

    foreach($xml->Continent as $continent) {
        echo "<div class='continent'>".$continent['Name']."<span class='status'>".$continent['Status']."</span>";
        foreach($continent->Country as $country) {
            echo "<div class='country'>".$country['Name']."<span class='status'>".$country['Status']."</span>";
            foreach($country->City as $city) {
                echo "<div class='city'>".$city['Name']."<span class='status'>".$city['Status']."</span></div>";
            }
            echo "</div>"; // close Country div
        }
        echo "</div>"; // close Continent div
    }
?>
Saturday, May 29, 2021
 
avon_verma
answered 7 Months ago
32

Try this XPath:

/object/data[@type="me"]

Which reads as:

  • Select (/) children of the current element called object
  • Select (/) their children called data
  • Filter ([...]) that list to elements where ...
    • the attribute type (the @ means "attribute")
    • has the text value me

So:

$myDataObjects = $simplexml->xpath('/object/data[@type="me"]');

If object is not the root of your document, you might want to use //object/data[@type="me"] instead. The // means "find all descendents" rather than "find all children".

Tuesday, June 1, 2021
 
koenHuybrechts
answered 7 Months ago
42

Here's another pure bash way. Works fine when your input is reasonably consistent and you don't need much flexibility in which section you pick out.

extractedNum="${testString#*:}"     # Remove through first :
extractedNum="${extractedNum#*:}"   # Remove through second :
extractedNum="${extractedNum%%:*}"  # Remove from next : to end of string

You could also filter the file while reading it, in a while loop for example:

while IFS=' ' read -r col line ; do
    # col has the column you wanted, line has the whole line
    # # #
done < <(sed -e 's/([^:]*:){2}([^:]*).*/2 &/' "yourfile")

The sed command is picking out the 2nd column and delimiting that value from the entire line with a space. If you don't need the entire line, just remove the space+& from the replacement and drop the line variable from the read. You can pick any column by changing the number in the {2} bit. (Put the command in double quotes if you want to use a variable there.)

Saturday, October 30, 2021
 
Bitwise
answered 1 Month ago
26

As you indicate in your comments to the original question, you are prepared to program a solution. I would propose using Java and the iText PDF library. It enables you to extract text from documents as long as the text actually is extractable (you actually can put glyphs into a PDF but drop the mappings from glyphs to characters).

You can find sample code for PDF text extraction with iText in the ExtractPageContent* samples for chapter 15 of iText in Action — 2nd Edition. Especially ExtractPageContentArea is of interest in your case.

Essentially you only have to take that sample and generalize it too extract the text from multiple areas on the page.

Saturday, November 20, 2021
 
EEk
answered 2 Weeks ago
EEk
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :
 
Share