Asked  7 Months ago    Answers:  5   Viewed   31 times

I maintain a bulletin board that saves rich text messages in HTML. Now I need to migrate all those messages into Joomla Kunena bulletin board that requires BBCode representation of HTML.

Is there any library to convert HTML to BBCode cleanly. There are bunch of scripts out there for BBCode to HTML but not the other way around.

Thanks...

 Answers

39

It should be doable with XSLT in text output mode:

<xsl:output method="text">
…
<xsl:template match="b|strong"><xsl:apply-templates/></xsl:template>
<xsl:template match="br">&#10;</xsl:template>
<xsl:template match="p">&#10;<xsl:apply-templates/>&#10;</xsl:template>
<xsl:template match="a"><xsl:apply-templates/></xsl:template>
<xsl:template match="text()"><x:value-of select="normalize-space(.)"/></xsl:template>

To get there parse HTML and use built-in XSLT processor.

Wednesday, March 31, 2021
 
Palladium
answered 7 Months ago
57

If you are able to obtain a DOMDocument object representing your HTML, then you just need to traverse it recursively and construct the data structure that you want.

Converting your HTML document into a DOMDocument should be as simple as this:

function html_to_obj($html) {
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    return element_to_obj($dom->documentElement);
}

Then, a simple traversal of $dom->documentElement which gives the kind of structure you described could look like this:

function element_to_obj($element) {
    $obj = array( "tag" => $element->tagName );
    foreach ($element->attributes as $attribute) {
        $obj[$attribute->name] = $attribute->value;
    }
    foreach ($element->childNodes as $subElement) {
        if ($subElement->nodeType == XML_TEXT_NODE) {
            $obj["html"] = $subElement->wholeText;
        }
        else {
            $obj["children"][] = element_to_obj($subElement);
        }
    }
    return $obj;
}

Test case

$html = <<<EOF
<!DOCTYPE html>
<html lang="en">
    <head>
        <title> This is a test </title>
    </head>
    <body>
        <h1> Is this working? </h1>  
        <ul>
            <li> Yes </li>
            <li> No </li>
        </ul>
    </body>
</html>

EOF;

header("Content-Type: text/plain");
echo json_encode(html_to_obj($html), JSON_PRETTY_PRINT);

Output

{
    "tag": "html",
    "lang": "en",
    "children": [
        {
            "tag": "head",
            "children": [
                {
                    "tag": "title",
                    "html": " This is a test "
                }
            ]
        },
        {
            "tag": "body",
            "html": "  n        ",
            "children": [
                {
                    "tag": "h1",
                    "html": " Is this working? "
                },
                {
                    "tag": "ul",
                    "children": [
                        {
                            "tag": "li",
                            "html": " Yes "
                        },
                        {
                            "tag": "li",
                            "html": " No "
                        }
                    ],
                    "html": "n        "
                }
            ]
        }
    ]
}

Answer to updated question

The solution proposed above does not work with the <script> element, because it is parsed not as a DOMText, but as a DOMCharacterData object. This is because the DOM extension in PHP is based on libxml2, which parses your HTML as HTML 4.0, and in HTML 4.0 the content of <script> is of type CDATA and not #PCDATA.

You have two solutions for this problem.

  1. The simple but not very robust solution would be to add the LIBXML_NOCDATA flag to DOMDocument::loadHTML. (I am not actually 100% sure whether this works for the HTML parser.)

  2. The more difficult but, in my opinion, better solution, is to add an additonal test when you are testing $subElement->nodeType before the recursion. The recursive function would become:

function element_to_obj($element) {
    echo $element->tagName, "n";
    $obj = array( "tag" => $element->tagName );
    foreach ($element->attributes as $attribute) {
        $obj[$attribute->name] = $attribute->value;
    }
    foreach ($element->childNodes as $subElement) {
        if ($subElement->nodeType == XML_TEXT_NODE) {
            $obj["html"] = $subElement->wholeText;
        }
        elseif ($subElement->nodeType == XML_CDATA_SECTION_NODE) {
            $obj["html"] = $subElement->data;
        }
        else {
            $obj["children"][] = element_to_obj($subElement);
        }
    }
    return $obj;
}

If you hit on another bug of this type, the first thing you should do is check the type of node $subElement is, because there exists many other possibilities my short example function did not deal with.

Additionally, you will notice that libxml2 has to fix mistakes in your HTML in order to be able to build a DOM for it. This is why an <html> and a <head> elements will appear even if you don't specify them. You can avoid this by using the LIBXML_HTML_NOIMPLIED flag.

Test case with script

$html = <<<EOF
        <script type="text/javascript">
            alert('hi');
        </script>
EOF;

header("Content-Type: text/plain");
echo json_encode(html_to_obj($html), JSON_PRETTY_PRINT);

Output

{
    "tag": "html",
    "children": [
        {
            "tag": "head",
            "children": [
                {
                    "tag": "script",
                    "type": "text/javascript",
                    "html": "n            alert('hi');n        "
                }
            ]
        }
    ]
}
Wednesday, March 31, 2021
 
employeegts
answered 7 Months ago
21

There are various BBCode parsers available for PHP, for instance

  • http://www.php.net/manual/en/book.bbcode.php

which allows you to simply define your replacement rules by hand:

echo bbcode_parse(
    bbcode_create(
        array(
            'b' => array(
                'type'      => BBCODE_TYPE_NOARG,
                'open_tag'  => '<b>',
                'close_tag' => '</b>'
            )
        )
    ),
    'Bold Text'
);
// prints <b>Bold Text</b>

Also check the various similar questions about BBCode Parsers:

  • https://stackoverflow.com/search?q=bbcode+php
Sunday, June 20, 2021
 
WooDzu
answered 4 Months ago
13

Convert from HTML to XML with HTML Tidy

Downloadable Binaries

JRoppert, For your need, i guess you might want to look at the Sources

c:temp>tidy -help
tidy [option...] [file...] [option...] [file...]
Utility to clean up and pretty print HTML/XHTML/XML
see http://tidy.sourceforge.net/

Options for HTML Tidy for Windows released on 14 February 2006:

File manipulation
-----------------
 -output <file>, -o  write output to the specified <file>
 <file>
 -config <file>      set configuration options from the specified <file>
 -file <file>, -f    write errors to the specified <file>
 <file>
 -modify, -m         modify the original input files

Processing directives
---------------------
 -indent, -i         indent element content
 -wrap <column>, -w  wrap text at the specified <column>. 0 is assumed if
 <column>            <column> is missing. When this option is omitted, the
                     default of the configuration option "wrap" applies.
 -upper, -u          force tags to upper case
 -clean, -c          replace FONT, NOBR and CENTER tags by CSS
 -bare, -b           strip out smart quotes and em dashes, etc.
 -numeric, -n        output numeric rather than named entities
 -errors, -e         only show errors
 -quiet, -q          suppress nonessential output
 -omit               omit optional end tags
 -xml                specify the input is well formed XML
 -asxml, -asxhtml    convert HTML to well formed XHTML
 -ashtml             force XHTML to well formed HTML
 -access <level>     do additional accessibility checks (<level> = 0, 1, 2, 3).
                     0 is assumed if <level> is missing.

Character encodings
-------------------
 -raw                output values above 127 without conversion to entities
 -ascii              use ISO-8859-1 for input, US-ASCII for output
 -latin0             use ISO-8859-15 for input, US-ASCII for output
 -latin1             use ISO-8859-1 for both input and output
 -iso2022            use ISO-2022 for both input and output
 -utf8               use UTF-8 for both input and output
 -mac                use MacRoman for input, US-ASCII for output
 -win1252            use Windows-1252 for input, US-ASCII for output
 -ibm858             use IBM-858 (CP850+Euro) for input, US-ASCII for output
 -utf16le            use UTF-16LE for both input and output
 -utf16be            use UTF-16BE for both input and output
 -utf16              use UTF-16 for both input and output
 -big5               use Big5 for both input and output
 -shiftjis           use Shift_JIS for both input and output
 -language <lang>    set the two-letter language code <lang> (for future use)

Miscellaneous
-------------
 -version, -v        show the version of Tidy
 -help, -h, -?       list the command line options
 -xml-help           list the command line options in XML format
 -help-config        list all configuration options
 -xml-config         list all configuration options in XML format
 -show-config        list the current configuration settings

Use --blah blarg for any configuration option "blah" with argument "blarg"

Input/Output default to stdin/stdout respectively
Single letter options apart from -f may be combined
as in:  tidy -f errs.txt -imu foo.html
For further info on HTML see http://www.w3.org/MarkUp
Wednesday, June 23, 2021
 
rasmusx
answered 4 Months ago
45

Based on your screenshot of Settings (Preferences on Mac) | Editor | Language Injections.

Please delete 3rd language injection rule from the bottom (the one for "div" -- that has "IDE" in Scope column).

That rule injects HTML into div tag which tells IDE to treat all other code (even PHP) inside such tag as HTML/plain text.

Thursday, August 5, 2021
 
Nil
answered 3 Months ago
Nil
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :