Asked  7 Months ago    Answers:  5   Viewed   275 times

I want to convert the below HTML to PDF using iTextSharp but don't know where to start:

<style>
.headline{font-size:200%}
</style>
<p>
  This <em>is </em>
  <span class="headline" style="text-decoration: underline;">some</span>
  <strong>sample<em> text</em></strong>
  <span style="color: red;">!!!</span>
</p>

 Answers

34

First, HTML and PDF are not related although they were created around the same time. HTML is intended to convey higher level information such as paragraphs and tables. Although there are methods to control it, it is ultimately up to the browser to draw these higher level concepts. PDF is intended to convey documents and the documents must "look" the same wherever they are rendered.

In an HTML document you might have a paragraph that's 100% wide and depending on the width of your monitor it might take 2 lines or 10 lines and when you print it it might be 7 lines and when you look at it on your phone it might take 20 lines. A PDF file, however, must be independent of the rendering device, so regardless of your screen size it must always render exactly the same.

Because of the musts above, PDF doesn't support abstract things like "tables" or "paragraphs". There are three basic things that PDF supports: text, lines/shapes and images. (There are other things like annotations and movies but I'm trying to keep it simple here.) In a PDF you don't say "here's a paragraph, browser do your thing!". Instead you say, "draw this text at this exact X,Y location using this exact font and don't worry, I've previously calculated the width of the text so I know it will all fit on this line". You also don't say "here's a table" but instead you say "draw this text at this exact location and then draw a rectangle at this other exact location that I've previously calculated so I know it will appear to be around the text".

Second, iText and iTextSharp parse HTML and CSS. That's it. ASP.Net, MVC, Razor, Struts, Spring, etc, are all HTML frameworks but iText/iTextSharp is 100% unaware of them. Same with DataGridViews, Repeaters, Templates, Views, etc. which are all framework-specific abstractions. It is your responsibility to get the HTML from your choice of framework, iText won't help you. If you get an exception saying The document has no pages or you think that "iText isn't parsing my HTML" it is almost definite that you don't actually have HTML, you only think you do.

Third, the built-in class that's been around for years is the HTMLWorker however this has been replaced with XMLWorker (Java / .Net). Zero work is being done on HTMLWorker which doesn't support CSS files and has only limited support for the most basic CSS properties and actually breaks on certain tags. If you do not see the HTML attribute or CSS property and value in this file then it probably isn't supported by HTMLWorker. XMLWorker can be more complicated sometimes but those complications also make it more extensible.

Below is C# code that shows how to parse HTML tags into iText abstractions that get automatically added to the document that you are working on. C# and Java are very similar so it should be relatively easy to convert this. Example #1 uses the built-in HTMLWorker to parse the HTML string. Since only inline styles are supported the class="headline" gets ignored but everything else should actually work. Example #2 is the same as the first except it uses XMLWorker instead. Example #3 also parses the simple CSS example.

//Create a byte array that will eventually hold our final PDF
Byte[] bytes;

//Boilerplate iTextSharp setup here
//Create a stream that we can write to, in this case a MemoryStream
using (var ms = new MemoryStream()) {

    //Create an iTextSharp Document which is an abstraction of a PDF but **NOT** a PDF
    using (var doc = new Document()) {

        //Create a writer that's bound to our PDF abstraction and our stream
        using (var writer = PdfWriter.GetInstance(doc, ms)) {

            //Open the document for writing
            doc.Open();

            //Our sample HTML and CSS
            var example_html = @"<p>This <em>is </em><span class=""headline"" style=""text-decoration: underline;"">some</span> <strong>sample <em> text</em></strong><span style=""color: red;"">!!!</span></p>";
            var example_css = @".headline{font-size:200%}";

            /**************************************************
             * Example #1                                     *
             *                                                *
             * Use the built-in HTMLWorker to parse the HTML. *
             * Only inline CSS is supported.                  *
             * ************************************************/

            //Create a new HTMLWorker bound to our document
            using (var htmlWorker = new iTextSharp.text.html.simpleparser.HTMLWorker(doc)) {

                //HTMLWorker doesn't read a string directly but instead needs a TextReader (which StringReader subclasses)
                using (var sr = new StringReader(example_html)) {

                    //Parse the HTML
                    htmlWorker.Parse(sr);
                }
            }

            /**************************************************
             * Example #2                                     *
             *                                                *
             * Use the XMLWorker to parse the HTML.           *
             * Only inline CSS and absolutely linked          *
             * CSS is supported                               *
             * ************************************************/

            //XMLWorker also reads from a TextReader and not directly from a string
            using (var srHtml = new StringReader(example_html)) {

                //Parse the HTML
                iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, srHtml);
            }

            /**************************************************
             * Example #3                                     *
             *                                                *
             * Use the XMLWorker to parse HTML and CSS        *
             * ************************************************/

            //In order to read CSS as a string we need to switch to a different constructor
            //that takes Streams instead of TextReaders.
            //Below we convert the strings into UTF8 byte array and wrap those in MemoryStreams
            using (var msCss = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(example_css))) {
                using (var msHtml = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(example_html))) {

                    //Parse the HTML
                    iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, msHtml, msCss);
                }
            }


            doc.Close();
        }
    }

    //After all of the PDF "stuff" above is done and closed but **before** we
    //close the MemoryStream, grab all of the active bytes from the stream
    bytes = ms.ToArray();
}

//Now we just need to do something with those bytes.
//Here I'm writing them to disk but if you were in ASP.Net you might Response.BinaryWrite() them.
//You could also write the bytes to a database in a varbinary() column (but please don't) or you
//could pass them to another function for further PDF processing.
var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf");
System.IO.File.WriteAllBytes(testFile, bytes);

2017's update

There are good news for HTML-to-PDF demands. As this answer showed, the W3C standard css-break-3 will solve the problem... It is a Candidate Recommendation with plan to turn into definitive Recommendation this year, after tests.

As not-so-standard there are solutions, with plugins for C#, as showed by print-css.rocks.

Tuesday, June 1, 2021
 
muaaz
answered 7 Months ago
59

You have nested text blocks. That's illegal PDF syntax. I think recent versions of iTextSharp warn you about this, so I guess you're using an old version.

This is wrong:

cb.BeginText();
...
cb.BeginText();
...
cb.EndText();
...
cb.EndText();

This is right:

cb.BeginText();
...
cb.EndText();
...
cb.BeginText();
...
cb.EndText();

Moreover: ISO-32000-1 tells you that some operations are forbidden inside a text block.

This is wrong:

cb.BeginText();
...
cb.AddImage(img);
...
cb.EndText();

This is right:

cb.BeginText();
...
cb.EndText();
...
cb.AddImage(img);

Finally, some operators are mandatory when creating a text block. For instance: you always need setFontAndSize() (I don't know what you're doing in writeText(), but I assume you're setting the font correctly).

In any case: you have chosen to use iTextSharp at the lowest level, writing PDF syntax almost manually. This assumes that you know ISO-32000-1 inside-out. If you don't, you should use some of the high-level objects, such as ColumnText to position content at absolute positions.

Saturday, July 31, 2021
 
Sean Werkema
answered 5 Months ago
72

You can set the document size and it will affect the next pages. Some snippets:

Set up your document somewhere (you know that already):

  var document = new Document();
  PdfWriter pdfWriter = PdfWriter.GetInstance(
    document, new FileStream(destinationFile, FileMode.Create)
  );
  pdfWriter.SetFullCompression();
  pdfWriter.StrictImageSequence = true;
  pdfWriter.SetLinearPageMode();           

Now loop over your pages (you probably do that as well already) and decide what page size you want per page:

 for (int pageIndex = 1; pageIndex <= pageCount; pageIndex++) {
    // Define the page size here, _before_ you start the page.
    // You can easily switch from landscape to portrait to whatever
    document.SetPageSize(new Rectangle(600, 800));          

    if (document.IsOpen()) {
      document.NewPage();
    } else {
      document.Open();
    }
  }
Saturday, August 14, 2021
 
hohner
answered 4 Months ago
29

Ok,it seems you can't do it directly with only OpenPDF, you have to use Flying Saucer: get flying-saucer-pdf-openpdf and then use it. An example:

String inputFile = "my.xhtml";
String outputFile = "generated.pdf";

String url = new File(inputFile).toURI().toURL().toString();

ITextRenderer renderer = new ITextRenderer();
renderer.setDocument(url);
renderer.layout();

try (OutputStream os = Files.newOutputStream(Paths.get(outputFile))) {
    renderer.createPDF(os);
}

Source.

PS: FlyingSaucer expects XHTML syntax. If you have some problems with yout HTML file, you could use Jsoup:

String inputFile = "my.html";
String outputFile = "generated.pdf";

String html = new String(Files.readAllBytes(Paths.get(inputFile)));
final Document document = Jsoup.parse(html);
document.outputSettings().syntax(Document.OutputSettings.Syntax.xml);

ITextRenderer renderer = new ITextRenderer();
renderer.setDocumentFromString(document.html());
renderer.layout();

try (OutputStream os = Files.newOutputStream(Paths.get(outputFile))) {
    renderer.createPDF(os);
}
Monday, October 4, 2021
 
heisenbergman
answered 2 Months ago
10

This is really trivial to do on your own. You didn't specify a language so the sample below uses VB.Net since (I think) it handles XML more easily. See the code comments for more details. This is targeting iTextSharp 5.4.4 but should work with pretty much any version.

''//Sample XML
Dim TextXML = <?xml version="1.0" encoding="utf-8"?>
              <catalog>
                  <cd>
                      <SR.No>14</SR.No>
                      <test>loss test</test>
                      <code>ISO-133</code>
                      <unit>gm</unit>
                      <sampleid>36</sampleid>
                      <boreholeid>21</boreholeid>
                      <pieceno>63</pieceno>
                  </cd>
                  <cd>
                      <SR.No>24</SR.No>
                      <test>sand</test>
                      <code>ISO-133</code>
                      <unit>gm</unit>
                      <sampleid>71</sampleid>
                      <boreholeid>22</boreholeid>
                      <pieceno>23</pieceno>
                  </cd>
                  <cd>
                      <SR.No>25</SR.No>
                      <test>clay</test>
                      <code>ISO-133</code>
                      <unit>mg</unit>
                      <sampleid>52</sampleid>
                      <boreholeid>21</boreholeid>
                      <pieceno>36</pieceno>
                  </cd>
              </catalog>

''//File to write to
Dim TestFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Test.pdf")

''//Standard PDF creation, nothing special here
Using fs As New FileStream(TestFile, FileMode.Create, FileAccess.Write, FileShare.None)
    Using doc As New Document()
        Using writer = PdfWriter.GetInstance(doc, fs)
            doc.Open()

            ''//Create a table with one column for every child node of <cd>
            Dim T As New PdfPTable(TextXML.<catalog>.<cd>.First.Nodes.Count)

            ''//Loop through the first item to output column headers
            For Each N In TextXML.<catalog>.<cd>.First.Elements
                T.AddCell(N.Name.ToString())
            Next

            ''//Loop through each CD row (this is so we can call complete later on)
            For Each CD In TextXML.<catalog>.Elements
                ''//Loop through each child of the current CD
                For Each N In CD.Elements
                    T.AddCell(N.Value)
                Next

                ''//Just in case any rows have too few cells fill in any blanks
                T.CompleteRow()
            Next

            ''//Add the table to the document
            doc.Add(T)

            doc.Close()
        End Using
    End Using
End Using

EDIT

Here's a C# version. I've included a helper method to create a large XML document based on your template to show page overflow. The PdfPTable will automatically spam multiple pages. You can specify the number of rows that should be considered a "header" so that they repeat on subsequent pages. You'll probably want to also apply some formatting rules but you should be able to find those online (look for PdfPTable.DefaultCell)

private XDocument createXml() {
    //Create our sample XML document
    var xml = new XDocument(new XDeclaration("1.0", "utf-8", "yes"));

    //Add our root node
    var root = new XElement("catalog");
    //All child nodes
    var nodeNames = new[] { "SR.No", "test", "code", "unit", "sampleid", "boreholeid", "pieceno" };
    XElement cd;

    //Create a bunch of <cd> items
    for (var i = 0; i < 1000; i++) {
        cd = new XElement("cd");
        foreach (var nn in nodeNames) {
            cd.Add(new XElement(nn) { Value = String.Format("{0}:{1}", nn, i.ToString()) });
        }
        root.Add(cd);
    }

    xml.Add(root);

    return xml;
}

private void doWork() {
    //Sample XML
    var xml = createXml();

    //File to write to
    var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf");

    //Standard PDF creation, nothing special here
    using (var fs = new FileStream(testFile, FileMode.Create, FileAccess.Write, FileShare.None)) {
        using (var doc = new Document()) {
            using (var writer = PdfWriter.GetInstance(doc, fs)) {
                doc.Open();

                //Count the columns
                var columnCount = xml.Root.Elements("cd").First().Nodes().Count();

                //Create a table with one column for every child node of <cd>
                var t = new PdfPTable(columnCount);

                //Flag that the first row should be repeated on each page break
                t.HeaderRows = 1;

                //Loop through the first item to output column headers
                foreach (var N in xml.Root.Elements("cd").First().Elements()) {
                    t.AddCell(N.Name.ToString());
                }

                //Loop through each CD row (this is so we can call complete later on)
                foreach (var CD in xml.Root.Elements()) {
                    //Loop through each child of the current CD. Limit the number of children to our initial count just in case there are extra nodes.
                    foreach (var N in CD.Elements().Take(columnCount)) {
                        t.AddCell(N.Value);
                    }
                    //Just in case any rows have too few cells fill in any blanks
                    t.CompleteRow();
                }

                //Add the table to the document
                doc.Add(t);

                doc.Close();
            }
        }
    }
}
Monday, October 25, 2021
 
Moz
answered 2 Months ago
Moz
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :
 
Share