Asked  7 Months ago    Answers:  5   Viewed   36 times

I need to extract all the images from a PDF file on my server. I don't want the PDF pages, only the images at their original size and resolution.

How could I do this with Perl, PHP or any other UNIX based app (which I would invoke with the exec function from PHP)?

 Answers

33

pdfimages does just that. It's is part of the poppler-utils and xpdf-utils packages.

From the manpage:

Pdfimages saves images from a Portable Document Format (PDF) file as Portable Pixmap (PPM), Portable Bitmap (PBM), or JPEG files.

Pdfimages reads the PDF file, scans one or more pages, PDF-file, and writes one PPM, PBM, or JPEG file for each image, image-root-nnn.xxx, where nnn is the image number and xxx is the image type (.ppm, .pbm, .jpg).

NB: pdfimages extracts the raw image data from the PDF file, without performing any additional transforms. Any rotation, clipping, color inversion, etc. done by the PDF content stream is ignored.

Wednesday, March 31, 2021
 
Sagar
answered 7 Months ago
92

You need DomPDF for rendering the PDF. Check if you have the DomPDF library, otherwise download and implement it as follow:

$rendererName = PHPExcel_Settings::PDF_RENDERER_DOMPDF;
$rendererLibrary = 'domPDF0.6.0beta3';
$rendererLibraryPath = dirname(__FILE__). 'libs/classes/dompdf' . $rendererLibrary;
Wednesday, March 31, 2021
 
Akdeniz
answered 7 Months ago
60

The problem is that you're using file_get_contents incorrectly; as you've used it, it will fetch the contents of include.php, not the results of its execution. You need to expand it to a full URL in order to have it fetched that way, though I would advise not doing that. Have the script generate a PDF file and then load that, using the file output option of mpdf:

$mpdf = new mPDF();
$mpdf->WriteHTML($html);
$mpdf->Output('/path/to/files/doc.pdf', 'F');

Then run that script and attach the resulting file from PHPMailer (and delete the file afterwards):

include 'invoice.php';
$mail = new PHPMailer();
...
$mail->addAttachment('/path/to/files/doc.pdf');
$mail->send();
unlink('/path/to/files/doc.pdf');

You could skip the external file approach by using the "return a string" output mode (S) of the Output method and returning the string from the included file:

$mpdf = new mPDF();
$mpdf->WriteHTML($html);
return $mpdf->Output('doc.pdf', 'S');

and then:

$pdf = include 'invoice.php';
$mail = new PHPMailer();
...
$mail->addStringAttachment($pdf, 'doc.pdf');
$mail->send();
Wednesday, March 31, 2021
 
lechup
answered 7 Months ago
72

First you need to grab the SlidePart in which you want to get the images from:

public static SlidePart GetSlidePart(PresentationDocument presentationDocument, int slideIndex)
{
    if (presentationDocument == null)
    {
        throw new ArgumentNullException("presentationDocument", "GetSlidePart Method: parameter presentationDocument is null");
    }

    // Get the number of slides in the presentation
    int slidesCount = CountSlides(presentationDocument);

    if (slideIndex < 0 || slideIndex >= slidesCount)
    {
        throw new ArgumentOutOfRangeException("slideIndex", "GetSlidePart Method: parameter slideIndex is out of range");
    }

    PresentationPart presentationPart = presentationDocument.PresentationPart;

    // Verify that the presentation part and presentation exist.
    if (presentationPart != null && presentationPart.Presentation != null)
    {
        Presentation presentation = presentationPart.Presentation;

        if (presentation.SlideIdList != null)
        {
            // Get the collection of slide IDs from the slide ID list.
            var slideIds = presentation.SlideIdList.ChildElements;

            if (slideIndex < slideIds.Count)
            {
               // Get the relationship ID of the slide.
               string slidePartRelationshipId = (slideIds[slideIndex] as SlideId).RelationshipId;

                // Get the specified slide part from the relationship ID.
                SlidePart slidePart = (SlidePart)presentationPart.GetPartById(slidePartRelationshipId);

                 return slidePart;
             }
         }
     }

     // No slide found
     return null;
}

Then you need to search for the Picture object which will contain the image you are looking for based on the file name of the image:

Picture imageToRemove = slidePart.Slide.Descendants<Picture>().SingleOrDefault(picture => picture.NonVisualPictureProperties.OuterXml.Contains(imageFileName));
Tuesday, August 17, 2021
 
VinoCoder
answered 2 Months ago
16

As already claimed in a comment to the question,

Considering the fairly straight forward page content stream style, the data should be extractable using a not too complicated custom text extractor.

In detail:

The page content stream style

Regular table entry content is drawn entry by entry, each entry field by field in reading order. Thus, while going through the content stream we do not have to try and re-arrange the content to establish that order. This makes this task fairly easy.

So the main work will be to ignore non-entries, i.e. the header on the first page, the bars indicating where a new first letter starts, and the page numbers.

We do so by

  • ignoring graphics and non-black text which takes care of the header and the first letter bars;
  • not accepting entries not starting with data in the SCHOOL NAME column which takes care of the page numbers which only live in the CAMPUS NAME column.

(Other approaches also would have done, e.g. ignoring everything in a bottom page area to take care of the page numbers.)

Now we merely have to split the entries into their fields.

Again the document structure helps, as it is a very uniform document, the table columns have the identical position and dimensions on each page. So we merely have to dissect at fixed x values.

There is just one stumbling block: in some entries atomic text chunks contain content of different columns. E.g. sometimes the contents of the F and M columns are drawn as a single string like "YN" and the optical distance is introduced via character spacing.

So we have to process the text chunks character by character, not as a whole.

A sample implementation

I use Java and the PDF library iText (current version 5.5.7 development snapshot) here. This does not mean at all that it cannot be done a using different setup, this merely is the setup I'm most accustomed to.

As separator I use the tab character because other likely candidates also occur as part of the text and I did not want to have to cope with escaping them.

This is the custom RenderListener class introduced to cope with the content as explained above:

public class CertifiedSchoolListExtractionStrategy implements RenderListener
{
    public CertifiedSchoolListExtractionStrategy(Appendable data, Appendable nonData)
    {
        this.data = data;
        this.nonData = nonData;
    }

    //
    // RenderListener implementation
    //
    @Override
    public void beginTextBlock() { }

    @Override
    public void endTextBlock() { }

    @Override
    public void renderImage(ImageRenderInfo renderInfo) { }

    @Override
    public void renderText(TextRenderInfo renderInfo)
    {
        try
        {
            Vector startPoint = renderInfo.getBaseline().getStartPoint();
            BaseColor fillColor = renderInfo.getFillColor();
            if (fillColor instanceof GrayColor && ((GrayColor)fillColor).getGray() == 0)
            {
                if (debug)
                    data.append(String.format("%4dt%3.3f %3.3ft%sn", chunk, startPoint.get(I1), startPoint.get(I2), renderInfo.getText()));
                for (TextRenderInfo info : renderInfo.getCharacterRenderInfos())
                {
                    renderCharacter(info);
                }
            }
            else
            {
                if (debug)
                    nonData.append(String.format("%4dt%3.3f %3.3ft%sn", chunk, startPoint.get(I1), startPoint.get(I2), renderInfo.getText()));
                if (currentField > -1)
                    finishEntry();
                entryBuilder.append(renderInfo.getText());
            }
        }
        catch (IOException e)
        {
            e.printStackTrace();
        }
        finally
        {
            chunk++;
        }
    }

    public void renderCharacter(TextRenderInfo renderInfo) throws IOException
    {
        Vector startPoint = renderInfo.getBaseline().getStartPoint();

        float x = startPoint.get(I1);

        if (currentField > -1)
        {
            if (isInCurrentField(x))
            {
                entryBuilder.append(renderInfo.getText());
                return;
            }
            if (isInNextField(x))
            {
                currentField++;
                entryBuilder.append('t').append(renderInfo.getText());
                return;
            }
            finishEntry();
        }
        if (isInNextField(x))
        {
            finishEntry();
            currentField = 0;
        }
        entryBuilder.append(renderInfo.getText());
    }

    public void close() throws IOException
    {
        finishEntry();
    }

    boolean isInCurrentField(float x)
    {
        if (currentField == -1)
            return false;

        if (x < fieldstarts[currentField])
            return false;

        if (currentField == fieldstarts.length - 1)
            return true;

        return x <= fieldstarts[currentField + 1];
    }

    boolean isInNextField(float x)
    {
        if (currentField == fieldstarts.length - 1)
            return false;

        if (x < fieldstarts[currentField + 1])
            return false;

        if (currentField == fieldstarts.length - 2)
            return true;

        return x <= fieldstarts[currentField + 2];
    }

    void finishEntry() throws IOException
    {
        if (entryBuilder.length() > 0)
        {
            if (currentField == fieldstarts.length - 1)
            {
                data.append(entryBuilder).append('n');
            }
            else
            {
                nonData.append(entryBuilder).append('n');
            }

            entryBuilder.setLength(0);
        }
        currentField = -1;
    }

    //
    // hidden members
    //
    final Appendable data, nonData;
    boolean debug = false;

    int chunk = 0;
    int currentField = -1;
    StringBuilder entryBuilder = new StringBuilder();

    final int[] fieldstarts = {20, 254, 404, 415, 431, 508, 534};
}

(CertifiedSchoolListExtractionStrategy.java)

We can use it like this:

@Test
public void testCertifiedSchoolList_9_16_2015() throws IOException
{
    try (   Writer data = new OutputStreamWriter(new FileOutputStream(new File(RESULT_FOLDER, "data.txt")), "UTF-8");
            Writer nonData = new OutputStreamWriter(new FileOutputStream(new File(RESULT_FOLDER, "non-data.txt")), "UTF-8")    )
    {
        CertifiedSchoolListExtractionStrategy strategy = new CertifiedSchoolListExtractionStrategy(data, nonData);
        PdfReader reader = new PdfReader("certified-school-list-9-16-2015.pdf");

        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        for (int page = 1; page <= reader.getNumberOfPages(); page++)
            parser.processContent(page, strategy);
        strategy.close();
    }
}

(ExtractCertifiedSchoolList.java)

Now data.txt contains all the entries as tab-separated lines and non-data.txt everything ignored.

Behind the scenes

To understand what is happening here, one first has to know how page content in PDFs is organized and how (for the sample code given) iText operates on it.

Inside the PDF

PDF documents are structures built from a number of base object types, some primitive types (numbers, strings, ...) and some more complex ones (arrays or dictionaries of other objects or streams).

A page in a PDF document is represented by such a dictionary object containing entries defining some page properties (like page dimensions) and other entries referencing objects that define what is drawn on the page: the content streams.

Content streams essentially contain a sequence of operations, which may

  • select a color (for stroking or filling),
  • define a path (move to some point, line to some other point, curve to yet another one, ...),
  • stroke or fill such a path,
  • draw some bitmap image somewhere,
  • draw some text somewhere, or
  • do numerous other things.

For the question at hand we mostly are interested in the operations involved in drawing text. In contrast to word processors the operations are not take this long string of text and arrange it as a paragraph but instead more primitively move text position here, draw this short string here, move text position again, and draw another string there.

E.g. in the sample PDF the oeprations for drawing the table header and the first entry line are these:

/TT2 1 Tf

Select font TT2 at size 1.

9.72 0 0 9.72 20.16 687.36 Tm

Set the text matrix to move the text insertion coordinates to 20.16, 687.36 and scale everything following by a factor of 9.72.

0 g

Select the grayscale fill color black

0 Tc
0 Tw

Select additional character and word spacing to 0.

(SCHOOL)Tj

Draw "SCHOOL" here.

/TT1 1 Tf

Select font TT1.

3.4082 0 TD

Move text insertion point by 3.4082 in x direction.

<0003>Tj

Draw a space character (the current font uses a different encoding which uses 16 bit per character, not 8, and here is represented hexadecimally).

/TT2 1 Tf
.2261 0 TD
[(NAME)-17887.4(CAMPUS)]TJ

Select font, move text insertion point, and draw the string "NAME", then a gap of 17887.4 text units, then draw "CAMPUS".

/TT1 1 Tf
24.1809 0 TD
<0003>Tj
/TT2 1 Tf
.2261 0 TD
[(NAME)-8986.6(F)-923.7(M)-459.3(CITY)-6349.9(ST)-1390.2(CAMPUS)]TJ
/TT1 1 Tf
28.5147 0 TD
<0003>Tj
/TT2 1 Tf
.2261 0 TD
(ID)Tj

Draw the rest of the header line.

/TT4 1 Tf
-56.782 -1.3086 TD

Move left by 56.782 and down by 1.3086, i.e. to the start of the first entry line.

("I)Tj
/TT3 1 Tf
.6528 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(Am")Tj
/TT3 1 Tf
1.7783 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(School)Tj
/TT3 1 Tf
2.6919 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
[(Inc.)-16894.2("I)]TJ
/TT3 1 Tf
18.9997 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(Am")Tj
/TT3 1 Tf
1.7783 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(School)Tj
/TT3 1 Tf
2.6919 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
[(Inc.)-8239.9(Y)-1018.9(N)-576.7(Mount)]TJ
/TT3 1 Tf
15.189 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
[(Shasta)-2423.3(CA)-2443.7(41789)]TJ

And draw the first entry line.

As you see and as I had mentioned above, the table content is drawn in reading order. Even multi line column entries come in the needed order, e.g. the campus name "A F International of Westlake Village":

[(Inc.)-7228.7(A)]TJ
/TT3 1 Tf
9.26 0 TD 
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(F)Tj
/TT3 1 Tf
.4595 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(International)Tj
/TT3 1 Tf
5.2886 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(of)Tj
/TT3 1 Tf
.8325 0 TD
<0003>Tj
/TT4 1 Tf
.2261 0 TD
(Westlake)Tj
/TT3 1 Tf
3.7739 0 TD
<0003>Tj
/TT4 1 Tf
-11.8374 -1.3086 TD

Move down to the second line of the column.

(Village)Tj
15.4938 1.3086 TD

Move up again to the main line of the entry.

[(Y)-1018.9(N)-576.7(Westlake)]TJ 

So we can digest the text as it comes, no need for sorting (the content could be ordered in a completely different way).

But we also see that there are no obvious column start and end points. To associate the text with a column, therefore, we have to calculate the positions of each character and compare them with externally given column start positions.

Parsing supported by libraries

PDF libraries usually provide some mechanism to help parsing such content streams.

There are two basic architectures for this, a library may parse the content stream

  • as a whole and provide it as a big array of positioned text chunks or
  • or piecewise and forward individual positioned text chunks using a listener pattern.

The former variant at first seems easier to handle but may have big resource requirements (I have come across multi-MB content streams), while the second one seems a bit more difficult to handle but has smaller memory requirements.

The library I used (iText) follows the latter approach but your problem could also have been solved using a library following the former one.

RenderListener is the listener interface to implement here, the renderText methods retrieves the tindividual text chunks with positions.

In the implementation above (CertifiedSchoolListExtractionStrategy) the renderText method first checks the fill color associated with the chunk and only forwards black text for further processing in renderCharacter. That method (and some helpers) in turn checks the field the text is in (by hard coded position boundaries) and accordingly exports tab separated values. This logic would similarly have been implemented using other libraries, too.

Monday, October 18, 2021
 
MarkMoore
answered 6 Days ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :