2 May 2012 20:58
Re: Extracting PDF metadata and exploding pages
SaGS <sags5495 <at> hotmail.com>
2012-05-02 18:58:37 GMT
2012-05-02 18:58:37 GMT
----- Original Message ----- From: "Scott Gifford" <sgifford <at> suspectclass.com> To: <gs-devel <at> ghostscript.com> Sent: Tuesday, 1 May 2012 06:43 Subject: [gs-devel] Extracting PDF metadata and exploding pages > ... > First, it uses poppler's pdfinfo to extract metadata from the PDF, like > this: > > Title: t10_4C > Creator: Adobe Illustrator CS4 > Producer: Adobe PDF library 9.00 > CreationDate: Fri Dec 16 18:26:22 2011 > ModDate: Fri Dec 16 18:26:22 2011 > Tagged: no > Pages: 1 > Encrypted: no > Page size: 270 x 162 pts > File size: 955508 bytes > Optimized: yes > PDF version: 1.4 Try Ghostscript's toolbin\pdf_info.ps. May even be more suitable, depending on what exact metatdata you need. For example 'Page size' above is vague, different pages may have different sizes and also there are different 'boxes' for each page (Mediabox, Cropbox, and others). If some info you need is not already provided, you can modify pdf_info.ps with only a little PostScript programming. Another tool to try is pdftk, see its dump_data command. > > > Next, it splits a multi-page PDF into many single-page PDFs, with "pdftk > burst". > > After that it uses ghostscript to generate PNG thumbnails of each page. From your description it doesn't seem you *need* those one-page PDFs. Convert the original PDF to one-PNG-per-page in one go by using %d in Ghostscript's output filename. The %d gets replaced with the page number. If you prefer fixed width 0-padded numbers use something like %04d (yes, it's just C printf() formatting). > > The user then re-orders the pages in a Web UI using the thumbnails. OK (that's your app). > Finally, it puts them back together in a different order with ghostscript. pdftk is more suitable for this task, out-of-the-box, see its cat command. For example 'pdftk IN.PDF cat 3 1 10-5 2 4 11-end output OUT.PDF' shuffles the first 10 pages and leaves the rest untouched. The page reordering can be done using Ghostscript alone, without fully interpreting the input file and generating a brand new output PDF, but this requires [a lot?] more PostScript programming and knowledge about PDF internals. Start with toolbin\pdfinflt.ps. This tool loads the input PDF without interpreting it (= without translating it to a series of drawing operations) then writes it out with the streams uncompressed. You can do some surgery on the PDF Page tree between loading the input and writing the output (and there's no need to suppress compressing the streams). A much more complex example is lib\pdfopt.ps, this one loads a PDF and writes it out linearised ('Web-optimised'). > ... > I would really like to be able to load the PDF file into ghostscript one > time, extract the data I need, then convert the pages one at a time to > individual PDF files then to PNGs. Is it possible to drive ghostscript > like this, having it do multiple operations on each page? Not multiple operations on each page, but I think it's possible to get the metadata and the one-PNG-per-page in just one execution of Ghostscript. Haven't tried it though; I think I can imagine ways for doing this, but it's not tame at all. In any case I don't think it's worth the trouble, your only gain is that you start Ghostscript only once instead of twice. Most of the time is spent interpreting the PDF and generating the PNGs. > ...