Tuesday, December 6, 2011

PDF Metadata Extraction - Multiple Files

This is going to be just a quick, short post (hey, don't laugh - it *can* happen!) with something I wanted to pass along to all my fearless readers.

Here's the scenario: I was stuck in Windows, and had a virtual ton of PDF files from which I need to extract metadata. No fancy commercial tools such as EnCase were at my disposal to automate the task for me, so I turned to pdfinfo. For those who are not familiar with it, pdfinfo is part of xpdf, an open source PDF viewer utility. PDF file metadata (author, title, revision, etc) is primarily stored in a couple different places within a PDF - the Info Dictionary, and/or the XMP (eXtensible Metadata Platform) stream. pdfinfo (which is a free utility, by the way) will extract this metadata from within a PDF file. It's a command-line utility, which is fine by me.

I had already located and exported the PDF files in question out to a single directory for parsing, and I was hoping it'd be as quick and easy as pointing pdfinfo to that directory and redirecting output to a file of my choosing. Alas, that was not to be; the tool is designed to be run like

pdfinfo.exe file.pdf

which would give STDOUT (or could be redirected to a text file, for instance). I tried against a single file, and that worked fine. I tried to use my limited Windows CLI knowledge and get it to feed the PDFs to pdfinfo, with no joy. If I was in Linux, I would've been more comfortable with creating a loop to go through the files and feed a variable (ie, the file) to pdfinfo. I messed around with looping in Windows a bit, but - another piece of the scenario - is that time was limited (of course!). In the process of trying to work out the loops, I looked at some posts on Commandline Kung Fu and other similar (well, similar, but less awesome, no doubt) sites. I may have had some syntax error or other minor issue that caused trouble, but I couldn't ever seem to get a loop to work, and just didn't have time to keep at it.

So here's my solution: I ran a quick file list for that directory, and used that in a spreadsheet to build out one line per PDF file, to parse that file's metadata and output to a plain text file (it's amazing what a little =concatenate, find/replace, and merge functions can do). I copied that over to notepad++ and saved it as a batch (cmd) file. Then I just fired off the batch file, let it run through and give me the metadata I was looking for. Not pretty, not the way that the Masters over at Commandline Kung Fu would have done it, but it got the job done. Here's an example, sanitized for public consumption.


pdfinfo.exe "t:\output\xyz001_pdf_export\United 01.pdf" >> t:\output\xyz001_pdf_export\pdf_metadata.txt
pdfinfo.exe "t:\output\xyz001_pdf_export\Carpet 02.pdf" >> t:\output\xyz001_pdf_export\pdf_metadata.txt
pdfinfo.exe "t:\output\xyz001_pdf_export\Tree 03.pdf" >> t:\output\xyz001_pdf_export\pdf_metadata.txt
pdfinfo.exe "t:\output\xyz001_pdf_export\Interview 04.pdf" >> t:\output\xyz001_pdf_export\pdf_metadata.txt
pdfinfo.exe "t:\output\xyz001_pdf_export\Local 05.pdf" >> t:\output\xyz001_pdf_export\pdf_metadata.txt
pdfinfo.exe "t:\output\xyz001_pdf_export\TipTop 06.pdf" >> t:\output\xyz001_pdf_export\pdf_metadata.txt
pdfinfo.exe "t:\output\xyz001_pdf_export\Safety 07.pdf" >> t:\output\xyz001_pdf_export\pdf_metadata.txt
pdfinfo.exe "t:\output\xyz001_pdf_export\Teleport 08.pdf" >> t:\output\xyz001_pdf_export\pdf_metadata.txt
pdfinfo.exe "t:\output\xyz001_pdf_export\Sharp 09.pdf" >> t:\output\xyz001_pdf_export\pdf_metadata.txt
pdfinfo.exe "t:\output\xyz001_pdf_export\Water 10.pdf" >> t:\output\xyz001_pdf_export\pdf_metadata.txt


So there it is, a short post (perhaps my first?). Hopefully it's helpful to someone else who needs to extract metadata from PDF files.

-----------------------------------

Just a quick update. In discussions last night on twitter, I mentioned that I thought Phil Harvey's exiftool would process PDFs for metadata. Rob Lee confirmed this, and called exiftool "the bomb-diggity." :) I said I would test it to compare against pdfinfo.

The two applications provide similar information; certainly the core info is the same (such as creation dates, permissions, author). pdfinfo provides information above and beyond exiftool, though, such as encryption, page size (actual dimensions), tags, form. Before you go thinking that pdfinfo is the way to go, I'll say that I find exiftool's output easier to read; each file entry is clearly separated, and the layout/format is nice (to me). BTW, pdfinfo also reports the filename based on the internal "Title." This can be confusing if the two don't match up. Exiftool reports the filename as seen by the filesystem/user, and the Title per the metadata.

Exiftool also gets you past the need to do any scripting, loops, etc. That's because you can run it like this:


"exiftool(-k).exe" -P t:\output\exports\desktop_pdf_export\*.pdf >> t:\output\exports\desktop_pdf_export\pdf_metadata_3.txt


And it's much faster. So while I still think pdfinfo is a great tool, I'm leaning toward Rob Lee's "bomb-diggity" direction on exiftool. ;) If I'd thought of that for PDFs, I'd probably never have seen pdfinfo, so it's a good thing I got to try both out. I think both are good, both are useful, and I'd use both again, certainly for cross-validation.

So there's the quick update. Enjoy!