Parsing PDF in .NET
20 January 2010
Yesterday, I spent most of my day working with iTextSharp as a tool to parse PDF files. What I learned is that this API is oriented towards the creation of PDF files and not so much the parsing of them. I learned this by trying hard to work with it, but this link pretty much sums it up. There are examples out in the wild that attempt to use iTextSharp to parse the text out of PDF files, like this one on Code Project. When you dig into them, you’ll find that they are based on hand-rolled parsers of the raw PDF contents at some level. I didn’t want to go down that path.
That Code Project link for getting text suggested another tool that was too heavy, PDFBox. Like iText, PDFBox is also from the Java world. It also comes with .NET support with the magical use of a Java to .NET conversion tool IKVM. The only ‘problem’ with PDFBox and anything from Java via IKVM, it’s very java-esque as you can see here. I spent the rest of my day with PDFBox. While I’m just starting to scratch the surface of its API, it certainly has the chops to get anything and everything out of PDFs.