Parsing PDF in .NET

20 January 2010

Yesterday, I spent most of my day working with iTextSharp as a tool to parse PDF files. What I learned is that this API is oriented towards the creation of PDF files and not so much the parsing of them. I learned this by trying hard to work with it, but this link pretty much sums it up. There are examples out in the wild that attempt to use iTextSharp to parse the text out of PDF files, like this one on Code Project. When you dig into them, you’ll find that they are based on hand-rolled parsers of the raw PDF contents at some level. I didn’t want to go down that path.

That Code Project link for getting text suggested another tool that was too heavy, PDFBox. Like iText, PDFBox is also from the Java world. It also comes with .NET support with the magical use of a Java to .NET conversion tool IKVM. The only ‘problem’ with PDFBox and anything from Java via IKVM, it’s very java-esque as you can see here. I spent the rest of my day with PDFBox. While I’m just starting to scratch the surface of its API, it certainly has the chops to get anything and everything out of PDFs.

Posted by sourcetonuts
Filed in Uncategorized

1 Comment »

One Response to “Parsing PDF in .NET”

Hamid Hassan Says:

16 February 2010 at 1:25 am
helpful post. thanks.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Source to Nuts