Parsing PDF in .NET

20 January 2010

Yesterday, I spent most of my day working with iTextSharp as a tool to parse PDF files. What I learned is that this API is oriented towards the creation of PDF files and not so much the parsing of them. I learned this by trying hard to work with it, but this link pretty much sums it up. There are examples out in the wild that attempt to use iTextSharp to parse the text out of PDF files, like this one on Code Project. When you dig into them, you’ll find that they are based on hand-rolled parsers of the raw PDF contents at some level. I didn’t want to go down that path.
That Code Project link for getting text suggested another tool that was too heavy, PDFBox. Like iText, PDFBox is also from the Java world. It also comes with .NET support with the magical use of a Java to .NET conversion tool IKVM. The only ‘problem’ with PDFBox and anything from Java via IKVM, it’s very java-esque as you can see here.  I spent the rest of my day with PDFBox. While I’m just starting to scratch the surface of its API, it certainly has the chops to get anything and everything out of PDFs.

One Response to “Parsing PDF in .NET”

  1. Hamid Hassan Says:

    helpful post. thanks.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: