Oh the tools I use…(Mono and HTML Agility Pack)

18 July 2009

A customer needed to scrape 100,000s and eventually millions of records of business listing information from a search engine. They told me that the job would be difficult because the search engine detects repeated attempts and knocks one off for a period. They suggested that I would have to work through a proxies.

Since this search company has search engines in many companies, I decided on a different approach, where I randomly switch around to their different servers for the results (e.g. http://it.yahoo.com/ then http://fr.yahoo.com/, etc..). This approach worked really well. This code was all written in C#/.NET and ported to OSX and Linux with Mono. Mono worked great BTW!!

Once the HTML was ‘in hand’, I parsed the HTML with a XPATH like open source tool HTML Agility Pack. This tool allowed me to parse through the HTML like XML. Love it!

This code is one of those pieces of code that takes longer to run than to write. For some reason, I find perverse pleasure in that! :>)

Oh the tools I use…

Kenny

Advertisements

2 Responses to “Oh the tools I use…(Mono and HTML Agility Pack)”

  1. zeno elea Says:

    Can’t get the HTML Agility Pack DLL to load in osx.

  2. sourcetonuts Says:

    Sorry to hear that, it worked fine for me. I would suggest posting your errors on stackoverflow.com with tags OSX, Mono and HTML Agility Pack.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: