Oh the tools I use…(Mono and HTML Agility Pack)
18 July 2009
A customer needed to scrape 100,000s and eventually millions of records of business listing information from a search engine. They told me that the job would be difficult because the search engine detects repeated attempts and knocks one off for a period. They suggested that I would have to work through a proxies.
Since this search company has search engines in many companies, I decided on a different approach, where I randomly switch around to their different servers for the results (e.g. http://it.yahoo.com/ then http://fr.yahoo.com/, etc..). This approach worked really well. This code was all written in C#/.NET and ported to OSX and Linux with Mono. Mono worked great BTW!!
Once the HTML was ‘in hand’, I parsed the HTML with a XPATH like open source tool HTML Agility Pack. This tool allowed me to parse through the HTML like XML. Love it!
This code is one of those pieces of code that takes longer to run than to write. For some reason, I find perverse pleasure in that! :>)