scraping data from the web is difficult (but fun!) around the edges

Monday, December 29, 2008

scraping data from the web is difficult (but fun!) around the edges

Been working on a variety of data collection tools, and plan on continuing to do so. It's fun, I'm getting some things done, and it helps exercise the brain. Through all of these exercises:

Starting to realize, the hardest part about collecting data from the web... is not grabbing the data (Nokogiri makes that pretty easy)... it's making the data fit into the 'holes' you want it to fit in. Not referring to int, varchar, etc... not 'an array of hashes' either... that's super easy.. the ASCII/non-ascii character sets are a rather large pain. You have the web/html, Ruby (or Python or Java or Perl or PHP or whatever), MySQL. All have different constraints with text there and represent some characters differently. When collecting foreign and domestic names, as one example, it's especially apparent -- the tildes the oomlats the accents, etc. It's really a set of devilish details depending on the problem you're trying to solve. Yet another reason to master Regular Expressions, not to mention to grok the text representations of each system.

...am starting to believe I need to build an Adapter to bridge the gap between all of these character interfaces. In my Ruby Classes, make sure that inputs all funnel into what the database is expecting that can be represented properly. Web/Ruby/Database -- understand each other the way I (a human) can understand looking at names and interpreting. i.e. scraping Joakim BÄCKSTRÖM should equal Joakim BÄCKSTRÖM whether it's Joakim BackstrÖm or Joakim Backstrom or Joakim BACKSTROM or whatever -- and no matter which version i scrape on any site, they should all be equal, and should all lead back to PLAYERID = 623, for example, so all his data is collated and connected. And it also appears on the output side as one 'blessed' name.

Any thoughts on this issue? Any ideas of a different design pattern I could use besides Adapter?

2 comments:

sneha said...: Hi Friend,
Congratulations for this nice looking blog.All article is good.I like It.; 4:49 AM
Unknown said...: Hello All,

Web Content Extractor is the most powerful and easy-to-use data extraction software for web scraping and data extraction from the websites. Web scraping is a constantly growing phenomenon on the Internet. This is a really informative post. Thank you for sharing it with us.
Data Extraction; 7:29 PM

Mark Holton's Weblog :: Web Application Development

Monday, December 29, 2008

scraping data from the web is difficult (but fun!) around the edges

2 comments:

MarkHolton.com

Github projects:

EyeOnMajors.com

Twitter Updates

Twitter Updates

About Me

Currently writing software for:

Previous Employers:

Entrepreneurial Endeavors

RailsConf 2008

Twitter Updates

Links

Blog Archive

Labels