Monday, December 29, 2008

scraping data from the web is difficult (but fun!) around the edges

Been working on a variety of data collection tools, and plan on continuing to do so. It's fun, I'm getting some things done, and it helps exercise the brain. Through all of these exercises:

Starting to realize, the hardest part about collecting data from the web... is not grabbing the data (Nokogiri makes that pretty easy)... it's making the data fit into the 'holes' you want it to fit in. Not referring to int, varchar, etc... not 'an array of hashes' either... that's super easy.. the ASCII/non-ascii character sets are a rather large pain. You have the web/html, Ruby (or Python or Java or Perl or PHP or whatever), MySQL. All have different constraints with text there and represent some characters differently. When collecting foreign and domestic names, as one example, it's especially apparent -- the tildes the oomlats the accents, etc. It's really a set of devilish details depending on the problem you're trying to solve. Yet another reason to master Regular Expressions, not to mention to grok the text representations of each system.

...am starting to believe I need to build an Adapter to bridge the gap between all of these character interfaces. In my Ruby Classes, make sure that inputs all funnel into what the database is expecting that can be represented properly. Web/Ruby/Database -- understand each other the way I (a human) can understand looking at names and interpreting. i.e. scraping Joakim BÄCKSTRÖM should equal Joakim BÄCKSTRÖM whether it's Joakim BackstrÖm or Joakim Backstrom or Joakim BACKSTROM or whatever -- and no matter which version i scrape on any site, they should all be equal, and should all lead back to PLAYERID = 623, for example, so all his data is collated and connected. And it also appears on the output side as one 'blessed' name.

Any thoughts on this issue? Any ideas of a different design pattern I could use besides Adapter?

Thursday, December 25, 2008

Merry Christmas, Irish Optimists!

Irish annihilate Hawaii

What a phenomenal game for the Irish yesterday in the Hawaii Bowl. Clausen was perfect, completely accurate 401 yards through 2.5 quarters, 5 td passes. Armando Allen runs back a kick for a TD, Golden with 2 bomb TD catches, and one punt run back (called back for roughing, too bad).


The Irish looked dominant on both sides of the ball, and special teams. By far the best game they played all year, let alone the best game on the road. Over a Hawaii team that beat Fresno State and nearly beat Big East champion, Cincinnati.

Enjoy this sequence of pics from Armando Allen's ko runback... check out the block by Tate as he leads Armando through the hole in the wedge... awesome:

wedge develops and Golden starts through the hole...
line up Hawaii #18...


on his back!







Tuesday, December 23, 2008

Merb and Rails unite in Rails3!

I see this as great news for the Ruby and Rails and Merb communities... what are your thoughts?

Yehuda Katz post is a great rundown:

http://weblog.rubyonrails.org/2008/12/23/merb-gets-merged-into-rails-3/comments/24239#comment-24239

http://yehudakatz.com/2008/12/23/rails-and-merb-merge/

http://rubyonrails.org/merb