Sunday, January 11, 2009

normalizing up 3 part names -- initial stake in ground

...all tests passing... ...collecting all 2008 PGATour data, and more Euro data now...

There are ZERO orphans in the 2008 PGATour data right now. Have collected each and every player's data for 36 tournaments in 2008. Including any other 3 part names.

The Player class is not in its ultimate form, but it is there and it splits names appropriately... still doesn't flatten special wacky characters and I'm not using any Bayesian techniques yet, but takes care of the 3 part names accurately: Jose Maria Olazabal, David Berganio Jr., Davis Love III, etc, etc.... also had to RegEx out of things like "Davis Love III (PB)"... the (PB) indicating the course name.

re = /\(\w{2}\)/
processed = name.gsub(re, "")
...re-scraping about 75 tournaments for PGA and Euro Tour with new names in the next 15 mins... pushed the new code to the carl_spackler GitHub repo .

CARL_SPACKLER::Player class:


No comments: