Sunday, January 11, 2009

normalizing up 3 part names -- initial stake in ground

...all tests passing... ...collecting all 2008 PGATour data, and more Euro data now...

There are ZERO orphans in the 2008 PGATour data right now. Have collected each and every player's data for 36 tournaments in 2008. Including any other 3 part names.

The Player class is not in its ultimate form, but it is there and it splits names appropriately... still doesn't flatten special wacky characters and I'm not using any Bayesian techniques yet, but takes care of the 3 part names accurately: Jose Maria Olazabal, David Berganio Jr., Davis Love III, etc, etc.... also had to RegEx out of things like "Davis Love III (PB)"... the (PB) indicating the course name.

re = /\(\w{2}\)/
processed = name.gsub(re, "")
...re-scraping about 75 tournaments for PGA and Euro Tour with new names in the next 15 mins... pushed the new code to the carl_spackler GitHub repo .

CARL_SPACKLER::Player class:

class Player
SPECIALS = []
LAST_ONE_NAMES = ["Olazabal", "Jimenez", "Johnson", "Singh", "Thompson", "Wan", "Hicks"] #for names where last 1 name = lname
LAST_TWO_NAMES = ["V", "IV", "III", "II", "Jr.", "Jr", "Sr.", "Sr", "Jong", "Pelt", "Broeck"] #for names where last 2 names = lname
def initialize(scraped_full_name)
@full_name = scraped_full_name
@fname = ""
@lname = ""
self.parse_clean_name
end
attr_reader :fname, :lname, :full_name #lname may include spaces to accomodate "Berganio Jr.", "Love III", etc
def translate_crazy_name_char(special_char)
special_char.strip() #really just a stub for now
end
def flatten name
#flatten special characters to non-freakish ASCII. E.g. different than straight flatten, make é = e (not e'')
re = /\(\w{2}\)/
processed = name.gsub(re, "") #strip out course in parens E.g. Davis Love III (PB)
processed = processed.gsub(/,/, "") #get rid of commas in name
processed
end
def parse_clean_name
# take full name and break it apart based on some simple rules
# later may use Bayesian techniques
names = self.flatten(@full_name).split(" ")
if names.length == 2 #normal
@fname = flatten(names[0])
@lname = flatten(names[1])
elsif names.length == 3
# check if any parts of the scraped_full_name match with CONSTANTS
names.each do |nm|
if LAST_ONE_NAMES.include?(nm) #one of the names indicates it's a 3 part name
@lname = flatten(names[2])
@fname = flatten(names[0]) + " " + flatten(names[1])
elsif LAST_TWO_NAMES.include?(nm) #one of the names indicates it's a jr, III name
@lname = flatten(names[1]) + " " + flatten(names[2])
@fname = flatten(names[0])
else #some untrapped 3 part name that doesn't match either case
#split as if it's LAST_TWO_NAMES
@lname = flatten(names[2]) + " " + flatten(names[1])
@fname = flatten(names[0])
end
end
end
end
end
view raw player.rb hosted with ❤ by GitHub

No comments: