normalizing up 3 part names -- initial stake in ground
...all tests passing... ...collecting all 2008 PGATour data, and more Euro data now...
There are ZERO orphans in the 2008 PGATour data right now. Have collected each and every player's data for 36 tournaments in 2008. Including any other 3 part names.
The Player class is not in its ultimate form, but it is there and it splits names appropriately... still doesn't flatten special wacky characters and I'm not using any Bayesian techniques yet, but takes care of the 3 part names accurately: Jose Maria Olazabal, David Berganio Jr., Davis Love III, etc, etc.... also had to RegEx out of things like "Davis Love III (PB)"... the (PB) indicating the course name.
re = /\(\w{2}\)/...re-scraping about 75 tournaments for PGA and Euro Tour with new names in the next 15 mins... pushed the new code to the carl_spackler GitHub repo .
processed = name.gsub(re, "")
CARL_SPACKLER::Player class:
class Player | |
SPECIALS = [] | |
LAST_ONE_NAMES = ["Olazabal", "Jimenez", "Johnson", "Singh", "Thompson", "Wan", "Hicks"] #for names where last 1 name = lname | |
LAST_TWO_NAMES = ["V", "IV", "III", "II", "Jr.", "Jr", "Sr.", "Sr", "Jong", "Pelt", "Broeck"] #for names where last 2 names = lname | |
def initialize(scraped_full_name) | |
@full_name = scraped_full_name | |
@fname = "" | |
@lname = "" | |
self.parse_clean_name | |
end | |
attr_reader :fname, :lname, :full_name #lname may include spaces to accomodate "Berganio Jr.", "Love III", etc | |
def translate_crazy_name_char(special_char) | |
special_char.strip() #really just a stub for now | |
end | |
def flatten name | |
#flatten special characters to non-freakish ASCII. E.g. different than straight flatten, make é = e (not e'') | |
re = /\(\w{2}\)/ | |
processed = name.gsub(re, "") #strip out course in parens E.g. Davis Love III (PB) | |
processed = processed.gsub(/,/, "") #get rid of commas in name | |
processed | |
end | |
def parse_clean_name | |
# take full name and break it apart based on some simple rules | |
# later may use Bayesian techniques | |
names = self.flatten(@full_name).split(" ") | |
if names.length == 2 #normal | |
@fname = flatten(names[0]) | |
@lname = flatten(names[1]) | |
elsif names.length == 3 | |
# check if any parts of the scraped_full_name match with CONSTANTS | |
names.each do |nm| | |
if LAST_ONE_NAMES.include?(nm) #one of the names indicates it's a 3 part name | |
@lname = flatten(names[2]) | |
@fname = flatten(names[0]) + " " + flatten(names[1]) | |
elsif LAST_TWO_NAMES.include?(nm) #one of the names indicates it's a jr, III name | |
@lname = flatten(names[1]) + " " + flatten(names[2]) | |
@fname = flatten(names[0]) | |
else #some untrapped 3 part name that doesn't match either case | |
#split as if it's LAST_TWO_NAMES | |
@lname = flatten(names[2]) + " " + flatten(names[1]) | |
@fname = flatten(names[0]) | |
end | |
end | |
end | |
end | |
end |
No comments:
Post a Comment