Hello Experts,
I am summer intern and one of my projects is to analyze the
customer base of my company. I?ve got huge list of clients including
names, emails, company names, and addresses. This list is fairly
accurate, but not perfect. For example, Microsoft could be listed as
?Microsoft?, ?Microsoft, Inc.? or ?Microsoft Corporation?. Also, some
customers have used a free e-mail account (yahoo, gmail?) instead of
the company domain.
I?ve been asked to individually look up the industry and employee
count for about 6,000 of the most recent customers. However, I?d
really like to impress them by categorizing as many as I can ? and
I?ve got over 35,000 records!
For each record, I?d like to place it in one of these categories?
? Education
? Government
? Military
? Non-Profit
? Business with 1-100 Employees
? Business with 101-500 Employees
? Business with 501-1000 Employees
? Business with 1001-10000 Employees
? Business with 10000+ Employees
This categorization will be used to better understand our customer
base. We will NOT be using this data for any type of spam and it will
not be resold. (this is an ethical project) Because, I respect the
privacy of the customers, I cannot provide the raw data. Please
assume it?s a long CSV file like this?
?Gates,Bill?,?Microsoft, inc?,? One Microsoft Way?,?Redmond?,?WA?,?
98052?,?jsmith@microsoft.com?
?Summers,Lawrence ?,?Harvard?,? Massachusetts Hall?,?
Cambridge?,?MA?,? 98052?,?lsummers@harvard.edu?
?Gates,Bill?,?Microsoft, inc?,? One Microsoft Way?,?Redmond?,?WA?,?
98052?,?jsmith@microsoft.com?
?Summers,Lawrence ?,?Harvard?,? Massachusetts Hall?,?
Cambridge?,?MA?,? 98052?,?lsummers@harvard.edu?
So here is the challenge... Find an elegant way to automate the
process of categorizing these records. Write some sort of script that
can go through each record, query an online source, such as Hoovers,
Google finance, or whatever source other you can find. Find a match
and return an employee count and a categorization from the list above.
This script should be able to handle it when a company is unlisted, or
the name is slightly off. Google finance can find the correct
company most of the time, even if the name is a variant
Obviously this categorization will not be perfect, but try to keep the
margin of error as low as possible. Simple things like not
categorizing every person with a @yahoo.com email address as a Yahoo
employee should be done. Feel free to ask any questions. |