Google Answers Logo
View Question
 
Q: Number of non-unique first+last names in US population? ( No Answer,   1 Comment )
Question  
Subject: Number of non-unique first+last names in US population?
Category: Reference, Education and News > General Reference
Asked by: clay_shirky-ga
List Price: $75.00
Posted: 27 Jul 2004 07:02 PDT
Expires: 26 Aug 2004 07:02 PDT
Question ID: 379621
Out of the US population, what percentage of the population has
non-unique names (e.g. John Smith, Susan Moon, Sanjay Patel)?
Alternatively, what percentage of the population has a first/lastname
combination that is unique within the US population? (e.g. Clay
Shirky.)

What is the commonest combination? How many people have the commonest combination?

Clarification of Question by clay_shirky-ga on 28 Jul 2004 04:15 PDT
I should be clear that I'm only interested in first+last name
combinations, no other characteristics, whether middle name or social
security, etc.

The core question is "What percentage of people in the US have a
first+last name combination that no one else shares?"

Request for Question Clarification by pafalafa-ga on 29 Jul 2004 19:05 PDT
Hello clay_shirky-ga,

I've been mulling over this question for a while because it's an
interesting challenge.  So far, though, I can't really get my hands
around a way to approach it.

In theory, one of the businesses like a phone company or large mailing
house that has access to super-large databases of names in the US
(some of them have over 200 million records!) could run a frequency
distribution and produce an answer.  I doubt any of them have the
incentive to do so, however.

But...your friendly datacrunchers at the Census Bureau once created
exactly this sort of frequency distribution for last names.  This
list, which is here:


http://www.census.gov/genealogy/names/dist.all.last



at least sheds a little light on the situation.  

The first name on this list -- no surprise -- is Smith, the most
common surname in America, shared by 1.006% of all Americans (at
least, at the time of the Census on which this was based).  Johnson,
Williams, Jones, Brown, Davis Miller, Wilson are next in line.  The 8
surnames cumulatively account for 5% of all Americans.

If you think about, there are going to be very, very few unique names
included at the top of the surname list.  Sure, there's probably a
Cougat Johnson out there, or a ESPN Smith, and some other unique
names, but these are the rare exceptions.

In fact, I'd say for the top 50% of names -- where you're still
dealing with pretty familiar names, like Siegel, Clinton, Kraft,
Kauffman, etc -- there's still likely to be only a tiny percentage of
unique names here.


Way down at the bottom of the list -- where you get the Aalund's and
Aalderlink's -- the reverse is likely true.  I would guess that a very
high proportion of these names are unique names.  There might be five
or ten John Aalund's floating around, but I don't think you'll find
too many Clay Aalund's.

By the way, Shirkey (with an "e") is number 24,219 on the list.
 

The 88,799 names on the Census list account for about 90% of the
population.  That means the remaining 10% have surnames even rarer
than Aalund and Aalderlink!

I would guess, based on this, that maybe 2-5% of Americans have a
totally unique name...a higher number than I would have thought.

I'd be interested to hear what you or my fellow researchers think
about this guesstimate.

Cheers.

pafalafa-ga

Clarification of Question by clay_shirky-ga on 30 Jul 2004 04:28 PDT
Pafalafa,

Your intuition about large private databases is, I think, right. Don't
know if you speak unix, but for a notional file of everyone's name, I
want to take then national whitepages and do

sort whitepages | uniq -c | grep ^\t1

which would pull out every unique name. The population minus that
number would be the non-unique names.

The problem with the census dep't names.last.all list is that names
are combinatorial -- Smith is 1%, but "John Smith" is some other, much
rarer frequency. The second problem is that it goes most to least,
whereas what I am interested in is least to most. When the second
person named Jamal Q. Public shows up, I lose interest in that name.

Don't think it matters for the nuts and bolts of the query, but here
is the framing problem: I am working on unique identifiers for
hospital patients, and am trying to make the point that as patient
record systems grow, the number of namespace clashes grows with it,
with a theoretical upper limit of population-scale overlap of names.

At which point I asked myself "Wonder if I could put a number on that
theoretical limit?" (Though now its become a matter of personal
interest as well.)

So an alternate way to answer the question would be "What is the
largest collection of f+l names in the US available for download or
purchase." A much less glamorous approach to the problem, but possibly
workable...
Answer  
There is no answer at this time.

Comments  
Subject: Re: Number of non-unique first+last names in US population?
From: neilzero-ga on 27 Jul 2004 23:41 PDT
 
The last part is likely easy. Likely John Smith. The title question is
difficult. Likely about 330 million as some non-unique names will
occur only twice. The First question can be calculated from the title
question = about 90%. If I am thinking clearly 10% is correct for
unique, but that seems too high.
 Keep in mind that most records in the USA use first, last AND middle
initial and a unique social security number, and an almost unique
address and telephone number. It may be hepful to tell how you will
use the answers.   Neil

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy