Google Answers: Point me to Research on This Type of Database Problem

View Question

Q: Point me to Research on This Type of Database Problem ( No Answer, 5 Comments )

Question

Subject: Point me to Research on This Type of Database Problem
Category: Computers
Asked by: petroglyph_1-ga
List Price: $10.00

Posted: 07 Mar 2005 16:57 PST
Expires: 06 Apr 2005 17:57 PDT
Question ID: 486447

I am looking for research on the following type of (database) problem
- data used outside the domain under which it was created.

For example, take a company which uses social security number as their
"employee identification number" in their human resources database.
This might work initially (as all employees may have SS#'s), but
companies will run into problems when they hire people w/o social
security numbers - e.g. foreign employees.

The reason they run into problems (I believe) is that social security
number is defined over a particular domain (the set of all individuals
issued numbers by the US social security administration) which may
overlap significantly, but is distinct, from the domain of "employees
of company X".

I believe that this type of problem can be generalized -- whenever you
use data outside of the domain under which it was strictly defined,
you will run into problems. (Thus, I am not looking for a solution to
the SS# example I gave -- it is just an illustration of a larger
problem).

I am looking for research that talks about this type of issue (cross
domain usage of data), the problems caused by using data in this way,
and perhaps a set of principles for addressing and avoiding such
issues.

The answer doesn't have to be limited to database theory -- other
areas of math or CS or science or logic or even linguistics might be
useful.  Just something which addresses this issue head on.

Request for Question Clarification by pafalafa-ga on 07 Mar 2005 17:47 PST

This sounds -- a bit, at least -- like the whole Y2K issue, on which
there is tons of research material.

Does Y2K fit your definition of "cross domain usage"?

pafalafa-ga

Clarification of Question by petroglyph_1-ga on 08 Mar 2005 15:38 PST

Thanks for your clarification question.

The Y2K issue doesn't appear to me to be a good path to follow.  I see
that it is connected in the following loose sense - two digit years
were being used to represent four digit years -- a smaller domain
representing a larger.  We then ran into problems when we we reached
the boundry of the smaller domain.  However, it seems less relevant
because more than likely the Y2K literature is not about the
*theoretical* reasons under say, set theory, why using data from one
domain in another domain is a bad idea.  Rather, the Y2K literature
likely discusses the problem on a practical level, and discusses
strategies for dealing/migrating software and data.  I am looking for
a more theoretical treatment of the issue -- an explanation at the
general level.  However, if you happen to know of a Y2K piece that
explores this theoretical issue, I would be quite interested.

Answer

There is no answer at this time.

Comments

Subject: Re: Point me to Research on This Type of Database Problem
From: frde-ga on 08 Mar 2005 03:33 PST

It's the old problem

People are always tempted to make primary keys 'mean something'

[DEPT] [ASSET TYPE] [YEAR] [SEQ]
  99       99         99 :)  9999

Oops - we just got 100 departments  ( 00 is assumed invalid )

If people do not 'import' problems, they build their own.

Subject: Re: Point me to Research on This Type of Database Problem
From: connectroy-ga on 09 Mar 2005 10:22 PST

What you need a surrogate key.  Most of the time, programs will use a
one-up number that can go to 2 billion or larger.  The SSN would be a
good field to search on but you can also search on phone number or
something else (these fields would be indexed of course).

Going across domains or systems, you could have another table that
"maps" which key field is used for which domain.  Then if a generic
number is entered, the program can go through the mapping to see if
there are any matches.

Researching the use of surrogate keys and normalization of tables in
relational databases should help.

Thanks,
Roy

Subject: Re: Point me to Research on This Type of Database Problem
From: curious_guy-ga on 10 Mar 2005 21:18 PST

I don't know how my answer helps. But just a thought.

why can't we use CRC ( Cyclic redundancy Check)? CRC's are basically a
smaller data(number) to represent a  larger data (files in your dir).
Though two there might be possibility that two files might produce
same CRC, the probability is very small accorsing to my understanding.
So they can be considered somewhat unique. For example consider the
employee data in the company, CRC's can be created using  age, sex and
name or any other field. I am not sure whether this is a best idea.
But a thought. Here is a link abt CRC which i came across recently

http://www.dogma.net/markn/articles/crcman/crcman.htm

Subject: Re: Point me to Research on This Type of Database Problem
From: curious_guy-ga on 11 Mar 2005 02:38 PST

2. CRC's can be created using  age, sex and
name or any other field. 
--> There are some problems with this, Since age can change. Sex
mostly remains the same. In this case , since age changes each year
your CRC will be affected.
Probably some other field which remains fixed can be taken into
account when calculating the CRC.

Subject: Re: Point me to Research on This Type of Database Problem
From: gozzy-ga on 12 Mar 2005 09:34 PST

I had the opportunity to listen to a talk recently from a prospective
faculty candidate and they discussed semantically heterogeneous data
sources (distributed data sources in this case). You can view her
research (including dissertation) from her website at
http://www.cs.iastate.edu/~dcaragea/ or do some searching on IEEE or
ACM's websites (though you'll have to be a subscribing member or pay
for full-text articles if you find anything there). If you find an
article that's somewhat recent, it might be good to type the title
into google -- many CS people publishing will post the PDF/PS/etc.
online -- see if citeseer comes up as one of the results (since they
often have links to the downloadable full text)

Below is the abstract of the talk I sat in on. Although her research
focused on machine learning and learning of classifiers, I believe the
main concepts of semantically heterogeneous data will be of interest
to you.

Learning classifiers from distributed and semantically heterogeneous data sources 
Speaker: Dr. Doina Caragea

Abstract: In many real world applications data sources of interest are
typically distributed and semantically heterogeneous, making it
impossible to use traditional machine learning algorithms for
knowledge acquisition. However, we observe that most of the learning
algorithms use only certain statistics computed from data in the
process of generating the hypothesis that they output. We use this
observation to design a general strategy for transforming traditional
algorithms for learning from data into algorithms for learning from
distributed data. The resulting algorithms are provably exact in that
the classifiers produced by them are identical to those obtained by
the corresponding algorithms in the centralized setting (i.e., when
all of the data are available at a central location). They also
compare favorably to their centralized counterparts in terms of time
and communication complexity. To deal with the semantical
heterogeneity problem, we introduce ontology-extended data sources and
define a user perspective consisting of an ontology and a set of
interoperation constraints between data source ontologies and the user
ontology. These constraints can be used to define mappings needed to
answer statistical queries from semantically heterogeneous data viewed
from a certain user perspective. The answers to such queries are
further used to extend our approach to learning from distributed data
into a theoretically sound approach to learning from semantically
heterogeneous data.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy