![]() |
|
![]() | ||
|
Subject:
Point me to Research on This Type of Database Problem
Category: Computers Asked by: petroglyph_1-ga List Price: $10.00 |
Posted:
07 Mar 2005 16:57 PST
Expires: 06 Apr 2005 17:57 PDT Question ID: 486447 |
I am looking for research on the following type of (database) problem - data used outside the domain under which it was created. For example, take a company which uses social security number as their "employee identification number" in their human resources database. This might work initially (as all employees may have SS#'s), but companies will run into problems when they hire people w/o social security numbers - e.g. foreign employees. The reason they run into problems (I believe) is that social security number is defined over a particular domain (the set of all individuals issued numbers by the US social security administration) which may overlap significantly, but is distinct, from the domain of "employees of company X". I believe that this type of problem can be generalized -- whenever you use data outside of the domain under which it was strictly defined, you will run into problems. (Thus, I am not looking for a solution to the SS# example I gave -- it is just an illustration of a larger problem). I am looking for research that talks about this type of issue (cross domain usage of data), the problems caused by using data in this way, and perhaps a set of principles for addressing and avoiding such issues. The answer doesn't have to be limited to database theory -- other areas of math or CS or science or logic or even linguistics might be useful. Just something which addresses this issue head on. | |
| |
|
![]() | ||
|
There is no answer at this time. |
![]() | ||
|
Subject:
Re: Point me to Research on This Type of Database Problem
From: frde-ga on 08 Mar 2005 03:33 PST |
It's the old problem People are always tempted to make primary keys 'mean something' [DEPT] [ASSET TYPE] [YEAR] [SEQ] 99 99 99 :) 9999 Oops - we just got 100 departments ( 00 is assumed invalid ) If people do not 'import' problems, they build their own. |
Subject:
Re: Point me to Research on This Type of Database Problem
From: connectroy-ga on 09 Mar 2005 10:22 PST |
What you need a surrogate key. Most of the time, programs will use a one-up number that can go to 2 billion or larger. The SSN would be a good field to search on but you can also search on phone number or something else (these fields would be indexed of course). Going across domains or systems, you could have another table that "maps" which key field is used for which domain. Then if a generic number is entered, the program can go through the mapping to see if there are any matches. Researching the use of surrogate keys and normalization of tables in relational databases should help. Thanks, Roy |
Subject:
Re: Point me to Research on This Type of Database Problem
From: curious_guy-ga on 10 Mar 2005 21:18 PST |
I don't know how my answer helps. But just a thought. why can't we use CRC ( Cyclic redundancy Check)? CRC's are basically a smaller data(number) to represent a larger data (files in your dir). Though two there might be possibility that two files might produce same CRC, the probability is very small accorsing to my understanding. So they can be considered somewhat unique. For example consider the employee data in the company, CRC's can be created using age, sex and name or any other field. I am not sure whether this is a best idea. But a thought. Here is a link abt CRC which i came across recently http://www.dogma.net/markn/articles/crcman/crcman.htm |
Subject:
Re: Point me to Research on This Type of Database Problem
From: curious_guy-ga on 11 Mar 2005 02:38 PST |
2. CRC's can be created using age, sex and name or any other field. --> There are some problems with this, Since age can change. Sex mostly remains the same. In this case , since age changes each year your CRC will be affected. Probably some other field which remains fixed can be taken into account when calculating the CRC. |
Subject:
Re: Point me to Research on This Type of Database Problem
From: gozzy-ga on 12 Mar 2005 09:34 PST |
I had the opportunity to listen to a talk recently from a prospective faculty candidate and they discussed semantically heterogeneous data sources (distributed data sources in this case). You can view her research (including dissertation) from her website at http://www.cs.iastate.edu/~dcaragea/ or do some searching on IEEE or ACM's websites (though you'll have to be a subscribing member or pay for full-text articles if you find anything there). If you find an article that's somewhat recent, it might be good to type the title into google -- many CS people publishing will post the PDF/PS/etc. online -- see if citeseer comes up as one of the results (since they often have links to the downloadable full text) Below is the abstract of the talk I sat in on. Although her research focused on machine learning and learning of classifiers, I believe the main concepts of semantically heterogeneous data will be of interest to you. Learning classifiers from distributed and semantically heterogeneous data sources Speaker: Dr. Doina Caragea Abstract: In many real world applications data sources of interest are typically distributed and semantically heterogeneous, making it impossible to use traditional machine learning algorithms for knowledge acquisition. However, we observe that most of the learning algorithms use only certain statistics computed from data in the process of generating the hypothesis that they output. We use this observation to design a general strategy for transforming traditional algorithms for learning from data into algorithms for learning from distributed data. The resulting algorithms are provably exact in that the classifiers produced by them are identical to those obtained by the corresponding algorithms in the centralized setting (i.e., when all of the data are available at a central location). They also compare favorably to their centralized counterparts in terms of time and communication complexity. To deal with the semantical heterogeneity problem, we introduce ontology-extended data sources and define a user perspective consisting of an ontology and a set of interoperation constraints between data source ontologies and the user ontology. These constraints can be used to define mappings needed to answer statistical queries from semantically heterogeneous data viewed from a certain user perspective. The answers to such queries are further used to extend our approach to learning from distributed data into a theoretically sound approach to learning from semantically heterogeneous data. |
If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you. |
Search Google Answers for |
Google Home - Answers FAQ - Terms of Service - Privacy Policy |