Google Answers: Needed: Simple Algorithm to Reduce Length of Concatenated Numeric Values ($10)

View Question

Q: Needed: Simple Algorithm to Reduce Length of Concatenated Numeric Values ($10) ( Answered 4 out of 5 stars

Question

Subject: Needed: Simple Algorithm to Reduce Length of Concatenated Numeric Values ($10)
Category: Computers > Programming
Asked by: rick94404-ga
List Price: $10.00

Posted: 27 Apr 2005 11:48 PDT
Expires: 27 May 2005 11:48 PDT
Question ID: 514989

Here's the problem I need to solve: I need to create a unique key in an Oracle DB from the following source data: field 1: char(2) sample data: 67 field 2: char(6) sample data: 000019 field 3: char(6) sample data: 000182 Here are the constraints: There may be duplicate values in any field, but the combination of field1, field2 and field3 will always be unique. The maximum key length must be 12 characters or less, and every character in the key must be a numeric value [0123456789] All positions in all 3 fields may contain data - I can't simply truncate the first position of field2 and field3 and concatenate all 3 to arrive at a 12 character (or less)unique key. The challenge seems to be generating the key from the source fields by a numerical calculation that yields a value that is unique and cannot be produced from the a different set of source fields. Here is an example: field 1: 12 field 2: 000019 field 3: 000182 Simple multiplication (12 x 19 x 182) yields 41496 - which fits the length and numerical character constraints, but could be produced by the same values in a different order: example: 12 x 182 x 19 also = 41496 I suspect their is an algorithm or other calculation (crypto?) that will produce the key as desired. No code required in the answer, just the calculation or a link to the calculation. Thanks and good luck! -Rick
Request for Question Clarification by studboy-ga on 27 Apr 2005 12:58 PDT Hi There's a couple of problems: 1) You mentioned multiplication-- what if you have 99, 999999, 999999, if you multiply the 3 you would get more than 12 characters, no? 2) Mathematicall, it's impossible to compress 14 characters into 12. Any type of hashing has a possibility of collision, and lossless compression would require the use of a frequency dictionary. Just curious, why does your key have to be 12 characters or less?
Clarification of Question by rick94404-ga on 27 Apr 2005 15:42 PDT Additional details- 1) You mentioned multiplication-- what if you have 99, 999999, 999999, if you multiply the 3 you would get more than 12 characters, no? Agree, this is another reason that the multiplication strategy isn't viable. 2) Mathematicall, it's impossible to compress 14 characters into 12. Any type of hashing has a possibility of collision, and lossless compression would require the use of a frequency dictionary. This approach sounds promising - although the characters are numerics, the key is really a string. My first thought was to take advantage of the numerics by performing a calculation, but your idea of employing a frequency dictionary is interesting - it would produce lossless compression and produce unique results within the 0-9 legal character set. Any further thoughts on how to implement? Since your answer is a comment, how do I accept it? -Rick Just curious, why does your key have to be 12 characters or less?
Request for Question Clarification by studboy-ga on 27 Apr 2005 22:07 PDT OK, if the fields are numbers only, then why not just represent them as hex? 999999 (6 chars) = F423F (5 chars) If this is acceptable to you, let me know.

Answer

Subject: Re: Needed: Simple Algorithm to Reduce Length of Concatenated Numeric Values ($1
Answered By: mathtalk-ga on 28 Apr 2005 18:29 PDT
Rated: 4 out of 5 stars

Hi, Rick:

Thanks for inviting me to post an Answer.  The crux of the
impossibility of what you would like to do, pair up all possible
combined values for three fields:

field 1:  two digits
field 2:  six digits
field 3:  six digits

with a uniquely corresponding twelve digit string, is that there are
more of the first set of possibilities (10 to the 14th power) than
there are of the latter (10 to the 12th power).

The "pigeonhole" principle is an often used conceptual tool (theorem)
in mathematics that expresses this notion:

[Pigeonhole principle -- Wikipedia]
http://en.wikipedia.org/wiki/Pigeonhole_principle

"The pigeonhole principle states that if n pigeons are put into m
pigeonholes, and if n > m, then at least one pigeonhole must contain
more than one pigeon. Another way of stating this would be that m
holes can hold at most m objects with one object to a hole; adding
another object will force you to reuse one of the holes."

As a practical matter problems such as the one you ask do arise in
computer science, and a design decision must be made how to proceed. 
One approach that may address your needs is "hashing" with "collision
avoidance".

[hashing -- Whatis.com]
http://whatis.techtarget.com/definition/0,289893,sid9_gci212230,00.html

The word "hash" is used somewhat loosely in various aspects of
programming, often in connection with storing a set of things (but
also in cryptography).  Here we will discuss a "hash" function that
takes as arguments the three field values and returns a twelve digit
"hash".  Even though no 1-1 function can be extended to all 10 to the
14th possible inputs, there may be practical value in an approach that
defines a unique value for all inputs that will actually arise in an
application.

In this scenario you anticipate having far fewer than 10 to the 12th
records to store, but you cannot tolerate the risk of assigning the
same 12 digit "hash" to more than one record (a "collision" of hash
values).

We will assume that the hash value (twelve digit string) is to be
stored in the same table as a table that holds the three original
input strings, as well as possibly in other locations (as the hash
value is slight more compact than the combined storage of the three
original fields).  This table we will call the hash table, and it
provides a lookup to "reverse" the assignment performed by the hash
function.

The main idea is to have a fairly simple computation to give a trial
hash value, then check the table to see if has been used already.  It
is has, apply a "shift" of one sort or another until an unused value
is obtained.

Note that this assumes far fewer than 10 to the 12th power values need
to be managed in this fashion.  The practical reason for this is that
you don't want to have a huge "overhead" in shifting values around to
find an open "parking space", so to speak.  Keep in mind the "birthday
problem" as a model of how the chances of a collision can increase
dramatically even though a relatively small fraction of the possible
hash values are assigned.

[Ask Dr. Math: The Birthday Problem]
http://mathforum.org/dr.math/faq/faq.birthdayprob.html

Earlier I suggested a simple approach for "calculating" the hash value
by throwing away the two leading digits of fields 2 & 3.  It doesn't
really matter how the leftover digits are glued together.  If you are
doing it in SQL, a natural option would be to use substring( ) to drop
those leading digits and string concatenation to put the remnants
together.

The ideal "shift" function to have for avoiding collisions is one that
jumps around "wildly", is easy to compute, and yet manages to visit
(eventually) every potential value (looking for the open slot).  It's
not easy in general to satisfy all three criteria, and when the
density of collisions is low (due to the paucity of records involved)
and speed important, the first two criteria of greatest importance.

A mathematical treatment of such problems might be to pick the largest
prime below 10 to the 12th, and do all the computations modulo that
prime.  Supposing that the value of all zeroes can be excluded from
the hash values by one means or another, a "shift" function with the
nice property of visiting all the possible slots could consist of
multiplication by a "primitive root" of the prime modulus, which we
might choose to have six digits or so to give plenty of "jumping
around" of the shifted values.

If a practical algorithm of this sort would be helpful to you, I'd be
happy to elaborate upon seeing a Request for Clarification from you. 
Let me know if you are intending to do the coding in SQL or in a more
powerful language.

best wishes,
mathtalk-ga

rick94404-ga rated this answer: 4 out of 5 stars

Simple and clear explanation to complex problem. Thanks for tackling this one!

Comments

Subject: Re: Needed: Simple Algorithm to Reduce Length of Concatenated Numeric Values ($1
From: vladimir-ga on 27 Apr 2005 14:25 PDT

Why not just use a sequence to assign a unique key (1, 2, 3, ...) to
each row?. As long as there are less than 10^12 unique combinations of
the three fields (theoretically there could be 10^14 different ones),
you'll be fine.

Unless I misunderstood what you're trying to achieve, there is no
significantly better or more elegant way to do it - it's impossible to
write a function that would transform elements from a bigger set to a
smaller set with no collisions.

Subject: Re: Needed: Simple Algorithm to Reduce Length of Concatenated Numeric Values ($1
From: mathtalk-ga on 27 Apr 2005 15:16 PDT

As vladimir-ga and earlier studboy-ga have noted, there is no one-one
function from the set of possible length 14 numeric character strings
into the set of possible length 12 numeric character strings.  If it
were acceptable to have "collisions" then this would a question of
devising a suitable "hash" function.

A couple more observations:

1. If the data is being stored as characters, then restricting the
concatenated data to numeric characters appears to be an artificial
limitation.  Consider using a few nonnumeric characters, for then the
12 character limit could be easily attained.

2. If the longer fields are apt to have leading zeroes, I'd be tempted
to create a concatenated result by interleaving the digits of the two
longer fields and tacking the short field (1) on at the left:

field 1:             12
field 2: 0 0 0 0 1 9
field 3:  0 0 0 1 8 2
-----------------------
cat:     00000001189212

and simply dropping the two highest digits leaves twelve:

           000001189212

Character by character manipulations in SQL can be quite tedious. 
This sort of manipulation is more easily done in "client-side" code.

regards, mathtalk-ga

Subject: Re: Needed: Simple Algorithm to Reduce Length of Concatenated Numeric Values ($1
From: mathtalk-ga on 28 Apr 2005 03:50 PDT

Consider the following thought experiment.  You have a magic algorithm
that takes the values of the three fields:

field 1: 12       [two decimal digits]
field 2: 000019   [six decimal digits]
field 3: 000182   [six decimal digits]

and maps these (uniquely) to a string of twelve decimal digits _without_ ambiguity.

Now divide that string in halves as field 2 and 3, and put another
pair of digits into field 1.   Apply the magic algorithm.

If it were possible to do this you could "compress" an arbitrary
amount of information into the three fields (or for that matter into
the twelve digit string since it conveys the same information), just
by repeating the procedure as many times as you want.  Since the
procedure is without ambiguity, in principle the procedure is
reversible and all the digits that go in can be "unpacked", with no
doubt devastating consequences for the telegram industry.

The impossibility of mapping all 10^14 possible values of the three
fields into 10^12 possible values of the twelve digit string is an
example of the famous mathematical "pigeonhole" principle.  If M > N
are positive integers, then any assignment of M things to N
pigeonholes will result in at least one pigeonhole being assigned more
than one thing.

regards, mathtalk-ga

Subject: Re: Needed: Simple Algorithm to Reduce Length of Concatenated Numeric Values ($10)
From: rick94404-ga on 28 Apr 2005 07:52 PDT

Mathtalk-

Your comment is well worth the fee. Please post as the answer for payment.

Great explanation!

What is your background?

-Rick

Subject: Re: Needed: Simple Algorithm to Reduce Length of Concatenated Numeric Values ($1
From: pgmer6809-ga on 03 May 2005 15:06 PDT

Most databases have a type of data called 'Numeric'. Numeric data is
typically stored as BCD, in other words two digits per byte.
If your 12 char limit is space on disk (or in the table) (as opposed
to space in a display form) you can just concatenate your 14 chars
together, and then save them as a numeric field using (in the data
base) only 8 chars. (7 for the numbers plus half a char for the sign,
and location of the decimal point).

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy