View Question
Q: Needed: Simple Algorithm to Reduce Length of Concatenated Numeric Values (\$10) ( Answered ,   5 Comments )
 Question
 Subject: Needed: Simple Algorithm to Reduce Length of Concatenated Numeric Values (\$10) Category: Computers > Programming Asked by: rick94404-ga List Price: \$10.00 Posted: 27 Apr 2005 11:48 PDT Expires: 27 May 2005 11:48 PDT Question ID: 514989
 ```Here's the problem I need to solve: I need to create a unique key in an Oracle DB from the following source data: field 1: char(2) sample data: 67 field 2: char(6) sample data: 000019 field 3: char(6) sample data: 000182 Here are the constraints: There may be duplicate values in any field, but the combination of field1, field2 and field3 will always be unique. The maximum key length must be 12 characters or less, and every character in the key must be a numeric value [0123456789] All positions in all 3 fields may contain data - I can't simply truncate the first position of field2 and field3 and concatenate all 3 to arrive at a 12 character (or less)unique key. The challenge seems to be generating the key from the source fields by a numerical calculation that yields a value that is unique and cannot be produced from the a different set of source fields. Here is an example: field 1: 12 field 2: 000019 field 3: 000182 Simple multiplication (12 x 19 x 182) yields 41496 - which fits the length and numerical character constraints, but could be produced by the same values in a different order: example: 12 x 182 x 19 also = 41496 I suspect their is an algorithm or other calculation (crypto?) that will produce the key as desired. No code required in the answer, just the calculation or a link to the calculation. Thanks and good luck! -Rick``` Request for Question Clarification by studboy-ga on 27 Apr 2005 12:58 PDT ```Hi There's a couple of problems: 1) You mentioned multiplication-- what if you have 99, 999999, 999999, if you multiply the 3 you would get more than 12 characters, no? 2) Mathematicall, it's impossible to compress 14 characters into 12. Any type of hashing has a possibility of collision, and lossless compression would require the use of a frequency dictionary. Just curious, why does your key have to be 12 characters or less?``` Clarification of Question by rick94404-ga on 27 Apr 2005 15:42 PDT ```Additional details- 1) You mentioned multiplication-- what if you have 99, 999999, 999999, if you multiply the 3 you would get more than 12 characters, no? Agree, this is another reason that the multiplication strategy isn't viable. 2) Mathematicall, it's impossible to compress 14 characters into 12. Any type of hashing has a possibility of collision, and lossless compression would require the use of a frequency dictionary. This approach sounds promising - although the characters are numerics, the key is really a string. My first thought was to take advantage of the numerics by performing a calculation, but your idea of employing a frequency dictionary is interesting - it would produce lossless compression and produce unique results within the 0-9 legal character set. Any further thoughts on how to implement? Since your answer is a comment, how do I accept it? -Rick Just curious, why does your key have to be 12 characters or less?``` Request for Question Clarification by studboy-ga on 27 Apr 2005 22:07 PDT ```OK, if the fields are numbers only, then why not just represent them as hex? 999999 (6 chars) = F423F (5 chars) If this is acceptable to you, let me know.```
 Subject: Re: Needed: Simple Algorithm to Reduce Length of Concatenated Numeric Values (\$1 Answered By: mathtalk-ga on 28 Apr 2005 18:29 PDT Rated:
 ```Hi, Rick: Thanks for inviting me to post an Answer. The crux of the impossibility of what you would like to do, pair up all possible combined values for three fields: field 1: two digits field 2: six digits field 3: six digits with a uniquely corresponding twelve digit string, is that there are more of the first set of possibilities (10 to the 14th power) than there are of the latter (10 to the 12th power). The "pigeonhole" principle is an often used conceptual tool (theorem) in mathematics that expresses this notion: [Pigeonhole principle -- Wikipedia] http://en.wikipedia.org/wiki/Pigeonhole_principle "The pigeonhole principle states that if n pigeons are put into m pigeonholes, and if n > m, then at least one pigeonhole must contain more than one pigeon. Another way of stating this would be that m holes can hold at most m objects with one object to a hole; adding another object will force you to reuse one of the holes." As a practical matter problems such as the one you ask do arise in computer science, and a design decision must be made how to proceed. One approach that may address your needs is "hashing" with "collision avoidance". [hashing -- Whatis.com] http://whatis.techtarget.com/definition/0,289893,sid9_gci212230,00.html The word "hash" is used somewhat loosely in various aspects of programming, often in connection with storing a set of things (but also in cryptography). Here we will discuss a "hash" function that takes as arguments the three field values and returns a twelve digit "hash". Even though no 1-1 function can be extended to all 10 to the 14th possible inputs, there may be practical value in an approach that defines a unique value for all inputs that will actually arise in an application. In this scenario you anticipate having far fewer than 10 to the 12th records to store, but you cannot tolerate the risk of assigning the same 12 digit "hash" to more than one record (a "collision" of hash values). We will assume that the hash value (twelve digit string) is to be stored in the same table as a table that holds the three original input strings, as well as possibly in other locations (as the hash value is slight more compact than the combined storage of the three original fields). This table we will call the hash table, and it provides a lookup to "reverse" the assignment performed by the hash function. The main idea is to have a fairly simple computation to give a trial hash value, then check the table to see if has been used already. It is has, apply a "shift" of one sort or another until an unused value is obtained. Note that this assumes far fewer than 10 to the 12th power values need to be managed in this fashion. The practical reason for this is that you don't want to have a huge "overhead" in shifting values around to find an open "parking space", so to speak. Keep in mind the "birthday problem" as a model of how the chances of a collision can increase dramatically even though a relatively small fraction of the possible hash values are assigned. [Ask Dr. Math: The Birthday Problem] http://mathforum.org/dr.math/faq/faq.birthdayprob.html Earlier I suggested a simple approach for "calculating" the hash value by throwing away the two leading digits of fields 2 & 3. It doesn't really matter how the leftover digits are glued together. If you are doing it in SQL, a natural option would be to use substring( ) to drop those leading digits and string concatenation to put the remnants together. The ideal "shift" function to have for avoiding collisions is one that jumps around "wildly", is easy to compute, and yet manages to visit (eventually) every potential value (looking for the open slot). It's not easy in general to satisfy all three criteria, and when the density of collisions is low (due to the paucity of records involved) and speed important, the first two criteria of greatest importance. A mathematical treatment of such problems might be to pick the largest prime below 10 to the 12th, and do all the computations modulo that prime. Supposing that the value of all zeroes can be excluded from the hash values by one means or another, a "shift" function with the nice property of visiting all the possible slots could consist of multiplication by a "primitive root" of the prime modulus, which we might choose to have six digits or so to give plenty of "jumping around" of the shifted values. If a practical algorithm of this sort would be helpful to you, I'd be happy to elaborate upon seeing a Request for Clarification from you. Let me know if you are intending to do the coding in SQL or in a more powerful language. best wishes, mathtalk-ga```
 rick94404-ga rated this answer: `Simple and clear explanation to complex problem. Thanks for tackling this one!`

 Subject: Re: Needed: Simple Algorithm to Reduce Length of Concatenated Numeric Values (\$1 From: vladimir-ga on 27 Apr 2005 14:25 PDT
 ```Why not just use a sequence to assign a unique key (1, 2, 3, ...) to each row?. As long as there are less than 10^12 unique combinations of the three fields (theoretically there could be 10^14 different ones), you'll be fine. Unless I misunderstood what you're trying to achieve, there is no significantly better or more elegant way to do it - it's impossible to write a function that would transform elements from a bigger set to a smaller set with no collisions.```
 Subject: Re: Needed: Simple Algorithm to Reduce Length of Concatenated Numeric Values (\$1 From: mathtalk-ga on 27 Apr 2005 15:16 PDT
 ```As vladimir-ga and earlier studboy-ga have noted, there is no one-one function from the set of possible length 14 numeric character strings into the set of possible length 12 numeric character strings. If it were acceptable to have "collisions" then this would a question of devising a suitable "hash" function. A couple more observations: 1. If the data is being stored as characters, then restricting the concatenated data to numeric characters appears to be an artificial limitation. Consider using a few nonnumeric characters, for then the 12 character limit could be easily attained. 2. If the longer fields are apt to have leading zeroes, I'd be tempted to create a concatenated result by interleaving the digits of the two longer fields and tacking the short field (1) on at the left: field 1: 12 field 2: 0 0 0 0 1 9 field 3: 0 0 0 1 8 2 ----------------------- cat: 00000001189212 and simply dropping the two highest digits leaves twelve: 000001189212 Character by character manipulations in SQL can be quite tedious. This sort of manipulation is more easily done in "client-side" code. regards, mathtalk-ga```
 Subject: Re: Needed: Simple Algorithm to Reduce Length of Concatenated Numeric Values (\$1 From: mathtalk-ga on 28 Apr 2005 03:50 PDT
 ```Consider the following thought experiment. You have a magic algorithm that takes the values of the three fields: field 1: 12 [two decimal digits] field 2: 000019 [six decimal digits] field 3: 000182 [six decimal digits] and maps these (uniquely) to a string of twelve decimal digits _without_ ambiguity. Now divide that string in halves as field 2 and 3, and put another pair of digits into field 1. Apply the magic algorithm. If it were possible to do this you could "compress" an arbitrary amount of information into the three fields (or for that matter into the twelve digit string since it conveys the same information), just by repeating the procedure as many times as you want. Since the procedure is without ambiguity, in principle the procedure is reversible and all the digits that go in can be "unpacked", with no doubt devastating consequences for the telegram industry. The impossibility of mapping all 10^14 possible values of the three fields into 10^12 possible values of the twelve digit string is an example of the famous mathematical "pigeonhole" principle. If M > N are positive integers, then any assignment of M things to N pigeonholes will result in at least one pigeonhole being assigned more than one thing. regards, mathtalk-ga```
 Subject: Re: Needed: Simple Algorithm to Reduce Length of Concatenated Numeric Values (\$10) From: rick94404-ga on 28 Apr 2005 07:52 PDT
 ```Mathtalk- Your comment is well worth the fee. Please post as the answer for payment. Great explanation! What is your background? -Rick```
 Subject: Re: Needed: Simple Algorithm to Reduce Length of Concatenated Numeric Values (\$1 From: pgmer6809-ga on 03 May 2005 15:06 PDT
 ```Most databases have a type of data called 'Numeric'. Numeric data is typically stored as BCD, in other words two digits per byte. If your 12 char limit is space on disk (or in the table) (as opposed to space in a display form) you can just concatenate your 14 chars together, and then save them as a numeric field using (in the data base) only 8 chars. (7 for the numbers plus half a char for the sign, and location of the decimal point).```