Google Answers Logo
View Question
 
Q: Compressing short text strings ( No Answer,   6 Comments )
Question  
Subject: Compressing short text strings
Category: Computers > Programming
Asked by: tseven-ga
List Price: $15.00
Posted: 21 Mar 2005 09:23 PST
Expires: 20 Apr 2005 10:23 PDT
Question ID: 498051
My goal: 
To compress a fairly short text string into an even shorter string.
For example:

String to compress:
813459546754565ROBERT.JOHNSON

Possible compressed string outcome:
A$21#JSK

I will be integrating this function into an Excel Spreadsheet using Visual Basic.

I am looking for at least an algorithm or example function written in
any language (preferably VB) so I may adapt it to my program.

I could settle for a free self-contained binary (exe) which will allow
me to pass the string to and from a standard text file.

Reason:
I need to take a rather large string and shorten it so that it may be
read by a bar scanner. There are some characters which the bar scanner
will not recognize.

I have seen Software Key generators take a string such as "John Smith"
and translate it to a 5 digit number. This is what I would like to do.

Links to the history of text compression will not help me. I need
source code examples or at minimum an working algorithm which I can
attempt to implement in Visual Basic.

Thanks,
Steve

Request for Question Clarification by efn-ga on 27 Mar 2005 09:58 PST
Hi Steve,

I don't know if I can help you with this, but at least I can ask some
questions to get the problem defined more specifically.

What are the input and output alphabets or character sets?

Can you be more specific about the range of input lengths than "fairly short"?

Are the inputs completely unpredictable, or do you have a set you can
analyze for patterns?

Is there any specific requirement for how much compression you need?

--efn
Answer  
There is no answer at this time.

Comments  
Subject: Re: Compressing short text strings
From: frde-ga on 21 Mar 2005 11:42 PST
 
I think that you have seen CRC16 algorithms

They take a string of bytes and produce a number between 0 and 65535
(note the 5 digits)

The numbers are not guaranteed to be unique to that string
- but the chances are high, and if you control 'both ends' you can guarantee it

Since converting '48637' back to '813459546754565ROBERT.JOHNSON'
is totally impossible without some sort of lookup table
- you might as well trash the CRC16 and use a lookup table of your own devising.

You can make a hamburger out of a cow
- but making a cow out of (partial) hamburgers is not easy
Subject: Re: Compressing short text strings
From: philnj-ga on 21 Mar 2005 13:48 PST
 
Do you need a deterministic compression method?  I mean, if you print
a bar code, must you be able to determine what the original text was
simply from the information contained in the bar code? Or can the bar
code writer and the bar code reader share a database?  If so, then you
can generate a key for each entry, and bar code the key, the reader
would read the key and do a database search to retrieve the info.

Another possibility is using Huffman Compression.  I seem to remember
that it uses a statistical approach to substitute common characters
for shorter bit sequences.  The length of the resulting code is not
guaranteed to be a ceratin size given a fix input size.

A long time ago, there was a system that I think was called radix-56
(or "radix-something") used by DEC to store strings.  Maybe there are
some old PDP-11 guys (or gals) out there who remember this system of
compression.

There is a practical limit to the amount of compression you can
accomplish and still retain a usable amount of information.

Sorry I can't be of more help.
Subject: Re: Compressing short text strings
From: tseven-ga on 21 Mar 2005 21:09 PST
 
To frde:
You are probably correct in the crc16 algorithms. I thought it was
more advanced than that.

To philnj:
Yes it must be deterministic, as I cannot predict the strings contents
nor exact length. A common database would not be possible.

Thanks for the thoughts, it has given me new avenues to look down.
Subject: Re: Compressing short text strings
From: bbcbasic-ga on 25 Mar 2005 14:57 PST
 
> A long time ago, there was a system that I think was called radix-56
> (or "radix-something") used by DEC to store strings

It was called RAD50 (where the '50' is octal for 40).  The principle
behind it was that 40-cubed is 64000, so will fit in a single 16-bit
word.  It used a limited character set of 40 characters (letters A-Z,
numbers 0-9 and four others) and could store three such characters in
a 16-bit word.  With RAD50 the string "813459546754565ROBERT.JOHNSON"
would compress to 10 words (20 bytes).

Encoding and decoding RAD50 is very straightforward.  If you want
some BASIC code to do it let me know.
Subject: Re: Compressing short text strings
From: 0x426f62-ga on 28 Mar 2005 10:35 PST
 
The constraints of your request as I understand it (that is be
deterministic, have no constraints as to length or character set,  and
reversable without a data base) render the solution basically
impossible.   Each 8-bit byte of data represents one of 256 possible
pieces of information (characters).  If you allow the constraint that
your character set is limited to the 40 characters suggested earlier,
that takes 5.5 bits to encode.  The most you save is about 2.5 bits
per character.  The most you can compress your data without using
pattern or predictive techniques will be 24 bits down to 16.  If your
input character set has to be larger than 40, then your initial
compression will be even less.  But this isn't the worst news.  You
say your bar code can't carry every possible 8-bit combination.  But
you haven't said exactly how large the character set is that you can
use to do the encoding.  Assuming it's 7-bits (just a guess), then
your best compression is going to save you 1.5 bits per 8-bit
character.  Your best reduction will be 18%.  If you allow linear
techniques, then your compression can and probably will be greater
most of the time, but no matter what technique you choose, there will
be a "worst case" out there that ends up occupying the full 24
characters for your 30 character string.

You might want to re-think your whole problem, or post the underlying question.
Subject: Re: Compressing short text strings
From: divineflight-ga on 30 Mar 2005 16:53 PST
 
I'd hate to say this but compressing '813459546754565ROBERT.JOHNSON'
to something as short as 'A$21#JSK' is nearly impossible. I say nearly
because it could happen if significant portion of the original string
is repeated. Lookup tables won?t help either and will probably result
tension.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy