Google Answers Logo
View Question
 
Q: Script (?) for Comparing Colums of Data ( No Answer,   6 Comments )
Question  
Subject: Script (?) for Comparing Colums of Data
Category: Computers > Algorithms
Asked by: tikal-ga
List Price: $10.00
Posted: 20 Jul 2003 14:59 PDT
Expires: 22 Jul 2003 11:55 PDT
Question ID: 233095
Hi,
This is my first post to GA. Here it goes :-)

I am looking to compare the contents of two sets of data  and end up
with a third set of data that consists of "terms" matched between the
two. Not only should this third row consist of exact matches, but it
should also include matches that include two terms from the first set
of data as well.

Example:

"A"             "B"             Identified Words ("C")
one             one             one
two             nine            onethree
three           ten             ten
four            onethree
ten             onenight
                 


Please note that:
 - "one" was an identified word because "one" appeared in both "a" and
"b"
 - "onethree" was an identified word because both "one" and "three"
appeared in column "a" and the term "onethree" appeared in column "b"
 - "ten" was an identified word because "ten" appeared in both "a" and
"b"
 - "onenight" WAS NOT and identified word because even though the term
"one" appeared in column "a" the term "night" did not.

Other points of note:

 - The size of the data source "a" is around 100,000 cells
 - The size of the data source "b" is around 40,000 cells
 - In terms of software, I really only have Excel and I have no
experience with databases
 - The data I have I receive in text files
 - I have a P4 2.4GHZ with 1GB RAM machine

In order to make this a complete answer, I need the following:

 - The exact script (or detailed instructions for another solution)
that will allow me to do this data extraction - The answer should not
be "here is where you find the info..."
 - If anything other than Excel is required, I will need EXACT
instructions as to what to download (it has to be free), and what to
do with the software I downloaded
 - If Excel is used, please take into account the limitation on the
number of cells in Excel (i.e. the solution you provide will now have
to work flawlessly between two columns of "a" data and basically treat
them as one) - I will consider limiting my "a" data source to 65,000
if there is no way of doing this.
 - The solution should take into account the length of time needed to
complete the extraction (i.e. I can't wait a day for the data
extraction to be completed :-)
 - The solution should allow me to update the "b" column data on a
frequent basis (daily or weekly) by "cutting and pasting" or by
"referencing" a txt file.

I thank you very much for your time reading this and would be happy to
answer any questions.

Clarification of Question by tikal-ga on 20 Jul 2003 15:10 PDT
*colums = columns

Ugh... typos in the "Subject" can't be changed :-(

Request for Question Clarification by mathtalk-ga on 22 Jul 2003 07:35 PDT
Hi, tikal-ga:

I could write a VBA program for you, that you could run from within
Excel, that solves this problem assuming that the lists A and B are
available as sorted files, and that it is acceptable to output list C
as a file.

The format of the files would be one "term" per line, just to keep
things easy.  I'd put the resulting Excel spreadsheet out on the 'net
someplace for you to download.

However the effort involved to write and test a program is more than
what your offered price of $10 would reasonably cover.  See the Google
Answers pricing guidelines here:

[How to price your question]
http://answers.google.com/answers/pricing.html

regards, mathtalk-ga

Clarification of Question by tikal-ga on 22 Jul 2003 11:50 PDT
Hi mathtalk-ga,
Thanks so much for your help. Unfortunately I think this solution will
take too long to execute within excel based on the number of cells I
need to compare.

I think I am asking for something that I will need to use a db for. I
guess I need to learn :-) Thanks for all of your help though!!
Answer  
There is no answer at this time.

Comments  
Subject: Re: Script (?) for Comparing Colums of Data
From: mathtalk-ga on 20 Jul 2003 18:05 PDT
 
Hi, tikal-ga:

I have read with interest your question.  It seems to me that in
asking for detailed directions on how to carry out the required
matching using some software, you are in fact asking not about
algorithms but about programming.

Certainly there are software packages that would be more appropriate
than Excel for solving this kind of problem.  Perhaps the most elegant
approach would be a Prolog program.  There are Prolog
compilers/interpreters which are freely available.

Likewise, as I think you perhaps anticipated, database programming
would provide another means for tackling this matching.

The basic point in both cases is that "declarative" programming
languages will allow the programmer to specify the "rules" that govern
solutions, and leave it to the language implementation to decide "how"
the solutions will be found.

You seem to propose two rules, which I'll explain using a Prolog-like
syntax.

Let InA(X) and InB(X) be two predicates which describe the contents of
list A and list B, respectively.  That is, InA(X) is true for each
item X in list A, and similarly for InB(X) and the items in list B.

match(X) if InA(X), InB(X).

match(Z) if InA(X), InB(Y), concatenate(X,Y,Z).

Once the basic "algorithm" or rules are defined, it is possible to
turn one's attention to making the implementation efficient.

regards, mathtalk-ga
Subject: Re: Script (?) for Comparing Colums of Data
From: tikal-ga on 20 Jul 2003 18:19 PDT
 
Hi mathtalk-ga,
Thanks very much for your comment. The only reason I phrased the
question in this way is because I have struggled with it for a long
time. Frustration has mounted! :-) The other reason I am looking for a
detailed solution is so that I can pick it apart the answer so I can
learn some data extraction methods. I'm seem to learn best by seeing
the answer and working back to the question - then reapplying to other
tasks... anyway. Maybe the issue with my question is the category?
Please advise.

I will take a look at prolog right now.

Thanks!
Subject: Re: Script (?) for Comparing Colums of Data
From: mathtalk-ga on 20 Jul 2003 20:05 PDT
 
Hi, tikal-ga:

Perhaps SQL is the best approach for the volume of data you are
dealing with.  There are free (open source) relational databases, such
as MySQL and GNU SQL Server.  Indeed Microsoft makes its MSDE, a
restricted version of SQL Server, freely redistributable, so it is
"free" to endusers (though a developer is needed to build an
application around it for you).

However let me correct what I did in the second Prolog rule for your
question; I made a mistake when I wrote:

match(Z) if InA(X), InB(Y), concatenate(X,Y,Z).

This would mean that a "match" can occur by combining items from
column A with column B, which is of course not the rule you outlined. 
Instead you said that a match could occur by concatenating two items
from column A into a matching item in column B.  (Question: Did you
also mean to allow for the converse, two items in column B combining
to make an item in column A?)

This would be expressed in Prolog as:

match(Z) if InB(Z), InA(X), InA(Y), concatenate(X,Y,Z).

Similar criteria can be expressed in SQL.  For example, we might have
a single column table A and another single column table B.  Then the
first rule (same item in both columns) would be expressed as a simple
"join" query:

SELECT A.item from A,B
WHERE  A.item = B.item

The second (more complex) rule would be:

SELECT B.item from B
WHERE EXISTS (
    SELECT * from A as A1, A as A2
    WHERE B.item = A1.item + A2.item
    )

Here the + operator is used in SQL to concatenate two "strings", an
item apiece from two logical copies of the table A.

Again some work to make the query more "efficient" in an
implementation would be the next practical step to take.

regards, mathtalk-ga
Subject: Re: Script (?) for Comparing Colums of Data
From: tikal-ga on 20 Jul 2003 20:17 PDT
 
Hi mathtalk-ga,
Thanks again for your comment! 

To answer the question "Did you also mean to allow for the converse,
two items in column B combining to make an item in column A?" - No.
Ideally the solution would only work by looking for combinations (2
max) of "a" to make "b".

Thanks again for the SQL reco. I plan on digging into SQL/MySQL in a
couple of months (when work slows down) but until then, I'm afraid I
would need some pretty detailed instructions on how to work with SQL
in general. I don't even know how to set up a db, add data etc. once I
get SQL installed.

Thanks very much!
Subject: Re: Script (?) for Comparing Colums of Data
From: stephenvakil-ga on 21 Jul 2003 08:16 PDT
 
Assuming you have ms access (instructions for 97) it's pretty simple:

1. Open ms access, create a new db
2. Go to File -> Get External Data -> Import
3. Change "Files of Type" to Microsoft Excel then browse to your file
4. Follow the wizard instructions (you might have to select "First Row
Contains Headers".  You do not need to create a primary key but may
optionally do so.  At the end you should have a table called Sheet1
5. Go to view->database objects->queries
6. Click New, Select Design View, OK
7. Click close on the table listing
8. Go to view->Sql View
9. Type in:  

SELECT s1.a & s2.a
FROM Sheet1 AS s1, sheet1 AS s2
WHERE (s1.a & s2.a in (select b from sheet1))
union
select b from sheet1 where b in (select a from sheet1)

10. Save this as some query name


Now you can double click on it to see the results.  If you want to
save it to an excel sheet within your existing workbook, select the
query you just created, go to save as/export, choose excel 97, and
browse to your workbook.
Subject: Re: Script (?) for Comparing Colums of Data
From: tikal-ga on 22 Jul 2003 11:55 PDT
 
Hi stephenvakil-ga,
Thanks for you help. I don't have Access though :-( I am going to take
a look at your code and try to figure it out though.

Thanks!

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy