Google Answers: Converting row-based text files into excel, access or spss

View Question

Q: Converting row-based text files into excel, access or spss ( No Answer, 13 Comments )

Question

Subject: Converting row-based text files into excel, access or spss
Category: Computers > Software
Asked by: pige-ga
List Price: $20.00

Posted: 02 Mar 2006 10:38 PST
Expires: 01 Apr 2006 10:38 PST
Question ID: 702938

I have a large number of full-text indexed newspaper stories (2000+) downloaded from a database as ?txt? files and want to know if there is a fast and easy way to convert the indices associated with each story into a usable format in either Excel, Access or SPSS. As you can see from the two examples below (I?ve stripped out the full text of the news story to simplify the presentation), each index has identical row headings. My file has 2000+ news stories so it has 2000+ indices, each with identical row headings. The result is a lot of needless duplication that impedes my ability to create, for example, a time series of news stories on a particular topic or to simply sum up word counts for the selected stories. SE Editorial HD Solace without substance WC 593 words PD 29 December 1977 SN The Globe and Mail SC GLOB PG P6 LA English CY All material copyright Thomson Canada Limited or its licensors. All rights reserved. SE Editorial HD An $11 billion challenge WC 552 words PD 1 March 1978 SN The Globe and Mail SC GLOB PG P6 LA English My goal then is to end up with an Excel, Access or SPSS file where I have columns for each row heading, i.e., SE (section), HD (headline), WC (word count), PD (publication date), and so on and where each row replicates the existing index data for each news story. I?d also like to strip out the text string ?words? in the WC row heading. My ?dream table? would look something like this: SE HD WC PD ETC? Editorial Solace without Substance 593 29 December 1977 ? Editorial An $11 billion challenge 552 1 March 1978 ? By way of background, I should add that I've already done some fairly extensive searches on the Internet. From what I can tell, all the text conversion software out there, including built-in text import features in excel, access and SPSS (I?ve tried?em all!), are premised on the belief that data identifiers (i.e., SE, HD, WC, PD, etc.) are indicated in column headings rather than row headings AND that the identifiers appear only once rather than multiple times in the same file. I've also stumbled across a rather cumbersome way of achieving my goal, which is to import the downloaded indices into ENDNOTE, do some searches and replaces to remove offending bits, and then export the file as tab delimited, then import them into Excel, then sort them by heading (SE, HD, WC, PD, etc) then cut the resulting data, and then, finally, paste that data into a separate spreadsheet. Very time consuming and cumbersome but it works. My question is aimed at finding a fast and easy way to do the same thing, i.e., to translate my big text files into useable spreadsheets or databases. I?d be willing to purchase a reasonably priced software package that could reliably handle this task in a simple way. Thanks kindly!
Clarification of Question by pige-ga on 02 Mar 2006 10:52 PST I should also add that I have looked at a document called "Exporting EndNote Records to MS Access and Importing Endnote Records From Access Using Tab Delimited Files in Both Directions" by Robert Gear. While this is very close to what I want, I still have to work through Endnote to get to my final result (i.e., I still have to import my text files (originally from FACTIVA) into endnote then export to access. My goal, again, is to go directly from the text file to excel, access, or spss.
Request for Question Clarification by hammer-ga on 02 Mar 2006 10:58 PST Is this a one time need, or do you need to process these files periodically? Do you prefer Excel or Access? Which version do you have? - Hammer
Request for Question Clarification by hammer-ga on 08 Mar 2006 09:02 PST Pige, Could you post a piece of the actual file someplace where we can get it? It's difficult to test parsing code without a real sample of what is being parsed. - Hammer

Answer

There is no answer at this time.

Comments

Subject: Re: Converting row-based text files into excel, access or spss
From: dmrmv-ga on 03 Mar 2006 09:54 PST

Your second example header list above doesn't include the "CY" field.
Are the headers always the same (number and order) or do they in fact
vary?

Subject: Re: Converting row-based text files into excel, access or spss
From: dmrmv-ga on 03 Mar 2006 21:13 PST

At the end of this comment is a Python program that does most of what
you want. It will parse all .txt files in the folder in which it is
run, and output a single file, index.csv, with the headers converted
into rows. Excel and Access will both import .csv (comma separated
value) files. Given your headers it will produce (where EOL is the end
of the line ie CR/LF pair in Wintel):

SE,HD,WC,PD,SN,SC,PG,LA,CY (EOL)
Editorial,Solace without substance,593 words,29 December 1977,The
Globe and Mail,GLOB,P6,English,All material copyright Thomson Canada
Limited or its licensors. All rights reserved.(EOL)
Editorial,An $11 billion challenge,552 words,1 March 1978,The Globe
and Mail,GLOB,P6,Englis,(EOL)

Note the truncation of the last line is because you didn't have the CY
field in the second example, and of course the program becomes much
more complex if you don't always have the same fields. You should be
prepared to pay much more in that case.

This is a minimal implementation, and doesn't do the stripping of the
" words" in the WC column because that is a fairly trivial addition
and would make reading the code more difficult (it would also be easy
to do in Excel or Access afterward). It also hard-codes the order of
headers that you used to create the header row in the output file. I
have loaded the resulting .csv file into Excel and it works fine.
Also, I am not a Google researcher so I can't make this an answer, and
this will allow a researcher to provide a completely functional answer
and fairly earn the money you have posted. To run this program you
will have to have an implementation of the Python programming language
on your computer; you can download from python.org.

I'm teaching myself Python and when I saw this question I contemplated
how I would do it in C, in which I am fairly proficient, and then
Python, which I am interested in learning because I have an expanded
need to process text files. I can tell you that to do this in C would
have taken me much longer, but I haven't used C much for manipulating
text files so I haven't built any libraries that would have helped
either. If I had to use Python to program an analytical instrument
through a serial port it would take me longer than in C, I'm sure.

Anyway, the above editorial is the price you pay for getting the code.
Snip out all of the lines between the Begin------- and end-------
lines and paste into a file called 'parseit.py', then issue the
command 'python parseit.py' in the folder where your .txt files are.

Begin Parseit.py -------------------------------------
import glob

def parsefile( filename, ofile ):

  ifile = open( filename, 'r' )
  writeline = ""
  print filename
  
  for i in range(0,9):
    templine = ifile.readline()[:-1]
    print '\t' + templine
    writeline = writeline + ( templine[3:].strip() + ',' )
    	
  writeline = writeline[:-1] + '\n'
  ofile.write( writeline )
  ifile.close()

filelist = glob.glob( '*.txt' ) #This is where the target files are read

filenum = len( filelist)

ofile = open( 'index.csv', 'w')
ofile.write( "SE,HD,WC,PD,SN,SC,PG,LA,CY\n" ) #This is the header row
  
for i in range( 0, filenum ):
  parsefile( filelist[i], ofile )
  
ofile.close()

End parseit -------------------------------------------------------------

Subject: Re: Converting row-based text files into excel, access or spss
From: pige-ga on 06 Mar 2006 09:37 PST

Hi,
First off, THANK YOU!  I'll get myself a copy of Python and give it a
whirl (I'm not a programmer but I am starting to give serious thought
to putting in the requisite time to learn at least one programming
language :).

In response to your question about the CY field, that was accidentally
ommitted from the second example (bad cut&paste job!).  Fields are
identical all the way through so that should take care of the
potential complications you noted.

Thanks again. I'll let you know how I make out with the Python script.

Cheers

Subject: Re: Converting row-based text files into excel, access or spss
From: pige-ga on 06 Mar 2006 11:21 PST

Hammer,

For now, it's a one-time need although I have several large datasets
(downloaded and indexed text files) to process. I can see myself
having to do this again in the future although I have no immediate
plans as of now. As for Excel vs Access, my preference is Excel. I
have the Office 2003 version.

Cheers (and thanks!)

Marc-André

Subject: Re: Converting row-based text files into excel, access or spss
From: cynthia-ga on 06 Mar 2006 13:28 PST

If you convert the file to excel, this (formulated) example might be helpful...
http://www.ozgrid.com/forum/showthread.php?t=26511

Subject: Re: Converting row-based text files into excel, access or spss
From: pige-ga on 08 Mar 2006 08:42 PST

To the person who wrote the Python script,

Just a follow-up. I've finally wrapped my head around Python enough to
get your script to work. Well, sort of. The script worked well for the
first citation but not the rest of'em (I have 2000+ such citation in
the same txt file) -- it also seems to retain the field identifiers
(PD, WC, etc.). I'm probably doing something wrong but not sure what.

Also, it turns out that my citations are not as identical as I
thought. Some news stories have a "reporter" field (BY:), some don't
depending on the nature of the story. Finally, I didn't note this
initially (I didn't even dare hope someone would be kind enough to try
their hand at programming something) but each citation also comes with
the lead paragraph of the story in question, which I'm sure
complicates immensely the task of parsing my very large file.

Anyway, thanks again for taking a stab at this. I'm trying to teach
myself Python sufficiently to take a stab at this myself. We'll see
how I make out.

Cheers

Marc-Andre

Subject: Re: Converting row-based text files into excel, access or spss
From: pige-ga on 08 Mar 2006 10:20 PST

One last comment to the person who wrote the Python script in case
they look here again...

You said it would be easy to strip the word "words" from a cell in
excel. How exactly would you do that?

Cheers

Marc-Andre

Subject: Re: Converting row-based text files into excel, access or spss
From: dmrmv-ga on 10 Mar 2006 11:06 PST

Hi, I wrote the Python script. I don't think it would be hard to
rework it so the headers aren't hard coded, you could just search for
a regular expression that matches two upper case characters and a
space (assuming they are all that format). You would use a regular
expression for that and while it would be a pretty simple one I'm not
familiar enough with them to know what it would be; probably something
like "[A-Z][A-Z] ".

You would have to either know beforehand what all of the potential
headers are, though, or grow your output table dynamically by adding
columns as each new header is found. That would be a little more
complicated to program.

The " words" field would be easier to strip in Python: either use the
string function replace():

import string
if string1.find( ' words' ) > -1:
  print string1.replace( ' words', '' )

or just print the string up to " words":

# the [:6] indicates from the beginning of the string to 6 characters left of 
# the end of the string
if string1.find( ' words' ) > -1:
  print string1[:-6]

In Excel you would create two new (temporary) columns beside the one
you wish to change and fill it with the formula (assuming your new
columns are B and C, and your old one is A): =SUBSTITUTE(A1,"
words","") . This should trim " words" from all of the cells to the
left. Now select the B column, and Copy, then Paste Special in Column
C and select Values. This will paste just the text, not the formula.
Delete Columns A and B.

I had thought from your original post that each file contained only
one abstract, so you would have to rewrite the script to behave
accordingly. If you put up a link to one of the files I wouldn't mind
trying to do it.

Subject: Re: Converting row-based text files into excel, access or spss
From: dmrmv-ga on 10 Mar 2006 11:13 PST

I just noticed I should have put a beginning of line character in
front of the regular expression in case there are uppercase letter
pairs elsewhere.  I have tested it and it works: "^[A-Z][A-Z] ". It
would cause problems if you have letter pairs as part of the abstract
(ie "AZ is a state in the US" would match if AZ was the start of a
line).

Subject: Re: Converting row-based text files into excel, access or spss
From: pige-ga on 13 Mar 2006 10:58 PST

hello dmrmv,
Thanks again for the help and offers of help (I'm going to read up on
regular expressions!)
Unfortunatley, I don't have a website just yet so I'm not sure I can
do the link thing (yet another mountain to climb!!). However, I can
send you a file as an attachment if you send me an email to the
following address: pigeom@magma.ca.
Cheers

Subject: Re: Converting row-based text files into excel, access or spss
From: pige-ga on 13 Mar 2006 16:33 PST

To dmrmv,

After some fumbling around, I figured out that my Internet provider
includes some webpage space with my DSL line. I've also figured out
how to paste a link to one of my textfiles via my "website" (I use
that term loosely given the meagre quantity and quality of stuff on
said site). Anyway, here's the link:

http://www.magma.ca/~pigeom/index.htm

Click on text files to pull up a sample of one of my txt files. Thanks
again for taking a stab at this. I'm learning a lot as I go.

Cheers

Subject: Re: Converting row-based text files into excel, access or spss
From: dmrmv-ga on 14 Mar 2006 22:00 PST

Hi Marc-Andre:

I downloaded your file and emailed you a script that I think will work
for you. I'm not a Google Researcher; my apologies if they feel that
I've poached on their preserve. However, it fills a need as a
practical approach to something I needed to do anyway.

Regards.

DMR.

Subject: Re: Converting row-based text files into excel, access or spss
From: q_juno-ga on 15 Mar 2006 17:06 PST

If you think you'll need to do this type of thing repeatedly, you
might consider Datawatch's Monarch.  I've used it for years for
parsing text files - from little table files to big 150 mgb data
downloads.  The pro version is expensive at over $600, but worth it
for me; it has lookup table and database-linking capability, and
parses html and pdf files, too, and outputs to almost anything.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy