|
|
Subject:
Converting row-based text files into excel, access or spss
Category: Computers > Software Asked by: pige-ga List Price: $20.00 |
Posted:
02 Mar 2006 10:38 PST
Expires: 01 Apr 2006 10:38 PST Question ID: 702938 |
I have a large number of full-text indexed newspaper stories (2000+) downloaded from a database as ?txt? files and want to know if there is a fast and easy way to convert the indices associated with each story into a usable format in either Excel, Access or SPSS. As you can see from the two examples below (I?ve stripped out the full text of the news story to simplify the presentation), each index has identical row headings. My file has 2000+ news stories so it has 2000+ indices, each with identical row headings. The result is a lot of needless duplication that impedes my ability to create, for example, a time series of news stories on a particular topic or to simply sum up word counts for the selected stories. SE Editorial HD Solace without substance WC 593 words PD 29 December 1977 SN The Globe and Mail SC GLOB PG P6 LA English CY All material copyright Thomson Canada Limited or its licensors. All rights reserved. SE Editorial HD An $11 billion challenge WC 552 words PD 1 March 1978 SN The Globe and Mail SC GLOB PG P6 LA English My goal then is to end up with an Excel, Access or SPSS file where I have columns for each row heading, i.e., SE (section), HD (headline), WC (word count), PD (publication date), and so on and where each row replicates the existing index data for each news story. I?d also like to strip out the text string ?words? in the WC row heading. My ?dream table? would look something like this: SE HD WC PD ETC? Editorial Solace without Substance 593 29 December 1977 ? Editorial An $11 billion challenge 552 1 March 1978 ? By way of background, I should add that I've already done some fairly extensive searches on the Internet. From what I can tell, all the text conversion software out there, including built-in text import features in excel, access and SPSS (I?ve tried?em all!), are premised on the belief that data identifiers (i.e., SE, HD, WC, PD, etc.) are indicated in column headings rather than row headings AND that the identifiers appear only once rather than multiple times in the same file. I've also stumbled across a rather cumbersome way of achieving my goal, which is to import the downloaded indices into ENDNOTE, do some searches and replaces to remove offending bits, and then export the file as tab delimited, then import them into Excel, then sort them by heading (SE, HD, WC, PD, etc) then cut the resulting data, and then, finally, paste that data into a separate spreadsheet. Very time consuming and cumbersome but it works. My question is aimed at finding a fast and easy way to do the same thing, i.e., to translate my big text files into useable spreadsheets or databases. I?d be willing to purchase a reasonably priced software package that could reliably handle this task in a simple way. Thanks kindly! | |
| |
| |
|
|
There is no answer at this time. |
|
Subject:
Re: Converting row-based text files into excel, access or spss
From: dmrmv-ga on 03 Mar 2006 09:54 PST |
Your second example header list above doesn't include the "CY" field. Are the headers always the same (number and order) or do they in fact vary? |
Subject:
Re: Converting row-based text files into excel, access or spss
From: dmrmv-ga on 03 Mar 2006 21:13 PST |
At the end of this comment is a Python program that does most of what you want. It will parse all .txt files in the folder in which it is run, and output a single file, index.csv, with the headers converted into rows. Excel and Access will both import .csv (comma separated value) files. Given your headers it will produce (where EOL is the end of the line ie CR/LF pair in Wintel): SE,HD,WC,PD,SN,SC,PG,LA,CY (EOL) Editorial,Solace without substance,593 words,29 December 1977,The Globe and Mail,GLOB,P6,English,All material copyright Thomson Canada Limited or its licensors. All rights reserved.(EOL) Editorial,An $11 billion challenge,552 words,1 March 1978,The Globe and Mail,GLOB,P6,Englis,(EOL) Note the truncation of the last line is because you didn't have the CY field in the second example, and of course the program becomes much more complex if you don't always have the same fields. You should be prepared to pay much more in that case. This is a minimal implementation, and doesn't do the stripping of the " words" in the WC column because that is a fairly trivial addition and would make reading the code more difficult (it would also be easy to do in Excel or Access afterward). It also hard-codes the order of headers that you used to create the header row in the output file. I have loaded the resulting .csv file into Excel and it works fine. Also, I am not a Google researcher so I can't make this an answer, and this will allow a researcher to provide a completely functional answer and fairly earn the money you have posted. To run this program you will have to have an implementation of the Python programming language on your computer; you can download from python.org. I'm teaching myself Python and when I saw this question I contemplated how I would do it in C, in which I am fairly proficient, and then Python, which I am interested in learning because I have an expanded need to process text files. I can tell you that to do this in C would have taken me much longer, but I haven't used C much for manipulating text files so I haven't built any libraries that would have helped either. If I had to use Python to program an analytical instrument through a serial port it would take me longer than in C, I'm sure. Anyway, the above editorial is the price you pay for getting the code. Snip out all of the lines between the Begin------- and end------- lines and paste into a file called 'parseit.py', then issue the command 'python parseit.py' in the folder where your .txt files are. Begin Parseit.py ------------------------------------- import glob def parsefile( filename, ofile ): ifile = open( filename, 'r' ) writeline = "" print filename for i in range(0,9): templine = ifile.readline()[:-1] print '\t' + templine writeline = writeline + ( templine[3:].strip() + ',' ) writeline = writeline[:-1] + '\n' ofile.write( writeline ) ifile.close() filelist = glob.glob( '*.txt' ) #This is where the target files are read filenum = len( filelist) ofile = open( 'index.csv', 'w') ofile.write( "SE,HD,WC,PD,SN,SC,PG,LA,CY\n" ) #This is the header row for i in range( 0, filenum ): parsefile( filelist[i], ofile ) ofile.close() End parseit ------------------------------------------------------------- |
Subject:
Re: Converting row-based text files into excel, access or spss
From: pige-ga on 06 Mar 2006 09:37 PST |
Hi, First off, THANK YOU! I'll get myself a copy of Python and give it a whirl (I'm not a programmer but I am starting to give serious thought to putting in the requisite time to learn at least one programming language :). In response to your question about the CY field, that was accidentally ommitted from the second example (bad cut&paste job!). Fields are identical all the way through so that should take care of the potential complications you noted. Thanks again. I'll let you know how I make out with the Python script. Cheers |
Subject:
Re: Converting row-based text files into excel, access or spss
From: pige-ga on 06 Mar 2006 11:21 PST |
Hammer, For now, it's a one-time need although I have several large datasets (downloaded and indexed text files) to process. I can see myself having to do this again in the future although I have no immediate plans as of now. As for Excel vs Access, my preference is Excel. I have the Office 2003 version. Cheers (and thanks!) Marc-André |
Subject:
Re: Converting row-based text files into excel, access or spss
From: cynthia-ga on 06 Mar 2006 13:28 PST |
If you convert the file to excel, this (formulated) example might be helpful... http://www.ozgrid.com/forum/showthread.php?t=26511 |
Subject:
Re: Converting row-based text files into excel, access or spss
From: pige-ga on 08 Mar 2006 08:42 PST |
To the person who wrote the Python script, Just a follow-up. I've finally wrapped my head around Python enough to get your script to work. Well, sort of. The script worked well for the first citation but not the rest of'em (I have 2000+ such citation in the same txt file) -- it also seems to retain the field identifiers (PD, WC, etc.). I'm probably doing something wrong but not sure what. Also, it turns out that my citations are not as identical as I thought. Some news stories have a "reporter" field (BY:), some don't depending on the nature of the story. Finally, I didn't note this initially (I didn't even dare hope someone would be kind enough to try their hand at programming something) but each citation also comes with the lead paragraph of the story in question, which I'm sure complicates immensely the task of parsing my very large file. Anyway, thanks again for taking a stab at this. I'm trying to teach myself Python sufficiently to take a stab at this myself. We'll see how I make out. Cheers Marc-Andre |
Subject:
Re: Converting row-based text files into excel, access or spss
From: pige-ga on 08 Mar 2006 10:20 PST |
One last comment to the person who wrote the Python script in case they look here again... You said it would be easy to strip the word "words" from a cell in excel. How exactly would you do that? Cheers Marc-Andre |
Subject:
Re: Converting row-based text files into excel, access or spss
From: dmrmv-ga on 10 Mar 2006 11:06 PST |
Hi, I wrote the Python script. I don't think it would be hard to rework it so the headers aren't hard coded, you could just search for a regular expression that matches two upper case characters and a space (assuming they are all that format). You would use a regular expression for that and while it would be a pretty simple one I'm not familiar enough with them to know what it would be; probably something like "[A-Z][A-Z] ". You would have to either know beforehand what all of the potential headers are, though, or grow your output table dynamically by adding columns as each new header is found. That would be a little more complicated to program. The " words" field would be easier to strip in Python: either use the string function replace(): import string if string1.find( ' words' ) > -1: print string1.replace( ' words', '' ) or just print the string up to " words": # the [:6] indicates from the beginning of the string to 6 characters left of # the end of the string if string1.find( ' words' ) > -1: print string1[:-6] In Excel you would create two new (temporary) columns beside the one you wish to change and fill it with the formula (assuming your new columns are B and C, and your old one is A): =SUBSTITUTE(A1," words","") . This should trim " words" from all of the cells to the left. Now select the B column, and Copy, then Paste Special in Column C and select Values. This will paste just the text, not the formula. Delete Columns A and B. I had thought from your original post that each file contained only one abstract, so you would have to rewrite the script to behave accordingly. If you put up a link to one of the files I wouldn't mind trying to do it. |
Subject:
Re: Converting row-based text files into excel, access or spss
From: dmrmv-ga on 10 Mar 2006 11:13 PST |
I just noticed I should have put a beginning of line character in front of the regular expression in case there are uppercase letter pairs elsewhere. I have tested it and it works: "^[A-Z][A-Z] ". It would cause problems if you have letter pairs as part of the abstract (ie "AZ is a state in the US" would match if AZ was the start of a line). |
Subject:
Re: Converting row-based text files into excel, access or spss
From: pige-ga on 13 Mar 2006 10:58 PST |
hello dmrmv, Thanks again for the help and offers of help (I'm going to read up on regular expressions!) Unfortunatley, I don't have a website just yet so I'm not sure I can do the link thing (yet another mountain to climb!!). However, I can send you a file as an attachment if you send me an email to the following address: pigeom@magma.ca. Cheers |
Subject:
Re: Converting row-based text files into excel, access or spss
From: pige-ga on 13 Mar 2006 16:33 PST |
To dmrmv, After some fumbling around, I figured out that my Internet provider includes some webpage space with my DSL line. I've also figured out how to paste a link to one of my textfiles via my "website" (I use that term loosely given the meagre quantity and quality of stuff on said site). Anyway, here's the link: http://www.magma.ca/~pigeom/index.htm Click on text files to pull up a sample of one of my txt files. Thanks again for taking a stab at this. I'm learning a lot as I go. Cheers |
Subject:
Re: Converting row-based text files into excel, access or spss
From: dmrmv-ga on 14 Mar 2006 22:00 PST |
Hi Marc-Andre: I downloaded your file and emailed you a script that I think will work for you. I'm not a Google Researcher; my apologies if they feel that I've poached on their preserve. However, it fills a need as a practical approach to something I needed to do anyway. Regards. DMR. |
Subject:
Re: Converting row-based text files into excel, access or spss
From: q_juno-ga on 15 Mar 2006 17:06 PST |
If you think you'll need to do this type of thing repeatedly, you might consider Datawatch's Monarch. I've used it for years for parsing text files - from little table files to big 150 mgb data downloads. The pro version is expensive at over $600, but worth it for me; it has lookup table and database-linking capability, and parses html and pdf files, too, and outputs to almost anything. |
If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you. |
Search Google Answers for |
Google Home - Answers FAQ - Terms of Service - Privacy Policy |