Google Answers: Script to manipulate .csv database files

View Question

Q: Script to manipulate .csv database files ( Answered 5 out of 5 stars

Question

Subject: Script to manipulate .csv database files
Category: Computers > Programming
Asked by: superunknown-ga
List Price: $55.00

Posted: 16 Sep 2004 16:41 PDT
Expires: 16 Oct 2004 16:41 PDT
Question ID: 402277

I'm looking for a script that will process a CSV file. I've looked
into writing one myself in python, but think its probably
(definetely?!) beyond me.

The script would need to be able to be made executable in windows or
OS X, so that I could launch it, enter the location of the .csv input
file at a prompt, and let it process the data.

The script would need to work as follows:

1 - Sort all records in ascending order (i.e. a,b,c...) by the sixth
column (the sixth entry on each line). Obviously the first line of the
csv file would contain the names of the columns for the subsequent
records, so this line would not be sorted.

2 - Now discard all the records that contain anything other than what
is now in the 6th column in the first record. So for example:

Line 1  A
Line 2  A
Line 3  A
Line 4  B
Line 5  C

- Lines 4 and 5 would be removed from the file. (Just want to make
sure I'm explaining myself properly!)

3 - The records would then need to be sorted according to a set of 3
priorities. Firstly - sort the records by the 4th column, again in
ascending order, as the highest priority.
The next priority would be to sort the 18th column in DESCENDING order.
And the final priority would be to sort the 6th column in ASCENDING order.

So after this columns 4, 18 and 6 would look like this:

line 1  A  Z  A
line 2  A  Z  B
line 3  B  A  A
line 4  D  A  B

4 - The final part of the script would be to seperate the records
according to what is contained in column 4, and output all records
containing the same values in column 4 to seperate files in a new
folder. Records would be stripped of all columns after number 7, and a
new field with a custom value would be appended on straight afterwards
to make 8 fields in the outputted files. A custom first line would
also be appended to each file. Ideally the files generated would take
their name from the value that their records held in column 4.

Explaining that was more difficult than I thought it would be!

I've not much idea of the difficulty of writing such a script, so if
it goes beyond the remit of google answers then just let me know, or
if the price is wildly unrealistic then please also let me know. Like
I say i've little idea how difficult this would be to pull off.

I've any questions need to be asked or more information supplied, just let me know.

Thankyou!

Request for Question Clarification by palitoy-ga on 17 Sep 2004 11:34 PDT
Hello superunknown

Does the script have to be written in python or could Perl be used instead?

Clarification of Question by superunknown-ga on 17 Sep 2004 12:54 PDT

No, the script doesn't have to be in python at all. It could be in
perl or whatever the most suitable language is for the job, just so
long as it could be easily run primarily on windows xp, but ideally
also on OS X / linux.

Answer

Subject: Re: Script to manipulate .csv database files
Answered By: leapinglizard-ga on 17 Sep 2004 18:29 PDT
Rated: 5 out of 5 stars

Dear superunknown, I have labored over the past five hours to write a Python script according to your specifications, and hereby present to you a first draft. I call it a first draft because it is unlikely that any program of this length, no matter how skillful the programmer, is free of bugs. I would be surprised if this one were. I have tested it to some extent on my own machine with a toy example, but I do not have access to your particular files. Bugs and incompatibility problems are likely to arise, as I detail below. Bear in mind, however, that I am committed to stamping out every single bug and modifying any feature until the program suits your needs precisely. My work on this question is not done until you are completely satisfied. In order to try out the script, you will first have to examine and modify the six constants defined near the beginning. Two of these are the custom field value and custom header that you requested, while the remaining four govern the behavior of the script to make it more adaptable to different circumstances. If the meaning of any of these constants is unclear, just ask. Once you have adjusted them to your liking, execute the script from a command line by typing python.exe grok.py on a Windows box or python grok.py on Linux, and see what happens. If the script crashes with a nonsensical error message, report back to me with a text dump so that I can get to work immediately on fixing the problem. If the error is reported by my own script -- you'll be able to tell from the relatively friendly message beginning with "error:" -- then it's a problem I've anticipated, and again I'll want to see a dump. Your choice of operating system may have a number of characteristics that I was unable to predict in advance. Among these unpredictable characteristics is the required reading and writing mode for .csv files. I can read and write them in binary mode on my Linux box, but your system may require text mode. A somewhat more significant problem is that of the format to which your particular CSV files adhere, since there is no single accepted CSV standard. The script at present assumes that it is reading a file in Excel's CSV format. If this is not the case, you will have to post a sample file so that I can analyze the formatting differences and implement them accordingly. Yet another difficulty is with the comparison operators, since I don't know what kinds of values you have in your fields. At present, the script properly sorts numbers and strings. If you have other kinds of data in your fields, again I'll have to see a sample file for analysis. I'm also not quite sure I've correctly interpreted your requirements for "custom value" and "custom first line", but I will work with you to adapt this and any other program feature to meet your needs. In addition, I would like to point out a few suboptimal aspects of the algorithm you specified, which I nonetheless implemented exactly to order. In particular, there is no need to sort the records by their sixth field if you merely want to find out the lowest value present in the sixth field of any record. This could be done by a traversal of the records in their existing order. Then you ask that records with any other value in the sixth field be thrown out, but later, you specify that the records be sorted according to values in the sixth field, yet the effect of the filtering step is that these values are all the same. Please don't think that I am criticizing your choice of algorithm. There may be details I don't understand, and perhaps you intend to modify the algorithm later so that these steps are necessary after all. I only wanted to bring these slight inefficiencies to your attention in case you weren't aware of them. In any event, I have implemented them according to spec. The script follows. I have included some inline comments so that it won't be entirely incomprehensible to those who have done some programming. #!/usr/bin/python # ----- begin grok.py # *** adjust constants as necessary CUSTOM_FIELD_VALUE = 'change_me' # this value is added as the eighth field of each record # remember, fields beyond seventh are previously dropped CUSTOM_BEGINNING = 'change_me_as_well' # each result file begins with this text # careful: the script adds a newline character DISCARD_FIRST_LINE = 1 # if non-zero, first line of CSV file is thrown out # if zero, first line of CSV file is preserved RESULT_FOLDER_NAME = 'test_results' # name of folder where results are to be stored OVERWRITE_RESULTS = 0 # if non-zero, existing result folder is used # if zero, new numbered result folder is made PUT_RESULTS_THERE = 1 # if non-zero, results are stored next to CSV file # if zero, results are stored at execution point # * end of constants import sys, os, string, re import csv def cmp_six(a, b): return cmp(a[5], b[5]) def cmp_complex(a, b): v = cmp(a[3], b[3]) if v != 0: return v v = cmp(b[17], a[17]) if v != 0: return v return cmp(a[5], b[5]) print 'starting grok.py' while 1: # prompt user for file name sys.stdout.write('enter a file name, or nothing to quit >') fname = sys.stdin.readline().strip() if fname == '': print '\nexiting grok.py' break # check validity of file name if not os.path.isfile(fname): print 'debug:', sys.exc_info() print 'error: "%s" is not a valid file name' % fname continue # open file and begin reading records = [] print 'reading from file "%s"' % fname try: reader = csv.reader(open(fname, 'rb'), dialect='excel') line_num = 0 # discard header if required if DISCARD_FIRST_LINE: print 'discarding header %s' % reader.next() line_num += 1 # parse all lines for line_list in reader: records.append(line_list) line_num += 1 except csv.Error: print 'debug:', sys.exc_info() print 'error: csv parser failed on line %d' % line_num continue print 'read %d lines' % line_num # attempt to sort records by sixth field try: records.sort(cmp_six) except: print 'debug:', sys.exc_info() print 'error: failed to sort records by sixth field' continue # filter out records where sixth field does not have the highest value master_value = records[0][5] new_records = [] for record in records: if record[5] == master_value: new_records.append(record) records = new_records # attempt to sort: 4th field ascending, 18th descending, 6th descending try: records.sort(cmp_complex) except: print 'debug:', sys.exc_info() print 'error: failed to sort records by fields 4, 18, 6' continue # attempt to locate or make result folder write_dir = os.getcwd() if PUT_RESULTS_THERE: write_dir = os.path.dirname(fname) result_dir = os.path.join(write_dir, RESULT_FOLDER_NAME) if os.path.exists(result_dir): if OVERWRITE_RESULTS: print 'reusing result folder "%s"' % result_dir else: i = 1 while 1: new_result_dir = '%s.%05d' % (result_dir, i) if not os.path.exists(new_result_dir): result_dir = new_result_dir print 'making result folder "%s"' % result_dir os.makedirs(result_dir) break i += 1 if i == 105: break if i == 10**5: print 'error: too many result folders already exist' continue else: print 'making result folder "%s"' % result_dir os.makedirs(result_dir) # separate records according to the fourth field, stripping fields past 7th h = {} record = None try: for record in records: val = record[3] if h.has_key(val): h[val].append(record[:7]) else: h[val] = [record[:7]] except: print 'debug:', sys.exc_info() print 'error: record separation failed on >%s<' % record continue # write out records to files, prepending custom line and appending 8th field try: keys = h.keys() keys.sort() for key in keys: res_fname = os.path.join(result_dir, 'result_%s.csv' % key) print 'making result file "%s"' % res_fname outf = open(res_fname, 'wb') outf.write(CUSTOM_BEGINNING+'\n') writer = csv.writer(outf, dialect='excel') line_num = 1 for row in h[key]: line_num += 1 writer.writerow(row + [CUSTOM_FIELD_VALUE]) outf.close() except: print 'debug:', sys.exc_info() print 'error: failed to write line %d of "%s"' % (line_num, res_fname) continue print 'completed processing "%s"' % fname # ----- end grok.py Let me reiterate that I do expect some bugs and/or incompatibilities to arise, and I understand that you may not be enamored with some of the features. But don't forget that I'm here to serve you, and that I won't rest until everything is perfect. Report any and all problems to me with Clarification Requests. Regards, leapinglizard
Request for Answer Clarification by superunknown-ga on 18 Sep 2004 00:51 PDT leapinglizard - wow, thankyou for your hard work. I don't have time to test out the script at this moment, but will do so later today, and then get back to you.
Clarification of Answer by leapinglizard-ga on 18 Sep 2004 03:04 PDT Very good. I'm looking forward to your preliminary results. leapinglizard
Request for Answer Clarification by superunknown-ga on 18 Sep 2004 13:05 PDT OK, i've tested the script on my OS X machine, and it works perfectly. Once again thanks for your efforts. I replaced the values at the top with my own, and they were inserted without a problem. With regard to the sixth field being sorted twice - sorry for the confusion that caused - it was an error on my part. I should have said the seventh field. Anyway i replaced the line: return cmp(a[5], b[5]) with: return cmp(a[6], b[6]) and as i say it worked perfectly. as regards sorting/removing in the sixth field, yes i can see the method i specified isn't optimal. If you think its worth the effort to rewrite that section then please go ahead. However i doubt its worth the effort on your part, as the existing way works fine (I'm not dealing with enormous files, so speed isn't that much of an issue. On the file I tested the script only took a couple of seconds to run). The only other possible request I could make would be to ask whether it would be possible to replace the name of the folder created with today's date. Its not a big issue, but if its easily done that would be a bonus. I haven't tested on a windows box yet. I'm sure it'll be fine, but is there anything I should be aware of when running the script on that platform? as soon as you get back to me on those two things I will of course rate the answer 5 stars. thanks for the great script leapinglizard.
Clarification of Answer by leapinglizard-ga on 18 Sep 2004 13:57 PDT Dear superunknown, I'm glad the script is working. In the new version below, I've replaced the sixth-field sort with a minimum-value search, and in the later sort I've made the correction from sixth to seventh. As for using the date in the name of the result folder, you'll see that I've modified the definition of RESULT_FOLDER_NAME accordingly. If you'd prefer to have the American date format, where the month and day come before the year, you can assign the first three elements of the time.localtime() tuple to three named variables, which you may then apply in a different order to the string format, like so. (year, month, day) = time.localtime()[:3] RESULT_FOLDER_NAME = 'results.%02d.%02d.%d' % (month, day, year) My biggest fears of script incompatibility revolved around the format of your CSV files. If you were able to process your files under OS X without a hitch, you will, in all likelihood, be able to process the same files under Windows. It's true that line terminators are traditionally different in Windows and Unix, but since I read and write files only through Python's csv module, the portability of Python should ensure the portability of my script. The only thing I'm uncertain about is the i/o mode for text files. You'll see in the script that I open files using the specifier 'rb' (for reading) and 'wb' (for writing), where the character 'b' means binary mode. I read a disquieting note in the Python documentation yesterday, according to which some platforms require binary-mode file access and others require text mode. Under Linux, both will work just fine for text files. I really don't know whether this is true under Windows XP. If there seems to be trouble with the file access, the first thing to try would be to remove that 'b' from the specifiers 'rb' and 'wb', so that they become just 'r' and 'w'. If it's anything else, get back to me and I'll troubleshoot. Regards, leapinglizard # ----- begin grok.py import sys, os, string, re import time, csv # *** adjust constants as necessary CUSTOM_FIELD_VALUE = 'change_me' # this value is added as the eighth field of each record # remember, fields beyond seventh are previously dropped CUSTOM_BEGINNING = 'change_me_as_well' # each result file begins with this text # careful: the script adds a newline character DISCARD_FIRST_LINE = 1 # if non-zero, first line of CSV file is thrown out # if zero, first line of CSV file is preserved RESULT_FOLDER_NAME = 'results.%d.%02d.%02d' % time.localtime()[:3] # name of folder where results are to be stored OVERWRITE_RESULTS = 0 # if non-zero, existing result folder is used # if zero, new numbered result folder is made PUT_RESULTS_THERE = 1 # if non-zero, results are stored next to CSV file # if zero, results are stored at execution point # * end of constants def cmp_complex(a, b): v = cmp(a[3], b[3]) if v != 0: return v v = cmp(b[17], a[17]) if v != 0: return v return cmp(a[6], b[6]) print 'starting grok.py' while 1: # prompt user for file name sys.stdout.write('enter a file name, or nothing to quit >') fname = sys.stdin.readline().strip() if fname == '': print '\nexiting grok.py' break # check validity of file name if not os.path.isfile(fname): print 'debug:', sys.exc_info() print 'error: "%s" is not a valid file name' % fname continue # open file and begin reading records = [] print 'reading from file "%s"' % fname try: reader = csv.reader(open(fname, 'rb'), dialect='excel') line_num = 0 # discard header if required if DISCARD_FIRST_LINE: print 'discarding header %s' % reader.next() line_num += 1 # parse all lines for line_list in reader: records.append(line_list) line_num += 1 except csv.Error: print 'debug:', sys.exc_info() print 'error: csv parser failed on line %d' % line_num continue print 'read %d lines' % line_num # attempt to find minimal value in sixth field try: min_value = records[0][5] for record in records: if record[5] < min_value: min_value = record[5] except: print 'debug:', sys.exc_info() print 'error: failed to find minimal value in sixth field' continue # filter out records where sixth field does not have the highest value new_records = [] for record in records: if record[5] == min_value: new_records.append(record) records = new_records # attempt to sort: 4th field ascending, 18th descending, 7th descending try: records.sort(cmp_complex) except: print 'debug:', sys.exc_info() print 'error: failed to sort records by fields 4, 18, 7' continue # attempt to locate or make result folder write_dir = os.getcwd() if PUT_RESULTS_THERE: write_dir = os.path.dirname(fname) result_dir = os.path.join(write_dir, RESULT_FOLDER_NAME) if os.path.exists(result_dir): if OVERWRITE_RESULTS: print 'reusing result folder "%s"' % result_dir else: i = 1 while 1: new_result_dir = '%s.%05d' % (result_dir, i) if not os.path.exists(new_result_dir): result_dir = new_result_dir print 'making result folder "%s"' % result_dir os.makedirs(result_dir) break i += 1 if i == 105: break if i == 10**5: print 'error: too many result folders already exist' continue else: print 'making result folder "%s"' % result_dir os.makedirs(result_dir) # separate records according to the fourth field, stripping fields past 7th h = {} record = None try: for record in records: val = record[3] if h.has_key(val): h[val].append(record[:7]) else: h[val] = [record[:7]] except: print 'debug:', sys.exc_info() print 'error: record separation failed on >%s<' % record continue # write out records to files, prepending custom line and appending 8th field try: keys = h.keys() keys.sort() for key in keys: res_fname = os.path.join(result_dir, 'result_%s.csv' % key) print 'making result file "%s"' % res_fname outf = open(res_fname, 'wb') outf.write(CUSTOM_BEGINNING+'\n') writer = csv.writer(outf, dialect='excel') line_num = 1 for row in h[key]: line_num += 1 writer.writerow(row + [CUSTOM_FIELD_VALUE]) outf.close() except: print 'debug:', sys.exc_info() print 'error: failed to write line %d of "%s"' % (line_num, res_fname) continue print 'completed processing "%s"' % fname # ----- end grok.py
Clarification of Answer by leapinglizard-ga on 18 Sep 2004 14:03 PDT I just noticed an erroneous comment. Where it says # attempt to sort: 4th field ascending, 18th descending, 7th descending this is untrue. The script does, of course, sort by ascending value of the 7th field. If you're a perfectionist, you'll want to replace this line with the following. # attempt to sort: 4th field ascending, 18th descending, 7th ascending leapinglizard

superunknown-ga rated this answer: 5 out of 5 stars

and gave an additional tip of: $10.00

Comments

Subject: Re: Script to manipulate .csv database files
From: leapinglizard-ga on 18 Sep 2004 15:07 PDT

Thank you for the five-star rating and the kind tip.

leapinglizard

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy