Google Answers Logo
View Question
 
Q: Script to manipulate .csv database files ( Answered 5 out of 5 stars,   1 Comment )
Question  
Subject: Script to manipulate .csv database files
Category: Computers > Programming
Asked by: superunknown-ga
List Price: $55.00
Posted: 16 Sep 2004 16:41 PDT
Expires: 16 Oct 2004 16:41 PDT
Question ID: 402277
I'm looking for a script that will process a CSV file. I've looked
into writing one myself in python, but think its probably
(definetely?!) beyond me.

The script would need to be able to be made executable in windows or
OS X, so that I could launch it, enter the location of the .csv input
file at a prompt, and let it process the data.

The script would need to work as follows:

1 - Sort all records in ascending order (i.e. a,b,c...) by the sixth
column (the sixth entry on each line). Obviously the first line of the
csv file would contain the names of the columns for the subsequent
records, so this line would not be sorted.

2 - Now discard all the records that contain anything other than what
is now in the 6th column in the first record. So for example:

Line 1  A
Line 2  A
Line 3  A
Line 4  B
Line 5  C

- Lines 4 and 5 would be removed from the file. (Just want to make
sure I'm explaining myself properly!)

3 - The records would then need to be sorted according to a set of 3
priorities. Firstly - sort the records by the 4th column, again in
ascending order, as the highest priority.
The next priority would be to sort the 18th column in DESCENDING order.
And the final priority would be to sort the 6th column in ASCENDING order.

So after this columns 4, 18 and 6 would look like this:

line 1  A  Z  A
line 2  A  Z  B
line 3  B  A  A
line 4  D  A  B

4 - The final part of the script would be to seperate the records
according to what is contained in column 4, and output all records
containing the same values in column 4 to seperate files in a new
folder. Records would be stripped of all columns after number 7, and a
new field with a custom value would be appended on straight afterwards
to make 8 fields in the outputted files. A custom first line would
also be appended to each file. Ideally the files generated would take
their name from the value that their records held in column 4.

Explaining that was more difficult than I thought it would be!

I've not much idea of the difficulty of writing such a script, so if
it goes beyond the remit of google answers then just let me know, or
if the price is wildly unrealistic then please also let me know. Like
I say i've little idea how difficult this would be to pull off.

I've any questions need to be asked or more information supplied, just let me know.

Thankyou!

Request for Question Clarification by palitoy-ga on 17 Sep 2004 11:34 PDT
Hello superunknown

Does the script have to be written in python or could Perl be used instead?

Clarification of Question by superunknown-ga on 17 Sep 2004 12:54 PDT
No, the script doesn't have to be in python at all. It could be in
perl or whatever the most suitable language is for the job, just so
long as it could be easily run primarily on windows xp, but ideally
also on OS X / linux.
Answer  
Subject: Re: Script to manipulate .csv database files
Answered By: leapinglizard-ga on 17 Sep 2004 18:29 PDT
Rated:5 out of 5 stars
 
Dear superunknown,

I have labored over the past five hours to write a Python script
according to your specifications, and hereby present to you a first
draft. I call it a first draft because it is unlikely that any program
of this length, no matter how skillful the programmer, is free of
bugs. I would be surprised if this one were. I have tested it to some
extent on my own machine with a toy example, but I do not have access
to your particular files. Bugs and incompatibility problems are likely
to arise, as I detail below. Bear in mind, however, that I am
committed to stamping out every single bug and modifying any feature
until the program suits your needs precisely. My work on this question
is not done until you are completely satisfied.

In order to try out the script, you will first have to examine and
modify the six constants defined near the beginning. Two of these are
the custom field value and custom header that you requested, while the
remaining four govern the behavior of the script to make it more
adaptable to different circumstances. If the meaning of any of these
constants is unclear, just ask. Once you have adjusted them to your
liking, execute the script from a command line by typing

  python.exe grok.py

on a Windows box or

  python grok.py

on Linux, and see what happens. If the script crashes with a
nonsensical error message, report back to me with a text dump so that
I can get to work immediately on fixing the problem. If the error is
reported by my own script -- you'll be able to tell from the
relatively friendly message beginning with "error:" -- then it's a
problem I've anticipated, and again I'll want to see a dump. Your
choice of operating system may have a number of characteristics that I
was unable to predict in advance.

Among these unpredictable characteristics is the required reading and
writing mode for .csv files. I can read and write them in binary mode
on my Linux box, but your system may require text mode. A somewhat
more significant problem is that of the format to which your
particular CSV files adhere, since there is no single accepted CSV
standard. The script at present assumes that it is reading a file in
Excel's CSV format. If this is not the case, you will have to post a
sample file so that I can analyze the formatting differences and
implement them accordingly. Yet another difficulty is with the
comparison operators, since I don't know what kinds of values you have
in your fields. At present, the script properly sorts numbers and
strings. If you have other kinds of data in your fields, again I'll
have to see a sample file for analysis. I'm also not quite sure I've
correctly interpreted your requirements for "custom value" and "custom
first line", but I will work with you to adapt this and any other
program feature to meet your needs.

In addition, I would like to point out a few suboptimal aspects of the
algorithm you specified, which I nonetheless implemented exactly to
order. In particular, there is no need to sort the records by their
sixth field if you merely want to find out the lowest value present in
the sixth field of any record. This could be done by a traversal of
the records in their existing order. Then you ask that records with
any other value in the sixth field be thrown out, but later, you
specify that the records be sorted according to values in the sixth
field, yet the effect of the filtering step is that these values are
all the same. Please don't think that I am criticizing your choice of
algorithm. There may be details I don't understand, and perhaps you
intend to modify the algorithm later so that these steps are necessary
after all. I only wanted to bring these slight inefficiencies to your
attention in case you weren't aware of them. In any event, I have
implemented them according to spec.

The script follows. I have included some inline comments so that it
won't be entirely incomprehensible to those who have done some
programming.




#!/usr/bin/python
# ----- begin grok.py

# ***** adjust constants as necessary

CUSTOM_FIELD_VALUE = 'change_me'
                        # this value is added as the eighth field of each record
                        # remember, fields beyond seventh are previously dropped

CUSTOM_BEGINNING = 'change_me_as_well'
                        # each result file begins with this text
                        # careful: the script adds a newline character

DISCARD_FIRST_LINE = 1  # if non-zero, first line of CSV file is thrown out
                        # if zero, first line of CSV file is preserved

RESULT_FOLDER_NAME = 'test_results'
                        # name of folder where results are to be stored

OVERWRITE_RESULTS = 0   # if non-zero, existing result folder is used
                        # if zero, new numbered result folder is made

PUT_RESULTS_THERE = 1   # if non-zero, results are stored next to CSV file
                        # if zero, results are stored at execution point

# ***** end of constants


import sys, os, string, re
import csv

def cmp_six(a, b):
    return cmp(a[5], b[5])

def cmp_complex(a, b):
    v = cmp(a[3], b[3])
    if v != 0:
        return v
    v = cmp(b[17], a[17])
    if v != 0:
        return v
    return cmp(a[5], b[5])

print 'starting grok.py'

while 1:

    # prompt user for file name
    sys.stdout.write('enter a file name, or nothing to quit >')
    fname = sys.stdin.readline().strip()
    if fname == '':
        print '\nexiting grok.py'
        break

    # check validity of file name
    if not os.path.isfile(fname):
        print 'debug:', sys.exc_info()
        print 'error: "%s" is not a valid file name' % fname
        continue

    # open file and begin reading
    records = []
    print 'reading from file "%s"' % fname
    try:
        reader = csv.reader(open(fname, 'rb'), dialect='excel')
        line_num = 0

        # discard header if required
        if DISCARD_FIRST_LINE:
            print 'discarding header %s' % reader.next()
            line_num += 1

        # parse all lines
        for line_list in reader:
            records.append(line_list)
            line_num += 1

    except csv.Error:
        print 'debug:', sys.exc_info()
        print 'error: csv parser failed on line %d' % line_num
        continue

    print 'read %d lines' % line_num

    # attempt to sort records by sixth field
    try:
        records.sort(cmp_six)
    except:
        print 'debug:', sys.exc_info()
        print 'error: failed to sort records by sixth field'
        continue

    # filter out records where sixth field does not have the highest value
    master_value = records[0][5]
    new_records = []
    for record in records:
        if record[5] == master_value:
            new_records.append(record)
    records = new_records

    # attempt to sort: 4th field ascending, 18th descending, 6th descending
    try:
        records.sort(cmp_complex)
    except:
        print 'debug:', sys.exc_info()
        print 'error: failed to sort records by fields 4, 18, 6'
        continue

    # attempt to locate or make result folder
    write_dir = os.getcwd()
    if PUT_RESULTS_THERE:
        write_dir = os.path.dirname(fname)
    result_dir = os.path.join(write_dir, RESULT_FOLDER_NAME)
    if os.path.exists(result_dir):
        if OVERWRITE_RESULTS:
            print 'reusing result folder "%s"' % result_dir
        else:
            i = 1
            while 1:
                new_result_dir = '%s.%05d' % (result_dir, i)
                if not os.path.exists(new_result_dir):
                    result_dir = new_result_dir
                    print 'making result folder "%s"' % result_dir
                    os.makedirs(result_dir)
                    break
                i += 1
                if i == 10**5:
                    break
            if i == 10**5:
                print 'error: too many result folders already exist'
                continue
    else:
        print 'making result folder "%s"' % result_dir
        os.makedirs(result_dir)

    # separate records according to the fourth field, stripping fields past 7th
    h = {}
    record = None
    try:
        for record in records:
            val = record[3]
            if h.has_key(val):
                h[val].append(record[:7])
            else:
                h[val] = [record[:7]]
    except:
        print 'debug:', sys.exc_info()
        print 'error: record separation failed on >%s<' % record
        continue

    # write out records to files, prepending custom line and appending 8th field
    try:
        keys = h.keys()
        keys.sort()
        for key in keys:
            res_fname = os.path.join(result_dir, 'result_%s.csv' % key)
            print 'making result file "%s"' % res_fname
            outf = open(res_fname, 'wb')
            outf.write(CUSTOM_BEGINNING+'\n')
            writer = csv.writer(outf, dialect='excel')
            line_num = 1
            for row in h[key]:
                line_num += 1
                writer.writerow(row + [CUSTOM_FIELD_VALUE])
            outf.close()
    except: 
        print 'debug:', sys.exc_info()
        print 'error: failed to write line %d of "%s"' % (line_num, res_fname)
        continue

    print 'completed processing "%s"' % fname


# ----- end grok.py





Let me reiterate that I do expect some bugs and/or incompatibilities
to arise, and I understand that you may not be enamored with some of
the features. But don't forget that I'm here to serve you, and that I
won't rest until everything is perfect. Report any and all problems to
me with Clarification Requests.

Regards,

leapinglizard

Request for Answer Clarification by superunknown-ga on 18 Sep 2004 00:51 PDT
leapinglizard -

wow, thankyou for your hard work. I don't have time to test out the
script at this moment, but will do so later today, and then get back
to you.

Clarification of Answer by leapinglizard-ga on 18 Sep 2004 03:04 PDT
Very good. I'm looking forward to your preliminary results.

leapinglizard

Request for Answer Clarification by superunknown-ga on 18 Sep 2004 13:05 PDT
OK, i've tested the script on my OS X machine, and it works perfectly.
Once again thanks for your efforts. I replaced the values at the top
with my own, and they were inserted without a problem.

With regard to the sixth field being sorted twice - sorry for the
confusion that caused - it was an error on my part. I should have said
the seventh field. Anyway i replaced the line:

return cmp(a[5], b[5])

with:

return cmp(a[6], b[6])

and as i say it worked perfectly. as regards sorting/removing in the
sixth field, yes i can see the method i specified isn't optimal. If
you think its worth the effort to rewrite that section then please go
ahead. However i doubt its worth the effort on your part, as the
existing way works fine (I'm not dealing with enormous files, so speed
isn't that much of an issue. On the file I tested the script only took
a couple of seconds to run).

The only other possible request I could make would be to ask whether
it would be possible to replace the name of the folder created with
today's date. Its not a big issue, but if its easily done that would
be a bonus.

I haven't tested on a windows box yet. I'm sure it'll be fine, but is
there anything I should be aware of when running the script on that
platform?

as soon as you get back to me on those two things I will of course
rate the answer 5 stars.

thanks for the great script leapinglizard.

Clarification of Answer by leapinglizard-ga on 18 Sep 2004 13:57 PDT
Dear superunknown,

I'm glad the script is working. In the new version below, I've
replaced the sixth-field sort with a minimum-value search, and in the
later sort I've made the correction from sixth to seventh.

As for using the date in the name of the result folder, you'll see
that I've modified the definition of RESULT_FOLDER_NAME accordingly.
If you'd prefer to have the American date format, where the month and
day come before the year, you can assign the first three elements of
the time.localtime() tuple to three named variables, which you may
then apply in a different order to the string format, like so.

(year, month, day) = time.localtime()[:3]
RESULT_FOLDER_NAME = 'results.%02d.%02d.%d' % (month, day, year)

My biggest fears of script incompatibility revolved around the format
of your CSV files. If you were able to process your files under OS X
without a hitch, you will, in all likelihood, be able to process the
same files under Windows. It's true that line terminators are
traditionally different in Windows and Unix, but since I read and
write files only through Python's csv module, the portability of
Python should ensure the portability of my script.

The only thing I'm uncertain about is the i/o mode for text files.
You'll see in the script that I open files using the specifier 'rb'
(for reading) and 'wb' (for writing), where the character 'b' means
binary mode. I read a disquieting note in the Python documentation
yesterday, according to which some platforms require binary-mode file
access and others require text mode. Under Linux, both will work just
fine for text files. I really don't know whether this is true under
Windows XP. If there seems to be trouble with the file access, the
first thing to try would be to remove that 'b' from the specifiers
'rb' and 'wb', so that they become just 'r' and 'w'. If it's anything
else, get back to me and I'll troubleshoot.

Regards,

leapinglizard





# ----- begin grok.py

import sys, os, string, re
import time, csv

# ***** adjust constants as necessary

CUSTOM_FIELD_VALUE = 'change_me'
                        # this value is added as the eighth field of each record
                        # remember, fields beyond seventh are previously dropped

CUSTOM_BEGINNING = 'change_me_as_well'
                        # each result file begins with this text
                        # careful: the script adds a newline character

DISCARD_FIRST_LINE = 1  # if non-zero, first line of CSV file is thrown out
                        # if zero, first line of CSV file is preserved

RESULT_FOLDER_NAME = 'results.%d.%02d.%02d' % time.localtime()[:3]
                        # name of folder where results are to be stored

OVERWRITE_RESULTS = 0   # if non-zero, existing result folder is used
                        # if zero, new numbered result folder is made

PUT_RESULTS_THERE = 1   # if non-zero, results are stored next to CSV file
                        # if zero, results are stored at execution point

# ***** end of constants


def cmp_complex(a, b):
    v = cmp(a[3], b[3])
    if v != 0:
        return v
    v = cmp(b[17], a[17])
    if v != 0:
        return v
    return cmp(a[6], b[6])

print 'starting grok.py'

while 1:

    # prompt user for file name
    sys.stdout.write('enter a file name, or nothing to quit >')
    fname = sys.stdin.readline().strip()
    if fname == '':
        print '\nexiting grok.py'
        break

    # check validity of file name
    if not os.path.isfile(fname):
        print 'debug:', sys.exc_info()
        print 'error: "%s" is not a valid file name' % fname
        continue

    # open file and begin reading
    records = []
    print 'reading from file "%s"' % fname
    try:
        reader = csv.reader(open(fname, 'rb'), dialect='excel')
        line_num = 0

        # discard header if required
        if DISCARD_FIRST_LINE:
            print 'discarding header %s' % reader.next()
            line_num += 1

        # parse all lines
        for line_list in reader:
            records.append(line_list)
            line_num += 1

    except csv.Error:
        print 'debug:', sys.exc_info()
        print 'error: csv parser failed on line %d' % line_num
        continue

    print 'read %d lines' % line_num

    # attempt to find minimal value in sixth field
    try:
        min_value = records[0][5]
        for record in records:
            if record[5] < min_value:
                min_value = record[5]
    except:
        print 'debug:', sys.exc_info()
        print 'error: failed to find minimal value in sixth field'
        continue

    # filter out records where sixth field does not have the highest value
    new_records = []
    for record in records:
        if record[5] == min_value:
            new_records.append(record)
    records = new_records

    # attempt to sort: 4th field ascending, 18th descending, 7th descending
    try:
        records.sort(cmp_complex)
    except:
        print 'debug:', sys.exc_info()
        print 'error: failed to sort records by fields 4, 18, 7'
        continue

    # attempt to locate or make result folder
    write_dir = os.getcwd()
    if PUT_RESULTS_THERE:
        write_dir = os.path.dirname(fname)
    result_dir = os.path.join(write_dir, RESULT_FOLDER_NAME)
    if os.path.exists(result_dir):
        if OVERWRITE_RESULTS:
            print 'reusing result folder "%s"' % result_dir
        else:
            i = 1
            while 1:
                new_result_dir = '%s.%05d' % (result_dir, i)
                if not os.path.exists(new_result_dir):
                    result_dir = new_result_dir
                    print 'making result folder "%s"' % result_dir
                    os.makedirs(result_dir)
                    break
                i += 1
                if i == 10**5:
                    break
            if i == 10**5:
                print 'error: too many result folders already exist'
                continue
    else:
        print 'making result folder "%s"' % result_dir
        os.makedirs(result_dir)

    # separate records according to the fourth field, stripping fields past 7th
    h = {}
    record = None
    try:
        for record in records:
            val = record[3]
            if h.has_key(val):
                h[val].append(record[:7])
            else:
                h[val] = [record[:7]]
    except:
        print 'debug:', sys.exc_info()
        print 'error: record separation failed on >%s<' % record
        continue

    # write out records to files, prepending custom line and appending 8th field
    try:
        keys = h.keys() 
        keys.sort()
        for key in keys:
            res_fname = os.path.join(result_dir, 'result_%s.csv' % key)
            print 'making result file "%s"' % res_fname
            outf = open(res_fname, 'wb')
            outf.write(CUSTOM_BEGINNING+'\n')
            writer = csv.writer(outf, dialect='excel')
            line_num = 1
            for row in h[key]:
                line_num += 1
                writer.writerow(row + [CUSTOM_FIELD_VALUE])
            outf.close()
    except:
        print 'debug:', sys.exc_info()
        print 'error: failed to write line %d of "%s"' % (line_num, res_fname)
        continue

    print 'completed processing "%s"' % fname

    
# ----- end grok.py

Clarification of Answer by leapinglizard-ga on 18 Sep 2004 14:03 PDT
I just noticed an erroneous comment. Where it says

    # attempt to sort: 4th field ascending, 18th descending, 7th descending

this is untrue. The script does, of course, sort by ascending value of
the 7th field. If you're a perfectionist, you'll want to replace this
line with the following.

    # attempt to sort: 4th field ascending, 18th descending, 7th ascending

leapinglizard
superunknown-ga rated this answer:5 out of 5 stars and gave an additional tip of: $10.00

Comments  
Subject: Re: Script to manipulate .csv database files
From: leapinglizard-ga on 18 Sep 2004 15:07 PDT
 
Thank you for the five-star rating and the kind tip.

leapinglizard

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy