Dear superunknown,
I have labored over the past five hours to write a Python script
according to your specifications, and hereby present to you a first
draft. I call it a first draft because it is unlikely that any program
of this length, no matter how skillful the programmer, is free of
bugs. I would be surprised if this one were. I have tested it to some
extent on my own machine with a toy example, but I do not have access
to your particular files. Bugs and incompatibility problems are likely
to arise, as I detail below. Bear in mind, however, that I am
committed to stamping out every single bug and modifying any feature
until the program suits your needs precisely. My work on this question
is not done until you are completely satisfied.
In order to try out the script, you will first have to examine and
modify the six constants defined near the beginning. Two of these are
the custom field value and custom header that you requested, while the
remaining four govern the behavior of the script to make it more
adaptable to different circumstances. If the meaning of any of these
constants is unclear, just ask. Once you have adjusted them to your
liking, execute the script from a command line by typing
python.exe grok.py
on a Windows box or
python grok.py
on Linux, and see what happens. If the script crashes with a
nonsensical error message, report back to me with a text dump so that
I can get to work immediately on fixing the problem. If the error is
reported by my own script -- you'll be able to tell from the
relatively friendly message beginning with "error:" -- then it's a
problem I've anticipated, and again I'll want to see a dump. Your
choice of operating system may have a number of characteristics that I
was unable to predict in advance.
Among these unpredictable characteristics is the required reading and
writing mode for .csv files. I can read and write them in binary mode
on my Linux box, but your system may require text mode. A somewhat
more significant problem is that of the format to which your
particular CSV files adhere, since there is no single accepted CSV
standard. The script at present assumes that it is reading a file in
Excel's CSV format. If this is not the case, you will have to post a
sample file so that I can analyze the formatting differences and
implement them accordingly. Yet another difficulty is with the
comparison operators, since I don't know what kinds of values you have
in your fields. At present, the script properly sorts numbers and
strings. If you have other kinds of data in your fields, again I'll
have to see a sample file for analysis. I'm also not quite sure I've
correctly interpreted your requirements for "custom value" and "custom
first line", but I will work with you to adapt this and any other
program feature to meet your needs.
In addition, I would like to point out a few suboptimal aspects of the
algorithm you specified, which I nonetheless implemented exactly to
order. In particular, there is no need to sort the records by their
sixth field if you merely want to find out the lowest value present in
the sixth field of any record. This could be done by a traversal of
the records in their existing order. Then you ask that records with
any other value in the sixth field be thrown out, but later, you
specify that the records be sorted according to values in the sixth
field, yet the effect of the filtering step is that these values are
all the same. Please don't think that I am criticizing your choice of
algorithm. There may be details I don't understand, and perhaps you
intend to modify the algorithm later so that these steps are necessary
after all. I only wanted to bring these slight inefficiencies to your
attention in case you weren't aware of them. In any event, I have
implemented them according to spec.
The script follows. I have included some inline comments so that it
won't be entirely incomprehensible to those who have done some
programming.
#!/usr/bin/python
# ----- begin grok.py
# ***** adjust constants as necessary
CUSTOM_FIELD_VALUE = 'change_me'
# this value is added as the eighth field of each record
# remember, fields beyond seventh are previously dropped
CUSTOM_BEGINNING = 'change_me_as_well'
# each result file begins with this text
# careful: the script adds a newline character
DISCARD_FIRST_LINE = 1 # if non-zero, first line of CSV file is thrown out
# if zero, first line of CSV file is preserved
RESULT_FOLDER_NAME = 'test_results'
# name of folder where results are to be stored
OVERWRITE_RESULTS = 0 # if non-zero, existing result folder is used
# if zero, new numbered result folder is made
PUT_RESULTS_THERE = 1 # if non-zero, results are stored next to CSV file
# if zero, results are stored at execution point
# ***** end of constants
import sys, os, string, re
import csv
def cmp_six(a, b):
return cmp(a[5], b[5])
def cmp_complex(a, b):
v = cmp(a[3], b[3])
if v != 0:
return v
v = cmp(b[17], a[17])
if v != 0:
return v
return cmp(a[5], b[5])
print 'starting grok.py'
while 1:
# prompt user for file name
sys.stdout.write('enter a file name, or nothing to quit >')
fname = sys.stdin.readline().strip()
if fname == '':
print '\nexiting grok.py'
break
# check validity of file name
if not os.path.isfile(fname):
print 'debug:', sys.exc_info()
print 'error: "%s" is not a valid file name' % fname
continue
# open file and begin reading
records = []
print 'reading from file "%s"' % fname
try:
reader = csv.reader(open(fname, 'rb'), dialect='excel')
line_num = 0
# discard header if required
if DISCARD_FIRST_LINE:
print 'discarding header %s' % reader.next()
line_num += 1
# parse all lines
for line_list in reader:
records.append(line_list)
line_num += 1
except csv.Error:
print 'debug:', sys.exc_info()
print 'error: csv parser failed on line %d' % line_num
continue
print 'read %d lines' % line_num
# attempt to sort records by sixth field
try:
records.sort(cmp_six)
except:
print 'debug:', sys.exc_info()
print 'error: failed to sort records by sixth field'
continue
# filter out records where sixth field does not have the highest value
master_value = records[0][5]
new_records = []
for record in records:
if record[5] == master_value:
new_records.append(record)
records = new_records
# attempt to sort: 4th field ascending, 18th descending, 6th descending
try:
records.sort(cmp_complex)
except:
print 'debug:', sys.exc_info()
print 'error: failed to sort records by fields 4, 18, 6'
continue
# attempt to locate or make result folder
write_dir = os.getcwd()
if PUT_RESULTS_THERE:
write_dir = os.path.dirname(fname)
result_dir = os.path.join(write_dir, RESULT_FOLDER_NAME)
if os.path.exists(result_dir):
if OVERWRITE_RESULTS:
print 'reusing result folder "%s"' % result_dir
else:
i = 1
while 1:
new_result_dir = '%s.%05d' % (result_dir, i)
if not os.path.exists(new_result_dir):
result_dir = new_result_dir
print 'making result folder "%s"' % result_dir
os.makedirs(result_dir)
break
i += 1
if i == 10**5:
break
if i == 10**5:
print 'error: too many result folders already exist'
continue
else:
print 'making result folder "%s"' % result_dir
os.makedirs(result_dir)
# separate records according to the fourth field, stripping fields past 7th
h = {}
record = None
try:
for record in records:
val = record[3]
if h.has_key(val):
h[val].append(record[:7])
else:
h[val] = [record[:7]]
except:
print 'debug:', sys.exc_info()
print 'error: record separation failed on >%s<' % record
continue
# write out records to files, prepending custom line and appending 8th field
try:
keys = h.keys()
keys.sort()
for key in keys:
res_fname = os.path.join(result_dir, 'result_%s.csv' % key)
print 'making result file "%s"' % res_fname
outf = open(res_fname, 'wb')
outf.write(CUSTOM_BEGINNING+'\n')
writer = csv.writer(outf, dialect='excel')
line_num = 1
for row in h[key]:
line_num += 1
writer.writerow(row + [CUSTOM_FIELD_VALUE])
outf.close()
except:
print 'debug:', sys.exc_info()
print 'error: failed to write line %d of "%s"' % (line_num, res_fname)
continue
print 'completed processing "%s"' % fname
# ----- end grok.py
Let me reiterate that I do expect some bugs and/or incompatibilities
to arise, and I understand that you may not be enamored with some of
the features. But don't forget that I'm here to serve you, and that I
won't rest until everything is perfect. Report any and all problems to
me with Clarification Requests.
Regards,
leapinglizard |
Clarification of Answer by
leapinglizard-ga
on
18 Sep 2004 13:57 PDT
Dear superunknown,
I'm glad the script is working. In the new version below, I've
replaced the sixth-field sort with a minimum-value search, and in the
later sort I've made the correction from sixth to seventh.
As for using the date in the name of the result folder, you'll see
that I've modified the definition of RESULT_FOLDER_NAME accordingly.
If you'd prefer to have the American date format, where the month and
day come before the year, you can assign the first three elements of
the time.localtime() tuple to three named variables, which you may
then apply in a different order to the string format, like so.
(year, month, day) = time.localtime()[:3]
RESULT_FOLDER_NAME = 'results.%02d.%02d.%d' % (month, day, year)
My biggest fears of script incompatibility revolved around the format
of your CSV files. If you were able to process your files under OS X
without a hitch, you will, in all likelihood, be able to process the
same files under Windows. It's true that line terminators are
traditionally different in Windows and Unix, but since I read and
write files only through Python's csv module, the portability of
Python should ensure the portability of my script.
The only thing I'm uncertain about is the i/o mode for text files.
You'll see in the script that I open files using the specifier 'rb'
(for reading) and 'wb' (for writing), where the character 'b' means
binary mode. I read a disquieting note in the Python documentation
yesterday, according to which some platforms require binary-mode file
access and others require text mode. Under Linux, both will work just
fine for text files. I really don't know whether this is true under
Windows XP. If there seems to be trouble with the file access, the
first thing to try would be to remove that 'b' from the specifiers
'rb' and 'wb', so that they become just 'r' and 'w'. If it's anything
else, get back to me and I'll troubleshoot.
Regards,
leapinglizard
# ----- begin grok.py
import sys, os, string, re
import time, csv
# ***** adjust constants as necessary
CUSTOM_FIELD_VALUE = 'change_me'
# this value is added as the eighth field of each record
# remember, fields beyond seventh are previously dropped
CUSTOM_BEGINNING = 'change_me_as_well'
# each result file begins with this text
# careful: the script adds a newline character
DISCARD_FIRST_LINE = 1 # if non-zero, first line of CSV file is thrown out
# if zero, first line of CSV file is preserved
RESULT_FOLDER_NAME = 'results.%d.%02d.%02d' % time.localtime()[:3]
# name of folder where results are to be stored
OVERWRITE_RESULTS = 0 # if non-zero, existing result folder is used
# if zero, new numbered result folder is made
PUT_RESULTS_THERE = 1 # if non-zero, results are stored next to CSV file
# if zero, results are stored at execution point
# ***** end of constants
def cmp_complex(a, b):
v = cmp(a[3], b[3])
if v != 0:
return v
v = cmp(b[17], a[17])
if v != 0:
return v
return cmp(a[6], b[6])
print 'starting grok.py'
while 1:
# prompt user for file name
sys.stdout.write('enter a file name, or nothing to quit >')
fname = sys.stdin.readline().strip()
if fname == '':
print '\nexiting grok.py'
break
# check validity of file name
if not os.path.isfile(fname):
print 'debug:', sys.exc_info()
print 'error: "%s" is not a valid file name' % fname
continue
# open file and begin reading
records = []
print 'reading from file "%s"' % fname
try:
reader = csv.reader(open(fname, 'rb'), dialect='excel')
line_num = 0
# discard header if required
if DISCARD_FIRST_LINE:
print 'discarding header %s' % reader.next()
line_num += 1
# parse all lines
for line_list in reader:
records.append(line_list)
line_num += 1
except csv.Error:
print 'debug:', sys.exc_info()
print 'error: csv parser failed on line %d' % line_num
continue
print 'read %d lines' % line_num
# attempt to find minimal value in sixth field
try:
min_value = records[0][5]
for record in records:
if record[5] < min_value:
min_value = record[5]
except:
print 'debug:', sys.exc_info()
print 'error: failed to find minimal value in sixth field'
continue
# filter out records where sixth field does not have the highest value
new_records = []
for record in records:
if record[5] == min_value:
new_records.append(record)
records = new_records
# attempt to sort: 4th field ascending, 18th descending, 7th descending
try:
records.sort(cmp_complex)
except:
print 'debug:', sys.exc_info()
print 'error: failed to sort records by fields 4, 18, 7'
continue
# attempt to locate or make result folder
write_dir = os.getcwd()
if PUT_RESULTS_THERE:
write_dir = os.path.dirname(fname)
result_dir = os.path.join(write_dir, RESULT_FOLDER_NAME)
if os.path.exists(result_dir):
if OVERWRITE_RESULTS:
print 'reusing result folder "%s"' % result_dir
else:
i = 1
while 1:
new_result_dir = '%s.%05d' % (result_dir, i)
if not os.path.exists(new_result_dir):
result_dir = new_result_dir
print 'making result folder "%s"' % result_dir
os.makedirs(result_dir)
break
i += 1
if i == 10**5:
break
if i == 10**5:
print 'error: too many result folders already exist'
continue
else:
print 'making result folder "%s"' % result_dir
os.makedirs(result_dir)
# separate records according to the fourth field, stripping fields past 7th
h = {}
record = None
try:
for record in records:
val = record[3]
if h.has_key(val):
h[val].append(record[:7])
else:
h[val] = [record[:7]]
except:
print 'debug:', sys.exc_info()
print 'error: record separation failed on >%s<' % record
continue
# write out records to files, prepending custom line and appending 8th field
try:
keys = h.keys()
keys.sort()
for key in keys:
res_fname = os.path.join(result_dir, 'result_%s.csv' % key)
print 'making result file "%s"' % res_fname
outf = open(res_fname, 'wb')
outf.write(CUSTOM_BEGINNING+'\n')
writer = csv.writer(outf, dialect='excel')
line_num = 1
for row in h[key]:
line_num += 1
writer.writerow(row + [CUSTOM_FIELD_VALUE])
outf.close()
except:
print 'debug:', sys.exc_info()
print 'error: failed to write line %d of "%s"' % (line_num, res_fname)
continue
print 'completed processing "%s"' % fname
# ----- end grok.py
|