Clarification of Answer by
mathtalk-ga
on
21 Jul 2003 09:34 PDT
Hi, twims-ga:
The Platform
============
The Visual Basic program you want could be written in a variety of
development environments. Any one of the Office VBA platforms, an ASP
page, Windows Scripting Host (WSH), or even compiled VB6/VB.Net would
be feasible.
However I've decided to recommend an Access VBA program. This engine
one you'd previously agreed to use, and it presents the possibility of
using the Access tables to "version" aspects of the conversion.
What I'm thinking is that besides the "conversion" routines, you may
want to create some routines that "query" for Word files in selected
directories and populate an Access table with the filenames and last
modified dates which are found there. Add a field to hold the
datetime when the file is converted (from Word tables to .csv files),
and then you can track the progress of your conversion processing. If
a Word file gets modified after it is converted, you'd be able detect
that and perhaps flag the file in your database for reconversion.
Add References
==============
Let's begin with adding References to an Access VBA "project". I have
Access 2002 installed, so I'll be using that as a point of reference.
Directions would be similar for other versions.
A standard way to get into the Visual Basic editor in Access is from
under the Tools > Macro menu, where you will also find the shortcut
Alt + F11. However if you are starting from a relatively empty Access
database, you will want to begin by adding a module, which Access 2002
allows you to do from the Insert > Module menu (or by using the New
Module button within the Objects, view Modules "tab").
Even before you create a module, References can be added in the Visual
Basic editor using the Tools > References... dialog. You'll see a
list of potential libraries to use with check boxes beside them. In
this context "References" means the COM interfaces that are exposed by
given object libraries. Two will be particularly important for this
project:
Microsoft Scripting Runtime (scrrun.dll)
Microsoft Word 10.0 Object Library (msword.olb)
Note that your version of the Word Object Library may have a different
number. Version 10.0 denotes the Office XP release.
The Scripting Runtime is useful for using the Windows file system,
e.g. files and directories, through the FileSystemObject and related
interfaces. The Word Object Library, of course, allows us to
manipulate the Word application and documents.
Click the checkboxes beside both of these and close the References
dialog. You may want to open the dialog again just to confirm the two
references are added by seeing that they've moved up to the beginning
of the list with the default references and any others that you've
added to your project.
Code Main Routine
=================
The top level routine, from the perspective of the immediate
conversion procedure, should probably take a single filename of a Word
document and take responsibility for parsing all of its tables. In
this approach you have the flexibility to process different Word
documents in any order, which seems to me the right granularity. One
can always wrap such a subroutine within a loop over a list of
documents, e.g. from the table of filenames suggested at top.
In this approach one would have a subroutine declaration that looks
something like this:
Public Sub Word2CSV(strWordDoc As String)
End Sub
Note that you will need to insert a module before you have a place to
insert such a code procedure.
Here is some pseudo-code for what the routine should accomplish:
0. Validate the input, e.g. that the string ends in ".doc".
1. Check that the file exists; exit (return) if it does not.
2. Open the file as a Word document, and check that this is
successful.
3. Check the count of tables in the document. If there are none, exit
(return).
4. Assuming there are Word tables in the document, loop over them &
call ParseTable on each one.
The purpose of subroutine ParseTable will be to create a .csv file for
the given Word table. It is usually a good idea to find ways to
compartmentalize the program's functionality. Here the responsibilty
for handling the conversion of a single table is assigned to the
ParseTable subroutine, and the outer routine Word2CSV only needs to
take responsibility for passing on the information that ParseTable
will need to perform its job.
As I see it there are two arguments that ParseTable will need to have
passed to it by the calling routine, Word2CSV. These are:
A) a reference to the Word table, and
B) the name of the .csv file to create for this table.
You probably have some ideas about the naming conventions that would
be useful, but from my limited knowledge of the intended application,
I'd think to modify the name of the existing Word document to get
names for various comma-separated-value files which will result.
Constructing these modified filenames then becomes a responsibility of
the outer routine Word2CSV.
What I specifically imagine is that besides changing the filename
extension from .doc to .csv, we would append a two-digit number to the
end of the proper filename. Since you indicated the Word documents
contain no more than 15 tables apiece, the extra two digits in the
filename, corresponding to the position of a table within the Word
document, will allow a unique filename for each table. If it is
possible for your Word documents to have more than 99 tables, then
this numbering scheme can be replaced by one that uses three- or
four-digit extensions.
With this in mind here are some quite detailed suggestions about tasks
0. through 4. itemized above.
0. You may have some additional validations on the filename string, or
you may feel comfortable omitting any validation because of your
confidence in the mechanism(s) by which Word2CSV is going to be
called. In any case a check that the last four characters of a string
are equal to ".doc" can be coded like this:
Dim bDocStr As Boolean
' Third argument vbTextCompare disregards case differences, .doc vs.
.DOC
If StrComp(".doc", Right(strWordDoc, 4), vbTextCompare) = 0 Then
bDocStr = True
Else
bDocStr = False
End If
If Not (bDocStr) Then Return
1. To check if a fully-qualified filename corresponds to an existing
file, you will probably want to use the FileSystemObject from the
Scripting Runtime library. Here's a code snippet to illustrate the
technique:
Dim fs As New Scripting.FileSystemObject
Dim bDocExist As Boolean
bDocExist = fs.FileExists(strWordDoc)
For a different illustration of using the Scripting Runtime's
FileSystemObject within VBA projects, see here:
[EDC Excel & VBA Tips - Using Microsoft Scripting Runtime]
http://www.erlandsendata.no/english/vba/fileaccess/scripting.htm
or consult the Access VBA help index entry under "file; size;" called
Size Property (FileSystemObject object).
Someplace later in your code, before exiting, you would want to
release the FileSystemObject object like this:
Set fs = Nothing
If you have a bit of familiarity with COM interfaces, this is the VBA
way to decrement the reference count on a COM object, allowing the
Windows operating system to reclaim the allocated memory once the
reference count drops to zero.
2. Once we know that the filename is valid and corresponds to an
existing file, we should try to open it as a Word document. This will
require an instance of the Word application to be running.
Some care is needed to avoid problems with either creating too many
instances of the Word application, or creating an instance of Word
through "automation" that winds up being "shared" as a manually
started instance of Word (launched by a desktop user). In the later
case you can have a scenario in with the interactive user closes the
shared instance of Word, causing subsequent errors for the
"automation" user (VBA code) that relies on a Word application
instance.
This last sort of thing is actually a bug, as Microsoft acknowledges:
[188546 BUG - Starting Word Manually Uses Same Instance as Automation]
http://support.microsoft.com/?kbid=188546
I suggest studying that article, and in particular their proposed
workaround, which involves creating a temporary instance of Word and
closing it after the "private" copy of Word has been created. Note
that in the code shown there, the wordApp variable is declared
"globally", so that it potentially can be created once and reused for
numerous documents.
For the sake of illustration I will simply write:
Dim wordApp As Word.Application
If TypeName(wordApp) <> "Application" Then
Set wordApp = CreateObject("Word.Application")
End If
This code fragment illustrates checking whether the variable wordApp
is already initialized, and if not using CreateObject to invoke a new
instance.
Once we have the Word application in hand, opening the document is
easy:
Dim wordDoc As Word.Document
Set wordDoc = wordApp.Documents.Open(strWordDoc)
After the instance of the Word application referenced by wordApp is no
longer needed, a couple of steps are needed to dispose of it properly:
If TypeName(wordApp) = "Application" Then
wordApp.Quit SaveChanges:=wdDoNotSaveChanges
Set wordApp = Nothing
End If
NB: The parameter wdDoNotSaveChanges is intended to coax the Word
instance into closing without protest, but it is still possible for
this quitting operation to take a bit of time. When working in a
VBScript (client-side) environment, I have found it prudent to put a
bit of delay between the Quit command and setting the object variable
to Nothing. I don't think this will be the case in your program, but
it's something to keep in mind.
3. The Word document objects have, as previously discussed, a
collection type member called Tables, which contains object references
for every Table object in the document in a kind of array. To see how
many there are, we can do this:
Dim nTables As Integer
nTables = wordDoc.Tables.Count
If nTables = 0 Then Return
4. Also as discussed before, a loop over the tables in the document
can take advantage of the "For Each" syntax supported by VBA
collection objects, ie. the wordDoc.Tables collection in this case.
We might code something like this:
Dim wordTbl As Word.Table
nTables = 0
For Each wordTbl In wordDoc.Tables
nTables = nTables + 1
Call ParseTable(wordTbl, Left(strWordDoc, Len(strWordDoc) - 4) _
& Format(nTables, "00") & ".csv")
Next wordTbl
where the subroutine ParseTable, to be discussed next, has been
declared this way:
Public Sub ParseTable(wordTbl As Word.Table, strCSVfile As String)
End Sub
Note that the integer variable nTables is being "reused" here to count
through the tables as we loop, providing the information about the
"index" of wordTbl in the collection wordDoc.Tables needed to
construct a unique filename for the .csv output that ParseTable will
require.
Code Inner Subroutine
=====================
With the framework of the program out of the way, you can now focus on
coding the core functionality. A number of the ingredients have
already been discussed. Manipulation of tables, and in particular how
to extract the string content of the cells row by row and column by
column, is amply illustrated in these pages:
[MS Word 2002 VBA - Working with Tables]
http://msdn.microsoft.com/library/en-us/vbawd10/html/wohowWorkingWithTables.asp
[Exporting Word 2000/2002 Documents to XML]
http://msdn.microsoft.com/library/techart/odc_expwordtoxml.htm
The output to CSV format may require a bit of discussion. If we are
treating all cell values as text, it would perhaps be a good idea to
output them as quoted text. To use a row from one of your examples as
an illustration, we might like to have output lines that look like:
"empname "," rate 8 Hrs "," rate 10 Hrs"
"Mike ","$25 ","$40 "
rather than simply:
empname , rate 8 Hrs , rate 10 Hrs
Mike ,$25 ,$40
The decision may depend on what your downstreaming process for these
files will be. Also, think about the possibility that some table
cells may already contain "double quote" characters. If so, some
logic will need to be written to handle them (assuming you want the
quoted strings to load properly into the database).
Another design decision concerns the leading and trailing spaces that
cell values may have (as is the case above). If you want to remove
these, then one option is to rely on the Visual Basic functions LTrim,
RTrim, or Trim (the latter removes both leading and trailing spaces).
Just as the Word document has a collection of tables called Tables,
the Word table object has a collection of rows called Rows. Since the
CSV output needs to be organized by rows, it seems to be logical to
structure ParseTable as an outer loop over the items in wordTbl.Rows
and an inner loop over the cells in a particular row.
Here then is a list of tasks to be performed by ParseTable:
1. Use a FileSystemObject to create a TextStream object, using the
CreateTextFile method on the filename.
2. Loop (outer) over the individual rows in wordTbl.
3. Loop (inner) over the individual cells in each row, appending
successive field contents to a string.
4. After the inner loop exits, write the string constructed to the
TextStream output file as a single line.
5. After the outer loop exits, close the TextStream output file; set
objects to Nothing before returning.
Something of the same structure can be seen in Kevin McDowell's
program above, which exports Word to XML. His ParseTable routine is
coded in the file XMLConverter.cls, which you will see if you download
and unzip his demo files. A slight modification of that code for your
purposes might look like this:
Public Sub ParseTable(wordTbl As Word.Table, strCSVfile As String)
Dim fs As New Scripting.FileSystemObject
Dim ts As Scripting.TextStream
Set ts = fs.CreateTextFile(strCSVfile)
Dim strTemp As String
Dim wordRow As Word.Row
Dim wordCell As Word.Cell
Dim nRow As Integer
' Outer loop on each row.
For nRow = 1 To wordTbl.Rows.Count
Set wordRow = wordTbl.Rows(nRow)
strTemp = ""
' Inner loop through each cell in current row.
For Each wordCell In wordRow.Cells
' Call cell parser... make it insert commas?
strTemp = strTemp & ParseCell(wordCell)
Next
ts.WriteLine strTemp
Next
ts.Close
Set ts = Nothing
Set fo = Nothing
Set fs = Nothing
End Sub
In appealing to yet another subroutine ParseCell (as Kevin did in his
code), I avoid some of the picky details about where to place the
commas and (possibly) quotation marks. In this sort of application
there's a tricky bit of logic needed one way or the other to suppress
the final comma. As written here the introduction of commas would be
a responsibility of ParseCell, so perhaps the smartest approach is to
make ParseCell aware (through the Parent of wordCell) when the last
cell in a row is being processed. Alternatively the last comma can
simply be "deleted" by shortening strTemp by one character just before
it is written to the output.
regards, mathtalk-ga