Hi, siam-ga:
Your question asks about QA aspects of a legacy data migration
project, but nearly all aspects of a project's planning can directly
impact the QA tasks. So I think it best to widen our discussion !!
:-)
CAVEAT
======
My understanding of your project is very incomplete, but from what you
have described, it seems to involve data cleanup at least as much as
pure data conversion. You seem to emphasize the semantic content of
the data more than its presentation, perhaps more strongly than your
client would. Many legacy "markup" conversion projects focus on
mainly on presentation of results, and hence target PDF or HTML
output, as for example:
[Minnesota Local Research Board]
http://www.lrrb.gen.mn.us/Guidelines/Appendices/appendixB.asp
Caveat: Your project is targeting XML output, but I know little about
what application(s) which will use this migrated data. Normally the
target application would dictate a lot of things concerning the QA
process. In the absence of knowing more about that, however, I'm
thinking of the results of the project as being targeted for various
potential (unwritten) future uses, making it in a narrow sense
something of a "data warehousing" or "data mining" project instead of
simply Web publishing. [Something of that duality is inherent with
MathML, which allows either content orientation or presentation
orientation in representing mathematical formulas.]
CARDINAL PRINCIPLES
===================
I have two cardinal principles for data mapping projects, and I want
to throw these out there in advance of telling you exactly how to
apply them in your project:
1) Speak in complete sentences.
The idea is that the units of conversion should resemble standalone
statements, capable of being true or false (correct or incorrect) on
their own. Of course this is not entirely the case, even in
mathematics. There is always a context to what is being asserted.
Nonetheless, in constructing your "XML snippets" be careful to avoid
fragmenting the source data beyond a point where it can no longer be
understood as "complete sentences" as this is a red flag that the
converted data has lost its coherence.
2) Invent needed vocabulary.
Your description of the specifications process echoes experiences I've
had. Apparently there exist many basic patterns for the conversion
"template" and probably even more "exceptions" to these patterns. In
order to discuss the patterns and exceptions, and most importantly to
be able to write them into the project specifications, I'm guessing
that you will need to invent some new vocabulary. The discussion of
critical issues can break down in the specification phase because the
same imprecise words get used to describe a variety of truly distinct
phenomena. Sometimes this is fortuitous and leads to deep insights
into the similarity of tasks for the software to perform, but more
often than not it results in a false sense of confidence by the client
that difficulties have been ironed out.
A FEW WEB CITATIONS
===================
Okay, now that I've thrown out my two cents on the generalities, let
me present a few papers I found in searching around the Web. While
none describes a situation exactly like yours, each struck me as
having some good ideas to contribute with respect to quality assurance
in data conversion projects.
First up is a white paper by Colin J. White of Database Associates:
[An Analysis-Led Approach to Data Warehouse Design and Development]
http://www3.newmediasales.com/dl/1755/Colin_White_Evoke_in_DWH_V2.pdf
This paper has absolutely nothing to do with XML but champions the
notion of using data quality dictate the design of a "data warehouse".
It presents some terminology that may be useful in selling your
project planning to the rest of the project team, such as the
importance of "staging areas" for data to minimize data quality and
data integration problems. Note that this is "version 2" of his
paper, so apparently it made a good enough impression on the first
client he used it with to make it into a second version!
The second paper is by Robert Aydelotte:
[From EDI to XML]
http://www.posc.org/ebiz/xml_edi/edi2xml.html
The author describes project planning for converting "legacy" EDI data
formats into XML/edi (XML-based EDI) but doesn't go into detail about
test cases and QA. However he gives a link to the ISIS European
XML/EDI Pilot Project, where many seminar presentations and other
project-specific documents are available. This was sufficiently far
in the past that validation of XML was discussed solely in terms of
DTD's, but what I found most interesting in this material was the
discussion of "best pracices" for creating those DTD's.
Third is a paper by Shazia Akhtar, Ronan G. Reilly, and John Dunnion:
[Automating XML Mark-up]
http://www.nyu.edu/its/humanities/ach_allc2001/papers/akhtar/
which may provide some cogent ideas toward selection of test cases.
They describe using the "self-organizing map" (SOM) learning algorithm
proposed by Kohonen to arrange documents in a two-dimensional map, so
that similar documents are located close to one another. I was
thinking that this idea might be applied in your project to the
selection of test cases. Supposing that the 2777 XML snippets were
mapped into a 2D diagram, selection of test cases from among these
could then be done so that a greater number of idiosyncratic documents
are chosen for critical examination (at the expense of using only a
relatively smaller number where very similar documents are densely
clustered).
STARTING OVER AGAIN
===================
Having hit all these fragmentary insights at the outset, let back up
and divide the data migration process into three "quasi-sequential"
phases:
Data Cleanup (rectification of original data)
Data Translation (data mapping and conversion)
Data Installation (provisioning of revised data to applications)
It would be nice if the three phases were truly sequential. In
practice one allows a greater or smaller measure of parallel activity
across these phases for the sake of speedy deployment. Understanding
the interactions is a key to minimizing cost and risk.
Data Cleanup
============
In a classic waterfall process for data migration, the data cleanup is
done on the frontend of the project. Temptation to defer this upfront
"intellectual effort" to a later point in the project calls into
question the "integrity" of the conversion phase. If the data is not
correct to begin with, how can a properly defined conversion process
produce correct output? GIGO stood for "garbage in, garbage out", but
it could also mean "good-data in, good-data out".
In this particular project you've said that specifications exist for
the original "tagged" format of the data files. That this data is
organized into 63 input files seems somewhat incidental to the
structure of the entities represented by that data. As a conceptual
aid I'm thinking of those files as being somewhat like 63 tables in a
relational database (please feel free to speak up and give a better
description), guessing that each of the 2777 output files (XML
snippets?) would generically depend on the aggregate of all the input
files.
You've also indicated that these specifications were abused, that to
an extent old "markup" practices blurred the lines between content and
presentation. For example, you suggest that semantic relationships
are sometimes "coded" merely as a pattern of contiguous presentation
(ordering) in the layout.
If it is meaningful to correct these "bad practices" in situ, then it
would be advantageous to do it before trying to convert them into
"proper" XML output. For one thing it sounds as if the client has
more "resources" who understand the legacy format than who understand
the target XML format.
Of course advantage should be sought in using "tools" to assist in
this data cleanup, and it may be that the legacy format is simply too
"fragile" to support an aggressive cleanup effort.
Data Translation
================
I suggested decoupling the "conversion" into a "naive translation"
phase and a "forgetful" (through stuff out) phase. This avoids
confusing an intentional discarding of information, obsolete for
future purposes, with the quite opposite objective, to "add value" by
reconstructing explicit semantic relations from "implied patterns".
A naive translation phase would put the legacy data into a more robust
XML format, in which you can hope to leverage lots of existing tools
(version control, XSLT, schemas, etc.) that may have no useful
counterparts in the legacy format. The "mission statement" for this
naive translation phase would be to provide XML tagging that
duplicates the existing data in a literal fashion, so that at least in
principle the legacy data could be fully reconstructed from this
intermediate form.
Note that XML/XPath does provide syntax for the ordering of sibling
nodes. In this sense I'd hope that the "patterns" of implied
relationships could be as manifest in the naive XML translation as
they are in the legacy format.
I'd anticipate that a number of issues with the original data would
not be fully recognized until the conversion phase was well along.
While it is currently hoped that many of the "exceptions" that are
recognized late in the game will somehow fit neatly into the
preconceived architecture of rules,
Data Installation
=================
As previously mentioned, without knowing something about the target
applications, it's hard to discuss their relative importance in the QA
process. You did mention in one clarification that the XML is to be
used to generate HTML, and that "the client wants to review final HTML
output" whereas you "feel it's much more important to look at the XML
output itself." Given the greater insight you have into the HTML
output process than I have, I'm certainly willing to adopt your point
of view and consider the XML and its correctness as the focus of this
question. It sounds as if the XML to HTML translation might be simply
a stylesheet transformation, although the designation of the XML
output files as "snippets" makes me suspect that a lot of "includes"
surround this process.
SPECIFIC QUESTIONS & ANSWERS
============================
Given this outline of the project, imperfect as only my imagination
can make it, we can at least recap the questions you raised and
discuss solutions:
1. What are some books and papers that address project planning for an
exercise like this?
This is the all-encompassing question asked in the original post.
Project plans are a means to an end, not the end in themselves.
Planning makes it more likely that you will reach the desired goal.
As Gen. Dwight Eisenhower famously observed, while the plans for
battle are useless as soon as war begins, planning is indispensable.
You obviously have a good grip on the tools of Unix and XML, so I
won't try to drive the discussion of project planning down to a
technical level. However here's a book on generic project planning
that I like:
Project Management: How to Plan and Manage Successful Projects
by Joan Knutson and Ira Blitz
It's not extremely thick, about 200 pages, and I took a short course
out of it a few years back, sponsored by the American Management
Association. One of the key points that I took away from that course
is that a project manager's role is that of facilitating, not doing
the project work. I can't say that I ever took that lesson to heart,
because I'm the quintessential player-coach on a project team, but I
really do appreciate the contributions made by project managers who
take care of the issues log, updating the schedule, drawing feedback
from the clients, etc. without involving themselves in technical
accomplishments on a day-to-day basis.
For advice on software projects I can recommend the very readable
Peopleware by Tom DeMarco and Timothy Lister (2nd ed.). I also find
food for thought in the eXtreme Programming (XP) series. As a
starting point I'd read:
Extreme Programming Explained: Embrace Change
by Kent Beck
2. How should we sample the output and assess satisfaction of
requirements (aside from XML validation), given a staggering variation
of input data and loose specs?
I mentioned an idea above for using Kohonen's self-organizing map
(SOM) to assist in selection of the test cases. You have obviously
had some discussions with the client about preparing artificial data
for use in unit testing, so clearly as you develop the conversion code
you are planning on stubbing out certain sections to allow for this
unit testing.
I might try use some "debug" code to profile which patterns are being
identified/applied and how often as your development code runs against
the entire input. I'm unclear about whether the conversion will have
to take all 63 files simultaneously as input, or whether it's more a
matter of processing each one individually. But in any case if you
can identify "natural" test cases for each code pathway, these will
naturally serve as good test cases for unit testing. Asking the
client to make up data for the sake of unit testing seems to me to
carry some risk of wasted effort and even introduction of conversion
issues that never existed in the original data! Just a thought
(probably a paranoid one!).
Once the conversion software is completed enough to run in
"integration" mode, you will want to consult that "debug" log to see
what the main code pathways are, and what "test cases" are good
benchmarks (illustrate expected functionality) and what are open-issue
related. I really feel that the automated testing suite is going to
provide value on this project, despite the additional effort required
of you, the lone developer. A major headache with late changes to
specs or even with bug fixes is that the changes needed to add A or
resolve B winds up unexpectedly breaking C. In my experience a test
harness always provides value because it's better to discover that
Murphy's Law has struck while the code changes are still fresh in your
mind.
So as a proxy for doing something clever with the SOM map, I'd suggest
using the "profiling" counts from the test harness to decide how to
sample test cases. As the client's experts report conversion issues,
meet with the project manager to decide how the issues need to be
logged, ie. spec change vs. bug in the code. Invent vocabulary as
needed to update the specs with clarity for all concerned.
3. None of the project team except you are technical enough to
understand actual code. How can the specs be made more specific
without pseudo code that appears more confusing than the code itself
(heavy use of regular expressions)? How can exceptions to specs that
only the client is aware of be effectively documented (they keep
popping up in conversations)?
This is picking up where the last topic left off. Pattern matching is
a key element of much declarative programming, but it can be tough
sledding to give it "literal translation" in the specs. This is where
an astute use of jargon, specially invented for this project, can pay
off. Give the patterns that need to be discussed in the specs
colorful, even semi-humorous names. It makes them memorable and gives
the rest of the project team a feeling of belonging, of being "in on
the secret". Give a fully blown definition of the pattern _once_ in
the appropriate section of the specs, but thereafter simply refer to
it by name.
Suppose (merely for sake of illustration) that in the documents
there's a typical pattern in which you have a section of BOLDED text,
followed by exactly three sections of fine print, followed by a
section in Italics. Regardless of what the actual purpose of this
pattern is for the client's typesetting needs, you might aptly and
humorously refer to it as the Three Blind Mice template. The lead
paragraph might be called the Farmer and the closing one, the Farmer's
Wife (since she "cuts off" the tail of the pattern).
Or, if someone on the project team fancies him- or herself a chess
aficionado, let them propose names like Queen's Gambit, etc. It's a
chance for the non-technical but creative members of the project to
make an expressive connection to the nitty gritty details, and usually
enhances the commitment of the team as a whole to doing stuff the
right way, rather than just producing something "of the form".
For each section of the specs that defines a "pattern" you can have a
standard subsection that describes "known" or suspected exceptions.
As the exceptions are more clearly identified and distinguished, some
of them are likely to evolve into subvarieties of "patterns", with
their own exceptions. Listing the known exceptions can help the
project team to prioritize the evolution of new patterns based on the
depth and complexity of the existing patterns and exceptions.
I don't know what language you plan to implement with. You mention
regular expressions (and a focus on correctness rather than speed),
which leads me to think of interpretive languages like perl or Awk. I
prefer Prolog as a declarative language with strong pattern matching
features, but in working with XML source documents of course XSLT is a
natural choice. But regardless of how pattern matching will be coded,
there needs to be an internally consistent vocabulary for all the
variations that the project team can buy into.
4. The client wants to review final HTML output which will be
generated from the XML, but I feel it's much more important to look at
the XML output itself and leave transformation to smaller scale
testing. How should we divide attention between the two?
You have a clear instinct about this, which I would trust. But I
think I'd try to adapt to the client's point of view in a way that
makes it seem as though they are winning the argument. Specifically
I'm thinking of serving up the XML pages with a very thin stylesheet
transformation, which in the limiting case might be the default
stylesheet used by Internet Explorer to render generic XML. If I knew
more about the target application, I might see more clearly what
incremental transforms might bridge the gap between the "raw" XML and
the ultimately desired HTML. If you are the only developer, then I
guess you'd be in the best position to judge how to finesse the
differences.
The presentation for testing will need to account for the size of the
output documents. While "snippet" suggests a single page or so of
XML, this may be wishful thinking on my part. If the documents are
really big, one might use an "outlining" stylesheet that allows for
"collapsible" sections of textual display to assist navigation within
the document. This is something I should know more about than I do;
if it's of interest, then make a Request for Clarification (RFC) with
the button at top of my Answer, and I'll put a demo together for you.
5. One more thing: How about adding interactive parser functionality
that will accept manual input if it can't recognize a pattern despite
exception handling? Or having the XML output documents edited
manually, if a problem is too specific to warrant a parser change?
Should this be allowed, given that ongoing revisions will require
repeating this manual change?
Obviously allowing for an XML output document to be edited manually
wouldn't require much programming effort on your part, where the first
option sounds to my uneducated ear as if it would require a lot of
effort. You accomplish the revision tracking for output documents
more or less easily by logging them into a version control system.
There are some issues with this. You'll need to come up with a naming
convention for the output documents which reflects their "identity"
across changes in the parser, and I have no clue how this might be
done. Also you'll need to come up with an extra "database" that
identifies which output documents are being treated as "manual"
exceptions, with the intention of "checking out" only those documents
prior to a run which are supposed to get automated treatment. I don't
think those are insuperable obstacles, and in fact I think the
identification of the "exceptional" output documents ties in well with
what I suggested above about having "exception" subsections in the
specs.
My only real objection to this sort of approach, which may be
pragmatically best, is that in principle one would prefer to do the
cleanup on the source data, rather than in ad hoc fashion as a
post-processing phase.
Perhaps for you the concept of interactively directing the parser has
a fairly immediate and easily implemented meaning, one that is more
restrictive than simply allowing the user to do whatever they please.
One aspect of it that I'd drill down on is how the parser is to be
"interrupted" to allow manual interaction. The exceptions are likely
to include not only documents that fail to match patterns, but also
documents that match patterns that were unintended. In the latter
case it seems that it might be prohibitively slow to "set breakpoints"
in the software that asks a user to decide in each circumstance
whether to allow automated parsing to continue or to "interrupt" for
manual interaction.
CONCLUSIONS
===========
I've been off thinking and talking to myself about these ideas for too
long, but everytime I went back to look over your notes in relation to
my ideas, I got the feeling that my ideas had at least partial
relevance to paths you'd already gone down. I need a reality check,
so I'm putting what I've got together as well as I can tonight for you
to take a look, and I'm standing by for any further clarification!
regards, mathtalk-ga |