I have a 1998 book[1] which describes a dynamic programming algorithm
for generating a pairwise alignment using affine gap scores. The
method, outlined below, is from a 1982 journal article[2]. In
particular, I am interested in the application of this method to
finding alignments of amino acid sequences. Is this method is still
in common usage? If it is still used, what, if any, improvements have
been made upon the original technique? If it is not still used, what
has replaced it?
The algorithm takes as input two strings, representing the sequences
to be compared, a scoring matrix, representing the scores for any
possible pairing of two characters, and two numbers, representing the
penalties for beginning and extending a gap. The particular usage I
am dealing with typically uses BLOSUM50 as the matrix, 12 as the
penalty for beginning a gap, and 2 as the penalty for extending a gap.
The strings are letters (case independant) which are the common
single-letter abbreviations of amino acids.
Let m be the length of the first sequence, n be the length of the
second sequence, d be the gap open penalty (12 above), e be the gap
extention penalty (2 above), and s(i,j) be the value of the scoring
matrix corresponding to the ith letter in the first sequence and the
jth letter in the second sequence.
The algorithm uses three matricies, denoted M(i,j),Ix(i,j), and
Iy(i,j). It recursively builds the matricies row by row like so:
M(i,j)=max(M(i-1,j-1), Ix(i-1,j-1),Iy(i-1,j-1))+s(i,j)
Ix(i,j)=max(M(i-1,j)-d, Ix(i-1,j)-e)
Iy(i,j)=max(M(i,j-1)-d, Iy(i,j-1)-e)
It also keeps a pointer along with each matrix value indicating which
of the terms in the max it chose.
The initial values M(i,0), M(0,j), Ix(i,0) and so on depend on the
type of alignment required, as does the point you choose to build the
alignment from using the matrices. For the simplest case, overlap
alignment, we set all of the initial values to 0 and we pick the point
to be the largest value of one of the forms M(i,n), M(m,j),Ix(i,n),
and so on. In other words, we pick largest value on the bottom or
right edge of any of the three matricies.
From there, the algorithm simply traces back the pointers to build the
sequence. An entry in the M matrix represents a pairing, an entry in
one of the I matrices represents a gap. The final output has two
pieces, the final score and the alignment. The alignment looks
something like this:
VLSPAD-K
HL--AESK
where -'s represent gaps in one sequence.
The algorithm can be simplified somewhat by combining the Ix and Iy
matricies into a single matrix I using I(i,j)=max(M(i-1,j)-d,
I(i-1,j)-e,M(i,j-1)-d, I(i,j-1)-e), at the cost of a small potential
of error if any value in the scoring matrix is <-2e.
The original algorithm is O(n^2) and requires memory space on the
order of 3mn. The simplified algorithm is also O(n^2) but requires
only memory space on the order of 2mn. Have these limits been passed?
[1] Durbin, Richard, et. al. Biological sequence analysis -
Probababilistic models of sequence analysis. Cambridge University
Press, 1998.
[2] Gotoh, O. An improved algorighm for matching biological
sequences. 1982. Journal of Molecular Biology 162:705-708. |