We consider the problem of aligning multiple protein sequences with
the goal of maximizing the SP (Sum-of-Pairs) score, when the number of
sequences is large. The QOMA (Quasi-Optimal Multiple Alignment)
algorithm addressed this problem when the number of sequences is
small. However, as the number of sequences increases, QOMA becomes
impractical. This paper develops a new algorithm, QOMA2, which
optimizes the SP score of the alignment of arbitrarily large number of
sequences. Given an initial (potentially sub-optimal) alignment ,
QOMA2 selects short subsequences from this alignment by placing a
window on it. It quickly estimates the amount of improvement that can
be obtained by optimizing the alignment of the subsequences in short
windows on this alignment. This estimate is called the SW (Sum of
Weights) score. It employs a dynamic programming algorithm that
selects the set of window positions with the largest total expected
improvement. It partitions the subsequences within each window into
clusters such that the number of subsequences in each cluster is small
enough to be optimally aligned within a given time. Also, it aims to
select these clusters so that the optimal alignment of the
subsequences in these clusters produces the highest expected SP score.
The experimental results show that QOMA2 produces high SP scores
quickly even for large number of sequences. They also show that the SW
score and the resulting SP score are highly correlated. This implies
that it is promising to aim for optimizing the SW score since it is
much cheaper than aligning multiple sequences optimally. The software
and the benchmark data set are available from the authors on request.