J Michael Sauder - Large-scale comparison of protein sequence alignment algorithms with structure alignments

Version 1

      Publication Details (including relevant citation   information):

      Proteins (2000) 40: 6-22.

      Sauder JM, Arthur JW, Dunbrack RL Jr


      Sequence alignment programs such as BLAST and PSI-BLAST are   used  routinely in pairwise, profile-based, or   intermediate-sequence-search  (ISS) methods to detect remote   homologies for the purposes of fold  assignment and   comparative modeling. Yet, the sequence alignment quality    of these methods at low sequence identity is not known. We have   used  the CE structure alignment program (Shindyalov and   Bourne, Prot Eng  1998;11:739) to derive sequence alignments   for all superfamily and  family-level related proteins in   the SCOP domain database. CE aligns  structures and their   sequences based on distances within each protein,  rather   than on interprotein distances. We compared BLAST,   PSI-BLAST,  CLUSTALW, and ISS alignments with the CE   structural alignments. We found  that global alignments with   CLUSTALW were very poor at low sequence  identity (<25%),   as judged by the CE alignments. We used PSI-BLAST to  search   the nonredundant sequence database (nr) with every sequence   in  SCOP using up to four iterations. The resulting matrix   was used to  search a database of SCOP sequences. PSI-BLAST   is only slightly better  than BLAST in alignment accuracy on   a per-residue basis, but PSI-BLAST  matrix alignments are   much longer than BLAST's, and so align correctly a  larger   fraction of the total number of aligned residues in the    structure alignments. Any two SCOP sequences in the same   superfamily  that shared a hit or hits in the nr PSI-BLAST   searches were identified  as linked by the shared   intermediate sequence. We examined the quality  of the   longest SCOP-query/ SCOP-hit alignment via an intermediate    sequence, and found that ISS produced longer alignments than   PSI-BLAST  searches alone, of nearly comparable per-residue   quality. At 10-15%  sequence identity, BLAST correctly   aligns 28%, PSI-BLAST 40%, and ISS  46% of residues   according to the structure alignments. We also compared  CE   structure alignments with FSSP structure alignments generated by   the  DALI program. In contrast to the sequence methods, CE   and structure  alignments from the FSSP database identically   align 75% of residue pairs  at the 10-15% level of sequence   identity, indicating that there is  substantial room for   improvement in these sequence alignment methods.  BLAST   produced alignments for 8% of the 10,665 nonimmunoglobulin   SCOP  superfamily sequence pairs (nearly all <25%   sequence identity),  PSI-BLAST matched 17% and the   double-PSI-BLAST ISS method aligned 38%  with E-values   <10.0. The results indicate that intermediate sequences    may be useful not only in fold assignment but also in achieving   more  complete sequence alignments for comparative modeling.

      Address (URL): http://www.ncbi.nlm.nih.gov/pubmed/10813826