HOME  |   AMDeC |   Columbia Genome Center |   Contact Us|
About The Center
Introduction
What you can do
Using the Facility
Hardware
Software
Databases
Staff
Services for Users
Access
Manuals
Support
Registration
Resources
caWorkBench3.0
Algorithm Reference
Tutorials & Examples
Links
Maps & Directions
Contact Us
 

 

Algorithm Reference and Comparison

Algorithms Accelerated on Facility High-Performance Hardware

 

BLAST
Smith-Waterman
Profile Methods
Hidden Markov Models
GeneWise

BLAST (BlastMachine)

BLAST searches for initial short exact matches, which it then extends. It provides moderate sensitivity and high specificity. Its effectiveness falls off as sequence similarity decreases.

The various translated versions allow searches involving nucleotide sequences to be conducted in the protein sequence space. This provides greater sensitivity because sequence is better conserved at the protein level than at the nucleotide level. Six-frame translation of the query and database sequences allows handling frameshifts caused by nucleotide insertions and deletions.

BLASTN

Nucleotide query vs. nucleotide database.

BLASTP

Peptide query vs. peptide database.

BLASTX

Six-frame translation of nucleotide query sequence vs peptide database.

TBLASTN

Peptide query vs six-frame translation of nucleotide database.

TBLASTX

Six-frame translation of nucleotide query vs six-frame translation of nucleotide database.

PSI-BLAST (Position Specific Iterative BLAST)

Starting with an initial peptide query against a peptide database, iteratively builds up a statistical sequence profile by aligning selected multiple hits from successive runs. At each step after the first, the profile is used as the query. Good for finding distant homologs not found by standard BLAST searches..

PHI-BLAST (Pattern-Hit Initiated BLAST)

Allows searching using a motif. "Combines matching of regular expressions with local alignments surrounding the match". (NCBI)

Mega-BLAST

"Optimized for aligning sequences that differ slightly as a result of sequencing or other similar 'errors'. It is approximately 10 times faster than more common sequence similarity programs and therefore can be used to swiftly compare two large sets of sequences against each other". "MegaBlast implements a greedy algorithm for gapped alignment search". (NCBI)

Smith-Waterman (GeneMatcher2)

The Smith-Waterman algorithm is a dynamic programming algorithm that guarantees finding the optimal local alignment of two sequences. It is slower than BLAST but more sensitive, as for example it does not require initial exactly matching regions to seed the alignment. It will thus give better dectection of remote homologs than BLAST. As it guarantees an optimal sequence alignment, it will give a lower distance between sequences than will BLAST (1).

The GeneMatcher2 also provides the Needleman-Wunsch algorithm, which finds the optimal global alignment. . That is it aligns the entire query and database sequences from beginning to end using whatever insertions/deletions are necessary.

The Paracel algorithms have the ability to find more than one alignment within a single database sequence (Mulitple High Scoring Pairs), so that potentially important hits will not be missed.

The GeneMatcher2 implementation of Smith-Waterman provides three different kinds of gap opening and extension penalties:

Affine - separate gap opening and extension penalties. The default method.

Linear - gap open penalty = gap extension penalty. Fastest but less realistic and accurate than the other penalty models.

Double-Affine - Two sets of open and delete scores are used for gaps. This allows the user to reduce the penalty of a gap extension as a gap grows longer. This algorithm is particularly advantageous when comparing chromosomal DNA to ESTs, as introns are not over-penalized when trying to find a match.

The GeneMatcher2 implementation of the Smith-Waterman algorithms provides versions capable of performing searches using six-frame translations of nucleotide query and database sequences, as described for the BLAST algorithm above.

Affine (swn)

nucleotide query vs. nucleotide database

Affine (swp)

amino acid query vs. amino acid database

Linear (swn or swp)

gap=linear

Double-Affine (swn or swp)

gap=double

Needleman-Wunsch (swn or swp)

alignquery=global, aligndata=global

Frame (swx)

Translated nucleotide query vs. amino acid database

Reverse-Frame (rframe)

Amino acid query vs. translated nucleotide (rframe) database

Double Frame (tswx)

Translated nucleotide query vs. translated nucleotide (rframe) database

GCG Profile Searches (GeneMatcher2)


Profile models are used to represent the similarities between members of a sequence family (typically a protein family or domain).

Profile searches use a position-specific scoring matrix built up from aligned sequences.  The scoring matrix reflects the probability of a given residue occurring at a specific position in the sequence. The matrix describes a particular domain or family.

In normal use, a profile model is used as a query against a sequence database to find sequences similar to a known family. The programs can also be run with the -invert option to query a sequence against a database of profiles to investigate if the sequence belongs to any known families. In the first case, the results are sorted by the query models (domains), in the second case, the results are sorted by the query sequences (assuming multiple queries are submitted).

Searches using GCG format models. The underlying algorithm used is Smith-Waterman (local alignments).

Profile (profile)

Amino acid model vs. amino acid database, or, nucleotide model vs. nucleotide database

-invert option: Amino acid sequence vs. database of amino acid models, or nucleotide sequence vs. database of nucleotide models

Profile-Frame (pframe)

Amino acid profile vs. translated nucleotide (rframe) database.

-invert option: nucleotide sequence vs. database of amino acid profiles

Hidden Markov Model and Prosite Generalized Profile Searches (GeneMatcher2)

The GeneMatcher2 supports Sean Eddy's HMMER programs for searching sequences with Hidden Markov Models (HMMs). HMMs are an enhancement of the Profile methods described above. They add position-specific scoring for insertions and deletions. One can search for local or global alignments; this is a property of the HMM model used. One advantage of using HMMs is that gap penalties have a rigorous theoretical underpinning, unlike in BLAST and Smith-Waterman where they are set empirically.

As with the Profile methods above, search results can be sorted by HMM model (what sequences contain this domain?), or by using the -invert option, sorted by database sequence (what domains are present in this sequence?).

Can also use Prosite Generalized Profile (GP) Models.

Hidden Markov Model (hmm)

amino acid model vs. amino acid database, or, nucleotide model vs. nucleotide database

-invert option : amino acid sequence vs. database of amino acid models, or nucleotide sequence vs. database of nucleotide models

HMM-Frame (hframe)

amino acid model (HMM) vs. translated nucleotide (rframe) database

-invert option: translated nucleotide sequence vs. database of amino acid models (HMMs)

-genomic option: allows for introns.

GeneWise (GeneMatcher2)

Genewise compares genomic sequences to protein reference sequences or HMM models at the protein level, allowing for introns (splicing) and frameshifts. It is "useful for predicting genes and their function in new, possibly unfinished, genomic sequences" (Paracel).

Can also use Prosite Models.

GeneWise (genewise)

amino acid model (HMM) or sequence vs. translated nucleotide (codon) database.

-invert option comparison: translated nucleotide sequence vs. database of amino acid models (HMMs).

 

Notes:

1. In a recent test, Christopher Dwan found BLAST is almost as good as Smith-Waterman for identifying very closely matching sequences, but as sequences become less similar, the results returned by the two methods rapidly diverge. Further, he reports that "Using known homologs from two closely related species, I compared a couple of the popular analyses based on alignments from both BLAST and Smith-Waterman. As might be expected, the results based on BLAST consistently indicated a greater distance between the sequences than those results shown by the optimal alignment".

Parts of this document are based directly on the Paracel FAQ Algorithm Primer.

 

 
Suggestions & Problems? Send e-mail to the Webmaster