September 23, 2005

What is BLAST?

This description is taken from the BLAST web site at NCBI (http://www.ncbi.nlm.nih.gov/BLAST/):

"The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families."

Posted by at at 9:57 AM

Instructions for using NCBI BLAST and a sample exercise

1. Go to the NCBI home page: http://www.ncbi.nlm.nih.gov

2. Take the link to BLAST on the dark blue bar across the top of the page.

3. Under the heading Nucleotide, choose Nucleotide-nucleotide BLAST (blastn)

4. Copy and paste the DNA sequence below into the text box near the top of the screen. For this simple example you don't need to change anything else on the page. Click the button labelled "BLAST!" to submit your query.

5. It may take several seconds for BLAST to compare your input query sequence to the database. Go ahead and press the "Format!" button. (This will pop open another window.)

6. Your results are displayed a couple of different ways. Scroll down the results page to first view a graphical display of your query (thick red bar at the top) and the various hits found by BLAST. The different lines represent different hits in the database; these "subject" sequences are lined up with the portion of the query sequence they match against. The color of the line represents how strong the match is.

7. Scroll down a bit farther to view the list of "sequences producing significant alignments." The links at the left of the page (starting with "gi") go to sequence records in the database which have produced a match. After that are the first several characters of the title of that record (although often not enough characters to make much out of). After that is a score column; without going into details, generally the higher the score, the better a match your query was to this subject record. These scores link to a more detailed display of the match of your sequence against this database sequence; more on that in step 8. Next comes the E-value, which is a statistic indicating the chances that the supposed "match" between your query sequence and this database sequence occurred by chance. The higher the E-value, the higher the chance that this match is spurious and not due to a true correlation to the query sequence. (Note that the notation used, e.g., 7e-147, means 7x10-147, a very tiny number. Possibly spurious matches will have e-values more like "5.")

8. Click the score link for your best hit to look at the match in detail. Here you can see the full title of the sequence record you've hit against, and see a nucleotide-by-nucleotide breakdown of how good the match was. The two sequences (Query and Subject) are lined up against each other and any matches are marked with a line between them. You can look at other matches further down the page and see that the matches become increasingly less convincing towards the bottom of the page.

9. Finally, click the link to the database record (starts with "gi") for one of the good hits. This takes you to a page with info about where this sequence came from (the "Source" organism), hopefully something about what this sequence represents (a gene, several genes, etc.), and other information about the sequence. Some records are much more detailed than others. You should at least be able to figure out what species is represented and hopefully what gene or gene.


Here's the sample sequence to BLAST:

GAATTCCGGAAGCGAGCAAGAGATAAGTCCTGGCATCAGATACAGTTGGAGATAAGGACG
GACGTGTGGCAGCTCCCGCAGAGGATTCACTGGAAGTGCATTACCTATCCCATGGGAGCC
ATGGAGTTCGTGGCGCTGGGGGGGCCGGATGCGGGCTCCCCCACTCCGTTCCCTGATGAA
GCCGGAGCCTTCCTGGGGCTGGGGGGGGGCGAGAGGACGGAGGCGGGGGGGCTGCTGGCC
TCCTACCCCCCCTCAGGCCGCGTGTCCCTGGTGCCGTGGGCAGACACGGGTACTTTGGGG
ACCCCCCAGTGGGTGCCGCCCGCCACCCAAATGGAGCCCCCCCACTACCTGGAGCTGCTG
CAACCCCCCCGGGGCAGCCCCCCCCATCCCTCCTCCGGGCCCCTACTGCCACTCAGCAGC
GGGCCCCCACCCTGCGAGGCCCGTGAGTGCGTCATGGCCAGGAAGAACTGCGGAGCGACG
GCAACGCCGCTGTGGCGCCGGGACGGCACCGGGCATTACCTGTGCAACTGGGCCTCAGCC
TGCGGGCTCTACCACCGCCTCAACGGCCAGAACCGCCCGCTCATCCGCCCCAAAAAGCGC
CTGCTGGTGAGTAAGCGCGCAGGCACAGTGTGCAGCCACGAGCGTGAAAACTGCCAGACA
TCCACCACCACTCTGTGGCGTCGCAGCCCCATGGGGGACCCCGTCTGCAACAACATTCAC
GCCTGCGGCCTCTACTACAAACTGCACCAAGTGAACCGCCCCCTCACGATGCGCAAAGAC
GGAATCCAAACCCGAAACCGCAAAGTTTCCTCCAAGGGTAAAAAGCGGCGCCCCCCGGGG
GGGGGAAACCCCTCCGCCACCGCGGGAGGGGGCGCTCCTATGGGGGGAGGGGGGGACCCC
TCTATGCCCCCCCCGCCGCCCCCCCCGGCCGCCGCCCCCCCTCAAAGCGACGCTCTGTAC
GCTCTCGGCCCCGTGGTCCTTTCGGGCCATTTTCTGCCCTTTGGAAACTCCGGAGGGTTT
TTTGGGGGGGGGGCGGGGGGTTACACGGCCCCCCCGGGGCTGAGCCCGCAGATTTAAATA
ATAACTCTGACGTGGGCAAGTGGGCCTTGCTGAGAAGACAGTGTAACATAATAATTTGCA
CCTCGGCAATTGCAGAGGGTCGATCTCCACTTTGGACACAACAGGGCTACTCGGTAGGAC
CAGATAAGCACTTTGCTCCCTGGACTGAAAAAGAAAGGATTTATCTGTTTGCTTCTTGCT
GACAAATCCCTGTGAAAGGTAAAAGTCGGACACAGCAATCGATTATTTCTCGCCTGTGTG
AAATTACTGTGAATATTGTAAATATATATATATATATATATATATCTGTATAGAACAGCC
TCGGAGGCGGCATGGACCCAGCGTAGATCATGCTGGATTTGTACTGCCGGAATTC

This sequence was created by Mark Boguski at NCBI for use in "The Lost World" by Michael Crichton (the sequel to "Jurassic Park"). He has an article in BioTechniques describing his involvement with the sequence: Bogulski, Mark S. A Molecular Biologist Visits Jurassic Park. 1992. BioTechniques 12(5): 668-669.

I learned about this sequence through the NCBI NAWBIS course. If you're into conspiracy theories (of a sort...), try running a BLASTx search on the sequence above. Look at the pairwise matches of your sequence against any of the first few subject sequences, for a hidden message in the query amino acid string.

Posted by at at 9:53 AM