|
The purpose of the following exercise is to help you become more familiar with using the various flavors of BLAST to gain knowledge about an "unknown" DNA sequence. The sequences have been grouped according to the disease with which they are associated. It is not necessary to carry out this exercise with each sequence listed; rather, pick the disease below which is of most interest to you and choose one sequence from within that category to work with. Sequences are from either a genomic or mRNA source. Depending on time available, you may wish to repeat the exercise with more than one sequence within the same category or from a different category. The second part of this lab deals with DNA -> protein translation.
![]()
Select the disease below which most interests you and choose one sequence to use in the exercise. Copy this sequence (FastA comment line included) into your computer's buffer.
Familial hypercholesterolemia: sequence 1 sequence 2 Cystic Fibrosis: sequence 1 sequence 2 Lipodystrophy: sequence 1 sequence 2 Nocturnal Asthma: sequence 1 sequence 2 Myocardial Infarction (susceptibility): sequence 1 sequence 2 Cholesterol Acyltransferase Deficiency: sequence 1 sequence 2
Go to NCBI's nucleotide BLASTN search site.
Perform a BLASTN search of the sequence against the NCBI non-redundant (nr) database. Paste your selected sequence into the Search window on the BLAST page. Be sure to uncheck the Low complexity box in the second part of the form. After clicking the BLAST! button, a page is displayed for formatting the output.
[BLASTN nr database run - input format: fasta file] Record the length of your sequence (given on the format page as number of letters). After waiting long enough for the search to run, check your results by clicking Format!.
Examine the BLASTN results and answer the following:
1. To what organism does the selected sequence belong?
2. What type of sequence is it (genomic or mRNA)?
Based on your observations, is it possible to answer these questions based on the results from a single BLAST search?
Record the accession number of the best hit.
From the BLASTN output, choose a full-length mRNA which significantly aligns with your fragment. Go to the GenBank annotation page for that mRNA by clicking on the gi#|version#|accession# link of the hit. Use the pull down menu on the GenBank page to display a FastA formatted version of the full-length mRNA. Save this FastA file to your local machine by clicking on "Save" and responding to the prompts with "Save File", storing the file with a name of your choice.
Go back to the GenBank annotation page for the full-length mRNA. Scroll down the page to locate the following pieces of information.
3. What is the mRNA's accession number?
4. What is the mRNA's GI number?
5. On what chromosome is the mRNA sequence located?
6. What part of the sequence actually encodes for a protein?
7. What is the protein's function?
LocusLink can be an invaluable source of information about a sequence of interest.
Click on LinkOut in the upper right hand corner of the GenBank annotation page for your mRNA, and then the link to NCBI Locus Link. To get to the LocusLink page, you need to click on the Locus ID number. Two of several useful categories of information found in LocusLink are Reference sequences associated with your sequence and GenBank Sequences associated with your segment. From the LocusLink page, find the genomic sequence associated with your mRNA, if one is listed. This will be given as an NCBI Genome Annotation Genomic Contig. If no genomic contig is given, this means that there currently isn't high enough quality genomic data for this particular sequence to be used by NCBI as reference information. Other less annotated sources will have to be searched for genomic data.
Record the accession number for the genomic sequence.
OMIM (Online Mendelian Inheritance in Man) can be another invaluable source of information.
This resource can be reached by by a number of different methods. If working from a LocusLink page, scroll down to the bottom of the page. There in the Additional Links section is a link to the relevant OMIM entry.
There is also usually an OMIM link at the top of a GenBank annotation page. However, clicking on this link results in a page with a list of possible OMIM entries.
You can also go directly to the OMIM home page and do a search for keywords of interest.
Since you are already on the LocusLink page, scroll to the bottom of the page and click on the OMIM number link.
Read the information given there to better understand the scope of this resource and the types of information it contains.
Use the mRNA sequence you found in a BLASTN run against the htgs (high throughput genomic sequences) database to see if you can determine the number of exons it contains. This can be done by either pasting in the mRNA's fasta sequence you saved to your local machine or by entering its accession number in the sequence window. Limit the sequence used to only that which encodes for the protein by using the From: To: boxes. Check to make sure that filtering is still turned off.
[BLASTN htgs database run - input format: accession number]
After looking at your results, answer the following questions.
8. Does it look as if htgs is the best source for a genomic sequence in your individual case? Why?
9. Do the organism and chromosome match that listed in the mRNA file?
10. How many exons were found in the best hit?
11. Is the exon coverage complete?
Using BLAST 2 SEQUENCES, construct an alignment between the full-length mRNA sequence and a genomic sequence associated with your fragment. Use the accession numbers for the two sequences, if possible, placing them in the form's Enter accession or GI boxes. Use the from: and to: boxes to limit the portion of the mRNA sequence used to its coding region. Be sure that filtering is turned off.
[BLAST 2 SEQUENCES run - input format: accession numbers] The genomic sequence used depends on your individual situation. It could be your best hit in your first BLAST search, if you had hits against a genomic sequence in the nr database. Or, it could be the best hit from your htgs BLAST search. This sequence could be the one listed as the NCBI Genome Annotation Genomic Contig. However, if the latter case is true, this process will only work if the accession number doesn't start with NC. This is because NC files contain only information on how to assemble the contig, not the actual sequence.
Based on your results, determine how many exons are in this gene. Confirm your answer by looking in Online Mendelian Inheritance in Man (OMIM), if you are not working with a mouse sequence. A direct link to the OMIM listing for your gene can be found on the GenBank annotation page for the mRNA sequence.
Use your mRNA sequence in a BLASTX search against the nr protein databases. Use only the coding region of the mRNA in the search. Be sure that filtering is turned off.
[BLASTX run - input format: accession number] Explore the impact of changing matrices on your search by repeating the BLASTX search, this time using a different matrix. If your first search found lots of very high quality hits use BLOSUM80 for the second run. If your search was light on high quality hits use BLOSUM45.
Print the results of your last run. From this set of results, select at least 8 sequences from different species, saving them as fasta files. Saving too few files will make for a poor tree later on. Give the saved files names that reflect the species of origin. Be sure to save the protein which best represents your initial selected sequence. Print two copies of the GenBank version of the data for the best protein hit and a single copy of two others. These files and hard copy will be used in lab session two.
In this step, it really depends on what you plan to do later on as to how many sequences you would save. In the exercise, this is only an attempt to get a small number of sequences so that future multiple alignments are more realistic and there is the potential for looking at phylogenetic relationships.
If you were interested in superfamily relationships, the alignments would be carefully gone over to ensure that vital regions of interest were found in each hit. Even hits of very low quality would be examined to find remote members of the family. Superfamily analysis is the topic of one of the additional optional exercises.
Protein sequences for your gene of interest, inferred and/or experimentally derived, can be obtained in several ways. While using BLASTX is one method of doing this, the simplest way is to directly translate the DNA sequence of interest. This approach is not without its problems.
Go to one of the EXPASY's translate tool sites ( Canada, China, Korea, Taiwan, USA) and paste in your initial selected sequence. Click on the TRANSLATE SEQUENCE button.
[EXPASY's translate run - input format: raw sequence] The resulting page contains the translation of your sequence in all 6 reading frames (three forward and three reverse).
Examine the results and pick the one translation with the least number of Stop terms to work with. [Usually this selection will give the desired results; however, this is not always the case. It depends on the sequence being translated and its origins. A sequence from mRNA that spans the 5'UTR and the beginning of the CDS region will have a methionine as its true start most of the time. Being observant as to where the fragment aligns in earlier BLAST runs helps in translation selection.] Click on the link for that reading frame. Another page comes up with your chosen translation and additional links that allow you to create a Swiss-Protein database entry of your data. Click on the desired link that best represents your understanding of the protein. Yet another page comes up. This page has a link to a FASTA format of the data. Save the fasta file.
To ensure that you will have data that works in the next step, also go through and save fasta files of the other two forward reading frame translations.
Use your downloaded translations and the full protein which best represents your selected fragment in an alignment comparison to determine which of these translations is real.
Go to one of the EXPASY's SIM Alignment sites ( Canada, China, Korea, Taiwan, USA) and paste in your two fasta formatted sequences. Change the type of entry to User-entered sequence for the two sequences. Strip off the > line from the sequences. Use default settings for the rest of run. Click on the Submit button.
[EXPASY's SIM Alignment run - input format: raw sequences] Scroll down the results page to see if the translated sequence used in the run aligns well with the found protein sequence. A good hit will be a section of 100% identity between the two input sequences. If the first attempt fails, run another one through the process and see if that one works.
The translation process cannot tell if it has reached the end of an exon or not. Also, the DNA fragment used may not respect codon boundaries and there will be additional Stops in the translation that will confuse the selection of the translation that is real.
12. Did you have problems with this procedure? If so, why?