|
This document contains a very detailed version of the exercise given in Gene Discovery. It assumes that the user knows how to cut and paste text from a computer screen and can use a text editor (word processing program). It is divided up into sections so that a user can move to a specific area from the main lab page. To completely understand a given section, read all the material contained in its links. Detailed instructions are given in green text on the html page and are italicized in the hard copy.
![]()
Introduction
BLASTN session (sequence vs nr)
actual run
mRNA file
results
LocusLink
OMIM
BLAT
Gene Discovery The purpose of the following exercise is to help you become more familiar with using the various flavors of BLAST to gain knowledge about an "unknown" DNA sequence. The sequences given below are from either a genomic or mRNA source.
Read the information in the BLAST link. It contains details on the composition of some of the databases accessed from the BLASTN site.
Select a sequence to use in the exercise. Copy this sequence (FastA comment line included) into your computer's buffer.
Familial hypercholesterolemia: sequence 1
(mRNA)sequence 2
(genomic)In this example the following sequence will be used to perform the desired tasks.
>TRIAL CCCACAGGGGGACCGGCCCTGTGACCCCTCACCGGGGCCGTGGGCCCGAGCCCCGGACTT CCCTAAGCCGGCAATGACCGCCTGCGCCCGCCGAGCGGGTGGGCTTCCGGACCCCGGGCT CTGCGGTCCCGCGTGGTGGGCTCCGTCCCTGCCCCGCCTCCCCCGGGCCCTGCCCCGGCT CCCGCTCCTGCTGCTCCTGCTTCTGCTGCAGCCCCCCGCCCTCTCCGCCGTGTTCACGGT GGGGGTCCTGGGCCCCTGGGCTTGCGACCCCATCTTCTCTCGGGCTCGCCCGGACCTGGC CGCCCGCCTGGCCGCCGCCCGCCTGAACCGCGACCCCGGCCTGGCAGGCGGTCCCCGCTT CGAGGTAGCGCTGCTGCCCGAGCCTTGCCGGACGCCGGGCTCGCTGGGGGCCGTGTCCTC CGCGCTGGCCCGCGTGTCGGGCCTCGTGGGTCCGGTGAACCCTGCGGCCTGCCGGCCAGC
BLASTN session (sequence vs nr) - actual run
Go to NCBI's nucleotide BLASTN search site.
Perform a BLASTN search of the sequence against the NCBI non-redundant (nr) database. Paste your selected sequence into the Search window on the BLAST page.
[BLASTN nr database run - input format: fasta file] This is a simple BLASTN run. Of the three parts of the page displayed, the only ones needed for this run are the first two. The first section of the page is given in the image below.
![]()
Paste your sequence into the window location shown by the shaded box in the following image.
![]()
Don't be concerned if the sequence isn't completely visible in the display window.
![]()
Be sure to uncheck the Low complexity box in the second part of the form. This insures that areas of repeating characters within the sequence of interest will included in the database search.
This will make the database search take a little longer, but it will ensure that areas considered to be of low complexity [repetitive characters] will be taken into account. The resulting scores will be higher than if filtering was left checked.
The following image is that of the second section of the page as it initially appears.
![]()
Use your cursor to click on the Low complexity box and turn the filtering process off. This second image is what this part of the form should look like before the search is submitted.
![]()
Once the sequence is in place and the filtering has been turned off, click on the BLAST! button marked by the black arrow in the following image. The default database for a BLASTN run is the nr database; therefore it isn't necessary to select a database for this particular run.
![]()
After clicking the BLAST! button, a page is displayed for formatting the output.
Record the length of your sequence (given on the format page as number of letters).
The length information is given in the part of the page marked by the black arrow in the following image.
![]()
After waiting long enough for the search to run, check your results by clicking Format!.
NCBI's BLAST sites are currently using queues to handle their heavy searching load. When a search request is received, it is placed in a queue. There is an estimate given below the Format! button on the page as to how long the search will take. As the size of the database increases, it takes longer for a search. NCBI has very fast machines and usually a search takes less than 2 minutes regardless of what the estimate says. Searches go faster during off hours than during the work day on the east coast (5 am - 2 pm PST).
Examine the BLASTN results and answer the following:
Read the information given in the hints link. If necessary, also look at the definitions material to help clarify any confusing terms.
What follows is a portion of the BLASTN results using the example sequence against the nr database. The various parts of the output file are explained. [The information consists of images which contain no active links.]
The BLASTN results page starts by referencing the paper in which the latest version of the BLAST program was described. The length of the query sequence (the one you pasted in) is given as well as the name of the database used and its length. The length of the database is used in determining the scoring of found matches.
![]()
Next is an image of the search results. The lines in this image are color coded to reflect the quality of found matches. Red and magenta lines represent high quality hits. On a real page you can go directly to an alignment by moving the mouse to the desired colored line and clicking.
![]()
In the example case, there are at least four very good full-length matches. [The red lines (bracketed in the hard copy) going the full width of the image.]
Below is the listing of the top matches and their scores. The results are given going from the best to worst match that meets the search criteria.
![]()
Each line of the list starts with a link to the actual data entry in the appropriate database. The link information is listed as the GI number, the database and the GenBank version number and finally the accession number if different from the version. Follow the links given for these numbers for more details on their meaning.
example link to database entry [second entry on the list]
The line continues with as much of the definition line from the data entry for the match as will fit in the allotted space. The bit score and E value finish out the line. Clicking on the bit score will move you to the actual alignment between your sequence and the database match. The smaller the E value the better the match.
Clicking on the link score for the second hit in the example run results in the following alignment.
![]()
Look at the alignment and find out just how much of the hit was covered in the matching sequence. Then check in the database entry to see where the matching region is located in the database sequence.
1. Is there more than one matched segment in the database entry?
2. Do these multiple segments appear in widely separated locations?
3. Does the match span coding and non-coding regions of the database entry?
4. Is the match only from a protein-coding region of the database entry?
1. To what organism does the selected sequence belong?
Look at the best hits. Are there any with 100% matches over the entire length of the query sequence? If there are more than one of these, do they all come from the same organism? If so, this is probably the organism from which the sequence came.
The top four matches of the example run all are from human sequences and have excellent scores (0.0).
2. What type of sequence is it (genomic or mRNA)?
Read the link and see if you can determine the answer to the question based on the type of hits.
The example sequence is from a HUMAN mRNA sequence, but it starts at the beginning of the 5'UTR region and contains 406 bases into the coding sequence of the protein.
Based on your observations, is it possible to answer these questions based on the results from a single BLAST search?
Some DNA sequences are from genes for which there is little or no information in the databases. Others could be from non-gene areas. Such DNA samples will have either only low quality matches or matches from only short sequence stretches. In such cases, good quality matches will only be possible from searching genomic sequences.
Record the accession number of the best hit.
From the BLASTN output, choose a full-length mRNA which significantly aligns with your fragment.
Look down the list of hits. Sometimes the term mRNA will be close enough to the beginning of the data entry's definition line to appear in the hit list. If not, you will need to click on the various top links until a high quality mRNA sequence is found.
Go to the GenBank annotation page for that mRNA by clicking on the gi#|version#|accession# link of the hit.
The top of such a page is given below.
![]()
Use the pull down menu on the GenBank page to display a FastA formatted version of the full-length mRNA.
If you are uncertain about what the term FastA formatted version means, go to its link and refresh your memory. This is the format of the original data file that was pasted into the BLASTN search.
The necessary menu for this task is the menu contained in the button currently displaying the term Default View. Click on the two arrows of that button to get a listing of its various options. The desired one is FASTA. Select that option and then click on the Display button to update the data on the screen.
Save this FastA file to your local machine by clicking on the Save button from the top of the NCBI page and responding to the prompts with "Save File", storing the file with a name of your choice.
Go back to the GenBank annotation page for the full-length mRNA.
Change the format of the data displayed on the screen back to the Default View (actually this is the GenBank version) by clicking on the arrows of the FASTA button to get the menu and selecting either Default View or GenBank and then clicking on the Display button.
Scroll down the page to locate the following pieces of information.
3. What is the mRNA's accession number?
4. What is the mRNA's GI number?
5. On what chromosome is the mRNA sequence located?
6. What part of the sequence actually encodes for a protein?
7. What is the protein's function?Read the material in the hints link. It provides the necessary information to locate the answers. The link below is an image of the example's mRNA sequence. In this image the accession number is colored red, the GI number blue, the chromosome information green and the coding region data orange. [In the hard copy version these items are marked by arrows ( -> text <- ).]
resulting example sequence answers 3. NM_000180
4. 4504216
5. chromosome 17
6. 74..3385
7. guanylate cyclase
LocusLink can be an invaluable source of information about a sequence of interest.
Click on LinkOut in the upper right hand corner of the GenBank annotation page for your mRNA,
example LinkOut link
and then the link to NCBI LocusLink.
example NCBI LocusLink link
To get to the LocusLink page, you need to click on the Locus ID number.
example LocusLink number link
Two of several useful categories of information found in LocusLink are Reference sequences associated with your sequence and GenBank Sequences associated with your segment. From the LocusLink page, find the genomic sequence associated with your mRNA, if one is listed. This will be given as an NCBI Genome Annotation Genomic Contig. If no genomic contig is given, this means that there currently isn't high enough quality genomic data for this particular sequence to be used by NCBI as reference information. Other less annotated sources will have to be searched for genomic data.
The Genomic Contig for the example is NT_026472. Since this sequence accession code does not start with NC, it actually contains sequence information. If the Genomic Contig did start with NC, it would be necessary to save the fasta format of the data in order to use actual sequence data in later steps.
Record the accession number for the genomic sequence.
OMIM (Online Mendelian Inheritance in Man) can be another invaluable source of information.
This resource can be reached by by a number of different methods. If working from a LocusLink page, scroll down to the bottom of the page. There in the Additional Links section is a link to the relevant OMIM entry.
There is also usually an OMIM link at the top of a GenBank annotation page. However, clicking on this link results in a page with a list of possible OMIM entries.
You can also go directly to the OMIM home page and do a search for keywords of interest.
Since you are already on the LocusLink page, scroll to the bottom of the page and click on the OMIM number link.
Given below is the bottom part of the example LocusLink from the previous section.
![]()
Click on the number given on the OMIM: line to go to the relevant OMIM entry. The top of that entry for the example is given below.
![]()
If you were to try to reach this information by the GenBank annotation page, you would click on the link marked by the arrow in the following image.
![]()
This would bring up a page with a listing of possible OMIM entries to choose from. The example one is given below.
![]()
Or you could go directly to the OMIM home page and do a search there for keywords of interest. The image below is that of the OMIM home page.
![]()
Read the information given here to better understand the scope of this resource and the types of information it contains. Check to see if the number of exons contained in the gene is given.
Most OMIM entries give the number of exons contained in the gene somewhere in its description. Scan the documentation to see if you can find this information.
The OMIM entry for the example gene contains the following information:
Perrault et al. (1996) determined that the human GUC2B gene is 16 kb long and contains 20 exons; it is 87% identical at the protein level to its mouse counterpart.
Use the mRNA sequence found previously to see if you can determine the number of exons it contains. Open the saved fasta formatted file, trim off the non-coding regions and save with a new name the modified data.
For the example mRNA file this would mean trimming off the sections not highlighted in the following image since the coding region is from 74 to 3385.
![]()
FASTA formatted files from NCBI have 70 characters to a line. Do the math to figure out where look for the starting and ending points of the coding region of interest. The desired region should start with ATG and be followed by a stop codon, TAA, TAG or TGA.
The url used at this point depends on the source of the sequence being used.
[BLAT run - input format: raw sequence] human sequence
Go to the human BLAT search site. Paste in only the sequence from your modified mRNA sequence file. Click on the Submit button at the top right-hand side of the page to run the search.
A filled out page is shown below.
![]()
On the resulting BLAT Search Results page, click on the details link for the hit on the proper chromosome.
The search shows two possible hits. Previously collected information gives chromosome 17 as the location of the gene.
![]()
The next page gives two sets of information. The listing on the left of the page has listed a number of blocks. The number of blocks corresponds to the number of exons in the submitted sequence.
![]()
On this page there are shown 18 blocks. Therefore, there are 18 exons in this gene. If the entire mRNA sequence had been used, there would be more blocks given, for it includes 5' and 3' sequence stretches.
mouse sequence
Go to the mouse BLAT search site. Follow the instructions given in the human sequence section to conduct your analysis.
See the data presented above.
Return to the BLAT Search Results page. This time explore the site by selecting the browser link. Check out the search information displayed on the sequence used.
Based on your results, how many exons are in this gene. For those with a human sequence, confirm your answer by looking in Online Mendelian Inheritance in Man (OMIM). A direct link to the OMIM listing for your gene can be found on the GenBank annotation page for the mRNA sequence.
Based on the human BLAT results given above there are 18 exons in this gene. The OMIM data says that the human GUC2B gene is 16 kb long and contains 20 exons. The original work was done in 1996 and the paper would have to be checked to see where the differences are.