Detailed Example for Lab 1

This document contains a very detailed version of the exercise given in lab 1. It assumes that the user knows how to cut and paste text from a computer screen and can use a text editor (word processing program). It is divided up into sections so that a user can move to a specific area from the main lab page. To completely understand a given section, read all the material contained in its links. Detailed instructions are given in green text on the html page and are italicized in the hard copy.

Table of Contents


Introduction

Lab One

The purpose of the following exercise is to help you become more familiar with using the various flavors of BLAST to gain knowledge about an "unknown" DNA sequence. The sequences have been grouped according to the disease with which they are associated. It is not necessary to carry out this exercise with each sequence listed; rather, pick the disease below which is of most interest to you and choose one sequence from within that category to work with. Sequences are from either a genomic or mRNA source. Depending on time available, you may wish to repeat the exercise with more than one sequence within the same category or from a different category. The second part of this lab deals with DNA -> protein translation.

Read the information in the BLAST link. It contains details on the composition of some of the databases accessed from the BLASTN site.


Select the disease below which most interests you and choose one sequence to use in the exercise. Copy this sequence (FastA comment line included) into your computer's buffer.

Familial hypercholesterolemia: sequence 1 sequence 2
Cystic Fibrosis: sequence 1 sequence 2
Lipodystrophy: sequence 1 sequence 2
Nocturnal Asthma: sequence 1 sequence 2
Myocardial Infarction (susceptibility): sequence 1 sequence 2
Cholesterol Acyltransferase Deficiency: sequence 1 sequence 2

In this example the following sequence will be used to perform the desired tasks.

>TRIAL
CCCACAGGGGGACCGGCCCTGTGACCCCTCACCGGGGCCGTGGGCCCGAGCCCCGGACTT
CCCTAAGCCGGCAATGACCGCCTGCGCCCGCCGAGCGGGTGGGCTTCCGGACCCCGGGCT
CTGCGGTCCCGCGTGGTGGGCTCCGTCCCTGCCCCGCCTCCCCCGGGCCCTGCCCCGGCT
CCCGCTCCTGCTGCTCCTGCTTCTGCTGCAGCCCCCCGCCCTCTCCGCCGTGTTCACGGT
GGGGGTCCTGGGCCCCTGGGCTTGCGACCCCATCTTCTCTCGGGCTCGCCCGGACCTGGC
CGCCCGCCTGGCCGCCGCCCGCCTGAACCGCGACCCCGGCCTGGCAGGCGGTCCCCGCTT
CGAGGTAGCGCTGCTGCCCGAGCCTTGCCGGACGCCGGGCTCGCTGGGGGCCGTGTCCTC
CGCGCTGGCCCGCGTGTCGGGCCTCGTGGGTCCGGTGAACCCTGCGGCCTGCCGGCCAGC

Back to Table of Contents


BLASTN session (sequence vs nr) - actual run

Go to NCBI's nucleotide BLASTN search site.

Perform a BLASTN search of the sequence against the NCBI non-redundant (nr) database. Paste your selected sequence into the Search window on the BLAST page.

[BLASTN nr database run - input format: fasta file]

This is a simple BLASTN run. Of the three parts of the page displayed, the only ones needed for this run are the first two. The first section of the page is given in the image below.

Paste your sequence into the window location shown by the shaded box in the following image.

Don't be concerned if the sequence isn't completely visible in the display window.

Be sure to uncheck the Low complexity box in the second part of the form.

This will make the database search take a little longer, but it will ensure that areas considered to be of low complexity [repetitive characters] will be taken into account. The resulting scores will be higher than if filtering was left checked.

The following image is that of the second section of the page as it initially appears.

Use your cursor to click on the Low complexity box and turn the filtering process off. This second image is what this part of the form should look like before the search is submitted.

Once the sequence is in place and the filtering has been turned off, click on the BLAST! button marked by the black arrow in the following image. The default database for a BLASTN run is the nr database; therefore it isn't necessary to select a database for this particular run.

After clicking the BLAST! button, a page is displayed for formatting the output.

Record the length of your sequence (given on the format page as number of letters).

The length information is given in the part of the page marked by the black arrow in the following image.

After waiting long enough for the search to run, check your results by clicking Format!.

NCBI's BLAST sites are currently using queues to handle their heavy searching load. When a search request is received, it is placed in a queue. There is an estimate given below the Format! button on the page as to how long the search will take. As the size of the database increases, it takes longer for a search. NCBI has very fast machines and usually a search takes less than 2 minutes regardless of what the estimate says. Searches go faster during off hours than during the work day on the east coast (5 am - 2 pm PST).

Back to exercise 1

Back to Table of Contents


BLASTN session - results

Examine the BLASTN results and answer the following:

hints1

Read the information given in the hints link. If necessary, also look at the definitions material to help clarify any confusing terms.

What follows is a portion of the BLASTN results using the example sequence against the nr database. The various parts of the output file are explained. [The information consists of images which contain no active links.]

The BLASTN results page starts by referencing the paper in which the latest version of the BLAST program was described. The length of the query sequence (the one you pasted in) is given as well as the name of the database used and its length. The length of the database is used in determining the scoring of found matches.

Next is an image of the search results. The lines in this image are color coded to reflect the quality of found matches. Red and magenta lines represent high quality hits. On a real page you can go directly to an alignment by moving the mouse to the desired colored line and clicking.

In the example case, there are at least four very good full-length matches. [The red lines (bracketed in the hard copy) going the full width of the image].

Below is the listing of the top matches and their scores. The results are given going from the best to worst match that meets the search criteria.

Each line of the list starts with a link to the actual data entry in the appropriate database. The link information is listed as the GI number, the database and the GenBank version number and finally the accession number if different from the version. Follow the links given for these numbers for more details on their meaning.

example link to database entry [second entry on the list]

The line continues with as much of the definition line from the data entry for the match as will fit in the allotted space. The bit score and E value finish out the line. Clicking on the bit score will move you to the actual alignment between your sequence and the database match. The smaller the E value the better the match.

Clicking on the link score for the second hit in the example run results in the following alignment.

Look at the alignment and find out just how much of the hit was covered in the matching sequence. Then check in the database entry to see where the matching region is located in the database sequence.

1. Is there more than one matched segment in the database entry?
2. Do these multiple segments appear in widely separated locations?
3. Does the match span coding and non-coding regions of the database entry?
4. Is the match only from a protein-coding region of the database entry?

Back to exercise 1


1. To what organism does the selected sequence belong?

Look at the best hits. Are there any with 100% matches over the entire length of the query sequence? If there are more than one of these, do they all come from the same organism? If so, this is probably the organism from which the sequence came.

The top four matches of the example run all are from human sequences and have excellent scores (0.0).

2. What type of sequence is it (genomic or mRNA)?

Read the link and see if you can determine the answer to the question based on the type of hits.

The example sequence is from a HUMAN mRNA sequence, but it starts at the beginning of the 5'UTR region and contains 406 bases into the coding sequence of the protein.

Based on your observations, is it possible to answer these questions based on the results from a single BLAST search?

Some DNA sequences are from genes for which there is little or no information in the databases. Others could be from non-gene areas. Such DNA samples will have either only low quality matches or matches from only short sequence stretches. In such cases, good quality matches will only be possible from searching genomic sequences.

Record the accession number of the best hit.

Back to exercise 1

Back to Table of Contents


mRNA task

From the BLASTN output, choose a full-length mRNA which significantly aligns with your fragment.

Look down the list of hits. Sometimes the term mRNA will be close enough to the beginning of the data entry's definition line to appear in the hit list. If not, you will need to click on the various top links until a high quality mRNA sequence is found.

Go to the GenBank annotation page for that mRNA by clicking on the gi#|version#|accession# link of the hit.

The top of such a page is given below.

Use the pull down menu on the GenBank page to display a FastA formatted version of the full-length mRNA. If you are uncertain about what the term FastA formatted version means, go to its link and refresh your memory. This is the format of the original data file that was pasted into the BLASTN search.

The necessary menu for this task is the menu contained in the button currently displaying the term Default View. Click on the two arrows of that button to get a listing of its various options. The desired one is FASTA. Select that option and then click on the Display button to update the data on the screen.

Save this FastA file to your local machine by clicking on the Save button from the top of the NCBI page and responding to the prompts with "Save File", storing the file with a name of your choice.

Go back to the GenBank annotation page for the full-length mRNA.

Change the format of the data displayed on the screen back to the Default View (actually this is the GenBank version) by clicking on the arrows of the FASTA button to get the menu and selecting either Default View or GenBank and then clicking on the Display button.

Scroll down the page to locate the following pieces of information.

hints2

3. What is the mRNA's accession number?
4. What is the mRNA's GI number?
5. On what chromosome is the mRNA sequence located?
6. What part of the sequence actually encodes for a protein?
7. What is the protein's function?

Read the material in the hints link. It provides the necessary information to locate the answers. The link below is an image of the example's mRNA sequence. In this image the accession number is colored red, the GI number blue, the chromosome information green and the coding region data orange. [In the hard copy version these items are marked by arrows ( -> text <- ).]

example mRNA data

Back to exercise 1

Back to Table of Contents


LocusLink can be an invaluable source of information about a sequence of interest.

Click on LinkOut in the upper right hand corner of the GenBank annotation page for your mRNA,

example LinkOut link

and then the link to NCBI LocusLink.

example NCBI LocusLink link

To get to the LocusLink page, you need to click on the Locus ID number.

example LocusLink number link

Two of several useful categories of information found in LocusLink are Reference sequences associated with your sequence and GenBank Sequences associated with your segment. From the LocusLink page, find the genomic sequence associated with your mRNA, if one is listed. This will be given as an NCBI Genome Annotation Genomic Contig. If no genomic contig is given, this means that there currently isn't high enough quality genomic data for this particular sequence to be used by NCBI as reference information. Other less annotated sources will have to be searched for genomic data.

The Genomic Contig for the example is NT_026472. Since this sequence accession code does not start with NC, it actually contains sequence information. If the Genomic Contig did start with NC, it would be necessary to save the fasta format of the data in order to use actual sequence data in later steps.

Record the accession number for the genomic sequence.

Back to exercise 1

Back to Table of Contents


OMIM (Online Mendelian Inheritance in Man) can be another invaluable source of information.

This resource can be reached by by a number of different methods. If working from a LocusLink page, scroll down to the bottom of the page. There in the Additional Links section is a link to the relevant OMIM entry.

There is also usually an OMIM link at the top of a GenBank annotation page. However, clicking on this link results in a page with a list of possible OMIM entries.

You can also go directly to the OMIM home page and do a search for keywords of interest.

Since you are already on the LocusLink page, scroll to the bottom of the page and click on the OMIM number link.

Given below is the bottom part of the example LocusLink from the previous section.

Click on the number given on the OMIM: line to go to the relevant OMIM entry. The top of that entry for the example is given below.

If you were to try to reach this information by the GenBank annotation page, you would click on the link marked by the arrow in the following image.

This would bring up a page with a listing of possible OMIM entries to choose from. The example one is given below.

Or you could go directly to the OMIM home page and do a search there for keywords of interest. The image below is that of the OMIM home page.

Read the information given here to better understand the scope of this resource and the types of information it contains.

Back to exercise 1

Back to Table of Contents


BLASTN session - htgs run

Use the mRNA sequence you found in a BLASTN run against the htgs (high throughput genomic sequences) database to see if you can determine the number of exons it contains. This can be done by either pasting in the mRNA's fasta sequence you saved to your local machine or by entering its accession number in the sequence window. Limit the sequence used to only that which encodes for the protein by using the From: To: boxes.

Use the image below to locate the necessary data entry points.

Put your mRNA accession number into the box with the red 1 to the left of it.
The beginning value for the CDS region goes into the From: box with the
green 2 above it.
The ending value of the CDS region goes into the To: box with the
orange 3 above it.
Use the menu marked by the blue 4 to switch the database from the default nr to htgs.
[In the hard copy, shaded numbers mark the boxes.]

Check to make sure that filtering is still turned off.

[BLASTN htgs run - input format: accession number]

Click on the BLAST! button to start the search.

The htgs database contains the high throughput genomic sequences that are currently being worked on. This search is usually slower than a normal nr database one.

The human genome still has a lot of work that needs to be done on it before it is truly finished. Once a particular sequence is finished it is put into a regular GenBank division and can be found with a search of the nr database.

After waiting long enough, format the output.

Back to exercise 1

Back to Table of Contents


BLASTN htgs results

After looking at your results, answer the following questions.

hints3

8. Does it look as if htgs is the best source for a genomic sequence in your individual case? Why?

The htgs example results do not show any found exons. So the run was repeated using the mRNA against the nr database.

While it looks like there are some good hits, they turn out to be from other mRNAs and not genomic sequences. NOT all sequence fragments will have hits in the htgs database. This is due in part to the changing nature of this database. Read the information contained in the htgs link given previously for more details on this database.

The image of a search that actually does find exons is given below. First the top of the results page is presented, followed by a link to the alignment for the hit containing the exon information.

The image shows a good match extending the entire length of the cds for the mRNA. This is the red line at the top of the matches going from ~200 to ~1550.

exon alignment

When looking at the alignment results, each match begins with the Score section and continues until another Score section is reached. The results are listed from the longest to the shortest match.

9. Do the organism and chromosome match that listed in the mRNA file?

Check to see that organisms match. Usually this can be done by just looking at the rest of the definition line in the hit line or from the information given in the alignment section for that hit.

To check for chromosome information, it is necessary to go to the GenBank file and read through it to find the needed information.

In the second example, the organism and chromosome both matched from the good hit.

10. How many exons were found in the best hit?

When the alignments were looked at for the second example, 13 sequence locations corresponding to possible exons were found.

11. Is the exon coverage complete?

The 13 locations covered the entire coding region.

Back to exercise 1

Back to Table of Contents


BLAST 2 SEQUENCES

Using BLAST 2 SEQUENCES, construct an alignment between the full-length mRNA sequence and a genomic sequence associated with your fragment. Use the accession numbers for the two sequences, if possible, placing them in the form's Enter accession or GI boxes. Use the from: and to: boxes to limit the portion of the mRNA sequence used to its coding region. Be sure that filtering is turned off.

[BLAST 2 SEQUENCES run - input format: accession numbers]

The BLAST 2 SEQUENCE page is given below, filled out properly for the example run.

Notice that the accession numbers are bigger than the provided boxes.

The genomic sequence used depends on your individual situation. It could be your best hit in your first BLAST search, if you had hits against a genomic sequence in the nr database. Or, it could be the best hit from your htgs BLAST search. This sequence could be the one listed as the NCBI Genome Annotation Genomic Contig. However, if the latter case is true, this process will only work if the accession number doesn't start with NC. This is because NC files contain only information on how to assemble the contig, not the actual sequence.

The results of the example run are given below. The first image is from the top of the page giving information of the sequences used in the run.

Next is the portion of the results showing the graphics for the found alignments. This section is actually composed of two parts. The left hand side of the page contains the found set of gapped local alignments. Clicking on one of these areas will move you to the associated alignment. On the right hand side is a dotplot image providing an overview of sequence similarity.

example BLAST 2 Sequences results (text version)

When looking at these results, each match begins with the Score section and continues until another Score section is reached. These results like BLAST results are given from the longest match to the shortest.

Poring through the example results comes up with 18 found matches. The cds is from 74 - 3385 in the example mRNA sequence.

example run raw hits - extracted by hand from the results page

     location in mRNA		location in genomic sequence
     74 - 794			378 - 1098
     1097 - 1453		3690 - 4045
     792 - 1101			1179 - 1488
     1821 - 2030		8099 - 8308
     2650 - 2845		10815 - 11010
     2841 - 3019		11282 - 11460
     2485 - 2651		10556 - 10722
     2026 - 2186		8403 - 8563
     2336 - 2489		9836 - 9989
     2186 - 2336		9060 - 9210
     1535 - 1643		4753 - 4861
     1638 - 1742		5258 - 5362
     3017 - 3116		11698 - 11797
     3117 - 3212		11883 - 11978
     3210 - 3299		12159 - 12248
     3297 - 3385		12399 - 12487
     1450 - 1537		4388 - 4475
     1740 - 1822		6831 - 6913
Organizing these matches into sequential order determines whether or not there is complete coverage and the number of exons.

organized hits - worked over by hand

     location in mRNA		location in genomic sequence
     74 - 794			378 - 1098
     792 - 1101			1179 - 1488
     1097 - 1453		3690 - 4045
     1450 - 1537		4388 - 4475
     1535 - 1643		4753 - 4861
     1638 - 1742		5258 - 5362
     1740 - 1822		6831 - 6913
     1821 - 2030		8099 - 8308
     2026 - 2186		8403 - 8563
     2186 - 2336		9060 - 9210
     2336 - 2489		9836 - 9989
     2485 - 2651		10556 - 10722
     2650 - 2845		10815 - 11010
     2841 - 3019		11282 - 11460
     3017 - 3116		11698 - 11797
     3117 - 3212		11883 - 11978
     3210 - 3299		12159 - 12248
     3297 - 3385		12399 - 12487

This shows that there are 18 found exons and the coverage is complete.

Based on your results, determine how many exons are in this gene. Confirm your answer by looking in Online Mendelian Inheritance in Man (OMIM), if you are not working with a mouse sequence. A direct link to the OMIM listing for your gene can be found on the GenBank annotation page for the mRNA sequence.

Back to exercise 1

Back to Table of Contents


BLASTX session

Use your mRNA sequence in a blastx search against the nr protein databases. Use only the coding region of the mRNA in the search. Be sure that filtering is turned off.

[BLASTX run - input format: accession number]

Fill in the initial page of the BLASTX form as shown below using the example data.

The time estimate on this page is usually off. It takes a lot longer than the estimate indicates.

The results for the example run are given below.

Explore the impact of changing matrices on your search by repeating the BLASTX search, this time using a different matrix. If your first search found lots of very high quality hits use BLOSUM80 for the second run. If your search was light on high quality hits use BLOSUM45.

Changing the matrix used in a search impacts the number and quality of found hits. Visit NCBI for more information on their matrices. The BLOSUM80 matrix is based on an alignment of sequences which were less than 80% identical to one another, while for the BLOSUM45 matrix the sequences were less than 45% identical. BLOSUM80 is best for finding short strong similarities and BLOSUM45 is better for long more diverse ones. For really short sequences it is best to use the older PAM matrices. The NCBI link given above has a table showing sequence sizes and recommended substitution matrices and associated penalties.

Print the results of your last run.

Print a copy of the results by selecting the Print option from the browser's File menu.

From this set of results, select at least 8 sequences from different species, saving them as fasta files. Saving too few files will make for a poor tree later on. Give the saved files names that reflect the species of origin. Be sure to save the protein which best represents your initial selected sequence. Print two copies of the GenBank version of the data for the best protein hit and a single copy of two others. These files and hard copy will be used in lab session two.

Go to the GenBank annotation page for a given hit by clicking on the gi#|version#|accession# link of the desired hit.

The top of such a page is given below.

Print a copy of the data by selecting the Print option from the browser's File menu.

Use the pull down menu on the GenBank page to display a FastA formatted version of the hit. The necessary menu for this task is the menu contained in the button currently displaying the term Default View. Click on the two arrows of that button to get a listing of its various options. The desired one is FASTA. Select that option and then click on the Display button to update the data on the screen.

Save this FastA file to your local machine by clicking on the Save button from the top of the NCBI page and responding to the prompts with "Save File", storing the file with a name of your choice.

Return to the BLASTX results by clicking on the browser's Back button until the results reappear on the screen.

Repeat this process until you have at least 8 fasta formatted sequences saved to your local machine. Select two other GenBank sequences to print. These will be used later.

hints4

In this step, it really depends on what you plan to do later on as to how many sequences you would save. In the exercise, this is only an attempt to get a small number of sequences so that future multiple alignments are more realistic and there is the potential for looking at phylogenetic relationships.

If you were interested in superfamily relationships, the alignments would be carefully gone over to ensure that vital regions of interest were found in each hit. Even hits of very low quality would be examined to find remote members of the family. Superfamily analysis is the topic of one of the additional optional exercises.

Back to exercise 1

Back to Table of Contents


Translation

Protein sequences for your gene of interest, inferred and/or experimentally derived, can be obtained in several ways. While using BLASTX is one method of doing this, the simplest way is to directly translate the DNA sequence of interest. This approach is not without its problems.

Go to one of the EXPASY's translate tool sites (Canada, China, Korea, Taiwan, USA) and paste in your initial selected sequence.

The translate tool looks like this.

Paste the raw sequence into the window.

Click on the TRANSLATE SEQUENCE button.

[EXPASY's translate run - input format: raw sequence]

Click on the button in the form marked by the black arrow.

The resulting page contains the translation of your sequence in all 6 reading frames (three forward and three reverse).

Examine the results and pick the one translation with the least number of Stop terms to work with. [Usually this selection will give the desired results; however, this is not always the case. It depends on the sequence being translated and its origins. A sequence from mRNA that spans the 5'UTR and the beginning of the CDS region will have a methionine as its true start most of the time. Being observant as to where the fragment aligns in earlier BLAST runs helps in translation selection.] Click on the link for that reading frame.

For the example sequence the translation with the least number of Stops is the second one.

Another page comes up with your chosen translation and additional links that allow you to create a Swiss-Protein database entry of your data. Click on the desired link that best represents your understanding of the protein.

The example translation has only one Met to choose from. From previous work, it is known that the selected sequence is comprised of the 5'UTR and the start of the protein.

Yet another page comes up. This page has a link to a FASTA format of the data.

Click on the FASTA format link to get the sequence displayed on the screen in this format.

The following image is the data for this example sequence.

Save the fasta file.

From the Netscape File menu select the Save As option. Save the file as Text with a file name of your choice.

To ensure that you will have data that works in the next step, also go through and save fasta files of the other two forward reading frame translations.

Click the browser's Back button until you are returned to the page with the 6 reading frame translations. Select one of the other forward translations and repeat the process given above. Do this until you have saved all three of the forward reading frame translations as a fasta file.

Back to exercise 1

Back to Table of Contents


Comparison

Use your downloaded translations and the full protein which best represents your selected fragment in an alignment comparison to determine which of these translations is real.

In the BLASTX portion of this lab, you were to save a number of protein sequences in fasta format onto your local machine.

Go to one of the EXPASY's SIM Alignment sites (Canada, China, Korea, Taiwan, USA) and paste in your two fasta formatted sequences. Change the type of entry to User-entered sequence for the two sequences.

The SIM alignment page looks like this.

Use the editor to open the necessary files and copy the text into your machine's buffer.

Then paste the text into the form's appropriate window in the browser. Once the translated sequence is in place, close the file and open the file for the protein from your organism, copying and pasting it into the second box.

Click on the circle buttons to change the nature of the used sequences from SWISS-PROT/TrEMBL database entries to User-entered sequence sequences. Give the two sequences distinctive names if that will help you tell them apart. A properly filled out form is given below.

Strip off the > line from the sequences.

When using user-supplied sequences, this site prefers that raw sequence information be used. Therefore, it is necessary to remove the first line of each data file and the resulting blank line. An example of a modified sequence is shown below.

MTACARRAGGLPDPGLCGPAWWAPSLPRLPRALPRLPLLLLLLLLQPPALSAVFTVGVLG
PWACDPIFSRARPDLAARLAAARLNRDPGLAGGPRFEVALLPEPCRTPGSLGAVSSALAR
VSGLVGPVNPAACRPAELLAEEAGIALVPWGCPWTQAEGTTAPAVTPAADALYALLRAFG
WARVALVTAPQDLWVEAGRSLSTALRARGLPVASVTSMEPLDLSGAREALRKVRDGPRVT
AVIMVMHSVLLGGEEQRYLLEAAEELGLTDGSLVFLPFDTIHYALSPGPEALAALANSSQ
LRRAHDAVLTLTRHCPSEGSVLDSLRRAQERRELPSDLNLQQVSPLFGTIYDAVFLLARG
VAEARAAAGGRWVSGAAVARHIRDAQVPGFCGDLGGDEEPPFVLLDTDAAGDRLFATYML
DPARGSFLSAGTRMHFPRGGSAPGPDPSCWFDPNNICGGGLEPGLVFLGFLLVVGMGLAG
AFLAHYVRHRLLHMQMVSGPNKIILTVDDITFLHPHGGTSRKVAQGSRSSLGARSMSDIR
SGPSQHLDSPNIGVYEGDRVWLKKFPGDQHIAIRPATKTAFSKLQELRHENVALYLGLFL
ARGAEGPAALWEGNLAVVSEHCTRGSLQDLLAQREIKLDWMFKSSLLLDLIKGIRYLHHR
GVAHGRLKSRNCIVDGRFVLKITDHGHGRLLEAQKVLPEPPRAEDQLWTAPELLRDPALE
RRGTLAGDVFSLAIIMQEVVCRSAPYAMLELTPEEVVQRVRSPPPLCRPLVSMDQAPVEC
ILLMKQCWAEQPELRPSMDHTFDLFKNINKGRKTNIIDSMLRMLEQYSSNLEDLIRERTE
ELELEKQKTDRLLTQMLPPSVAEALKTGTPVEPEYFEQVTLYFSDIVGFTTISAMSEPIE
VVDLLNDLYTLFDAIIGSHDVYKVETIGDAYMVASGLPQRNGQRHAAEIANMSLDILSAV
GTFRMRHMPEVPVRIRIGLHSGPCVAGVVGLTMPRYCLFGDTVNTASRMESTGLPYRIHV
NLSTVGILRALDSGYQVELRGRTELKGKGAEDTFWLVGRRGFNKPIPKPPDLQPGSSNHG
ISLQEIPPERRRKLEKARPGQFS

Use default settings for the rest of run. Click on the Submit button.

[EXPASY's SIM Alignment run - input format: raw sequences]

The Submit button is marked with a black arrow in the previous image. Click on it to run the process.

Given below is an image of the results of this process using the example data files.

Scroll down the results page to see if the translated sequence used in the run aligns well with the found protein sequence. A good hit will be a section of 100% identity between the two input sequences. If the first attempt fails, run another one through the process and see if that one works.

A good hit is represented in the image above. However, not all exons are very big, so scroll down the list of alignments to see if there are any 100% alignments before assuming that you picked the wrong translation to run through the process.

In the example run the protein translations match perfectly. This is due to the nature of the original DNA fragment [containing the first 480 bases of the mRNA sequence]. Fragments which arose from genomic sequence in which the selected area spanned an exon would have initial perfect matches followed by mismatches at the end.

The translation process cannot tell if it has reached the end of an exon or not. Also, the DNA fragment used may not respect codon boundaries and there will be additional Stops in the translation that will confuse the selection of the translation that is real.

12. Did you have problems with this procedure? If so, why?

The length and source of your initial sequence can make the translation process tricky. Review the information given previously for possible causes of any problems you might have encountered.

Back to exercise 1

Back to Table of Contents

 


Copyright 2003 Regents of the University of California. All rights reserved.