Automated annotation of 5' RACE sequence tags

Prior to running this procedure, chromatogram files must be downloaded and interpreted (described in our data processing protocol), and gene trap vector sequence must be removed (described in our vector removal protocol).

The automated annotation procedure for a PGA sequence tag is as follows:

Each sequence tag is queried via BLAST against NCBI's "NR" database. BLAST search outputs are limited to an E value cutoff of 1e-10, with the number of reported sequences (v) = 50. Output format is set as flat query-anchored, no identities, and blunt ends.

BLAST outputs are parsed to collect names, E values, and alignments against the query sequence for all the candidate sequences. For each candidate sequence, the following is determined:

  1. The longest consecutive run of aligned bases in the alignment of the candidate sequence against the query sequence (gaps of up to three bases in length are allowed).
  2. The percentage of identical bases in the longest run.
  3. Whether the last 60 bp at the 3' end of the sequence tag exactly matches part of the candidate sequence.
Candidate mouse sequences that are >95% identical over >90% of the query sequence, or have a 60-bp match with the 3' end of the sequence tag, are kept and GenBank entries for these candidate sequences are retrieved.

Candidates are then grouped into synonym lists. Two candidates are synonyms if they are from the same organism, their sequences are within 2% in length and are at least 98% identical. Within each synonym list, candidates are ranked according to accession number type, and the highest type preference within the list is assigned as the accession number for the list. The lowest E value within a synonym list is assigned as the E value for the list.

If there are multiple mouse synonym lists, we use the UCSC BLAT server to resolve if they occur at the same map position. The synonym accession numbers and the corresponding sequences are sent to the BLAT server to obtain chromosomal positions. If all synonyms hit matching chromosomal regions, all mouse synonym lists are combined into a single list.

If there are no mouse synonym lists, then any candidate sequences that are >40% identical over >80% of the query sequence are collected and grouped into homolog synonym lists.

Annotation categories:

  1. Putative Mouse ID: If there is only one mouse synonym list, then the accession number of that synonym list is reported.
  2. Multiple Mouse IDs: If there are multiple mouse synonym lists, then the accession numbers of all synonym lists are reported.
  3. Nearest Homolog: If there are no mouse synonym lists and there is at least one homolog synonym list, then the accession number of the homolog synonym list with the best E value is reported.
  4. Unidentified: If there are neither mouse nor homolog synonym lists, no accession number is reported and the cell line is marked for manual annotation.

 

Last updated 16 November 2002.
Copyright 2002 Regents of the University of California. All rights reserved.