Sequence Ontology Annotation Guide

This Sequence Ontology (SO) [13] aims to unify the way in which we describe sequence annotations, by providing a controlled vocabulary of terms and the relationships between them. Using SO terms to label the parts of sequence annotations greatly facilitates downstream analyses of their contents, as it ensures that annotations produced by different groups conform to a single standard. This greatly facilitates analyses of annotation contents and characteristics, e.g. comparisons of UTRs, alternative splicing, etc. Because SO also specifies the relationships between features, e.g. part_of, kind_of, annotations described with SO terms are also better substrates for validation and visualization software. This document provides a step-by-step guide to producing a SO compliant file describing a sequence annotation. We illustrate this by using an annotated gene as an example. First we show where the terms needed to describe the gene's features are located in SO and their relationships to one another. We then show line by line how to format the file to construct a SO compliant annotation of this gene.


What is sequence annotation?
Sequence annotations provide explanatory notes and critical commentary about a sequence, e.g. indicating where transcription occurs and where a regulatory region lies. They may arise from bioinformatics analysis, wet-bench analysis, or a combination of both by an expert biologist. Genomic DNA sequence is most commonly associated with annotation, but any biological sequence may be annotated, e.g. a microarray probe or an mRNA sequence. These annotations allow us to connect what we know about the biology, and results of experiments, with the actual sequence. It enables us to readily locate features on the sequence and relate them to other features. We can also assign additional properties to these features, e.g. where (in which cell types or tissues) a gene is expressed.
There has been a proliferation of exchange formats to reflect the varying needs of the community over time. The three large genome databanks, DDBJ [10], EMBL [8] and GenBank [1] distribute their sequences as flat files, and all use an agreed-upon feature table [3] to name the features of their annotations. As model organism groups needed to exchange complex data models, other formats appeared, such as game-XML [4] and GFF [6,7]. Rendering of genomic annotation into graphical views also became an important issue, and formats such as Bioinformatics Sequence Markup Language (BSML) [2] also appeared. All of these exchange formats rely upon simple controlled lists of key words that enumerate permissible feature types.
Currently there is no single, standardized means for describing an annotation. This makes annotation exchange and analysis a much more complicated task than it need be. More importantly, what distinguishes SO from the keyword lists (feature tables) used by the big genome databases is that it formally specifies the sub-class, membership (mereological), and topological relationships that exist between the Sequence Ontology Annotation Guide 643 terms. Specifying these relationships in a principled way provides the basis for a readily extensible object-oriented data model. Software need not be aware of the terms themselves, but need only be aware of the nature of the possible relationships between them. This completely inverts the previous paradigm where the relationships were essentially hard-coded or implicit in their physical placement in the file. Thus, using SO, both data exchange formats and the software that manipulates their contents need only be 'aware' of the underlying relatedness of the features. Moreover, the variety of possible relationship types is much more constrained and their behavior is formally specified, making it possible to readily include additional terms or move terms about in the ontology without re-writing any code for parsing and rendering.
There are several tools for genome browsing, annotation, curation [9,12,14], and viewing a gene via a genome browser graphically demonstrates the relationships between the features. Figure 1 depicts the Drosophila melanogaster gene CG10188. This gene is located on the reverse strand and has two annotated transcripts and a total of four exons, three of which are coding (opaque) and one is noncoding (transparent). If an exon includes any coding sequence at all, even one base, it is categorized as coding. There is also a transposable element of type Cr1a located within the intron of the transcripts.

The Sequence Ontology
In SO each term is defined with a descriptive definition, agreed upon by the community, and the relationships the term has to other terms provide a logical description. Describing the relationship between terms in this way restricts how they can be applied to describe a sequence. For example, the ontology states that an intron is part of a primary transcript, and a primary transcript is a transcript, whereas an mRNA is a processed transcript, and a processed transcript is a transcript. So it would be illogical to state that an intron is part of an mRNA. This is illustrated by following the relationships between the terms in Figure 2.
Properties are concepts that are not locatable on the sequence, but describe an aspect of a feature located on the sequence. For example, SO has terms to describe attributes of a feature, such as the kind of regulation a gene undergoes, which include maternally imprinted and negatively autoregulated. Additionally, there are terms to describe chromosome variation and consequences of mutation.

Representing SO instances
The ontology describes our current knowledge of biological sequences and their relationships to one another (since every feature is itself a sequence). But it does not describe actual specific instances of sequence, for this a separate framework; i.e. a flat-file format or database schema is required. As mentioned earlier, there are many formats and representations that are suitable for this. For example, SO terms may be used to type specific features (i.e. sequences) in a relational database or to label the features in a flat file format such as GFF3, or in a hierarchical markup such as XML.
There are currently two data exchange formats that rely on SO to type their features: Generic Feature Format 3 (GFF3 [7]) and the Chado relational schema from the Generic Model Organism Database group (GMOD [5]). In the remainder of this document, GFF3 is used to illustrate how biological features are modelled using SO. The format of GFF3 is outlined in Table 1.
The GFF3 file denoted in Figure 3 represents the gene CG10188 shown graphically in Figure 1. The file starts with three comment lines that state the version of the format and also the region of the genome being annotated, in this case a span of chromosome 2L. Comment lines begin with '# #'. For clarity, the title of each column of the file format is also included in a comment line.
The first line of the actual annotation describes the gene that is being annotated. In this case the landmark is the chromosome arm 2L. The source field is undefined and the type of feature is gene. Although it is not always clear where a gene begins and ends, in this annotation, it is the five-primemost base of the transcripts to the three-primemost base. The score is not defined; the gene is located on the reverse strand; and this feature does not have a phase. The attributes recorded for this gene show its unique identifier, its name and a property. The second line of the annotation describes the first of the transcripts (CG10188-RA) of the gene. This transcript is typed as mRNA, as this term gives more information to the user than transcript. Because of the laws of transitivity, mRNA inherits the relationships and attributes of transcript via processed transcript. This can be demonstrated by tracing the is a relationships from mRNA to transcript in Figure 2. The Parent tag has Figure 2. A selection of the terms in SO that relates kinds of transcripts, and their parts. The relationships shown are is a ('i'), which provides the sub-type hierarchy, and part of ('P'), which produces meronomies Table 1. A description of the columns of the GFF3 format. The format consists of one line per sequence feature with nine columns per line. If a column is not defined, the '.' symbol is used. Comment lines begin with '# #' 1 seqid The landmark to which the coordinates are given.

source
The procedure that produced the feature. For example, the name of a piece of software or another database may be appropriate. Not all features have a source.

type
The type of feature using either a term name or accession number from the Sequence Ontology.

begin
The begin coordinate of the feature relative to the landmark given in column 2 where 1-based integer coordinates are used.

end
The end coordinate of the feature relative to the landmark given in column 1 where 1-based integer coordinates are used.
The score attributed to the feature if required.

strand
The direction of the annotation.

phase
The phase of the feature. Not all features have a phase.

attributes
The attributes of the feature are recorded as tag-value pairs and multiple attributes are separated by semi-colons. Lower case tags are unrestricted, but upper case tags are reserved for special meanings. There are several tags with predefined meanings: ID is the identifier for the feature and the value of this tag must be unique within the document. Name is the tag used for display purposes for the feature so it does not have to be unique. Another commonly used reserved tag is Parent, which is used to capture part of relations. The value of this tag is the ID of the 'parent'.
been used to show that the mRNA feature is part of the gene. The next five lines in the annotation identify the protein coding and non-protein coding portions of the mRNA. This is done by defining the five prime UTR, the CDS and the three prime UTR. The CDS is defined as 'a contiguous sequence which begins with and includes a start codon and ends with and includes a stop codon'. The CDS sequence in the annotation must therefore contain the start and stop codons. As can be seen in Figure 1, the CDS and the five prime-UTR of the transcript CD10188 span more than one exon. When using these terms, it is therefore necessary to use split locations, which reflect the exon boundaries. Thus, the five prime UTR takes up two lines in the file but has a single ID. The five prime UTR is a part of the mRNA, shown by the Parent tagvalue pair. The following two lines show the coding portion of two exons, and are typed with the term CDS. This is also a split location annotation, and the CDS is part of the mRNA. The final portion of the last exon is the three prime UTR, which is the sequence following the stop codon to the end of the transcript.
There is often more than one way to annotate sequence using SO. The second transcript, CG10188-RB, is annotated with exons to demonstrate this point. An exon is a part of a transcript, so by inheritance it is also part of an mRNA. The mRNA has two exons, which, unlike the CDS annotations, have unique IDs. To be able to differentiate between the coding and non-coding portions of the transcript, more information is needed. The two lines following the exons label the locations of the UTR.
Both ways of annotating these transcripts would validate, as they are both true to the relationships in the ontology. However, it is common practice in the model organism community to use the first method and annotate to the CDS rather than the exons, as the non-coding exon structure is often unknown.
Although the introns are not explicitly annotated in this example, as they are implicit, it is possible to include them. These examples have focused on the parts of transcripts and genes, but all features that can be located on the sequence may be annotated in this way. There is a transposable element located in an intron of this gene. Transposable elements are not strictly parts of genes, although they may be located among them. The final row of Figure 3 shows the annotation of a transposable element.
Relationships are recorded in the final column of the GFF3 file. Currently, only the part of relationship is strictly enforced in GFF3. Other relationships may be created. For deeper annotation that details attributes of the features, the tag-value pairs are appropriate for attaching these properties to the feature. The property relation is the tag, and the term is the value, e.g. a transcript may be negatively autoregulated, so the tag-value pair would be   Figure 3. The features of the gene CG10188 represented as GFF3. The first transcript (CG10188-RA) is annotated as a split CDS feature and respective UTR, whereas the second transcript (CG10188-RB) is annotated as exons, and the coding portion implied using the UTR. The transposable element Cr1a{}412-RA is also annotated in this region but is not related to the gene by a part of relationship