Position Weight Matrix, Gibbs Sampler, and the Associated Significance Tests in Motif Characterization and Prediction

Position weight matrix (PWM) is not only one of the most widely used bioinformatic methods, but also a key component in more advanced computational algorithms (e.g., Gibbs sampler) for characterizing and discovering motifs in nucleotide or amino acid sequences. However, few generally applicable statistical tests are available for evaluating the significance of site patterns, PWM, and PWM scores (PWMS) of putative motifs. Statistical significance tests of the PWM output, that is, site-specific frequencies, PWM itself, and PWMS, are in disparate sources and have never been collected in a single paper, with the consequence that many implementations of PWM do not include any significance test. Here I review PWM-based methods used in motif characterization and prediction (including a detailed illustration of the Gibbs sampler for de novo motif discovery), present statistical and probabilistic rationales behind statistical significance tests relevant to PWM, and illustrate their application with real data. The multiple comparison problem associated with the test of site-specific frequencies is best handled by false discovery rate methods. The test of PWM, due to the use of pseudocounts, is best done by resampling methods. The test of individual PWMS for each sequence segment should be based on the extreme value distribution.


Introduction
Most genetic switches are in the form of sequence motifs that interact with proteins [1]. Position weight matrix or PWM [2][3][4][5][6] is one of the key bioinformatic tools used extensively in characterizing and predicting motifs in nucleotide and amino acid sequences. e popularity of PWM has been further increased since its implementation as a component in PSI-BLAST [7], which is frequently used to generate PWM for motif characterization and prediction [8][9][10][11].
A PWM-based sequence analysis involves three types of output: the site-speci�c frequency distribution, the PWM itself, and PWMS for each input sequence (and optionally PWMS resulting from scanning new sequences with the trained PWM). Here I brie�y review the PWM method in the context of motif discovery, followed by a detailed illustration of the Gibbs sampler of which PWM is a key component, and 2 Scienti�ca then propose statistical signi�cance tests appropriate for each of the three types of PWM output.

PWM in the Context of Motif Discovery Methods
e simplest input for a PWM-based method consists of an aligned set of sequences and the speci�cation of the background (prior) frequencies. e main output of PWM, other than the PWM itself, consists of the site-speci�c information content and the motif information content [6] as well as PWMS for individual motifs, together with the associated statistical tests. We �rst illustrate the PWM method by applying it to the 246 donor splice sites of yeast introns each represented by 5 nucleotide sites on the exon side and 12 nucleotide sites on the intron side (Table 1). e four columns on the le of Table  1 headed by A, C, G, and U are the site-speci�c counts of nucleotides A, C, G, and U. When all site-speci�c counts are greater than zero, each element in the PWM, designated by PWM (where , and 4 corresponding to A, C, G, and U, respectively, and is site index), is computed as PWM log (1) where is the background frequency of nucleotide , and is the site-speci�c nucleotide frequency for nucleotide at site (e.g., A 8 / 46 in Table 1). Plotting these site-speci�c PWM values graphically over sites yields the sequence logo [49,50]. e PWM score (PWMS) for a particular motif is computed as PWMS PWM (2) where is the length of the motif which equal 17 for our example shown in Table 1. Note that PWMS is the logarithm of a likelihood ratio, or log-odds. Given a 17 mer, say, = ACGGTACCACG-TAAGTT, we have two hypotheses. e �rst hypothesis is that the 17 mer belongs to a motif, constrained to have speci�c nucleotides at speci�c sites ( Yes ), and the second is that each site in the 17 mer is sampled from a nucleotide pool with no site-speci�c constraints ( No ). e likelihoods of observing sequence , given the two different hypotheses, are speci�ed, respectively, as Yes Yes is leads to the following that is identical to (2): e background frequencies ( ) have been speci�ed in three different ways in previous publications. e �rst is simply to assume equal background frequencies [51,52] in characterizing splice sites with PWM. is is equivalent to the classic sequence logo method for graphic display of site patterns [49] which does not take background frequencies into consideration. In sequences with biased nucleotide frequencies, equal values will generate a false site pattern when there is in fact no pattern. For example, the AT-biased background genome in the yeast implies that PWM A and PWM T will be greater than PWM C and PWM G on average even when the sequences contain no site-speci�c information. Similarly, the classic sequence logo will display A and T more prominently than C and G even when the sequences of interest contains no site-speci�c information.
e second approach to specify is to compute it from the input sequences. As such, in our example, can be computed from the four columns headed by A, C, G, and U on the le side of Table 1. is approach also has a problem. Suppose a certain motif is a poly-U sequence, and all input sequences are "UUUUUUUU". is will generate background nucleotide frequencies with U and A C G 0. Note that the site-speci�c frequencies, given the input sequences all being "UUUUUUUU", are U and A C G 0. So the resulting PWM would then suggest that the motif is not informative, which is contrary to our intuition, that is, a stretch of UUUUUUUU conserved across a set of aligned sequences is likely to be biologically informative.
e third approach is to specify according to the speci�c problem one wishes to solve. For example, when characterizing splice sites of introns in a particular species, one may use the nucleotide frequencies of all transcripts (including all exons and introns) annotated in the genome as the background frequencies [28]. Similarly, a study of site patterns of branchpoint sequences in introns could have values computed from all intron sequences. I suggest that only this third approach be used to avoid the impression that PWM could have an in�nite number of null hypotheses (each associated with a different speci�cation of ).
e computer program DAMBE [53,54] offers different choices for specifying in computing PWM. Similarly, the new sequence logo method allows more appropriate speci�cation of background (prior) frequencies [50]. e resulting PWM for the 246 donor splice sites of yeast introns, with background frequencies computed from all introns, is shown in Table 1.
When some values are zero, as is the case in our example, (1) is inapplicable because the logarithm of zero is unde�ned. ree approaches can be taken to avoid this problem [55,56]. e �rst is to compute by which approaches / with increasing (where is the site-speci�c count for nucleotide or amino acid at site ). PWM values can then be computed by (1). is approach is poor when is small. e second approach is to use explicit pseudocounts by de�ning where is the frequency of nucleotide , and is then It is important to keep small (e.g., 0.0001) because the expected PWM from random sequences is 0 in (1). A large will substantially increase PWM above 0 with random sequences. e two approaches above share one main disadvantage. Suppose we have 10 aligned motifs of 10 amino acids each. Position 3 is occupied by amino acids K (lysine) and R (arginine) and position 5 by amino acid E (glutamic acid). e two approaches above will specify pseudocounts for positions 3 and 5 in the same way, which is unreasonable for the following reason. If position 3 requires a positively charged amino acid, and position 5 a negatively charged amino acid, then amino acids K, R, and H (histidine) should be more likely found than other amino acids at position 3, and amino acid D (aspartic acid) should be more likely found than other amino acids at position 5. By using other aligned protein sequence data of roughly the same divergence we can derive frequency distributions for positions that require a positively charge or negatively charged amino acid and use these frequency distributions to produce pseudocounts [56]. In our case, the pseudocounts at positions 3 and 5 will be assigned quite differently because the frequency distribution for a position requiring a positively charged amino acid is typically quite different from that for a position requiring a negatively charged amino acid.
PWM and PWMS can potentially be used to measure codon usage bias. For example, given the frequency of nucleotide as , the background frequency of a codon, say AGC, can be speci�ed as A G C , and compared to the observed frequency of AGC. Such an approach would eliminate one major weakness of commonly used codon bias indices such as CAI [57,58] and Nc [59,60].

Gibbs Sampler with PWM as a Key Component
While PWM is a technique for characterizing a set of identi�ed motifs, Gibbs sampler [61], named aer the mathematical physicist, J. W. Gibbs, is for de novo motif discovery. For example, given a set of yeast intron sequences, what and where is the branchpoint site ? All information we have is that each intron should have one branchpoint site, but what sequence signature does it have and where is it located along the intron sequence? is scenario ( Figure 1) is where the Gibbs sampler will shine. A similar scenario involves the discovery of regulatory equence motifs given a set of coexpressed genes (i.e., genes that increase or decrease their transcription level synchronously over time) by microarray [62,63], SAGE [64,65],  or deep-sequencing [66][67][68] experiments. If the coexpressed genes are also coregulated, then they may share a certain yetunknown transcription factor binding site controlled by the same or similar transcription factor. Given that the binding site is oen located upstream of the translation initiation codon, one may extracted the upstream sequences from these coexpressed genes and let the Gibbs sampler to �nd the candidate regulatory motifs. A recent study has shown that shared motifs may also present in the 5 ′ UTR of mRNA to modulate translation initiation [69].
Gibbs sampler is one of the Monte Carlo algorithms that rely on repeated random sampling to estimate desired parameters. Monte Carlo method was envisioned by the famous mathematician Stanislaw Ulam, following the successful assembly of the �rst electronic computer ENIAC in 1945, and further developed by physicists and mathematicians working on nuclear weapon projects in the Los Alamos National Laboratory in mid-1940s [70]. e term "Monte Carlo method" was coined by Nicholas Metropolis to designate this class of computational algorithms. While the general application of the method unsurprisingly followed the operation of ENIAC in 1945, the physicist Enrico Fermi is known to have independently developed and applied the method nearly 15 years earlier with mechanical calculators [70].
ere are two slightly different applications of Gibbs sampler in motif prediction. e �rst assumes that each sequence contains exactly one motif [30] and the associated algorithm is called a site sampler. e second is more �exible and allows each sequence to have none or multiple motifs [71] and the algorithm is termed a motif sampler. We will illustrate the site sampler and then brie�y discuss the motif sampler.
I numerically illustrate the Gibbs sampler algorithm for motif discovery. e main output of the Gibbs sampler is typically of three parts. e �rst is the shared motif in an aligned format (bottom panel in Figure 1). e second is a PWM summarizing the discovered motif, and the third contains the associated signi�cance tests which will be reviewed in a later section. e derived PWM, just like any other PWM, can be used to scan sequences not in the input data to discover the presence of the motif present elsewhere.

Computational Details of the Gibbs Sampler.
We will use the erythroid nucleotide sequences [85], listed in Figure 2, to illustrate the Gibbs sampler algorithm. Our main objective is to infer the location and sequence of the unknown motif shared among the sequences so that we can align the motifs as shown in the bottom panel of Figure 1. e aligned motifs will allow us to generate a PWM that characterizes the motif by site-speci�c nucleotide frequency distributions. e PWM can be used to scan for the presence of the identi�ed motif in other sequences.
We need �rst to count all nucleotides, with their numbers designated as A , C , G , and T , respectively, in the sequences. e total number of nucleotides of all 29 sequences ( Figure 2) is 1209, with A , C , G , and T equal to 325, 316, 267, and 301, respectively. ese values are needed for specifying pseudocounts (which we encountered in the previous section on PWM).
Let be the number of input sequences designated as 1 , 2 , … , , … , . Let be the length of , and be the length of the motif, which typically is of length 4-8. For our illustration, we will use . One typically would run the Gibbs sampler several times with different values if one knows little about the length of the motif. e PWM is of dimension 4 × for nucleotide sequences, and 20 × for amino acid sequences. Let A be the unknown starting position of the motif in .
e main algorithm of Gibbs sampler is of two steps. e �rst is random initialization in which a random set of A values is assigned and site-speci�c nucleotide frequencies are calculated. e second step is predictive updating until a local solution of A values is obtained, together with site-speci�c nucleotide frequencies that can be made into a PWM. is is repeated multiple times and previously stored locally optimal solutions are replaced by better ones. Convergence is typically declared when two or more local solutions are identical. ese steps are numerically illustrated in the following sections.
e second column in Table 2 will be referred to as the C0 vector with C0 A , C0 C , C0 G , and C0 T equal to 278, 279, 230, and 248, respectively. e 4 × matrix, occupying the last six columns in Table 2, will be referred to as the C matrix. e C matrix is tabulated from the 29 random motifs whereas the C0 vector is tabulated from nucleotides outside of the motifs. us, the sum of the �rst, second, third, and fourth rows of Table 2 should be equal to A , C , G , and T , respectively. Also note that each of the six columns in the C matrix should add up to 29.

Predictive
Update. e predictive update consists of obtaining (= 29 in our example) random numbers ranging from 1 to , and use these numbers as an index to choose the sequences sequentially to update the site-speci�c distribution of nucleotides (the C matrix) and the associated frequencies (the C0 vector). For example, the random numbers in my �rst run of the Gibbs sampler happen to be 11,18,26,22,2,28,12,9,7,3,17,16,1,4,21,15,14,24,19,27,29,6,10,20,13,8,23,25, and 5, respectively. is means that 11 will be used �rst, and 5 last, for the �rst cycle of the predictive update. It is important to use a random series of numbers instead of choosing sequences according to the input order. e latter increases the likelihood of trapping Gibbs sampler within a local optimum.
Our �rst randomly chosen sequence happens to be 11 and its randomly chosen motif starts at site 11, that is, A 11 11, with the motif being AGTGTG. is initial motif will now be taken out of the C matrix and put into the C0 vector. is motif has one A, zero C, three G's, and two U's. By adding these values to the C0 vector in Table 2, we obtain the C0 vector in Table 3. We also need to take this motif out of the C matrix by subtracting the �rst A from the �rst value in the �rst column in the C matrix in Table 2 (i.e., new C A,1 = old C A,1 − 1), the second G from the third value in the second column in the C matrix in Table 2 (i.e., new C G,2 = old C G,2 − 1), and so on. is converts the C matrix in Table  2 to the C matrix in Table 3.
At this point the C matrix is made of the 28 randomly chosen motifs, one from each sequence (excluding 11 ). You will notice that each of the six columns in the C matrix has a sum of 28. e reason for taking the initial motif in 11 out of the C matrix and put it back into the C0 vector is that we are going to �nd a better motif in 11 , and put it into the C matrix so that the C matrix will again be based on 29 motifs. How are we going to get a better motif? Recall that a position weight matrix (PWM) can be used to scan a sequence in a sliding window of length m to get position weight matrix scores (PWMSs) for each window. We will make a PWM out of the C0 vector and the C matrix and use the resulting PWM to scan 11 and get a new motif that has the highest PWMS.
One may wonder why such a practice would get us anywhere given the fact that the C matrix is initially made of random motifs. e resulting PWM would exhibit no pattern, and the resulting PWMSs will therefore be uninformative. e key concept here is that when one takes a random walk over a terrain with multiple peaks, one sooner or later will encounter a peak, and climbing the peak will at least bring us to a local maximum. Aer reaching the top of one peak and recording the height, we will land ourselves at another randomly chosen location and start climbing local peaks again. is process continues until we reach the highest peak or aer a �xed number of computer iterations without �nding any higher peak.
Typically, the PWM is generated by using the C0 vector as background frequencies ( ) and the C matrix as site-speci�c frequencies . However, although most algorithmic illustration of the Gibbs sampler computes this way (e.g., [32, pp. 133-147]), computed from the C0 vector has F 2: e erythroid sequences [85] for illustrating the Gibbs sampler algorithm, with the 3 ′ -end trimmed to the maximum length 50 bases to �t the page.
T 2: Site-speci�c distribution of nucleotides from the 29 random motifs of length 6. e second column lists the distribution of nucleotides outside the 29 random motifs. serious problems when input sequences are almost as short as the motif. For example, if the true motif has many nucleotide A and few nucleotide U, then the C0 vector will also have many A and few U. Now a motif with a few nucleotide U will be taken as deviating substantially from the background and will tend to have a high PWMS, leading to a biased estimate of the true motif. us, when input sequences are short, one should specify the background frequencies instead of using C0 to compute . One may refer to the previous section on PWM for more information on background frequencies.
For pseudocounts, we may use . e resulting PWM is then used to scan which is 40 bases long, with 35 ( 4 − possible motif starting points (i.e., possible A values along the sequence). e 35 PWMS values for these 35 possible motifs in (Table 4) are normalized to have a sum of 1 ( Norm in Table 4). We now proceed to update the initial A ( ) by a new A value based on result in Table 4. How should we choose the new A value?
ere are two strategies to choose the new A value. e �rst is to randomly pick up an A value according to the magnitude of Norm (Table 4). You may visualize a dartboard with 35 slices with their respective areas being proportional to Norm values. When you throw a dart at the dartboard, large slices will have a better chance of being hit than small slices. If the dart happens to land on the 7th slice, then the initial A will be updated to A 7, with the original motif AGTGTG replaced by the new motif CTCAAG.
e second strategy is simply to use the largest Norm value for updating initial A to the new A value. As the motif starting at site 25 has the largest Norm , we will set the new A equal to 25 and replace the initial motif (= AGTGTG) by the new motif (= TCACAG). With this approach we do not need as we can choose A based on the largest odds ratio in Table 4. is strategy is faster than the �rst, but did not seem to lose any sensitivity in motif discovery based on limited simulation studies. However, if one is concerned about the possibility of missing motifs, one should use the �rst strategy.
Regardless of how the new A is chosen, the updating is the same. Suppose we have taken the second strategy and set the new A equal to 25. e C matrix in Table 2 is then revised by replacing the original A motif (= AGTGTG) by the new motif (= TCACAG). is leads to an updated C0 vector and C matrix (Table 5).
We repeat this process for the rest of the sequences to update the rest of A values. Aer the last sequence has been updated, we have obtained a new set of A values, a new set of 29 motifs, together with the PWM based on the associated C0 vector and C matrix. At this point we compute a weighted alignment score (i.e., a weighted PWMS) as follows: where is the motif width, and Code is the number of different symbols in the sequences (4 for nucleotide and 20 for amino acid sequences). is a measure of the quality of alignment of the motifs. e larger the value, the better.   (8), has many different names. It has been called the Kullback-Leibler information or Kullback-Leibler divergence in information theory [86][87][88], or large-deviation rate function in statistical estimation [89]. In bioinformatics, especially in motif characterization and prediction involving a PWM, it is most oen referred to as the information content [6]. e fact that the Kullback-Leibler information is a special case of the so-called -divergence that measures the difference between two probability distributions and leads naturally to the use of the letter in (8).
e predictive updating is repeated again and again. Each time when we get a new set of A values, a new set of motifs 8 Scienti�ca T 5: Site-speci�c distribution of nucleotides from the 29 initial motifs of length 6, aer replacing the initial A 11 motif (= AGTGTG) by the new motif (= TCACAG). and the PWM based on the C0 vector and the C matrix, we compute a new value. If the new value is greater than the previously stored value, then the new value, the new set A values, and the new set of motifs will replace the previously stored ones. is continues until we reach a local maximum of or when the preset maximum number of local loops has been reached. e resulting value, the set of A values, the new set of motifs and the associated PWM are stored as the locally optimal output. In the hill-climbing analogy, represents the height of a local peak. e entire process is now repeated from the very beginning, that is, we again perform the initialization by choosing another random set of A values, and go through the local iteration to obtain another locally optimal output. If the new locally optimal output is better than previously stored ones (i.e., the new value is larger than the previously stored one), the new output will replace the previously stored output. is process is repeated multiple times until convergence is reached, that is, when new values are consistently the same as the previously stored one, or until a �xed number of computation iteration has been reached without �nding an value better than what has already been recorded. e �nal site-speci�c nucleotide distribution (Table 6) displays a much stronger pattern than the initial distribution (Table 2) from 29 randomly chosen motifs. e �nal aligned motifs (Figure 7-2 in [32]) share in general a consensus of (C/T)TATC(A/T). Its reverse complement (A/T)GATA(A/G) is known to be the binding site of GATA-binding transcription factors [90][91][92][93][94][95]. is discovery of the motif suggests that this set of sequences may indeed be coregulated by the same type of GATA-binding transcription factors. Such �ndings are crucial in transcriptomic and proteomic studies aiming to understand gene regulation networks. Algorithms such as Gibbs sampler help us understand interactions among genes and gene products.
It might be relevant here to summarize essential biology about the GATA box and GATA-binding transcription factors. A living cell is a system with many genetic switches that can be turned on or off in response to intracellular and extracellular environment. It is these switches that distinguish a normal living cell from a cancer cell or a dead cell. e GATA motif (or GATA box) is one of such switches and it is switched on or off by speci�c transcription factors (which are proteins that bind to the motif and turn on or off the transcription of the gene containing such motifs). One of the better known GATA-binding transcription factors is GATA-1 which binds to the GATA motif found in cis-elements of the vast majority of erythroid-expressed genes of all vertebrate species examined [96,97]. e core promoter of the rat platelet factor 4 (PF4) gene contains such a GATA motif and the binding of such GATA motif by GATA-binding proteins such as GATA-1 suppresses the transcription of the PF4 gene [91]. It is now known that GATA regulatory motifs and the GATA-binding transcription factors are present in a variety of organisms ranging from cellular slime mold to vertebrates, including plants, fungi, nematodes, insects, and echinoderms [98], suggesting that the function of the genetic switch is far beyond erythropoiesis. In human, the GATA motif and the GATA-binding proteins are implicated in several diseases [99]. e sequence divergence of GATA motifs and their binding proteins should shed light on the coevolution of the components of genetic switches.
One may have noted that some sequences have a strong (C/T)TATC(A/T) motif, whereas others (e.g., the second, the fourth and the �h sequences) have only weak and highly doubtful signals. Computer programs implementing Gibbs sampler typically would output a quantitative measure of the strength of the signal, and PWMS is the most oen used index for this purpose ( Table 7). Recall that PWMS is the log-odds, but one may use the odds ratio directly as a measure of relative motif strength. Also recall that an odds ratio is the ratio of two probabilities associated with two hypotheses. �e�ne Yes as the hypothesis that the 6-mer is a motif with its site-speci�c constraints, and No as the hypothesis that the 6-mer is not a motif and has its probabilities speci�ed only by the four overall nucleotide frequencies. e odds ratio is the ratio of the probability that Yes is true over the probability that No is true. One generally should take a cut-off value of 20, that is, Yes is 20 times more likely than No . One should note that Gibbs sampler, being started from a random set of A values, may not necessarily converge to the same motif. is is both an advantage and a disadvantage of the algorithm. e advantage is that repeated running of the algorithm will allow us to identify other types of hidden motifs (i.e., other than the reverse complement of the GATA motif) in the sequences. e disadvantage is that users not familiar with the algorithm oen get confused when the same input generates quite different results. For example, another set of putative motifs, in the form of RGVAGR (where R is A or G and V is "not T"), has been found to be shared among the sequences [32, p. 146].
It is possible that the input sequences may contain two or more different biologically signi�cant motifs. If one motif is much stronger (more over-represented among the input sequences) than other motifs, and if the search by the Gibbs sampler algorithm outlined before is exhaustive, then we will always end up with the strongest motif and miss all other biologically interesting motifs. However, one could run Gibbs sampler by speci�cally exclude the strongest motif already identi�ed so that weaker motifs can then be identi�ed.

Motif
Sampler. e Gibbs sampler has two versions. e one that we have just illustrated is called site sampler. It assumes that each sequence contains exactly one motif [30]. e other version is more �exible and allows each sequence to have none or multiple motifs [71] and the algorithm is termed motif sampler. e GATA-binding transcription factors comprise a protein family whose members contain either one or two highly conserved zinc �nger �NA-binding domains [98] and it is consequently likely that a sequence may contain more than one GATA box. For example, the erythroid �ruppel-like factor (���F, which is a zinc �nger transcription factor required for -globin gene expression) has in its 5 ′ -region two GATA motifs �anking an � box motif characterized by CANNTG [100]. is calls for an algorithm that can identify multiple motifs in a single sequence.
e site sampler can be extended to motif sampler by post-processing. e PWM generated from the site sampler can be used to re-scan the sequences for motifs and compute the associated PWMS or odds ratio for all 6-mers in each sequence. All what we need is to have a cut-off score to keep those motifs with a PWMS or odds ratio greater than the cutoff score.
e PMW, be it from alignment of known motifs or from running the Gibbs sampler, need to be assessed for its statistical signi�cance. One continuous problem with PWM is the lack of generally applicable and accurate signi�cance tests, either for individual sites of the motif, on PWM or on PWMS. ere are two reasons why accurate signi�cance tests are desirable. First, aer characterizing a motif with PWM, one naturally wants to know whether the characterized PWM is signi�cant, which sites contribute to the signi�cance and which sequence has a PWMS that is signi�cantly greater than random expectation. Second, aer �nding a signi�cant PWM, one typically would want to use the PWM to scan other sequences to identify new motifs, and one needs a good signi�cance test to show the reliability of the identi�ed motif. is would reduce the number of putative sequence motifs going through experimental veri�cation which is typically tedious and expensive [101,102].
In short, three separates signi�cance tests are required: one for individual sites, one for PWM per se and one for PWMS. ese tests are detailed in the following sections.

4.�. Stati�ti�al
Si��i��a��e �e�t� for ���i�i��al Site�. e statistical signi�cance of individual sites can be done by 2tests with type I error rate controlled for by the false discovery rate [103,104]. Take the data in Table 1 for example. e background frequencies are A = 0.3279, C = 0.1915, G = 0.2043, and U = 0.2763, which allow us to obtain expected T 8: Evaluating statistical signi�cance of individual sites by two types of false discovery rate.
counts of A, C, G, and T. With 17 2 -tests (Table 1), we face the problem of multiple comparisons and need to control for the familywise error rate [105] which is synonymous to experimentwise error rate.
Designate the error rate by 0 , then the exact critical for rejection in individual tests is where is the number of tests and is equal to 17 in our case. If we set 0 0.05, then 0.00 0 2 05. e Bonferroni criterion is based on the approximation that which leads to 0.002 . e second order Bonferroni , when relevant assumptions are met [105], is based on which leads to 0.002 . In practice, these different values make little difference. In our case, all three values lead to the conclusion that the frequency distribution at sites 1, 2, 13, and 16 do not deviate signi�cantly from the background frequencies.
e statistical protocol for controlling for the familywise error rate has been considered too conservative, and the protocol for controlling for the false discovery rate (FDR) has consequently been proposed recently [103,104]. e classical FDR approach [103], now commonly referred to as the Benjamini-Hochberg procedure or simply the BH procedure, sorts values in descending order and computes critical⋅BH⋅ for the th value (where the subscript BH stands for the BH procedure) as where is FDR (e.g., 0.05), is the rank of the value in the sorted array of values, and is the number of tests (i.e., the number of values). If is the largest satisfying the condition of ≤ critical⋅BH⋅ , then we reject hypotheses from to . In our case, all the sites are statistically signi�cant based on critical⋅BH⋅ (Table 8).
e FDR procedure above assumes that the test statistics are independent or positively dependent (in the extreme case of perfect positive dependence, all tests are the same and there is really only just one test with no multiple comparison problem). A more conservative FDR procedure has been developed that relaxes the assumption [104]. is method, now commonly referred to as the Benjamini-Yekutieli or simply the BY procedure, computes critical⋅BY⋅ for the th hypothesis as With in our case, Σ / . 55252 . Based on critical⋅BY⋅ , the 2 -tests pertaining to sites 1 and 2 are not statistically signi�cant (Table 8). e BY procedure was found to be too conservative and several alternatives have been proposed [106]. For large , Σ / converges to ln( ) (Euler's constant equal approximately to 0.57721566). us, for 0000, Σ / is close to 10. So critical⋅BY is nearly 10 times smaller than critical⋅BH . Both FDR procedures above have been used in signi�cance tests concerning yeast splicing sites [23].
�.�. ����u�t�n� �t�t�st�c�� ���n��c�nce o� ��� ��en �seudo� counts Are Used. Whether a PWM represents a motif with site-speci�c constraints can be tested by using the statistic [6] speci�ed in (8). However, the distribution of is altered by pseudocounts as speci�ed in (5) and (7). For example, the expectation of is no longer zero with pseudocounts when there is no site-speci�c pattern.
A more straightforward method for evaluating the sig-ni�cance of PWM is by resampling. With the tetranomial distribution de�ned by ( A C G T ) , where is the nucleotide frequency of nucleotide , we can obtain a new set of sequences (246 sequences of 17 nt each) and compute . is is repeated for, say, 5000 times to obtain 5000 values. e 95th or 99th percentile of the values can be taken as critical values at 0.05 and 0.01 signi�cance levels, respectively. An observed for the PWM is signi�cant if it is greater than the critical . Based on this criterion, the PWM from the 246 donor splice sites is highly signi�cant. e same resampling technique can also be used to evaluate the signi�cance of the site-speci�c patterns in the previous section or the signi�cance of PWMS in the next section. ���� ����is�ic�l �i�ni�c�nc� o� ����� One of the purposes of constructing a PWM is to facilitate the computation of PWMSs. For example, the PWMS for sequence UAAAG-GUAUGUUUAAUU, given the PWM in Table 1 (the four columns headed by A, C, G, and U on the right side), is simply PWMS = PWM U1 + PWM A2 + ⋯ + PWM U17 .
us, we can use the PWM to predict a new donor splice site by scanning a nucleotide sequence with a window of 17 nucleotide sites and computing the PWMS. e larger the PWMS, the more likely the 17-mer is a donor splice site. However, we need to address the question of how large is large in such in silico predictions.
PWMS from random sequences follows approximately the normal distribution ( Figure 3), with mean 0 (or slightly greater than 0 when pseudocounts are used with a small ). e distribution in Figure 3 has a mean equal to 0.068884 and a standard deviation equal to 0.314714254.
Suppose we are to use our 4 × 17 PWM to scan a target sequence of 1000 nt for a possible donor splice site. ere are 984 (= 1000 − 17 + 1) different 17 mers along the sequence , resulting in 984 PWMS values. If the maximum PWMS is 1, how statistically signi�cant is it?
If the length of the target sequence were only 17 nt instead of 1000 nt, then the answer is easy. e upper 99% con�dence limit for a normal distribution with mean equal to 0.0689 and standard deviation equal to 0.3147 is 0.8808 (= 0.06888 + 2.58 × 0.3147), which implies that a PWMS of 1 is signi�cant at the 0.01 level. However, because our target sequence is 1000 nt, with the maximum of PWMS equal to 1 out of a total of 984 PWMS values, we need to go a long way to evaluate the signi�cance of this maximum PWMS value. Suppose we perform many sampling experiments from the same normal distribution ( as in Figure 3: In each experiment, we sample times to obtain 1 , 2 , … , . e maximum in each experiment is max . is is equivalent to use PWM to scan a sequence to obtain PWMS 1 , PWMS 2 ,…, PWMS , with the maximum PWMS designated as PWMS max . What is the distribution of max , designated as ( max ? Note that max is an extreme value of values, so it is natural to call ( max an extreme value distribution (EVD).
Extreme value distribution or EVD, also referred to as the Gumbel distribution in honour of the pioneer of the statistics of extremes [107], is used in BLAST [108,109] and new versions of FASTA [110] to attach statistical signi�cance to a match score between two sequences. It is also used to perform signi�cance tests involving PWM [5,6,55]. Here I will outline the mathematical framework of EVD pertaining to PWMS. e probability of getting an value smaller than max is max = max 0 ( .
Note that max can be either 1 , 2 , … , or , with possibilities. ( − 1 values are smaller than max in each experiment. is leads us to which is plotted for = 0.068884, = 0.314714254, and = 84 ( Figure 4). Compared to the distribution of ( in Figure  3, the distribution of ( max ) has been shied substantially to the right and peaks at max = 1.05.
Now we can answer the question of whether our observed max = 1 is statistically signi�cant. e probability of observing an max value equal to 1 or greater is which is approximately 0.7986, that is, it is not statistically signi�cant. A much simpler, but likely less accurate, method based on only without deriving (16)- (18), is to use the Bonferroni criterion in (10). With , /986 81 which requires a PWMS value equal to 1.292076814 to be marginally signi�cant, given 6888 , and 31 71 2 . As our observed maximum PWMS is 1 < 1 292 7681 , it is not signi�cant at the 0.05 signi�cance level.
In summary, a PWM-based sequence analysis involves three types of output: the site-speci�c deviation from the background frequencies, the position weight matrix itself and the position weight matrix score for each input sequence. e signi�cance of the �rst can be evaluated with 2 -tests using the false discovery rate as the criterion for rejection of the null hypothesis, the second by the resampling method, and the third by statistics based extreme value distribution. ese tests have been implemented in the most recent versions of DAMBE [53,54].