Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cfg.206 Research Article

Suppression subtractive hybridization (SSH) is frequently used to unearth differentially expressed genes on a whole-genome scale. Its versatility is based on combining cDNA library subtraction and normalization, which allows the isolation of sequences of varying degrees of abundance and differential expression. SSH is a complex process with many adjustable parameters that affect the outcome of gene isolation.We present a mathematical model of SSH based on DNA hybridization kinetics for assessing the effect of various parameters to facilitate its optimization. We derive an equation for the probability that a particular differentially expressed species is successfully isolated and use this to quantify the effect of the following parameters related to the cDNA sample: (a) mRNA abundance; (b) partial sequence complementarity to other species; and (3) degree of differential expression. We also evaluate the effect of parameters related to the process, including: (a) reaction times; and (b) extent of driver excess used in the two hybridization reactions. The optimum set of process parameters for successful isolation of differentially expressed species depends on transcript abundance. We show that the reaction conditions have a significant effect on the occurrence of false-positives and formulate strategies to isolate specific subsets of differentially expressed genes. We also quantify the effect of non-specific hybridization on the false-positive results and present strategies for spiking cDNA sequences to address this problem.


Introduction
Developmental processes such as ageing, metamorphosis, and embryo development are associated with changes in gene expression (Hill et al., 2000;Lee et al., 1999;White et al., 1999). Cells exposed to different extracellular environments, at different metabolic levels or pathophysiologic states, also exhibit different profiles of gene expression (Alizadeh and Staudt, 2000;Bittner et al., 2000;DeRisi et al., 1997;Oh and Liao, 2000). The transformation of genotype to a variety of phenotypes is characterized by differential gene expression from the same repertoire of sequence information. An important first step in the elucidation of the molecular mechanisms responsible for altered physiological states or developmental pathways is the identification of genes that are differentially regulated at the transcriptional level.
Several methods to profile gene expression have been developed. The expression profile for each sample may be estimated separately and then compared by methods that depend on specific hybridization of probes to DNA microarrays (Lipshutz et al., 1999) or on the counting of tags or signatures of DNA fragments (Brenner et al., 2000;Velculescu et al., 1995). Differences in gene expression between two samples can be compared 406 C. Gadgil et al. directly by methods such as differential display (Liang and Pardee, 1992), two-colour microarray hybridization (Brown and Botstein, 1999), subtractive cloning techniques (Sagerstrom et al., 1997), and combinations of these (Pardinas et al., 1998;Yang et al., 1999). These approaches have been successfully used to identify genes differentially expressed in two populations that exhibit large changes in expression levels, or genes that are expressed at high concentrations in terms of number of copies per cell. Closed systems such as DNA microarrays require genomic sequence information in order to identify differentially expressed transcripts. Open systems have the flexibility of identifying uncatalogued sequences. However, many techniques have a low efficiency of identifying rare genes that are differentially expressed (Martin and Pardee, 2000). This problem is exacerbated when the change in expression level of rare transcripts is small. Since genes expressed at low levels also play a role in establishing differentiated phenotypes, their identification is essential for a complete mechanistic understanding of cellular changes.

Suppression subtractive hybridization
Suppression subtractive hybridization (SSH), a technique to identify a set of genes differentially expressed in two cell samples, has the promise of overcoming some of these difficulties . The singular advantage of SSH lies in the ability to identify differentially expressed genes, irrespective of the level of expression, in the absence of sequence information. SSH has been used to investigate differential expression in a variety of experimental systems, including malignant melanoma (Hipfel et al., 2000), liver regeneration (Groenink and Aad, 1996), embryo development (Simpson et al., 1999) and honeybee larval development (Evans and Wheeler, 1999). SSH identified differentially expressed sequences with no matches in the public databases in all these systems.
The SSH process normalizes the levels of rare and abundant genes, and subtracts genes expressed in both samples. Genes upregulated in one sample (referred to as tester) relative to the other sample (called the driver) can be identified. The SSH process ( Figure 1) entails two rounds of hybridization followed by two PCR reactions . Poly A + mRNA is isolated from total RNA and reverse-transcribed to give a double-stranded cDNA pool. The cDNA is digested by RsaI, resulting in fragments 0.1-2 kb long. This step reduces the size distribution of cDNA species and creates blunt ends for adaptor ligation. The tester is divided into two equal parts (referred to as tester A and tester B) and ligated with different adaptors (adaptor A and adaptor B) at the 5 end of each fragment. In the first set of hybridizations (hybridization 1A and hybridization 1B), an excess of driver sample is added to each tester fraction separately, and the reactions are allowed to proceed under identical conditions. Among species present at the same concentration in the tester, those present in similar or higher levels in the driver will form duplexes at a faster rate than those whose concentration in the driver is lower. This leads to an enrichment of single-stranded species that are present at a higher level in the tester. Due to the second-order hybridization process, normalization of the concentration of single-stranded species is also achieved, as abundant species form duplexes at a higher rate than rare species. In the second hybridization, the end products of hybridization 1A and hybridization 1B are mixed and additional excess single-stranded driver is added for further subtraction. Unsubtracted single stranded species from hybridization 1A and hybridization 1B form duplexes in which one strand has adaptor A and the other strand has adaptor B. The duplex species formed during the two hybridization steps are shown in Figure 1.
After the hybridization reactions, end-filling of duplexes with adaptor overhangs is carried out to form blunt-ended DNA. The duplexes are then amplified by PCR using adaptor A and adaptor B as primers. This leads to a differential amplification, depending on the nature of the duplex. Those duplexes in which the two strands have different adaptors are exponentially amplified in the PCR reactions. Duplexes in which both strands have identical adaptors at both ends form panhandlelike structures because of the self-complementary nature of the adaptors and are not amplified. Duplexes with an adaptor only at one end are linearly amplified.
The PCR products are then ligated into vectors that are used to transform Escherichia coli. In a successful SSH the frequency of the sequences isolated from E. coli clones is greater for genes that are expressed at a higher level in the tester. To identify genes that are downregulated in the sample

Single Strands
Driver-Driver

Hybridization 2 Products
End-filling Reaction for duplexes Figure 1. Schematic of the SSH process depicting the cDNA species formed during the hybridization reactions used as the tester, a reverse SSH is carried out by switching the samples used as tester and driver. SSH accomplishes normalization and subtraction by taking advantage of the different rates of hybridization of cDNA strands for different genes, depending on their abundance level and the degree of (differential) expression. The extent of hybridization is governed by the hybridization temperature, hybridization times and the driver : tester ratio. The effect of these operating parameters on the efficiency of normalization and subtraction, and thus the probability that a particular differentially expressed gene is isolated, differs depending on each gene's abundance level and extent of differential expression. As the conditions that lead to the highest probability of successful isolation vary with the abundance, level of differential expression, length of the transcript and degree of sequence similarity to other transcripts, there may not be a unique optimal condition of SSH for isolating all differentially expressed genes. The large number of parameters that affect the outcome of SSH suggest the use of a mathematical model to facilitate the selection of experimental conditions.

Mathematical models for subtractive hybridization
Attempts to model various techniques of isolating differentially expressed transcripts have assumed DNA hybridization to occur as a simple irreversible bimolecular reaction, and used the hybridization model developed by Wetmur and co-workers (Wetmur, 1976;Wetmur and Davidson, 1968). Ermolaeva et al. (1996;Ermolaeva and Wagner, 1995) developed a mathematical model for a subtractive hybridization process where the tester contains only three differentially expressed transcripts, which are completely absent in the driver, and presented an analytical solution for the case of large driver excess. The subtractive hybridization model of Cho and Park (1998) explored the effect of differing tester : driver ratios on target identification in cases where target sequences present in the tester are totally absent in the driver. Milner et al. (1995) have presented a model for subtractive hybridization of genomic deletion mutants and enrichment of upregulated sequences, and predicted target enrichment in genomic subtractions consistent with experimental results. This model, however, does not present an analysis of the probability of isolation of a particular differentially expressed sequence.
Since it is more likely that genes are differentially regulated from a basal level rather than being switched on or off (Gurskaya et al., 1996), a mathematical model for a process that identifies sequences differentially expressed between two cell samples should consider all possible levels of regulation. Microarray experiments and studies using serial analysis of gene expression (SAGE) (Zhang et al., 1997) have clearly demonstrated a range in the level of gene expression, with the majority of genes exhibiting limited if not negligible differential expression (Sagerstrom et al., 1997). Therefore, any attempt to model the SSH process should also consider the concentrations of non-differentially expressed species that remain at the end of the process. This concentration can then be used to estimate the number of false-positives, and the probability that the process will isolate a particular differentially expressed species. In a cDNA pool, there exist several sequences that have a partial sequence homology to each other. The formation of chimeric cDNAs during the SSH process has been observed (Zhang et al., 2000). A comprehensive mathematical model for a process based on DNA hybridization should account for nonspecific hybridizations between strands that have partial sequence homology.
In this report, we present a model for the SSH process that enables analysis of the effects of process variables, such as the amount of driver excess and the hybridization times on the probability of identifying differentially expressed cDNA species having a certain abundance, relative expression level, and degree of sequence similarity to other sequences in the hybridization mix.

Model development
We represent single-stranded cDNA for species i (i = 1, 2, . . . n T ) by M A i , M B i or M D i where the superscript denotes the type of adaptor (adaptor A, adaptor B, or no adaptor) and n T is the total number of distinct sequences. After hybridization reactions, each single-stranded species M p i (p = A, B or D) may form three types of duplexes M p i M q j : (a) homo-duplexes with complete sequence match (i = j) and with the same adaptors on both strands or both strands from the driver (p = q); (b) homoduplexes of strands that are perfectly complementary (i = j) except for the adaptors (p = q); and (c) heteroduplexes of partially complementary strands (i = j). We assume that the behaviour of the sense and antisense strands of each species will be completely symmetrical. Therefore, the model focuses on one strand.

Model development for ideal hybridization
In our initial analysis, it is assumed that only perfectly complementary strands react irreversibly to form duplexes (i.e. kf p,q i,j = 0 for i = j and kb p,q i,j = 0∀i,j). The rate constant for the duplex formation reaction is assumed to be the same for all species, i.e. kf p,q i,i = constant = k f . Initial concentrations for the single-stranded and duplex species at the beginning of the hybridization reactions are given in Table 1. Using these assumptions to simplify Equations 1 and 2, the mass balance equations for each step of the hybridization processes are formulated and listed in the Appendix. The concentration of duplexes with two different adaptors, A and B, at the end of the hybridization process can be expressed in terms of the initial concentration of the single-stranded species [M A i ] 0,1 , the relative expression ratio κ i , the hybridization rate constant k f and the hybridization time (t hyb1 , t hyb2 ) and driver excess (E 1 , E 2 ) for the two reactions as: The duplex concentration remains unchanged during the end-filling reaction. During the final PCR reaction, species are amplified exponentially, amplified linearly, or not amplified depending on the nature of the adaptors. After the PCR step, the total DNA corresponding to gene i available for ligation is: In a typical SSH, the number of PCR cycles (n PCR ) is high, and hence 2 n PCR n PCR . The right-hand side of Equation 4 is dominated by the first term, and can be expressed as: After the PCR reaction, a subtracted cDNA library is constructed and N col colonies are picked for further analysis as putative differentially expressed genes. The probability p s that species s is among those N col colonies depends on the fraction (f s ) of the PCR product corresponding to species s. The probability that, of N col colonies, none corresponds to species s is (1 − f s ) N col . Hence, the probability that at least one colony corresponds to s is: where f s is the ratio of the DNA concentration corresponding to species s available for ligation to the total DNA concentration for all n T species available for ligation: Substituting Equation 5 in Equation 7, we get the expression for the probability of identification of species s for ideal hybridizations as: The mathematical framework outlined here provides an analytical expression for the concentration of the different duplex species that are formed after the two hybridization steps as a function of the initial concentrations of the cDNA single strands and the reaction conditions (Equation 3). Substituting this expression in Equation 8, the probability of isolation of a particular differentially expressed gene can be obtained.

Non-specific hybridization of partially complementary strands
To account for non-specific hybridizations in which single strands that are not perfectly complementary hybridize to form heteroduplexes of the type M p i M q j (i = j), we consider the case where there exist two species, i and j, with varying degrees of sequence complementarity. Equations 1 and 2 are solved simultaneously to simulate the system. If the number of species that have partially complementary sequences is N s , it can be shown that we have to solve 4.5 N s (N s + 1) simultaneous ordinary differential equations to fully simulate the system. Since this number scales as the square of the species involved, the problem quickly becomes computationally intractable with just a few species.
Through numerical simulations of these model equations, the concentration of duplex species that are formed from the hybridization of one strand with adaptor A with another strand with adaptor B ([M A i M B j ]) can be simulated. An assumption is made that such a heteroduplex is not dissociated during the end-filling step. The heteroduplexes will then be amplified by the PCR process. The total amount of DNA available for ligation after the PCR steps, given for ideal hybridizations by Equation 4, can be rewritten for this situation as: As 2 n PCR n PCR , the value of D i can be approximated as: This value can then be substituted in Equation 7 to obtain: can be used to calculate the fraction of DNA corresponding to the species i and hence the probability of identification of at least one colony containing the sequence can be estimated using Equation 6.

Simulation parameters
To use the developed equations for simulation of the SSH process, the reaction rate constants and initial concentrations of M i have to be determined. The cDNA is digested with RsaI, leading to single strands with an average molecular mass of ∼150 kDa, corresponding to a length of 470 bp. The experimentally observed range of strand lengths is 200-2000 bp. Using a total cDNA concentration in the tester of 2 µg cDNA and taking reagent dilution into account, the total cDNA concentration is calculated to be 1 × 10 −7 M. PolyA + mRNA in a typical mammalian cell is divided into three abundance classes: abundant, intermediate and rare species (Hastie and Bishop, 1976). The number of species in each class and their relative abundance is shown in Table 2. This classification is also computationally convenient as it enables the estimation of the average initial concentration of a species in a particular class from the total cDNA concentration ( Table 2). The concentrations of species M i in the driver are the corresponding tester concentrations multiplied by the excess ratio E and divided by the differential expression ratio κ i . As equal volumes of tester and driver are mixed at the start of the hybridization process, the initial concentration of tester and driver species in the hybridization mixture is half that in each fraction. The second order rate constant (k f ) was taken to be 1 × 10 6 M −1 s −1 (Craig et al., 1971;Ermolaeva and Wagner, 1995). The renaturation rate constant for nonspecific hybridization depends on the percentage sequence identity. Vernier and co-workers (Vernier et al., 1996) report that the values of the renaturation rate constant decreases to 98%, 80%, and 77% of the rate for renaturation of perfectly complementary strands, respectively, for sequences sharing 94%, 83% and 77% sequence identity. These values are used for simulation of the association rates for non-specific hybridization. Anderson and Young (1985) report that the duplex dissociation rate increases by a factor of two for every 10% mismatch of the single-strand sequences. Based on this data and results on melting of chimeric duplexes reported elsewhere (Gotoh et al., 1995;Spiegelman et al., 1973), the values of the dissociation constant for 500 bp strands having partial homologies of 94% and 77% have been estimated as 1 ×10 −5 s −1 and 5 ×10 −3 s −1 , respectively, and used for simulating non-specific DNA hybridization.
The DNA hybridization process is never complete for finite hybridization times, and there is a finite probability of isolating a species that is not differentially expressed. The denominator of Equation 8 is the sum of the concentrations of duplexes of the type M A i M B i . Some of these duplexes correspond to cDNA present in a higher concentration in the tester and others represent cDNA present in equal or lower concentration in the tester than the driver. The latter category of genes can lead to false-positive results. To assess the probability of obtaining false-positive results, we divide the total number of genes n T into genes that are differentially expressed (n A , n I and n R , corresponding to differentially expressed abundant, intermediate and rare species, respectively), and genes that are not differentially expressed (n * A , n * I and n * R ). The duplexes with different adaptors but from genes that are not differentially expressed (M A i * M B j * ) give rise to false-positive results. The fraction of the tester homoduplex with different adaptors A and B on the two strands for a particular species s (Equation 8) can be rewritten as: The value baseline is the total concentration of tester homoduplex with different adaptors A and B on the two strands from genes that are not differentially regulated. These sequences lead to the formation of false-positives. The value of the baseline depends on the number of genes whose expression levels in the tester and driver are the same. A survey of the literature reveals a large variation in the number of differentially expressed genes among samples from different sources. Expression analysis of 8740 rat genes using high-density DNA array technology revealed that 873 genes exhibit statistically significant differences in gene expression levels during nephrogenesis (Stuart et al., 2001). An analysis of publicly available microarray data (http://ep.ebi.ac.uk/EP/EPCLUST/) shows that approximately 20% of genes selected for microarray construction to investigate differences in gene expression between normal and cancer cells were differentially expressed. In closely related cell types (B and T lymphocytes), 2% of the genes were found to be differentially expressed using a subtractive hybridization approach (Sagerstrom et al., 1997). In a survey of the whole genome using SAGE, approximately 1.5% of expressed genes were found to be unequally expressed in normal and cancer cells (Zhang et al., 1997). Like SAGE, SSH is an open system where all expressed transcripts are probed to isolate differentially expressed genes. We are interested in the study of cells in different metabolic states and spheroid formation in hepatocytes, i.e. probing differences in closely related cell types. Hence we assume that ∼1.5% of genes will be differentially expressed and estimate the number of genes that are not differentially expressed. We have assumed that 11 800 rare, 740 intermediate and 10 abundant genes are present in equal concentrations in the tester and driver samples, i.e. n R * = 11800, n I * = 740 and n A * = 10. These values are used to calculate the baseline concentration.
All symbolic calculations of partial derivatives were carried out using Mathematica  4.0 (Wolfram Research, Champaign, IL). All numerical calculations were carried out using Matlab  5.3.1 (The MathWorks, Natick, MA).

Results
Equation 3 is used to estimate the concentration of tester homoduplex with different adaptors on the two strands M A i M B i relative to the total concentration of false-positives, and to investigate the effect of a change in the reaction conditions. The baseline value varies with the conditions used for SSH. For easy comparison among different conditions, the concentrations M A i M B i are normalized to the baseline and denoted as i . A large value of i ( i 1) indicates a high likelihood of isolating a clone corresponding to species i . Conversely, a lower i represents a high likelihood that a large number of false-positive clones will be obtained before species i is isolated. The results shown are for strands of various initial concentrations corresponding to rare, intermediate-abundance, and abundant mRNA. Results are shown for levels of differential regulation corresponding to κ i = 1, 10, 100 and 1000. The initial concentration in the driver sample is calculated by dividing the initial concentration in the tester sample by κ i . As the number of sequences that are not differentially expressed does not change in the reverse subtraction process, the results for an abundant species with κ i = 1000 correspond to both an abundant sequence that is downregulated 1000-fold, and a rare sequence that is upregulated 1000-fold. Simulations are carried out for t hyb1 = 8h, t hyb2 = 12h, and E 1 = E 2 = 30 hybridization as a function of the relative differential expression (κ i ) for values of the reaction parameters recommended by the Clontech PCR-Select  protocol (E = 30, t hyb1 = 8 h, t hyb2 = 12h) (Clontech Manual, 1999). It is seen from the figure that the value of i for differentially expressed species is less than 1 (i.e. <0 in the log scale, as shown in Figure 2) for all abundance classes. In other words the probability of obtaining a false-positive clone is higher than that of obtaining a particular true positive. As is seen from the graph, the value of i , and hence the probability that a particular species i is identified by the SSH procedure, depends greatly on κ i and the initial concentration. For example, for abundant species that are not differentially expressed (κ i = 1), the value of i is much lower than that of rare species with κ i = 1. Thus, in the absence of non-specific hybridization, abundant species are efficiently eliminated by the SSH process under recommended process conditions. This leads to the conclusion that under these conditions, the bulk of the false-positive sequences resulting from the SSH process will consist of rare sequences that are not differentially expressed. However, for abundant sequences that are downregulated by a large extent (κ i = 1000), the value of i is close to one, and higher than the corresponding value for intermediate and rare sequences. This implies that the SSH process is biased towards sequences that are differentially expressed to a large extent. However, there is poor efficiency of identifying rare sequences that are downregulated even further. Even an on-off regulation of rare sequences (as approximated by the bar corresponding to rare sequences with κ i = 1000) does not lead to an improvement in the efficiency. For moderate differential expression levels (κ i = 10-100), the SSH process is most successful in identifying rare sequences that are upregulated 100-fold, or intermediate abundance sequences that are downregulated 100-fold.

Ideal hybridization
The effect of hybridization times and excess ratios on the relative concentrations of the M A i M B i duplex was evaluated by varying the value of one reaction parameter while keeping all the others constant. The normalized partial derivative of i with respect to the parameter being changed, Z parameter = (parameter) i ∂ i ∂(parameter) , describes how the probability of isolating a true differentially expressed transcript varies with a change in the parameter. A value of Z = +1 implies a 100% increase in the relative concentration of M A i M B i to baseline due to a 100% increase in the parameter value. Shown in Figure 3 are plots of such partial derivatives with respect to the excess ratios (E 1 and E 2 ) and reaction times (t hyb1 and t hyb2 ) for the first and second hybridization as a function of κ i . Figure 3a shows that, except for transcripts with a low (κ i < 20) level of differential expression, Z E1 is positive, and therefore increasing E 1 is beneficial as it results in an increase in the i corresponding to differentially expressed species and leads to fewer falsepositives.
The effect of changing E 2 on the relative M A i M B i concentration is shown in Figure 3b. For rare transcripts, Z E2 > 0 for all values of κ i , and the value of i increases with an increase in the excess ratio for hybridization 2. However, for abundant transcripts, the trend is opposite, i.e. i decreases as the relative concentration as E 2 is increased from a value of 30. For intermediate-abundance species, the effect of changing E 2 depends on the degree to which they are differentially expressed (κ i ). There is a beneficial effect for highly upregulated species (κ i > 30) but a negative effect on species with a lower differential expression ratio. The magnitude of this increase is not high (<0.5), showing that the number of false-positives is not very sensitive to changes in E 2 from the value of 30. Thus, doubling the excess ratio will increase the relative concentration of rare transcripts by a maximum of 50% from their original level. Increasing the ratio beyond this value is impractical in situations with a constraint on the amount of cDNA available for analysis. Figures 3c and 3d are plots of the normalized partial derivatives with respect to t hyb1 and t hyb2 . Increasing t hyb1 leads to an improvement in i for rare species and a decrease in i for differentially expressed abundant species. For intermediate abundance species with a low κ i (<50), the effect of increasing t hyb1 is to decrease i , but for highly upregulated intermediate abundance species, there is a small increase in i with increase in t hyb1 . A similar effect is seen from the plot of the normalized partial derivative with respect to t hyb2 (Figure 3d), except that the magnitude is much smaller, indicating that the relative concentration ( i ) is less sensitive to changes in t hyb2 from its value of 12h.
Significantly larger excess ratios are used in other subtractive hybridization based processes such as cDNA-representational difference analysis (cDNA-RDA; Hubank and Schatz, 1994) and normalized library construction (Ko, 1990;Patanjali, 1991;Sasaki et al., 1994). Simulation results for SSH with hybridization times and excess ratios similar to those used in these procedures are presented in Figure 4. The effect of significant decreases in hybridization time and excess ratios are also presented. The beneficial effect of increasing the excess ratios on i for differentially expressed species irrespective of abundance levels is clear from the increase observed, even at a low hybridization time of 15 min. On the other hand, increasing t hyb1 and t hyb2 alone does not yield a significant improvement in i . The graphs illustrate the beneficial effect of decreasing t hyb1 and t hyb2 for better identification of differentially expressed abundant species, and increasing t hyb1 and t hyb2 for better detection of differentially expressed rare species. Similar calculations for other combinations Model for suppression subtractive hybridization of reaction conditions can be easily carried out using this model.

Effect of non-specific hybridization
The effect of non-specific hybridization is explored using this model. The analysis presented for ideal hybridization can be repeated for conditions where the presence of partially complementary sequences leads to the formation of chimeric duplexes (heteroduplexes). These sequences may each belong to different abundance classes and have different κ values. Here we present a detailed analysis of one possible situation where both non-specifically interacting sequences M α and M β are not differentially expressed (κ α = κ α = 1) and hence contribute to the false-positives obtained in the SSH process. Thus, the objective of changing process conditions here is to minimize α and β , or at least to reduce those to values corresponding to the falsepositives for ideal hybridizations ( i for sequences with κ i = 1). In the following discussion, subscripts α and β are used to represent sequences with partial sequence similarity. Figure 5a shows the effect of the presence of an abundant species, M α , with 94% sequence homology to another abundant sequence, M β , on the false-positives that are obtained. The falsepositives that are obtained in the absence of nonspecific hybridization are also shown for comparison. As seen in Figure 5a, in the presence of nonspecific hybridization, the concentration of abundant sequences that are not differentially expressed (κ α = 1) is much higher than that of rare sequences with κ α = 1. This is the exact opposite of the case where there is no non-specific interaction, where the concentration of false-positive sequences that are abundant is much lower than that of rare sequences. It is also seen that in the presence of non-specific hybridization, α for an abundant sequence with κ α = 1 is more than 100 times greater than the i value for any rare species with κ i = 1 in the absence of non-specific interactions. The simulation results presented in Figure 5a show that the false-positive pool obtained will be dominated by such an abundant sequence with κ α = 1 and partial homology to another abundant sequence. It should be noted that although results shown in this figure are for strands corresponding to the sequence M α , both the sequences M α and M β that are partially complementary are affected the same way during the SSH process if they are both present in the same concentrations in the tester and driver. Hence, any analysis presented here for the sequence M α is equally applicable to M β and both sequences will occur to the same extent in the false-positive pool.
Non-specific hybridization results in falsepositive sequences and decreased probability of isolating genes that are truly differentially expressed. This masking effect of partially complementary, abundant sequences can be reduced by increasing the subtraction efficiency through addition of the specific sequence M α to the driver. The increase in the concentration of the driver strands for that sequence increases the rate of formation of tester-driver duplexes. The effect of such spiking is shown in Figure 5b. The figure depicts α values that are obtained when the driver is spiked with increasing amounts (no spiking, 10, 100 and 1000-fold spiking) of the sequence M α . It is seen that there is a beneficial effect of spiking M α . At a level of a 1000-fold spiking, α decreases to a level comparable to that in the absence of nonspecific interaction.
As indicated above, there is a symmetrical relationship between species M α and M β allowing us to interpret these results for the sequence M β as well. Thus, when M β is spiked, β will decrease. However, the effect of spiking one sequence (say M α ) on false-positives arising from another nonspecifically interacting sequence (M β ) may not always be positive. Such effects are simulated using the model and results are presented in Figure 5c, which shows the effects of spiking M β on α . The values of α for species of different abundance levels and κ α = 1 either remains the same or increases with increasing spiking of M β . Thus, the simulation results show that there is a detrimental effect on false-positives arising from M α when M β is spiked at increasing levels. With an increase in the amount of M β that is spiked, the concentration of M α falsepositives increases slightly.
The obvious solution to this problem is spiking both M α and M β in the driver. The effect of such a spiking is depicted in Figure 5d. It is seen that the value of α decreases when increasing amounts of M α and M β are spiked. The decrease is less than the decrease seen when only M α is spiked to the same extent. This will also be true for β , and the falsepositives arising from both M α and M β will be reduced by this spiking strategy. Although spiking both sequences clearly achieves a beneficial effect by decreasing the number of false-positives with increased spiking levels, the effect is attenuated compared to that observed in Figure 5b and any increase in concentration of spike must be more than 1000-fold to achieve a reduction in the number of false-positives to a level comparable to that in the absence of any non-specific interactions.

Discussion
The probability that a differentially expressed species s will be correctly identified as such by suppression subtractive hybridization depends on f s , the fraction of the M A i M B i duplexes corresponding to this species to all such duplexes present at the end of the second hybridization. We have derived an analytical expression to calculate this concentration. The concentration, and hence the probability of identification, depends on two categories of factors: system factors, such as abundance, degree of differential expression (κ s ) of the gene, the concentration and percentage similarity of non-specifically interacting sequences, and abundance and number of genes that are not differentially expressed in the two samples; and reaction conditions, such as the driver excess (E 1 and E 2 ) used in each hybridization, and the hybridization times t hyb1 and t hyb2 .
We have used previously reported results on the number of differentially expressed species to estimate the concentration of DNA that will lead to false-positives. The results for a different number of genes with unchanged expression level can be easily computed. The nature of the results presented here does not change appreciably if a different (lower) number of species is assumed to be present in equal concentrations in the tester and driver (results not shown). We have used reported values of the hybridization rate constants to carry out the simulations. Simulations using values that differ by one order of magnitude from this value show that the results are qualitatively identical to those presented here. We used the model to predict the effect of changes in the reaction conditions on the probability of identification of genes from different abundance classes that are differentially expressed at various levels. The model predicts that, for the process conditions typically used, the SSH method will lead to the identification of some differentially expressed genes, but will also yield a high number of false-positives. A number of studies have reported false-positive results as high as 90% (Nemeth et al., 2000a(Nemeth et al., , 2000b. The probability of identification is also a function of transcript length. Each cDNA is digested with RsaI, which results in multiple fragments of the same gene being present. As these fragments undergo independent hybridization reactions, the probability of identification of a particular gene is approximately proportional to the number of fragments, and hence to transcript length. Conversely, a long gene that is not differentially expressed will have a higher contribution to the number of falsepositives than is estimated by the model. A number of studies have reported that all the isolated sequences at the end of the SSH process are unique (Glienke et al., 2000;Grillari et al., 2000;Sandhu et al., 2000;Shen and Gudas, 2000;Wang et al., 1999). This study suggests that if there is a high efficiency of subtraction, redundant clones will be obtained, either in the form of multiple colonies having the same insert, or colonies with different fragments of the same gene.
From the analytical solution to the coupled set of differential equations, it is possible to calculate the normalized partial derivative of the relative concentration of the M A i M B i duplex with respect to process parameters. To increase the probability of isolating a differentially expressed gene i , the process parameter values should be optimized to increase i . The value of the partial derivative provides a quantitative estimate of the effect of a proposed change in one process parameter. It should be noted that the value of the partial derivative, by definition, is true only in a small interval near the values of the parameters at which it is estimated. Using the analytical expression presented here, such a calculation is easy to implement for values different from those presented here. In addition to determining the extent of the falsepositives that result from the SSH process, the nature of the false-positives can be explored using our model. In the absence of non-specific interactions, i for abundant species with κ i = 1 is less than the corresponding concentration for rare species. The basis of this counterintuitive observation lies in the slower kinetics of duplex formation for residual rare single-stranded species with κ i = 1 compared to the corresponding abundant sequences and does appear in the transient concentration profiles (data not shown).

Effect of non-specific hybridization
In the cDNA mixture used for the SSH process, sequences corresponding to proteins with conserved motifs exhibit partial complementarity to each other. Chimeric duplexes formed between strands having partial sequence homology have been observed at the end of the SSH process, in some cases at levels as high as 2% of all duplexes (Zhang et al., 2000). Such non-specifically interacting sequences include those that are differentially expressed, and those that are present at equal concentrations in the tester and driver. In our simulation we specifically address the issue of falsepositives caused by the latter type of sequences. Simulation results predict that if there exist two abundant sequences with at least 94% homology, the false-positive pool will be dominated by these sequences. Experimental results from an investigation (Korke et al., unpublished results) of the expressed species in a hybridoma cell culture reveal that the predominant false-positive obtained during a SSH belonged to a class of molecules called intracisternal A particles, reiterated murine retrovirus-like elements (Dupressoir et al., 1999) with high intrasequence similarity (Leib-Mosch et al., 1992;Rynditch et al., 1998). Microarray and Northern blot analysis reveals that these sequences are equally expressed in the hybridoma samples under consideration in the SSH study (Korke et al., unpublished results). The observation that a large fraction of the false-positive pool consists of IAP sequences supports the simulation results presented here.
The addition of specific sequences to the driver in order to reduce the concentration of a particular sequence in the final products has been reported in other subtractive hybridization approaches such as cDNA RDA (Hubank and Schatz, 1994). There are several possibilities in the choice of the sequence that is to be added or 'spiked' to the driver. Either, or both partially complementary sequences, or a consensus sequence, or a concatenated sequence may be used. We used the mathematical model to simulate the efficacy of these approaches in reducing the concentration of the target sequence in the final product mix. The effect of spiking any one of the two sequences has a detrimental effect on false-positives resulting from the sequence that is not spiked. However, the concentration of duplexes corresponding to the spiked sequence decreases appreciably. Spiking a consensus sequence is akin to spiking an interacting sequence, and hence the effect on both sequences will be detrimental. Spiking a sequence that is a concatenation of the two sequences will have an effect that depends on the stability of the duplex containing a large dangling end that will be formed. If this duplex is as stable as a perfectly complementary duplex, the effect will be beneficial. Otherwise, as is thermodynamically more likely, if the duplex thus formed has a significant melting rate, the effect will be analogous to adding a partially complementary sequence. The best result is obtained when both sequences are spiked. Higher spiking levels (>1000-fold) have to be used, but a reduction in the levels of both sequences to levels representative of false-positives in the absence of non-specific interactions can be achieved.
Simulation of non-specific interactions has been carried out assuming a homology of 94% between the two sequences. Simulations were also carried out assuming a homology of 77%, for which rate data was available. The rate of the forward reaction is one-hundredth that of perfectly complementary sequences, and the melting rate is high. The falsepositive rate obtained is the same as in the case of ideal hybridization (results not shown). Thus, non-specific hybridization between sequences with partial sequence homology of less than 80% does not affect the SSH performance.
In an actual experiment, there might be more than two sequences with a high homology. For a particular sequence α, the key step that determines α is the formation and slow dissociation of heteroduplexes formed from M A α (or M B α ) and a driver strand of the interacting sequence M D β , as is observed from an examination of the temporal kinetics during the hybridization processes (results not shown). The exact composition of the interacting sequence is not important, and therefore we contend that the simulations for a oneinteractor case may be taken as representative of a situation where multiple non-specifically interacting sequences are present. The effect of nonspecific interactions on species that are differentially expressed and/or under different process conditions can be easily simulated using the model presented here. It is seen that for extreme process conditions (t hyb = 48 h, E = 300), the presence of an abundant non-differentially expressed sequence with 94% sequence complementarity increases α values for differentially expressed rare and intermediate abundance species. However, the concentration of false-positives also increases (results not shown).
It is seen that there is a differential effect, both qualitative and quantitative, of changing process conditions on the probability of isolating differentially expressed transcripts depending on their abundance, degree of differential expression and the presence of non-specific hybridization. For some transcripts, i increases with a positive change in the process parameter and for some the relative concentration decreases. Thus, the model aids in probing the effect of a proposed change on the transcript class of interest. This mathematical model will serve as a tool to carry out virtual SSH experiments to determine the best conditions for the particular sample under consideration. The framework presented here can also be used for the analysis of the efficiency of other procedures for the isolation of differentially expressed genes.