Expression Profiling of All Protein-Coding Genes in Wild-Type and Three DNA Repair-Deficient Substrains of Escherichia coli K- 12

Gene chips or cDNA arrays of the entire set of Escherichia coli (E. coli) K12 genes were used to measure the expression, at the mRNA level, of all 4290 protein-coding genes in wild-type (WT) and three DNA repair-deficient derivative strains: (i) AB1157 (WT), (ii) LR39 (ada, ogt), (iii) MV1932 (alkA1, tag-1) and (iv) GM5555 (mutS). The aim was to investigate whether disruption of a single gene would result in significant deviation in the expression of other genes in these organisms. We describe here a simple approach for a stringent statistical evaluation of cDNA array data. This includes: (i) determination of intra- and interassay variation coefficients for different expression levels, (ii) rejection of biased duplicates, (iii) mathematical background determination, and (iv) comparison of expression levels of identical copies of a gene. The results demonstrated a highly significant correlation of gene expression when the mutants were individually compared with the wildtype. Altogether, 81 deviations of the expression of 59 genes were noted, out of 12,870, when 3-fold or greater up- or down-regulation was used as a criterion of differential expression. In the light of current knowledge of E. coli biology, the differential expression did not follow any logical pattern. In fact, the deviations may simply represent inter-assay variation. The results obtained here with a simple model organism are different from those obtained with most mammalian knockouts: disruption of the function of a single gene does not, under good growth conditions, necessarily result in great changes in the expression of other genes.


Introduction
Gene disruption techniques, also known as gene knockout, have proved to be important tools in assessing the functions of genes in various forms of life, ranging from bacteria to man. Knockout mice, for instance, are invaluable models for the study of mutations similar to those found in human diseases. Germ line gene disruption in mammals has resulted in unexpected phenotypes. This, on the other hand, is associated with a surprising functional redundancy of gene products as well as with an astronomic amount of possible interactions of proteins inside the cell. Mammalian cells remain very complicated systems in which to study the effect of single gene disruption on the expression and function of all other genes. The completion of the E. coli genome projects (Blattner et al., 1997) has permitted the development of new tools for genome-wide analysis in this model organism.
We have been interested in the biological effects of a therapeutic alkylating agent, chlorambucil. Wild-type (WT) and DNA repair-deficient E. coli strains have served as model organisms (Salmelin et al. 2000). The importance of DNA repair, shown by reversal of damage and attenuation of the toxicity of chlorambucil, was indicated by the susceptibility of cells lacking direct DNA repair or O 6 -methylguanine-DNA methyltransferase I and II (ada, ogt). Similarly, the protective role of base excision repair was substantiated by demonstration of even more increased susceptibility to chlorambucil among cells lacking 3-methyladenine-DNA glycosylase I and II (alkA1, tag-1). Cells deficient in mismatch repair (mutS) appeared to be only slightly more sensitive than normal cells to chlorambucil. These results clearly demonstrated that, dependent on the individual gene, gene knockout results in specific functional disturbance of E. coli cells. The traditional interpretation of this kind of outcome is straightforward: gene disruption results in paralysis of the corresponding function. In the present paper we report the assessment of another possibility, namely, that single gene disruption might cause unexpected changes in the homeostasis of the expression of genes whose involvement is not anticipated. To this end we analyzed, at the mRNA level, the expression of all protein-coding genes of these four E. coli strains. Specific emphasis was paid to the statistical interpretation of the results.
Bacterial growth and isolation of total RNA Samples for gene expression analyses were taken from exponentially growing cultures. It was important to assess global gene expression under conditions similar to those where the susceptibility to chlorambucil had been determined, i.e. when the cells had been permeabilized by using polymyxin B nonapeptide (PMBN), as described in detail elsewhere (Salmelin et al., 2000).
The cells were cultured in Luria-Bertani (LB) nutrient medium (Sambrook et al., 1989). A single colony of each E. coli strain was used to inoculate LB medium and it was cultured at 37uC with shaking overnight. Next, 10 ml of this culture was added to a 250 ml Erlenmayer flask containing 40 ml fresh LB medium and it was incubated for 90 minutes at 37uC. After incubation, 10 mg PMBN/ml was added and the flasks were incubated for 2 hours at 37uC in a shaking incubator. This incubation time was chosen on the basis of the results of earlier experiments (Sambrook et al., 1989). After the second incubation the Erlenmayer flasks were placed on ice and RNA isolation was started immediately.
Total RNA from logarithmic growth-phase cells was isolated according to the protocol recommended by Sigma-Genosys. In order to minimize errors originating from RNA preparation (Arfin et al., 2000), two samples from each culture were preparated simultaneously and finally pooled for the cDNA array hybridization. The detailed protocol is available on the Internet at the Sigma-Genosys homepage: http://www.genosys.com. To remove contaminating genomic DNA from purified RNA, the samples were treated with RNasefree DNase I. The RNA was extracted again with three phenol (acidic) extractions followed by one phenol : chloroform : isoamyl alcohol (25 : 24 : 1) extraction and after that by one chloroform : isoamyl alcohol (24 : 1) extraction. RNA was precipitated (as in the protocol used for RNA isolation) and centrifuged at 12 000 x g for 30 minutes. The pellet was washed with 70% ethanol, re-centrifuged and dissolved in diethyl pyrocarbonate (DEPC)-treated water. The RNA pellet was stored in DEPC-treated water at -20uC after being quantified by absorbance at 260 nm. High quality of the RNA was confirmed by using agarose gel electrophoresis (1.2% agarose gel).

Hybridization, and analysis of DNA arrays
Expression profiling was performed as described in detail elsewhere (Panorama 2 E. coli Gene Arrays, Protocol Booklet, available at http://www.genosys. com). In short, after RNA isolation the procedure consists of (i) generation of 33 P-labeled cDNA from the RNA samples, (ii) hybridization of labeled cDNA to duplicate arrays (Panorama 2 E. coli Gene Arrays, Sigma-Genosys used in this work) representing 4290 PCR-amplified open reading frames, (iii) autoradiography of the arrays, and (iv) analysis of the expression patterns.
The primary data consisted of duplicate pixel intensities for all 4290 ORFs. For intra-and inter-strain comparisons the data of each membrane were normalized by dividing all sampled intensities by the mean sampled intensity of all gene points (except the control points).

Whole genome perspective
The expression profiles of the four E. coli stains were very similar, as indicated in the between-strain correlation analysis illustrated in Figure 1. Considerably smaller variation was observed in a withinstrain comparison when duplicates of individual gene points were compared. Although the intrastrain variability in gene expression as a whole was smaller than the variability between the strains ( Figure 2), this may simply represent an inter-assay effect rather than an overall difference in gene expression between the strains (see also Statistical Considerations, below). The inter-assay variation may also indicate variations between individual cDNA array membranes. This was shown to concern variable background levels of different membranes (cf Table 1 below). 'Between-slide' variation has been demonstrated by using comparative hybridization with fluorescent probes (Tseng et al., 2001).
We revealed a total of 130 genes in the three mutant strains whose expression at the mRNA level differed 2-fold or more from that of the WT. Sixtysix genes differed 3-fold or more and these genes were selected for further scrutiny as described below. This decision was based on two factors: (i) we wanted to examine whether there are gross inter-strain differences between the mutants versus WT, and (ii) previous work relying on very similar cDNA array hybridization methodology resulted in the conclusion that a 2.5-fold expression difference indicates significantly different expression, with 99% confidence in the two tails of the data (Tao et al., 1999).

Statistical considerations
The raw data consisted of pixel intensities, in duplicate, corresponding to relative hybridization signals of mRNAs representing all open reading frames (ORFs) of E. coli. There were a total of 34 320 data points representing gene expression. Corresponding pixel intensities of background signals and known hybridization standards were also obtained. We made use of these data to calculate the reliability of low values approaching background, and analytical precision at different levels of gene expression. This information was used to re-evaluate the reliability of the basic procedure. The primary selection concerned all genes that showed 3-fold or greater differences of expression when compared with the wild-type. Originally, 66 such genes were found.
The background values for each strain were determined from 22 dedicated replicate array points. The pixel intensities of background values were transformed to percentages of whole genome Figure 1. Gene expression in three DNA repair-deficient E. coli strains compared with WT. Least squares regression was computed according to the Pearson product-moment correlation method after logarithmic transformation of the data representing percentage expression of each individual gene. Confidence intervals were computed to correspond with 95% limits. The percentage transformation was carried out with normalized data E. coli gene expression and data evaluation 5 expression. These figures and their statistical treatment were used to determine the sensitivity of the assay (Table 1). We chose an arbitrary detection limit such that 99.7% of all possible background values remained below this level. According to this sensitivity rule, the expression of eight out of 66 selected genes fell below the sensitivity level. Fold expressions of these genes were changed accordingly and two of these eight genes did not thereafter satisfy the '3-fold' rule. These were b4273 (yi22_6) and gatA (see Table 2B). The other statistical concern was the precision of the method. This information is necessary for validation of individual data points. The practical question was, how large is the maximal acceptable bias between duplicates? Duplicate assays provide a very convenient tool to determine intra-assay precision. To this end, we used the formula where SD = standard deviation, d = difference between duplicates and N = total number of determinations (Reed and Henry, 1974). We determined the SD values for the three different expression ranges of all four strains examined. This allowed us to determine the coefficients of variation (CVs) Figure 2. Correlations of duplicate analyses of the expression of all ORFs in WT and three DNA repair-deficient E. coli strains. Least squares regression was computed according to the Pearson product-moment correlation method after logarithmic transformation of the data representing percentage expression of each individual gene. Confidence intervals were computed to correspond with 95% limits (obscured by data points). The percentage transformation was carried out with normalized data 6 C. Salmelin and J. Vilpo for the different expression levels. Very similar intra-assay variations were observed (Table 3). As shown below, this information was applied to the acceptance or rejection of values represented by biased duplicates. Two recent analysis of global gene expression in E. coli did not take this opportunity into account (Tao et al., 1999;Arfin et al., 2000). Unfortunately, most published investigations do not give pertinent analytical variations. However, as shown here, random errors occur and must be corrected or data eliminated. We arbitrarily chose to accept the duplicates if they were within t 3 SDs of the average of the two values. This decision meant that all values within a 99.7% confidence interval would be accepted for further analysis. The procedure resulted in rejection of four of the selected 66 genes, as indicated in Table 2: yhiE in AB1157 and in LR39, cysB in LR39, pnhA in GM5555, and b2640 in GM5555. This kind of validation process is possible only if duplicate cDNA arrays are used, or corresponding information of assay precision is otherwise available.
We estimated analytical precision also on the basis of comparison of gene expression in the wildtype and each of the three mutant strains. In this case, a mean value of each duplicate determination was used as illustrated in Figure 1. The SD values were calculated according to Reed and Henry (1974;equation 1). This allowed us to determine CVs for the different expression levels (Table 4). It shall be emphasized that this variation corresponds to the total variation (intra-assay, inter-assay and inter-strain). The use of duplicate means, on the other hand, results in a small underestimation of this total variation.
Another way to look at the global genomic data, not previously used, is to compare the results concerning mRNAs transcribed from identical copies of one gene. The cDNA of a gene should hybridize to all copies of the gene in a similar way. Examples are as follows: 6 identical copies of the insA-gene  1.6, 1.4, 1.4, 1.1. The current assay system should not make any distinction between these similar mRNA species. This indicates a good assay precision, but deviations were also found. For instance, a similar comparison with 6 identical copies of the yi22-gene yielded less precise results in the strain LR39: (LR39/WT): -1.8, -2.9, -1.9, -1,8, -3.5, -3.5 compared to (MV1932/WT): 2.1, 2.2, 1.6, 1.9, 2.1, 1.7 and (GM5555/WT): -1.3, -1.3, -1.6, -1.2, -1.6, -1.7. Hence, the expression analysis of yi22 remains insufficient in this case. In Table 5 some examples of these genes and their expression are shown. This approach led to rejection of two genes of the 66 selected genes, namely yi22_5 and b4273 (yi22_6). The gene b4273 (yi22_6) is now rejected by two different criteria (see above). We recommend the application of this quality control approach in all cases where more than one copy of the pertinent gene is active.

Functional groups and protein functions of differentially expressed genes
The final number of selected genes having 3-fold or greater expression differences in mutant strains compared with WT was 59/(3r4290). In other words, only 0.46% of the genes in the three mutants showed a 3-fold or greater difference compared with the WT. The functional groups, protein products, and fold differences of these genes are given in Table 4. Among these 59 genes, the expression of 40 was changed in one strain, that of 16 in two strains and only three were different from the wild type in all three mutant strains. The differentially expressed genes were evenly distributed among the 19 functional groups (Table 6). Furthermore, the three mutants had similar amounts of differentially E. coli gene expression and data evaluation 7  expressed genes. MV1932 had the lowest proportion (24%) of changes in gene expression, LR39 had the highest (43%), and GM5555 was in between (31%), when the wild-type was used as the expression reference. Of the 81 deviations of the 59 genes 38 were up-regulated (MV1932 13%, LR39 34% and GM5555 53%) and 43 down-regulated (MV1932 37%, LR39 51% and GM5555 12%).
Are the differences between the strains real?
The statistical approach revealed several significant gene expression differences between the four E. coli strains analyzed. However, none of these indicated a biological compensation of the primary gene defect. The other approach to the question of expression differences between the strains was to investigate the functions of the differentially expressed genes and to attempt to link this information to the possible consequences of the original disrupted gene function. In particular, differentially expressed operons would be valuable in this regard. It was not possible, however, to link any of the currently known functions, whether up-or down-regulated, to the conceivable consequences of the original disruption of the DNA repair genes in these three mutant E. coli strains. Some similarities in the mutants were noted, such as relatively low expression of the cold-shock proteins CspA and CspB in MV1932 and LR39. Furthermore b1375 (ynaE), b1544 (ydfK) and rhsB showed low expression in all three mutants, but lack of information on the cellular functions of these proteins does not allow further interpretation of this observation. Furthermore, we cannot exclude the possibility that the genes undergoing 3 fold or greater differential expression in different mutant strains would have functions, in addition to the currently known ones, which could compensate for the functions of the deleted genes. It remains to be studied whether better compensation mechanisms exist in other knock-out strains or under more stressful conditions. In conclusion, in spite of individual deviations from the common expression patterns of many genes in these four E. coli strains, no systematic    patterns were revealed at the 3-fold sensitivity level used in this investigation. We emphasize the importance of careful assessment of the precision of the method as well as individual scrutiny of every gene accepted in the final expression analysis. The cDNA arrays with at least two replicas for each gene provide a very versatile tool to determine the pertinent assay precision. The current results demonstrated a good interassay precision as evaluated by using different E. coli substrains.