A Comparative Study of the Impact of G-Stack Probes on Various Affymetrix GeneChips of Mammalia

We have previously discovered that probes containing runs of four or more contiguous guanines are not reliable for measuring gene expression in the Human HG_U133A Affymetrix GeneChip data. These probes are not correlated with other members of their probe set, but they are correlated with each other. We now extend our analysis to different 3′ GeneChip designs of mouse, rat, and human. We find that, in all these chip designs, the G-stack probes (probes with a run of exactly four consecutive guanines) are correlated highly with each other, indicating that such probes are not reliable measures of gene expression in mammalian studies. Furthermore, there is no specific position of G-stack where the correlation is highest in all the chips. We also find that the latest designs of rat and mouse chips have significantly fewer G-stack probes compared to their predecessors, whereas there has not been a similar reduction in G-stack density across the changes in human chips. Moreover, we find significant changes in RMA values (after removing G-stack probes) as the number of G-stack probes increases.


Introduction
A sequence of nucleotides having frequent occurrences of runs of guanines is capable of forming unusual four-stranded structures called G-quadruplex structures. These are a result of the Hoogesteen hydrogen bond that binds two guanines at close to 90 degrees. These structures not only form in a single sequence of nucleotides but two or four parallel sequences can also collectively form a G-quadruplex structure. We are investigating the effect of these unusual structures on gene expression measurements using microarrays.
Microarray technology is an effective way of gene expression profiling. Affymetrix GeneChips are the most popular type of array for many model organisms. GeneChip arrays are composed of 25-base long sequences that are known as probes. These probes are arranged in the form of pairs; each pair consists of a perfect match (PM) probe and a mismatch (MM) probe. The PM probe contains the same sequence of bases as appears in the gene; whilst MM is identical to PM except that the central base (13th position base) is the complement of that in the PM probe. Each probe belongs to a particular probe set and similarly each probe set represents a particular gene; however, some genes are represented by more than one probe set.
As a probe set corresponds to a particular gene, it is therefore expected that all the probes of a probe set should be correlated with each other if the particular gene that is represented by that probe set is expressed. However, [1] previously discovered that probes containing G-stacks behaved abnormally with respect to other probes in their probe sets. In our previous work [2], we confirmed the findings of [1], but we also went further in discovering that the G-stack probes are highly correlated with each other. This indicates that they cannot be measuring gene expression but instead suggests a biophysical process occurring on the surface of GeneChips, which we associate with the formation of G-quadruplexes. The probes on a GeneChip are grown through photolithography and this results in many single-stranded DNA sequences being held in close proximity [3]. Probes are readily able to physically touch their neighbouring probes, each of which shares the same sequence. It is expected that if these closely placed parallel probes contain runs of guanines then they may form Gquadruplex structures. In [2], we have also shown that the value of the correlation coefficient changes according to the location of G-stack within the probes (using a popular human chip, the HG U133A array).
We are now investigating different GeneChip designs for two issues.
(1) The effect of the position of guanine run within the G-stack probes with an expectation that G-stack probes are highly correlated with each other. We provide a detailed discussion on this topic in Section 3.1.
(2) The position of the G-stack with the highest correlation coefficient value with an expected result that this will be position 1. Position 1 is at the free end of the probe so it can more readily come into contact with its neighbouring probes (discussed in Section 3.2).
For this study we have selected chip designs that are used to study the transcriptome of the mammalian family. We have generated contour plots to show overviews of the entire correlation surface for each of these chip designs.

Materials and Methods
The GeneChip data consist of CEL files that report the average intensity of each probe of a microarray. These fluorescent intensities are read through the Affymetrix scanners after the target sequences are hybridised to a microarray. The data from many tens of thousands of GeneChip arrays are freely available in public domains in the form of CEL files. We have downloaded CEL files from the NCBI GEO (Gene Expression Omnibus) repository [4].
We have focused on the mammalian family and have selected GeneChip data for Homo Sapiens (Human), Mus Musculus (Mouse), and Rattus Norvegicus (Rat). For each organism, three or more different chip designs have been used. We have used data from 352 randomly chosen CEL files for most of the chip designs except for a mouse chip design MG U74Bv2 (280 CEL files) and two of the human chips, HG U95D (87 CEL files), and HG U95E (86 CEL files).
We have adopted a pipeline for which a number of inhouse informatics tools have been developed. The pipeline performs the following tasks.

To Generate Contour Plots
(1) We selected probes having exactly one G-stack of length four from the probe sequence (.tab) file of the particular chip design. The.tab file, which contains the probe annotation which includes probe set ID, x and y coordinates and probe sequence with some other information is available at the Affymetrix website [5].
(2) We separated out the filtered list of G-stack probes into groups according to the position of the guaninerun (G-stack) within the G-stack probes. The possible position of a G-stack having exactly four guanines within a probe could be P = 1, 2, 3, ..., 22. In this way, we have generated 22 groups of G-stack probes. For instance, group 1 represents to all the G-stack probes in which G-stack is at position one.
(3) Rather than using the observed intensities, we used the normalised CEL files.
(4) We produced lookup tables of the x and y coordinates for each of the 22 groups.
(5) As we have generated 22 groups of G-stack probes, a 22 by 22 matrix (M) is generated in which each element represents the average correlation coefficient of two groups of probes; probes that are members of one specified group with probes that are members of another specified group. For instance, element M [5,12] of the matrix represents the average correlation between G-stack probes in groups 5 and 12.
(6) As a final step, the matrix M is used to generate a contour plot of the correlation surface.

To Analyse RMA Values with the G-Stack Probes Included and Excluded
(1) In R with Bioconductor, we used RMA [6] to obtain a set of values for each probe set in each of 352 CEL file using the standard CDF file of the specific chip. This set of RMA values reflects to the values of the probe sets with the G-stack probes included.
(2) We then masked all the G-stack probes using the code supplied by NASC [7] in order to generate a new CDF file without G-stack probes.
(3) We again used RMA with the new CDF file to obtain another set of values of the probe sets with the Gstack probes masked.
(4) These two sets of RMA values were used to analyse the effect of removing G-stack probes on RMA values (discussed in Section 3.3).

Result and Discussion
A list of chip designs involved in this study is shown in Table 1. The table also shows the chip size, the number of annotated probes, the number of G-stack probes, and the number of affected probe sets in each chip design. The lists of G-stack probes and the lists of affected probe sets (along with the number of G-stack probes in those probe sets) for the arrays analysed are available at http:// bioinformatics.essex.ac.uk/users/fnmemo/G-Tract.html.

Effect of the Position of G-Stack within the Probes.
The contour plot for human chip HG U133A, Figure 1, is almost identical to our previous work, with the discrepancy arising because of the different datasets used in the two studies. The density of G-stack probes differs according to the position of G-stack within the probes (see Table 2). The correlation coefficients of G-stack probes in all the human chips are quite high with the most marked correlation values at their diagonals. Human arrays show a fairly constant fraction of Gstack probes across different designs. For instance, for  In contrast to the human results, the mouse and rat chips show large changes in G-stack probe density across different designs, with significantly smaller fractions being found on the latest designs. Of the four mouse chips used in this study, the older designs MG U74Av2 and MG U74Bv2 have over an order of magnitude more G-stack probes than do the newer chip designs, MOE430A and MOE430B. We found that 0.15% (372/249,958) and 0.1% (252/248,704) of the annotated probes contain a G-stack in MOE430A and MOE430B, respectively. Whereas the percentages of annotated G-stack probes in MG U74Av2 and MG U74Bv2 are 3.72% (7,360/197,993) and 3.6% (7,006/197,131), respectively. Moreover, both the chips, MOE430A and MOE430B, also have an absence of G-stack probes in their middle (see Table 2) which is reflected in the contour plot of chip MOE430B in Figure 2. As with the human data, the locations for which there are probes indicate that G-stacks probes are highly correlated on mouse designs. Furthermore, the correlation is also highest when comparing two probes that have G-stacks at the same location within the probe. We also used three chip designs for rat (Rattus Norvegicus). The RG U34A chip is the oldest design and we find Table 2: In G-stack probes, the effect of the position of G-stack on the average correlation coefficient value, (n is the number of affected probes and r is the average correlation between these n probes).   that it has over an order of magnitude more G-stack probes than do the new chip designs, Rat230 2 and RAE230A. We found 0.05% (81/175,477) of the annotated probes contain a G-stack in chip design RAE230A. Similarly chip design Rat230 2 has 0.06% (208/342,410) of the annotated probes contain a G-stack. Whereas, the old Rat chip RG U34A contains 2.6% G-stack probes (3,691/140,317).
Due to the small number of G-stack probes in the new chips of rat and mouse, we expect that the gene expression measurement in these new chip designs will be less affected by G-stack probes.

Position of the G-Stack with the Highest Correlation
Coefficient Value. We have also checked whether there is any specific position of G-stack where the correlation is always most marked. We were expecting that the correlation could be highest at the 5 end of the probes, as is the case for HG-U133A in our previous work [2]. The 5 end of the probe is free and so it has a greater tendency to come into physical contact with adjacent probes. However, we find that there is no specific position of G-stack where correlation coefficient is most marked in all the chips. Furthermore, the diagonal values in the contour plot are almost always showing the highest correlation. Table 2 provides the details of diagonal values of matrix M (M is explained in Section 2.1) for all the chips analysed which is graphically illustrated in Figure 3.

Effect of Removing G-Stack Probes on RMA Values.
To examine the effect of removing G-stack probes, we used RMA to obtain values for each probe set in each of 352 HG U133 Plus 2 CEL files. We then obtained revised RMA values with the G-stack probes masked. In Table 3, we report the results in terms of a summary of the values obtained for d, which we define as the revised RMA value minus the original value. In the table there are separate columns to summarise the effects on probe sets having varying numbers of G-stack probes. The percentages reported are based on the number of probe sets shown in the second row of the table and are averaged over the 352 CEL files.
As one would hope, there are no major changes on probe sets that have no G-stack probes. As the number of G-stack probes increases, so the changes become potentially much more serious, and the effects are more variable. On average G-stack probes have higher values than other probes, so that, the majority of RMA values are reduced by the removal of the G-stack probes. However, there are also many instances where the RMA value is appreciably increased by removal of G-stack probes.

Conclusion
G-stack probes behave as outliers within their probe sets because they are usually poorly correlated with other members of their probe sets while they are highly correlated with each other. We have illustrated that this is true in various chip designs of different mammalia. Therefore, as we suggested before in our previous work [2], these probes should not be included within a calculation of the gene expression measurement. Due to the previous work, we were expecting that the correlation among the G-stack probes is at its highest when the runs of guanines start from position 1 (5 end) within the probes. It was our expectation that as the 5 end is the free/moving end, so there are more changes for the G-stack probes to attach with the neighbouring probe's G-stack at this end. Although it is true for some chip designs, for instance HG U133A, MOE430A and MOE430B, it is not true for all of them. Thus, in general, we did not find a common position of G-stack where the correlation coefficient value is high in all the chips. We also found that a much smaller fraction of G-stack probes are present in the new chip designs of rat and mouse compared to the original designs. This suggests that the change in design protocol led to a significant removal of probes which we now believe to be misinformative. It is surprising that such a change in design did not lead to a significant reduction in the amount of Gstack probes in human 3 array.
Furthermore, we find that the changes in RMA values (after removing the G-stack probes) become more serious as the number of G-stack probes increases.