The Dynamics of Germinal Centre Selection as Measured by Graph-Theoretical Analysis of Mutational Lineage Trees

We have developed a rigorous graph-theoretical algorithm for quantifying the shape properties of mutational lineage trees. We show that information about the dynamics of hypermutation and antigen-driven clonal selection during the humoral immune response is contained in the shape of mutational lineage trees deduced from the responding clones. Age and tissue related differences in the selection process can be studied using this method. Thus, tree shape analysis can be used as a means of elucidating humoral immune response dynamics in various situations.


INTRODUCTION
Memory B lymphocyte generation involves affinity maturation of the cells' antigen receptors, based on somatic hypermutation of receptor genes and antigendriven selection of the resulting mutants (Kelsoe, 1996;Wabl et al., 1999;Neuberger et al., 2000;Diaz and Casali, 2002). Hypermutation of immunoglobulin variable region genes is several orders of magnitude faster than normal somatic mutation, and there is evidence that hypermutation is generated by a different mechanism than that of normal somatic mutation (Winter and Gearhart, 1998;Neuberger et al., 1998;Cowell and Kepler, 2000). The exact mechanism of somatic hypermutation is yet unknown, although it has been shown to depend on transcription, activation-induced cytidine deaminase (AID) and DNA mismatch repair mechanisms. It is thought that the mechanism is related to that of class switch recombination (Honjo, 2002). Many questions are still open, such as how somatic hypermutation is triggered and regulated; whether immune complexes play a role (Song et al., 1998; and how the processes of hypermutation and selection interact to shape the memory B cell repertoire. Theoretical approaches utilized so far in the study of affinity maturation include the analysis of the frequencies of specific types of mutations (Dunn-Walters et al., 1998;Dorner et al., 1998;Spencer et al., 1999;Oprea and Kepler, 1999;Kim et al., 1999;Foster et al., 1999;Monson et al., 2000;Michael et al., 2002), and mathematical models exploring the dynamical interactions between somatic hypermutation and clonal selection (Sulzer et al., 1993;Kepler and Perelson, 1993;Oprea and Perelson, 1997;Shlomchik et al., 1998;Shannon and Mehr, 1999;Kesmir and de Boer, 1999). In this study we present a new approach-the analysis of the shapes of mutational lineage trees.
The generation of "lineage trees" or "dendrograms" to visualize the lineage relationships of B cell mutants in the GCs has been used in the past to confirm the role of the GC as the location of somatic hypermutation (Kocks and Rajewsky, 1988;Manser, 1989;Jacob et al., 1991), to identify lineage relationships between cells from independent GCs (Vora et al., 1999) or different tissues (Dunn-Walters et al., 1997a,b) and from additional processes of diversification such as gene conversion in the rabbit (Seghal et al., 1998;Schiaffella et al., 1999;Seghal et al., 2000;Seghal et al., 2002). The experimentally generated lineage trees reflect multiple rounds of mutations for each germline V gene that participated in the primary response. We believe that much information about the dynamics of antigen-driven clonal selection during the immune response is contained in the shape of lineage trees deduced from the final responding clones (Shannon and Mehr, 1999). For example, trees generated from clones during the peak of the primary response are much more "bushy" (Jacob and Kelsoe, 1992), but trees become less "bushy" as the response progresses (Jacob et al., 1993). The "pruned" shape of these trees has been referred to as evidence for the destructive character of somatic hypermutation. Other examples of lineage trees drawn to illustrate various aspects of the germinal centre reaction, or differences in this reaction under varying circumstances, abound in the literature. So far, however, lineage tree classification has been based only on a qualitative, intuitive assessment of the most obvious shape characteristics. Hence we set out to explore whether the information embedded in the mathematical shape characteristics of lineage trees can in any way be quantifiable, and whether it can be shown to correlate with the dynamics of the underlying immune response.
The objective of the present study was to develop a rigorous computer-aided algorithm for extracting the information contained in lineage trees, using the tools of mathematical graph theory. The algorithm we developed is composed of a module that characterizes trees according to their various graph-theoretical measures, and another module for finding correlations between these measures and the dynamical parameters of the GC response that generated the trees. Note that, for the purpose of our analysis, we are not interested in the properties of the individual cells or clones represented by the lineage tree, but rather in the overall characteristics of the lineage tree as a graphical entity. We demonstrate in the following that the information extracted using our algorithm is indeed valuable in revealing the dynamics of hypermutation and antigen-driven selection in germinal centres.

Tree Similarity and Size Scaling
Measurement of published lineage trees reveals several interesting details about our method, even though published data are too scarce for statistical analyses (only 1-2 trees are usually published as an illustration). First, when two trees develop from the same germline gene under similar conditions (two different GCs in the same response (Jacob et al., 1991), the trees are indeed similar in all aspects measured (Fig. 1a,b). While the profiles of the two trees are similar, it is obvious that most properties vary with tree size, e.g. tree II is slightly larger in most measured properties than tree I. In order to properly compare trees, we must distinguish between two types of tree properties: those that are independent of tree size and those that correlate with tree size. Examples of size-independent properties are: root degree, maximum or average outgoing degree and maximum or average distance between a leaf to the nearest split node. Examples of size-dependent properties are: the number of internal or pass-through nodes, the maximum or average path length (from root to leaf), trunk length, etc. We set out to examine whether scaling these properties by tree size gives a better measure of tree similarity or difference.
As previously mentioned, there are two different measures of tree size that could be used to scale the size-dependent tree properties. The total number of nodes seems to be the most natural measure. The number of internal nodes, or the number of pass-through nodes (which are a subset of all internal nodes), correlates well with the total number of nodes. Scaling by the number of leaves is more problematic, as it is highly sensitive to the sampling process, that is, to the number of cells from any given clone that were found in the experiment. It is also highly sensitive to the particular germline gene involved, as different germline genes differ in their potential for improvement by mutation (Shannon and Mehr, 1999). Additionally, as the response progresses, the number of nodes per leaf may grow, as the tree gets longer by the addition of mutations, and more "pruned" through the action of selection (see next section). Hence scaling should be done with care. However, when two trees are generated from the same germline gene in the same response, as in the case shown above, we find that they are very similar in all their scaled size-dependent properties, whether scaling is done by total number of nodes or by number of leaves (Fig. 1c,d). Similar results were obtained for the two trees published in (Jacob and Kelsoe, 1992).

Trees Grow and are "Pruned" as the Response Progresses
When trees are taken from a response to the same antigen, but in different times during the response (Jacob et al., 1993), the trees seem to gradually change towards a longer, more pruned, shape. There is a consistent change in several tree parameters (Fig. 2). For example, the number of leaves not only does not increase, but actually decreases with time. This is probably due to the effect of selection that "prunes" branches corresponding to useless or lower-affinity mutants. Two measures of tree "bushiness", which is expected to decrease with time as a result of selection, also decrease-the maximum and average (excluding pass-through nodes) outgoing degree of a node.
On the other hand, trees from the primary response are very similar in structure to those from the secondary response, at least in the one published example we analysed (Vora et al., 1999). Our measurements show that the two trees are similar in every aspect (Fig. 3).

Our Analysis can Distinguish between Trees from Different Sources
Trees from GCs from spleen and Peyer's patches of young and old human patients (Banerjee et al., 2002) were analysed by our algorithm. Data clearly shows that the trees from the spleen show signs of having been subject to stronger selective forces. Both the maximum and average outgoing degree were smaller in the spleen than in the Peyer's patches, indicating a more bushy, and less selected, response in the Peyer's patch. Similarly, both the maximum and average distance from last split node to leaf were increased in the spleen compared to the Peyer's patches (Fig. 4). This algorithm also showed some age-related differences in selection that concurred FIGURE 1 Similarity of two trees obtained from two different GCs in the same response. (a) Trees from (Jacob et al., 1991), drawn such that each mutation is shown as a separate node. (b) A comparison of the "profiles" (selected graphical properties) of the trees. The trees are very similar in all their scaled size-dependent properties, whether scaling is done by total number of nodes (c) or by number of leaves (d).

Analysis of Simulated Trees-"Tree Generator"
The analysis of published trees indicated that graphical tree parameters may indeed correlate with the biological parameters of the germinal centre response. However, there is not enough published data for conclusive analysis. In order to extract possible correlations between the graphical parameters describing a tree and the biological parameters describing the corresponding affinity maturation process, we would have to statistically analyse a significant number of trees with a priori known biological parameters. There are only a few tens of trees obtained experimentally and published, which are not sufficient for this purpose, and even if their numbers were sufficient, not all the biological parameters are known for experimental trees. Hence we decided to define and implement a simple simulation of the humoral immune response, which will allow us to control mutation and selection parameters, and produce lineage trees.

Simulation Parameters and their Effect on Tree Shape
This section summarizes the "biological" parameters controlling tree generation by our simulation. Varying these parameters enables us to produce trees corresponding to different values and then analyse the subsequent change in tree shape. We chose parameter values which model the affinity maturation process in the most realistic way, based on experimental data, while keeping the model as simple as possible. Simulation parameters are given in Table I.
The mutation mode parameter denotes how the simulation interprets the mutation rate parameter. Its values are Bit (the default) and Div. In Bit mode, the mutation rate is the probability of a single bit in the receptor string to mutate. In Div mode, the mutation rate is the fixed percentage of bits that mutate per division.
The selection mode parameter defines the method of selection, and how the simulation uses the selection threshold parameter, which denotes the minimum affinity required for a cell to survive selection. In Abs (absolute) selection mode, the selection threshold remains constant. In Rel (relative) selection mode, the selection threshold FIGURE 3 Comparison (c) between trees from a primary (a) and secondary (b) response (Vora et al., 1999), showing their similarity.
corresponds to the average affinity in the population; however, in the first generation the pre-defined selection threshold is used.
The selection rate parameter defines the probability for a cell to undergo selection in each generation. That is, (selection rate) 21 denotes the maximal number of mutations between two consequent selection events.
The selection start parameter defines the time selection begins to operate (in number of generations since the simulation started).
The values chosen for each of the above parameters (Table I) yielded 1920 different parameter sets. Each different parameter set was used in five different simulations, with five different random number generator seeds. Some simulations yielded more than one tree and in some cases no trees were generated (when all cells died before the end of the simulation). The total number of trees generated was 8300. Several features, which indicated that tree shapes indeed reflect the dynamics of the response, were observed during simulation development and tree generation, as follows.
1. As expected, the relative selection mode is more effective than absolute selection-in the case of relative selection, the population contains fewer cells and their average affinity is higher than in the case of absolute selection, in which there is no way to develop nor to kill the cells with relatively low affinity obtained at the first steps of the simulation. The interesting point is that this is reflected in tree shapes. Trees obtained in relative selection mode have fewer branches and their average length is higher than trees obtained in absolute selection mode. For example, the average number of leaves for all 5175 trees obtained in the absolute selection mode is 4648^4701; and the average distance from a leaf back to the last split is only  The p values for the differences between the two tissues in maximum and average outgoing degree and maximum and average distance from leaf to split node are 0.04, 0.04, 0.07 and 0.07, respectively.
1:85^0:55 nodes, while for the 3125 trees obtained in the relative selection mode, the average number of leaves is 584^911 and the average distance from a leaf back to the last split is 2:25^4:50: The trees are larger than published experimental trees because (i) here we have the full tree and not just a sample of it in each case and (ii) not all parameter sets reflect biologically relevant parameter regimes. However, the fact that the standard deviations in the number of nodes (and the leaf to last split distance in the relative selection mode) are much larger than the means shows that the means were at the lower end of the range, that is, there were many more smaller trees than larger trees. The majority of trees having non-zero and long trunks were obtained in relative selection mode as well: there were 3787 trees with no trunk among the trees obtained in the absolute selection mode, and none among the trees obtained in the relative selection mode; there were only 1201 out of 5175 trees with trunk length $ 3 in the absolute selection mode, compared with 2056 out of the 3125 trees obtained in the relative selection mode. 2. In the case of "effective" (relative mode) selection, population size is relatively small throughout the simulation, and therefore antigen consumption is lower. Thus, in most cases of absolute selection, the antigen was totally exhausted before the simulation has reached 90 time steps, while in most cases of relative selection, the simulation stopped on the 90th step with a certain amount of remaining antigen. 3. The probability of getting no tree (all cells dying by the end of the simulation) grows with mutation rate. For a mutation rate of 0.008 (4 bits flipped per generation), all cells died regardless of the values of all other parameters. For a mutation rate of 0.002 (1 bit flipped per generation), almost every simulation yielded at least one tree. Hence higher values probably reflect unrealistic mutation rates. 4. In those simulations where 5 clones were allowed to develop in parallel, at least one clone always died out. In most cases, only one or two trees (out of 5 initial clones) were generated. This is an expected result of interclonal competition. 5. In most cases, population size upon simulation completion was below 10000 cells even when an upper limit was not used (only in 0.6% of cases did the population exceed 20000 cells). In cases where the population did exceed 20000, selection was weak, and in most such cases selection mode was absolute.

Correlations between Biological and Graphical Parameters
We proceeded to search for correlations between biological parameters and graphical ones, beginning by looking for simple (linear) one-to-one correlations between each biological parameter and each graphical one. We have found a surprisingly large number of correlations that were significant, though most of them had low correlation coefficients, most likely because of the high variability between trees in almost all parameters measured. Table II gives the one-to-one linear correlation coefficients and their p-values for all graphical parameters measured. It is evident that most graphical parameters correlated only with the mutation rate and with the selection threshold. The number of initial clones (1 or 5), the rate of selection (number of mutation rounds between two rounds of selection) and the time of starting the selection (in the beginning of the simulation or 10 generations later), in the ranges used in our simulations, did not correlate strongly ðjRj # 0:1Þ with any of the graphical parameters. Similar results were obtained with the scaled graphical parameters, whether they were scaled by the number of leaves or by the number of nodes (data not shown).

DISCUSSION
The objective of the present study was to develop a rigorous algorithm for extracting the information contained in mutational lineage trees, which so far were only used as an illustration of the dynamics of the humoral immune response. The algorithm we developed is composed of a module that characterizes trees according to their various graph-theoretical measures, and another module for finding correlations between these measures and the dynamical parameters of the GC response which generated the trees. The measurement module alone is useful in analysis of trees from different experimental sources, in that it can show which tree properties are significantly different between trees from different experimental groups [Banerjee et al., 2002]. Analysis of additional data will possibly enable us to hone this method further (see appendix).
Our statistical analysis validates our basic premise, that the shapes of lineage trees contain biological information on the dynamics of the germinal centre response that generated the trees. One may ask whether methods based on non-linear functions might have been more successful in prediction of the biological parameters from the graphical ones. We have attempted to improve our predictions by using a co-evolutionary algorithm, which allowed a population of proposed solutions (general polynomial functions of the graphical parameters) to coevolve with a population of test cases (from the data analysed above). However, this method has not yielded better results than straightforward linear stepwise regression. We presume that the high variability of the trees in our simulated tree database-and possibly of experimentally-generated trees as well-precludes better prediction of biological parameters from the graphical ones, at least using regression methods.
Several more questions may be raised with respect to tree shape analysis. For example, there is the question of the reliability of the data itself, not only due to PCR errors, but also because of the way trees are generated. As far as we know, all tree generation algorithms assume that if a mutation is shared by two different cells, then it must have occurred in a common ancestor of both cells. Thus, these algorithms do not allow for the possibility of identical mutations occurring in parallel in different "branches" of the tree. As there is no way to tell which shared mutations have indeed occurred in a common ancestor, and which shared mutations actually occurred independently, we must take the trees as they are and analyse them as such, assuming that the tree generation algorithms are at least consistent in all cases.

Tree Notation and Representation
A lineage tree is a rooted tree where nodes correspond to B cell receptor genes. For two nodes u and v, we say that v is a daughter of u if the cell corresponding to v is a mutant of the cell corresponding to u, which differs from u by only one mutation, and is one mutation further than u away from the original (germline) gene. Two B cells with identical receptors will correspond to the same node. A lineage tree describes the maturation process of a B cell at a certain moment of observation-it consists only of the cells that were sampled at that moment and their ancestors back to the root. The ancestors are not necessarily sampled at the time of observation. We distinguish between three kinds of nodes (Fig. 5a): . Root-representing the original B cell (node 0).
. Leaves-representing mutant B cells, which were alive at the time of sampling and had no daughters at the time of observation (nodes 6, 11, 12, 13 and 14). . Internal nodes-representing B cells that were produced during the maturation process, which may have been killed by selection but have a live offspring. There are two types of internal nodes: Split nodesthose with more than one daughter (node 3 and 10); and Pass-Through nodes-those with exactly one daughter (nodes 1, 2, 4, 5, 7, 8 and 9).
Since trees may come from different external sources (published experimental data, simulations, etc.), we faced the need to define a universal format for tree representation. For this purpose we chose the adjacency list format. Each node in a tree has a unique identification (id) number, satisfying the following two conditions: Hence a tree is represented by a text file, where each line contains a node id followed by its daughters' id's, delimited by space(s). Lines starting with a "#" sign are considered to be comments and thus ignored by the measurement algorithm (described below). A sample file containing the adjacency-list presentation of the tree from the previous example is shown in Fig. 5b.

Tree Measurement
In this section we define the graphical parameters to be measured on the tree. A priori we measured all the graphical properties we could define on the trees, as we did not know for certain which measure would best correlate with biological parameters. We show in the "results" section which of the parameters seem to express some information of interest. Note that, for quite a few of these properties, the maximum, minimum and average values per tree may be measured; while it is obvious that minimum values are often trivial, we have again measured all possible properties and then looked for the ones which best correlate with biological parameters. The complete list of parameters measured is the following.
. Number of nodes-total number of nodes, number of leaves, internal nodes, pass-through nodes. . Path length (root-to-leaf distance)-min, max, average (over all leaves in the tree). . Outgoing degree (number of daughters per node)min, max, average, average excluding pass-through nodes, root's degree. . Distance from root to the first split node (trunk length)if root's degree is 1; otherwise this distance equals zero. . Distance between leaf and the nearest split node-min, max, average. . Distance between leaf and the first (closest to the root) split node-min, max, average. . Distance from root to split node (on each path to a leaf, not considering the root itself)-min, max, average. . Distance from root to the maximal (in terms of outgoing degree) split node-min, max, average. . Distance between two adjacent split nodes-min, max, average.
We developed a computer program that reads a tree in the format described above and measures the graphical parameters, creating a text report. As stated above, a lineage tree describes the process at the specific moment of observation. Thus, in an experimentally obtained tree, only those cells that were sampled are represented. Usually the percentage of the germinal centre cells that are FIGURE 5 A sample tree (a), with the nodes marked by their "id" numbers; the adjacency list that corresponds to the tree is shown in (b). sampled is not very high. In order to neutralize the effects of sampling, the above parameters have to be additionally scaled (divided) by number of nodes (total) or by the number of leaves. In the results section we further discuss this issue with respect to experimental tree measurement. For the time being it will suffice to note that for most of the properties given in the list above, three values were measured-unscaled, scaled by number of nodes (total) and scaled by number of leaves.

Simulation of Germinal Centre Lineage Trees
The model of the affinity process implemented by our simulation ("tree generator") is very simple, yet it captures the main features of the process (Fig. 6). Our model considers a single population of B cells consisting of several clones (cells with different antigen receptors). A B cell's receptor is represented by a 512-bit string. A certain amount of antigen is available at the beginning of the simulation, where the antigenic epitope is represented by a 512-bit string as well. The affinity of a B cell is given by a (normalised) number of non-coinciding bits in the cell's receptor and the antigen (actually computed by applying the logical function XOR to the two strings bitwise). In every simulated "time step", each cell in the population can divide into two daughter cells, each one of which may undergo mutation according to the mutation parameters. The probability of the cell to divide and mutate depends on the affinity of its receptor to the antigen, population size (relatively to the maximum allowed population size) and the amount of available antigen. A newborn cell can either immediately die due to lethal mutation, divide and mutate again, or undergo selection, whichever should happen according to the simulation parameters. As a result of selection, a cell either dies, or survives. This decision is taken according to the cell's affinity and selection parameters. Each successful selection event consumes one antigen unit. The process stops whenever the antigen is exhausted or after a specified number of time steps. Several clones may develop simultaneously in the simulation. Each clone originates from a different B cell in the initial population. A lineage tree is produced for each clone that survives to the end of the simulation.