^{1}

^{1,2,3}

^{1}

^{2}

^{3}

The inference of population dynamics from molecular sequence data is becoming an important new method for the surveillance of infectious diseases. Here, we examine how heterogeneity in contact shapes the genealogies of parasitic agents. Using extensive simulations, we find that contact heterogeneity can have a strong effect on how the structure of genealogies reflects epidemiologically relevant quantities such as the proportion of a population that is infected. Comparing the simulations to BEAST reconstructions, we also find that contact heterogeneity can increase the number of sequence isolates required to estimate these quantities over the course of an epidemic. Our results suggest that data about contact-network structure will be required in addition to sequence data for accurate estimation of a parasitic agent's genealogy. We conclude that network models will be important for progress in this area.

Epidemiology is a data-driven field, and it is currently being infused at an increasing rate with molecular sequence data. This new and growing data source has led to a call for multi-level models of the relationship between sequence data and infectious disease dynamics [

By allowing for additional data to be used and integrated, phylodynamic modeling may lead to improvements in the accuracy and quality of the surveillance of infectious diseases. For example, the number of norovirus outbreaks reported increased in 2002. It was not clear, however, whether the higher reported numbers were a sign of more outbreaks or more frequent reporting of outbreaks. Case-reporting bias does not affect molecular data, however. So coalescent analysis of molecular data [

To model heterogeneity in contact, we represent individuals in a population as nodes, and we represent the potential for two hosts to infect each other as an edge that links two nodes. Researchers call the resulting networks contact networks. Contact-network structure necessarily affects the genealogy of any replicating infectious agent that is spreading through a host population. In this paper, we use the term parasite to refer to all such infectious agents, including bacteria and viruses. The genealogy of these parasites must fit inside the tree of infections that forms as the parasite spreads from host to host, and this tree of infections must fit inside the host population's contact network. While more elaborate elements of contact-network structure may be important, we here focus simply on variation in the number of edges coming out of nodes, which corresponds to heterogeneity in contact rates.

Contact heterogeneity has often not been discussed as a possible bias in coalescent analyses (e.g., [

Our primary goal here is to assess how contact heterogeneity affects the relationship between coalescent reconstructions and the reality of parasite population dynamics. First, we build contact networks with different levels of heterogeneity. Then, we simulate the spread of parasites through the networks, generating epidemic dynamics and a genealogy of the parasite with each simulation. Then, we use the BEAST software package [

We simulated infectious disease progression on networks. The nodes of the networks represented hosts and had states of being susceptible, infectious, or recovered. The edges of the network determined the set of possible transmission events; infectious hosts transmitted infection across edges shared with susceptible hosts until the infectious hosts recovered. The number of nodes in the network was kept at 10,000, and the mean degree (degree is the number of edges coming out of a node) was kept at 4. The networks were built to be either regular, meaning that all nodes have the same degree, or with degree distributions sampled from Poisson, exponential, or Pareto distributions. The minimum degree in the Pareto networks was 1. The regular networks served as models with zero heterogeneity, Poisson networks as models with heterogeneity similar to a Poisson process, exponential networks as models with heterogeneity similar to a variety of social networks [

We simulated epidemics and genealogies in continuous time using a method based on the Stochastic Simulation Algorithm [

Simulation source code is available from the authors upon request. The code made use of the GNU scientific library [

The output of a simulation included a time series of prevalence, that is, the count of infected nodes (given a fixed population of 10,000 nodes), and incidence, that is, the sum of the rates of all possible transmissions. Simulations also generated infection trees in which each transmission was a bifurcating node, each recovery was a terminal node, and branch lengths were equal to the time between events. We sampled from the full infection trees to generate the trees for input in the skyride coalescent analyses. We sampled by selecting a set of nodes uniformly at random from the full infection tree to become tip branches of an infection subtree. To generate the subtree, we cut the branches of the full infection tree at the subset of randomly selected nodes that had no descendants in the set of randomly selected nodes, and we pruned off any paths that did not terminate in this subset of nodes.

Using the sampled infection trees as genealogies, we obtained a posterior distribution for the skyride population sizes with the time-aware method of Minin et al. [

Using the posterior skyride population-size distributions, we obtained the skyride trajectories with Tracer [

To plot time series from different stochastic simulations on a common time scale, we used the time at which growth became nearly deterministic in each simulation as time zero for that simulation.

Coalescent theory is an area of population genetics that models the structure of genealogies backward in time from a set of lineages sampled from a large population. A simple coalescent process turns out to be a good model for the genealogies of a wide range of scenarios in population genetics [

The skyride uses this simple relationship between effective population size and the expected time before coalescence to estimate population size from the length of intracoalescent intervals in a genealogy. The median of a skyride reconstruction

Predicting a skyride from the dynamics of an epidemic model is simply a matter of calculating the rate at which a pair of lineages will coalesce, that is, the rate at which two chains of infection merge into a single chain. Volz et al. [

The similarity of (

To determine the effect of sampling on the ability of the skyride to reconstruct prevalence history, we simulated genealogies and pruned off a variable number of branches from the genealogies. We found that small amounts of pruning rapidly reduced the number of coalescent events in the sampled genealogy that occurred in the peak and late phases of the epidemic, thereby restricting accurate reconstruction to the early phase of the epidemic (Figure

Low levels of proportional sampling may prevent accurate reconstruction of prevalence during and after the epidemic peak. We consider reconstruction to be accurate when the skyride and the predicted skyride match. The light-blue ribbons are the middle 95% of the posterior density of the skyride reconstruction. The small bars on the

To demonstrate the effect of network structure on the reconstruction of prevalence history, epidemics were simulated on networks with varying heterogeneity. Keeping the extent of sampling equal and increasing heterogeneity compressed the coalescent events in the sampled genealogy into the beginning of the epidemic. Figure

Contact heterogeneity determines the amount of time over which the skyride estimated from the genealogy is informative of the skyride predicted by prevalence and incidence. Contact heterogeneity also affects the relationship between the skyride and prevalence trajectories. The light-blue ribbons are the middle 95% of the posterior density of the skyride reconstruction. The small bars on the

Figure

Contact-network structure, infectious disease dynamics, and genealogical structure interact. The ratio of prevalence to incidence is the generation time, which scales prevalence of the predicted skyride (up to a constant factor). Dividing the predicted skyride by the number of pairs of lineages backs out a smoothed expected length of intracoalescent intervals in the genealogy. Panel labels on the top indicate the approximate degree distribution of the contact networks. The variance of the degree distributions increase from left to right. Parameters: contact-network size = 10,000, degree distribution mean = 4, transmission rate = 2, recovery rate = 1, proportion of nodes sampled = 0.01.

The effects of contact heterogeneity can be important in relating the structure of genealogies to infectious disease dynamics (Figure

But are the data requirements of these more complex models feasible? To begin answering this question, we next discuss the implications of obtaining the equivalent of our simulated data from a real-world system.

We knew the true infection tree in our simulations. In typical coalescent analyses of an infectious disease (e.g., [

It may be possible to work around the second problem by collecting sequences over time such that there are no branching points in the tree that are too far from every pair of tips. For the first problem, there is simply no information that the sequences alone can provide, and additional knowledge of events in the chain of infection is necessary to determine the infection tree. The panels labeled “Time to coalescence” in Figure

There also may be a need for contact tracing to establish the genealogy for airborne infections because many airborne transmissions may occur in a single day during which a single strain may be dominant in a host, as the super-spreading events in the 2003 SARS-coronavirus outbreak demonstrated [

In addition to being necessary to fill gaps in molecular data, contact tracing may be necessary because genealogies do not always match infection trees. Such discordance is likely to occur when there is relatively little time between transmissions. When there is little time for a mutant to become fixed between transmissions, the order in which alleles at loci of a sequence appear in transmitting inocula (or sequence isolates) need not match the order in which the alleles appeared in the within-host population. Measures of within-host viral load and sequence diversity may be informative of the chance of such discordance. If populations tend to be large and diverse, then sequence data may be useless for reconstructing the recent details of chains of infection but still useful in reconstructing deeper branches in the tree. Sequence data from diverse within-host populations could also be useful in parameter estimation for coalescent models (e.g., [

In our simulations, we also knew the variance of the degree distribution. We do have some data about the structure of contact networks for some systems. We have survey data about human sexual-contact networks (e.g., [

Contact heterogeneity is well known to have a strong effect on infectious disease dynamics. We have shown how the relationship between infectious disease dynamics and genealogies is similarly sensitive to the contact heterogeneity specified by a network. We have argued that direct knowledge of the tree of infections is likely needed in addition to sequence data for the accurate inference of prevalence from sequence data. Thus, it seems that understanding the structure of the contact networks for various diseases will be important for progress in phylodynamics.

This work was supported by NSF Grant EF-0742373. The Texas Advanced Computing Center at UT provided computing resources.