A Dirichlet Autoregressive Model for the Analysis of Microbiota Time-Series Data

Growing interest in understanding microbiota dynamics has motivated the development of different strategies to model microbiota time series data. However, all of them must tackle the fact that the available data are high-dimensional, posing strong statistical and computational challenges. In order to address this challenge, we propose a Dirichlet autoregressive model with time-varying parameters, which can be directly adapted to explain the effect of groups of taxa, thus reducing the number of parameters estimated by maximum likelihood. A strategy has been implemented which speeds up this estimation. The usefulness of the proposed model is illustrated by application to a case study.


Introduction
Recent studies suggest that microbiota, which denotes the collection of bacteria living either in or on the human body, plays a key role in the health status of individuals. In this respect, some studies have pointed out that the maintenance of a stable microbial ecosystem is necessary for a healthy life. In fact, it is known that a disruption of the stable state of the microbiota can be associated with different diseases such as obesity, diabetes, or cancer [1][2][3]. erefore, analyzing stability of the microbiota and understanding how quickly it recovers and reaches a new stable state are key questions in the study of the human health status. In this context, longitudinal studies can help to both understand microbiota regularity over time in healthy individuals and study the response of the microbiota to perturbations in disease scenarios.
Many proposals for microbiota data longitudinal analyses use count-based strategies (see, for instance, Section 3.5 in [4] and the references therein). However, more recent approaches suggest considering compositional vectors of relative abundances [5][6][7]. e reason is that microbiota data are generated through DNA sequencing and they are constrained by an arbitrary constant sum. is is due to the fact that the sequencing instruments used have a fixed upper bound on the number of reads delivered. erefore, the read count cannot be related to the absolute number of molecules in the input biological sample, and so microbiome datasets must be converted to either relative abundance values or normalized counts [8][9][10][11].
On the contrary, the compositional nature of microbiota longitudinal data forces the use of multivariate time series models that take into account the following two features: first, each time series is related to a bacterial taxon and, second, the vector corresponding to each time point represents nonnegative proportions that add up to one. A well-known approach to analyze compositional time series in different scenarios involves transforming the data in order to break the unit sum constraint, and so the use of standard time series techniques is appropriate. Within this strategy, log-ratio or Box-Cox transformations have been considered and Gaussian distributions have been used [12]. However, alternative approaches can also be taken into account, for instance, those based on the use of the Dirichlet distribution [13][14][15].
Focusing on the analysis of the dynamics of microbial communities, autoregressive models have been considered, with some of them using a standard Lotka-Volterra structure [16][17][18]. However, these models are based on pairwise interactions and thus fail to capture effects that a third microbe may have on an interacting pair of microbes; see [19] for more details about limitations of models based on Lotka-Volterra structure. A nonparametric approach with an additive structure, which does not presuppose any underlying functional form for community dynamics, has also been proposed [20,21]. Additive models have the advantage in that they do not need explicit specifications of the functional forms of the relationships between microbes. Nevertheless, this approach admits additivity in the relationships, which is not necessarily realistic for complex microbial communities. Also, a common practice in those works is to consider taxonomic averaging with the aim of reducing the number of parameters in the model, which can lead to inaccurate conclusions [22]. For instance, the role played by important community members can be missed if they are associated with a low abundance, and microbiome stability may be overestimated. Recent works also propose the use of state-space models, which assume that abundances are associated with a real-value hidden state variable vector that evolves through time based on a firstorder Markov process and can identify the microbial interaction [23][24][25]. Other alternatives for the analysis of microbial community temporal dynamics are linear mixed models that provide flexibility in correlated longitudinal data [26,27] or dynamic Bayesian networks, which are another class of state-space models appropriate to model the interaction of microbial taxa [28].
In this paper, we model relative abundances of microbial taxa with a Dirichlet distribution with time-varying parameters. We assume that these relative abundances, after a log-ratio transformation, can be explained by an autoregressive structure which takes into account the effect of the bacterial community as a whole. is proposal can be useful to understand the relationships between microbes and the identification of keystone members of the microbial ecosystem that may play an important role. It is worth noting that the Dirichlet distribution has a strong independence structure, which can be deduced from its definition by a set of independent, gamma-distributed, random variables, with equal scale parameter. is fact makes it inappropriate to consider this probability distribution for modeling compositional data [29]. However, it has also been found useful when used as a conditional distribution; see, for instance, [30,31].
One important feature of our proposal is the consideration of the bacterial community effect as a whole, by recurring to the geometric mean, in order to reduce dimensionality.
is formulation allows us to encapsulate information and decrease the number of parameters to estimate without removing or grouping microbial taxa. To the best of our knowledge, this is the first model developed for microbiota time series based on Dirichlet distribution with time-varying parameters. e paper proceeds as follows: In Section 2, we present some basic definitions and describe the proposed model and in Section 3 we illustrate the performance of the model with a case study. Finally, some conclusions are drawn and directions for future research are suggested.

Basic Definitions and Preliminaries.
Let y � (y 1 , y 2 , . . . , y K ) be a K-dimensional random vector which satisfies that K i�1 y i � 1 and 0 < y i < 1 for all 1 ≤ i ≤ K. e random vector y follows a Dirichlet distribution with pa- where the normalizing constant, B(α), is the multivariate beta function which can be defined in terms of the gamma function, Γ(·): Considering τ � K i�1 α i , the expectation of each component, y i , is defined as E(y i ) � α i /τ and the variance as Compositional time series are multivariate time series in which the observation vector at time t is nonnegative proportions whose sum is 1. Historically, such time series have been modeled by considering transformation of the observations and modeling them with standard multivariate techniques using Gaussian distributions. However, alternative approaches have also been considered in which the original data are modeled directly, that is, how they are observed experimentally. In this case, probability distributions on the simplex must be considered and a very popular distribution is the Dirichlet distribution.
Longitudinal data on relative taxa abundances can be regarded as a compositional time series where the vector of relative abundances corresponding to each time point is an element of the simplex: Taking into account the fact that the Dirichlet distribution with covariates and time-varying parameters can provide an adaptable covariance structure [15,30,31], our proposal is based on this probability distribution and on a reparameterization defined by the well-known additive logratio transformation (alr transformation). Note that alternative forms of transformation can also be proposed, such as the centered log-ratio transformation (clr transformation) or the isometric log-ratio transformation (ilr transformation). See [29,32], for details related to these transformations. e additive log-ratio transformation of index K is the one-to-one linear transformation from S K to R K−1 defined as 2 Complexity alr y 1 , y 2 , . . . , y K � ln y 1 y K , ln y 2 y K , . . . , ln y K−1 y K .

(4)
Zheng and Chen analyze the consideration of the additive log-ratio (alr) transformation and the centered log-ratio (clr) transformation as link function in an autoregressive moving-average model with the Dirichlet distribution [15]. In our case, we have also taken into consideration the alr transformation as the link function in an autoregressive structure but have redefined the effect of the relative abundance of the rest of the microbial community in order to reduce the number of parameters. Note that, in a standard autoregressive model (VAR model), the effect of the remaining taxa on each taxon must be defined by adding each of them as an additive term and this option increases model dimensionality. We have considered the effect of the rest of the taxa on average. Compared with the approach proposed by Zheng and Chen, we have not considered a moving-average component to reduce dimensionality. e alr transformation has been considered for biological purposes. is option allows the comparison of two particular taxa. Taking into account y K as reference, we can consider that 2.2. Model Formulation. Let y t �(y 1t , y 2t , . . ., y Kt ) be the vector of relative abundances at time t, so that y it ∈ (0, 1) and y 1t + y 2t + · · · + y Kt � 1. We assume that the vector y t |y t−1 , y t−2 , . . . , y 1 follows a Dirichlet distribution with positive parameters α t � (α 1t , α 2t , . . ., α Kt ): In order to link α t with y t−1 , y t−2 , . . . , y 1 , we propose ln(α jt /α Kt ) � η jt , where η jt is defined as Note that additive log-ratio data transformation and first-order Taylor approximation enable us to state that and thus to assume that taxon relative abundance over time is dependent on their own relative abundance as well as on the effect of the relative abundance of the rest of the microbial community at the previous time points. Note that a jp are coefficients associated with relative abundance of taxa j at the previous time points and b jp are those associated with the effect of the relative abundance of the rest of the microbial community on (geometric) average. e coefficient P is the order of the model and a j0 represents the expected mean value of E(ln(y jt /y Kt )) when significance is lacking for both the effect of the relative abundance of taxa j at the previous time points and the effect of the remaining relative abundance of the rest of the microbial community on (geometric) average. It is worth emphasizing that in equation (7) each taxon abundance is evaluated with respect to the abundance of taxon y K and thus positive values signify that the taxon in the numerator has more weight than taxon y K and conversely for negative values. e parameter τ t � α 1t + α 2t + · · · + α Kt is the concentration parameter of the Dirichlet distribution and must also be estimated. We consider that it is a time invariant parameter; that is, τ t � τ for each t. In order to reduce dimensionality, equal concentration for probability mass has been admitted at each time point. erefore, using the expression ln(α jt /α Kt ) � η jt , we can determine α jt and α Kt as follows: From these equalities, it is easy to calculate the expected value E(y it ) � α it /τ t and variance Var(

Case Study
In order to demonstrate the utility of our proposal, we analyze an available time series microbiome dataset. In this section, we show a summary of the results obtained. We have studied the applicability on prediction, variability analysis, and trends clustering. ere are other works taking similar approaches, which also provide applications to predict or detect relevant taxa, albeit only partially, and have different features. See, for instance, [33][34][35]. Conversely, our proposal is an alternative integrated framework, applicable to more than one context. In the case study performed, we have selected family as the taxonomic level and y K is defined as the family with the greatest relative abundance. We must point out that only a first-order autoregression has been considered, P � 1 in equation (7), because higher-order autoregressions were not significant. Given the concern of the presence of zeros, we have chosen only families whose relative abundance is bigger Complexity 3 than 1% for each single time point but this cutoff value can be arbitrarily defined.

Dataset.
We have analyzed the database generated by the 16S rRNA gene sequencing of stool samples of a healthy male and a female with irritable bowel syndrome (IBS). Fecal bacteria from the healthy male were monitored for 15 consecutive days. e IBS patient's fecal samples were collected in the morning on alternate days the first week and once a week thereafter. is longitudinal dataset was studied by Durbán et al. in [36,37]. Figure 1 shows the relative abundances of the microbial families in samples of the above-mentioned individuals. Note that, at family level, samples were quite similar. e families Bacteroidaceae, Porphyromonadaceae, Rikenellaceae, and Ruminococcaceae are present in both samples, while the families Erysipelotrichaceae and Lachnospiraceae were present in the individual with IBS. We must point out that in both the healthy and the IBS individuals several undefined families were also present, which we classified collectively as Other. Taxa that did not meet the 1% threshold have also been rolled into the Other category. is group of undefined detections has been taken as reference, y K . It is should be pointed out that equivalent results would be obtained if we chose another component (another family) as reference. Our proposal satisfies the permutation invariance principle. See [29] for details.

Predicting Temporal Behavior of Microbial Taxa.
In this section, we describe how the proposed model can effectively predict the future dynamics of a microbial community. Modeling microbiota time series data and using models for predicting temporal dynamics of a future state can help to gain a better understanding of the different roles played by microbes. Figure 2 displays both the predicted relative abundance, E(y it ), and the experimentally reported values for each biological family in the healthy male. e same fit was also performed for the IBS patient, the results of which are shown in Figure 3.
In order to predict the future evolution of families, E(y it+h ), and variance, Var(y it+h ), expressions (9) and (10) must be evaluated at time t + h. Table 1 shows the estimated values for the parameters a j0 , a j1 , and b j1 . e estimation procedure is detailed in the Appendix.
To analyze the predictive performance of our approach, we have compared it with a simple method which predicts the same value as the previous time point.
is simple method will provide us with a baseline value. For this comparative purpose, the data for the last three time points (L � 3) have been left to evaluate the ability of our model to predict the relative abundance of taxa, and the first twelve time points have been used to estimate the parameters. As a measure of predictive efficiency, the sum of absolute errors (SAE) has been considered. is index is defined as where y il is the relative abundance of taxa i at time points which have been left to evaluate the predictive efficiency and y il is the corresponding predicted value.
In Table 2, we can see that, under the proposed model, SAE is greater than the sum of absolute errors under the baseline approach in the healthy male. e analysis indicates that the predictions for the families Other and Rikenellaceae calculated by our proposal are not accurate for the healthy male. However, this does not happen for the remaining families. For the IBS patient, the SAE index shows that our proposal performs well. In this case, all the predictions calculated with our proposal are more accurate than the ones provided by the baseline method.
We have also evaluated the predictive efficiency of our model on other real datasets. Lloyd-Price et al. [38] tracked 132 subjects (Crohn's diseases or ulcerative colitis patients) for one year each to analyze microbial activity during the disease (up to 24 time points each). We have analyzed two of them.
ey are two school-aged subjects diagnosed with Crohn's disease recruited from Cedars-Sinai Medical Center in Los Angeles and Cincinnati Children's Hospital, respectively. e subject recruited from Cedars-Sinai Medical Center reported antibiotic use. Caporaso et al. [39] present a human microbiota time series analysis of two healthy individuals over 396 time points at four body sites. We have used their available longitudinal time series data of gut microbiome samples from both subjects over 80 time points.
We have also investigated predictive efficiency of the proposed model using two simulated datasets. e simulation study has been carried out considering five and eight microbial entities, respectively. Following the proposal established in [40], we have generated its time series over 30 time points with a generalized Lotka-Volterra structure.
is approach allows us to simulate the temporal dynamics of a bacterial community considering the interentity interactions as input. To generate the interaction matrix, we have taken into account the algorithm proposed by Klemm and Eguiluz in [41]. is algorithm generates a modular and scale-free interaction matrix that reproduces properties of a microbial network. We assigned interaction considering diagonal values to −1 and off-diagonal values by sampling from a uniform distribution between 0 and 1. We have also set the modularity parameter of the Klemm and Eguiluz algorithm to 4 and 6, for the database which considers five taxa and eight taxa, respectively, and the interaction matrix connectance to 0.4 and 0.3, respectively. Note that these parameters allow us to both measure the strength of division of a network into subcommunities and define the interaction probability between entities, respectively. e interaction matrix has been generated with a positive interactions percentage equal to 64%. We have used the R package presented in [40] called seqtime.
In all these alternative datasets, we have also selected family as the taxonomic level and families whose relative abundance is bigger than 1% for each single time point have also been considered. e last three time points have also been left to evaluate the predictive ability of our proposal. Table 2 shows the predictive efficiency of our approach on the alternative datasets. e results display the predictive 4 Complexity performance of our proposal compared to the baseline method. Additionally, in order to increase the evidence of our proposal, we have also compared it against the wellknown TGP-CODA method proposed byÄijö et al. in [33]. e results obtained on the real and simulated datasets also demonstrate validity of our model for predicting temporal behavior of microbial taxa.

Analyzing Variations in Temporal Behavior of Microbial
Taxa.
e aim of this case study is to illustrate how our model can be useful to show the relationship between microbiome variability and host health status. In this respect, the estimates of the variances for each microbial taxon, Var(y it ), differentiate between healthy and dysbiotic microbiota (Figure 4). In addition, boxplots combining all the families also show a clear difference across time between healthy and unhealthy microbiota. We can appreciate that median, interquartile range, and whiskers are larger in the IBS patient. In the healthy individual, all families present a stable variance. As we mentioned before, the estimation procedure for the parameters of the model, α it , and thus for the variance, Var(y it ), is described in the Appendix.
In order to evaluate how our variance analysis performs in terms of indicating host health status, we have compared our proposal to a microbial trend analysis (MTA). Wang et al. [42] propose an MTA framework for longitudinal microbiome data analysis.
is proposal can capture the common dynamic patterns on the microbial community and identify the dominant taxa. Additionally, MTA can also classify individuals based on its longitudinal microbial profiling. We have used MTA because this toolbox is similar to our proposal. Note that both are integrated frameworks which allow several applications to microbial longitudinal data. In the IBS case study, the distance-based classification algorithm proposed in MTA framework also classifies the subjects (IBS patient and healthy subject) into different groups.
We have also analyzed the ability of our proposal to distinguish the microbial dynamics between subjects in alternative scenarios. We have also used the simulated datasets described before and the real datasets studied by Lloyd-Price et al. and Caporaso et al., respectively. Figure 5 displays the variance time series for the alternative scenarios considered. We can also appreciate the differences, although in these cases they are lower than those observed in the IBS scenario. For these aforementioned alternative datasets, we have also compared the performance of our variance analysis to that of the MTA framework. In all scenarios, the MTA approach also classifies the subjects in different groups.  Additionally, we have also compared our proposal against MITRE, a supervised machine learning method proposed by Bogart and others in [34]. is proposal also allows predicting or inferring the status of the host analyzing microbiota time series data. Table 3 displays the probability associated with the host's status. As in our approach, MITRE also clearly discriminates the host's status on each scenario analyzed.

Clustering Groups of Taxa Sharing a Similar Pattern over
Time. Another important goal while analyzing microbiota time series data is the detection of groups of taxa which present similar trends over a timeframe. e detection of taxa with similar temporal dynamics versus taxa with alternative patterns can help to understand the principles that define the microbiome in health or cause dysbiosis in disease.
It should be remembered that, in equation (7), a j0 represents a family-specific intercept that picks up the average relative abundance of family j versus the relative abundance of family y K over time; a j1 are related to the intrinsic dynamics for each family versus family y K and b j1 to the dynamics of the remaining families considered in geometric mean. Figure 6 displays the PCA biplots with the first two principal components that together explain 98.93% and 99.45% of the total variance of the IBS patient and healthy individual, respectively. We can observe that a j1 and b j1 are positioned on opposed quadrants. ey are negatively correlated. Note that families close to each other in the biplot represent observations with similar values. erefore, we can determinate which families have measurements that are the most similar to each other.
In the PCA biplot for the IBS patient, we observe that Porphyromonadaceae and Rikenellaceae are associated with the highest values of a j1 and the lowest values of b j1 . We can also note that Lachnospiraceae is associated with the highest values of b j1 and the lowest ones of a j1 . On the other hand, in the healthy subject, we note that Rikenellaceae and Ruminococcaceae are now the families that are associated with the highest values of b j1 and the lowest values of a j1 , respectively.
Taking into account the fact that a j1 are related to the intrinsic dynamics for each family and b j1 are related to the dynamics of the remaining families in geometric mean, we are be able to cluster families sharing a similar pattern over time. For instance, in the IBS patient, the temporal dynamics of the relative abundance of Lachnospiraceae (versus family y K ) is strongly related to that of the remaining families in geometric mean. However, the dynamics of the relative abundance of both Porphyromonadaceae and Rikenellaceae (versus family y K ) are not so closely associated with it. Remember that in our proposal all relative abundances are analyzed with respect to the relative abundance of y K and this one clusters undefined taxa.   We have also evaluated the identification of taxa by comparing our proposal to the microbial trend analysis (MTA). Table 4 shows the output of MTA related to the contribution of each taxon to the longitudinal microbial profiling for each individual.
We can observe a strong correspondence between the classification of taxa defined by our proposal and the contribution values to the longitudinal microbial profiling presented by MTA. In the IBS patient, our model points to Bacteroidaceae and Lachnospiraceae as relevant taxa compared to the rest. Note that Bacteroidaceae is associated with the highest values of a j0 and Lachnospiraceae to the highest values of b j1 . We can corroborate this using the MTA analysis. In this case, Bacteroidaceae and Lachnospiraceae are the families with greater contribution to the longitudinal microbial profiling of this patient with a contribution of 0.200 and 0.180, respectively. e other families grouped by the PCA biplot present a similar contribution value around 0.04. Note that Other, which has a contribution of 0.959, has been taken as reference in our proposal. In the healthy individual, we can observe that our approach detects Bacteroidaceae and Rikenellaceae as the families associated with the highest values of a j1 and b j1 , respectively. e MTA analysis indicates that these families are the most relevant in the longitudinal microbial profiling of this subject, with a contribution of 0.459 and 0.476, respectively.
In the PCA biplot for the subjects reported by Lloyd-Price et al. (Figure 7), we observe the relevant role of Clostridiaceae, which is clearly associated with higher values of coefficient a j1 (intrinsic dynamics for each family) in the subject with antibiotic use and with higher values of coefficient a j0 (average relative abundance of family j versus the relative abundance of family y K over time) in the subject without antibiotic use. Its relevant role is also highlighted by the MTA approach. See Table 5. It should be remembered that a higher absolute value indicates a stronger contribution to a common trend.
With respect to the healthy female studied by Caporaso et al., we note the relevant association between Other and a j0 , which is also emphasized by the MTA analysis with a high  absolute value. We also appreciate that this association is not as narrow as in the male studied. is fact is corroborated by the MTA analysis. Its absolute associated value is lower in the male than in the female. In the PCA biplot for the male individual, we also observe that Lachnospiraceae and Ruminococcaceae are grouped close to a j1 and this fact is also detected by our MTA analysis with similar values around −0.20. In these alternative scenarios, Bacteroidaceae has been considered as reference family, y K . e resulting PCA biplot for five simulated taxa clearly shows that taxon 1 a j0 correlates with having a high MTA contribution value equal to 0.515. Additionally, it displays that taxon 3 and taxon 4 are associated with a j1 at a similar level, with contribution values to the longitudinal microbial profiling equal to 0.209 and 0.258, respectively. From this biplot, we can also conclude that taxon 5 is associated with the coefficient b j1 . By contrast, the biplot for eight simulated taxa indicates that taxa 4 and 7 are grouped and closely related to a j0 and taxon 1, taxon 2, and taxon 5 are also grouped and related to a j1 . is evidence has also been confirmed by its higher MTA contribution values equal to 0.449 and 0.428, and 0.212, 0.257 and 0.242, respectively. In this simulated case, taxon 2 and taxon 3 have been considered as reference, respectively. e results of all these complementary analyses have also corroborated that the taxa grouped by the PCA biplot contribute similarly to the MTA values, and the families clearly located close to the axes present higher values.
In order to address the relationship between taxa, we have also analyzed association using correlation approach. Figures 8 and 9 show the correlation matrices provided by microbiome R package [43]. Note that these correlation indexes are provided considering the clr transformation data.
In the PCA biplot for the IBS patient ( Figure 6), we observe that Porphyromonadaceae and Rikenellaceae had a similar pattern (defined by the highest values of a j1 and the lowest values of b j1 ) and it can also be observed in the correlation analysis. We can note that Porphyromonadaceae and Rikenellaceae present a significant positive correlation;     Figure 8. We can also note that Lachnospiraceae presents an alternative pattern which is associated with the highest values of b j1 and the lowest ones of a j1 . e correlation matrix displays negative correlation between Lachnospiraceae and Porphyromonadaceae and between Lachnospiraceae and Rikenellaceae.
In the healthy subject, we note that Rikenellaceae and Ruminococcaceae are now the families that are associated with the highest values of b j1 and the lowest values of a j1 , respectively.
is association between Rikenellaceae and Ruminococcaceae can also be observed in the correlation matrix with a positive correlation between these taxa. We can appreciate that the well-known correlation analysis corroborates the associations between taxa defined by our approach. In the PCA biplot for the subjects reported by Lloyd-Price et al. (Figure 7), we observe the difference between Clostridiaceae and Other.
is difference is also pointed out in the correlation analysis, which is shown with a negative index. Analyzing the rest of scenarios, we can also appreciate that similar patterns defined by our approach are associated with positive correlations and nonsimilar patterns are associated with negative ones.   Other optimization, we carried out a second optimization procedure also considering function optim. In this case, with the parameter values obtained previously with the linear system optimization as initial values and τ in the parameter search interval [1,30], we have calculated Akaike information Criterion (AIC) for each second optimization procedure and the τ value that minimizes Akaike information Criterion (AIC) has been selected. In summary, the initial value estimation procedure for likelihood optimization is as follows: Step 1: solve the linear system defined by ln (y jt /y Kt ) � η jt taking into account a linear leastsquares algorithm with L2 regularization. is step allows us to obtain the initial values for parameters except for τ. Remember that additive log-ratio data transformation and first-order Taylor approximation allow us to state that E ln y jt y Kt � E alr y jt ≈ alr E y jt � alr α jt τ � ln α jt α Kt . (A.2) Step 2: (2a) Compute a second optimization procedure with the values obtained in Step 1 and τ in interval [1,30].
(2b) Compute Akaike Information Criterion (AIC) for each optimization procedure carried out in Step (2a). is step provides the initial value for τ.
Data Availability e data and code used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.