System Identification and Prediction of Dengue Fever Incidence in Rio de Janeiro

Identification, prediction, and control of a system are engineering subjects, regardless of the nature of the system. Here, the temporal evolution of the number of individuals with dengue fever weekly recorded in the city of Rio de Janeiro, Brazil, during 2007, is used to identify SIS susceptibleinfective-susceptible and SIR susceptible-infective-removed models formulated in terms of cellular automaton CA . In the identification process, a genetic algorithm GA is utilized to find the probabilities of the state transition S → I able of reproducing in the CA lattice the historical series of 2007. These probabilities depend on the number of infective neighbors. Time-varying and non-time-varying probabilities, three different sizes of lattices, and two kinds of coupling topology among the cells are taken into consideration. Then, these epidemiological models built by combining CA and GA are employed for predicting the cases of sick persons in 2008. Such models can be useful for forecasting and controlling the spreading of this infectious disease.


Introduction
The propagation of infectious diseases is one of the major public health problems in the world e.g., 1 .Cellular automaton CA; e.g., 2 has been employed for modelling the spreading of such diseases e.g., 3-6 , in order to predict epidemics and evaluate the efficacy of control strategies.In a CA epidemiological model, each cell composing the lattice can correspond either to a unique individual or to a larger fraction of the population.At each time step, each cell is in one of three states: S, I, or R. The state S represents a cell that is susceptible and therefore subject to being infected; the state I is related to a cell that is infective and hence can transmit the disease for susceptible ones; the state R is associated with removal from the process of disease propagation, meaning that the cell either is cured and immunized or is dead.Such a model is usually known by the abbreviation SIR e.g., 7 .If recovery from the infection does not confer immunity, the model is called SIS.The definition of the rules concerning the transitions among the states S, I, and R and the choice of the network topology utilized for representing the contacts among the individuals specify a SIR or SIS model based on CA.
There are several works on identification of transition rules of CA by using genetic algorithm GA; e.g., 8-11 .For instance, GA was employed for finding the probabilities of state transitions of a SIR model formulated in terms of CA in order to simulate the temporal evolution of the number of chickenpox cases registered by the Arizona Department of Health Services, USA, between 1994 and 2004 12 .Here we use a similar approach for identifying SIR and SIS models related to the incidence of dengue in 2007 in Rio de Janeiro, the second most populous city in Brazil.
Dengue is presently considered the most relevant arthropod-borne viral disease in terms of morbidity and mortality, affecting tens of millions of people annually worldwide e.g., [13][14][15][16] .Its main vector is the mosquito Aedes aegypti, usually found in urbanized areas in tropical and subtropical countries.There are four serotypes of dengue virus.The variation number 2 was the predominant serotype circulating in the city of Rio de Janeiro in 2007 and 2008 e.g., 17, 18 .Dengue is transmitted to humans by the bite of an infected mosquito, which became infected with dengue virus after biting an infected person.Because of the current lack of vaccines or specific drugs against dengue fever e.g., 14 , it is crucial to predict dengue epidemics, in order to concentrate efforts on combating the vector.In fact, the strategy commonly adopted for controlling the spreading of this disease involves the chemical attack against the mosquito using insecticides and the reduction of natural or human-made uncovered water containers like empty bottles and old automobile tires that store rainwater and can support the mosquito proliferation e.g., 13 .
There are several mathematical studies on dengue dynamics by using autoregressive models e.g., 19, 20 , cellular automata e.g., 21 , genetic algorithms e.g., 22 , neural networks e.g., 23 , ordinary differential equations e.g., 24, 25 , and partial differential equations e.g., 26 .Here we employ our originally developed approach 12 based on CA and GA to build models able of reproducing the number of dengue cases recorded in Rio de Janeiro during 2007.Then, these models are used for weekly predicting the cases of sick persons in the course of 2008.
In our CA models, humans live in a square matrix formed by n × n n 2 cells.These CA models can be different in the number of states available for each cell two states if SIS, three states if SIR , in the lattice size here 500 × 500, 1000 × 1000 or 2420 × 2420 , in the connection topology in the CA lattice here cross-shaped or square-shaped neighborhood , in the maximum radius r where connections can be made here r 1 or 2 , and in the use of time-varying or non-time-varying probabilities associated with the state transition S → I.
This manuscript is organized as follows.In Section 2, the epidemiological models written as CA rules are introduced.In Section 3, the GA employed for identifying the probabilities of the transition S → I is described.In Section 4, the numerical results obtained from different kinds of CA models are presented.In Section 5, the main conclusions are stressed.

SIR and SIS Models Based on CA
when the mosquito is uniformly distributed over the space.From September 2006 to March 2008, the densities of Aedes aegypti weekly registered in urban and suburban areas of Rio de Janeiro produced similar time series 27 .Thus, in these areas, the mosquito distribution could be roughly taken, at each week, as uniform but there were significant differences when compared to suburban slum areas .In addition, SIR e.g., 28 and SIS e.g., 29 models that do not explicitly consider the mosquito concentration were already employed in numerical studies on dengue spreading.These experimental and theoretical works give some support to our simplification about dengue transmission.
The time step of a CA simulation corresponds to one week of real time, which is the mean duration of the latent and the infective periods e.g., 13, 16 .In our model, at each time step t, there is a probability P i v of a S-cell being infected, where P i depends on the number v of infective neighbors.We assume that the probability P i of the transition S → I obeys the constraints: P i 0 0 i.e., a susceptible individual can contract the disease only if v > 0 and if v 1 < v 2 then P i v 1 ≤ P i v 2 ; i.e., P i v is a monotone increasing function of v .A S-cell that acquires the disease in t will be infective only in t 1.And, after one time step, each I-cell becomes a R-cell in a SIR model or a S-cell in a SIS model.Notice that the transitions I → R and I → S are governed by deterministic rules.
In a SIR model, a R-cell stays in such a state, representing either death or perfect and sustained immunity against the dengue virus; more precisely, against the specific serotype causing the disease e.g., 13,15 .In a SIS model, an I-cell returns to the susceptible state after either death or recovery, meaning either birth of a susceptible cell in the place of a cell killed by the infection or no cross-protective immunity against the other serotypes.In fact, cross-immunity is detected only during a few months, corresponding to a short-term effect e.g., 15 ; hence, it was not considered here.In both kinds of models, the total number of cells n 2 remains constant.The states of all cells are simultaneously updated at each time step.
In many mathematical studies on disease propagation using CA, the contact network which defines the neighborhood of each cell in the lattice is taken as a regular structure e.g., 3, 4 .In this context, cross-shaped e.g., 30 and square-shaped e.g., 31 neighborhoods are commonly employed.Cross-shaped neighborhood of radius r of a cell is formed by the cells orthogonally surrounding such a cell until the distance r for instance, if r 1 this neighborhood comprises the 4 closest neighbors: left, right, up, and down ; square-shaped Moore neighborhood of radius r of a cell is constituted by all cells surrounding such a cell until the distance r for instance, if r 2 the Moore neighborhood of a cell is formed by the 24 cells pertaining to the square matrix of size 5 centered in such a cell .Both kinds of neighborhoods with r 1 and 2 are utilized in our epidemiological models.
With these models formulated in terms of CA rules, we intend to fit the number of dengue cases observed in Rio de Janeiro, during the year of 2007, by finding appropriate values for P i .The values of these probabilities related to the transition S → I are determined by using a GA.The parameter values, the features of the genetic operators and details of the chromosome selection process of the GA were defined after preliminary tests, in which the performance of our scheme based on CA and GA and the processing time required for the numerical simulations were evaluated.A previous work 12 also guided these choices.

GA Used for Identifying the CA Models
Each generation of the GA is formed by 40 chromosomes 40 candidate-solutions .The length of the chromosomes depends on the kind and on the radius of the neighborhood chosen in the CA lattice.For instance, when cross-shaped neighborhood of r 1 is employed, each Point 2 Figure 1: Two points are randomly selected in two randomly picked chromosomes the parents and the genes in gray between these two points are swapped.Thus, two new chromosomes the children are created.
chromosome is formed by 4 genes the values of P i 1 , P i 2 , P i 3 , P i 4 ; when squareshaped neighborhood of r 1 is used, each one consists of 8 genes the values of P i 1 , P i 2 , . . ., P i 8 .The fitness of the chromosomes is evaluated in simulations with the CA models.
Firstly, the GA used in the identification process with time-varying P i v, t is described.In this case, the fitness of a chromosome is high if the number of infective cells appearing in the CA lattice in the tth time step is close to the datum recorded in the tth week of 2007.The fitness function F t,q of the qth chromosome at the time step t is written as where q 1, 2, . . ., 40; x t is the number of historical cases registered in the tth week in the year of 2007; x CA t,q is the number of infective individuals obtained in the CA model in the tth time step with the qth chromosome.
An initial population of 40 chromosomes is randomly generated, and the value g of their genes must always be real numbers between zero and one, that is, g ∈ 0, 1 , because they represent probabilities.These chromosomes are evaluated, and the 3 best ones the 3 chromosomes with highest fitness are kept apart.Other chromosomes are created from the current generation by applying a two-point crossover operation e.g., 32 .Such an operation is performed by selecting two points in two randomly picked chromosomes the parents and swapping the genes between these two points.Thus, two new chromosomes the children are produced.Figure 1 illustrates this operation.
Then, each gene of the set of 40 chromosomes composed by the 3 best parents plus 37 children has 1% of chance of suffering mutation, which changes the value of g of ±0.02 if after this operation g > 1, then we impose g 1; if g < 0, then we impose g 0 in order to get g ∈ 0, 1 .
After applying these genetic operators for producing a new generation of chromosomes, the values of g in such chromosomes are reordered if necessary as shown in Figure 2, because P i must be a monotone increasing function of v. Thus, a new generation is created and can be evaluated as described above.Figure 2: By comparing to Figure 1, observe that the values of the genes 5 and 6 in the child 3 were reordered, because P i must be a monotone increasing function of v.
The system identification performed with time-varying P i v, t can be summarized as follows.Initially, 40 chromosomes are randomly generated and each one is evaluated in a CA lattice, by using their probabilities of the state transition S → I to obtain the number of infective cells in the second time step the second week of the historical series of 2007, t 2 .Of course, the initial condition in the CA corresponds to the number of infective cases in the first week of 2007 t 1 .The transitions I → R in a SIR model and I → S in a SIS model are deterministically performed.Then, the numbers of sick cells in the 40 lattices one lattice for each chromosome are compared to the historical record.The 3 best chromosomes obtained from this comparison plus 37 chromosomes created by crossover are selected for suffering mutation.Thus, a new generation of 40 chromosomes is formed.Notice that one GA generation corresponds to one time step in the CA lattice.This is repeated for the other 50 weeks of 2007.Also, the whole identification process is fully executed 10 times.In the end, the best set of 51 chromosomes is picked out the 51 chromosomes able of weekly reproducing the numbers of infections in 2007 .This set is employed for predicting the dengue cases in 2008.
The GA used in the identification process with non-time-varying P i v is a little different.In this case, the goal is to obtain constant values for P i v that can satisfactorily fit the number of sick persons observed during 51 weeks from the initial condition which corresponds to the dengue cases in the first week of 2007 .Initially, 40 chromosomes are randomly created and they are individually evaluated for 51 time steps in a CA lattice again, one lattice for each chromosome .The fitness function G q of the qth chromosome is defined as follows: Notice that the lower the absolute value of the difference between the grand total recorded during 2007 and number of sick individuals appearing in the CA in all 52 time steps, the higher the fitness.The next GA generation is created as already described, by applying mutation in the 3 best parents plus 37 children.Notice that here one GA generation corresponds to 51 time steps 51 weeks in the CA lattice.This GA is executed by 25 generations.In the end, the best chromosome is picked out the one able of better fitting the number of infections occurred in the course of 2007 .This unique chromosome is utilized for making predictions about the infections in 2008.

Results
The simulations were performed by taking into account the dengue cases weekly recorded by the Health Department of the city of Rio de Janeiro, Brazil, in 2007 and 2008 33 .In The CA models can be SIR or SIS; with lattice size n 500, 1000, or 2420; with crossshaped or square-shaped neighborhood; with neighborhood radius r 1 or 2. Thus, there are 2 × 3 × 2 × 2 24 kinds of models.For n 500, each CA cell represents approximately 23 individuals; for n 1000, 6 individuals; for n 2420, 1 individual.
Table 1 presents the ranking of these 24 kinds of models in the identification stage, by using time-varying P i v, t .The lower the rank number, the lower the difference between the series generated by the CA model and the historical data from 2007.The number in the last column is the average error obtained in 10 simulations for each kind of model.The error in a unique simulation is given by 52 t 1 |x t − x CA t |, where x t is the number of dengue cases in the tth week and x CA t is the number of infective individuals in the tth week obtained in the CA simulation with the best chromosome found in that week in the identification process.The value of x CA t is calculated from the number of I-cells by using a spatial scale factor determined by n.Remember that x 1 x CA 1 .Table 1 shows that all CA models can satisfactorily fit the data: the average errors are of the order of 1%-20% of the grand total of cases recorded in 2007.As an example, Figure 3 exhibits the historical series of 2007 solid line and the series generated  by a CA model dotted line with SIS, cross-shaped neighborhood, r 2, and n 1000; the kind of model in fourth place in Table 1.Table 2 presents the ranking of these 24 kinds of models with time-varying P i v, t in the prediction stage.The initial amount of I-cells in the CA lattice is determined from the cases registered in the first week t 1 in 2008.The numbers of I-cells in the next 51 weeks are calculated by using the values of P i v, t corresponding to the best 51 chromosomes found in the identification stage.Thus, our ambitious intention was to make a prediction of 51 steps ahead.Observe, however, that the average error was great: even for the model in first place in Table 2, the average error is of the same order of the grand total of cases registered in 2008.In spite of that huge average error, some models found in the identification process were able of correctly predicting the outbreak occurred in 2008.Figure 4 presents the historical series of 2008 solid line and the series generated by the same CA model dotted line of Figure 3. Notice that there is a reasonable agreement between the two series.The set of chromosomes the set of P i v, t with t 1, 2, . . ., 51 that better identified the 2007 series was obtained with SIS, cross-shaped neighborhood, r 2, and n 500 the average performance of this kind of model appears in the second place in Table 1 .We performed 10 simulations with this set and calculated the average values of sick individuals in each week of 2008 and the corresponding standard deviations.Figure 5 shows the results.Notice that this model predicts a sustained high number of dengue cases after the 20th week, which in fact did not occur.However, the prediction for the first 10 weeks can be considered satisfactory.
The set of chromosomes that better predicted the 2008 series was obtained with SIS, cross-shaped neighborhood, r 2, and n 1000 the CA model employed in Figures 3 and 4 .We executed 10 simulations with this set and determined the average values of sick persons in each week of 2008 and the corresponding standard deviations.Figure 6 exhibits the results.Observe the good qualitative and quantitative agreements: the average number of cases in the CA lattice during 2008 is 129699, which is close to the real grand total 126326 cases .Also, both series have a peak around the forth month.
The set of chromosomes that better anticipated the number of cases recorded in the second week of 2008 was obtained with SIR, cross-shaped neighborhood, r 2, and n 2420.In 10 simulations, the average value predicted by this CA model is 3145, which is very close to the real number 3148 cases .The week-by-week prediction for the whole year of 2008 is not good, as shown in Figure 7.However, the difference between the total of cases recorded in 2008 and the total predicted by this model is less than 1%.
By using non-time-varying P i v , the results were not too good.As examples, Figures 8 and 9 present the series generated by a CA model in the identification and prediction stages, respectively, with SIR, square-shaped neighborhood, r 1, and n 500.The reason for the failure of the approach with non-time-varying P i v can be the seasonality of dengue fever: its incidence usually increases after the beginning of the rainy season and decreases after the beginning of the winter.Thus, the probabilities P i of the transition S → I change during the year, because the mosquito concentration varies in function of climatic factors, such as rainfall and temperature e.g., 15, 27 .But notice that in Figure 9 the predicted incidence enhances  in the first weeks of 2008, which in fact did happen.Therefore, models with non-time-varying P i v can also give an acceptable prediction, at least for a few time steps ahead.

Conclusions
Dengue fever is a public health problem in more than 100 countries, like Brazil e.g., 17 , China e.g., 20 , and Mexico e.g., 22 .Forecasting its incidence is crucial to guide prophylactic actions.Here we used data recorded in the city of Rio de Janeiro in 2007 and 2008 in order to find out the kinds of epidemiological models formulated as CA rules able of better fitting and predicting its occurrence.Tables 1 and 2 reveal that the choice of the model structure is critical: SIS models appear to be more convenient for identification while  SIR models for prediction.Neighborhood radius r 2 seems to be more adequate for identification; r 1, for prediction.These tables also show that cross-shaped or squareshaped neighborhoods do not produce very different results and lattice size n 500 can be considered a good choice of spatial scale for a discussion about the influence of the lattice size, see 12 .Moreover, time-varying probabilities P i v, t are essential for capturing the seasonality of this infectious disease.
Of course, a prediction of 51 steps ahead is too challenging.In a real situation, the probabilities P i v, t could be periodically updated by considering the latest weekly records.However, in spite of this ambitious goal, we found models that correctly predicted the epidemic occurred in 2008 just from the cases recorded in the first week of this year and by using the probabilities P i v, t determined from the 2007 series.This framework based on CA and GA should be applied to other diseases and cities, in order to check the validity of

Figure 3 :
Figure 3: Historical records from 2007 solid line and a series obtained with a CA model based on timevarying P i v, t , SIS, cross-shaped neighborhood, r 2, and n 1000 dotted line .

Figure 4 :
Figure 4: Historical records from 2008 solid line and a series obtained with the same CA model used in Figure 3 dotted line .

Figure 5 :
Figure 5: Historical records from 2008 solid line and the average temporal evolution of infective individuals obtained in 10 simulations with a CA model based on time-varying P i v, t , SIS, cross-shaped neighborhood, r 2, and n 500 dotted line .The bars correspond to the standard deviations.

Figure 6 :
Figure 6: Historical records from 2008 solid line and the average temporal evolution of infective individuals obtained in 10 simulations with the same CA model of Figure 3 dotted line .The bars correspond to the standard deviations.

Figure 7 :
Figure 7: Historical records from 2008 solid line and the average temporal evolution of infective individuals obtained in 10 simulations with a CA model based on time-varying P i v, t , SIR, cross-shaped neighborhood, r 2, and n 2420 dotted line .The bars represent the standard deviations.

Figure 8 :
Figure 8: Historical records from 2007 solid line and a series obtained with a CA model based on nontime-varying P i v , SIR, square-shaped neighborhood, r 1, and n 500 dotted line .

Figure 9 :
Figure 9: Historical records from 2008 solid line and a series obtained with the same CA model used in Figure 8 dotted line .

Table 1 :
Performances of the models in the identification process by using the historical series of 2007 .
2007, 5857900 people lived in this city and we assume that this number also represents the whole population in 2008.In 2007, 25107 people get sick by dengue; in 2008, 126326 people, a number five-times greater than in 2007.Could this epidemic have been predicted by some of our CA models in the first week of 2008, when 1510 cases were recorded?