Use of Indicator Kriging to Investigate Schistosomiasis in Minas Gerais State, Brazil

Geographic Information Systems (GISs) are composed of useful tools to map and to model the spatial distribution of events that have geographic importance as schistosomiasis. This paper is a review of the use the indicator kriging, implemented on the Georeferenced Information Processing System (SPRING) to make inferences about the prevalence of schistosomiasis and the presence of the species of Biomphalaria, intermediate hosts of Schistosoma mansoni, in areas without this information, in the Minas Gerais State, Brazil. The results were two maps. The first one was a map of Biomphalaria species, and the second was a new map of estimated prevalence of schistosomiasis. The obtained results showed that the indicator kriging can be used to better allocate resources for study and control of schistosomiasis in areas with transmission or the possibility of disease transmission.


Introduction
Schistosomiasis mansoni is an endemic disease, typical of developing countries [1,2]. In Brazil, the schistosomiasis is caused by the etiological agent Schistosoma mansoni, whose intermediate host is species of mollusk of the Biomphalaria genus.
The S. mansoni was introduced in Brazil by the African slavery trade during the sixteenth century [3]. The Brazilian port of Salvador and Recife received most of the slaves [4], originated from endemic regions. In the early eighteenth century, there was a large migration of slave labor due to the decline of sugar production in the Northeast of Brazil and the discovery of gold and diamond in the Minas Gerais State. It is estimated that one fifth of the population at that time moved to Minas Gerais [5], using the "ways of São Francisco" [6] as the main access route. Probably, in these early migrants also came to schistosomiasis.
In Minas Gerais, there are seven species of Biomphalaria genus: B. glabrata, B. straminea, B. tenagophila, B. peregrina, B. schrami, B. intermedia, and B. occidentalis [7][8][9]. Come in these, only B. glabrata, B. tenagophila, and B. straminea have been found naturally infected by S. mansoni [10,11]. B. glabrata is of great epidemiologic importance, due to its extensive geographic distribution, high infection indices, and effectiveness in the schistosomiasis transmission. Moreover, its distribution is almost always associated with disease occurrence [12]. B. tenagophila was found naturally infected by S. mansoni in state of Minas Gerais, and it is responsible for the focus maintenance in the city of Itajubá [13]. B. straminea, although had not been found infected in state of Minas Gerais, was considered responsible for Paracatu's focus [14].
They are commonly found in a wide of habitats, both natural (streams, creeks, ponds, swamps) and artificial (irrigation ditches, small dams), particularly in shallow and slow running waters (less than 30 cm/s), where the substratum can be the muddy or rocky bed and with floating or rooted vegetation, pH between 6 and 8, NaCl content below 3 by 1000, and mean temperature between 20 and 25 degrees C [15][16][17].
The study of the habitat of these mollusks, as well as their behavior in relation to the climate, results in valuable information when the goal is the disease transmission control [18].
Environmental and socioeconomic factors may influence the spatial distribution of schistosomiasis. Under these circumstances, the Geographic Information System (GIS) can be applied to characterize, to better understand the interconnection of these factors, and to provide a more complete picture of disease transmission [19]. GIS allows a complex analysis of a large number of information and displays the results of this analysis in graphical maps. These techniques have become important tools for the design and implementation of control programs [20], enabling a better distribution of state resources to allow a direction more suitable for disease control [21][22][23]. Among these tools, we can cite the indicator kriging, which allows to data spatialization aiming at map generation. It also gives information about inference uncertainties that can be used as quality restrictions of the classification process.
This study is a review of the use the indicator kriging of the Georeferenced Information Processing System (SPRING) to make inferences about the presence of the species of Biomphalaria (B. glabrata, B. tenagophila, and/or B. straminea), intermediate hosts of Schistosoma mansoni. Also, using numerical indicator kriging, a new map of estimated prevalence of schistosomiasis, in areas without information in the Minas Gerais State, Brazil, is presented.

Methodology
Kriging may be defined as a technique of statistical inference, which allows the estimation of values and the uncertainties associated with the attribute during the spatialization of a sample property [24].
To achieve the objectives, two approaches have been considered: categorical and numerical indicator kriging. The categorical indicator kriging was based on the information of the mollusks species, and the numerical indicator kriging used data from the prevalence of schistosomiasis.
The procedure for adjustment of the semivariogram is not straightforward and automatic, but interactive, because the interpreter does the first adjustment and checks the adequacy of the theoretical model [25]. After the models fitted for each class (categorical) or quartile (numerical), the indicator kriging was applied to obtain an approximation of the conditional distribution function of random variables.
The numerical indicator kriging was conducted in the entire state using the schistosomiasis prevalence data (lower quantile, median and upper quantile) from 999 localities.
The categorical indicator kriging was performed in each of the fifteen river basins (Buranhém, Doce, Grande, Itabapoana, Itanhém, Itapemirim, Jequitinhonha, Jucuruçu, Mucuri, Paraíba do Sul, Paranaíba, Pardo, Piracicaba/Jaguari, São Francisco and, São Mateus) using the mollusk data. The mollusk attributes (class of species and localization) were distributed along the drainage network of 15 River Basins, according to the methodology used by Guimarães et al. [22]. The indicator kriging was done in the software SPRING [26]. In Appendix A is described the geostatistical modeling used in the indicator kriging. Gerais had its beginning in 1986, and since 2000 has been under the coordination of the SESMG in collaboration with Municipal Health Systems. The PCE prevalence information is available for municipalities and localities [18]. The Kato Katz technique is the methodology used to determine prevalence, examining one slide per person. The spatial distribution of the schistosomiasis prevalence is presented in Figure 1(a), for the 255 municipalities and for the 999 localities used in this study.
Data on the distribution of Biomphalaria mollusks were provided by the Laboratory of Helminthiasis and Medical Malacology of the René Rachou Research Center (CPqRR/Fiocruz-MG). Mollusks were collected in breeding places from different municipalities in Minas Gerais at different periods, using scoops and tweezers, and then packed to be transported to the laboratory [27]. Specific identification was performed according to the morphology of the shells, reproductive system, and renal ridge of the mollusks [28][29][30][31][32][33], and also by low stringency polymerase chain reaction and restriction fragment length polymorphism [34]. The spatial distribution of the Biomphalaria species data are presented in Figure 1

Three species
No data 0 50 100

Test of the Hypothesis for Differences between PCE and Indicator Kriging Estimated Prevalences.
To assess whether the estimates made by kriging methods were close to those obtained by PCE, initially estimates were made by municipalities, by averaging the estimated prevalence for all the grid points belonging to that municipality. A regression line was then adjusted, using the prevalence provided by the PCE as dependent variable and the prevalence provided by kriging as independent variable, that is, where P PCE and P K are the PCE and kriging prevalences, respectively. A hypothesis test was then performed to determine, with a 95% confidence level, whether the intercept was zero and the slope parameter was equal to 1: If the null hypothesis is accepted, it can be concluded that, in average, the kriging estimates are equal to PCE prevalences at a 95% confidence level.

Results and Discussions
The indicator kriging procedure, based on the fitted semivariograms, was applied using the sample data presented in Figure 1, to generate a regular grid of 250 meters of resolution (x, y) over the Minas Gerais State.
The following results were obtained to achieve the objectives.

Using Categorical Data (Thematic).
The resulting map of the species distribution generated by applying the mode estimator (Appendix A.2.-(A.6)) is presented in Figure 2(a). Figure 2(b) presents a map of the uncertainties associated with the classification, computed by using (A.7) of Appendix A.2. The map of uncertainties shows that the higher uncertainties are concentrated among class transition areas.
The methodology was validated using a sampling procedure. The fieldwork was conducted in the São Francisco and Paraíba do Sul River Basins where no information existed about the presence of the mollusks. More details about the results for the São Francisco and Paraíba do Sul River Basins can be found in Guimarães et al. [22] and Guimarães [18] and Tibiriçá et al. [9].
The research of mollusks was accomplished in five municipalities in the São Francisco River Basin (SFRB) and in nine municipalities in the Paraíba do Sul River Basin (PSRB). The mollusks collected were sent to the analysis of the species in the Laboratory of Helminthiasis and Medical Malacology of the René Rachou Research Center (CPqRR/Fiocruz-MG). Also, the mollusks collected in the PSRB were identified at the Parasitology Laboratory in the Federal University of Juiz de Fora (UFJF) and the Entomology Laboratory of the GRS/JF, Secretary of Health in the State of Minas Gerais (SESMG). Collection and identification of the mollusks were performed according to the methodology described in Section 2.1.
Figures 2(c) and 2(d) show the historical Biomphalaria species (Figure 1(b)) at the surveyed municipalities of São Francisco and Paraíba do Sul River Basins, respectively. 0 50 100    Table 1 presents the estimated and found species, as well as the value of the uncertainty mean for searched municipality, and the collection points, where these species had been found.
To explain the differences in the two basins, some considerations should be made. Figure 2 shows the spatial distribution of Biomphalaria species according to historical data. We can observe from this figure that the SFRB (Figure 2(c)) has a better spatial distribution of species surveyed than the PSRB (Figure 2(d)).
The municipalities surveyed in the SFRB had 100% accuracy with at least one specie estimated, but this value for the PSRB was 66.67%.
About 50% of the municipalities of SFRB have historical information about the Biomphalaria species in one of eight classes (B. glabrata, B. tenagophila, B. straminea As kriging is affected by the amount and spatial distribution of input data, this may explain the differences in the two basins. This fact is also reflected in the uncertainties of the estimates. Comparing the values presented in Table 1, the SFRB had an overall uncertainty mean of 0.232, and the PSRB had an overall uncertainty mean of 0.332. Therefore, the overall uncertainty for PSRB is 43.1% greater than for SFRB. Figure 3 shows the spatial distribution of schistosomiasis-estimated prevalence by kriging (Figure 3(a)), map of the uncertainties (Figure 3(b)), estimated prevalence by kriging by type of classes (Figure 3(c)), and in Figure 3(d) the mean estimated prevalence by kriging for the 255 municipalities where the PCE prevalence information is available. The PCE prevalence values (Figure 1(a)) and the respective kriging estimates (Figure 3(d)) were plotted together. Figure 4 shows the scatter plot as well as the regression line. The hypothesis test (A.12) was performed, and the null hypothesis was accepted, indicating that there is no significant difference between the PCE prevalence and the kriging prevalence means, with a significance level of 0.05. Table 2 presents the comparison between the prevalence of PCE and the prevalence estimated by kriging by type of classes: low (prevalence among 0.001 to 5), medium (5.001 to 15), and high prevalence (above 15). From Table 2, it can be noted that (i) 60.8% of the municipalities are estimated in the same class as they belong;

Using Numerical Data.
(ii) 37.2% of the municipalities had the prevalence estimated in the adjacent class, that is, from low to medium class, from medium to high class, from high to medium class, or from medium to high class; (iii) less than 2% of the municipalities had the low prevalence estimated a high class; (km) N 0.001-5.000 5.001-15.000

>15.001
No data 0 50 100  (iv) when the estimated class is not in the same class of the PCE, the kriging has a trend of about 25.9% to overestimate and about 13.3% to underestimate the prevalence values.

Conclusions and Future Work
Indicator kriging showed to be a rather robust tool since its results presented a very good agreement with the field findings. This tool allowed to determine and to delimit, respectively, the distribution of the Biomphalaria species and the areas of risk (map of uncertainty of the Biomphalaria species).
Kriging is an auxiliary useful tool to guide the fieldwork, indicating the places with higher probability of occurrence of the considered species, with particular attention to those Journal of Tropical Medicine 7 species that are more important for disease transmission. The results of this tool can be used to better allocate the always limited resources for distribution studies and the development of strategies for mollusk control.
Some important issues, related to the nature and precision of the Biomphalaria species data, need to be considered when looking at the results: the data were obtained from historical records (most occurring before the broad usage of GPS equipment), and the information is given in a municipality level basis. Because of this, an assumption was made that the species found in the municipalities are uniformly distributed inside the municipality drainage network. The authors believe, however, that other type of distribution hypothesis would not greatly affect the results.
To improve the accuracy of an estimate using kriging, it would be necessary to obtain data with better location and spatial distribution of the information collected in the fieldwork.
Also, the kriging proved to be a suitable tool, and their results showed a good agreement with the PCE data.
This technique can be used to estimate the schistosomiasis prevalence in the municipalities of Minas Gerais where the prevalence is not determined by the PCE. The results of this tool can be used to better allocate resources for studies in areas with medium and high prevalence.
The entire methodology of this study used free software allowing the playback of the methodology in other states of Brazil where there is no information about the type of Biomphalaria and/or schistosomiasis prevalence at no cost.
Conditioned to appropriate funds existence, an extensive malacological survey is recommended for better evaluation of the methodology and also GPS utilization in all future fieldworks.
It is also recommended to obtain data on the schistosomiasis prevalence in western Minas Gerais (nonendemic region). Thus, one can obtain a better estimate of prevalence at the state level and not only in the endemic area.

A.
A.1. Indicator Kriging. A spatial attribute inside a region A of the earth surface can be modeled, from a geostatistical point of view, as a random function [35,36]. For each spatial position u ∈ A, the attribute value of a spatial data is modeled as a random variable (RV) Z(u), which can assume different values with an associated probability of occurrence. In the n sample positions u α , α = 1, 2, . . . , n, the values Z(u) are considered deterministic, that is, they can be considered RV's with probability 100% of occurrence. The distribution function of Z(u), conditioned to n sample points, can be defined by Deutsch and Journel [36] F(u; z | (n)) = Prob{Z(u) ≤ z | (n)} for a numerical attribute, F(u; z | (n)) = Prob{Z(u) = z | (n)} for a categorical attribute. (A.1) The function F(u; z | (n)) models the uncertainty about the values of z(u), in nonsampled positions u, conditioned to the n samples. This function can be evaluated from the inference procedure-denominated indicator kriging.
The indicator codification of the RV Z(u), for a cutoff value z = z k , generates the indicator RV I(u; z k ) using the following nonlinear mapping function: for a numerical attribute, for a categorical attribute. (A. 2) The cutoff values, z k , k = 1, 2 . . . , K, are defined depending on the number of samples. It is necessary that the amount of codified samples with value 1 are enough for defining, with success, a variography model for each cutoff value [37].

(A.3)
This estimation, made through ordinary kriging over the indicator values, yields a least-square estimate for the distribution function at the cutoff value z k [36]. A set of F * (u; z k | (n)) estimates, obtained for different k cutoff values, can lead to an approximation of the conditional distribution function (cdf) of Z(u).
The expression of the ordinary indicator kriging estimator is given by: The weights λ Oα (u; z k ) are obtained by solving the following system of equations: where φ(u; z k ) is a Lagrange parameter, h αβ is the vector defined by the positions u α and u β , h α is the vector defined by the positions u α and u, C 1 (h α ; z k ) is the autocovariance defined by h αβ , and C 1 (h α ; z k ) is the covariance defined by h α . The autocovariance is determined by the theoretical variography model defined by the set I when z = z k .
The indicator kriging, simple or ordinary, gives, for each cutoff value k, an estimator which is also the better least square estimator of the conditional expectation of the RV I(u; z k ). Using this property, it is possible to compute estimates of the cdf values of Z(u) for several values of z k , belonging to the Z(u) domain. The set of the estimated values of the cdf 's of Z(u), for the cutoff values, is considered a discrete approximation of the real cdf of Z(u). The greater the number of cutoff values, the better the approximation.
The indicator kriging is nonparametric. It does not consider any type of a prior distribution for the random variable. Instead, it permits the construction of a discrete approximation of the cdf of Z(u). The values of the discrete probabilities can be directly used to estimate the values that characterize the distribution, such as average, variance, mode, and quantiles.

A.2. Estimation and Uncertainty Evaluation for Categorical
(Thematic) Data. The optimal estimation z * (u), for a categorical attribute represented by K classes c k , k = 1, . . . , K, can be defined as where p k (u) = F * (u; z k | (n)), when z k = c k , is the estimated probability of the class k at the location u. This estimator is known as the mode estimator since it considers the highest probability, and the classifier that assigns classes at each position u using the mode estimator is known as mode classifier. The species classification probability for each position, inside the region of interest, is obtained with the values for indication (0 or 1) of the neighboring known samples (nearer) of u. This probability is estimated by using ordinary kriging, where the weighted value of each neighboring sample of u is calculated considering the variability model defined by the semivariograms for each class.
Usually, inferences about local uncertainties for categorical attributes, Unc(u), are made using the complement of the mode probability, which is defined as (A.7) The mode estimator, as defined in (A.4), has the advantage of classifying all locations u inside the region of interest. An alternate methodology consists in considering the uncertainties as restrictions to the classification process.
In this case, only locations with an uncertainty below a predefined threshold (U max ) are classified, that is z * (u) = ⎧ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎩ c k iff p k (u) > p i (u) ∧ Unc(u) < U max ∀i, k = 1, . . . , K, i / = k, φ otherwise, where z * (u) = φ means that the value was not estimated or the location was not classified. when the variable presents a symmetry level that allows to assume the normality hypothesis. For distributions highly asymmetrical, a more robust measure is the interquartile range defined as the difference between the highest and lowest quartile:

A.3. Estimation and Uncertainty Evaluation for Numerical
From the inferred cdf, F * (u; z | (n)), it is possible to derive various probability intervals such as the 95% interval [q 0.025 ; q 0.975 ], such that, Prob Z(u) ∈ q 0.025 − q 0.975 | (n) = 0, 95 (A.11) with q 0.025 and q 0.975 being the quantiles 0.025 and 0.975, respectively, of cdf. Attribute values referring to the quantiles are estimated from the fit function and of the cutoff values used in the indicator kriging. Also, uncertainties can be estimated for ranges of attribute values. The probability of a value Z(u) be within an interval (a, b] is computed as the difference between the values of cdf for the b and a thresholds: Prob{Z(u) ∈ (a, b] | (n)} = F(u, b | (n)) − F(u, a | (n)). (A.12) After model fittings, indicator kriging procedures were applied to obtain an approximation of the conditional distribution function of the random variables. Based on the estimated function, maps of mollusk spatial distributions along with the corresponding uncertainties for the entire state and also map of estimated prevalence of schistosomiasis were built.