Data Association Methodology to Improve Spatial Predictions in Alternative Marketing Circuits in Ecuador

This work proposes a methodology that reduces the error of future estimations in commercialization based on multivariate spatial prediction techniques (cokriging) considering the products with strong associations. It is based on the Apriori algorithm to find association rules in sales of agricultural products of local markets. Results show the improvement in spatial prediction accuracy after using the best association rules.


Introduction
Family farming is an economic and social sector provider of food in the world that guarantees processes of food security within a country. In Latin America and the Caribbean, family farming is the main source of agricultural and rural employment, comprising 80% of farms that represent around 60 million people. is type of productive model combines agriculture, livestock, forestry, fishing, aquaculture, and grazing within the same farm and provides on average between 27 and 67% of the total food production for each country in Latin America [1].
is type of family farming has certain characteristics: (i) High presence of labor and family administration (ii) It is a diverse agriculture that allows self-sufficiency but also guarantees the feeding of other families through the surplus (iii) It is an agriculture that has limited access to productive resources such as land, water, and working capital compared to large-scale operations (iv) As an intangible heritage, family farming develops its own social and cultural dimension, which generates intergenerational links for the transfer of knowledge, traditions, and customs (v) Generates social and community ties through the generation of cooperatives Due to its productive and social characteristics, family farming is not a sector solely focused on production but also on the commercialization of products. However, the active participation of these farmers, focused on producing food for consumption in markets, is part of the sustainable and participatory development of the sector. In this sense, there are several limitations such as the geographical dispersion of the different farms of family farming, the production volumes of each family farm, and the limited capacity to meet quality standards established by marketing chains that demand access to markets. As indicated by Contreras et al., a lack of adequate coordination between the consumer and producer in the production and marketing system is not allowing adequately to respond to new demands or dissatisfaction of the consumer [2].
In response to this problem, local initiatives have emerged from family farmers to access markets such as the alternative marketing circuit (CIALCO). ese spaces that propose the direct encounter between producers and consumers, in recent years, have acquired a high importance within the agendas of public policy for the development of family farming. is importance is linked to the local assessment of production through the promotion of a more local food consumption focused on the assessment of the agrobiodiversity of each territory [3]. ese short marketing circuits are characterized as follows: (i) Low or no presence of intermediation for the commercialization of products (ii) Generation of bonds of trust and closeness between producers and consumers (iii) Assessment of the temporality of production of each product.
is type of direct marketing can be presented through different strategies such as public purchase, fairs, or local markets for the sale of food, among other modalities. In Ecuador, these initiatives promoted by family farmers are supported by the Ministry of Agriculture and Livestock through the strengthening of local fairs to meet producers and consumers, the baskets of family farming, the export of family farming products, and the public purchase, among other initiatives. At the same time, this promotion has the objective of generating agricultural, environmental, and social policies that improve public policies aimed at responding according to the challenges and needs faced by family farmers, which make public action and its impacts effective, equitable, and sustained development of this sector. So far one of the constraints for the generation of public policies according to the needs of family farming is the scarcity of information on production, but also the data related to the income obtained by the marketing of their products [4]. e research objective is to generate a methodology to improve the prediction of commercialization of different agricultural products using geostationary spatial data mining techniques, using the existing data corresponding to the year 2014 that allows generating future scenarios for the evaluation of public policies that help the development of the family agricultural sector. e country Ecuador is crossed by the equatorial line, that means its territory is located both north and south of Latitude Zero (Figure 1(a)). At the south-central region are located the provinces of Tungurahua and Chimborazo (Figure 1(b)). From these two provinces, information has been collected on the sale of agricultural products in the socalled alternative marketing circuits (CIALCO). e paper is organized in six sections, and it begins with a description of alternative marketing circuits, focuses on the main problem that is the lack of historical data that does not allow using statistical techniques, and presents the alternative use of algorithms used in data mining for the generation of future estimates in the commercialization of agricultural products generated by peasant families in specific places in the provinces of Tungurahua and Chimborazo in Ecuador.
In the second section, we present other works that use data mining techniques used in this research oriented to different domains establishing validity and probity of the algorithms used such as association rules, kriging, and cokriging, in addition to the relationship with topics oriented in the same line of research.
In Section 3, the theoretical description of the methods and data mining algorithms are presented, emphasizing their mathematical development, and the description of the information provided on which the different processes are applied is also presented in this section. Section 4 describes the methodology proposed to improve the process of future estimates in the commercialization of products, a multivariable function is generated using the products resulting from the association rules, and it is verified if the errors in the prediction tend to decrease.
In Section 5, the proposed methodology is applied and, with the values obtained, the percentage of error reduction in the predicted values is calculated using algorithms for multivariable.
To conclude in the last section, using the percentage of improvement in future predictions, a tomato production scenario is presented graphically in the provinces of Tungurahua and Chimborazo in Ecuador, which allows to establish policies to improve the functionality in this type of circuits.

Related Work
is section overviews some relevant previous works related to the developed research, both in theoretical and practical aspects. A series of studies conducted in various fields of science try to use the rules of association as a criterion to establish future estimates, so we can see some works such as [5], whose authors analyze the stock of a supermarket, or [6], to predict admission decisions for students. In works like [7], a relationship between association rules and a fuzzy classification is established.
In [8], an explanation of the mathematical development of kriging and cokriging based on substitution models within the framework of optimization is made, [9] and they propose to improve the construction of the variogram using information of magnitude and direction applied to data of the National Network of the Geomagnetic Observatories of China.
In spite of researching specific works, there is very little documentation on the improvement in future estimation using association rules, as of 2016 the works that focus on this topic in a specific way [10] give the first guidelines in a process to establish the most consumed products and the best association rules that produce the first patterns with the highest consumption of family farming in Ecuador, and using rules of association, here we can see the generation of the first scenarios in the consumption of the products.
In the second work [11], it is the set of products obtained from the application of the Apriori algorithm, inside an improvement that oscillates in a value between twenty and thirty percent, the estimates of future sales using time series considering the half-squared error, and this work establishes scenarios regarding the periodicity of a product. e third work [12] focuses on the estimation of the commercialization of products using their geographical location and their relationship as an influence in the improvement of consumption predictions based on the set of items resulting from the application of association rules. e research developed in the area of statistical science has the concepts mature enough to deal with this type of approach with great solvency; however, in this particular case, there is not a sequence of data of several years that allow to use statistical techniques. Available data are limited to 2 Computational Intelligence and Neuroscience 2014, creating an appropriate scenario to test data mining techniques that allow to establish future estimates of product consumption, with these results it is expected to generate scenarios where the implementation of policies to improve alternative circuits can be evaluated by marketing.

Methods and Materials
is section presents in detail the theoretical basis of the data mining processes used, and a detail of the information on which the future estimates are made.

Association Rules and Apriori Algorithm.
e first way to establish a relationship between products in this research is based on the number of times some products appear together in a sale transaction [13], and for this it is necessary to discretize the transactional data file, so that if a product is acquired, it is identified with a value Tof true and F if it is not a part. is differentiation of products acquired allows to establish the minimum support that is known as the relationship between the number of times a product appears in a transaction with respect to the total of transactions made, and this process is repeated for a single item. Once the sets that meet for an item are established, we proceed to a similar calculation with two items and so on, identifying all sets that meet a preestablished minimum coverage, looking for the rules that meet a minimum of confidence, i.e., if the product appears in the antecedent of a rule, it has a minimum confidence level of appearing in the consequent of the rule. e pseudocode is shown below. [14] Step 1. Generate all item sets L with a single element; this set is used to form a new set with two, three, or more elements all possible pairs which are taken as Sup equals minsup

Pseudocode Algorithm Apriori
Step 2. For every frequent item, set L′ is found: Determine all association rules of the form: If L′-J⟶J Select those rules whose confidence is greater or equal than minconf

Repeat
Step 1, including next element into L One of the best known algorithms to search for association rules is the Apriori method [15]. It is based on two parameters support and confidence: (i) e support of a rule is defined as the number of instances that the rule correctly predicts: (ii) e confidence indicates the percentage of times that a rule is met among the instances selected by the antecedent A: 3.2. Spatial Estimation. As mentioned in [16], "in the geographical space everything is related to everything, but the closest spaces are more related to each other". Geostatistics use the concept of a random function to find nondeterministic values on a region D, and if x crosses the region, a series of random variables are obtained, defined as Computational Intelligence and Neuroscience which constitutes a random function on domain D.
To simplify the feature of the random function, we consider some descriptive parameters or moments that summarize the information, the expectation, or first-order moment m(x) � E[Z(x)], represent the average around which the values taken by the realizations of the random function are distributed, and the variance is calculated as follows: and the variance and its square root called standard deviation constitute measures of dispersion of Z(x) around its mean value; the covariance centered between two random variables is given by the relationship and gives us an elementary vision of the interaction that exists between Z(x 1 ) and Z(x 2 ), and the semivariogram, defined between the two random variables, is given by the expression and it reflects the way in which a point has influence on another point at different distances. e variogram is equal to the variance minus the covariance:

Experimental and Modeling
Variogram. If we consider the z regionalized variable known in n sites x 1 , . . . , x n , the estimator of the experimental variogram for a separation vector h, it is defined as follows: An experimental variogram cannot be used because it is defined only for certain distances and directions, to interpret the spatial continuity of the study variable, and a theoretical model should be adjusted around the experimental variogram.
A variogram, c(h), is isotropic if it is identical in all directions of the space and if it does not depend on the orientation of the vector h but only on its magnitude |h|; otherwise, there is anisotropy in its distribution [17].
In general, the modeled variogram grows from the origin and stabilizes at a distance a, around a plateau; the two random variables Z(x) andZ(x + h) are correlated if the length of the separation vector h is less than the distance a, called the reach or zone of influence, beyond |h| � a; and the variogram is constant and equal to its plateau.
A spherical variogram of reach a and plateau C is defined as In processes involving geostatistics, the spatial correlation is modeled by the variogram, and this process is generated by a random function Z(s) composed by the mean (m) and the residue e(s): Z(s) � m + e(s), with an average constant E(Z(s)) � m, and the variogram defined as e variance of Z is constant, and the correlation of Z does not depend on the location s but only on the separation distance h. en, we can form multiple pairs Z(s i ), Z(s j ) , that have identical separation vector h � s i − s j , and we estimate the correlation between them [18,19]. An experimental variogram c(h) is isotropic if it is identical in all directions of space; otherwise, there is anisotropy.
If we assume the entropy is in the independent direction of the semivariance, we replace the vector h with the magnitude ‖h‖. Under this assumption, the variogram can be estimated for N(h) as a simple pair of data Z(s i ), Z(s i + h). (10) and this estimate is called a simple variogram. e experimental variogram [20,21] measures the average dissimilarity between two data as a function of their separation, often presents slope changes, which indicate a change in spatial continuity from certain distances, and the variogram can be modeled as the sum of several elementary models called models nested or nested structures [22]

For some distances (intervals), h j is defined as
e adjustment to a model is not done considering only the experimental variogram, but it must consider all the available information on the regionalized variable, and a more detailed explanation can be found in [23].

Estimation with Kriging.
e kriging method in this case is considered as a linear prediction with unbiased linear estimator, and there are some types of kriging depending on the average of the known population. ese types are ordinary and simple, and for this study, we are interested in the ordinary type. e regionalized variable is the obtaining of the stationary Z random function that fulfills where V is the neighborhood considered in the kriging process. e following conditions are considered: 4 Computational Intelligence and Neuroscience (i) Linearity: where x 0 is the place where an estimate is established, x α , α � 1, . . . , n are the sites with known data, and λ α , α � 1, . . . , n are the weights that together with "a" they are the unknowns. (ii) Unbiased estimation constraint: it is expressed that the expectation of estimation error must be zero: (iii) Minimum variance: find weights that minimize the variance of the estimation error: Being the variogram, a tool equivalent to the covariance from the relationship, e calculation of kriging is done as follows:

Multivariate Prediction:
Cokriging. In this case, multiple spatial variables are analyzed together to build the prediction model. e first step is modeling a multivariable variogram, and the main tool for estimating semivariances between different variables is the crossed variogram, defined as follows: Two variables can have cross correlation, which means that the variables not only exhibit autocorrelation but that the spatial variability of a variable A is correlated with variable B, and vice versa. is can be extended to multiple variables; the measurements are taken in a limited set of locations, and the interpolation can be made to an unlimited number of locations.
e cokriging seeks to estimate the value of a variable considering the data of this variable and other correlated variables, for this uses the following relationships [24,25]. e crossed variogram between two variables Z 1 and Z 2 is defined as follows: and can be computed from the available data: where N(h) � α, β, such that x α − x β � h , being both variables z 1 and z 2 measure in x α and x β .

Materials.
e analysis is based on information from 2014, provided by the General Coordination Network Marketing Ministry of Agriculture and Livestock of Ecuador. It contains the weekly performance of sales of agricultural products made by small farmers located in Ecuador's central highlands specifically the provinces of Tungurahua and Chimborazo. e available data contains information about the number and volume of sales of products such as vegetables, legumes, meat, dairy, fruits, tubers, and processed products, finding an average of 1,200 items per month divided on a weekly basis. e elected products that have greater relevance in relation to information in the universe to be part of the research consists of thirty products, indicated in Table 1 (it contains the names in English, scientific name, and Spanish, the original language of the study). Further details of this dataset can be appreciated in the initial part of the investigation [26]. e available record contains the products that are part of the marketing, the value of sales, date, and fair to which each transaction belongs ( Table 2).
On the one hand, the first data sheet contains all the recorded transactions, organized in packages named "canastas" (baskets), each one representing a sale of certain products, containing the products present in each purchase, and implicitly also contains the spatial geolocalization of the operation (the location of the fair) and the time stamp (date) of operation. As shown in Table 3, the transactions have the dates and, for each of the 30 products, the label with character "s" denotes it was present in transaction, and otherwise "No".
is  Table 3 contains the sales value reported for each product aggregated in weeks and locations. It contains the numerical attributes reflecting the weekly variation of sales, with a blank space when there is no registered value.
is second table, containing the sales information of 48 weeks with a total of 1260, was based to carry out the prediction analysis.

Proposed Methodology
e proposed methodology to improve the prediction of commercialization of products consists in searching the set of elements with the highest degree of associativity in commercialization. It is used to reduce the error in the spatial estimate of commercialization of agricultural products. It consists in the following steps: (i) Establish a baseline with future estimated values for the marketing of agricultural products, using the deterministic method IDW (inverse distance weight) (ii) Establish the set of associated products (using the Apriori algorithm of association rules)

Data Processing.
e proposed methodology has been applied to the marketing information of agricultural products provided by the Ministry of Agriculture of Ecuador, of the result collected from the different fairs located in Tungurahua and Chimborazo, with the data of sale of products of the month of July the year 2014 (Figure 2(a)).
To implement, the proposed methodology, we use the mathematical algorithms found in the R language, and the libraries used are SpatialPoints (sp), ggmap, tmap, ggplot2, GADMTools, rgeos, gdalUtils, gstat, geoR, proj4, crs, raster, maps, readr, in version 1.0.143 [27][28][29][30], and Weka 3.7 [31], and the generation of association rules is carried out. e first activity is centered in the creation of the grid or mesh [32,33] to determine the prediction area, and a dimension structure is defined with parameters: cellcentre.offset x � −79.1085, y � −2.531218, cellsize x � 0.05, y � 0.05, cells. dim x � 21; y � 32. In the sector of the equatorial line one degree of length equals 111.32 km, the distance occupied in length by the two provinces x min � −79.133499 and x max � −78.0834991 is 1.049 degrees, the equivalent to 116 km, for the conformation of the grid (spgridtc), and the distance between cells is 5.84 km (Figure 2(b)).

Search for Association Rules.
To find association rules, information must be quantized, so you can identify whether an agricultural product is part of the procurement process.
If part of the transaction is the label with the character "T", otherwise "F" for all months of 2014, to optimize the process of searching for the best value association rules is replaced with "F" by the symbol "?". Computational Intelligence and Neuroscience e Apriori algorithm for association rules is applied to a set of 550 transactions, with minimum support parameters equal to 0.4 (220 occurrences) and a confidence of 0.8. e resulting set is (i) Each time a white onion transaction is made, a tomato transaction is performed with a confidence of 87%, tamarillo (86%), carrot (83%), and broccoli (82%) (Figure 3), and each one of these elements generates a rule of association with the tomato, which constitutes the set of multivariable. (ii) e product with the highest commercial ratio of the study sample is tomato. (iii) e set of greater associativity is structured as A � {Tomato, White Onion, Tamarillo, Carrot, Broccoli} Figure 4.
With the set of best association rules, the source of data is generated on which the different estimation processes are carried out in the future, and this file is called Fjespacial Table 4.

Baseline Analysis (IDW).
In order to establish a baseline of analysis, the deterministic method inverse distance weighting (IDW) is used to calculate a first estimate in the future using the set of products with the greatest associativity such as broccoli, white onion, tomato, tamarillo, and carrot established in Section 5.2. e prediction of consumption for a single variable, idw ((TOMATE)∼1, FJespacial, spgridtc), where the variable to predict is the tomato, Fjespacial contains the values of sales, and spgridtc is called the grid or area where the prediction is made (Figure 5(a)).

Spatial Data Analysis.
e data used correspond to the commercialization of the input denominated tomato of the month of July 2014, in the provinces of Tungurahua and Chimborazo, this file is converted to a spatial type, transforming the location data x and y into geographic coordinates [17,[34][35][36], that represent the latitude and length of each of the alternative circuits of commercialization type fairs that act in the study, and the fourth column corresponds to the values of the behavior of commercialization of tomato.

eoretical and Experimental Variogram
Models. e distance between the points that identify the fairs is expressed in tenths of a degree, and between each jump, there is a distribution of two fairs.
e model variogram (m), of the spherical type with a range of 0.157, is where the spatially correlated points are found, with a plateau equal to 2151 and distance 0,473.
In Figure 3, the adjusted variogram can be observed using the experimental variogram and the model variogram.    Based on the ordinary kriging method that is considered the best unbiased linear estimator type, the values found in the interpolation vary especially in two foci on which the predictions are generated. e values closest to the points of information are more influenced than those that are far away (Figure 5(b)).

Spatial Prediction Based on Associated Products.
Because of the interrelation of products found with the Apriori algorithm, a set of associated products in the commercialization with the highest incidence in the process was identified. e five products resulting from association rules is A � {Tomato, Broccoli, White Onion, Tamarillo, Carrot}. e correlation between the elements of set A was verified, and the model variogram with each element will be generated, as can be seen in Figure 6.
At this point, the linear model of coregionalization is adjusted to a variogram of multivariable samples using the products.

Cokriging.
In the same way as made for the tomato variable, we proceed to estimate the future sales of the target variable (tomato), with an extended model integrating all the associated products, as shown in Figure 5(c). e variable g represents a function with all the products resulting from the added association rules for which the new variogram is calculated, vmra <-variogram (g). e adjusted variogram is obtained from the interaction between the variogram of the function g (multivariable) and the model m of a single variable, vm.fit <-fit.lmc (vmra, g, and m). e multivariate prediction is derived from the relation xt <-predict (vm.fit and spgridtc). A summary of the three cases of prediction of future consumption is presented in Figure 5.

Discussion.
To perform the assessment of the prediction model, the cross validation divides the data into two sets: the  modeling subset is used by the model variogram to estimate the coefficients, ant then, kriging is applied in the locations of the validation set, so that validation measures are compared with their predictions. e procedure known as leave-one-out cross validation (LOOCV) was applied, and it performs as many iterations as data (N) has the set, using N−1 data to train the model and the data left for testing, being the result the arithmetic mean of the N error results obtained E � 1/N n i�1 e i . Cross validation usually gives a pessimistic estimate of performance (bias), since most models would improve if the training set would be bigger. For this reason, LOOCV has the lowest bias since the training set contains the whole dataset except one datum. On the other hand, some authors point out that the error estimated by LOOCV may have greater variance than k-fold cross validation, with k<<n, since the size of datasets is higher and estimation smoother. However, this is open to discussion, as indicated [37], since k-fold cross-validation produces dependent test errors, and their correlations cannot be estimated unbiasedly. As indicated, in [38], in learning problems employing models with moderate/low instability (as linear regression problems), LOOCV often has lower variability both in bias and variance.
In any case, for situations with small datasets the variance in fitting the model tends to be higher, implying that kfold cross-validation is likely to have a high variance (as well as a higher bias) with respect to LOOCV.
is is why LOOCV is often the best choice with limited amounts of available data, as the case study in this work, in order to get the maximal use of data to compare the performance of alternative learning structures. e estimation error (difference between the estimated value and the true value) is calculated in each site with data, and a statistical analysis is made of the errors committed in all data sites. e results obtained from performing the cross validation for each method chosen for this study indicate that when comparing the residual values of the predictions, the IDW and kriging method have similar prediction values while the cokriging process (multivariate) presents a improvement for its smaller amplitude in the results (Figure 7). As can be seen in the cash flow diagrams, for all the estimation processes of future sales, the values are located in a range between −20 and 30. Figure 8 shows the result of subtracting the residual values between the prediction of (1) IDW/kriging method (left frame) and (2) IDW/cokriging (right frame).   In the second part, the residual difference of the cross validation between the IDW method and the ordinary cokriging (multivariable) is calculated, establishing nine positive and five negative values.
Positive samples indicate that the residual value of the multivariable function is smaller.

Conclusions
is research is focused on a target area for analysis located in Tungurahua and Chimborazo provinces of Ecuador, where there are fourteen alternative marketing circuits called fairs. ese locations were used to create the grid of future sales estimates, and the analysis of sales transactions containing agricultural products generated the set of strongly associated products, based on the Apriori algorithm. As result, the set of associated products with support parameters � 0.4 and confidence greater than 0.8 are, in order, the following: white onion (0.87), tamarillo (0.86), carrot (0.83), and broccoli (0.82), and each of these products is associated with the sale of tomato.
Using the IDW process as baseline for comparison, the leave-one-out cross validation of predictions was done to compare with geostatistical techniques based on the variogram to generate interpolations of product sales in the target area. Based on the functions of kriging, the sales values of products were established according to their spatial locations and influences of close neighbors. Finally, a multivariable set of products for predictions was established resulting from association rules with greatest associativity (tomato, broccoli, tamarillo, white onion, and carrot). Based on this multivariable set, the prediction values are calculated using the same procedure described in the first and second stages. With this improvement in the sales prediction process, it would be possible to establish scenarios to generate consumption maps ( Figure 9) that can be supplied with a better production process and its subsequent commercialization which is reflected in a better level of economic income for the farmer families.
Finally, the products resulting from applying the Apriori algorithm with the greatest associativity are tomato, broccoli, tamarillo, white onion, and carrot.
Based on this multivariable set, the prediction values are calculated using the same procedure described in the first and second stages. e residual value of the IDW prediction minus kriging prediction delivers eight positive values and six negative  values. In the same way, it is calculated for the IDW minus cokriging process, and in this case, nine positive and five negative values are obtained.
In the process of cokriging (multivariable), there are a greater number of cases with positive differences that shows that this process using the set of association rules as multivariable has a 16% improvement when establishing future sales estimate. e proposed methodology is to find a set based on association rules to establish the multivariable process, and this research has shown acceptable improvement in prediction values. With this improvement in the sales prediction process, it is possible to establish scenarios to generate consumption maps (Figure 9) that can be supplied with a better production process and its subsequent commercialization which is reflected in a better level of economic income for the family farmer.
Finally, it should be emphasized that a methodology is established based on the use of association rules that allow future estimates to be improved using cokriging (multivariable) processes.
Taking into account that in order to use a conventional statistical process, there is not enough data available to establish a distribution and an estimate of future prediction, and it is considered that the proposed technique and methodology are useful for initial cases of study, have a limited amount of data, especially to create a baseline, as more annual data series are obtained which can be compared between the values obtained using data mining techniques contrasted with values of traditional statistical techniques.

Data Availability
e data used are provided by the General Coordination of Marketing Networks of the Ministry of Agriculture and Livestock of Ecuador, within the framework of an interinstitutional agreement with the Salesian Polytechnic University.

Disclosure
is research is part of the Doctorate in Computer Science Program that is being studied by Washington R. Padilla.

Conflicts of Interest
e authors declare that they have no conflicts of interest.

Acknowledgments
is work was supported in part by Project MINECO TEC2014-57022-C2-2-R, Salesian Polytechnic University of Quito-Ecuador and by Commercial Coordination Network, Ministry of Agriculture and Livestock Ecuador.
is research has the partial economic support of the Salesian Polytechnic University of Ecuador. Computational Intelligence and Neuroscience 13