Disaggregation of Statistical Livestock Data Using the Entropy Approach

1 Faculdade de Ciências e Tecnologias, Universidade do Algarve, Edif́ıcio 8, 8005-139 Faro, Portugal 2 Universidade de Évora (UE), Centro de Estudos e Formação Avançada em Gestão e Economia Tecnologias, 7000-809 Évora, Portugal 3 Department of Management, Universidade de Évora, 7000-809 Évora, Portugal 4 Instituto de Ciências Agrárias e Ambientais Mediterrâneas, Universidade de Évora, 7000-809 Évora, Portugal


Introduction
Disaggregated statistical information is necessary to have a correct analysis of spatial patterns inside each country, but data on agricultural and forest occupation and production are frequently found only at national and subnational level [1][2][3][4], and oftentimes this problem does not have an appropriate and accurate solution.
In Portugal, southern Europe, apart from the general lack of data on agricultural and forest occupation, there is a need for up-to-date data on livestock numbers [2].Only the General Agricultural Census (GAC), conducted by the National Institute of Statistics (NIS) every 10 years, features information at disaggregated level by subregion, county, and parish.The other information sources display information merely according to agrarian regions and NUTS II [5,6].Even the different agents operating in the territory do not have more detailed data, and only the health veterinary entities have some more accurate data for nowadays, which is not accessible to all.However, planning and devising of a clear and sustainable rural development policy call for the availability of disaggregated information [4], at least when it comes to the numbers of livestock intended for breeding, mainly in regions where these variables have a great importance for the farmers' income.
After the entrance of Portugal into the European Union, the Alentejo region, southern Portugal, has come under the influence of different policies and as a consequence there are several inland rural areas with problems and in decline, as well as the extensification of agricultural activities [7].In this region, the importance of livestock breeding activity is unquestionable [8].
A particular county where there is a tendency towards demographic decrease is Castelo de Vide located in the NUTS II of Alto Alentejo.Several teams from the University of Évora have performed depth studies and identified this county as a case study about the decline process of the inland counties of the NUTS II of Alto Alentejo.Here, there is a need for data on extensive livestock for breeding to enable a correct analysis of the current state of affairs and the implementation of development policies, which may contribute to the increase of the farms' income.
Therefore, methodologies for estimating livestock data are relevant, namely, in rural areas where there is a tendency towards the decline of agricultural activity, livestock breeding and local economy.Thus, a methodology to obtain disaggregated information on livestock units is proposed in this paper, focusing especially on the forecast of the main categories of livestock units intended for breeding, as well as total livestock units.
In order to develop such a methodology, one has considered the previous studies carried out and different methodologies that may be adapted to this area and which are quite few.Fragoso et al. [2] presented a way of calculating the main livestock species intended for breeding.The authors used a conversion process of the livestock into normal heads and followed the evolution of the aggregate based on the historical pattern.Thus, they simply made the data follow the aggregate variation rate with some adaptations.This procedure is very simple but may lead to some mistakes.Other studies carried out for the Alentejo region, namely, by the state entities, do not use a suitable disaggregation level or present incomplete information.
In spite of the above referred, there are some studies carried out for land use disaggregation based on entropy in the Alentejo Region, which allowed a good disaggregation of information [2,9].There are also some international studies that use methodologies based on the entropy approach for estimating livestock data.Some of these are presented next.Jongeneel [10] used a nonstationary Markov model approach, where the transition probabilities are explained by a set of exogenous (policy) variables.Karantininis [11] applied a generalized cross entropy instrumental variables estimator to recover the nonstationary transition probability matrix for the Danish pork industry.Tonini and Jongeneel [12] analyzed the evolution of the dairy farm structure of Poland during the postsocialist period and applied a generalized cross entropy Markov chain approach which incorporated prior information for estimation.
Therefore, the key question is to be able to adapt such studies in order to allow not only a good added value to data calculation, but also a dynamic calculation process of the livestock data, since there are some indicators of a possibility of this methodology providing good results for solving the problem, and allowing the possibility of predicting future trends.
The study was applied to the Alentejo Region where livestock were disaggregated, in the first step for the subregion of Alto Alentejo and then in the second step for its county of Castelo de Vide.The following maps show the location of the application area (Figure 1).
In addition to this introduction, the paper is organized in more three parts.The second part includes the material and methods, which describe the disaggregation problem and the methodological approach developed for estimating the number of livestock units.The third part is concerned with the presentation and discussion of results.Finally, the concluding remarks are included in the last part.

The Disaggregation Problem.
The intent is to obtain disaggregated data on the main livestock categories at county level for 2005 and also to obtain a historical sequence of data.So, for addressing this problem it must be considered that in the Alentejo region the number of livestock units is closely linked with the agricultural and forest areas, since the livestock are breeding mainly in extensive systems based on pastures under Mediterranean forest, seeded or spontaneous.
The information from the disaggregated units (counties and subregions) is available only for the first  periods ( < ), that is 1989 and 1999.Therefore, the information disaggregation problem can be formulated as follows: The aim is to obtain from the known variable    () the data in    (), which is the target variable related to the livestock activities () at the disaggregated level () in a given moment ().The matrix   represents the aggregated unit  and it is necessary to determine    ∀,  and , for the subunits   at the moment (), for which there is no available information at the disaggregated level ().
As it was referred before, the number of livestock units is connected to the agricultural and forest occupation at the disaggregated level (   ()) or, in its absence, to the data at the aggregate level (  ()).The aim is to find the livestock distribution at disaggregated level, knowing that it is a function of agricultural and forest occupation in that moment: where    () represents the number of livestock units  in territorial unit  at the moment  and    () is the agricultural or forest occupation  in territorial unit  at the moment .Therefore, it is assumed that this is a direct relation with the land uses, namely, pastures.However, the number of livestock units is bounded by the following relation: According to that, the composition of livestock must respect the predefined rules of heading (V) as a function of agricultural and forest occupation  at the moment , that is, for the total physical area necessary to breeding livestock.These values should be also subjected to the following restriction: This restriction demands that the number of livestock units from a certain class at an aggregate level must be equal to the sum of that class in all disaggregated territorial units.That is, the sum of the number of livestock units  in all disaggregated units  at the moment  must be equal to the respective aggregated value.

The Estimation of Livestock's Normal
Heads.The number of livestock units is a function of farms' land occupation and hence there are coefficients that indicate this relation considering that each livestock class has different requirements in terms of feeding and, therefore, the necessity for a wider or smaller area.The conversion of different livestock categories into normal heads (NH) is a useful tool for considering those aspects.
Normal head is a common livestock measure that allows dealing in the same units with different categories of livestock in function of species and ages.There is a legal table of conversion (Portaria no.229-A/200 of 6 March 2008, [5]).For instance, according to that table a sheep over 1 year is equal to 0.15 NH and a bovine over 2 years is equal to 1 NH.In order to better understand this measure, Table 1 presents the main conversion indexes for the different species according to their age.Each territorial unit (region, subregion, and county) has a well-determined relation between NH and the agricultural and forest occupation.This relation is obtained calculating the number of NH for each predominant livestock class in the area and establishing the relation between the number of NH and agricultural and forest occupation, namely, forage crops and permanent pastures.So, the number of effective livestock intended for breeding  in NH, in territorial unit , at the moment  can be calculated as follows: where INH is the conversion index from livestock class  into NH.On the other hand, the relation between livestock numbers in NH and agricultural and forest occupation is determined by where  is the relation between total livestock and  agricultural and forest occupation in territorial unit .These values can then be transferred to a period  + 1 as follows: It is also possible to define sustainability limits on the number of NH per agricultural and forest occupation with specific additional information.

The Estimation of Livestock's Percentage Weight by Class.
Until this point, the methodology proposed allows us to estimate the total number of breeding livestock in (NH), but it does not allow us to calculate the percentage weight of each class.This leads us to search for a methodology that is able to estimate this type of data.Howitt and Reynaud [13,14] and Fragoso et al. [2] applied entropy methodology considering Markov processes in order to estimate the land uses weight.The studies about maximum entropy of Good [15] and Golan et al. [16] and the ones related to Markov processes and maximum entropy previously referred to livestock were also considered.
Based on the methodology of Howitt and Reynaud [13, 14] and Fragoso et al. [2], we propose an analytical framework developed in the following three steps.

Advances in Operations Research
(1) Convert the data of each livestock breeding class into NH, for the years in which information is available.(2) Disaggregate the data from a database created at aggregate level, based on the theory of maximum entropy and calculate the livestock weight in  + 1, . . ., .
(3) Redistribute livestock numbers in NH according to estimated proportions and convert into animal numbers.
We assume that a sequence of livestock can be characterized by a first-order Markov process, considering   decision states corresponding to   possible strategies, indexed by  ∈ {1, . . .} with  =   .Assuming a second-order Markov process, the probability of moving from any decision state  ∈ {1, . . .,   } in the year  − 1 to any decision state   ∈ {1, . . .,   } in the year  can be calculated by multiplying the respective probabilities NH   ( − 1) × NH   ().This product of probabilities can be associated with a     matrix of (  ×   ) dimension, which is called probability transition matrix [2,13,14].
Therefore, the creation of the prior information at the aggregated level is presented now.The dynamic process of livestock's percentage weight estimation in NH, at aggregate level, is obtained based on the estimation of the probability transition matrix at aggregate level    , using the generalized maximum entropy (GME) approach [16].To fulfil this objective the following maximum entropy model, inspired in Howitt and Reynaud [13] and Fragoso et al. [2], was adapted to the aims of this work and the Alentejo region conditions: Subject to: This optimization program aims to maximize the entropy of the probability distribution {   1 , . . .,     } ∀ and   and {   1 , . . .,     } ∀  and , taking into account the conditionals determined by restrictions.Equation ( 9) defines the dynamic process of livestock numbers in NH.Equation (10) determines that the sum of transition probabilities in any Markov state is equal to 1.In the same way, (11) and (12) guaranty that the values of the variables of {   1 , . . .,     } and {   1 , . . .,     } are defined as a probability distribution, and between 0 and 1, thus respecting the formalism of Golan et al. [16].
According to Golan et al. [16] it is necessary to define a parameter   = { 1 , . . .,   } of  ≥ 2 points with  1 = 0 and   = 1 and with a {   1 , . . .    } probability distribution.For errors    () we proceed in the same way reparameterizing the error through a support vector V  = {V 1 , . . ., V  } with  ≥ 2 points, so that errors are then defined in {   1 (), . . .,     ()} as follows: Therefore, knowing the   probability distribution we can easily create a database of information at aggregate level, which will later be used in the estimation process of probability transition matrixes at disaggregated level for calculating the number of livestock units.
The formalization of the first step in the disaggregation process can be given through the following cross-entropy minimization program where the previously estimated transition matrix is used as prior information: min Subject to: in which { 1 , . . .,   } with  ≥ 2 points is the support vector associated with the { 1 , . . .,   } probabilities of the error vector.This optimization model minimizes the cross-entropy of transition probability distribution and the entropy of probability distribution of errors (14) subjected to restrictions (15)- (17).Equation ( 15) is a data consistency restriction that guaranties the compatibility of information at aggregate level with disaggregated information and states that the sum of all the disaggregated values plus the error must be equal to the aggregate.The last two equations assure that     and   respect the properties of a probability distribution; that is, their sum is equal to 1.
In our case, at the county disaggregation level, the aim is to obtain livestock data specifically for the county of Castelo de Vide.Thus, instead of going for the simultaneous disaggregation of information in all counties of Alto Alentejo's NUT III, one may also choose the direct disaggregation of data regarding the county of Castelo de Vide.In order to do that, one may rewrite (15) as follows: where NH  ( + 1) is the livestock's percentage weight of the aggregate .This is useful because it enables addressing situations in which we only have data on some territorial units comprising an aggregate and can materialize a second variant of the model proposed.So, from the estimations of matrix T   one may reproduce the percentage of livestock numbers  in NH at the moment  + 1 as follows:

The Estimation of Livestock Number.
Assuming that the number of NH in relation to land agricultural and forest occupation has already been calculated for having livestock number, one may simply make its redistribution into the NH percentage weight calculated as follows: Afterwards, the number of NH can easily be converted into real number of animals by means of the inverted use of each conversion index: where INH is the conversion index of livestock effective  into NH.
Finally, if we want to estimate total livestock number, the data conversion in NH cannot be made due to the NH  coefficient limits.If a set of explanatory variables is available, it can be used and it will solve this problem.If not, one may take the premise that the year variation rate follows the livestock intended for breeding rate as follows: in which ⃗ ∇   is the year variation rate of total livestock  and  is the number of years after  year.
This method is based on strong assumptions and its application requires the opinion of experts.It should be applied only when we do not have a stable weight for each of the units to be disaggregated.In our case we may assume that the model may be applied in a simultaneous disaggregation process, where there is only the need to use the entropy process presented above and calculate the livestock numbers of specie  for unit , as follows:

Results and Discussion
In order to obtain data on livestock for the county of Castelo de Vide, it was necessary to implement the disaggregation process in two stages.Figure 2 presents the disaggregation levels from I (for which there is available information), to II (subregions), and to III (the counties' level).The application of the proposed methodology to Portuguese data implied also some adaptations.
The first one is due to the almost complete lack of data on the classes of conversion into NH in the GAC of 1989, which makes the application of the methodology be only done correctly after 1999, despite the fact that its application to the 1990s can be made with partial inclusion of livestock intended for breeding.
The second point is the existence of limited data in certain classes of land occupation.This makes it impossible to carry out a correct simultaneous disaggregation process for the Alto Alentejo's counties and allows for only the application of a direct disaggregation process to the breeding livestock.A third consideration is related to the conversion into NH.The relation between livestock breeding and the agricultural and forest occupation cannot be made when calculating the total number of livestock units (including younger animals) due to the conversion limits.However, similar methodologies can be used to calculate the percentage weight of each animal, supposing that the evolution of the total number of animals follows the same pattern more than breeding livestock.
Given these considerations, the following implementation lines of the model were used: (1) direct disaggregation of breeding livestock for the county of Castelo de Vide considering 2 years (2003 and 2005), (2) direct disaggregation of breeding livestock for the county of Castelo de Vide considering a historical sequence, (3) simultaneous disaggregation of the total number of livestock in the subregions (NUTS III) of the Alentejo region, and (4) direct disaggregation of the total number of animals' proportions and their further conversion into number of animals for the county of Castelo de Vide.
In the application of the model different support limits of parameters and error had to be defined.Based on previous studies, it was established the parameter  can be defined with  = 3 points, being therefore considered as  = {0, 0.5, 1} for all models.In order to establish the limits of error components, the Pukelsheim's three-sigma rule recommended by Golan et al. [16] was used and we selected the value of the limits that presented better improvements in the results during the calibration process, following the empirical findings of previous studies [2,9].
In the estimation process of the number of NH, regarding the livestock intended for breeding in 2003 and 2005, the following limits V = {−1, 0, 1} to the disaggregation level of the subregion of Alto Alentejo (NUTS III) and the county of Castelo de Vide were assumed.For the other breeding livestock disaggregation model, V = {−2, 0, 2} was considered at the first disaggregation level and V = {−5, 0, 5} at the second disaggregation level.On the procedure of reconstructing historical series and total livestock numbers of all categories, the limits V = {−0.7,0.7} in the simultaneous disaggregation process and V = {−1, 0, 1} in the direct disaggregation process were considered.
The results are presented next, following the lines of implementation presented before.
The main livestock classes considered in the county of Castelo de Vide for the first model's implementation line were bovine cattle, sheep, and goats (Table 2).However, the need to convert breeding livestock into NH has led to consider different divisions regarding bovine cattle, such as bovines of 1 to 2 years of age and bovines over 2 years of age.
In the second model's line of implementation aimed to calculate partial estimations of breeding livestock and to obtain the historical reconstruction of predominant livestock, one considered  = 7 years and only cows over 2 years of age in regard to bovine cattle, due to lack of data.Livestock numbers were calculated, using data from 1989 (Table 3).
The third implementation line, which allowed the disaggregation of data simultaneously for the different subregions, includes the following livestock types: cows over 2 or more years of age; other bovines, sheep and goats; and a series of  = 7 years (Table 4).It allowed disaggregating correctly the numbers of the different types of livestock, which were obtained directly from the model of cross-entropy minimization ( 14) to (17).
The fourth line of implementation of the model, which aimed to calculate the total proportions of livestock, considered cows over 2 or more years of age; other bovines, sheep and goats; and a series of  = 7 years (Tables 5 and 6).

Validation.
The results' validation was made based on crossreference with statistical data, using the opinion of experts and technicians from the Portuguese Ministry of Agriculture with a good knowledge of the area of application and on the calculation of the weighted prescription absolute deviation (WPAD) which measures the estimation deviations from real statistics data from the 1999 GAC.The WPAD was calculated considering the absolute deviation at the disaggregated level for each livestock specie and class, considering the total deviation for the disaggregated territorial units: and at aggregate level according to the aggregation of disaggregated units: where ŷ  is the estimated livestock number  at disaggregated territorial level ,    is the respective observed value from the available statistics, and   / is a ratio that allows weighing the WPAD i according to the representativeness of disaggregated unit  in the aggregate.
The D   expresses the true importance of the percentage deviation in each livestock class regarding the observed values (PAD) weighed by its true importance at disaggregated level .The WPAD  corresponds to the sum of D   values by livestock number , which allows giving the idea of the real total deviation for the values of the unit .The WPAD corresponds to the weighted sum of the WPAD  by the weight of each unit  regarding the total value or aggregate.
In the first implementation line, the procedures for the estimation of livestock intended for breeding in the county of  Castelo de Vide are close to the opinion of experts consulted, showing also trends that can be observed in this county.
In the second implementation line, the model aiming to estimate part of the breeding livestock was validated for 1999 by comparison with the data from the GAC of 1999.This analysis showed that the model produced satisfactory data, since the WPAD i is below 15%, which is a threshold usually recommended by some authors, such as Fragoso et al. [2] or Hazell and Norton [17].Following this criteria for the PAD   , only the data on breeding goats are not valid.
The simultaneous disaggregation process of livestock developed in the third line of implementation was only applied to the Alentejo's subregions and presented values that are very close to the reality, since the WPAD is only 6% in 1999.In terms of WPAD i only Alentejo Central stands out with an 8.11 percentage and the percentage in Alto Alentejo is about 3.16.
Finally, in the last line of implementation of the model presented, the direct disaggregation procedures have shown a reliable disaggregation of the proportions of total livestock effective in Castelo de Vide.The PAD   values for Castelo de Vide are very low for sheep (12%), which is the predominant livestock category.All the other categories show PAD   values under 15%, with exception of the cows with 2 or more years of age that present a 30% PAD   .It was obtained an acceptable WPAD i for Castelo de Vide with 14.6%, above the thresholds recommended by Fragoso et al. [2].

The Disaggregation Informational Gain.
The heterogeneity of the information calculated was measured through the disaggregation informational gain (DIG) used by Howitt and Reynaud [13,14].According to the authors the measure of information gain from disaggregation should have the following properties: "(i) the measure of potential gain increases monotonically with the heterogeneity of the disaggregated sample; (ii) the gain from disaggregating a uniform set of samples is zero; (iii) the measure is invariant to changes in the number of disaggregated samples and the variability of the aggregated sample; (iv) the measure has an informationtheoretic interpretation" [14].
The DIG is based on the cross entropy between the observed values of breeding livestock at aggregated and disaggregated level and on the cross entropy between the breeding livestock estimated by the model and the observed values.The DIG measure increases as the units' posterior distributions become closer to the true distributions.In the case of perfect disaggregation, the DIG is equal to 1.This measure may, however, be subjected to some critics,  considering the fact that a model with considerable errors might cause negative values.The formula can be presented as follows: where Ê  are the estimated shares of livestock  in unit  and    are the real shares on the disaggregated unit , and   is the aggregated share of livestock .The same informational measure can be defined for each of the disaggregated units: Due to the nature of DIG and its line of use in previous studies [14], it was only applied to the third line of implementation of the model: the simultaneous disaggregation process.Here, the DIG revealed a percentage of recovered information of around 60.3%.This is a very satisfactory level of recovered information which overcomes the levels obtained by Fragoso et al. [2] for the Alentejo area.The DIG i revealed also very satisfactory results: the best one was obtained in Alentejo Litoral with 78%, which is followed closely by Baixo Alentejo with 75%.The other two subregions revealed values near the 60%: Alto Alentejo and Alentejo Central presented 59% and 57%, respectively.

Concluding Remarks
The methodology used enables us to solve the problem of lack of data regarding the main categories of breeding livestock with satisfactory results.It was possible to demonstrate that dynamic methodologies based on Howitt and Reynaud [13,14], which initially were applied to the data disaggregation of agricultural occupation, are flexible enough to be applied with adaptations, to other types of data and still obtain satisfactory results.However, there are some recommendations that should be made for future studies in this area.
The proposed model was applied using a direct disaggregation process for the county of Castelo de Vide due to the lack of data about land uses.However, if there are valid data regarding several land uses, including the relation between them and livestock, they should be inserted in future studies.This will imply better development of the land use disaggregation model, but it also may give the chance of developing a combined approach, considering the simultaneous disaggregation of the livestock breeding in extensive systems and the main land uses.This is a considerable challenge, since the studies carried out until now showed unsatisfactory results in several cases.
A future application of the methodology proposed should also consider variations in the relation between agricultural and forest occupation and the number of livestock NH through time, as well as the application to a set of counties or territorial units.Taking into account these issues, we envisage the development of this methodology in a more complex way, with the inclusion of new databases, since in spite of its simplicity it allows us to obtain relevant information regarding animal production and management.
This methodology could have an important role, considering that some regulations and policy measures may influence the livestock breeding activity, hence leading to change its relation with agricultural and forest occupation.A good knowledge of the real situation will allow providing the ability of better understanding the consequences under different scenarios.

Table 1 :
The indexes of conversion into normal heads.

Table 2 :
Estimated livestock intended for breeding in Castelo de Vide.

Table 3 :
Estimated historical data of livestock intended for breeding in Castelo de Vide.

Table 4 :
The number of livestock estimated for each of the Alentejo's subregions.

Table 5 :
Proportion of total livestock numbers estimated in Castelo de Vide.

Table 6 :
Livestock numbers in Castelo de Vide.