Nonreplicated Static Data Allocation in Distributed Databases Using Biogeography-Based Optimization

Allocation of data is one of the key design issues of distributed database. A major cost of query execution in a distributed database system is the data transfer cost from one site to another site. The allocation of fragments among the different sites over the network plays an important role in performance of the distributed database system. The main objective of a data allocation in distributed database is to place the data fragments at different sites in such a way, so that the total data transfer cost can be minimized while executing a set of queries. In this paper, a new biogeography-based optimization (BBO) algorithm has been used to allocate the fragments during the design of distributed database system.The goal of this paper is to design a fragments allocation algorithm, so that the total data transmission cost can be minimized. To show the performance of proposed algorithm, results of biogeographybased optimization algorithm for data allocation are compared with genetic algorithm.


Introduction
Distributed database technology is one of the most important developments of the past two decades in the field of database systems.Distributed database system technology is the union of two separate branches of computer science: database system and computer network [1].Distributed database technology has become an integral part of most of the business organization due to its decentralized nature.Distributed databases have eliminated many of the shortcomings of the centralized databases and fit more naturally in the decentralized structures of many organizations [2].Distributed database can be defined as a collection of logically interrelated data distributed over the sites of a computer network [1].Distributed database system has many advantages over centralized database system [1,2]: (i) reduced communication overhead, (ii) improved performance, (iii) reliability, (iv) availability, (v) expandability.
The design of centralized database has two main issues: designing the conceptual schema and designing the physical database.But the design of distributed databases adds two more issues: designing the fragmentation of global relations and allocation of fragments over network [2].All these issues complicate the design of distributed database.The problem of fragmenting the database is a difficult one in itself and a number of different techniques have been proposed for fragmenting the database by different research works.This study concentrates only on data/fragments allocation problem.
Fragment allocation can further be divided into two different categories: replicated/redundant and nonreplicated/nonredundant [1,2].In a nonreplicated/nonredundant allocation exactly one copy of each fragment will exist across all the sites, while under a replicated/redundant allocation, fragments are replicated over multiple sites.Data in distributed database system is allocated according to two different types of access patterns: static and dynamic [1].In a static environment, the access probabilities of application running on different site to fragments never change but in a dynamic environment these probabilities change over time.
Chu [3] was the first to develop a model to minimize overall operating costs under the constraints of response time and storage capacity with fixed number of copies of each file.Casey [4] further investigates Chu's allocation model and relaxes the assumption of fixed number of copies.Casey [4] has given stress on the difference between updates and retrieval.Eswaran [5] proved that Casey's formulation was NP-complete, so finding optimal solution is not computationally feasible.Ceri et al. [6] considered the problem of file allocation for typical distributed database applications with a simple model of transaction execution.Ceri et al. [6] proposed a nonreplicated allocation of data and suggested that once the optimal nonreplicated solution has been found for nonreplicated environment, then replication can be handled easily by applying a greedy algorithm.
Apers [7] proved that the fragment allocation problem in distributed database is altogether different from the file allocation problem.Apers proposed a method for data allocation so that total data transfer cost during the execution of a set of transaction can be minimized.Sarathy et al. [8] have given nonlinear integer programming formulation for fragments allocation.Tamhankar and Ram [9] had given an integrated method of fragmentation and allocation together.Corcoran and Hale [10] and March and Rho [11,12] presented a genetic algorithm-based approach to allocate operations to nodes.Loukopoulos and Ahmad [13] have given genetic algorithms for static and adaptive distributed data replication.Ahmad et al. [14] compared genetic algorithm, a simulated evolution algorithm, mean field annealing algorithm and neighborhood search algorithm for data allocation in distributed database design.Ahmad et al. showed that when efficiency and solution quality are equally important then genetic algorithm is an attractive solution [14].Menon [15] has presented an integer programming formulation for the nonredundant version of the fragment allocation problem.Hababeh et al. [16] have given a high-performance computing method for data allocation in distributed database system using cluster based approach for network sites.Rahmani et al. [17] have tried to improve the Hababeh et al. allocation by incorporating genetic algorithm.More recently, Abdalla [18] has given a synchronized design technique for efficient data distribution.
In all of the above approaches, data allocation has been proposed based on the static data access patterns.Brunstroml et al. [19] proposed an optimization algorithm for nonreplicated dynamic allocation of fragments in distributed database systems.Ulus and Uysal [20,21], Singh and Kahlon [22], and Abdallaha et al. [23] have given threshold algorithm, TTC algorithm, and POE algorithm, respectively, to further improve the performance of optimization algorithm given by Brunstroml et al. [19].Wolfson et al. [24] introduced a model for adaptive replicated data allocation for data redistribution.
In this paper, a new algorithm for nonreplicated static allocation of fragments during distributed database design using biogeography-based optimization technique is introduced.To show the performance of the proposed algorithm, results are compared with the genetic algorithm of Ahmad et al. [14].The new proposed algorithm is giving quality solutions within a shorter time.

The Data Allocation Model
2.1.Fragment Allocation Problem.Assume a distributed database system consisting of sites  = { 1 ,  2 , . . .,   } on which a set of queries  = { 1 ,  2 , . . .,   } is running.Each site has its own processing power, memory, and local database system and all the sites are connected by a communication link network.Let  = { 1 ,  2 , . . .,   } be the set of fragments after partitioning all global relations during fragmentation phase of distributed database design.The allocation problem involves finding the optimal placement of the fragments () to the sites ().The optimality can be defined with respect to two measures, minimal cost and performance [1].

The Cost Model. Table 1 gives the description of various notations used to draw the cost model of data allocation.
There are primarily two types of costs associated with execution of a query.The first type is the cost of retrieval of fragments to process a query and the second type is the cost to update fragments to process that query.The formula to calculate total cost of data transfer is given as follows.Total data transfer cost (TC) = retrieval cost (RC) + update cost (UC): CC ,  is the communication cost associated with th site and the site containing the th fragment, where   =   means th fragment is allocated at site   and CC ,  = 0 if  =   .

Cost Function.
The main objective of the study is to generate a fragment allocation schema which can minimized the total data transmission cost during the execution of database queries.So, the cost function that has to be minimized is given below: Simon [25] developed the biogeography-based optimization (BBO).BBO is primarily based on "The Theory of Island Biogeography" given by MacArthur and Wilson [26].MacArthur and Wilson [26] have given a mathematical model of biogeography.The mathematical model describes that the rate of change in the number of species on an island highly depends on the stability between the immigration of new species onto the island and the emigration of established species [26].
The BBO algorithm works on a population called habitats (or islands).Each habitat represents a possible solution to the problem in hand.Each solution feature of a habitat is called a suitability index variable (SIV) of that habitat.The fitness of each habitat is represented by its habitat suitability index (HSI).HSI is a metric that determines the goodness of a candidate solution.Habitats with a high HSI tend to have a large number of species, while those with a low HSI have a small number of species.Habitats with a high HSI have many species that emigrate to nearby habitats.Habitats with a high HSI have a low species immigration rate () and a high species emigration rate ().Habitats with a low HSI have a high species immigration rate () because of their thin populations.This immigration of new species to low HSI habitats may raise the HSI of the habitat, because the suitability of a habitat is proportional to its biological diversity.However, if a habitat's HSI remains low, then the species that reside there will tend to be vanished.This will further open the way for additional immigration.So the low HSI habitats are more dynamic in their species distribution than high HSI habitats.Figure 1 shows the relationships between fitness of habitats (number of species), immigration rate (), and emigration rate () [25].
The immigration rate () and emigration rate () are functions of the number of species in the habitat.They can be calculated as follows [25]: where  is the maximum possible immigration rate;  is the maximum possible emigration rate;  is the number of species of the th individual;  is the maximum number of species.

BBO Algorithm.
In biogeography-based optimization, there are two operators: migration and mutation [25].A population of candidate solution can be represented by different design variables.Each design variable for a particular population member is considered as suitability index (SIV).Migration is a probabilistic operator that improves the quality of a habitat.The immigration and emigration of each  solution are used to probabilistically share the information between habitats.For each habitat   , its immigration rate   is used to probabilistically make a decision whether to immigrate or not.If immigration is selected, then the emigrating habitat   is selected probabilistically based on the emigration rate   .Migration is represented as [25]   (SIV) ←   (SIV) . ( Mutation is a probabilistic operator that randomly modifies a habitat's SIV.A randomly generated SIV replaces a selected SIV in the solution   according to a mutation probability, which is predefined.The main reason of mutation is to increase diversity of the population.Mutation is useful for both poor solution and good solution.For low HSI solutions, mutation gives them an opportunity of enhancing the quality of solutions, and for high HSI solutions, mutation is capable of making them better [25]. The biogeography-based optimization algorithm is given (see Algorithm 1) [25,27].

Encoding of Habitat.
In the proposed BBO algorithm for data allocation, the allocation of each fragment to different sites over the communication network is encoded in a binary representation.For example, if a data fragment is assigned to site 2, then its assignment value is 10.The assignment values of all the data fragments are concatenated to form For example, in the case of 3 sites and 5 fragments, a habitat is represented by 5 sets of 2 bit each, one set for each fragment.If fragment 1 is allocated to site 3; fragment 2 is allocated to site 1; fragment 3 is allocated to site 3; fragment 4 is allocated to site 2; fragment 5 is allocated to site 3, then the habitat for these allocated fragments will be [11 01 11 10 11].Habitat suitability index (HIS) is the total cost of data fragments allocation.

Results
To check the performance of the proposed BBO algorithm, it is compared with GA of Ahmad et al. [14].Different experiments are conducted with number of fragments ranging from 4 to 24 and number of sites fixed as 4 and 8.All the experiments are done on 2.8 GHz Intel Core i5 processor with 4 GB RAM and 64 bit Microsoft Windows 7 as an operating system.BBO and GA algorithms are implemented in the MATLAB 2010 programming environment.The communication network topology, communication cost between sites, the size of fragments, numbers of queries, execution frequency of each query at different site, retrieval frequency of different fragments, and update frequency of different fragments are randomly generated from uniform distributions for each experiment [8,14,15].Both algorithms are tested on the same data set for each experiment and the values of other parameters are given as follows: Tables 2 and 3 summarize the experimental results obtained from BBO and GA for 4 sites and the number of fragments ranging from 4 to 24.Tables 4 and 5 show the experimental results obtained from BBO and GA for 8 sites and the number of fragments ranging from 4 to 24.These results are obtained after running both the algorithms 20 times independently for each experiment.
From Tables 2 and 4, it is clearly evident that the minimum cost achieved by proposed BBO algorithm for allocation of data fragments is less than the minimum cost achieved by GA.In all the experiments, BBO algorithm for fragments allocation is providing allocation schema better than GA based fragments allocation.From Table 3, Figure 4, Table 5, and Figure 7, it is also clear that BBO algorithm for fragment allocation is faster than GA based fragment allocation.But the average cost of fragment allocation for BBO algorithm is more than GA in some cases as shown in Tables 2 and 4. Figures 2, 3, 5, and 6 show the convergence of GA and BBO.In most of the cases convergence rate of BBO is fast as compared to GA.In overall the proposed BBO algorithm for fragment allocation is providing quality solutions in less time.

Conclusion
This paper presents a new biogeography-based optimization technique for nonreplicated static allocation of data fragments during the design of distributed database.To evaluate the performance of proposed algorithm, results are compared with GA.From the results, it is clearly evident that the proposed technique for data fragment allocation is providing quality solutions in quick time.The proposed algorithm significantly minimize the data transfer cost during the execution of a set of queries.However, in some cases the average cost of allocation for BBO is more than GA, but for fast running time and quality solution, BBO can be introduced as a capable algorithm for fragment allocation during distributed database design.

Figure 2 : 6
Figure 2: Convergence of GA and BBO for 4 fragments and 4 sites.

Figure 3 :
Figure 3: Convergence of GA and BBO for 24 fragments and 4 sites.

Figure 4 :Figure 5 :
Figure 4: Minimum and average running time of GA and BBO for 4 sites.

Figure 6 :
Figure 6: Convergence of GA and BBO for 24 fragments and 8 sites.

Figure 7 :
Figure 7: Minimum and average running time of GA and BBO for 8 sites.

Table 1 :
Description of various notations.
The execution frequency of the jth query at ith siteRF The retrieval frequency to the kth fragment by jth query Percentage of kth fragment needed for retrieval by jth query UF  The update frequency to the kth fragment by jth query   Percentage of kth fragment needed to be updated by jth query CC ,  The communication cost from site   to site    Size (  )

Table 2 :
Minimum and average cost achieved by GA and BBO for 4 sites.

Table 3 :
Minimum and average running time of GA and BBO for 4 sites.

Table 4 :
Minimum and average cost achieved by GA and BBO for 8 sites.

Table 5 :
Minimum and average running time of GA and BBO for 8 sites.
3.1.Biogeography-Based Optimization.The biogeographybased optimization (BBO) is a newly developed populationbased evolutionary technique.Biogeography-based optimization (BBO) is based on theory of biogeography.Biogeography is the study of geographical distribution of species.

Step 1 :
Initialize Population Size, Maximum Number of Iterations (NI), Maximum Immigration rate (I), Maximum Emigration rate (E), Mutation rate, and Elitism Parameter; Step 2: Generate a random set of habitats based on the size of the population.Each habitat corresponds to a potential solution to the given problem; Step 3: Evaluate habitats and compute corresponding HSI value of each habitat; Step 4: For  = 1 to NI Step 5: Calculate the immigration rate () and emigration rate () for each habitat according to HSI of each habitat; / * Start of Migration * / Step 6: Select non-elite habitat   with probability ∝   for immigration; Step 7: if   is selected then select   with probability ∝   for emigration; Step 8: if   is selected then randomly select a SIV from   ; Step 9:   (SIV)←   (SIV); Step 10: End if Step 11: End if / * End of Migration * / / * Start of Mutation * / Step 12: Select an SIV in   with probability based on the mutation rate; Step 13: if   (SIV) is selected then replace   (SIV) with a randomly generated SIV; Step 14: End if / * End of Mutation * / Step 15: Re-evaluate habitats and compute corresponding HSI value of each habitat; Step 16: End for