Subquery Allocation Problem and Heuristics for Secret Sharing Distributed Database System

We discuss query optimization in a secure distributed database system, called the Secret Sharing Distributed DataBase System (SSDDBS). We have to consider not only subquery allocations to distributed servers and data transfer on the network but also decoding distributed shared data. At first, we formulated the subquery allocation problem as a constraints satisfaction problem. Since the subquery allocation problem is NP-complete in general, it is not easy to obtain the optimal solution in practical time. Secondly, we proposed aheuristic evaluationfunction for the best-first search. We constructed an optimization model on an available optimization software, and evaluated the proposed method. The results showed that feasible solutions could be obtained by using the proposed method in practical time, and that quality of the obtained solutions was good.


Introduction
The need for distributed databases (DDB) [1] is increasing as information technologies develop and the amount of data to be handled is growing exponentially. For databases such as those containing personal data, confidentiality, dependability, and robustness are becoming increasingly important. We have proposed a Secret Sharing Distributed Database (SSDDB) [2] that combines a secure distributed storage system [3][4][5] with a relational database system. The relations are divided into fragments in the SSDDB, and the fragments are managed by encrypting with a (k, n) threshold scheme [6]. Thus, the fragment must be both encrypted and decrypted every time there is a request to the database system to perform a search or some other functions. This makes it essential to have optimized parallel operation.
Optimization of the query process in an SSDDB proceeds in 3 stages: optimization of the query structure, optimization of subquery allocation to servers employed, one for each fragment request, and optimization of the combination of subquery results. This paper addresses the issue of subquery allocation to servers.
Optimization of the query process in an ordinary DDB is precisely described in [1]. A database management system must decide which server to use for processing when the same data are in distributed locations due to replication, even for ordinary distributed databases. Optimization of the query process for such a case is not sufficiently described in [1]. Evrendilek et al. proposed a query optimization problem which considers the data replication in [7]. However, with SSDDB, a server must be selected for decrypting fragments, and the procedures suggested for allocating subqueries to servers in ordinary DDBs are not very effective here. There exists no research on optimization of the query process for SSDDB as far as we know.
There are three factors to consider when allocating subqueries to servers: firstly, subquery allocation to servers; next, determination of the server storing the fragment needed to handle the subquery; finally, determination of the server for the necessary decrypting of the fragment. Subquery allocation to servers in an ordinal distributed database is known to be an NP-complete problem [7]. Since the subquery allocation problem in an ordinal distributed database is a special case of that of SSDDB, subquery 2 Journal of Computer Systems, Networks, and Communications allocation to servers becomes an NP-complete problem. This paper is an investigation of best-first search methods with a view to online use, and proposes heuristic methods for the best-first search.
Section 2 provides an overview of SSDDB and Section 3 demonstrates the formulation of the subquery allocation problem as a constraint satisfaction problem. A heuristic evaluation function for a best-first search is then proposed. Section 4 shows the execution of an experimental calculation and demonstrates the effectiveness of the proposed heuristic evaluation function.

Secret Sharing Distributed Database System
2.1. The (k, n) Threshold Scheme. The secret sharing distributed database system is software that safely performs distributed sharing of confidential information. The system described herein uses the (k, n) threshold scheme demonstrated by Shamir [6].
Let F represent an algebraic number field and a confidential datum s be an element of F. A (k−1)th degree polynomial equation is created using s and random coefficients where r 0 = s. The value w j = f (x j ) is calculated for each of the n distinct values of x j ∈ F ( j = 1, . . . , n, x j / = 0). The value w j is called a share and x j is called the ID of share w j .
Since f (x) is a (k − 1)th degree polynomial, it is uniquely defined by its values at k different points (x j , f (x − j)). The confidential datum s can be retrieved by evaluating s = f (0).
In the (k, n) threshold scheme, it is impossible to decrypt s using fewer than k shares. There is no dependence on how shares are selected, so even if up to n − k shares are lost, it remains possible to decrypt s.

SSDDB
Structure. An outline of the structure of an SSDDB is presented in Figure 1. The SSDDB is composed of a database management system (DBMS) and a secret sharing storage system (SSSS). The DBM process on the DBMS and the SSS process on the SSSS are assumed to be connected to each other over the Internet and each is able to communicate with the other individually.
When the DBM receives a query from an external source, it creates a query processing plan. It carries out this processing while assigning subqueries to other DBMs for parts of the processing and sending requests to DBMs and SSSs to retrieve or decrypt the necessary data sets. Thus, the user can approach the DBMS in the same way as approaching an ordinary distributed database management system, but an SSSS is used for storage, so the specificity of an SSSS must be considered in the query optimization, as described below.
The data sets are divided into small sets (fragments) and encrypted using the (k, n) threshold scheme. This process is managed by the SSSS. A DBM is also capable of caching data sets. Only DBM3 in the figure is caching data. The dashed lines in the figure indicate physical boundaries. DBM1, SSS1, DBM2, and SSS2 are each separate processes running on the same computer. In other words, no SSS process is running on the computer where DBM3 is running. A DBM can request an SSS to retrieve and decrypt a fragment on a different computer. However, this generates a cost due to transmission of data over the network.

Outline of Query Processing in an SSDDB.
Optimization of query processing in an SSDDB consists of the three stages of firstly, query structure optimization, secondly, optimization of subquery allocation to multiple servers, one for each fragment request, and thirdly, optimization of the combination of the subquery results.
In query structure optimization, the query is subdivided by employing information in the fragments. The first task is to eliminate any redundant expressions sometimes included in user-generated queries. Queries are also rewritten or divided into subqueries in order to reduce the volume of intermediate data. At this point, SSDDB specificity is not yet required, so the ordinary procedure for a DDBS [1] can be used as is.
The generated subqueries can be processed by separate servers. In the second stage, a server for processing a subquery is selected so as to minimize the processing cost of the query, by considering fragment allocations and server loads. In the SSDDB, all fragments except for those in the cache are distributed as shares in the SSSS. Thus, no particular fragment is assigned to any particular server. The cost of recreating distributed fragments must be calculated. Sections 3.2 and 3.3 address the issues of allocating subqueries to servers while considering the specificity of the SSDDB.
Finally, the results of processing the subqueries assigned to the different servers are combined. As with the first step, SSDDB specificity is not required, so again the ordinary procedure for a DDBS [1,7] can be applied as is.

Subquery Allocation Problem to Servers.
Let S be a finite set of servers with |S| = p. Here, the term server is assumed to mean a physical computer, at most one DBM process and at most one SSS process are running on each server. F is a finite set of fragments with |F| = m, and SQ is a finite set of subqueries with |SQ| = r.
Let d j,h ∈ R + be the cost required to decrypt some fragment f j ∈ F on some server s h ∈ S. Here, R + is the set of all nonnegative real numbers. Costs are set considerably higher in servers where an SSS process is not running, in order to avoid assigning decryption requests to such servers. When a fragment f j is decrypted on a server s h , the cost for transmitting shares from other servers is assumed to be The cost for processing a subquery sq i on a server s h is assumed to be w i,h ∈ R + . As was the case with decrypting costs, this should be set rather high for servers where a DBM process is not running. The cost necessary in order to transmit a fragment f j from a server s h to a servers h is written t f j,h,h . The load cost of a server s h is assumed to be l h ∈ R + . Some additional variables are employed here. v i, j is set to the value of 1 when a subquery sq i calls a fragment f j ; otherwise, to the value of 0. c j,h is set to 1 when a fragment f j is cached in a server s h ; otherwise, to 0. x i,h is set to 1 when a subquery sq i is allocated to a server s h ; otherwise, to 0. y h1, j,h2 is set to 1 when a fragment f j is transmitted to a server s h1 , where it is required, from a server s h2 ; otherwise, it is 0. z j,h is set to 1 when a fragment f j is decrypted at a server s h ; otherwise, it is 0.
Normally, parameters of disk I/O, CPUs and the network can be used when constructing a query processing cost model in a distributed database. For an SSDDB, disk I/O can be considered a part of the share decryption cost; the following two costs are employed.
The calculation cost SC h in a server s h is defined with the following expression: The network cost NC h for a server s h is defined thus: The above expressions indicate that the subquery allocation problem to servers becomes a constraint satisfaction problem for the decision variables x i,h , y h1, j,h2 , and z j,h min max subject to Here, the object function prevents the subqueries from being concentrated on a single server. Equation (5) is to ensure that each subquery is processed only once by one of the servers. Equation (6) indicates that the fragment a server requires must always be transmitted from some server (please note that the h 1 = h 2 case is permitted). Equation (7) indicates that when a fragment f j is transmitted from a server s h1 to a server s h2 , s h1 must have either decrypted f j or been holding it in its cache.

Subquery Allocation to Servers Based on Best-First
Searches. Subquery allocation to servers in an ordinal distributed database is known to be an NP-complete problem [7]. If k is set to 1 in the (k, n) threshold scheme in an SSDDB, this describes an ordinal distributed database. Accordingly, the issue of subquery allocation to servers becomes an NPcomplete problem. This paper is an investigation of best-first search methods with a view to online use. In combinatorial optimization problem, searching the optimal solution is done by expanding states iteratively. The search process could be described by a tree structure, called the search tree. Figure 2 is an example of the search tree.
Based on preliminary experiments, we heuristically decided the order of decision variables as x i,h , z j,h , y h1, j,h2 as shown in Figure 2. The depth of the search tree of each step is r, p * m, and r * m * p. But at the time when variables x i,h and z j,h are decided, most of variables y h1, j,h2 have been decided by evaluating (5), (6), and (7).
In the best-first search, the most promising state is chosen by evaluating a function, called the heuristic evaluation function. Let us consider a situation expanding the state s in Figure 2; we are going to choose a server to process the subquery sq i . The heuristic evaluation function G(x i,h ) shows an expected cost when choosing a server s h , and the search process chooses the state whose G(x i,h ) is minimal.
The fragments in the SSDDB are encrypted using the (k, n) threshold scheme and the shares are distributed to the various servers. Therefore, in order to process a subquery, the shares are transmitted and the fragment must be decrypted. Let us assume here that a subquery sq i is processed by a server s h , and that in addition to the cost w i,h of processing sq i on s h , there is a cost for the decryption of the fragment f j , which is needed for the processing in sq i , and transmitting f j to server s h . Accordingly, the minimum possible value for the cost of processing a subquery sq i on a server s h is as follows: When a subquery sq i is supposed to be processed on a server s h , a server to decrypt the corresponding fragment f j must be decided. The server s h2 which gives the minimal value in (8) is the most promising one. Accordingly, a heuristic evaluation function for choosing a server to decrypt a fragment f j is given as follows: Since most of variables y h1, j,h2 could be decided by evaluating (5), (6), and (7), we have not prepared any heuristic evaluation function for the variable y h1, j,h2 .

Computational Experiment
This problem of subquery allocation to servers was solved with an ILOG solver (Ver. 5.1, ILOG). The problem was described using OPL. The operating system was Linux (CPU: Celeron 2.0 GHz, Memory: 1 GB).
The following 2 procedures were used for comparison.
No Heuristic. The ILOG Solver engine was used as is without heuristic evaluation functions. This solver determines variables in the order they are described. In the experimental model, they were described in the order x, z, y.

EDN. The heuristic evaluation function employed by
Evrendilek et al. [7], which is referred to as EDN, was used in (10).
Equation (10) ensures that the processing for subqueries is distributed as widely as possible.
The results are presented in Table 1. Symbols p, m, and r in the table denote the numbers of servers, fragments, and subqueries, respectively. D denotes the dominant cost among the decryption cost (d), subquery processing cost (w) and transmission cost for shares and fragments (t). As indicated by n, all of these costs are about the same. The dominant cost was selected at random from the range [100, 999] and the other costs were randomly selected from [10, 99]. Five problems were constructed using identical conditions. The calculation time was limited to 1 hour and f was the number of searches that were not completed within the time limit. Blank columns indicate that searches were completed for all problems. The value t i was the mean time required to find a feasible solution (initial feasible solution). The value t o was the mean time required to find the optimal solution. The value t f was the mean time required to complete the search. The means for t o and t f were taken over the searches that were completed. All times were stated in seconds.
The value is the ratio between the evaluated values of the initial feasible solution and the optimal solution and represents the mean over problems for which the optimal solution was identified.
The results when p, m, and r were set to identical values and the scale of the problem was varied are listed in the upper part of Table 1. The time t i was within 0.1 s and the feasible solution was easily found with all the methods. Also, as shows, the proposed procedure proved to be better than the other procedures at finding the initial feasible solution.
The time required to complete the search grew exponentially with the scale of the problem. Furthermore, the search was never completed within the time limit under the No Heuristic procedure, even when the problem scale was small. Thus, it is essential to develop a procedure for quickly deriving a useful feasible solution for the subquery allocation problem.
The proposed method was able to find optimal solutions for a greater quantity of the problems within the time limit than the EDN approach. Also, on average, the proposed method was superior from the perspective of times to find the optimal solution and complete the search. It should be noted that the values for t o and t f are the means of problems seeking optimal solutions within the time limit. For example, under p = 8, m = 8, r = 8, and D = w, for the proposed Thus, the proposed method is superior to both EDN and the No Heuristic procedure in terms of both initial feasible solution and calculation time.
Let us turn to some observations about the relationship between problem characteristics and solution methods. First, when the transmission cost is dominant (D = t), the searches are completed more quickly than for other problems of similar scale. On the other hand, when the subquery processing cost or decryption cost is dominant (D = w or D = d), the time required for search completion increases with the scale of the problem. This seems to be because transmission cost depends greatly on server selection.
We observed changes in the calculation time when only one of the parameters p, m, or r was changed (lower part of Table 1).
It takes more time to derive an initial feasible solution when the number of servers is increased; on the other hand, the time to identify an initial feasible solution remained within 1 s when either the number of fragments or subqueries was increased. The value of was raised relatively high by expanding the number of servers. Future studies must address the issue of developing procedures for identifying good feasible solutions at high speed in systems incorporating large numbers of servers.

Conclusions
It is essential to optimize query processing, as this has a great influence over database performance. This study provides a formulation of the subquery allocation problem in a secret sharing distributed database in terms of a constraint satisfaction problem. A heuristic evaluation function is proposed for solutions in best-first searches.
An experiment was conducted using an ILOG Solver to examine the effectiveness of the above function. Some issues remain for the proposed procedure when there are a large number of servers, and further research will be necessary. It will also be necessary to evaluate the performance of this method on an actual system.