BISGA: Recalculating the Entire Boolean-Valued Information System from Aggregates Using a Genetic Algorithm

A Boolean-valued information system (BIS) is an application of a soft set in which the data are mapped in a binary form and used in making applications not limited to decision-making, medical diagnoses, game theory


Introduction
Soft set is used to present uncertain and vague data into crisp and clear data.Pawlak introduced the concepts of soft sets in 1994 [1].Molodtsov then defned them in 1999 [2].Soft sets can be used in decision-based applications such as game theory [3], medical diagnosing [4], and fnancial problems [5].BIS is an application of a soft set.It maps the values of the soft set in a table in binary form, which helps in fnding more appropriate choices by weighing all objects.Te objects that satisfy more parameters are considered the best choice.Table 1 shows the representation of a BIS.
Wrong decision can be made with an incomplete BIS, which can yield in loss to an organization and individuals.A BIS having incomplete data is called an incomplete information system (IIS).Tere are several reasons for the IIS, i.e., improper data entry, errors in communication, and virus attacks.Researchers have been trying to solve this problem.Tese research works are categorized into two categories, the preprocessed category containing parity bits [6], supported sets, or aggregate sets [7] and the unprocessed category.Supported sets, aggregate sets, or parity bits can be extracted from the BIS before the data are lost or corrupted.Te lost data can be recovered by using these sets [8].Te unprocessed category uses the remaining available data in the BIS to recover the lost data.Techniques in the unprocessed category include weighted average [9], probability [10], DFIS [11], ADFIS [12], and DFPAIS [13].Table 2 shows an incomplete Boolean-valued information system, where missing data are represented by " * ." Rose et al. [6] frst introduced the concept of data flling in soft sets through parity bits.0s and 1s are inserted to make the number of 1s even or odd in the data.Ten, they introduced the concept of aggregate sets [7].Tese are four aggregate sets which are the row aggregate set, column aggregate set, left-right diagonal aggregate set, and right-left diagonal aggregate set, as shown in Tables 3-5.Te sum of the values of each row is recorded in row aggregate sets.Te sum of the values of each column is recorded in column aggregate sets.Left-right diagonal aggregate sets and rightleft diagonal aggregate sets are calculated the same way as row and column aggegate sets.Tis method of aggregates is more accurate than parity bits in data flling of partially missing values in the BIS.Khan et al. [8] introduced the concept of recalculating the entire BIS from these four aggregates.Tey solved a problem manually using nonsimultaneous linear equations as a proof of concept.Tey inserted 1s and 0s according to universal and empty sets.Universal sets are those sets in which every value is 1, and empty sets are those sets in which every value is 0. Tey also made suppositions when it was necessary.When the suppositions failed, they had to track back and start alternative supposition again from the binary domain.Due to these suppositions, that solution was hard to implement, and if still it is implemented, then it will not be generic for every BIS such as the genetic algorithm (GA).Furthermore, many diferent BISs can have the same sets of aggregates because of the circular patterns in the BIS, as given in Figure 1.In Figure 1, there are four BISs which have the same sets of aggregates.More interestingly, all four tables in Figure 1 have the same aggregates as in the example manually solved in our base paper by Khan et al. [8], where the authors used to recalculate only one table and did not focus towards the implementation of their technique and the other possible tables which would have got calculated if they had implemented their technique.Tis seems to be one of the possible weaknesses of their approach.It should be noted that a set of diferent BISs recalculated still satisfy all aggregates and hence can be used for decision purposes.Tis investigation becomes possible due to applying GA on recalculating BISs.However, further investigation will be required for recalculating only the original BIS from the set of aggregates.
GA [14] is a metaheuristic bio-inspired algorithm used for searching and optimization problems in many domains.Te main principle of GA is survival of the fttest, which tries to select fttest individuals or chromosomes from the available population.Two powerful operators of crossover [15] and mutation [16] are applied to the selected chromosomes to further flter the best allele and genes from the chromosomes selected.Te ftness of crossed and mutated chromosomes is then checked, and best of them are crossed over and mutated again to come closer to the fttest solution in several iterations until the satisfactory solution is found.
Terefore, this paper recalculates the entire BIS in soft sets from aggregates using GA and presents an algorithm named BISGA.Tis article has the following four initial contributions, while the main aim is proposing BISGA: Table 2: Incomplete information system.Left-right aggregates (1) To identify the constraints which can be applied to chromosomes to narrow down the search space (2) To fnd the appropriate genetic operators and customize them to be used in the 2D environment (3) To derive a four-dimensional ftness function (4) To analyze the accuracy by comparing with the original BIS and efciency through an average number of generations Te rest of the paper is further divided into another fve sections: the literature review is provided in Section 2. Te proposed algorithm is discussed in Section 3. Section 4 contains the results and discussion, and the paper is concluded in Section 5.

Literature Review
Tis section consists of three subsections.In the frst subsection, we provide the literature review for soft sets.In the second subsection, we discuss incomplete soft sets and the techniques for handling incomplete soft sets and, in the last subsection, we provide some discussion on the genetic algorithm.
2.1.Soft Sets.Soft set is defned as "let U be an initial universal set and P be a set of parameters.A pair (F, P) is called a soft set (over U) if and only if F is a mapping of P into the set of all subsets of the set U." For example, let U � p 1 , p 2 , p 3 , p 4 , p 5 , p 6   be a set of phones and be a set of few features representing "latest phone," "budget," "wireless charging," "5G," "high-resolution camera," and "edge," respectively.
We suppose that the latest phones are p 2 , p 3 , p 4 , and p 6 , the budget phone is p 5 , the phones with wireless charging features are p 1 , p 2 , p 4 , and p 6 , the 5G phones are p 2 , p 3 , p 5 , and p 6 , the phones p 3 , p 4 , p 5 , and p 6 have a high-resolution camera, and p 1 , p 2 , p 3 , p 4 , and p 6 are the edge phones.Tese data are represented in Table 1.Te data tell us that p 6 is not a budget phone, but it contains all the features which makes it a more appropriate choice: Majumdar and Samanta [17] used a soft set to diagnose diseases.Moreover, Kharal [5] used a soft set to point out fnancial problems.Furthermore, Deli and Cagman [3] demonstrated the soft set theory applications in game theory.Tey employed some set operations which determine the soft game's solution, making the game easy to apply.
In any case, if the data were not available completely, it would be not easy to decide and is called BIS with missing information, as shown in Table 2.

BIS with Missing Information.
Te information in the BIS can be lost because of errors, improper entry, virus attacks, etc., called an incomplete information system.Researchers have found missing or lost data in an incomplete soft set.Tese techniques are divided into two main categories: preprocessed category and unprocessed category.(same as base paper example) Column Aggregates 4 3 3 5 3 5 4 4 3 3 5 3 5 4 BIS 3 Figure 1: Four BISs having the same aggregates due to circular patterns; only row and column aggregates are shown here.Readers can check that diagonal aggregates are also the same for these BISs.

Complexity 3
Te frst attempt in the unprocessed category was made by Zou and Xiao [9], where the authors presented the weighted average technique.In this technique, the decision can be made without fnding the missing data.Kong et al. [10] presented a probability-based technique to fnd exactly the missing or lost data.Tis technique can fnd the values in a range between 0 and 1. Qin et al. [11] presented DFIS.It can recover the lost data by using parameter associations.Consistent associations consider same values between parameters and inconsistent associations consider opposite values between parameters.First, the associations between all parameters are calculated, and the lost data are recovered according to them.Khan et al. [12] presented that associations must be found after each iteration in ADFIS.When one piece of data is decisively recovered, the associations with that piece of data must be recalculated.Tey keep recalculating it with recently included data until the fnal piece of data is recovered.Kong et al. [13] presented some cases, through which they have shown that ADFIS would not work every time.Tey suggested in DFPAIS that the decision should not be taken alone on the highest association.All associations should take part in the decision process.Te highest association still has a bigger impact but does not possess all authority.Tese works are tested on UCI benchmark datasets.UCI consists of 4 datasets, i.e., zoo datasets, fag datasets, congressional datasets, and heart datasets.
As far as the techniques of the preprocessed category are concerned, the proposed work is related to that category.Terefore, we will discuss it with its necessary mathematical details as some of its equations will be used in the proposed work.Preprocessed category techniques are only usable when we have some compressed or extracted data of the lost data.Khan et al. [8] gave a concept that the entire BIS can be regenerated.Tey used four aggregate sets which were introduced by Mohd Rose et al. [7].Among these four aggregates, two aggregate sets are the sum of the 1s of every row and the sum of every column yielding in row aggregate and column aggregate sets, respectively, as shown in Table 3.A row aggregate can be found mathematically as follows: where "R" denotes a row, "u" is the current row, and "p" represents all the columns of the BIS.
Similarly, a column aggregate can be found as follows: where "C" denotes a column, "p" is the current column, and "u" represents all the rows of the BIS.
Since diagonals can also be treated in two ways, therefore, the other two aggregates are left-right (LR) and rightleft (RL) diagonals.Te aggregates of these LR and RL diagonals are calculated as the arithmetic sum of the values in each diagonal as shown in Tables 4 and 5 and highlighted through the same color for a diagonal.A number of LR and RL diagonals are calculated using the equation as follows: where ″|U|″ and ″|P|″ are the number of rows and columns in the BIS, respectively.Both LR and RL diagonal aggregates can be calculated in two steps as given below.
Case 1.When 1 ≤ k ≤ |A|, i.e., the LR diagonals starting from the frst row and ending with the last column and the RL diagonals starting from the frst row and ending with the frst column, we obtain Case 2. When |A| < k < D, i.e., the LR diagonals starting from the frst column and ending with the last row and the RL diagonals staring from the last column and ending with the last row, we obtain Selection operators are used to select individuals from the population for breeding [18].In roulette wheel selection, a wheel or pie is divided among the individuals based on their ftness values.Individuals having better ftness values take a more signifcant slice of the pie.
Crossover operators are used to exchange information among individuals [15].Two or more individuals are taken, and at least one new individual is produced.Single-point crossover divides the genes into two separate genes that exchange with the other half of the other parent.
Mutation operators mutate the newly born individuals to diversify them from current individuals [16].In bit fip mutation, bits are fipped randomly.Zero becomes one, and one becomes zero, whereas swap mutation swaps the genes in the chromosome.
Aj and Pd [19] and Larranaga et al. [20] presented that binary genetic algorithms have difculties with irregular patterns and hamming clif and struggles in attaining accuracy.A two-dimensional GA was presented by Tsai et al. [21] for airline scheduling problems.2D genomes were taken as a single-dimensional array while applying the crossover operator in their study.A most recent work on GA is multivariate missing data imputation including the flling of continuous and discrete missing data [22], but it is diferent from the proposed work as it still flls partial missing information instead of the entire matrix unlike the proposed work.

Proposed Work: BISGA
Tis section of the proposed work mainly consists of stepby-step BISGA accompanied with a step-by-step solved example.Te algorithm for BISGA is presented in pseudocode 2 and visualized with the help of a fowchart shown in Figure 2.
3.1.Population.For BISGA, the population consists of all possible BISs of the same size.Each BIS is called a chromosome or individual.Te size of a chromosome is calculated same as a table in which the number of columns is equal to length of column aggregate sets and the number of rows is equal to length of row aggregate set.
Te initial population is provided by randomly assigning ones and zeros to the table, but a constraint of row aggregate satisfaction is applied.Tis constraint is considered to be the frst contribution of BISGA and stated as follows: "each row must be assigned the number of ones that are equal to the value of its aggregate."Tis constraint is applied because the BISGA will not need to calculate and check the ftness of row aggregates at each iteration, and only the other three aggregates will need to be further checked for satisfaction.Tis constraint not only increases the efciency of BISGA apparently by 25% but also helps its performance improvement, because the overall number of ones in each chromosome is exactly equal to the required number as in the original BIS and remains fxed throughout the execution.Te minimum number of selected chromosomes or parents should be two for the process of crossover.Selecting as more parents as possible chromosomes will increase the chances of fnding the fttest chromosomes at this early stage.

Selection.
For selecting parents from the population for crossover, both roulette wheel selection and tournament selection operators are suitable for the problem.Tey both have similar impacts.Later, in the results, we have used roulette wheel selection.

Fitness Function. Fitness function in genetic algorithms
is used to assess chromosomes or individuals.Te ftness function is newly derived for BISGA and is considered to be its main contribution.Initially, ftness for an aggregate x of an individual is calculated by fnding the absolute diference between individual's aggregate sets and actual aggregate sets using equation (1) as follows: where "ind" indicates the individuals of the current chromosome, N indicates the aggregate set length, and "act" indicates the actual aggregate sets of the BIS.Each aggregate ftness of an individual is then added with each other to fnd the accumulative ftness of the whole individual using equation (9).Te individual is considered ftter if the diference is lower or closer to zero.Zero ftness means that the input aggregates have been satisfed by the individual.Te ftness function formula is given below.Other useful discussion on ftness can be found in the relation section of discussion.Fitness(ind) � fitness agg LR  + fitness agg RL  + fitness agg C . (10)

Crossover.
As for crossover, single-point crossover is proposed initially for BISGA.Te crossover is performed on the frst half of rows from parent 1 and the second half of rows from parent 2 making child/ofspring1.Similarly, the remaining second half of parent 1 and the frst half of parent 2 produce the second ofspring.If more than two parents are taken, they will be crossed in the same fashion based on their Complexity ftness to produce more ofspring.It should be noted that the initial constraint applied to row aggregates is maintained in this type of crossover.

Mutation.
A row-wise swap mutation operator is used for mutation in BISGA.For BISs having less than hundred values, one mutation in the whole BIS is enough.While for bigger BISs, one to two percent mutation of all values is required.When there is one mutation in the whole BIS, obviously, swapping should be performed in the same row to maintain the constraint of row aggregate satisfaction.Similarly, in case of more than two mutations in a BIS, the constraint of row aggregate satisfaction must be considered if the mutation is selected randomly in a single or multiple rows of an individual.More discussion related to mutation can be found in the relevant section of discussion in this article.

Pseudocode 1: Fitness Function.
Requirements are as follows: row aggregate sets, column aggregate sets, left-right diagonal aggregate sets, right-left diagonal aggregate sets, and individual's chromosomes: (1) Find the aggregates sets of the individual (using equations ( 2) to ( 8)) (2) Find the ftness of each aggregate set for the individual (using equation ( 9)) (3) Calculate the ftness of the individual (using equation ( 10)) 3.5.2.Pseudocode 2: BISGA.Requirements are as follows: aggregates of the actual BIS, chromosomes, and ftness function: (1) Generate two or more chromosomes of BIS size (2) Put the number of ones randomly in each row according to the size of that row aggregate and fll the remaining cells of chromosomes with zeros (3) Calculate the ftness of each chromosome (using pseudocode 1) (4) End if at least one chromosome has zero ftness, otherwise go to next step (5) Crossover two or more fttest chromosomes with each other row wise and other ftter with each other and so on (6) Mutate two percent of the genes row wise using swapping and go back to step 3 3.6.Solved Example for BISGA.In this section, we demonstrate an example of BISGA as a proof of concept.Consider 4 by 4, Table 3 will be referred as the actual BIS, and its aggregate sets are as follows: 3.6.1.Step 1: Initial population.Four chromosomes are generated randomly as the initial population as given in Figure 3, and the size of each chromosome is equal to the number of row aggregates multiplied by the number of column aggregates.

3.6.2.
Step 2: Initial Constraints.Te number of ones generated randomly in frst row is 3 which is equal to the frst row aggregate.Similarly, two, one, and two number of ones are inserted in the second, third, and fourth rows, respectively.Te remaining cells are flled with zeros.

3.6.3.
Step 3: Calculating Fitness.Considering the ftness calculation of chromosome 01, as row aggregates of the actual BIS and chromosome 01 are the same, the ftness for this aggregate is 0. Column aggregates of the actual BIS are 3, 0, 3, 2 { }, while those of chromosome 01 are 3, 2, 2, 1 { }.Te absolute diference of these column aggregates is 0, 2, 1, 1 { }, and its total is 4, which is the ftness of the column aggregate for chromosome 01.Similarly, the ftness of the left-right and right-left diagonals for chromosome 01 is 2 and 4, respectively.Adding the ftness of each aggregate set (0 + 4 + 2 + 4) becomes 10 which is the ftness of chromosome 01.In a similar fashion, the ftness of other three chromosomes can be calculated for each chromosome in Figure 3.

3.6.4.
Step 4: Selection.Based on the ftness of chromosomes calculated in the previous step, chromosome 01 and chromosome 04 are selected as the fttest among all for further operations.Selected chromosome 01 is shown by a black background and other 04 by a white background, and bold borders in Figure 1 help in understanding in the next step of crossover.

3.6.5.
Step 5: Crossover.Chromosome 01 and chromosome 04 are row wise crossed over uniformly such that the frst two rows of chromosome 01 and the last two rows of chromosome 04 are combined to generate a new ofspring, as shown in Figure 4. Similarly, the last two rows of chromosome 01 and the frst two rows of chromosome 04 are combined for making the second ofspring from the selected chromosomes.In a similar fashion, chromosome 02 and chromosome 03 are crossed over, which are not shown here for simplicity purpose.

3.6.6.
Step 6: Mutation.As we already insert the right amount of 0s and 1s in our chromosome as initial constraint, now we can only use swap mutation.Suppose the frst and the last element of ofspring 01 in the third row are selected for mutation, then frst 1 will become 0 and last 0 will become 1, as underlined in Figure 3. Hence, the aggregate value of row 3 will not be afected and will remain equal to 1 as its previous value.Similarly, all other ofspring will be mutated.

Complexity
Following the fowchart of BISGA, as given in Figure 2 and pseudocode 2, now the ftness of each ofspring will be checked as in Step 3. If no ofspring has 0 ftness, Step 3 to Step 5 will be repeated until an ofspring with zero ftness is found.In our example, ofspring 01 after mutation is the same as the actual BIS, and the algorithm will terminate after fnding its ftness equal to zero.

Discussion and Results
In this section, we present some of the initial results obtained through BISGA after implementing it in Python.Second, the results are followed by some necessary discussions related to these results and BISGA operators.

Results
. Two main results of BISGA are already included in this article in the form of Figure 1 and a solved example.Among these, Figure 1 results are the most important because this example is taken from the base paper, where Khan et al. [8] recalculated only one BIS, while BISGA calculated four diferent tables which show that BISGA is powerful than the Khan et al. [8] approach.In addition to those fours BISs, another 5 th BIS is calculated which is not shown in Figure 1.We mention it here in the text for the information of readers, and readers can fnd that the aggregates of all these BISs are the same as those of the original BIS.Te ffth BIS is 1) is considered to be 100 percent accurate as it is the original BIS, then the accuracy of other BISs is given in Figure 5.
When BISGA was run ten times for recalculating the BIS of the base example, the original BIS was calculated four times, while the other four equivalent BISs were calculated in remaining tests.Te frequency of original and equivalent BIS recalculation through BISGA is given in Figure 6 which also shows the number of iterations used for these calculations.Figure 7 illustrates the progression of solutions through random generations.Wrong values are highlighted in the tables.
Te second important result already presented is the selfexplanatory step-by-step solved example which elaborates the concept of BISGA.
In addition to these two important results, we run BISGA after implementing it in Python with diferent sizes of the BIS on UCI benchmark datasets and dummy Boolean datasets.UCI benchmark datasets have 4 datasets which are already used by ADFIS for data flling in soft sets [12].First, 10 * 10 values are taken from all 4 datasets.Te average accuracy percentage and efciency is given in Figure 8. Te experiments are also performed on dummy data for more generic results of the proposed algorithm.
Other experiments are conducted on random binary data.BISs of varied sizes, i.e., 5 * 5, 5 * 6, 6 * 6, 6 * 7, 7 * 7, 7 * 8, 8 * 8, 8 * 9, 9 * 9, 9 * 10, and 10 * 10, were randomly created.Twenty random BISs were generated for each abovementioned size.Ten, their aggregates were extracted and given as input to BISGA.Every BIS was regenerated twenty times by the algorithm.Te solution was then compared with the original BIS.Te average accuracy achieved for these BISs is given in Figure 9.It should be noted that less than 100% accuracy for larger BISs means that diferent BISs were recalculated for the same BIS, which are equivalent to the original in term of their aggregates.Te average number of generations it took to reach the solution is given in Figure 10.Complexity supposition.So the question rises that the Khan et al. [8] technique may be more accurate in recalculating the original BIS.However, if we observe in the same example of Figure 1, every equivalent BIS has the same universal and empty aggregates, so it clarifes that if empty and universal aggregates are considered in BISGA, still equivalent BIS will get calculated.
Second, as the size of the BIS increases, two issues will be faced.Te frst issue is that it will take more iterations and more time to get the solution, and the second issue is that the number of equivalent BIS increases and that it becomes difcult to fnd out the original BIS.Both these issues related to larger BISs are natural and specialized techniques, and operators will be needed to minimize them.
Tird, low accuracy in larger BISs does not mean that every BIS calculated has low accuracy, as apparent from Figure 9.If we succeed in recognizing the original BIS, the accuracy will be 100% in that case.For instance, three original BISs among 20 were recalculated for a 10 * 10 BIS in our experiments, but the accuracy shown is the average of all 20, among which 17 are equivalent BISs.Terefore, the average has reduced the accuracy to 57%.It is worth noting here that the ftness of all those BISs including 17 inaccurate is zero, and that is why they are called equivalent due the reason of the same ftness.
As mentioned earlier, we have checked BISGA on benchmark datasets, and we have used dummy binary data for our experiments that may not refect any specifc soft sets.However, BISGA works the same for any binary data regardless of the fact whether it is a soft set or not.Similarly, 2D binary data from any other domain can be recalculated using BISGA.
It is important to discuss the initial constraint of row aggregate satisfaction for each individual.It fxes the number of total zeros and ones equal to the sum of any aggregate in the whole BIS.If that constraint is not put, the initial population of BISs will be generated with random zeros and ones and not necessarily equal to any aggregate.Hence, it would obviously take more iteration to fx the numbers of ones and zeros plus arrange them in such an order to satisfy the ftness function.Furthermore, investigation will help if the constraint as applied to row aggregates is better or if it applied to any diagonal results in the fast solution.Te iterations of BISGA would be further reduced if such constraints can be applied initially on more than one aggregate in the future.8 Complexity

Complexity
Last but not the least, why mutation used is BISGA though the values have very limited binary domains, and there are 99.99% chances that both zeros and ones will be present in any reasonable population selected, as the crossover used here is single point and row wise.Without mutation, a cell or gene will not get its required bit with the combination of other bits in the same row or column.So the mutation makes it fexible to fip the bit in any cell where it is needed.

Conclusion
In this paper, we have presented BISGA that can recalculate the entire BIS.A constraint is applied on the selection of the initial population which satisfes one fourth of the ftness and is maintained throughout the algorithm.Appropriate genetic operators are selected, such as single-point crossover and swap mutation.A ftness function is derived for this problem.Experiments are conducted on UCI benchmark datasets and randomized datasets to determine the performance of the algorithm.Te genetic algorithm is evaluated based on accuracy and efciency.Results show that when the size of the BIS increases, the chances of circular patterns increase, which afects the efciency and accuracy of the algorithm.Other aggregates can be added to avoid the circular patterns for future works.In the future, there are some expected applications of BISGA in data integrity and data compression.

Data Availability
Te data used to support the fndings of this study are available from the corresponding author upon request.

Conflicts of Interest
Te authors declare that they have no conficts of interest regarding the publication of this article.

Figure 5 :Figure 6 :
Figure 5: Accuracy of BISs recalculated through BISGA for the base example.

Figure 7 :Figure 8 :
Figure 7: Solution's progression of the base example (wrong values are highlighted).
the Size of BIS

Figure 10 :
Figure 10: Efciency for diferent BISs recalculated using BISGA; the number represents the number of generations in which the solution was found.

Table 3 :
Row aggregate set and column aggregate set.

Table 4 :
Left-right diagonal aggregate set.

Table 5 :
Right-left diagonal aggregate set.