The systems consisting high input spaces require high processing times and memory usage. Most of the attribute selection algorithms have the problems of input dimensions limits and information storage problems. These problems are eliminated by means of developed feature reduction software using new modified selection mechanism with middle region solution candidates adding. The hybrid system software is constructed for reducing the input attributes of the systems with large number of input variables. The designed software also supports the roulette wheel selection mechanism. Linear order crossover is used as the recombination operator. In the genetic algorithm based soft computing methods, locking to the local solutions is also a problem which is eliminated by using developed software. Faster and effective results are obtained in the test procedures. Twelve input variables of the urological system have been reduced to the reducts (reduced input attributes) with seven, six, and five elements. It can be seen from the obtained results that the developed software with modified selection has the advantages in the fields of memory allocation, execution time, classification accuracy, sensitivity, and specificity values when compared with the other reduction algorithms by using the urological test data.
The information based systems consisting high input spaces require high processing times and memory usage. Feature reduction algorithms are used for determining the dominant and significant attributes for representing the whole data with no or minimum information loss. Reduction systems aim to reduce the computation times and prevent information storage problems when processed with artificial intelligence techniques. Rough sets theory is very significant in data mining and is used for input attribute selection purposes for representing the whole data set [
Most of reduction algorithms that use rough sets based reduction algorithms have the problems of input space limits and high memory demand problems [
The different versions of rough sets methodologies are used for data mining based systems for reduction and knowledge discovery purposes. In the data mining and knowledge discovery systems, the input database is represented by the inputs named as attributes. Each column of information system represents attributes and each row represents a case or an event. Hidden data embedded in the knowledge based systems are investigated by the data mining based systems. The significant input attributes are determined by the rough sets based methodologies. The data mining based procedures are useful to overcome the problems caused by high dimensional data. The data mining based procedures help the data to be classified by artificial intelligence based system like artificial neural network classifiers. The different versions of feature selection algorithms are used for clustering and data mining purposes [
An information system is expressed as
The genetic algorithms are the computational models used for generating solutions for specified areas and use the solution candidate models that are named as chromosomes. The genetic algorithm based strategies explore the solution candidates by constructing the generations. These algorithms apply recombination operators and mutation operators to these structures to obtain critical solution candidates. Crossover methods are applied to the selected chromosomes for obtaining different solution candidates. These optimization algorithms evaluate the potential solutions and produce new solutions for finding the optimal solution. Selection algorithms are used for obtaining the generation that is used for crossover and mutation operators. The goodness of a solution is represented typically by the fitness value and calculated according to the specific problem.
The genetic algorithm based models propose the advanced solution techniques for calculating the optimal results by producing new solution candidates. These soft computing methods aim to find the better solutions by applying genetic algorithm operators like selection, crossover, and mutation. These operators are the computational mathematical models for finding the optimal solutions for the investigated problem. An implementation of a genetic algorithm begins with a population of chromosomes. Then the genetic algorithm based system uses the genetic algorithm operators for finding better solutions. In a broader usage of the term, a genetic algorithm is any population based model that uses selection and recombination operators to generate new sample solution points in a search space.
The aim of the study is to construct a reduction software that supports large input numbered systems with effective memory usage and processing time. Locking to local solutions and high computation times is also a problem in the genetic algorithm selection mechanisms like roulette wheel selection and some of the other selection strategies. These problems have been solved by using the developed hybrid software using new proposed modified selection that is based on artificial selection system and faster and efficient reducing is obtained by optimum memory usage. The developed software has the capability of finding the reducts (reduced input attributes) more faster and efficiently and the locking to the local solutions problem is also solved in the designed modified artificial selection algorithm. In Section
In this study, feature reduction software using genetic algorithm with new modified selection and rough sets based hybrid system (FRSGR) has been developed. Delphi 7 programming language has been used for designing the interface of FRSGR. In the designed system, a new modified selection system that depends on artificial selection method is proposed and used. The developed system not only supports the medical systems but also the information systems with high dimensional input spaces.
In the constructed software genetic algorithm system using new modified selection system is integrated with rough sets attribute reduction system for finding the optimal reducts (reduced input attributes for representing the whole data) of the medical and information based systems with high input spaces. Attribute dependency value of rough sets methodology is used as the fitness value for the genetic algorithm based solution candidate generation system. The software can be stopped according to the fitness value or the maximum number of generations determined by the user of the software. A new selection mechanism based on artificial selection algorithm is designed for the genetic algorithm part of the software. In the FRSGR, roulette wheel and modified artificial selection algorithms are used. Linear order crossover algorithm is used as the recombination operator in the genetic algorithm part of the constructed software. Arbitrary two input change and three input change methods are used as the mutation operators. In addition, another software based on decision relative discernibility matrix is also constructed by using Delphi programming language as a test software for comparing the performance of the designed FRSGR. The new selection mechanism designed (proposed) and used in the system is the modified version of the artificial selection algorithm. The modified version decreases the computation time when compared with the classical approach and roulette wheel mechanism and finds the solution candidates effectively by preventing the locking to local solution candidates. Better results are obtained when compared with the classical artificial selection and roulette wheel selection mechanisms.
In the roulette wheel selection mechanism, the larger regions are assigned to the chromosomes with larger fitness values. The chromosomes with smaller fitness values have small regions. This strategy selects a random point in the region. The chromosomes with higher fitness values can be selected more frequently because the probability of selection of larger region is higher. In the roulette wheel selection strategy of FRSGR, the separated regions are determined by the fitness value determined by the rough sets based strategy. In the FRSGR, the fitness value for roulette wheel selection mechanism is determined by the attribute dependency value of the solution candidate.
Rough set is itself the approximation of a set by a pair of precise concepts named as lower and upper approximations [
The lower approximation and upper approximation concepts are expressed by (
The positive, negative, and boundary regions of the rough sets are expressed by [
Attribute (feature) dependency values of rough sets methodology are used for the fitness value of the generated candidates in FRSGR. The constructed software uses the feature dependency value of rough sets methodology for each chromosome for finding the optimal reducts with high performance. In the rough sets theory, feature dependency value is the ratio of the positive region to the solution space and is expressed in (
As a stopping criterion and attribute evaluation mechanism,
The stopping criterion for the developed hybrid system is accepted as a threshold level for attribute (feature) dependency value which is calculated by the proportion of the number of the elements in the positive region to the elements in the universal set and shown in (
Selection systems are significant for genetic algorithm based systems [
In contrary with the classical approach, the middle fitness valued chromosomes are added to the best and worst valued chromosomes. The modifications give rise to faster reduct (reduced input attributes) obtaining.
The abbreviation “sol. can.” denotes the “solution candidate” that is used for the chromosomes in the generation. The solution chromosomes that are equal or higher than
In the FRSGR, the chromosomes of the last two generations are listed in the descending order according to their attribute dependency values. The middle region proposed in the modified version starts from the middle point of the list continues downwards. The “Mid. Sol
This modification decreases the computation time and prevents the algorithm to be locked in the local solution points. The percentage values of the selected chromosomes from middle region are named as solution addition percentage in the developed software interface and expressed by the abbreviation “Mid. Sol.%” and shown in (
The reduction procedure also decreases the training times of the artificial neural network classifier system. Delphi programming language and interface have been used for developing FRSGR and variable input artificial neural network test software that uses back propagation algorithm.
The constructed software is generated with the adaptation of the multiple input databases and the selection method is used for determining the gene pool for the crossover and mutation operators of the genetic algorithm. Linear order crossover method is used as a recombination operator. Falkenauer and Bouffouix proposed a modified version of order crossover, the linear order crossover (LOX) [ Random points are selected from the parent chromosomes for determining sublists. The random points for crossover can be started from different locations in the parent chromosomes but the lengths of the sublists are accepted as the same. Interchange the sublists taken from the parents with the holes previously defined. Prevent the repetitions in the chromosome genes preserving the orders in the parent chromosomes and fill the left and right side of the crossover points of chromosome using the genes taken from the parent.
Mutation operators are used in the system because high crossover rates are used in the developed system that gives rise to generated different solution candidates. Arbitrary two input change and three input change methods are used as the mutation operators. The random selected two inputs are changed in arbitrary two input change methods and random selected three inputs are changed in the arbitrary three input change methods [
By using FRSGR, high classification accuracy, sensitivity, specificity, PPV and NPV values have been obtained when neural network classifier has been used and the processing times were reduced by using this selection algorithm and input number restriction problems of most of the reduction algorithms were solved. In the genetic algorithm based systems, locking to the local solutions is also a serious problem that increases the computation times and prevents searching the solution spaces for finding the optimal solutions. This problem is also solved by the developed modified artificial selection system by generating the first two startup generations randomly and using not only the best and worst solution candidates but also the chromosomes with middle (intermediate) valued attribute dependency values.
An artificial neural network (ANN) software is constructed and added to the output of the reduct generation system. The general structure of the generated software is shown in Figure
General structure of the developed software.
General structure of software interface (FRSGR).
Artificial neural network and test part of the developed software.
Another software developed for reducing the input attributes by using the decision relative discernibility approach.
Another software using decision relative discernibility matrix and function based reducing mechanism is constructed for comparing the performance with the designed FRSGR. The decision relative discernibility based reduction software is accelerated and optimized for comparing the performance with FRSGR. Discernibility matrix based system uses Boolean algebra and set theory for the obtaining reducts for representing the whole medical system.
A discernibility matrix is expressed by using a decision table
The results obtained from the FRSGR are compared with the Johnson algorithm based reducer of Rosetta software. The Johnson based reducer derives the reducts by using the a variation of Greedy algorithm. This algorithm has a natural bias towards finding a single prime implicant of minimal length. The reduct named as “ All sets If
Uroflowmetry is a diagnostic test that is made for checking for abnormalities in the flow rate of a patient’s urine. Uroflowmetric measurements are very important for determining the urological illnesses like kidney problems, urethral obstructions, urethral strictures, abnormal bladder activities, prostatic diseases, and bladder obstructions. The volume of urine left in the bladder is measured after uroflowmetry test. The residual urine volume that cannot be voided by the bladder shows the volume of urine left in the bladder after the test. Voiding time shows the time passed for voiding the bladder and measured by the devices of uroflowmetry [
The FRSGR generates solution candidates with multiple input variables and can produce solution candidates with desired input number range. The system has 12 input variable, and 1 classification variable (decision variable). The input variables of the system are uroflowmetric measurements named as maximum flow rate (mlt./s), average flow rate (mlt./s) and residual urine volume (mlt.) and the sampled flow rate values (mlt./s) from the uroflowmetry graph in the period of
The input variables
Some of the transactions in the medical (urological) database.
|
|
|
|
|
|
|
|
|
|
|
|
| |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
2 | 1 | 1 | 3 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
3 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 1 |
4 | 2 | 1 | 1 | 2 | 1 | 2 | 1 | 1 | 1 | 2 | 1 | 1 | 1 |
5 | 2 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 2 | 1 | 2 | 1 |
6 | 2 | 1 | 2 | 1 | 2 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 |
7 | 2 | 1 | 2 | 1 | 1 | 1 | 2 | 2 | 2 | 1 | 1 | 1 | 1 |
8 | 3 | 3 | 1 | 3 | 3 | 3 | 4 | 3 | 3 | 3 | 3 | 3 | 3 |
9 | 3 | 3 | 1 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
10 | 2 | 1 | 1 | 2 | 1 | 2 | 1 | 1 | 1 | 2 | 1 | 1 | 1 |
11 | 4 | 3 | 1 | 3 | 3 | 3 | 4 | 3 | 3 | 3 | 3 | 3 | 3 |
12 | 2 | 1 | 1 | 2 | 1 | 1 | 1 | 2 | 2 | 1 | 1 | 1 | 1 |
13 | 2 | 1 | 2 | 2 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 2 | 1 |
14 | 3 | 3 | 1 | 3 | 3 | 3 | 3 | 3 | 4 | 3 | 3 | 3 | 3 |
15 | 2 | 1 | 2 | 2 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 2 | 1 |
Maximum Flow rate (mlt./s) 1—Very Low: 2—Low: 3—Medium: 4—High:
Average Flow rate (mlt./s) 1—Very Low: 2—Low: 3—Medium: 4—High:
Residual Urine Volume (mlt.) 1—None: 0 mlt 2—Medium: 3—High: 4—Very High:
Some of the transactions in the medical (urological) database are shown in Table
The artificial neural network (ANN) software with variable input processing feature is constructed by visual programming language. The output of the FRSGR is attached to developed flexible artificial neural network classifier software and the dominant attributes representing the data sets (reducts) are accepted as the input variables. The number of input variables and the hidden neurons, the error rate and the learning rate variables can be determined by the user of the interface. Back propagation method is used in classification software. Calculated weights can be saved to the text files and read from them for faster processing purposes. Output value of ANN is calculated by forward propagation. Updating of the weights is made in the backward propagation phase. Net and output values for middle layer neurons are calculated by (
In the backward propagation algorithm the initial weights are updated according to the position of the neurons. The updated weights are applied to the next iteration. Updating of the weights between the middle and output layer is made by using (
In (
The new values of the weights are calculated by (
In the update phase of the weights between the middle layer and the input layer, [
In the neural network classifier system, normalization procedure is made according to the values in the columns. The Normalization equation used in the procedure is expressed in (
The classification accuracy used for the testing is calculated from the proportion of the number of patterns that are classified correctly to the number of all test patterns and expressed by (
Sensitivity and specificity are measures of performance used in classification systems. Sensitivity is calculated by the proportion of true positives to the sum of true positives and false negatives. This measures the ratio of the true positives in the sick people which are correctly identified. Sensitivity is also expressed as the ratio of positive (sick) classified patterns to the whole patterns (patients) with disease. The true positive term expressed that the patient has the disease and the classification (test) is positive. False positive explains that the patient does not have the disease but the test (classification) is positive. The true negative states that the patient does not have the disease and test is negative. The false negative expresses that the patient has the disease but the test or the classification is negative.
The reduced input attributes (colums) are tested in the neural network part of the developed software. In the classification procedure, the very risky and risky groups are accepted as positive (unhealthy or risky group), and the healthy group is accepted as the negatives. Sensitivity is expressed in (
The terms used for the calculation of sensitivity, specificity, NPV, and PPV.
Diagnostic Test or Classification | Disease (Positive) |
Disease Negative |
---|---|---|
Test Positive | True Positive (T. P.) | False Positive (F. P.) |
Test Negative | False Negative (F. N.) | True Negative (T. N.) |
| ||
The Column Total | (T. P.) + (F. N.) | (F. P.) + (T. N.) |
Positive predictive value and negative predictive value are two performance values of the tests and are calculated by using (
The average classification sensitivity and specificity values were calculated and compared in (Section
In the constructed FRSGSR, the urological database with 12 input variables is reduced and tested according the urological test database in the ANN part of the software. High average classification accuracy, sensitivity, specificity, PPV, and NPV results were obtained during the classification tests. The modified version of artificial selection algorithm was tested with the classical artificial selection algorithm and the roulette wheel selection mechanism. The computation time has been decreased averagely about 50% when compared with the roulette wheel selection mechanism and averagely 40% when compared with classical artificial selection algorithm when crossover and mutation rates were accepted as 50% and the percentages of good (best), intermediate, and worst solutions were accepted as 33%, 34%, and 33%, respectively. The designed software also supports the classical artificial selection and the best and worst percentages are accepted as 40% and 60% in the test procedure. In the selection process during the test procedures, the solution candidate taken from the middle point to the bottom also increases the performance by preventing the genetic algorithm system to be locked into some local solutions and helps the system for finding the reducts more rapidly. The system explores the reducts in 2 to 20 minutes of time depending upon the number of individuals in the population starting generations when modified artificial selection algorithm is used. The test operations are made by using Core2Quad 3.0 processor with 8 GB RAM. In the genetic algorithm based systems, locking to the local solutions is also a serious problem. These problems are solved by using the modified selection algorithm that forms the initial two generations randomly. The developed system prevents the memory errors by using the memory more efficiently. Most of reduction algorithms do not support the systems with high input spaces. In the FRSGR, attribute dependency value of rough sets methodology is used as the fitness value and the threshold value can be changed by the user of the interface. In the test procedure, a software based on attribute dependency reduction system that explores full combinations that does not contain genetic search strategy (all substes representing the attributes) is also constructed by using Delphi 7 programming language for comparing the performance with FRSGR. Attribute dependency reduction without genetic search strategy that explores full combinations supports the systems with 11 input variables and does not support the systems with 12 input variables or higher because when the systems with 12 variables are tested the allocated memory demand exceeds 899 MB and this situation gives rise to memory errors. The input number restriction and memory tests are made with the data set of 120 transactions (rows). In the developed FRSGR, the input number restriction problems are eliminated and the software supports the systems up to 100 input numbers. And the system is also tested with the constructed decision relative discernibility based reduction test system software using Delphi programming language. Decision based discernibiliy based attribute reduction system also has input number restrictions and supports maximum the data set with 12 input variable because this approach also demands extreme storage area.
The decision relative discernibility matrix and function based reducing procedures and most of rough sets based reduction algorithms require high memory usages that give rise to memory errors and also the long computation times. The discernibility matrix and function based reducing software are also constructed by using Delphi programming language for the test procedure for comparing performance with the developed hybrid system. The discernibility matrix and function based attribute reducing software supports maximum 12 inputs when 120 transactions (rows) are used. When the test data with 15 input variables (uroflowmetric data) are tested with the decision relative discernibility matrix and function based system, the processing time exceeds 4 hours and exceeds the memory allocated by the operating system and causes memory errors. When testing the decision relative discernibility approach, in the task manager of the operating system, the Memory-Peak Working Set exceeds 860 MB (allocated memory in the task manager of operating system) that give rise to memory error. The average classification accuracy of decision relative matrix based approach is about 80% when the urological test data are used. In Table
Average time interval and memory usage levels of tested system softwares.
Tested System | Number of Inputs | Time (average) | Allocated Memory Peak Working Set (MB), (average res.) | ||
---|---|---|---|---|---|
Modified |
12 | 2–20 min. | 70–250 MB | ||
1 | FRSGR | Artificial |
12 | 3.5–33 min. | 65–300 MB |
Roulette |
12 | 4–35 min. | 75–320 MB | ||
Modified |
15 | 4–30 min. | 75–320 MB | ||
| |||||
12 | 70 min. | 380 MB | |||
2 | Decision Relative Discenibility | ||||
15 | Exceeds 4 Hours | Exceeds 860 MB and causes memory error (insufficient memory) | |||
| |||||
3 | Attribute dependency reduction without genetic search | 12 | Exceeds 2 Hours | Exceeds 899 MB and causes memory error |
The modified artificial selection algorithm version used in the software decreases computation times and prevents the genetic algorithm part to be locked to the local solution candidates. The software can be run for different threshold values (attribute dependency values calculated) and different number of attribute ranges. The number of the input variables of the medical system (twelve) has been reduced to the reducts with seven, six, and five elements. Some of the reducts found by the FRSGR are listed with the attribute dependency values in Table
Some of the reducts found by the developed FRSGR.
Element Number | The Reducts | Attribute Dependency Value |
---|---|---|
7 |
|
1 |
|
1 | |
| ||
|
1 | |
|
1 | |
|
1 | |
|
1 | |
|
1 | |
6 |
|
1 |
|
1 | |
|
1 | |
|
1 | |
|
1 | |
|
1 | |
| ||
|
0.975 | |
|
0.967 | |
6 |
|
0.975 |
|
0.983 | |
|
0.975 | |
| ||
5 |
|
1 |
FRSGR finds the significant attributes of the medical risk degree determination system for the urological illnesses like urethral obstructions, urethral strictures, and the urological illnesses and determines the risk factor according to the urological measurements (uroflowmetric measurements and residual urine volume). The reducts that are named as
The processing times of the artificial neural network system for training procedure have been reduced averagely above 70% during the test operations made with the full and reduced medical data set. The full data set is the urological data set with 12 input variables and used in the test procedures.
Decision relative discernibility function is expressed below (The decision based discernibility equation is abbreviated). “+” shows the union “
After the second simplification procedure the reducts obtained from the decision relative discernibility based approach are given below.
We have tested average classification accuracies of decision relative discernibility and Johnson reducer algorithms with the urological test data base that consists of 12 input variables that we have used in this study for testing procedure. The average classification accuracies of 80% and 55% have been obtained for the decision relative discernibility and Johnson reducer algorithm, respectively. The same database is used for FRSGR and the average classification accuracy obtained is above 95%. In addition, higher average sensitivity, specificity, positive and negative predictive values are obtained by using FRSGR. The average classification accuracies, sensitivities, specificities, PPV and NPV percentages of the reducts of FRSGR, decision relative discernibility, and Johnson Reducer are shown in Table
Classification accuracies, sensitivities, specificities, PPV and NPV of FRSGR, decision relative discernibility, and Johnson reducer.
Tested System Software | Average Classification Accuracy (%) | Average Sensitivity (%) | Average Specificity (%) | PPV (%) | NPV (%) |
---|---|---|---|---|---|
|
95 | 97 | 93 | 95 | 95 |
|
80 | 82 | 78 | 84 | 74 |
|
55 | 52 | 60 | 66 | 45 |
During the test operations of Johnson reducer of Rosetta software, the found reducts with 2 elements or higher are evaluated (average of the reducts of full and object related discernibility).
During the classification test procedures, the inputs included by the reducts are tested in the neural network classification part of the software of the FRSGSR. Some of the reducts that are found by the Johnson reducer system are expressed below
FRSGR has found the significant reduced number of input attributes with high classification accuracy, sensitivity, specificity, PPV and NPV values when compared with Johnson algorithm (average of the reducts of full discernibility and object related discernibility) and decision relative discernibility based reduction system.
The extreme memory demand and input space restriction problems of most of rough sets based and feature reduction systems are solved by using the designed software which has also the capability of finding the reducts (reduced input attributes) more faster and efficiently. Different reducts can be obtained by the developed system according to the user defined attribute dependency parameter and changing this threshold level gives the opportunity to determine the quality of classification. High classification accuracy, sensitivity, specificity, positive and negative predictive values are obtained for FRSGR when ANN based classifier is used for testing procedure.
Most of the reducts with high attribute (feature) dependency values include the inputs named as
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interests. The research was not influenced by a secondary interest, such as financial gain.
This project is supported by the Scientific Research Projects Unit of Selcuk University.