Enhancing the Efficiency of a Decision Support System through the Clustering of Complex Rule-Based Knowledge Bases and Modification of the Inference Algorithm

Decision support systems founded on rule-based knowledge representation should be equipped with rule management mechanisms. E ﬀ ective exploration of new knowledge in every domain of human life requires new algorithms of knowledge organization and a thorough search of the created data structures. In this work, the author introduces an optimization of both the knowledge base structure and the inference algorithm. Hence, a new, hierarchically organized knowledge base structure is proposed as it draws on the cluster analysis method and a new forward-chaining inference algorithm which searches only the so-called representatives of rule clusters. Making use of the similarity approach, the algorithm tries to discover new facts (new knowledge) from rules and facts already known. The author de ﬁ nes and analyses four various representative generation methods for rule clusters. Experimental results contain the analysis of the impact of the proposed methods on the e ﬃ ciency of a decision support system with such knowledge representation. In order to do this, four representative generation methods and various types of clustering parameters (similarity measure, clustering methods, etc.) were examined. As can be seen, the proposed modi ﬁ cation of both the structure of knowledge base and the inference algorithm has yielded satisfactory results.


Introduction
Big Data is no longer just about processing a huge number of bytes, but doing things with data that you could not do previously.It is not just tabular data you can easily stick into a spreadsheet or a database [1].Where computer scientists were once limited to mere gigabytes or terabytes of information, they are now studying petabytes and even exabytes of information.At the same time, the tools to sift all that data are getting better as computer scientists refine and improve the algorithms they use to extract meaning from the deluge of data [2].There is no doubt that big data are now rapidly expanding in all science and engineering domains.While the potential of these massive data is undoubtedly significant, fully making sense of them requires new ways of thinking and novel learning techniques to address the various challenges.
Most traditional machine learning techniques are not inherently efficient or scalable enough to handle the data with the characteristics of large volume, different types, high speed, uncertainty and incompleteness, and low value density.In response, machine learning needs to reinvent itself for big data processing [3].Current hot topics in the quest to improve effectiveness of the machine learning techniques include search for compact knowledge representation methods and better tools for knowledge discovery and integration.
The main subject of the author's scientific work lies at the boundary of artificial intelligence, methods of representation and exploration of domain knowledge, statistical methods of data analysis, and machine learning methods.Recent work focuses on managing complex knowledge bases with rule representation and the development of new inference algorithms in such data sets.
In order to extract useful domain knowledge from the studied area, a lot of data should be collected beforehand.Much also depends on how the rules are induced.For example, effective rule induction algorithms can generate a compressed set of several dozen or several hundred rules for a data set consisting of several thousand objects.That is why when talking about domain knowledge bases, files with several thousand rules are often considered to be too large [4].The author's experience of working on such amount of data is presented in [5].In this research, the author has focused on discovering the optimal methods for big data storage, managing management, and exploration.In order to do this, the preliminary experiments, using medium-sized knowledge bases with various types and sizes of data, were carried out.The goal is to specify the most important parameters that facilitate a quick and effective discovery of new knowledge in knowledge bases.
In inference processes based on the rule-based knowledge bases, we explore new domain knowledge by activating the rules (components of a rule-based system with form: IF premises THEN conclusion) with true premises-the ones which may have been covered by the facts given a priori.The process of activating a given rule results in dealing with its conclusion as a new fact.The more rules and initial facts in a given knowledge base, the more rules that can be activated.Of course, the recent solutions in the area of decision support systems require that they additionally perform the task in the shortest time and with the least human involvement.Let us take an example of the medical system, in which we aim to make a decision as fast as possible, based on the knowledge (facts) about a particular patient.The system searches a knowledge base with rules in order to find all the rules relevant to the given set of facts.In case of a big data set, with many rules, such a process can be too timeconsuming.The classic approach is then inefficient, as it has to search every rule in a given knowledge base, which in case of big dataset takes too much time.Thus, new solutions need to be discovered and developed.Such solutions should result in the effectiveness not worse than it is in the case of the classic approach, doing it as quickly and as efficiently as possible.It requires a deep analysis of the knowledge stored in the knowledge bases and exploration of the information about a given domain, for example, in the form of so-called meta-knowledge (knowledge about knowledge).In the literature, there is a lot of research devoted to the subject of meta-knowledge and meta-rules [6][7][8].
It is widely known that the best way to learn a new field is to use generalization skills.Generalization is the process of discovering general features, important features, and the features common for a given class of objects.Following this path, the generalization of the information saved in the rules allows us to gain knowledge about those rules.By attributing similar rules to one group and through the generalization of such groups, we obtain knowledge about many rules without having to review each rule separately.
The notion proposed in this paper is built around the idea of the similarity analysis between the rules and then their subsequent clustering.Among numerous clustering algorithms, the agglomerative hierarchical clustering (AHC) algorithm was chosen (the author previously analysed many other algorithms as well [9,10]).Its most important feature (and advantage) is the fact that it clusters (agglomerates) the most similar rules and forms a group from them.Regarding the rules in the knowledge base, we must take into account that from a certain moment of clustering, the rules cease to be similar in any respect and there is no reason to cluster them any longer.Thus, the classic clustering AHC algorithm requires a modification.Furthermore, to effectively (efficiently and quickly) find the right group of rules to activate, it is necessary to describe them optimally.The author has recently devoted much attention to the proposal and analysis of methods for representing groups of rules, using the generalization approach [11].This paper is aimed at verifying the effectiveness of inference, i.e., the ability to activate rules by reviewing only a selected part of the entire knowledge base, most relevant to the given facts.An inference process can be considered successfully finished where only a small part of the entire knowledge base is searched and we are able to successfully find and activate a given rule (or rules).
It turns out that some clustering parameters have a significant impact on the structure of groups of rules (a tendency to create small or large clusters, to identify atypical rules and separate them from groups).Moreover, certain methods of representation of rule clusters (representative generation methods) are characterized by a tendency to create overly general representatives (or sometimes empty) or overly detailed representatives that have ceased to reflect the content of the whole group.Having knowledge about which clustering parameters and which representative generation methods ensure the best efficiency, we will be able to strive to achieve optimal results.
The structure of the paper is as follows.Section 2 introduces the rule-based knowledge bases and inference processes in decision support systems.Managing of rules in knowledge bases is the main subject of Section 3. The proposed approach with a description of the clustering algorithm and inference algorithm for a hierarchical structure of a knowledge base with rule clusters is presented in Section 4. The results of experiments with their interpretation are included in Section 5.The summary is presented in Section 6.

Knowledge-Based Systems
The knowledge-based system (KBS) is a system that uses artificial intelligence to solve problems.It focuses on using knowledge-based techniques to support human decision making, learning, and action.Such systems are capable of cooperating with human users and are fit for purpose.We may even say that they are better than humans are, as they are enriched with the virtues of efficiency and effectiveness.They are able to diagnose diseases, repair electrical networks, control industrial workplaces, create geological maps, etc. Representation of knowledge is difficult because an expert knowledge can be imprecise and/or uncertain.In general, the knowledge is represented as a large set of simple rules.Conclusions are generally obtained through the inference process.The expert systems have been pioneers in the field of knowledge-based systems.They replace one or more 2 Complexity experts for problem solving.In many situations, they may be more useful than traditional computer-based information systems.There are many circumstances when they become particularly useful: when an expert is not available, when expertise is to be stored for future use or when expertise is to be cloned or multiplied, when intelligent assistance and/ or training are required for decision-making or problemsolving, or when more than one expert's knowledge has to be stored on one platform.All these situations make them very useful nowadays, and thus, it is very important to improve their performance and usability.The improvement may concern both the structure of the knowledge base and the inference algorithms.
2.1.Rule-Based Knowledge Bases.Among various methods of knowledge representation, rules are the most popular form.Rule-based knowledge representation uses the Horn clause form: "if premise then conclusion."This is one of the most natural ways for domain experts to explain and present their knowledge.Activation of the rules during the inference process results in adding their conclusions as new facts (new knowledge).Let us assume that the knowledge base KB is a set of N rules: where cond 1 r ∧ ⋯ ∧ cond m r is the conjunction of the rule's conditions (premises) and concl r is the conclusion of the rule r.
Rules may be generated automatically using one of many possible algorithms based on the machine learning techniques.The knowledge base can be composed of different types of rules: classification rules, association rules, regression rules, or the so-called survival ones [12].In addition, the rule set can be obtained by transforming the decision tree [13].They also can be given by experts, but such process is a very difficult task.Usually, the value of experts' knowledge is rated so highly that experts are reluctant to share it.Therefore, to carry out the right number of experiments, it was decided to use the knowledge base with rules generated automatically from data shared within the UCI machine learning repository [14].An efficient algorithm for generating rules automatically from data is the LEM algorithm [15].It is based on the rough set theory [16][17][18] and induces a set of certain rules from the lower approximation (lower approximation is a description of the domain objects that are known with certainty to belong to the subset of interest), and, respectively, a set of possible rules from the upper approximation (upper approximation is a description of the objects that possibly belong to the subset of interest).This algorithm follows a classical greedy scheme which produces a local covering of each decision concept.It covers all examples from the given approximation using a minimal set of rules.
The procedure for preparing knowledge bases for this work was as follows.Each selected set of data from the repository was rewritten as a decision table, which was then subject to the process of rule induction (LEM2 algorithm) using the RSES tool [19].
When the size of the input data (which rules are to be generated from) increases, the number of generated rules does too.Let us look at the diabetes data set [14].It contains the data for 768 objects described with 8 continuous attributes.Processing the data with LEM2 and RSES with an implementation of the LEM2 algorithm, 490 rules have been created.For the nursery dataset, which originally contains 12,960 instances described with 9 conditional attributes, 867 rules have been generated.Such numbers make it difficult or even impossible to be analysed by a person.It is also important to note that the generated rules might have a varying number of premises.It can be said that the fewer premises a rule has, the easier it is to determine if it is true (it requires less number of conditions to cover).On the other hand, making a decision dependent on the highest possible number of conditions may suggest that if all the conditions have been met, the decision must be correct.
When looking globally at a knowledge base with rules, it turns out that it may contain a large number of short rules (with one premise or a few) but also some rules described with a large number of premises with only a few premises that differentiate them.This, in turn, brings about various problems at the rule analysis stage in the inference process.When there is a set of many long rules (described with several premises) which differ from one another by a single premise, it can extend the inference process which then attempts to check all the rules which are deemed fit to be activated.Another possible outcome might be that in a given knowledge base there is an uneven distribution of rules connected with given premises.This may result in a large group of rules dedicated to one area only and one or very few rules describing other areas of the domain (the particular part of the 3 Complexity domain has not been sufficiently explored).Finding rare rules might become a nontrivial task.When taking into consideration the matter of big sets of often dispersed rules, it turns out that for the effectiveness of the inference processes, decision support systems founded on rule-based knowledge representation should be equipped with rule management mechanisms.In other words, they are methods and tools which help to review the rules effectively and quickly find those to be activated.One of the available solutions is rule clustering.In the subject literature, this issue has been extensively described and most of the time it focuses on cluster analysis [21].Assuming every rule cluster as a group of similar rules, it is possible to create its representative as a set of all the features that describe the group in the best possible way.Let us imagine there is a knowledge base with a large number of rules which are subject to clustering.As a result, there will be a structure of groups of rules which are similar to one another.The extent of cohesiveness of a knowledge base will translate into the number and size of the resulting clusters of rules.There are several possible scenarios: a small number of clusters which contain a large number of rules in each of them or a large number of clusters which contain a few rules in each of the clusters.Of course, the scenarios described above are at the extreme ends of the scale.However, the generated structure of clusters may be well-balanced where each cluster contains a comparable number of rules and the number of rules is close to the size of each cluster (e.g., if there are 100 rules which are divided into 10 clusters with 10 rules in each).
Subsequently, the effectiveness of the knowledge extraction from rule clusters depends on the rule cluster quality and the efficiency of inference algorithms.For rule clusters, we create representatives and they are then searched in the process of inference.Due to the fact that the quality of representatives and the optimization of inference processes are so important, better solutions are still being sought.
To make the rule activation process possible, apart from the gathered knowledge, an inference mechanism is necessary.The following subsection presents the definition of inference and a short description of the existing inference algorithms and discusses the parameters and the inference control strategies.

Inference
Algorithm.An inference engine is a software program that refers to the existing knowledge, manipulates the knowledge in line with needs, and makes decisions about actions to be taken.It generally utilizes pattern matching and search techniques for conclusions.Through these procedures, the inference engine examines existing facts and rules and adds new facts when possible.There are two common methods of deriving new facts from rules and known facts.These are data-driven (forward chaining) and goal-driven (backward chaining) inference algorithms.The most popular one, with respect to the usability in real-life applications, is the data-driven algorithm based on the modus ponens rule-a common inference strategy.It is simple and easy to understand [22].The framework can be given as follows: the rule states that when A is known to be true and a rule states "if A, then B," it is valid to conclude that B is true.
The data-driven algorithm starts with some facts and applies rules to find all possible conclusions.It is applicable when the goal of inference is undefined.The inference with a given goal is provided until this goal is considered as a new fact.The case in which there are more than one possible rule to activate, in a given iteration of the inference algorithm, is called in the literature a conflict set, and the method which deals with the issue is called the conflict set resolution strategy [23].It should be emphasized, especially in case of a big dataset, that such situation occurs very often.There are many possible strategies proposed in the literature, but the most popular ones are to use the FIFO (First In First Out) or LIFO (Last In First Out) techniques familiar in programming languages.When there are many rules and facts involved in an expert system, classic inference algorithms become ineffective.Inference times become unacceptable, and the number of newly generated facts exceeds the limit of the new knowledge that can be properly absorbed.
In such cases, it is necessary to find new inference algorithms which ensure effective management of the analysis process for rules to be activated.One may also consider changing the structure of the knowledge base with the rules to organize them in a specific and well-described structure so that later its search would be effective.
In this paper, the author continues her research on modification of a knowledge base structure with rules into a hierarchical one where the quality of representatives of the created rule clusters is as important as the quality of these clusters.
Therefore, the author proposes the following method of optimization.At the first stage, the knowledge base structure is modified.In the classic approach where the knowledge base is a set of rules written without any specific order, it is necessary to search the entire set of rules.The author proposes to cluster the rules with similar premises into the rule clusters.Among various methods, the agglomerative hierarchical clustering algorithm is used in this research (the author has also studied the use of other algorithms [10]).Its classic approach assumes merging, in every iteration, the two most similar rules or groups of rules into one group.The proposed modification of this approach is based on finding the optimal moment to cut the created hierarchical structure of rules.It should be finished when there is not enough similarity between the rules or groups of rules which remained to be clustered.Details of the proposed approach are presented in the following section.

Rule Clustering
Too many rules in the knowledge base can negatively affect the effectiveness of management of rules.One of the ways of managing the rules is to cluster them into groups and to describe the groups by their representatives.Each cluster is described using a so-called group representative (Profile).The notion of cluster analysis indicates that objects in the analysed dimension are split into clusters which collect the objects most similar to one another and the resulting clusters are as different as possible [21].The optimal structure of rule clusters assumes a maximum internal similarity and a minimal external similarity between groups of rules.It guarantees an optimum internal cohesion and external separateness of clusters.In the next subsection, the author briefly introduces other clustering methods.

A Short Characteristic of Clustering Algorithms.
Within the scope of cluster analysis algorithms, it is possible to select either partitional (sometimes called k-optimizing algorithms, as exemplified by k-means) or hierarchical algorithms (which provide additional knowledge about the order of clustering the most similar objects together, e.g., the agglomerative hierarchical clustering algorithm (AHC)).Both partitional and hierarchical algorithms utilize the distance or similarity measurement in the process of finding similar objects.Moreover, there are algorithms based on the intracluster density (DBSCAN [24] and OPTICS [25]) and, most recently, spectral analysis algorithms (SMS (spectral mean shift) [26]).
Assuming that clustering is an automated process performed on a random set of rules with an unknown structure, the best solution which helps to avoid other possible problems is to use a hierarchical algorithm.The above-mentioned problems are, among others, an inability to determine an optimum number of clusters (necessary for partitional algorithms), the need to separate rare objects (rules) from the created clusters, and a motivation to gain additional knowledge on the sequence of rule clustering so that for each rule, another most similar rule or cluster can be found.In the density-based algorithms, similarly to partitional algorithms, additional clustering parameters like a minimum proximity threshold or the number of objects in a cluster need to be defined.The agglomerative hierarchical clustering algorithm (AHC) is free of such limitations [9,10].This algorithm has many modifications which vary from the original with respect to a changing stop condition of the clustering process.

Agglomerative Hierarchical Clustering Algorithm.
The author proposes the clustering of rules with similar premises which produces a hierarchical structure (dendrogram).In the classic form of the agglomerative hierarchical clustering algorithm (AHC), the clustering process of individual rules should be continued until a single cluster of rules is obtained with a reservation that at each step a cluster is created by joining pairs of the most similar rules or clusters of rules.Accordingly, for the N number of rules in a knowledge base, the number of the algorithm's iterations is equal to N − 1.It is easy to notice that for numerous knowledge bases the inference's duration time might be a problem.This is an unacceptable feature for big knowledge bases, and modifications which reduce the number of iterations are welcome.

Clustering Parameters.
There are various clustering parameters that help to achieve optimal clustering results.In this research, the author has analysed such parameters as similarity measures, the number of clusters to create, and clustering methods.

Similarity Measures.
Clustering of similar objects requires that similarities (or distances) between the object be defined.In the literature, there is a lot of research devoted to the analysis of available measures of similarity and dissimilarity of objects [27,28].These measures (in this paper) have been used to determine the similarities of rules between one another as well as the similarities of rules and clusters of rules in relation to the cluster representatives.The same measures can be subsequently used to measure the similarity of representatives for clusters of rules and facts in the inference process.To provide the universality of the solution, both the single rules and 5 Complexity clusters use the conjunction of pairs which consist of an attribute and its value.The values of attributes may be symbolic and continuous.
Generally, a similarity value for a pair of rules r i and r j which belong to a set of rules R is calculated in the following way: where sim f is a similarity value between two rules r i and r j in relation to the f − th attribute and the value w f is the weight of the attribute a f (usually determined as where d is the number of attributes).Alternatively, weights 0 and 1 can be used for attributes (where 0 for the f − th attribute's weight means that the attribute does not appear in the rule while 1 means that a given attribute constitutes the rule's premise part).The similarity value can be obtained by using one of a various possible similarity measures.The author dealt with the influence of measures of similarity on the clustering quality in [29,30].In [29], nine various measures were described and analysed: SMC (simple matching coefficient) and its modification wSMC (weighted simple matching coefficient), Gower's measure (widely known in the literature), two measures used for information search in large text files (OF and IOF) and four measures based on the probability of occurrence for a given feature in the description of a rule or a group of rules (Goodall's measures) [27,28].In this research, the author uses the same set of similarity measures (in the experimental stage, each of these methods was used).The measures have been widely described by the author in [29,30]; therefore, the issue is not discussed again in this work.For example, the similarity value sim f based on the wSMC equals 1 if rules r i and r j contain the same value for the f attribute.Otherwise it equals 0. Hence, only if rules r i and r j contain the same values for the every attribute in their premises and weight w f is determined as w f = 1/d for f = 1, … , d and d is the number of attributes, then the similarity value sim r i , r j equals 1.If the rules differ at least for one attribute, the value is less than 1.Value 0 for sim r i , r j (in case of wSMC similarity measure) means that there was not even one attribute for which rules r i and r j would have the same value.Some of the analysed measures determine the similarity of rules using the frequency f r if of occurrence of a certain pair of attributes and its values in the entire set of rules (f r if denotes the number of times a premise r jf appears in rules), while others are based on probabilities p f r if (p f r if denotes the sample probability of the case when a premise r if appears in rules:

Number of Clusters.
To determine an optimum similarity threshold might be impossible if the algorithm needs to be made independent from the type of data.It must be remembered that when similar rules are to be clustered, the threshold has to be set up at a reasonably high level or the clustering within a knowledge base can be initiated for rules which are practically dissimilar to one another and it might be impossible to reach a high level of similarity.In [9,10], the author has presented an approach based on the termination of clustering when the intercluster similarity is no greater than the intracluster similarity.Unfortunately, the computations required for this approach are too burdening as far as the clustering algorithm is concerned.Another solution is the termination of clustering at a certain level as an attempt to force upon the number of clusters.Then, the AHC algorithm joins the rules and their clusters as long as the assumed number of clusters is reached.The above-described solution is presented in this paper.
In the literature, there are multiple papers which deal with the issue of an optimum selection of the number of clusters in the clustering algorithms [31,32].The most prevalent approach to be found in these papers underlines the necessity to perform numerous iterations for a gradually changing number of clusters and then choosing an optimum solution.Theoretically, it means that the number of possible partitions for a knowledge base with N rules equals N because, having 5 rules to cluster, we may place every rule in 1 or 2, 3, 4 and even into 5 clusters.Of course, the first and last solutions do not make sense (we would achieve one big cluster with an entire set of rules or 5 singular rule clusters).For this reason, the starting parameter value pertaining to the number of groups is 2 and increases by 1 in every partition until the number of clusters is smaller than the number of rules.If numerous knowledge bases are concerned, such an approach would not be time-effective.
The author has attempted to propose heuristics which help to determine an optimum number of clusters.The number of clusters K to be created is calculated with respect to the equations K 1 = N + i * %N and K 2 = N − i * %N .K 1 and K 2 are the numbers of clusters to create, and N denotes the number of rules.It is easy to see that the modification consists in the clustering for a gradually changing (one step at a time, iteratively relative to the variable i, for i = 1, 2, … ) parameter K.Such a solution makes it possible to find the optimal number of clusters to create and does not require checking all possible scenarios but only some of them.For example, in case of a heart disease dataset with 99 rules, all the possible rule partitions, based on the proposed heuristics, are as follows: K = 1, … , 20.Hence, instead of generating 99 different rule partitions, only 20 are created and analysed.

Clustering Methods.
In this paper, the author has used four most popular methods as found in the literature.The first of them, the single-link method (SL), measures the distance between clusters R p and R q as a minimum distance between a random pair of rules r i and r j where r i ∈ R p and r j ∈ R q .The second one is called the complete-link method (CL) and defines the distance between the cluster R p and R q as the longest distance between any two objects in two clusters.6 Complexity There are two more methods known in the literature-the average link method and the centroid link method.The former, marked as AL in this paper, measures the distance between the luster R p and R q as an average distance of all pairs of objects located within the examined clusters.The latter, marked in this paper as CoL, always calculates the distance between the clusters R p and R q as a distance between their centroids.A centroid is a pseudo-object whose attribute values are mean values of all objects in the cluster.

Proposed Approach
Having obtained groups which consist of similar rules, in fact only a small part of the knowledge base is searched.The previous object-by-object analysis, where the searched objects need to match the knowledge in the most possible way, can be reduced to matching the input data to each cluster's representative and selecting the best matching representative.
4.1.Hierarchical Structure of a Knowledge Base.As the resulting structure is one or more binary trees with M number of nodes, it is easier to reduce the computing complexity of the inference algorithm from the linear to the log 2 M complexity as the former emerges from the necessity of review of all rules in the knowledge base in order to find a set of activable rules.The knowledge base's structure with rule clusters shall be defined as a sorted pair PR, Prof iles PR where PR = R 1 , R 2 , … , R K represents the structure of a K number of clusters and Prof iles PR = Prof ile R 1 , Prof ile R 2 , … , Prof ile R K constitute a set of representatives for these clusters (for K ≪ N).The following two conditions must be met: ⋃ j=1,2,…,K R j = KB and R l ∩ R j = ∅ for j ≠ l and j, l = 1, 2, … , K. A hierarchical knowledge base contains a structure of clusters of rules together with their representatives.As a result of the application of the AHC algorithm with a set criterion of stopping the agglomeration, we get a number of clusters (equal to K) containing other rule clusters or single rules.This structure is then searched in the inference process.

Agglomerative Hierarchical Clustering: A Proposed
Approach.The pseudocodes of the hierarchical clustering algorithm for rules and data-driven inference algorithm for rule clusters are presented in Figure 1.Iteratively, until a given number of clusters (K) is not achieved, at every step of the clustering process, we create a similarity matrix for all rule clusters.Each cell contains a similarity value for a pair of rule clusters R l and R j .Then, we have to choose a matrix cell with the biggest similarity.At the end of each iteration, we create a new cluster R q which contains the merged clusters R l and R j and we remove the clusters from the structure PR and add the new cluster R q to it.The cluster analysis in effect produces fairly homogeneous groups of rules together with their representatives.

Knowledge Extraction in Rule Clusters. The decisionmaking process consists of extraction of new knowledge
based on both the rules in a knowledge base and the facts.Since the rules have been merged into groups, the inference process must apply to the rule clusters.The idea proposed by the author is based on the method widely known in the literature within the domain of retrieval information systems and searching within hierarchical structures.Rule clustering with the AHC algorithm creates a hierarchical structure in the form of a dendrogram.A similar structure was obtained in the SMART system [33] where textual documents were subject to clustering.The clusters therein were defined as such sets of documents where each item is similar to all the remaining parts of the set.The obtained hierarchy of documents was then searched through analysis of the similarity between the groups' representatives and the given query.At each level of the hierarchy, the most similar group was chosen.The process ended when the most relevant group (document) was found [34].The objective of the procedure is to maximise the search efficiency by matching a request with only a small subset of the stored documents, at the same time minimizing the loss of the relevant documents retrieved in the search.It is necessary to remember that cluster representatives are analysed; thus, the efficiency of searching within documents depends on the quality of the representatives.There are many possible ways to build a cluster representative.For example, document clusters can be represented by the set of the features most common for all the documents in a given cluster.The representative can be general or specific, which is very important in the context of inference efficiency.General representatives as a short type description may be easy to analyse but take more time to find a given document.Specific representatives contain usually many features in their descriptions and thus it takes much more time to analyse one representative, but usually we can easily find a given document.
In this project, the author works with rules in a knowledge base which are a very specific data type and thus require a specific way to manage them properly.They may have different lengths and may contain not only different attribute values but, above all, completely different attributes, which significantly affect the ability to compare them and to look for similarities.

Rule Clusters' Representatives.
When a set of clusters has been generated, it is possible to construct a representative classification vector for each cluster, called a centroid vector, such that the property assignment of the centroid reflects the typical, or average, values of the corresponding property values for all elements within each given cluster.Various methods can be used to generate the centroid vectors.Considering the fact that rules in a knowledge base are a specific type of data and most of the time those rules are recorded with various types of data, the author proposes an approach which considers both nominal and numeric features in a representative's description.To find out which form of a representative (general or detailed) provides a greater effectiveness of the resulting structure and inference processes, the author proposes several different approaches.It should be noticed that in her previous research [11], the author analysed also other methods of generating cluster representatives.Each rule cluster R q ∈ PR is assigned a representative called a profile (Prof ile R q ).In the basic approach (further referred to as the threshold approach), a representative consists of all such attributes which have appeared in k% of rules in a given group (default k = 30%): where f requency getAttr p s returns the number of times when the attribute of a given premise p s appears in the conditional part of all rules in the group R q .If a given attribute reaches a set threshold then, depending on its type, its value (for symbolic features) or a mean (for numeric features) is added to the representative.As this method analyses only the attribute part in pairs (attribute, value), the accuracy of the searching process may not be as precise as it is for other methods.Finding similar representatives with this technique only that a rule cluster containing a given attribute has been found.
The conditional and decision parts of every rule are created from a given set of pairs (attribute, value).For the following set of attribute A = a, b, c, d, e, dec and their values , e 2 , and V dec = A, B, C , we may consider a few different scenarios (for simplicity's sake, in the example let us assume that all the attributes are at a nominal scale).For the knowledge base KB = r 1 , r 2 , r 3 , r 4 , the following rules are We may say that rule r 3 is unlike the others (it is described by other attributes) while rules r 1 and r 2 are quite similar because besides the same premise c, c 2 , they also contain a similar premise with an attribute a. Rule r 4 is (like rule r 3 ) unlike others, but looking only at the attribute part, we may say that it is more similar to rules r 1 and r 2 than rule r 3 , containing an attribute a.
Assuming that the selected clustering algorithm will first join the rules r 1 and r 2 and then include rule r 4 in the same cluster, the representative created with the use of the threshold method (with a k parameter set to value 50%) is Prof ile r 1 , r 2 , r 4 = a, a 1 , a, a 2 , a, a 3 , c, c 2 .Undeniable advantages of approximation of sets based on the rough set theory can be found in numerous papers such as [16][17][18].The rough set is the approximation of a vague concept (set) by a pair of precise concepts, called lower and upper approximations.The lower approximation is a description of the domain objects which are known with certainty to belong to the subset of interest, 8Complexity whereas the upper approximation is a description of the objects which possibly belong to the subset.Using the notions of lower and upper set approximation, a representative is created with the use of the lower/upper approximation method.The lower approximation method defines a cluster's representative as all pairs (attribute, value) which appear in the conditional part of each rule in the analysed cluster.Conversely, a cluster's representative designated with the upper approximation method shall contain all such pairs (attribute, value) which have appeared in the conditional part of at least one rule in the cluster.The definition of a lower approximation for a group's profile R q is as follows: and an analogical definition for the upper approximation method is where cond r i means the conditional part of the r i − th rule, and p s is a single premise in this rule r i .The representative for rule cluster r 1 , r 2 , and r 4 using the lower approximation-based method regrettably contains an empty set, while using the upper approximation-based approach it contains the following features: Prof ile r 1 , r 2 , r 4 = a, a 1 , a, a 2 , a, a 3 , c, c 2 .It is imprecise as it contains the features which cover less that 30% of the rules in a given group.Hence, it seems justifiable to control the level of coverage of features selected for group representatives.It has led to an alternative way of creating cluster representatives, namely, the weighted representative method.In this method, giving weight (expressed as k%), a representative is created from all pairs (attribute, value) which have appeared at least in k% of rules in a given group.
The representative of a group of rules r 1 , r 2 , and r 4 selected with the use of this approach (with a value of the k parameter set at 50%) is Prof ile r 1 , r 2 , r 4 = c, c 2 because only this particular premise appears in at least 50% of the rules in this group.This clearly shows the difference between the threshold and weighted approach.It must be emphasized that representatives of clusters are created promptly with clusters of rules, and as a result, there might be empty/blank representatives even though a cluster has been created.This happens when the representative designation method is too restrictive (capture conditions for some features in a representative are relatively high and difficult to fulfil) and simultaneously a stop condition has not been reached as the created structure still has more groups than the assumed threshold and the groups are continuously clustered.Such restrictive requirements are the traits of the lower approximation method.This method stipulates that a feature included in a representative's description is concurrently a common feature of all rules that constitute a cluster.This condition is usually too difficult to fulfil, especially when rules in a knowledge base are short and rarely have common premises.In consequence, at some stage (when groups are clustered into groups at a higher level of hierarchy), there are clusters without representatives.Such structures have to be avoided as they hinder a review of such group and making use of clustering as a tool in the exploration of knowledge bases.An excessive reduction of the conditions examined in the course of designation of representatives makes them too detailed and often inadequate for the described clusters.For instance, using the upper approximation method or setting up too low a threshold for the designation of representatives in the weighted or threshold representative methods (e.g., a 25% threshold) for a cluster of four rules, when a given feature is included as a premise in at least one rule, it is sufficient to be included in the cluster's representative.

4.5.
Inference Process in a Hierarchical Knowledge Base.At the core of big data analytics is data science (deep knowledge discovery through data inference and exploration).A knowledge representation requires some process that, given a description of a situation, can use the knowledge to make conclusions.When the knowledge is properly represented, the inference reaches appropriate conclusions in a timely fashion.Thus, the knowledge must be adapted to the inference strategy to ensure that certain inferences are made from the knowledge.Inference in classic knowledge bases matches the entire set of rules to the known facts to deduce new facts.It is impossible to work on the entire set of rules and facts in case of big knowledge bases.Therefore, in this and previous research tasks [9], the author defines the model of the hierarchical knowledge base with rule clusters and rule clusters' representatives.
Inference in a hierarchical knowledge base involves using hierarchy properties to optimize the search of clusters of rules.The results of inference and the course of the inference process itself depend strongly on the goal of inference.
When considering the forward inference (data-driven), we need to take into account the inference with a given hypothesis to prove or without it.In the first case, we review the representatives of clusters of rules at each level and eventually select the rule or rule cluster most relevant to the given facts.If a selected rule can be activated, the result leads to the addition of a new fact to the knowledge base.When this new fact is simultaneously the goal of the inference, the process should end successfully.When the goal of the inference is not specified, we proceed as long as there are any rules that can be activated.Thus, as a result, the implemented inference algorithm leads to the exploration of a number of new facts, and one of the measures of inference efficiency is, among others, a percentage of new facts compared to the ones given at the beginning.The more new facts, the more effective the reasoning process is.

Complexity
In the classic approach, premises of each rule are examined to see whether they match the set of facts.If they do, the rule is activated and its conclusion is added to the set of facts.If this new fact is a given hypothesis to be proved, the process ends successfully.If there is no given goal of inference, the process is repeated until there is at least one rule to be activated.
In the approach proposed in this research, only representatives of the created rule clusters are analysed, which significantly shortens the time of inference.Usually, the number of the created rule clusters is significantly smaller than the number of rules being clustered.However, the success of the inference process depends on the quality clustering and the approach to creating the representatives.For the structure of K clusters with their representatives, the inference process looks as follows.For the given set of input facts, we are looking at the representative clusters from the highest level in the created hierarchical structure, and at every level of the hierarchy, going from the root to the leaves, we choose the cluster most relevant to the facts.If the selected group is already a single rule, and all its premises match a given set of facts, then the rule is activated and its conclusion is added as a new fact to the knowledge base.If the new fact is simultaneously a given goal to be proved, the inference process is successful.Otherwise, the search process continues until the requested goal of the inference is confirmed or there are any rules to activate.It is easy to see that in the most optimistic case the process lasts only one iteration, during which one rule is activated and its conclusion matches the given goal of inference which ends the process successfully.Of course, the inference process succeeds also if the given hypothesis is proved in more than one iteration, or if any rule was activated (when no hypothesis was specified).For this reason, in the experimental stage, the author examined the following cases: was the goal specified, was it achievable, and was it eventually achieved?It was additionally examined whether any rule had been activated, how many rule clusters had been searched, and if an empty representative had occured during the searching process.
Verification of the correctness of the proposed solution consists of comparing the result of the inference for a hierarchical knowledge base with rule clusters with the result obtained for a classic knowledge base (without rule clusters) and classic inference (analyzing all the rules one by one).In the course of verification, it was checked how frequently the specified goal of inference had been confirmed or any new knowledge had been deduced from the rules and facts.
The pseudocode of the data-driven inference algorithm for rule clusters is presented as Algorithm 2 in Figure 1.
The most important procedure is the one which makes it possible to find the most relevant (to the set F) rule cluster first and then the most relevant rule in the selected group.For each cluster R i , its representative Prof ile R i is compared to the set of facts F, and as a result, a group with the maximum similarity is selected (i = 1, 2, … , K).The review time needed in the classic approach to search every rule is reduced to the time needed to search cluster representatives.Most of the time, K (number of clusters) is significantly smaller than N (number of rules).The selected rule is activated, and the inference process is finished successfully if the new fact is a requested goal of inference.If not, the process is continued.4.6.Analysis of the Proposed Idea.For a structure containing about a thousand clusters of rules, about a dozen or so representatives will be compared to find the group which is most similar to the given information.Due to the logarithmic computational complexity of the algorithm, the more rules we group, the greater the time gain from browsing the cluster structure is.This is undoubtedly the biggest advantage of using this approach.Especially with big data sets, such solutions are particularly useful.The disadvantage may be the omission of other rules relevant to the given facts.This approach is more optimal in relation to the approach presented in the author's previous research [9,10].The optimization arises from the fact that if, at a given level of analysed structure of rule clusters, the group selected as more relevant contains other clusters (which means additional subsequent searches), we check if the other cluster (omitted at this level, less relevant) is not a single rule.If that is the case, and the premises of this rule match the facts, such rule is activated and makes it possible to finish the inference process earlier.

Example of Rule Clustering and the Inference Process for
Rule Clusters.Let us assume that a given knowledge base contains five rules: The course of the AHC clustering algorithm for this knowledge base, in case of using the wSMC similarity measure, is presented in Figure 2. 10 Complexity As a result, two clusters of rules are generated: R 1 which contains rules r 3 and r 4 and R 2 which contains r 1 , r 2 , and r 5 .The lower and upper approximation-based representatives for these groups are as follows: and there is also a given input set of facts F = a, a 1 , b, b 1 .The course of the inference, taking into account the type of representatives, is presented in Table 1.This basic example clearly illustrates how a representative generation method influences the efficiency of the inference process, producing different results.In case of the LowerApp method, no rule would be activated and no new knowledge would be extracted.When considering big data sets, one should bear in mind that the chosen cluster representation method can significantly affect the amount of new knowledge extracted from the knowledge base of hundreds or thousands of rules.The lower approximation method (producing general descriptions for rule clusters) unfortunately can make the process of discovering new knowledge from rules and facts impossible (because of empty representatives).

Experiments
The experiments were aimed at investigating whether the presented clustering methods (SL, CL, AL, and CoL) and representative generation methods (Threshold, LowerApp, UpperApp, and Weighted) influence the efficiency of inference and the quality of created rule clusters.The subjects of the experiments are four datasets: heart, libra, weather, and krukenberg, with various numbers of attributes and rules [14].The smallest knowledge base contains 5 attributes and 5 rules and the greatest number of rules is two hundred, while the greatest number of attributes is 165 elements.In the experiments, many possible combinations were examined for each knowledge base: nine similarity measures, four clustering methods, and four representative generation methods with three various percentage thresholds and various numbers of clusters.The total number of experiment equals 178,200, and it results from the necessity of using all possible combinations of different similarity measures, clustering methods, cluster number, representative generation methods (with various values of threshold k), and the additional parameters related to the inference process such as a different number of facts and the cases with a given hypothesis to be proved or without any hypothesis.All tables summarize the results obtained for the whole 178,200 of the experiments performed.
Tables 2-4 present the results of the analysis of the influence of using various methods for representatives of rule clusters on the inference efficiency.Step Similarity between F and Prof iles Sim 11 Complexity Table 2 presents the frequency of finishing the inference successfully (the goal of the inference has been reached or/ and any new fact was induced from rules and facts already known) and the frequency of exploration of at least 100% of new knowledge (new facts) in accordance with the input knowledge.Table 3 presents a description of created clusters dependent on different representative generation methods in the form of the following factors: BCS (biggest cluster's size), O (the number of outliers), and ARL/BRL (average/biggest representative's length).Table 4 contains a description of inference efficiency presented as an average number of fired rules, an average number of empty representatives, and the average number of new facts as well as the number of the searched clusters.It is easy to observe that the representative generation method which allows confirming a given goal most often is the UpperApp method (in 21.91% of cases while the LowerApp method allows us to confirm the goal only in 11.96% of cases).If we aim to achieve a lot of new facts (new knowledge), then the representative generation method which allows to get the new knowledge exceeding 100% of input knowledge is the LowerApp method (in 52.09% of cases).The New knowledge column with the value At least 100% corresponds to the case where for a given set of input facts, at least the same number of new facts was generated during the inference process.
The UpperApp method generates the biggest cluster size, the greatest number of outliers, and a much wider range of representatives than it is in case of other representative generation methods.Only for the UpperApp and Threshold representative generation method are empty representatives not generated at all.
Tables 5-7 contain similar information as Tables 2-4 but for various clustering methods.
The SL clustering method makes it possible to confirm a given goal of inference most often.This method also generates the smallest size of the biggest cluster, the smallest number of outliers, and the shortest lengths of the generated representatives for the created rule clusters.The abovementioned method also yields the smallest number of fired rules, the earliest time of achieving empty representatives, and the smallest number of searched clusters.

Conclusions
The decision support systems founded on rule-based knowledge representation should be equipped with rule management mechanisms.Effective exploration of new knowledge in every domain of human life requires new algorithms of knowledge organization and searching of created data structures.Optimization proposed by the author in this paper is based on the cluster analysis method and modification of 12 Complexity the inference algorithm, which searches within representatives of the created rule clusters instead of rules.This article presents both the description of the proposed approach and the results of the experiments carried out for the chosen knowledge bases.Among various clustering algorithms, the agglomerative hierarchical clustering algorithm was selected with a modification proposed by the author in which rule clusters are built until a given number of clusters is reached.For every rule cluster, a representative is created.During the inference process, only representatives are analysed, and at every level of the created hierarchical structure, the most relevant representative is selected and further analysed.This means it is possible to search only a small part of the whole knowledge base with the same accuracy that would be achieved when the whole knowledge base is searched.During the previous experiments, it was shown that for big knowledge bases (with more than a thousand of rules), only 1.5% of the whole KB has to be analysed to finish the inference process successfully.For every combination of the clustering parameters such as similarity measures, number of clusters, and others-Tables 2-4 present the results of the described and examined methods of the cluster representative generation.Tables 5-7 present the results for four different clustering methods, respectively.
As expected, the UpperApp representative method corresponds with creating the biggest size and the largest representatives of the created clusters.As a result, this method leads to a successful conclusion more frequently.Therefore, it is recommended to consider further analysis of both the representative generation methods and the inference algorithm in order to propose new optimizations and achieve a higher efficiency.

Figure 1 :
Figure 1: The pseudocodes of the hierarchical clustering algorithm for rules and the data-driven algorithm for rule clusters.

Figure 2 :
Figure 2: The course of the AHC clustering algorithm for a given knowledge base.

Table 1 :
The course of knowledge exploration for an example of knowledge base.

Table 2 :
Inference efficiency vs. representative generation methods.
a Empty representative found during inference.

Table 3 :
The quality of rules clusters vs. representative generation methods.

Table 4 :
Description of inference efficiency vs. representative generation methods.
a Empty representative found during inference.

Table 7 :
Description of inference efficiency vs. clustering methods.

Table 6 :
Quality of rule clusters vs. clustering methods.