Text Case-Based Reasoning Framework for Fault Diagnosis and Predication by Cloud Computing

,


Introduction
Discrete Event System (DES), such as railway onboard system, produces a large amount of text data by recording the maintenance process, which consists of symptoms corresponding to faulty parts, observed failure modes, and repair actions taken to fix the faults.Hundreds of thousands of such repair verbatim are collected and are used for health status estimation, fault detection, and fault diagnosis and predication.However, the overwhelming size of the repair verbatim data restricts an ability of its effective utilization in the process of fault diagnosis and predication.The main reason is that the complex task requires large computational efforts in order to identify the fault reason and optimal maintenance actions.This has stimulated researchers for advanced computing frameworks aimed at reducing the complexities of fault diagnosis and predication by using of Artificial Intelligence technologies and distributed processing architectures [1].Through in-depth study and data mining to these Big Data, the life cycle of DESs is obtained and intelligent maintenance decision support including fault detection, fault location, and repair advices is provided.In this paper, we take a railway signalling system as an example of DES.
However, the task of automatic discovery of knowledge from the repair verbatim is a nontrivial exercise mainly due to the following reasons: (1) Unstructured repair verbatim: in maintenance documents, the repair records are typically written in unstructured text, which also includes noisy textual data resulted from synonym, abbreviation, and error records.This presents a big challenge for text process.
(2) Massive repair verbatim: in maintenance documents, there are hundreds of thousands of repair verbatim collected during the diagnosis episodes.
(3) High-dimension data: in maintenance documents, there are tens of thousands of distinct terms or tokens in the view of text mining.After elimination of stop words and stemming, the set of features is still too large for many machine learning algorithms.
In order to tackle the above challenges, text mining based on knowledge discovery from historical datasets has recently been proposed.Text mining [2][3][4] is a knowledge-intensive task, which is gaining a wider attention in several Original Equipment Manufacturing (OEM) industries, for example, aerospace, automotive, power plants, medical, biomedicine, manufacturing, and sales and marketing divisions.In fault diagnosis domain, Rajpathak et al. [5] propose an ontology based text mining method for automatically constructing and updating a D-matrix by mining hundreds of thousands of repair verbatim.Rajpathak et al. [6] also present a real-life reliability system by fusing the field warranty failure data with the failure modes extracted from unstructured repair verbatim data by using the ontology based natural language processing technique to facilitate accurate estimation of component reliability.Wang and Xu et al. [7] present a bilevel feature extraction-based text mining that integrates features extracted at both syntax and semantic levels with the aim of improving the fault classification performance for railway onboard equipment, considering the fact that, in a maintenance situation, the operators always search solution of fault diagnosis and predication problems that could be very similar to other states, which have been previously processed.In these cases, the corresponding fault diagnosis and predication solutions are expected to be correlated to these similar system states.Hence, the fault diagnosis and predication problem can be quickly computed by a Case-Based Reasoning (CBR) module, which tries to infer from historical information the hidden relationship between system states and the corresponding historical solutions [8,9].CBR is an effective technique for problem solving in the fields in which it is hard to establish a quantitative mathematical model, such as fault diagnosis, health management, or industrial systems [10].Following this idea, He [11] proposes a framework to use text mining and Web 2.0 technologies to improve and enhance CBR systems for providing better user experience.He also suggests that text mining and Web 2.0 are promising ways to bring additional values to CBR and they should be incorporated into the CBR design and development process for the benefit of CBR users.In order to tackle the vast amount of maintenance records, cloud computing is recently used in big maintenance process.Cloud computing is "a distributed computing technology that provides dynamically scalable computing resources including storage, computation power, and applications delivered as a service over the Internet" [12,13].It has several advantages such as location independence, cost effectiveness, maintenance, and scalability [14].Bahga et al. [15] present a cloud computing framework, CloudView, for storage, processing, and analysis of massive machine maintenance data, collected from a large number of sensors embedded in industrial machines.A CBR approach is adopted for machine fault predication, where the past cases of failure from a large number of machines are collected in a cloud.Case-base creation jobs are formulated using the MapReduce parallel data processing model.Yang and Xu et al. [16] prescribe an agent-based heterogeneous data integration and maintenance decision support for highspeed railway signal system, in which ontology and CBR are integrated for fault diagnosis.
However, the measured variables from railway signalling system monitoring system compose a large number of subsystems, such as track, switch, power, and interlocking subsystems, in which there exists of large number of operational and maintenance logs.In addition, after the case retrieval, the number of rows of the knowledge base increases dynamically in order to have a comprehensive set of historical solutions, and the number of data samples could be of the order of several thousand for realistic railway signalling system, which makes the overall problem intractable.Therefore, there is an urgent need to combine a practical text mining system with cloud computing that can quickly analyze such data to maintain the equipment safety and reliability.In this paper, a cloud computing-based computing framework with text case-based reasoning (TCBR) is proposed, which borrows the ideas from [15,[17][18][19].The main idea is to extract fault features by text mining, reduce attributes by rough set theory [20][21][22][23][24][25], and solve the fault diagnosis and predication problem by deploying a CBR module based on the Hadoop platform with MapReduce framework [26,27], which is a computing paradigm for Big Data management created at Google.This computing paradigm is able to scale up the computing to thousands of processors and terabytes (or petabytes) of data.A fault diagnosis and predication of train onboard equipment is presented to demonstrate its efficiency.
The rest of the paper is organized as follows.In Section 2, the system framework of integration of rough set theory and TCBR is proposed.In Section 3, the methodology of TCBR for fault diagnosis/predication is presented.In Section 4, the effectiveness of the proposed method is analyzed by application to a railway onboard system, and Section 5 draws the conclusion of the paper.

System Framework
Borrowing ideas from [15], a TCBR framework for fault diagnosis and predication by cloud computing in railway signalling system is presented in Figure 1.The proposed framework allows massive data collection and analysis in a computing cloud, with the benefit of real-time diagnosis and predication of equipment failure by information on cases related to faults.The proposed framework has capability to analyze the maintenance records gathered from field devices (e.g., switches, track circuits, and cables in railway signalling systems) from a number of railway signalling maintenance sectors (e.g., Changsha, Guangzhou, and Chenzhou in Figure 1).By processing these data, a case-base (CB) is created, which is used for fault diagnosis and predication by dispatching CB to maintenance sectors.These sectors carry on fault diagnosis and predication by comparing the gathered fault symptom with the cases in CB.
In the proposed cloud computing framework, HDFS and MapReduce are utilized to store and process large numbers of monitoring data.HDFS stores files across a collection of nodes in a cluster.Large files are split into blocks and each block is written to multiple nodes (default is three) for fault tolerance.MapReduce is a parallel data processing model which has two phases: Map and Reduce.In the Map phase, data are read from a distributed file system (such as HDFS), partitioned among a set of computing nodes in the cluster, and sent to the nodes as a set of key-value pairs.The Map tasks process the input records independent of each other and produce intermediate results as key-value pairs.When all the Map tasks are completed, the Reduce phase begins in which the intermediate data with the same key is aggregated.In this paper, the task of fault feature reduction and case retrieval involving massive data analysis is done by the cloud computing using MapReduce jobs.

Methodology
The proposed framework comprises four main steps: data acquisition, feature extraction, rough set-based attribute reduction, and cloud computing-based case retrieval.
Figure 2 outlines the steps of presented approach and related components.First, in the data acquisition and feature extraction module, the maintenance historical records are analyzed by lexical analysis and diagnosis ontology for annotating key terms recorded in the repair verbatim.The annotated terms are extracted, which are used to identify fault features, such as fault mode, fault phenomena, fault part, and corrective actions.The extracted fault features are further reduced by rough set theory.Then a CB is generated by collecting the reduced fault features.When new maintenance log is entered, the case retrieval is conducted by similarity calculation, in which the diagnosis ontology is used to compute the similarity between different concepts.Finally, the results of CBR are used for fault diagnosis or predication.The task of rough set theory based fault feature reduction and case retrieval involved massive data analysis is conducted by the cloud computing.

Data Acquisition and Feature Extraction.
In the event of railway signalling malfunctioning, the diagnostic trouble symptoms are generated and transmitted to the monitoring center database by wired/wireless communications.After every diagnosis episode a repair verbatim is recorded, which consists of a textual description of the mixture of fault symptom (i.e., fault terms), e.g., 'faults' , a fault symptom associated with a specific part, e.g., 'SDU (Speed Distance Unit)' , failure modes (i.e., fault classes), and finally corrective actions, e.g., 'replaced SDU' taken to fix of its faults.In railway industry, millions of such repair verbatims are generated every year.Table 1 [7] gives a simple example with two verbatims.They provide useful data from which the knowledge must be discovered for efficient fault diagnosis and handling of the similar cases in the future.From repair verbatim data, text mining techniques can be used to establish the associations between fault terms and fault classes such that these associations can be used to improve the precision of fault diagnosis and predication.
This data acquisition component consists of collecting unstructured information from maintenance records; conducts fault feature extraction including lexical analysis, fault mode, fault phenomena, fault part, and corrective actions, and synonym recognition; and then inputs the fault feature reduction module.
In order to realize the fault term/feature extraction, we here utilize the ontology technology.The term "ontology" is defined as the study of the existence of knowledge and has been widely applied in different information systems.It is an explicit and formal specification of a conceptualization and advanced knowledge organization technique.Informally speaking, an ontology is able to be a conceptual model that specifies the terms and relationships between the concepts explicitly and formally, which in turn represents the knowledge for a specific domain.In railway fault diagnosis and predication, the ontology stores information about solutions for previous fault diagnosis and related fault symptom, fault part, and corrective actions, which is shown in Figure 3.
Ontology is used to describe knowledge for fault diagnosis and maintenance cases.In the figure, there are concepts, such as System, Subsystem, Component, Fault, Symptom, and Maintenance Support, to describe the knowledge of railway signalling maintenance.The terms, such as train control system, onboard equipment, SDU, are the instances corresponding to the concepts, System, Subsystem, and Component, respectively.At abstract level, the railway fault ontology is a structure of the form:   = (  ,   ,    ,    →  ) [16].The   represents concepts in the railway fault diagnostic domain.More specific concepts are represented by formalizing them in terms of the concept-subconcept hierarchy   .The instances,    , formalize the domain specific concepts in terms of the data associated with the objects in real world; e.g., the concept fault mode can be instantiated by defining the instances, such as SDU failure or SDU relevant failures.These instances provide the knowledge base which can be used for annotating the fault features in the repair verbatim.
The binary relations    →  are used to represent the association in railway domain.These relations are used in CBR to verify the associations between the fault features, which are extracted by lexical analysis and fault diagnosis ontology.
One of problems we must face is the synonym identification, which is regarded as data noise and plays a very important role in the following TCBR.Figure 4 shows the process for synonym identification.
Let  be the synonym term under consideration; the context of  in an instance is formed by terms surrounding .This is done by feature selection in , where   is a feature and V  is its corresponding value.Surrounding words of $W$ in a fixed window size are used in the universe.Obviously, large values of window sizes capture dependencies at longer range but also dilute the effect of the words closer to the term.Leacock et al. [28] used a window size of 50, while Yarowsky [29] argued that a small window size of 3 or 4 had better performance.A small window size has an advantage of requiring less system space and running time [30].Here, we use a small window size of 4 to conduct the synonym term identification, which is similar to [31].In order to extract features in a window size of 4, a document annotation by ontology is needed.The main aim of document annotation is to attach the metainformation to each repair verbatim by highlighting the key terms recorded in it.Firstly, the concept instances from the fault ontology are matched with the terms/phrases written in each repair verbatim.The document annotation helps us extract relevant terms (i.e., fault attributes) for the following TCBR.A number of document preprocessing steps are taken to reduce the document dimensions: tokenization (a process of breaking a stream of text into the words and phrases), stop word deletion, and lexical matching.The nondescriptive stop words (for example, 'a' , 'an' , and 'the') are deleted to reduce the noise.
We select the repair verbatim set, (V 1 , . . ., V  ), consisting of the synonym term of  as training data.In order to avoid the feature sparse, four words on either side of  and the contextual information specified in terms of the component ( 1 , . . .,   ), symptoms ( 1 , . . .,   ), and actions ( 1 , . . .,   ) cooccurring with  are collected.Under the independent assumption of component, symptoms, and actions, the synonym term identification is formulated: = arg = arg By native Bayes assumption, symptoms and actions are independent of each other, and (   | ) is calculated by where (   , ) is the number of cooccurrences of    and .
After the Bayesian model for synonym term identification is done, the data noise resulting from synonym term may be solved.

Fault Case, Attribute Reduction, and Weight Calculation.
Fault cases are defined as a fault detection, diagnosis, and fix situation in terms of fault symptom descriptions and related solutions.In a CBR method, a case is described by a set of attributes (fault features) or aims that identify the instance of a fault, its symptom, and its solution.It can be used for specification of a fault situation and its relevant attributes, which is used to facilitate the retrieval of suitable maintenance records.These features are viewed as the attributes of each fault record in the database.
The purpose of the attribute reduction has been employed to remove redundant conditional attributes from discretevalued datasets, while retaining their information content [24].Attribute or feature selection is to identify the significant features, eliminate the irrelevant or dispensable features to the learning task, and build a good learning model.It refers to choosing a subset of attributes from the set of original attributes.For reducing the dataset and assigning weights to the case feature attributes we use rough set theory [22,23].Rough set theory is a mathematical tool to deal with problems on vagueness and uncertainty.We now provide few key definitions of the rough set theory: (1) Information System: an Information System is defined as a pair  = (, ), where  is a finite nonempty set called the universe that includes all the cases and  is a finite nonempty set of attributes.If we distinguish two disjoint classes of attributes, called condition and decision attributes, then the information system is called a decision (5) Reduct:  subset  ⊆  is said to be a -reduct of  if   () =   () and there is no   ⊆  such that    () =   ().In other words, a reduct is the set of attributes that can differentiate all equivalence classes.(6) Core: Core is the set of attributes that are contained by all reducts, defined as   () = ⋂   (), where   () is the -reduct of .In other words, the Core is the set of attributes that cannot be removed without changing the positive region; i.e., all attributes present in the Core are indispensable.
The rough set attribute reduction by cloud computing is formulated as the following problem.Theorem 1. Zhang [25] According to Theorem 1, each subdecision table can compute equivalence classes independently.At the same time, the equivalence classes of different subdecision tables can combine together if their information set is the same.Following [32], the attribute reduction by cloud computing is listed as Algorithms 1, 2, and 3.
In Algorithm 1, each sample  can be seen as a key/value pair: the key is   () (c ∈ ), the value is   () ( ∈ ).This step is executed in parallel by multiple Map operations.In Algorithm 2, the samples which have the same key are merged to generate    (, ).In Algorithm of Algorithms 1 and 2, the less significant attributes are removed and the attribute reduction is obtained.
In this weighting method we use the significant attribute dependence coefficient, computing the reduction of information.The significant dependence coefficient is computed as where   is the th attribute from which we are computing the weight;  is the cardinality;   () is the positive region of all relations (features) present in the reducts; and, finally,  (−  ) () is the positive region of all relations present in the reducts extracting attribute   .
The weight of   is computed as Considering that the calculation of attribute weights   (, ) is similar to    (, ),    can be computed by Algorithms 1 and 2. For simplicity, we omit it here.

Case Retrieval and Ranking.
The case retrieval is one of the most important processes in TCBR design and also in TCBR component of the system.When a new failure situation occurs, the TCBR system retrieves, from a case-base, previous cases that are similar to the new failure situation.Therefore case retrieval and ranking are equal to searching the most similar cases in the generated CB.This job is easily parallelized in the cloud computing framework by assigning a partial task to each node in a cluster, in which  most similar cases are searched by Map function.After that, Reduce task is used to aggregate these similar cases and then finds most similar case, which is used for fault diagnosis, predication, and maintenance decision.
To retrieve similar cases from historical ones, a similarity measurement is commonly used in case retrieval.The value of similarity is between 0 (not similar) and 1 (most similar).The total similarity value is calculated by  [8,33], which will be described as semantic similarity.
Semantic similarity between two concepts is considered to be governed by the shortest path length as well as the depth of the concepts in ontology [33]; that is, where  ≥ 0 and  ≥ 0 are parameters scaling the contribution of shortest path length  and depth ℎ, respectively.The case retrieval Algorithm 4 works as follows: we assume that case-base has stored the historical data related to a given fault situation.When a new fault record is input into the system, the case-base is searched to retrieve all cases with a similar profile.The similarity of the new problem to the stored cases is determined by calculating similarity value between case features.Then, once the most similar cases have been obtained from the case-base, they will be used in the fault diagnosis and predication to generate a maintenance decision.

Application to Fault Diagnosis of a Railway Onboard System
We now describe a typical use case of fault diagnosis using the proposed framework.CPU, c 3 emergency brake fault, c 4 ATP startup failure, c 5 normal brake failure, c 6 brake test failure, c 7 bypass relay being invalid, c 8 relay interface of normal brake being invalid, c 9 redundant relay interface being invalid, c 10 relay interface of emergency brake being invalid, "d=1" brake test failure, "d=2" normal brake failure, and "d=3" brake bypass failure.
To evaluate the performance of the proposed framework, we perform a number of experiments by varying the number of cases and attributes and compare the case retrieval times.The local node used for evaluations had an Intel(R) Pentium(R) CPU G630 2.7 GHz processor with 4 GB memory and 500G disk space.The local node contained the case-base in a text file database and case retrieval jobs are programmed in Eclipse platform using Java.Figure 5 illustrates the comparison of case retrieval times with different attributes.From it we can see that with the attribute reduction the case retrieval and fault diagnosis/predications can reduce the time consumption obviously.Figure 6 shows the comparison of case retrieval times with different nodes (i.e., compute units).We observe that even with a large casebase in the local node with up to 7,000 cases, the case retrieval and fault diagnosis/predications can be done in a timescale of seconds.For analysis of 7,000 cases with the proposed method, experiments show a speed up of up to 2 times using a computing cluster (with 3 compute units) as compared to a single node.to reduce the attributes of fault cases, which significantly decrease the complexity of case retrieval; (3) case retrieval by cloud computing is able to scale up a large amount of case bases and improves the efficiency of fault diagnosis and predication.The effectiveness of the proposed algorithm is demonstrated through its use in fault diagnosis and predication of a railway onboard system.Optimization of text mining and rough set reduction in cloud computing needs further research.

Figure 4 .
Figure 4.The context is then mapped to a feature vector(( 1 , V 1 ), ( 2 , V 2 ), . . ., (  , V  )), where   is a feature and V  is its corresponding value.Surrounding words of $W$ in a fixed window size are used in the universe.Obviously, large values of window sizes capture dependencies at longer range but also dilute the effect of the words closer to the term.Leacock et al.[28] used a window size of 50, while Yarowsky[29] argued that a small window size of 3 or 4 had better performance.A small window size has an advantage of requiring less system space and running time[30].Here, we use a small window size of 4 to conduct the synonym term identification, which is similar to[31].In order to extract features in a window size of 4, a document annotation by ontology is needed.The main aim of document annotation is to attach the

Figure 5 :
Figure 5: Comparison of case retrieve time with different number of attributes.

Figure 6 :
Figure 6: Comparison of case retrieve time with different number of nodes.

Table 1 :
Two examples of railway signaling maintenance records.
table,  = (,  ∪ , , ), where  =  ∪ ,  ∩  = 0.Each attribute  ∈  is associated with a set   of its values by function , called the domain of .(2) Indiscernibility Relation: Indiscernibility Relation () for any subset  ⊆  is defined as () = {(, ) ∈  ×  : ∀ ∈ ,   () =   ()}.Two entities are considered indiscernible by the attributes in  if and only if they have the same value for every attribute in .() is an equivalence relation that partitions $U$ into equivalence classes which are denoted by /().(3) Lower and upper approximations: for any case  ⊆  and attribute subset  ⊆ , the lower approximation of , R * () is the set of objects of U that are surely in , whereas the upper approximation of , R * () is the set of objects of  that are possibly in .
(4) Positive region: the positive region of decision class /().With respect to condition attributes,  is denoted by   () = ⋃ R * ().It is a set of objects of  that can be classified with certainty to classes /() by employing attributes of .

Table 2 :
Case-base of train onboard equipment failures.

Table 2
shows part of a case-base of train onboard equipment failures with data comprising ten different attributes after feature extraction (i.e., c 1 -c 10 ) and the corresponding fault types , where c 1 means ATP (Automatic Train Protection) failure, c 2 communication interruption between DMI (Driver Machine Interface) and