Text Matching and Categorization: Mining Implicit Semantic Knowledge from Tree-Shape Structures

. The diversities of large-scale semistructured data make the extraction of implicit semantic information have enormous difficulties. This paper proposes an automatic and unsupervised method of text categorization, in which tree-shape structures are used to represent semantic knowledge and to explore implicit information by mining hidden structures without cumbersome lexical analysis. Mining implicit frequent structures in trees can discover both direct and indirect semantic relations, which largely enhances the accuracy of matching and classifying texts. The experimental results show that the proposed algorithm remarkably reduces the time and effort spent in training and classifying, which outperforms established competitors in correctness and effectiveness.


Introduction
Rapid developmental trend in social network means the explosive growth of users as well as dramatic changes in providing services.Therefore, large-scale text classification and retrieval revive the interest of researchers [1].The traditional knowledge representations are characterized by strong pertinences and have great power in expressing empirical knowledge or rules, but they are insufficient in representing complex and uncertain knowledge existent in social webs.Texts share various forms of common structural components (from simple nodes and edges to paths [2,3], subtrees [4], and summaries [5]) [6].Direct semantic information can be found easily, but hidden semantic information is extremely difficult to be detected.Zaki and Aggarwal [4] propose a structural rule-based classifier for semistructured data, called XMiner, which can mine out parent-child frequent branches and ancestor-descendant ones and conduct structured or semistructured data perfectly, but the shortness is the lack of semantic information in text representation.
Semantic similarity assessment [7,8] can be exploited to improve the accuracy of current information retrieval techniques [9], to automatically annotate documents [10,11], to protect privacy [12,13], to match web services [14], and to resolve problems based on knowledge reuse [15].Semantic network [16][17][18] is more concerned about semantic information.For the semantic data mining can be based on the text analysis, many semantic community detection algorithms exploited the latent Dirichlet allocation (LDA) model as the core model, which is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar [19,20].However, semantic analyzing based on LDA [16,21] is complicated, and semantic information mining is important for text matching and categorizing, so it is needed to find a much more efficient and friendly way, of which the results are precise and accurate.
A relation between two words can be in one-way direction or bidirection based on the interrelationships between them, so it is reasonable to use graphs or trees to express a text.The method proposed can mine out implicit semantic information without cumbersome lexical analysis by making links express semantic knowledge and pointers record a traversal sequence which describes different abilities of nodes in expressing a text.The method proposed in this paper not only extracts semantic information by creating tresses but also calculates the similarities of coexisting hidden structures to measure the similarities of texts.Three main contributions of  6) IF = NIL THEN (7) add  to  (8) Count(  ) = 1 (9) ElSE (10) IF  does not appear in  THEN (11) add  to  (12) add all edges of   to  (13) Count(  ) = 1 ( 14) IF  appears in  THEN ( 15) IFtheCounts of them are equal THEN (18) set the direction of the pointer randomly (19) IFtheCounts of them are unequal THEN (20) set the direction of the pointer from the node with a bigger Count to the one with a smaller Count this work are listed as follows.One is to represent all semantic information in a text using tree-shape structures.The other is to generate semantic trees based on the combining of pointers and a fixed traversal strategy and to use subtrees as addenda structures.The last one is to discover implicit knowledge by analyzing semantic trees and mining coexisting hidden structures.

Representation of Semantic Information
Because knowledge model is highly dependent on relations, it is reasonable to use trees to express a text.This paper employs tree-shape structures to describe a text, from which semantic information can be mined out without cumbersome lexical analysis.

Semantic Graphs.
A text is deemed as a sequence of sentences (denoted as  = ⟨ 1 ,  Based on the assumption that words in one sentence are deemed as having semantic relationships (a relationship is existent between   and   in   , where ,  ∈ [1, ] and  ̸ = ), the nodes arising in the same sentence are linked with each other in SemGraph.
Definition 2 (isolated node).The node has neither in-degree pointers nor out-degree pointers.
The process of building SemGraph is as in Algorithm 1.
The creation of a SemGraph is illustrated by the example shown in Figure 1.By scanning a text in sentences and supposing nouns  1 and  2 appear in the first sentence, they are added into SemGraph directly and the Counts of them are assigned to 1, respectively (Figure 1(a)).Since the Counts of the two nodes are equal, the direction of the pointer is set randomly.Figure 1(b) supposes  1 and  3 coexist in the second sentence.Because  1 has existed in SemGraph, there is no need to add  1 to the SemGraph again, but the Count of  1 must be modified ( =  + 1).As a new node,  3 is added to the SemGraph directly and the Count is set to 1.Because ( 1 ) = 2 > ( 2 ) = 1, the direction of the pointer is shifted from  1 to  2 .Similarly, for ( 1 ) = 2 > ( 3 ) = 1, the pointer between them is set from  1 to  3 .In short, pointers mark from the more frequent node to the opposite.In Figure 1(c), sentences in the text are supposed to be as follows: ⟨ 4  5 ⟩, ⟨ 4  6  7 ⟩, ⟨ 4  6 ⟩, ⟨ 2  4 ⟩, ⟨ 1  5 ⟩, ⟨ 3  7 ⟩, ⟨ 6  3 ⟩.After processing the similar works mentioned above for each sentence, the final result is shown in Figure 1(c).If the last sentence is ⟨ 6  3 ⟩, the Counts of  3 and  6 add one, respectively.Because ( 3 ) = 4 > ( 1 ) = 3 and ( 6 ) = 4 > ( 4 ) = 3, the pointers between  1 - 3 and  4 - 6 should be changed, as shown in Figure 1(d).
After building an original SemGraph, the redundant nodes (the Counts under threshold P) must be pruned to achieve the purpose of simplification as they are weak to describe a text.
The following is the setting method of threshold P: is the sum of Counts.  is the number of nodes. is the number of characters in the text. V is the average number of characters in samples. is an optional artificial setting value. is a measurement of SemGraph by considering various parameters, and P controls  in the Scope of (0, 1).If some texts are more authoritative or have stronger abilities to represent a class,  is reset to a smaller value based on specialist knowledge.The smaller the  is, the more important the text is.Eventually a smaller P makes more information in the text retained.
Further explanations are as follows.
(1) P is inversely proportional to the mean of Counts in SemGraph.(2) More characters in a text lead to more redundant information, so the size of the text is used to fine tune .(3) Experts can manually select some representative texts and assign a smaller  for building a SemGraph representing a class (denoted as class-SemGraph) quickly.IF  (, ) = 1 &&   (, ) = 1 THEN ( 9)

Fusion
I F   < Δ subedge THEN delete the link (16) delete and  in  Δ subedge , it means the relationship is weak or nonexistent, so it should be deleted in the class-SemGraph.
By adding new classified texts to the corresponding class-SemGraph, high accuracy and real-time performance can be ensured.In conclusion, the merging operation is needed to be performed by combining text-SemGraphs for creating or updating class-SemGraphs.The implement strategy is as in Algorithm 2.
(, ) judges whether there is a link between node  and  in SemGraph .  (, ) = 1, if a link exists between the nodes and 0 otherwise.
Insignificant nodes in  must be deleted every once in a while to ensure timeliness, which is done by Algorithm 3. Nodes deemed as less capable to describe a class must satisfy the following conditions:   (  ) <   && (  ) <   , where   (  ) is the sum of in-degree and out-degree pointers of the th node and   is an artificial threshold and proportional to the average length of texts.The root of  should be relocated when the network is changed, which is the start node when traversing  or mining frequent structures.The concrete implementation of finding a root is also done by Algorithm 3.  (5) delete  and all edges of   (6)  records the node that has the maximum Count (7) return  Algorithm 3: deleNode(G, t).

Formation of Trees.
In order to analyze implicit frequent structures, SemGraphs should be decomposed into several trees; thus studying the features of the trees is equivalent to processing the SemGraph.Depth-First Search (DFS) or Breadth-First Search (BFS) strategies cannot meet the requirements of social network, because fixed traversal strategies would miss or destroy some important relationships; thus pointers are needed to achieve correct mining results when traversing graphs.The method of choosing a root is as follows: (1) choose the node with the maximum Count; (2) if there is more than one node having equivalent maximum Count, the node having more out-degree pointers is chosen as the root.
BuildTree() is a semantic graph searching method proposed in this paper without losing semantic information based on DFS or BFS.BuildTree() usually generates more than one tree, so several trees can express all semantic relationships between nodes.Algorithm 4 is the semantic graph searching strategy based on DFS.
In Figure 1,  6 has the biggest Count, so  6 is chosen to be the root.BuildTree() creates three sets of trees based on DFS and BFS, respectively, shown in Figures 2 and 3.The analysis shows that in spite of the two different results they do not affect follow-up works as they express exactly the same semantic information.
0 is a master subtree, while both  1 and  2 are auxiliary subtrees.DFS or BFS only create master subtrees, which omit some vital semantic relationships.For instance,  0 believes that  2 and  4 have no semantic relation, but actually they have one in SemGraph, so  1 is essential to replenish this missing relationship.

Mining Implicit Frequent Structures
Definition 3 (implicit frequent structure (IStruc)).IStruc is a frequent structure of SemGraph, which reserves ancestordescendant relationships.
That is, there are at least two connected nodes in IStruc and they are not linked in SemGraph; the frequent structure like this is called implicit frequent structure.(11) delete the edges that have been visited in G (12) IFnode is an isolate node THEN delete node (13) ELSE BuildTree(node) Algorithm 4: BuildTree(R).

Definition 5 (branch root). It meets the following conditions:
is the smallest one in all   's and   is the smallest one in all   's.
Definition 6 (List).It is denoted as [  ,   , Scope, (Scope of branch root)  ], where   is the ID of a text,   is the ID of a tree, and  is the number of branch nodes.
In order to mine IStrucs, it is needed to analyze new structures generated by connecting nodes one by one.But there is no need to connect all the nodes.For example, if two nodes in a tree do not have a common ancestor node, they should not be connected.Therefore, it is essential to judge whether the nodes meet some preconditions.Preconditions.To specify the process of mining IStrucs by computing the Scopes of nodes, two sets of trees representing two texts are shown in Figure 4.The subscript of  1 in Tree 0 of Text 1 is 0 determined by DFS, so   = 0.All the direct successor nodes of  1 are { 2 ,  3 ,  4 }, and   's of those nodes are {1, 4, 5}.Obviously,  4 has the maximum   (  = 5), so the   of  1 is set to 5 and the Scope of  1 is [0, 5].
Lists of nodes are as follows.The format of an item in a list is ⟨text ID, tree ID, Scope⟩: The node just appearing in one text cannot be a common IStruc, so nodes like this are deleted. 2 ,  3 , and  4 only appear in Text 1, so they are deleted.After deleting redundant nodes, the rest are { 1 ,  5 ,  6 }.Assuming that  1 is a root node, it will be linked with other nodes which meet the preconditions to create new IStrucs.Therefore, { 1  5 } and { 1  6 } are created as that shown in (3).

Scoring Tactics
The semantic trees having common IStrucs are not a proof of existing association relations, so it is essential to analyze the authorities of IStrucs.The following is the scoring tactic of IStrucs to compute similarities between two texts or between an unknown-class text and a class.Scoring rules: where  is a node in an IStruc, while  is a node in a SemGraph.Δ is the degree of variance between a text and a class.

Experiment
Three datasets are used in this paper.
(1) SND: the dataset is gathered from sina (http://www .sina.com/) repeatedly.Training data is collected in different periods, and testing set dynamically collects data from websites which is timeliness with the focus of hot topics.Training set contains 5200 documents in 5 different classes, while testing set has 2500 documents.
(2) TREC: the dataset (http://trec.nist.gov/data.html)based on a subset of the AP newswire stories has 242,918 stories.Over 50,000 texts are selected from TREC randomly, reporting events from areas as different as the politics, finance, media, entertainment, and so forth.
Three sets of baseline approaches are chosen for the experiments.
(1) -NN approach: this approach finds the nearest  neighbors in the training set.After finding the neighbors, it can be calculated how many of these neighbors belong to the th class.Therefore, the probabilities of the test points belonging to each of the classes can be got by dividing the counts with .
(2) Term vector model: it is an algebraic model for representing texts as vectors of identifiers, which is used in information filtering, retrieval, indexing, and relevancy rankings.VSM [22] signs the importance of topics by the term weights computed as the term frequencies.
(3) Multilabel classification approaches: MetaLabeler [23] can determine the relevant set of labels for each instance without intensive human involvement or expensive cross-validation.Two steps are involved: one is to construct the metadata; the other is to learn a metamodel.The first step can be considered as a multiclass classification problem.
The size of training dataset should be kept within a reasonable Scope.A small amount of data could affect the authority of SemGraph, while a large number of data would incur unnecessary computational cost.After class-SemGraphs have been established, unknown-class texts are studied to ensure the timeliness and the quality of the corresponding class-SemGraph, so the size of training dataset is not the bigger the better.In order to explain the method of setting the size of training dataset, ten texts are made as a set to build or update a class-SemGraph.If the number of information increment is small and the added information is of low importance, the learning process will be ceased.
In Figure 6, the size of training dataset within [70, 100] is reasonable.If a class contains a relatively larger amount of information, the size of the training set should be set as a bigger value, such as Book.
Details of the training sets and the class-SemGraphs are shown in Table 1.The number of nodes in SemGraphs is compared with the size of datasets, which is shown in Figure 7. Take Car for an example; the generated knowledge is shown in Table 2 after studying 100 texts in Car.The weights in Table 2 are the sums of weights of the same nodes in different texts, which are calculated by the algorithms mentioned before.

The ratio of incremental information
The number of texts  Adding new texts that have been categorized into the corresponding class-SemGraph can help to update it in realtime.Manually analyzing the training dataset points out that most of the texts can be classified into one or two classes (shown in Figure 8), but only the closest matching class is selected.If the algorithm maps one text to several matched classes, it will cause unnecessary troubles, because multiple matches would confuse distinctions between classes, which make text classification more difficult.For example, Car normally contains the features of the following classes: finance, energy, transportation, and environmental protection.If texts in Car are used to update the SemGraphs of finance, the two class-SemGraphs will become more similar and more difficult to be distinguished, so this paper only classifies a text as the most similar class.The final classification results are shown in Table 3.
By analyzing the experimental results, the proposed algorithm is proved to be effective, which outperforms the other algorithms and is stable to deal with different kinds of data.The result is shown in Table 4.
It can be found that the errors of the algorithm proposed are acceptable and reasonable by analyzing relationships between the wrong classified text and the class.The errors will not affect users' experiences but may indirectly influence the accuracy of class-SemGraphs.It is simple to improve this shortcoming, which is to add a judgment for filtering inappropriate texts before updating class-SemGraphs.Instead of using all the new texts to update class-SemGraphs, the improved method is to select the texts which highly The number of texts in training dataset  match with one certain class and have low matching degree with other classes to update class-SemGraphs.If we wish to keep class-SemGraphs entirely pure, the texts only matching one class are chosen to update the corresponding class-SemGraph.It can be found from Figure 8 that the number of texts belonging to one class is the largest, so this method is feasible.

Conclusion
Compared with other mainstream methods, the method proposed is simple and able to discover implicit knowledge.In addition, the algorithm is more stable in dealing with different kinds of data.After analyzing classification results, it is found that errors fall within a reasonable range and the relationship between the incorrectly classified text and the wrongly specified class makes some senses.

Figure 1 :
Figure 1: The creation of a SemGraph.Black symbols represent the items changed.Nodes are denoted as [node name: Count].
Strategy.SemGraphs that represent texts are denoted as text-SemGraph.A class-SemGraph is generated by merging text-SemGraphs in the same class.The following two problems need to be considered to merge different text-SemGraphs in the same class.(i) Weights of nodes must be recalculated.The formula is as follows:  = (  )/∑ (  ), where  represents a node and  embodies the importance of node   in the corresponding class.(ii) If the number of occurrences of a new relationship is larger than threshold Δ addedge , the link between the two nodes will be added in the class-SemGraph.If the number of disappearances of an old relation is less than threshold Input: new added SemGraph ; time = 0 Output: class-SemGraph  (1) IF  = NIL THEN exit (2) IF  = NIL THEN  =  (3) ELSE (4) unify names in  and  (5) calculate for each node (6) calculatethesumof of same nodes (7) FOR ∈  &&  ∈  &&  ̸ =  DO (8) (17) deleNode(, time) (18) time = time + 3 Algorithm 2: MSGraph().

Figure 2 :
Figure 2: The result of semantic graph searching strategy based on DFS.

Figure 3 :
Figure 3: The result of semantic graph searching strategy based on BFS.

4 Figure 4 :
Figure 4: Two sets of trees representing different texts.Text 1 is expressed by two trees, while Text 2 is explained by one tree.The bold numbers represent the node IDs generated by DFS.

Figure 5 (
Figure 5(a) existing in Tree 0 of Text 1. ⟨2, 0, [0, 2], {[2, 2]}⟩ means that Tree 0 of Text 2 has the structure shown in Figure 5(b).Nodes { 5  1  6 } are not directly connected in the original trees, so  5 - 1 - 6 is an implicit frequent structure.Although the mining results have different structures in SemGraphs, they contain the same hidden knowledge.Without mining IStrucs, some implicit relations of texts are ignored entirely, which greatly reduces the accuracy of text matching.

Figure 6 :
Figure 6:  The contrast data between the amount of texts and incremental information ratio.

Figure 7 :
Figure 7: Data comparisons.The number of nodes in different class-SemGraphs is compared with the size of datasets.

Figure 8 :
Figure 8: The numbers of classes by numbers of covering documents.
SemGraph G; the time interval parameter t Output: a simplified SemGraph G; the root node R (1) IF t meets the restriction of time THEN exit (2) ELSE (3) FOREACH   IN G (4) IF   (  ) <   && Count(  ) <   THEN Input: Definition 4 (Scope).It represents the Scope of node  in a tree, whose format is [  ,   ], where   is an index of node  generated by traversing the tree according to DFS or BFS and   is the maximum value of   's among all successor nodes of .SemGraph G; the root node R; int  = 0Output: trees (denoted as   ,  = {1, 2, 3, . ..}) Input: node  (the List is ⟨  =   ,   =   , [  ,   ], {[   ,    ]}  ⟩) and node  (the List is ⟨  =   ,   =   , [  ,   ], {[   ,    ]}  ⟩) are linked in a IStruc, they must meet one of the following conditions.(1) If   ≥   &&   ≤   ,  is a child node of  in the SemGraph.(2) If   ≤   &&  and  have the same ancestor node,  is a brother node of  in the SemGraph.(3) If    ≥    &&    ≤    &&   =   &&   =   ,  is a child node of 's branch root in the SemGraph.(4) If    ≤    &&   =   &&   =   ,  is a brother node of 's branch root in the SemGraph.

Table 1 :
Details of the training sets and SemGraphs.