Open Relation Extraction in Patent Claims with a Hybrid Network

Research on relation extraction from patent documents, a high-priority topic of natural language process in recent years, is of great significance to a series of patent downstream applications, such as patent content mining, patent retrieval, and patent knowledge base constructions. Due to lengthy sentences, crossdomain technical terms, and complex structure of patent claims, it is extremely difficult to extract open triples with traditional methods of Natural Language Processing (NLP) parsers. In this paper, we propose an Open Relation Extraction (ORE) approach with transforming relation extraction problem into sequence labeling problem in patent claims, which extract none predefined relationship triples from patent claims with a hybrid neural network architecture based on multihead attention mechanism. The hybrid neural network framework combined with Bi-LSTM and CNN is proposed to extract argument phrase features and relation phrase features simultaneously. The Bi-LSTM network gains long distance dependency features, and the CNN obtains local content feature; then, multihead attention mechanism is applied to get potential dependency relationship for time series of RNN model; the result of neural network proposed above applied to our constructed open patent relation dataset shows that our method outperforms both traditional classification algorithms of machine learning and the-state-of-art neural network classification models in the measures of Precision, Recall, and F1.


Introduction
With the development of economy, patent documents, being an extremely important knowledge carrier, record a large number of valuable inventions, creative ideas, and excellent design concepts. Automatically extracting none predefined relation triples from patent claims, which contains a series of rights granted by a government for a given limited period, is a vital basic research application for some upper level applications of patent document analysis, such as patent information retrieval [1,2], patent classification [3], patent categorization [4], and patent knowledge graph construction [5].
However, relation extraction from patent document is not an easy task. On one hand, specification requirements for patent writing leads to lengthy and complex sentence, which results in its difficulties to parse with normal NLP tools; on the other hand, traditional approaches, NLPbased linguistic method, statistics-based machine learning method, and multimethod hybrid method [6] cannot catch temporal information and long sentence-level global dependency features.
In this paper, we propose an open relation extraction model of hybrid neural network to extract relation triples from patent claims, where Bi-LSTM network can obtain temporal information from the whole sentence, and CNN pooling can gain local content information; at the same time, multihead attention is incorporated into extracting content dependency feature in order to better serve for sequence label classification problems. Our main contributions are summarized as follows: (

Related Work
As for the traditional semantic relation extraction from patent documents, there are mainly four methods, which are NLP-based linguistic method, statistics-based machine learning method, and multimethod hybrid method. On one hand, in the early period of semantic relation extraction from patent documents, NLP-based linguistic methods are dominant. Most of the existing methods made use of linguistic analysis. Regular expression pattern matching techniques is proposed to parse, annotate, and extract target semantic information for knowledge sharing in machine readable format OWL [7]; extracting hyponymy lexical relations is conducted on patent documents using lexicosyntactic patterns [8] and extracting knowledge combined with domain ontology from patent unstructured data [9]. Data-intensive methods are incorporating into patent claim analysis for enhancing analysis robustness combined with symbolic grammar formalisms [10]. Conceptual graphs are extracted from patent claims for comparing patent similarity analysis or any domain of interest [5]. A patent processing system named PATExpert is designed for summarizing patent claims, where deep strategies of syntactic dependency relationship analysis operate on deep-syntactic structures of the claims for improving its readability [11]. Gabriela et al. [12] proposed an extraction of verbal content relations from patent claims using deep syntactic structures. Fantoni et al. [13] proposed a method of automatically detecting and extracting information about functions, the physical behaviours, and states of the system from patent text with a large knowledge base and a series of NLP tools. Lee et al. [14] proposed a hierarchical keyword vector for representing the dependency relationships among claim elements and a tree matching algorithm for comparing claim elements of parents to assess patent infringement risks. Taeyeoun Roh et al. [15] proposed a series of rules to structure and layer technological information in patent claims through NLP tools.
On the other hand, statistical-based machine learning is frequently applied for processing patent analysis in recent years. Gabriela [16] proposed a two-stage method of rulebased claim paragraph segmentation and machine learningbased of conditional random field (CRF) lengthy sentence segmentation which will help automatically detect division phrases for forming meaningful shorter sentences. Wang et al. [17] present an approach to extracting principle knowledge from process patents classifying with contraction matrix. Okamoto et al. [18] proposed an informationbased technique to grasp the patent claim structure through entity mention extraction and the relation extraction method with DeepDive [19] platform which using Markov logic network-based inference [20] and distant supervision-based labeling [21] to extract relations from unstructured text. Deng et al. [22] proposed to construct knowledge graph for facilitating technology transfer where common knowledge base can reveal the technical details of technical documents and assist with the identification of suitable technologies.
Besides, with the rise of deep learning technology, especially its wide application in natural language processing, hybrid technologies as above have emerged for patent mining, such as patent information extraction, patent relations extraction, and construction of patent semantic knowledge base. Yang and Soo [5] proposed a method to convert a patent claim into a formally defined conceptual graph with hybrid techniques of part-of-speech tags, conceptual graphs, domain ontology, and dependency tree. Korobkin et al. [23] proposed a hybrid methodology of LDA-based statistical and semantic text analysis to extract a physical knowledge in the form of physical effects and their practical applications. Carvalho et al. [24] present a hybrid method of extracting semantic information from patent claims by using semantic annotations phrasal structures, abstracting domain ontology information, and outputting ontology-friendly structures to achieve generalization. Lv et al. [25] proposed a hybrid method of patent terminology relation extraction combined with attention mechanism and Bi-LSTM [26] model to construct the patent knowledge graph.
Different from traditional relation extraction, where categories of relationships are classified at advance, open relation extraction (ORE) extract none predefined triples from unstructured text. ORE is firstly defined by Banko et al. [27] who proposed to extract none predefined relations from web, attracting extensive attention and follow-up researches in various fields. Del Corro and Gemulla [28] then proposed dependency parsing-based clause IE framework to detect and extract "useful" pieces of information clauses. Neural network are also incorporated into ORE [29,30] with end-to-end sequence model or encoderdecoder model.
Our work is similar with Lv et al. [25] and [29][30][31], but Bi-LSTM and attention mechanism, together with open relation extraction, are firstly proposed to extract the none predefined relationship from the patent documents forming Subject-Relation-Object triples. As we believe that NLPbased parsing tools cannot catch long dependency relationship of lengthy patent sentences, different phased attention would improve the end-to-end sequence labeling classifications. We propose a hybrid neural network framework of extracting open relations from patent claims with multihead attention. Although Bidirectional Encoder Representation from Transformers (BERT) [32][33][34][35], another neural network model based on bidirectional transformer, performs excellent in a series of natural language processing tasks including sequence tagging, we would leave it for the future work.

Our Hybrid ORE Neural Framework
The paper proposes a supervised neural network of extracting open relations from patent claims without predefined relation categories, which enables a supervised machine learning approach to ORE in patent claims. We define the task as a sequence tagging problem, and we develop an 2 Wireless Communications and Mobile Computing end-to-end neural mode with Bi-LSTM and CNN with multihead attention to classify labels above. At first stage, as for the lengthy and complex structure, a machine learningbased method is used to detect segmentation word or phrases for splitting meaningful pieces of short sentences. And then word features and part-of-speech features are incorporated into the Bi-LSTM network. At then, multihead attention mechanism is applied to Bi-LSTM features for help dependency relationship label classification. Postprocessing operation is done for getting Argument1-relation-Argu-ment2-like triples. Our neural ORE architecture is shown in Figure 1.  Table 1.

Feature Embedding.
Word embedding is an operation of transforming a word token into a real-valued vector to represent syntactic and semantic information from content. Given a sentence consisting of n words S = fx 1 , x 2 , x 3 ⋯ ⋯x n g, every word x i is converted into a real-valued vector e i by looking up the embedding matrix W word ∈ ℝ d w |V| , where V stands for the whole vocabulary and d w represents as the size of word embedding. We use Glove [26] as our word embedding model. Part of speech embedding is transforming POS of each word x i in sentence S into a one-hot vector p i , which comes from annotated brown corpus with 36 types. Finally, the concatenation of word embedding e i and POS embedding p i is input feature of our neural model.

Bi-LSTM Network.
As deep learning technology and natural language processing combine more and more closely, long short-term memory (LSTM) network, which is firstly proposed by Hochreiter and Schmidhuber in 1997 to solve gradient vanishing problem, shows its good merit on capturing long distance relationship in different NLP subtasks. The transfer diagram of adjacent units in LSTM neural network is shown in Figure 2.
The core design philosophy of LSTM is an adaptive gating mechanism, which decides the degree to which LSTM units keep the previous state and memorize the   3 Wireless Communications and Mobile Computing extracted features of current data input [36]. The calculation process is as follows: A typical LSTM network consists of four parts: one forget gate f t , one input gate i t , one current cell state C t , and one output gate O t . Through four parts of the iterative calculation above, cell units decide whether to take the inputs, forget the memory stored before, and output the state generated later. Bidirectional LSTM network is the combination of forward LSTM networks and backward LSTM networks, where the hidden layer of the latter network flows in opposite position as that of the former, which can cap-ture the future information as well as the past one. Thus, the Bi-LSTM model is able to exploit information both from the past and the future, more suitable for the sequence tagging model tasks. In this paper, we use the Bi-LSTM model to obtain the semantic and syntactic information from the sentence, and we get the combined hidden information h i with element-wise sum operation as the following equation from two subnetworks of the forward hidden state h i ! and backward hidden state h i .
3.4. Multihead Attention. Attention mechanism has now become a predominant concept in neural network literature in recent years and has received varying degrees of attention and research within the artificial intelligence (AI) community in a large number of applications, such as speech recognition, computer vision, natural language processing, and statistical learning. In this paper, we adopt the multihead attention, which has shown excellent performance in many tasks, such as reading comprehension [ ., 2017). The essence of multihead attention is to do multiple calculations of self-attention, which can enable sequence-to-sequence neural model to obtain more features from different representation subspaces, so that the model can capture more context information of sentences. The relevant attention equations are described as below: where Q, K, and V represent query matrix, key matrix, and value matrix of the multihead attention mechanism, and in the above equations, Q = n * d k , K = n * d k , V = n * d k , and i ∈ h. For each head attention, we compute the attention weight by Equation (2), and finally, we concatenate each head as output results of attention layer.

CNN Network.
Convolutional neural networks (CNN) is a good means of capturing salient local features from whole sequence as for its capability of learning local semantic patterns by its flexible convolutional structure in multidimensional feature extraction [39]. Convolution is often thought of as the product of a weight vector and a sequence vector. The weights matrix is regarded as the filter for the convolution [40]. Given various convolution window length, different outputs are fed to a max-pooling layer, where we can get a feature vector of fixed length.
3.6. CRF Layer. The output of the softmax layer does not affect each other and is independent of each other, while Bi-LSTM can learn semantic and syntactic information about the content. But as for some tasks, such as Noun chunking and Named Entity Recognition (NER), output labels are mutually restrictive. Taking " aB_Arg1 flexibleI_ Arg1 pressI_Arg1 plateE_Arg1" for an example, label B_ Arg1 must be in front of I_Arg1, and label E_Arg1 must come after label B_Arg1 and I_Arg1, and other sequence is illegal. And the result label calculation of the CRF layer is realized by dynamic programming optimization, which would obviously outperform the model without the CRF layer int the time series estimation problem.

Experiments
4.1. Dataset. We extract 1309 claims from patent documents form USPTO and annotate the claims with thirty undergraduates for about 2 months. The constructed dataset finally contains 29850 sentences, where 60% for training, 20% for verification, and 20% for test. For argument1 and argument2, we use BIEOS label mechanism, which is also suitable for relation phrase labels. There are three relationships in the whole labeled dataset, and each relationship contains a single tag "S" or two more tags "BE" or "BIE." The statistics of all labels are shown in Table 1. Finally, we evaluate our patent ORE mode on above dataset. The results are measured by Precision (P), Recall (R), and F1-score, which is defined in Table 2.
4.2. Hyperparameters Setting. We implement our model with python 3.5 in Keras on NVIDIA Quadro P2000. Adam method is used to optimize our model, learning rate is set to 0.01, and batch size is 50. For multihead attention, we set the number of attention heads is 4, and we use Glove as word embedding model, and the dimension of word vectors is set as 300. Part of speech embedding size is one-hot vector and is set as 26, and relation label embedding size is also set as 12. The dropout rate is set to 0.1 to prevent overfitting, and L2 regularization is also employed in training to

Experiments and
Discussion. In our model, label embedding of our hybrid neural network model consists of word embedding, part-of-speech embedding, and relation tagging embedding. More information feature would be incorporated into the embedding layer though the concatenation by the last dimension for each word. The attention mechanism used in our model is multihead attention, which layer is followed by CNN layer. From a series of experiments in Table 4 above, we obviously conclude that hybrid neural network model performs better than traditional neural network model like Bi-LSTM, such as model 1and model 2 in Table 4, and neural network models with the help of label embedding obviously perform better than the models without the label embedding, such as model 3 and the model 1, in the evaluation measures of Precision, Recall, and F1 score. Through the comparison with the other neural models, our model with multihead attention outperforms other model as well.

Conclusion and Future Work.
In this paper, we propose a Patent Open Relation Extraction neural model. Instead of employing feature engineering, we use a hybrid Bi-LSTM +CNN+CRF neural model with multihead attention mechanism. The hybrid model outperforms the single other model obviously on our self-constructed patent sequence tagging dataset. In the future, we consider incorporating the transform model into our model, such as Bidirectional Encoder Representation from Transformers (BERT), and we also consider patent domain word embedding, which we think would potentially improve the performance.

Data Availability
The dataset used to support the findings of this study have not been made available as the dataset also forms part of an ongoing study.