A Deep Paraphrase Identification Model Interacting Semantics with Syntax

Paraphrase identification is central to many natural language applications. Based on the insight that a successful paraphrase identification model needs to adequately capture the semantics of the language objects as well as their interactions, we present a deep paraphrase identification model interacting semantics with syntax (DPIM-ISS) for paraphrase identification. DPIM-ISS introduces the linguistic features manifested in syntactic features to produce more explicit structures and encodes the semantic representation of sentence on different syntactic structures by means of interacting semantics with syntax..en, DPIM-ISS learns the paraphrase pattern from this representation interacting the semantics with syntax by exploiting a convolutional neural network with convolution-pooling structure. Experiments are conducted on the corpus of Microsoft Research Paraphrase (MSRP), PAN 2010 corpus, and PAN 2012 corpus for paraphrase plagiarism detection..e experimental results demonstrate that DPIM-ISS outperforms the classical word-matching approaches, the syntax-similarity approaches, the convolution neural network-based models, and some deep paraphrase identification models.


Introduction
e goal of paraphrase identification is to determine whether two texts have the same meaning [1]. It focuses on how best to model the semantics of sentences [2]. Paraphrase identification is one of the most basic problems in lots of applications of natural language processing, such as machine translation [3], question and answering [4], plagiarism detection [5,6], and document retrieval [7].
Although paraphrase identification is commonly defined in semantic terms [2], the early methods to paraphrase identification were usually based on the word (or word n-gram) matching or the vector similarity in the word space, without considering the semantics of words or sentences. e bag-of-words model [8], the n-gram model [9], the TFIDF [10] (term frequency and inverse document frequency) model, and so on were commonly applied to represent the text, and then some text similarity computing methods (such as edit distance, longest common substring, Jaccard coefficient, and cosine distance) were exploited to measure the degree of paraphrase between the two texts.
However, paraphrase is usually done by word replacement with synonyms/antonyms, syntactic modification, sentence reduction, combination, reorganization, word shuffling, concept generalization, and specificity to change the appearance of the original text while retaining the semantics of the source sentence [11], which makes the above methods difficult to further improve the performances using only word matching or vector similarity in the word space. e syntactic feature-based methods, another way without considering the semantics, have also been used in paraphrase identification [11][12][13], especially in cross-language paraphrase identification [14]. ese studies assume that similar texts have similar syntactic structures [12,15].
at is, if two sentences describe the same thing, they are likely to have similar syntactic structures [16]. However, simply relying on the similarity of syntactic structures without regard to semantics cannot solve the problem of "the same semantics but different syntactic structures" [17].
In recent years, the models of paraphrase identification tend to transfer from the traditional model to deep model [18]. A variety of deep models have been introduced into the research field of paraphrase identification [19][20][21][22][23][24]. ese models utilized the distributed representation of text and focused on identifying the paraphrase through learning the matching structures and the matching degrees.
Except for the widely accepted distributed semantic representation in the deep paraphrase identification models, researchers also paid attention to the role of syntax in representing the text and computing the semantic similarity, and proposed some deep paraphrase identification models integrating the syntax [16,25].
ese studies determined the validity of syntactic features in deep paraphrase identification.
Goldberg presented that the linguistic features providing the more explicit general concepts can be very valuable [26]. Hu et al. proposed that a successful sentence-matching algorithm needs to capture not only the internal structures of sentences but also the rich patterns in their interactions [21]. We deem that the linguistic features manifested in syntactic features can produce more explicit structures for the representation of sentences and modeling the semantics on these syntactic features by means of the interaction of semantics with syntax can better represent the sentences and help to identify paraphrase.
Based on this, we propose a novel deep paraphrase identification model interacting semantics with syntax, denoted as DPIM-ISS. DPIM-ISS represents the sentences as the semantic vector on syntactic features and characterizes the syntactic role for the semantics of word or phrases by interacting semantic and syntactic information. Exploiting this representation, DPIM-ISS models the semantic representation on syntactic features explicitly and permits the model to learn the paraphrase pattern from the semantic on different linguistic features.
e experimental results show that the proposed model outperforms the traditional word-matching approaches, the syntax-similarity approaches, the distributed-representations-of-sentences-based models, the CNNbased models, and a couple of deep models for paraphrase identification. e contributions of this paper can be summarized as follows: (i) e idea of modeling the semantic representation of sentence on different syntactic structures by means of interacting semantics with syntax (ii) A new application of deep architecture, namely DPIM-ISS, to exploit the sentence representation interacting semantics with syntax for paraphrase identification (iii) Experiments on three datasets (i.e., MSRP, PAN 2010, and PAN 2012) to show the benefits of our model e following sections are organized as follows: Section 2 analyzes the issues of paraphrase identification. Section 3 introduces the details of DPIM-ISS. e experimental results are reported in Section 4. Section 5 discusses the related work. Section 6 concludes our work.

Analysis of Paraphrase Identification
Taking the data of MSRP and PAN (the detailed statistics of the two datasets can be found in Section 4.1) as examples, we investigate the semantic similarity of the sentences from the aspects of lexical similarity and syntactic similarity to denote the paraphrase.

Paraphrase Sentences with High Lexical Similarity.
From the perspective of word matching, the sentences are more than likely being paraphrased if they use the same or similar words. We randomly selected 1000 pairs of paraphrase sentences and 1000 pairs of nonparaphrase sentences from the MSRP dataset and compared their lexical similarity using Jaccard coefficient, as Figure 1 shows. Figure 1 reveals that when Jaccard coefficient is higher than 0.6, most of the sentence pairs are paraphrase sentences, while when Jaccard coefficient is lower than 0.25, most of the sentence pairs are nonparaphrase sentences.
Analyzing the examples of paraphrase sentences, we find that if the paraphrase sentences rewrite the source sentences by simple duplication, the syntactic structures of the two sentences are the same or similar, while if the paraphrase sentences rewrite the source sentences by text manipulation such as adjusting word orders or modifying the syntactic structures, the syntactic structures of the two sentences will therefore be different, but the words are still the same or similar. It shows that the word matching is still valuable in the paraphrase identification task. When Jaccard coefficients are between 0.25 and 0.6, it is difficult to distinguish paraphrase or nonparaphrase.

Paraphrase Sentences with the Same (Similar) Syntactic
Structures but Different Words. From the view of the syntactic structure, some paraphrase sentences have the same or similar syntactic structures but different words. Figure 2 gives a pair of paraphrase sentence from PAN 2012 with low lexical similarity but high syntactic similarity. Figure 2 exemplifies a lexical paraphrase, where underlined words are replaced with synonyms, and short phrases or words are inserted to change the appearance of the text. Although much of the text is changed, paraphrasing retains the semantics of the source. It is a common type of case in paraphrase identification. e higher the degree of paraphrase, the more difficult to identify paraphrase only by word matching.
If the word matching is not considered and only the syntactic features are exploited, the pairs of such paraphrase sentences are more similar on syntactic structures. Figure 3 compares the Jaccard coefficients of syntactic features computed from 1000 pairs of paraphrase and nonparaphrase sentences randomly selected from the training dataset of MSRP. e X-axis records the Jaccard coefficients, and the yaxis is the number of the samples. e statistical information in Figure 3 shows that the number of paraphrase sentence pairs is significantly higher than that of nonparaphrase sentence pairs as the similarity of the syntactic feature sequence increases. For example, when the Jaccard coefficients of the sentence pairs are between 0.8 and 0.9, there are 137 pairs of paraphrase sentences and only 28 pairs of nonparaphrase sentences. erefore, the similarity of the syntactic structure is useful to the task of paraphrase identification.

Nonparaphrase Sentences with Similar Words and Similar
Syntax Structures. Figure 4 describes an example of nonparaphrase sentences with similar words (black part) and similar syntax structures (see the dependency parsing tree corresponding to two sentences). In this example, S 1 and S 2 share a large number of the same words. Without respect to the semantics, the two sentences will be recognized as paraphrase due to the high levels of word matching.
Similarly, S 1 and S 2 can be identified as paraphrase since they have basically the same syntactic structures. However, if we compare the semantics of the words defined on the dependency tree, we can find that the semantics of verb appeared and surrendered are completely different, which leads to the semantic difference between the two sentences.

Different Words and Different
Syntax, but the Same Semantics. Figure 5 shows an example in MSRP corpus with different words and different syntactic structure, but the same semantics.
In Figure 5, there are few identical words between two paraphrase sentences and the syntactic structures are much more varied. However, if we map the semantics of words to the substructures expressed by the dependency tree of sentences and compare the semantics of words in the syntactic substructures, such as refused and denied on VBD, the semantic similarity of the two sentences can be found.
A sentence written in the natural language is not the simple collection of words, but the text with the syntactic structure under the grammar restriction.
ere exist the corelationships between semantics and syntax: when we need to convey and express the message in a proper way, the semantics and syntax of the sentence will work together, which encourages us to interact syntax and semantics in paraphrase identification to boost the performance.

Deep Paraphrase Identification Model
Interacting Semantics with Syntax e architecture of the deep paraphrase identification model interacting semantics with syntax (DPIM-ISS) contains two components: the sentence representation interacting semantics with syntax and the extraction of the matching pattern based on convolutional neural network. In this section, we introduce DPIM-ISS in detail.

Overview of DPIM-ISS.
Paraphrase identification is usually formalized as a binary classification task [29]: given two sentences (s k , s p ), the paraphrase identification model M determines whether they roughly have the same meaning. We propose DPIM-ISS to learn M, as shown in Figure 6.
In the architecture of DPIM-ISS illustrated in Figure 6, the model contains the two main parts: (1) the sentence representation interacting semantics with syntax, and (2) the extraction of the paraphrase matching pattern based on convolutional neural network. In what follows, we describe these components in detail.

e Sentence Representation Interacting Semantics with
Syntax. In recent years, the tensor has attracted much attention due to its ability to model the interaction between objects. For example, Socher et al. proposed a neural tensor network to model the interaction of two entities [30] and Qiu et al. modeled the interaction between the questions and answers using tensor in the task of community question    Source sentence: In the early 1800's doc bowles built the first hotel, a three story frame building. The community thrived and there was an influx of tourist traffic coming to drink and soak in the mineral waters. In the 1850's French lick was a key station in the "underground railway". The French lick springs resort and spa was built in the late 1800's.
Paraphrased sentence: In an French insertion 1800's tourist substitution bowles built a substitution first drink substitution , the substitution three hotel railway mineral substitution . The community thrived and there was the substitution influx in synonym doc substitution traffic coming in story substitution and soak in the building substitution waters. To synonym the 1850's French lick was the early substitution station in the "underground frame synonym ". The key substitution lick springs resort and spa was built of a late substitution 1800's. answering [31]. In the study of Yu et al., the idea of tensor was exploited to model the interaction between the semantic information and the structural information [32]. e motivations of these methods are all to use tensor as the tool to capture the interaction between different features. Inspired by these studies, DPIM-ISS uses tensor to interact the semantics and syntactic structures to model the sentence representation. Figure 7 gives a detailed example.
w i denote the semantic feature vector of w (k) i represented as word embedding and g (k) w i be the syntax feature vector of w (k) i that provides the syntax role of w (k) i . DPIM-ISS uses the tensor product ⊗ of e (k) w i and g (k) w i to project the structure of interacting semantics with syntax for the word w (k) i , represented by the notation x (k) w i : 4 Complexity . . , g m , . . . , g M denote an example of predefined syntax feature vector template of size M. Each g (m) represents a fixed syntax feature such as the subject (the syntactic component), the noun (the speech), and so on.
w i is a binary vector to represent the categorical variables. e categorical values are mapped to all zero values except those syntactic features that w (k) i has. We use the syntactic parsing to obtain the syntax feature g (k) has the m-th syntactic feature; on the contrary, g (m) � 0 indicates that w (k) i does not act as the role of m-th syntactic feature.
Using the semantic feature vector e (k) w i and the syntactic feature vector g (k) w i , we then generate the word embedding representation x (k) w i interacting the semantics with syntax using equation (1). Each x (k) w i is a two-dimensional matrix, shown on the left of Figure 7. Furthermore, if each word in the sentence s k is represented using the word embedding interacting the semantics with syntax, then we can get a three-dimensional matrix M s k to represent the sentence s k , shown on the right of Figure 7.
M s k consists of three dimensions: each word w in sentence s, the semantic feature vector e, and the syntactic feature vector g. DPIM-ISS captures the interactions between semantic features and syntactic features using tensor product, depicts the semantics of words on syntactic roles and decomposes the sentence into the syntactic subsections with semantics.
In order to obtain the expression of a sentence, we sum the word embedding interacting semantics with syntax to map M s k into a two-dimensional space of semantic and syntactic dimensions, shown in equation (2). en, we obtain the representation T (k) of the sentence s k , called the sentence representation interacting semantics with syntax in this paper: Furthermore, given two sentences s k and s p , we represent the interaction between them as a vector A k,p as follows: Sentence representation interacting semantics with syntax Extracting the paraphrase matching pattern based on convolutional neural network  Complexity en, the feature vector A k,p is further fed to a convolutional neural network to extract the paraphrase matching pattern.

Extracting the Paraphrase Matching Pattern Based on Convolutional Neural Network.
e convolutional neural networks have been applied to learn effective feature representations in some language tasks in recent years. In DPIM-ISS, we use the convolutional neural networks to extract the features of paraphrase matching. en, the extracted features will be fed into a multilayer perceptron classifier to identify the paraphrase.

Convolutional Layer.
We use wide one-dimensional convolution [33], which was proposed by Kalchbrenner et al., to define the convolution kernel to extract the features from A k,p for paraphrase identification. In DPIM-ISS, A k,p is the interacting representation between the two sentences, and it is an m × n matrix, where m is the number of syntactic features and n is the dimension of semantic features. e convolution layer exploits the U convolution kernels of size 1 × n and a convolution kernel contains two parameters: W and b, where W � [w 1 , . . . , w n ] is the feature weight vector of the convolution kernel and b is the bias of the convolution kernel. A convolution kernel performs the convolutional operation on the interaction matrix A k , p to get an m × 1 vector V u , which represents the expression of a semantic feature on a syntactic feature. V u is defined as follows: where Cov(W, A k,p ) denotes a convolution operation on A k , p using parameter W: e convolution operation explores U convolution kernels to produce a matrix A k,p ′ ∈ R U×m×1 , which is composed by where m is the number of syntactic features and U is the number of convolution kernels:

Max
Pooling. e outputs from the convolutional layer are then passed to the pooling layer to extract the k top values from each dimension of A k,p ′ for reducing the number of the features. On each column of A k,p ′ , we set the size of nonoverlapping pooling window to w. e k features with the highest value are extracted from the window, and the matrix A k,p ″ ∈ R (U×m/w×1) made up of k m/w vectors is generated as follows: where each V ″ (u) is defined as follows: en, the resulting features of A k,p ″ operated by max pooling are combined to form a k × m/w dimensional vector Z.

Further Enhancements.
Madnani et al. proved that the machine translation (MT) metrics significantly boosted the performance of paraphrase identification [6]. For each pair of sentences, we construct a vector L to indicate the lexical similarity using the METEOR automatic MT evaluation metric, including precision, recall, F1, Fmean, penalty, and METEOR score [34]. We refer to such vector as the lexical features and incorporate it into the proposed DPIM-ISS by appending it to the vector Z. We conducted several experiments both with and without these features, which are discussed below.

Identifying Paraphrase.
We pass Z with L to a twolayer perceptron, shown in equation (9): where p 0 and p 1 indicate the identification results, W i and b i are the weight matrix and the bias of the i-th layer of the perceptron, respectively, and δ 1 is the ReLU activation function [35], defined as follows: and δ 2 is the SoftMax function to output the value of p k : where a k is the output value after ReLU activation function in the last layer.

Training the Model.
During the training phase, parameters of DPIM-ISS are updated with respect to a crossentropy loss between the predicted results and the ground truth, and the regulation technology is adopted to avoid the overfitting problem. e loss function is defined as follows: where y (i) is the label of i-th training example, λ is the regularization coefficient, and W 1 and W 2 are the parameters of the two-layer perceptron. To train the model, we use the backpropagation algorithm [36] with the Adam update rule [37]. e updating forms of parameters are as follows: where t is the current timestep, W t is the weights of t-th timestep, W t-1 is the weights of W in the last round of training, ƞ is the learning rate, ε is a parameter, and m t and v t are the bias-corrected estimates to control the direction of the gradient. We set ƞ � 1e-4 and ε � 1e-08. e whole sentence representation interacting semantics with syntax and the training process are detailed in Appendix A.

Datasets.
We conduct our experiments on three datasets: the Microsoft Research Paraphrase (MSRP) [27], the PAN 2010 [6], and the PAN 2012 [28]. MSRP is a classical dataset for paraphrase identification developed by Microsoft, and the latter two datasets are constructed using the datasets of 2010 and 2012 Uncovering Plagiarism, Authorship and Social Software Misuse shared task.

MSRP.
e MSRP corpus is a well-known corpus for paraphrase identification. MSRP was created by mining the news articles on the web and then extracting the paraphrases sentences from 9,516,684 sentences in 32,408 news clusters by using a semiautomatic method. It contains 5,801 sentential pairs, which is split into 4,076 (2,753 paraphrase, 1,323 not) training and 1,725 (1,147 paraphrase, 578 not) test pairs.

PAN 2010.
Madnani and Tetreault used the humancreated plagiarism instances in the test collection from the PAN 2010 plagiarism detection competition to create the PAN 2010 paraphrase sentence corpus. ey utilized the bag-of-words overlap and length ratios to generate the pairs of paraphrase sentences and selected the sentence pairs that had at least 4 words in common from the same document as the pairs of nonparaphrase sentence. en, they sampled randomly from both the positive and negative instances to create a training set of 10,000 sentence pairs and a test set of 3,000 sentence pairs.

PAN 2012.
We constructed the PAN 2012 paraphrase sentence pair dataset using the training and test data of PAN 2012 paraphrase plagiarism detection corpus. Let d plg and d src denote the plagiarized document and its source document, and (s, r) is a pair of plagiarism text annotated by PAN (s ∈ d plg , r ∈ d src ). Let s i ∈ s be the sentence of s and r j ∈ d src denote the sentence of d src , and T � {y, (s i , r j )} represent the training dataset. y and r j are defined as follows: y � 1, r j � arg max r j ∈r cos s i , r j , y � 0, r j � arg max r j ∈d src ,r j ∉ r cos s i , r j , where cos(s i , r j ) is the cosine similarity of s i and r j . Using the proposed method, we obtained 341,426, and 50,114 pairs of paraphrase sentences from the artificial-high-obfuscation subcorpus of PAN 2012 training and test corpus. en 15,932 training pairs and 7,966 test pairs that the length ratios were more than 50% were randomly selected to generate our PAN 2012 paraphrase sentence pairs dataset. e statistics of the three datasets are described in Table 1.

Baselines.
We evaluate the effectiveness of our model with several baseline methods, including the traditional word-matching approaches, the syntax-similarity approaches, the distributed-representations-of-sentencesbased models, and the CNN-based models. At the same time, we also select multiple deep paraphrase identification models as baselines. We give a detailed description of these baselines as follows: (1) Word-Matching Approaches. We select four typical wordmatching approaches as baselines.

Jaccard.
e Jaccard method calculates the Jaccard coefficient of the two sentences first and selects those pairs whose Jaccard coefficients are greater than a threshold t as the paraphrase sentence pairs. In experimenting, we set t from 0.0 to 1.0 and let the incremental step length be 0.01. We selected the parameter t on the training corpus in terms of optimizing accuracy. en, the corresponding t was applied on the test corpus. On the MSRP dataset, t � 0.34. On PAN 2010, t � 0.24. On PAN 2012, t � 0.27. Cosine. Similar to the Jaccard method, the cosine method calculates the similarity of the two sentences using the cosine distance. Similar to the above Jaccard method, we set a threshold t to decide the paraphrase sentence pairs. On the MSRP dataset, t � 0.28. On PAN 2010, t � 0.34. On PAN 2012, t � 0.20. METEOR. We applied the six METEOR evaluation metrics as the features to learn a classifier using the logic regression model (in DPIM-ISS, these lexical features are integrated into the extracted features that interact semantics with syntax). All parameters are obtained based on the training data to optimize F1.
(2) Syntax-Similarity-Based Approaches (Syntax-sim). For syntactic similarity, we referred to the method proposed in [11], denoted as Syntax-sim (Syntax-similarity). In Syntaxsim, we considered the text as the string of syntactical sequences derived from Stanford POS tagging 1 instead of using actual words and utilized the Jaccard coefficient to compute the similarity of syntactical sequences for further decision.

(3) Distributed-Representations-of-Sentences-Based Model (Paragraph Vector).
In our DPIM-ISS, we focus on the distributed representation of sentences. us, we select a distributed-representations-of-sentences-based model, the paragraph vector, proposed in [38] as the baseline for comparison. Paragraph vector used an unsupervised algorithm to learn the sentence representations. We utilized the tools of gensim 2 to learn the sentence vector and applied the cosine distance to compute the similarity of the two sentences. e parameter settings are as follows: the size of context window is 5, the lowest word frequency is 5, the learning rate is 0.025, and the dimension of sentence vector is 300.
(4) CNN-Based Models. ARC-I DPIM-ISS exploits the convolutional neural network to extract the paraphrase patterns of the interacting sentence representation. We also select a CNNbased paraphrase identification model, the ARC-I [21], as the baseline. In the experiment, we reimplemented ARC-I due to no publicly available codes, using the network structure and parameter setting as described in the original paper. e word embedding used for ARC-I was as the same as DPIM-ISS (will be described in 5.2.3). All parameters were obtained based on training data to optimize F1.
Except for the experimental results having been reported in the existing literature, all the parameters of the baselines and the DPIM-ISS are tuned to optimize the evaluation metrics F1 score on the training corpus and the best parameter settings are used on the testing corpus.

Evaluation Metrics.
Followed the previous research, the task of paraphrase identification is formalized as a where TP is true positive, TN means true negative, FP is false positive, and FN represents false negative. e F1 score is the harmonic mean of precision and recall: where the precision and recall are defined as follows:

Word Embedding. Word embedding required in the DPIM-ISS model and ARC-I was all learned based on One
Billion Word Benchmark Corpus (http://www.statmt.org/ lm-benchmark/) that contains nearly one billion sentences with different English words. We chose CBOW which was provided by gensim [40,41] as the learning model. e dimension of word embedding was set to 300, the size of context window was set to 5, the lowest word frequency was 5, and the learning rate was 0.0002.

Syntactic Features.
We used Stanford's parser (https:// nlp.stanford.edu/software/lex-parser.shtml) to get the dependency tree of sentences. e results of parser described the syntactic relationship in a sentence by means of the part of speech and the interword dependency. In our experiment, we only preserved the part-of-speech tags and the word dependency tags. ese markers were used as the syntactic features, and we simplified these tags in our experiment. For example, we simplified the tag nmod:including as nmod.
en, only 30 syntactic tags were preserved, shown in Table 2.

Experimental Results and Analysis.
e experimental results are summarized in three parts. In Section 4.3.1, we compare DPIM-ISS to the traditional word-matching approaches, the syntax-similarity approaches, the distributedrepresentations-of-sentences deep models, and the CNNbased models. We compare the performances of DPIM-ISS with other deep models for paraphrase identification in Section 4.3.2. In Section 4.3.3, we analyze the performance of each substructure in our model.  Table 3.
First, we compare the performance of DPIM-ISS with word-matching approaches. We observe that the DPIM-ISS outperforms the Jaccard approach, the Cosine approach, and the METEOR approach on F1 score and accuracy. Comparing DPIM-ISS with METEOR, the experimental results show that DPIM-ISS performs better than the method using only lexical features.
In addition, on PAN 2010 and PAN 2012 datasets, the METEOR approach, which takes the synonym matching into account, is significantly higher than the baselines on accuracy and F1 score. is is closely related to the synonym replacement method used in the construction of PAN datasets.
en, we analyze the performance of DPIM-ISS and syntax-similarity approaches. e experimental results show that the DPIM-ISS has a significant improvement over the syntax-similarity approach. We also note that the improvement on MSRP datasets is lower than that on PAN 2010 and PAN 2012 datasets. Similarly, compared DPIM-ISS with the method Sentence2Vector and ARC-I, we found that the performance improvements on MSRP are lower than those on PAN 2010 and PAN 2012. We conclude that the performance gap is attributed to the construction methods of the MARP dataset and PAN datasets.
For analyzing the differences on performance, we investigate the three datasets and find two main issues: (1) the syntactic structure on MSRP is more similar than those on PAN datasets. (2) Compared with MSRP, the use of words of PAN are significantly different.
Since the MSRP dataset was constructed using the corpus of topic-clustered news data, it does not adopt the deliberate obfuscation, which results in the small lexical differences but similar syntactic structures between the two sentences in MSRP. erefore, DPIM-ISS does not get much more benefits than the traditional deep learning methods. For the two PAN datasets, the source sentences are paraphrased in order to avoid plagiarism detection. e vocabulary shows the significant variations, and the syntactic structure takes on the marked difference. By decomposing the sentence's syntactic structure using the dependency tree, we obtain the key substructures of a sentence. e same substructures may be owned by the two sentences simultaneously (such as the predicate verb). Although these substructures present different appearance in terms of words, they may have similar semantics. DPIM-ISS uses the sentence expression interacting the semantics with syntax to obtain the semantic expression on the syntactic structures and learns the patterns of paraphrase in these semantic expressions using CNN. It pays attention to the different functions of semantic matching in different syntactic structures on paraphrase identification and solves the issues of the different syntactic structures as well as the different words to a certain extent.

Comparison with Other Deep Models for Paraphrase
Identification. Based on the MSRP dataset, we compare the performance of DPIM-ISS with other main deep models for paraphrase identification. We choose the MSRP dataset Complexity since the results of various deep models for paraphrase identification can be obtained directly from the literature which proposed these models. e data listed in Table 4 come from the experimental results presented in the corresponding literature.
From Table 4, we can see that uRAE and DPIM-ISS, which are built based on the syntactic information, perform much better than the other baselines.
ough the best performance of our model (83.55) is still slightly worse than uRAE on F1 score (83.6%) [22], uRAE relies heavily on pretraining on an external large dataset annotated with parse tree information to learn the representation of phrase features for each node in a parse tree. Compared with uRAE, DPIM-ISS only needs to parse the two sentences to be recognized for obtaining the syntactic structures without any additional pretraining.

Model
Analysis. First, we analyze the influence of lexical features on DPIM-ISS. We remove the lexical features in DPIM-ISS and use the features captured by the convolutional neural network from the interacting sentence expression as the input of MLP directly to learn the classifier. e model that removes the lexical features is denoted as DPIM-ISS-L. Table 5 lists the performance comparison between DPIM-ISS-L and DPIM-ISS. e experimental results in Table 5 demonstrate that the lexical features help to improve the performance of paraphrase identification, especially on the PAN 2012 dataset. We conclude that METEOR evaluation measures take the synonym replacement into account, which is one of the main construction strategies of the PAN 2012 dataset. However, on the MSRP dataset, there are little changes in the use of words and the syntactic structures, so the additional lexical features do not lead to a significant improvement on MSRP than on PAN 2012.
For the number of syntactic feature parameters, we compared the performance of 30 syntactic features with 67 syntactic features. On MSRP training corpus, we got 0.7119 on accuracy and 0.8173 on F1 when we used 30 syntactic features (the syntactic features in Table 2). However, when we used 67 syntactic features (30 syntactic features in Table 2 added another 37 syntactic features), we got 0.6805 on accuracy and 0.8041 on F1. We also tried two commonly used dimensions of word embedding, 300 and 600, on MSRP training corpus. e accuracy got by the 300 dimensions word embedding was 0.7119 on accuracy and 0.8173 on F1, while the 600-dimensional word embedding achieved 0.6786 on accuracy and 0.7891 on F1. e above two experimental results show that too many features will Table 3: Performance comparisons with word-matching-based approaches, the syntax-similarity approaches, the text-semantic-representation-based deep models, and the CNN-based models.    affect the classification performance on the size of network that we designed. To further improve the performance of DPIM-ISS, we can try to expand the network size or add the network layers to enhance the representation ability of DPIM-ISS.

Related Work
Early work on paraphrase identification usually relied on lexical, semantic, or syntactic similarity measures to identify paraphrases.
Lexical-based approaches used the bag-of-words representations without considering the semantics of the words, which inevitably led to the problem of "polysemy" and "synonymy" in paraphrase identification.
Some methods resorted to the knowledge base (such as WordNet) to measure the word semantic similarity for alleviating the restrictions of word matching-based paraphrase identification methods. For example, Mihalcea et al. utilized the WordNet-based measures to compute the word semantic similarity [8], Mohamed and Oussalah also presented to use the WordNet and Wikipedia to compute the word semantic similarity and named-entity semantic relatedness for paraphrase identification [42], Madnani et al. exploited the METEOR (based on WordNet) machine translation metrics as the features of classifiers to determine the paraphrase [6], and Islam et al. [43] and Bollegala et al. [44] computed semantic similarity using a corpus-based measure. e main advantage of knowledge base-based semantic approaches is that it can make full use of the prior knowledge of experts. However, the limitations of this kind of approaches mainly include the following: knowledge base needs the human maintenance and updating, the limitation of vocabulary coverage, and the lack of sufficient context information to determine the exact concepts.
On the other hand, researchers have noticed the role of syntactic features in paraphrase identification and presented some syntax-based methods. For example, Das and Smith believed that the paraphrase was related to the syntactic structure, and they used the part-of-speech tag and the syntactic dependence of words as the features to learn the classifier [2], Koroutchev et al. exploited the Lempel-Ziv algorithm to compares the syntactic and morphological features of the two texts to detect the text similarity [13], Elhadi and Al-Tobi utilized the part-ofspeech sequence to represent text and detect plagiarism [12,15], Potthast et al. employed n-grams of the syntactic structure sequence to detect the plagiarism in European languages [14], and Mohammad et al. extracted the POS tags as syntactic features of classifiers to identify the paraphrase for the Arabic language [45]. However, these methods could not work effectively when the syntactic structures changed greatly.
To avoid the disadvantages of single class of similarity measures, a different way to look to paraphrase identification is relying on the supervised learning to combine the lexical, syntactic, and semantic features to classify the sentence pair paraphrase or not [46].
In recent years, the distributed representation of words or text has made progress of the semantic representation. Manning pointed out that having a dense, multidimensional representation of similarity between all words was incredibly useful in natural language processing [47]. e distributed representation uses the vectors in contiguous semantic space to project the linguistic units, which makes the similarities of words can be calculated using the distances of word vectors.
us, two sentences, represented as two vectors in the lowdimensional semantic space, can still have a high similarity even if they do not share any term [39].
Inspired by the success of the deep neural networks recently, the paraphrase identification has been innovated towards the deep paraphrase identification models, including the full-connected neural network-based models such as DSSMs (deep structured semantic models) [19], the CNN-based (convolution neural network) models such as CDSSMs (convolutional deep structured semantic models) [20,39], ARC-I (Architecture-I) [21], ARC-II (Architecture-II) [21], MatchPyramid [1] and Match-SRNN (Match-special recurrent neural network) [23], the recurrent neural network-based (RNN) models such as MV-LSTM (MV-bidirectional long short-term memory) [24], CNN-and RNN-based models such as Deep-Paraphrase [48], and attention-based alignment models such as pt-DecAtt [49]. ese methods focused on the distributed representation of text and identified the paraphrase through the learning of matching degrees and matching patterns, which reduces the dependence on the design of artificial features.
Researchers also introduced the features of syntactic structures into the framework of deep paraphrase identification models. For example, Socher et al. deemed that syntactic and semantic analysis was needed for paraphrase detection, and they presented to exploit recursive autoencoders (RAEs) and unfolding recursive autoencoder (uRAE) to encoder the words, the multiword phrases, and the sentences in syntactic trees [25]. Zhou et al. followed the idea of Socher and used the weighted uRAE to encode the phrases and sentences embedding that obtained from parse trees [50]. Wang et al. proposed the DeepMatch Tree to match the two short texts that relied on a tree-mining algorithm [16]. Based on the dependency tree, DeepMatch Tree represented the two sentences as the binary matching models composed by the subtree pairs and utilized a deep neural network to learn the matching pattern. Considering the influence of syntactic structure on semantic computation, Liu et al. [51] exploited the syntactic feature for paraphrase identification. In their method, based on the syntactic tree, the TreeLSTM [52] was used to model the sentences and represent the semantic composition. Especially, they introduced the attention mechanism to extract the cross-sentence features. Xu et al. also made use of syntactic features to indicate the dependency relation between words [53]. ey incorporated the lexical, syntactic, and sentential encodings for paraphrase identification. In their approach, integrating the syntactic features was verified to contribute to performance improvement. However, the high performance cannot be divorced from the large-scale pretrained model, such as Complexity 11 BERT (bidirectional encoder representations from transformers) [54]. e above approaches enjoyed the advantages of integrating the syntactic features in the paraphrase identification. ey all exploited the dependency trees to obtain the local substructures of words or phrases on the syntactic structures at different granularities and learned the semantic representation of these substructures. In this regard, the ideas of this paper are the same as those of the existing work. e difference lies in the semantic representation and interaction on syntactic structures. DPIM-ISS is designed to interact the semantics and syntactic features for obtaining the semantic representation on syntactic structures. Furthermore, we exploit the explicit syntactic structure to model the semantic interaction on syntactic structures between two sentences. is allows us to learn the paraphrase pattern from the semantics on different linguistic features, which was not performed in the RAE, uRAE, weighted uRAE, and DeepMatch Tree .

Conclusions
In this paper, we present the DPIM-ISS, a novel text deep paraphrase identification model interacting semantics with syntax. In DPIM-ISS, we introduce the syntactic information by capturing the syntactic structures and represent the semantics by means of the distributed representation method. en, we exploit the tensor to interact the semantics and syntax for representing the sentences and use the convolutional neural network to extract the paraphrase patterns in text matching space. Experiments on MSRP, PAN 2010, and PAN 2012 corpus demonstrate that DPIM-ISS achieves comparable or better performance against the traditional word-matching approaches, the syntax-similarity approaches, the distributed-representations-of-sentencesbased models, the CNN-based models, and some text deep paraphrase identification methods.
ere is an important direction to improve the performance of DPIM-ISS. We note that the acquisition of syntactic features now mainly relies on the results of syntactic parsing. e advantage of this kind of approach is to capture the explicit syntactic structures. However, we can try to another way of exploiting syntactic features, for example, to integrate the representation and the learning of the syntactic features into the network of DPIM-ISS directly.
is should be one of our future work.