Phonetics and Ambiguity Comprehension Gated Attention Network for Humor Recognition

Humor refers to the quality of being amusing. With the development of artificial intelligence, humor recognition is attracting a lot of research attention. Although phonetics and ambiguity have been introduced by previous studies, existing recognition methods still lack suitable feature design for neural networks. In this paper, we illustrate that phonetics structure and ambiguity associated with confusing words need to be learned for their own representations via the neural network.,en, we propose the Phonetics and Ambiguity Comprehension Gated Attention network (PACGA) to learn phonetic structures and semantic representation for humor recognition. ,e PACGA model can well represent phonetic information and semantic information with ambiguous words, which is of great benefit to humor recognition. Experimental results on two public datasets demonstrate the effectiveness of our model.


Introduction
Humor is frequently used in daily communication [1]. When interacting with people, if artificial intelligence (AI) systems, such as chatbots, can detect humor within the conversation, it will help them better understand the emotions of the human and help the AI make more appropriate decisions. erefore, humor computation deserves particular attention, as it has the potential to turn computers into creative and motivational tools for human activity [2].
Humor recognition refers to determining whether a sentence in a given context expresses a certain degree of humor. Yang et al. [3] identified three semantic structures and a phonetic structure behind humor. Experimental results show that ambiguity and phonetic structures are important for humor recognition.
Phonetic structures, used as devices in humorous texts, usually take the form of alliteration or rhyme. Alliteration, rhyme, or word repetition are often used to evoke or enhance the effect of humor even if the content is not humorous. Exp 1. "You can tune a piano, but you can't tuna fish." In Exp 1, the humor does not come from the content of the sentence, but the words "tune" and "tuna" have the same pronunciation, which produces a comic effect. Hence, it shows that phonetic structures, such as alliteration, rhyme, and word repetition, play an important role in humorous texts.
Ambiguity [4] refers to some words with multiple meanings in a sentence causing different sentence comprehensions. Ambiguity and humor often go together [5], and it is a crucial component of many humorous texts [6]. Exp 2. "Did you hear about the guy whose whole left side was cut off? He's all right now." Exp 2 shows humor caused by ambiguity. e word "right" is the ambiguous word, meaning "right side" or "okay".
For the detection of phonetic structures and ambiguity in a humorous text, the most popular methods are based on complex feature engineering, such as semantic similarity and the number of rhyme chains. e idea of feature engineering is simple, but it is time consuming and cannot easily capture the latent semantic information behind humor. Recently, due to strong feature extraction capabilities, neural networkbased approaches have become mainstream for this task. However, most researchers simply use the deeper neural network without modeling phonetic structure and ambiguity. Moreover, it is difficult to analyze the results of humor recognition.
To solve this problem, we propose an end-to-end neural network named Phonetics and Ambiguity Comprehension Gated Attention network to detect humor in text. e proposed model captures the phonetic information by Convolutional Neural Networks (CNN), combines with Bidirectional Gated Recurrent Units (Bi-GRU) and attention mechanism to build the information of context and ambiguous words, and applies gated mechanism to adjust the effects of the two kinds of information in the task of humor recognition. Our work makes three contributions: (1) For solving phonetic structure and ambiguity features in humor recognition, we propose a novel framework named Phonetics and Ambiguity Comprehension Gated Attention network (PACGA), which can understand the phonetic representation by the CNN model, and learn latent semantic representation associated with ambiguous words by Bi-GRU and attention mechanism. (2) We propose the gated attention strategy to exploit the combination of the phonetic structure and ambiguity in the humor recognition. Experimental results show that it is useful for humor recognition. (3) Experimental results on the pun-of-the-day [3] and One liners 16000 [7] datasets demonstrate that our method achieves state-of-the-art performance compared with strong baselines. Furthermore, the detailed analysis reveals the interpreting ability of our proposed model in humor recognition.

Related Work.
In this section, we will review related works on machine learning-based methods and deep learning-based methods for humor recognition. Machine learning-based methods have been widely used to detect humor in text, which usually depends on feature extraction from text to train classifiers. Mihalcea and Strapparava [8] brought empirical evidence that computational methods can be successfully applied to the task of humor recognition in text. Zhang and Liu [9] designed about fifty features of five categories derived from influential humor theories, linguistic norms, and affective dimensions. Barbieri and Saggion [10] proposed a rich set of features, including ambiguity and phonetic structure. In recent work, Liu and Zhang [11] modeled sentiment association between discourse units to detect humor. ey found that some syntactic structure features consistently correlated with humor in a separate paper [12]. Most of the abovementioned experimental results show that phonetic structure and ambiguity are primary features in humor recognition. However, the cost of constructing a large number of features is high and it also limits the generalization capability of the model.
Recently, deep learning-based methods have garnered considerable success in humor recognition. Bertero and Fung [13] combined word-level and audio frame-level features and used RNN and CNN to predict humorous utterances. In their other paper [14], CNN was used to encode utterances, and then Bi-LSTM was used to predict humor in dialogues [15]. Systematically, the performance of humor recognition based on CNN was compared with some well-established conventional methods using manual features. Chen and Soo [16] used CNN and Highway Networks to increase the depth of networks for humor detection. Zhao et al. [17] proposed a tensor embedding method to capture lexical similarity to detect humor. Blinov et al. [18] collected a dataset of jokes and funny dialogues in Russian and used language model fine-tuning for text classification. ere is no doubt that deep learningbased methods can extract high-dimensional features automatically and achieve high performance in humor recognition. However, previous studies did not take into account the linguistic features of humor when using deep learning. ey ignored the guidance of humor theory, and most of the experimental results are difficult to illustrate and explain.

Methods
In this section, we introduce our model, PACGA. Our model is able to improve humor recognition by considering both phonetic representation and latent semantic information associated with ambiguous words. e overall architecture of PACGA is shown in Figure 1. e framework consists mainly of three parts: (1) a convolutional neural network for phonetic structure comprehension, (2) a Bi-GRU combined with attention mechanism for semantic comprehension associated with ambiguous words, and (3) a gated attention strategy is used to leverage phonetic representations and semantic representations to recognize humor. We describe the details of our model in the following sections.

Phonetics Comprehension Network (PCN).
Many humorous texts play with sounds, creating incongruous sounds or words [3]. Mihalcea and Strapparava [7] claim that the phonetic features of humorous texts are at least as important as their content. For example, "More sun and air for son and heir;" "sun" and "son" and "air" and "heir" are homophones. Both of them make the sentence not only harmonious and pleasant but also interesting and humorous. e pronunciation of words is not exactly the same as their spelling. In order to get the phonetic representation of words, we use the Carnegie Mellon University (CMU) pronouncing dictionary. e current phoneme set of CMU has 39 phonemes, which is more accurate than the version without lexical stress. We convert each word into its corresponding phoneme. For example, the pronunciation of "word" is ["W," "ER," "D"]. It should be noted that a word may have more than one phonetic symbol in CMU. We use all the pronunciations of a dictionary entry for the speech extension and match any pronunciation as the speech extension of a word. Following Jaech's [19] work, we apply a substitution matrix between vowels and vowels and consonants and consonants. It can be used as a phonetic extension of the original word when the pronunciation is found in CMU after phoneme replacement.

Phonetics Embedding Layer.
In the phonetics embedding layer, the pronunciation of each word can be mapped to a high-dimensional feature space for capturing the meaningful semantic information. For each word w i , in a sentence S S � w 1 , w 2 , . . . , w N , w i ∈ R d and we convert the w i into P � p 1 , p 2 , . . . , p l , p i ∈ R d′ is the pronunciation of a word, where d and d ′ are the dimensional vector, N is the length of sentence, and l is the length of w i . For the phonetics embedding, we randomly initiate.

Permute Layer.
e permute layer can permute the dimensions of the input according to a given pattern. In our work, we aim to find out the pattern of alliteration or rhyme by the permute layer. e transformed matrix represents the pronunciation of different words among corresponding phonetics to feed the convolutional layer.

Convolutional Layer.
We adopt the convolution operation in order to learn the local features of phonetic representation. In general, the convolutional layer uses a filter to extract local n-gram features. A filter can use a window of h words to generate the new feature map. c t is a feature map which is produced by a window of words x i: i+L−1 . e formula is as follows: where f is the nonlinear function ReLU, w is the filter to produce the feature map c t , L is the length of the window, and b is the bias.

MaxPooling Layer.
GlobalMaxPool2D is used to generate the phonetic representation after capturing the local speech features using two-dimensional CNN. At this point, we get the phonetic representation r p of a target sentence by the Phonetics Comprehension Network.

Ambiguity Comprehension Network (ACN).
Ambiguity is the disambiguation of words with multiple meanings [20]. Humor and ambiguity often go together when a listener expects one meaning but is forced to use another meaning [3]. For a humorous example, "it is so hot that all the fans left after the baseball game." e surface meaning of "fans" is a ball game fan, but the implication may be that the electric fans are off. An ambiguous word with multiple possible meanings may lead the readers to misunderstand the sentence. It is the keyword that triggers humor. Furthermore, we also note that the multiple meanings of the ambiguous word are often quite different. To sum up, we pay attention to capturing ambiguous words in a sentence that can help us to improve humor recognition.

Word Embedding.
Every word feature of a humorous text can be mapped to a high-dimensional feature space in this layer for capturing the meaningful semantic regularities. Here, GloVe [21] is applied as the pretrained word vector in order to produce the word embedding for detecting humor.

Ambiguous Word Embedding.
e definition of an ambiguous word here is a word in a humorous sentence with multiple meanings that has the highest semantic similarity. Our work is strongly based on the intuition that humor arises from ambiguous words. In other words, the more Phonetics comprehension network Ambuguity comprehension network Gated attention  Complexity meanings a word has and the higher the semantic distance between them, the more contribution it makes to humorous sentences. Here, we use WordNet to identify ambiguous words for detecting humor. Firstly, we ignore the stop words of a sentence. en, we compute the number of synsets for each word though WordNet and select top T words as candidate ambiguous words. e semantic similarity can be computed among the meanings of each candidate word. en, we choose the cosine similarity function to measure the semantic distance.
. , x iK be the synset of x i , and K be the number of synonyms for the word x i . e similarity is calculated as follows: As a result, the word with the highest similarity is the selected ambiguous word to express humor in a sentence.
e ambiguous word is represented as To combine the information of ambiguity and context, we learn ambiguous word embedding for humor recognition. Since the common word embedding representations exhibit a linear structure, it makes it possible to meaningfully combine words by an elementwise addition of their vector representations [22]. In order to better take advantage of information within ambiguous, we append the ambiguous word representation to each word embedding in text. e ambiguous word embedding of a word x i ′ for a specific target where ⊕ is the vector concatenation operation.

Bidirectional Gated Recurrent Units (Bi-GRU).
We leverage a Bi-GRU on top of the ambiguous word embedding to capture the features for humor recognition. e Bi-GRU is used over X to generate a hidden vector sequence (h 1 , h 2 , . . . , h N ). At each step s, the hidden vector h s is computed based on the current vector x s and the previous vector h s−1 . e formula is as follows: where σ is the sigmoid function, z s is the reset gate and r s is the update gate, x s represents the input, h s is the candidate hidden state and h s is the hidden state at time s, and ◇ represents r elementwise multiplication operation.
Bi-GRU consists of two hidden states at each time step s: one is forward GRU h → s and the other is backward GRU h ← s . Finally, the two parts above are concatenated:

Ambiguity Attention Bi-GRU.
e standard Bi-GRU cannot pay attention to the ambiguity for humor recognition, even if we add ambiguous information in the embedding layer. To address this issue, we utilize the attention mechanism to capture the key part of the sentence in response to a given ambiguous word.
For each time step, Bi-GRU produces a hidden vector h i . Furthermore, the ambiguous word representation x a and hidden vector h i are concatenated, H en, we use the attention mechanism to produce an attention weight vector α and the weighted hidden vector r a . e formulas are as follows: where M ∈ R 2 d×N , α ∈ R N , and r ∈ R N . W a and W α ∈ R 2 d are parameters. α is a vector of ambiguity attention weights and r a is a weighted representation of a given sentence with the special ambiguous word. At this point, we get the ambiguity representation r a by the Ambiguity Comprehension Network.

Gated Attention Mechanism.
After learning by the phonetics and ambiguity comprehension network, we combine the two parts to get the integrated representation. Intuitively, phonetic structure and ambiguity contribute differently to humor. erefore, gated attention is leveraged to model the confidence of clues provided by the two parts. We calculate the value of the attention gate as follows: where σ is the sigmoid function, w is the weight matrix, and b is the bias.
In order to control the information between phonetic and ambiguous information, we use the value of attention gate g and 1 − g as the combination weights. e final representation of a sentence is as follows: where r pa is the integrated representation, r p is the phonetic representation, r a is the ambiguous semantic representation, g is the combination weight, and ⊙ is elementwise multiplication.
Humor recognition can be formalized into text classification. r pa is the vector representation of the text and it can be used as the input to obtain the final classification result: where p is the predicted probability of humorous text and W p and b p are the biases.

Model
Training. e model can be trained in an end-toend way by backpropagation, and we use crossentropy loss as the loss function. Let y be the true distribution and y be the predicted distribution for the text dataset. e goal of training is to minimize the loss function between y and y for all samples. We can formalize this process as follows: where i is the index of sentences, j is the index of class, λ is the L 2 -regularization term, and θ is the parameter set.

Experiments
In this section, we first introduce the dataset and evaluation metrics. en, we compare the performance of our model with several strong baselines in humor recognition. Finally, we give a detailed analysis of our method, including ablation experiments, visualization results, and error analysis.

Datasets and Evaluation Metrics.
We conduct experiments on the widely used Pun-of-the-day dataset and oneliners 16000 dataset. Table 1 shows their detailed statistical distribution.

Pun-of-the-Day (Puns).
is dataset was constructed by Yang et al. [3]. e humorous texts of this dataset are from the Pun of the Day website, and the negative samples are from AP News, New York Times, Yahoo! Answer, and Proverb. e dataset contains an equal number of positive and negative samples. e average length of sentences is 13.5 words.

Oneliners-16000 (Oliners).
is dataset was constructed by [7]. Oneliners in this dataset are from some famous humorous websites, and the negative samples are from the titles of Reuter news. It is also a balanced dataset. e average length of sentences is 12.6 words.

Evaluation Metrics.
We use Accuracy (Acc), Precision (P), Recall (R), and F-measure (F1) in our experiments to measure performance in humor recognition. We use 5-fold crossvalidation with a grid search method to select the optimal parameters. In detail, for each parameter, the following crossvalidation operations are performed. (1) e original dataset is randomly divided into five equally sized subsets. (2) For the five subsets, four subsets are used to train the model and the remaining subset is used as validation data for testing the model. (3) We repeat step (2) five times such that each of the five subsets is used as the validation data once. (4) e five results from the folds are averaged to produce results. Finally, the parameter pair with the highest results obtained by the crossvalidation process is set as the optimal parameters. In our experiments, dp is 0.35, op is Adam, filter sizes is [2,3,4], and T is 3.

Comparison with Existing Methods.
We compare our proposed model with several baselines:

Support Vector Machine (SVM).
is method uses all the features mentioned in the paper [3].

CNN.
is method is proposed by Chen and Lee [15].

CNN + HN + F.
is method was proposed by Chen and Soo [16].

TM.
is method was proposed by Zhao et al. [17].

Bi-LSTM + CNN.
e method is a complete reimplementation of the proposed method in Bertero and Fung [14].

Bi-GRU.
We employ word embedding and learn the latent semantic representations through Bi-GRU.

3.2.9.
Bi-GRU + F. In addition to employing semantic representations learned automatically by Bi-GRU, the artificial features mentioned above are also incorporated into the network.

Bi-GRU + Att.
We implement a deep learning Bi-GRU architecture with a focus on recognizing humorous text. We expected a higher performance than the Bi-GRU, but the results obtained are instead much lower on most of the evaluation metrics. e input of manually constructed features may conflict with semantic features that are automatically learned by the Bi-GRU. erefore, adding too many artificial features into the deep learning methods cannot effectively improve humor recognition to some extent. (6) Bi-GRU + Att uses the attention mechanism without the information of ambiguous word. Obviously, its experimental performance has not been greatly improved, which is largely due to its inability to pay close attention to features strongly related to humor. is shows that our proposed phonetics information, ambiguity information, and gated attention mechanism have superior performance in humor recognition. (8) Compared with the baseline methods, our model achieves a higher accuracy score and F1 score for Puns, but lower precision and recall. We argue it is the different types of additional information which cause this phenomenon. Our model can learn latent semantic and phonetic information behind humor, such as phonetic structure and ambiguous information, and gated attention mechanism is applied to adjust the weight between them for proving more relevant features driven by humor theory, while the other methods usually only employ semantic information for obtaining high precision and recall compared with PACGA. Our model achieves the comparable performance on two datasets, which shows that our model has a better generalization capability.

Detailed Analysis.
We conduct extra experiments to analyze our model in detail.  In addition to phonetic information, we also try to distinguish humor only by using semantic information. Next, we design an ACN model that employs word embedding and ambiguous word information to learn potential humorous features based on Bi-GRU and attention mechanism. Finally, we introduce our proposed model PACGA. Tables 4 and 5 show the performance of all the models on both datasets: (1) Tables 4 and 5 show that Bi-GRU achieves the worse performance which is consistent with our intuition. Without the phonetic structure and ambiguous word information, the performance of Bi-GRU in humor recognition is unsatisfactory. (2) PCN only uses phonetic information, and its performance is significantly lower than the other models on both datasets. Obviously, only using a single model to capture phonetic features for detecting humor could not give a competitive performance. Semantic information plays an important role in the identification of humor. (3) Compared with Bi-GRU, the performance of ACN is slightly improved. is shows that ambiguous word information and attention mechanism is helpful for Bi-GRU to focus on the latent sematic features of humor. (4) Among all the methods, PACGA achieves the best performance for this task. e reason is our model considers the phonetic information, word information with ambiguous words, and gated attention mechanism.

Impact of Different Combination Strategies.
e combination strategy may affect the performance in humor recognition and measure the importance of our two main parts. erefore, we design a series of experiments to explore the impact of different combination strategies. We adopt three strategies. (1) PAC-ST1: it directly combines the phonetic representation and ambiguity representation. (2) PAC-ST2: it assumes that two parts of information are of the same importance, and the parameter g is a constant, the value is 0.5. (3) PAC-ST3: the two parts of information have different importance. e gated attention is used to model the confidence of clues provided by the two parts.
We compare the single model and combination model with different strategies, and the results are given in Table 6. From the results, we find that all the combined models outperform the single model, which shows that both the phonetic structure and semantic information contribute to humor recognition. Among the combination models, the performance of PAC-ST1 and PAC-ST2 were roughly the same, and PAC-ST2 had a slight improvement.
Furthermore, PAC-ST3 beat both of them by a large margin (1.48% or 1.56% on F1) for both datasets. is shows that our presented gated attention strategy to assemble information can better capture the inherent features behind humor.

Visualization of Attention.
In order to validate the effectiveness of our model, PACGA, we visualize the attention layers for the sentences whose labels are correctly predicted.
From Figure 2, we can see that the common words, such as "is" and "does," are afforded little attention by our model, which justifies the intuition that common words make little contribution to identifying humor. Meanwhile, some specific words are crucial for humor. In Figure 2(a), the words "war," "right," "determines," and "left" have higher attention weights, which implies our model pays attention to those words, as we expect. It shows that ambiguous words can provide useful information for its context to adjust its attention, and it plays a great role in a humor recognition task. In Figure 2(b), obviously, the ambiguity is not the main reason for humor, and we pay much attention to the phonetic structure, which implies our model can learn the importance of phonetic structure and ambiguity for humor recognition. us, through the PACGA, we can well model phonetic structure and ambiguity, respectively, and then concatenate their representations by gated attention mechanism, which is helpful for humor recognition.

Error Analysis.
We also conduct a preliminary error analysis in this section. Our aim is to find some problematic issues by studying some misclassified test cases and to improve the humor recognition of our model in the future.   Exp 3. e one who invented the door knocker got a no bell prize.
Exp 4. A tidy desk is a sign of a cluttered desk drawer. For Exp 3, the true label is "humor," but our model predicted its label as "nonhumor." In this example, the punch line is "no bell prize," it sounds like "Nobel Prize." Obviously, this type of humor is caused by similarity in pronunciation, but "Nobel Prize" does not appear in the sentence, and our model cannot capture any phonetic information. Hence, some background knowledge would be required in order to predict the label correctly. For Exp 4, "tidy" and "cluttered" are opposites, and this kind of conflict makes a sentence humorous. Humor sometimes relies on two or more inconsistent, unsuitable, or incongruous parts or circumstances. erefore, our model needs to be able to identify inconsistencies simultaneously.

Conclusions and Future Work
In this paper, we design an automatic computational neural network named Phonetics and Ambiguity Comprehension Gated Attention network (PACGA) to detect humor. e main idea of PACGA is to use phonetic structure and ambiguity for humor recognition. In our model, a phonetics comprehension network is used to understand the phonetic representation of CMU pronunciation dictionary by CNN. Ambiguity comprehension network leverages latent semantic representation associated with ambiguous words by Bi-GRU. Based on phonetics comprehension network and ambiguity comprehension network, gated attention mechanism is used for modeling the confidence of clues. Experiments on Puns and Oliners datasets verify that our proposed PACGA can learn effective information for phonetic structure and semantics which provide significant information for detecting humor. In addition, the detailed analysis and visualization of attention also show the validity and interpretation ability from different perspectives.
In the future, we would like to step further into how to integrate humor characteristics into a deep learning model. Certainly, how to use common sense for humor recognition is also an issue deserving of study.

Data Availability
All data analyzed during this study are public corpus, which can be obtained by sending an email to the dataset builder. e data "pun of the day" that support the findings of this study are openly available in [3]. e data "onelienrs-16000" that support the findings of this study are openly available in [7].  Figure 2: Visualization of attention. A darker color means more importance. e pie chart shows the weights of the two parts based on gated attention mechanism.

Conflicts of Interest
8 Complexity