Egocentric Video Summarization Based on People Interaction Using Deep Learning

The availability of wearable cameras in the consumer market has motivated the users to record their daily life activities and post them on the social media. This exponential growth of egocentric videos demand to develop automated techniques to effectively summarizes the first-person video data. Egocentric videos are commonly used to record lifelogs these days due to the availability of low cost wearable cameras. However, egocentric videos are challenging to process due to the fact that placement of camera results in a video which presents great deal of variation in object appearance, illumination conditions, andmovement.This paper presents an egocentric video summarization framework based on detecting important people in the video. The proposed method generates a compact summary of egocentric videos that contains information of the people whom the camera wearer interactswith. Our proposed approach focuses on identifying the interaction of camerawearerwith important people.Wehave usedAlexNet convolutional neural network to filter the key-frames (frames where camera wearer interacts closely with the people). We used five convolutional layers and two completely connected hidden layers and an output layer. Dropout regularization method is used to reduce the overfitting problem in completely connected layers. Performance of the proposed method is evaluated onUT Ego standard dataset. Experimental results signify the effectiveness of the proposed method in terms of summarizing the egocentric videos.


Introduction
The introduction of wearable cameras in 1990s by Steve Mann has revolutionized the IT industry and created a deep impact in our daily lives. The availability of low cost wearable cameras and social media has resulted in an exponential growth of the video content generated by the users on daily basis. The management of such a massive video content is a challenging task. Moreover, much of the video content recorded by the camera wearer is redundant. For example, narrative clip and GoPro cameras record a large amount of unconstrained video that contains much of the insignificant/redundant events beside the significant events. Therefore, video summarization methods [1,2] have been proposed to address the issues associated with handling such a massive and redundant content.
Egocentric videos are more challenging to address for summarization due to the presence of jitter effects experienced because of camera wearer's movement. Accurate feature tracking, uniform sampling, and broad streaming data with very refined boundaries are the additional challenges to lifelogging video summarization. To address the aforementioned challenges associated with the egocentric videos, there exists a need to propose effective and efficient methods to generate the summary of full-length lifelogging videos. Some distinctive egocentric video recording gadgets are shown in Figure 1. The focus of these egocentric video recordings is on activities, social interaction, and user's interests. The objective of the proposed research work is to exploit these properties for summarization of egocentric videos. Egocentric video summarization has useful applications in many domains, i.e., law enforcement [1,3], health care [4], surveillance [5], sports [6], and media [7,8].
The generation and transmission of vast amount of egocentric video content in the cyberspace have motivated the researchers to propose effective video summarization techniques for wearable camera data. Existing frameworks [9][10][11][12] have used supervised as well as unsupervised methods for egocentric video summarization.
Existing methods have used supervised learning techniques for summarization based on activity detection [13][14][15][16][17], object detection [18], and significant events detection [19]. The goal of egocentric video synopsis is to detect significant events of the lifelogging video data and generate the summarized video. Kang et al. [20] proposed a technique to identify new objects encountered by the camera wearer. Ren et al. [21] proposed a bottom up motion-based approach to segment the foreground object in egocentric videos to improve the recognition accuracy. Hwang et al. [22] proposed a summarization technique based on identifying important objects and individuals interacted with the camera wearer. Similarly, Yang et al. [23,24] analyzed the lifelogging video data to summarize the daily life activities of camera wearer. The summarized video contains the frames of user's interaction with important people and objects. Choi et al. [25] presented a video summarization method to identify some common human activities (i.e., talking) based on crowd perception. Lee et al. [26][27][28] have proposed egocentric video summarization techniques to detect the excited events of a camera wearer's entire day. These approaches [18,[26][27][28][29][30] have used the region cues (i.e., nearness to hands, gaze, and recurrence of event) to detect the key-events. These cues are used to evaluate the relative significance of any new region.
Egocentric video summarization methods have also used unsupervised learning to categorize sports actions [7], scene discovery [31], key-frame extraction, and summarization [29,32]. Choudhury et al. [33] presented a pattern of influence among the people that builds on the social network. This algorithm [33] is used to find the interaction between people during the conversation. A "sociometers" wearable sensor package is used to measure face to face interaction. Yu et al. [34,35] proposed an eigenvector analysis method to address the issue of automated face recognition. This method named as "decision modularity cut" is used to evaluate the performance in terms of social network. Fathi et al. [17] presented an approach to detect and recognize the social collaboration in a first-person video. The locality and direction information are used to compute the pattern of attention of various persons followed by assigning different roles. Finally, roles and locality information are analyzed to determine the social interactions.
In the last few years, Convolutional Neural Networks (CNNs) have been heavily explored due to its ability to learn remarkably well to understand the image content and immense scale video characterization [36][37][38]. Supported by the achievement of CNNs, few research works [39,40] adopted deep learning features (e.g., CNN features) to perceive long-term activities and achieved significant implementation progress. Poleg et al. [39,41] applied a compact 3D Convolution Neural Network (CNN) architecture for longterm activity detection of the egocentric lifelogging video data. It is a common practice to use a large and diverse dataset for CNN training in video summarization applications [42][43][44][45]. The training process has used only restricted amount of task-specific training data. Jain et al. [46,47] used CNN features for visual detection tasks, for example, object localization, scene identification, and classification. Alom et al. [40] used cellular simultaneous recurrent networks (CSRNs) for feature extraction. CNN features computed from the supervised learning are translated-invariant.
Egocentric video recordings are inadequate with regards to a suitable structure and unconstrained in nature. Generally, there is no emphasis on the important things the user needs to record. Most important consideration in egocentric vision has concentrated on activity recognition, identification and video summarization. We proposed an effective egocentric video summarization method based on identifying the interaction of camera wearer with important people. The proposed research work aims to produce more informative summaries with minimum redundancy. The representative key-frames for the summary are selected on the basis of people interaction with the camera wearer. Our video summary focuses on the most important people that interact with camera wearer while neglecting other content. We consider interactions, such as having a discussion with the people, and fully connected with each other that are important moments. Performance of the proposed technique is evaluated on a standard egocentric video dataset "UT Ego" [24,30,48]. Experimental results show the effectiveness of the proposed method in terms of identifying important people for egocentric video summarization. Our method provides superior detection performance and generates more useful summaries with minimum redundancy as compared to existing state-of-the-arts.
The rest of the paper is organized as follows. Section 2 demonstrates the proposed framework for egocentric video summarization. Section 3 provides the results of different experiments performed on the proposed method along-with the discussion on the results. Finally, Section 4 concludes the paper.

Materials and Methods
The proposed egocentric video summarization framework is presented in this section. Our method takes the fulllength egocentric video as an input and generates a concise video that contains the most interesting segments (i.e., people interaction with camera wearer). The flow of the proposed summarization method of first-person video data is shown in Figure 2.
We trained a regression model to compute the scores of region's likelihoods. The input video is partitioned into series of n sub-shots, = { 1 , . . . , }. We trained the AlexNet CNN model for classification. The details of our classification model are provided in Section 2.1. The architecture of the proposed framework is provided in Figure 3.

AlexNet
Architecture. The proposed technique contains 8 transformation trainable layers, five convolutional layers supported by the two completely linked hidden layers and an output layer. We utilized ReLu activation function in all the trainable layers, except the last fully connected layer where we applied the Softmax function. Moreover, our system contains three pooling layers, two normalization layers, and a dropout layer. The AlexNet architecture of the proposed framework is demonstrated in Figure 4.
In the first convolutional layer, relatively large convolutional kernels of size 11 x 11 are used. For next layer, the size of convolutional kernel is reduced to 5×5, and for third, fourth and fifth layers we applied the convolutional kernels of size 3×3. In addition, first, second, and fifth convolutional layers use overlapping pooling operations with a pool size of 3×3 and stride of 2×2. Our proposed architecture has eight fullyconnected layers with 4096 nodes. The last fully connected layer is supported to one thousand-way softmax function that makes dispersion over the 1000 class labels. The details of convolutional and fully connected layers are provided in Figure 5.

ReLu Non-Linearity. In recent years, the Rectified Linear
Unit has recognized into a popular unit because it takes less time for training as compared to other units. Saturating nonlinearities are much slower with the non-saturating in training time with gradient descent. We used rectified linear unit function to train the network. The scope of ReLu is [0, inf] which implies that it can explode the activation. The following describes the Relu activation function: where I representS the input image. If I < 0 the output will be zero, whereas it provides a linear function when I ≥ 0. It is also used as a classification function. tanh is a hyperbolic tangent function that works like the sigmoid function. tanh function lies in the range of (-1,1) and computed as follows: In this manner negative input values to the tanh will guide to negative output. The following represents the sigmoid function that lies in the range of (0,1) and computed as follows: tanh function takes more time to train a network than ReLUs and deep convolutional neural networks. As demonstrated in Figure 6, sigmoid and tanh are computationally more complex for training purposes as compared to ReLU.

Softmax Function and Response Normalization.
In the proposed architecture, we employed the softmax function as a nonlinear function at the output layer. This activation function transforms the output values into soft class possibilities. We used normalization scheme in first two layers. The activity of a neuron is computed with the aid of kernel at position     (q, p) after applying the ReLU nonlinearity. The responsenormalized activity is computed as follows: , represents the activity of a neuron computed by using kernel i at position (q, p) and , represents the responsenormalized activity, where the sum runs over n "adjacent" kernels and N is the total range of kernels within the layer. The constants V, n, , and are hyperparameters and their values are determined by applying a validation set. The response normalization scheme is used to reduce the test error rate of the proposed network.

Pooling Layer.
We applied the overlapping pooling in the entire system. In CNN, output summary of the neighbouring groups of neurons is obtained through pooling layers in the same kernel map as these pooling units do not overlap. It requires two hyperparameters that are spatial extent w and the stride h. More specifically, this pooling layer is like a network of pooling units spaced h pixels apart. At the point of pooling unit, it summarizes a neighbourhood of size w × w. If we set h = w then we acquire the traditional local pooling as used in CNNs. By setting ℎ < , we have overlapping pooling situation where we experience lower error rate after detailed experimentation of our framework. Therefore, we used the overlapping pooling in the entire network with ℎ = 2 and = 3. This overlapping scheme significantly reduces the computational cost by decreasing the size of the network as well as error rate.

Dropout.
When the number of iterations roughly doubles in our network, we need to converge through the dropout method. If they are "dropped out," the neurons do not participate in forward pass and back propagation. We used dropout in the initial two completely connected layers as the dropout process reduces the over fitting substantially in our proposed framework.

Results and Discussion
This section provides a comprehensive discussion on the results obtained through different experiments that are designed for performance analysis of the proposed framework. The details of the standard dataset used for classifier training and testing are also provided in this section. In addition, we also discussed the evaluation metrics used for measurement.

Dataset.
We used a standard dataset UT Ego for performance evaluation of the proposed method. UT Ego [24,30,48] is specifically designed to measure the performance of egocentric video summarization approaches. UT Ego dataset comprises of four egocentric videos that are captured in uncontrolled environments. The dataset videos are of 3-5 hours in length having resolution of 320×480 and frame rate of 15 fps. These videos capture different daily life activities that includes eating, purchasing, attending a lecture in faculty, and driving a car. UT Ego dataset is divided into two classes, one where camera wearer interacts with the people and other where the camera wearer interacts with other objects.

Training and Implementation Details.
The input frame is resized into 227x227 for training purposes. We used stochastic gradient descent to train our network. It has the minimum batch size of 10, momentum of 0.9, and weight decay of 0.0005 for framework to learn. The weight decay parameter value of 0.0005 reduces the error rate of our model during training. The update rule for weight is generated as follows: where I, m, and represent the iteration index, momentum variable, and learning rate, respectively. (( / )| ) is the common over the i th batch D i of the derivative of the objective with respect to evaluated at . The weights in each layer are introduced by zero-mean Gaussian distribution with standard deviation of 0.01. Neuron biases within the second, fourth, and fifth convolutional layers are initialized with 1. Due to this type of initialization, learning will be fast at early stages by imparting the ReLUs with fine inputs. The remaining neuron biases are initialized with 0. All layers in our network have equal learning rate which can be adjusted during the training stage. The details of training and implementation are provided in Figure 6. The learning rate was fixed at 0.01 for more reliable training and then gradually decreased to 0.0001 as the optimization stage takes more time.

Evaluation on the Validation Set.
The output feature maps of our convolution layers are obtained through drop-out regularization and batch normalization. We used layer by layer dropout regularization and batch normalization. Our model will overfit if we use drop-out layer before the output layer. It has been observed that the validation set achieves better accuracy if we increase the learning features. It has to be generalized by using drop-out in each convolution layer. We used 70% images of the entire dataset for training purposes and remaining 30% for validation. Few snapshots of the training sample images, training progress, and four sample validation images along-with their predicted labels are shown in Figure 7.

Evaluation Metrics.
To evaluate the performance of proposed method, three objective evaluation metrics such as precision, recall, and accuracy are used. The details of these metrics computation are provided in this subsection.
Precision represents the ratio of correctly labelled images for positive class (i.e., people interaction with camera wearer) to the total retrieved images of positive class. Precision is calculated as follows: where true positive (TP) represents the frame having people interaction with the camera wearer correctly detected by the classifier. And, false positive (FP) represents the frame misclassified as positive (i.e., people interaction detected) that belongs to the negative class (i.e., frames without people interact with the camera wearer).
Recall represents the ratio of true detection of people interaction frames against the actual number of people interaction frames in the video and computed as where false negative (FN) in (8) represents the positive class images that are misclassified. Accuracy represents the ratio of correctly labelled images of positive (i.e., images having people interact) and negative classes (i.e., images without people interaction) and computed as where in (9) Positive and Negative represent the total number of positive and negative samples of our dataset.   compare the overall performance of egocentric video summarization using different classifiers that are Support Vector Machine (SVM) [51], Extreme Learning Machine (ELM) [52,53], K-Nearest Neighbor (KNN) [54], Regularized Extreme Learning Machine (RELM) [55], and Decision Trees [56].

Performance
In addition, we also compared the results obtained on these classifiers against the proposed method. The objective of this experiment is to obtain the best classification model that achieves best accuracy for egocentric video summarization based on people interaction. We used three different feature descriptors that are Histogram of oriented gradients (HoG), local binary patterns (LBPs) and local tetra patterns (LTrPs) to train all these classifiers individually (i.e., SVM, KNN, ELM, RELM, and decision trees). More specifically, we trained each classifier (i.e., SVM) using HoG descriptor in the first phase followed by using LBP and LTrP in the second and third phase respectively. Finally, the results obtained in each phase are combined to achieve the average precision, recall, and accuracy as shown in Figure 8.
For feature extraction, we employed HoG, LBP, and LTrP on the input video frame and represent each frame in the form of feature vector for training.
For HoG descriptor representation, we decomposed input image into 64x64 sized window. A histogram of the orientated gradient is computed for each window and then normalized. The feature extraction process for HoG is shown in Figure 9.
For LBP representation, we divided the input image into small square blocks of size 3 x 3 for processing. As we know LBP is computed by comparing the centre pixel value with the neighbouring pixel values as follows: where and represent the grayscale value of centre pixel and neighbouring pixels, respectively. n represents the total number of neighbours that is set to 8 in our case. Once the local binary patterns are computed for all blocks then the entire image is represented through creating the histogram as where M × N represents the size of the image. The entire process of LBP feature extraction is demonstrated in Figure 10. For LTrP representation, we first resize the input image I, convert into grayscale and then calculate first-order derivatives along 0 ∘ and 90 ∘ directions as follows: where ℎ and V denotes the horizontal and vertical neighbourhoods of the central pixel .
Depending on first order derivatives, (15) generates four directions. The values of the four directions are 1, 2, 3, and 4. Finally, a tetra bit pattern is generated by checking all the neighbouring pixels and direction of center pixel X c . Once we obtain the LTrP we represent the entire image through histogram as shown in (12). After representing the frames into feature vectors, we train the classifiers one by one using each of these three descriptors. We divided the dataset into two halves, the first half is used for training the classifiers and the remaining half is for testing. To be precise, we used 152165 frames each for training and validation. For SVM classification, we obtained an average precision of 88%, recall of 85%, and accuracy of 86%. For KNN, an average precision, recall, and accuracy of 86%, 84%, and 81% respectively are achieved. For ELM classification, we obtained an average precision of 75%, recall of 74%, and accuracy of 67%. For RELM classification, we obtained an average precision of 76%, recall of 75%, and accuracy of 68%. Similarly, for decision trees an average precision of 75%, recall of 74%, and accuracy of 73% are achieved. As mentioned in the previous experiment, the proposed method achieves an average precision of 97%, recall of 95%, and accuracy of 96%. From the results, it can be clearly observed that the proposed method provides superior performance as compared to SVM, KNN, decision trees, ELM, and RELM classifiers. It is concluded from the results gathered that the proposed method is very effective in terms of generating informative summaries of a full-length lifelogging video data.

Receiver
Operating Characteristics Curves Analysis. In our third experiment, we designed receiver operating characteristic (ROC) curves to evaluate the performance of different classifiers along-with the proposed method. ROC curves are plotted using the false positive rate (FPR) against the true positive rate (TPR) which are computed as In the proposed method, each frame is assigned a discrete class label. A (FPR, TPR) pair is obtained for each discrete classification approach that indicates a single point in ROC curve. Each point located on the curve line illustrates a pair of sensitivity and specificity values. ROC curves for SVM, KNN, decision trees, ELM, RELM, and proposed method are plotted in Figure 11. From the results we can observe that the proposed technique achieves best ROC curve among the comparative classifiers. In addition, SVM and KNN provide reasonable classification accuracy due to the fact that we have a binary classification problem. From the results we can argue that the proposed method is very effective in terms of detecting people's interaction with the camera wearer to generate more informative video summaries.

Performance Comparison of the Proposed Framework with
Existing State-of-the-Art Approaches. In our last experiment, we examine the performance of the proposed method against recent existing state-of-the-art methods [23,30,49] for egocentric video summarization. Aghaei et al. [49] proposed a technique in the field of egocentric photo-streams captured through a low temporal resolution wearable camera. This technique [49] was deployed for multi-face detection, social signals interpretation and social interaction detection (i.e., presence or absence of people interaction). Hough-Voting for F-Formation (HVFF) and Long-Short Term Memory (LSTM) approaches were used for social interaction detection. Yang et al. [23] proposed an egocentric summarization technique for social interaction using some common interactive features like head movement, body language, and emotional expression during communication. Moreover, Hidden Markov support vector machine (HM-SVM) was used to summarize the video. Aghaei et al. [30] proposed an approach based on Long Short-Term Memory (LSTM) for detection, categorization, and social interaction of the people.   and estimate the distances between the people and camera wearer. This method [30] used low temporal resolution image sequences to detect the social interactions. Su et al. [50] proposed a video summarization approach to detect the engagement using long-term ego-motion cues (i.e., gaze). This approach [50] consists of three stages that are frame prediction, interval prediction, and classification with the trained model.
The classification performance of the proposed and comparative methods is presented in Table 2. F1-score metric is used for performance comparison as the F1-score is a reliable parameter for performance comparison in cases where some methods have better precision but lower recall and vice versa. The detailed statistics of the datasets used by each of the comparative methods are also provided in Table 2, which includes the information of video length, format, frame rate, resolution, and quantity. From Table 2, we can observe that the proposed framework shows remarkable performance and achieves encouraging results as compared with the existing methods.

Conclusion
In this paper, we proposed an effective method to generate the precise and informative summary of a full-length lifelogging video with minimum redundancy. Our proposed method produces the summary on the basis of people interaction with the camera wearer. The proposed scheme combines the ideas from deep convolutional neural networks and completely connected conditional random fields for key-frame extraction. The proposed method achieves an average accuracy of 96% on the challenging egocentric videos that signify the effectiveness of our method. In our experiments, we specifically used different combinations of feature descriptors on different classifiers and compared the results with our method in terms of precision, recall, and accuracy. In addition, the proposed method is also compared with existing state-of-theart egocentric video summarization methods in terms of F1score. Experimental results clearly indicate that the proposed technique is superior among the existing state-of-the-art techniques in terms of generating useful video summaries. Currently, we are looking to design our own egocentric video dataset with a motivation to increase the diversity of the dataset. We intend to investigate the performance of our method on a more diverse egocentric video dataset in the future.

Data Availability
The authors have used standard dataset UT Ego that is publicly available at http://vision.cs.utexas.edu/projects/ egocentric data/UT Egocentric Dataset.html.

Conflicts of Interest
There are no conflicts of interest. Submitting authors are responsible for coauthors declaring their interest.