Multiplayer online games (MOGs) have become increasingly popular because of the opportunity they provide for collaboration, communication, and interaction. However, compared with ordinary human communication, MOG still has several limitations, especially in communication using facial expressions. Although detailed facial animation has already been achieved in a number of MOGs, players have to use text commands to control the expressions of avatars. In this paper, we propose an automatic expression recognition system that can be integrated into an MOG to control the facial expressions of avatars. To meet the specific requirements of such a system, a number of algorithms are studied, improved, and extended. In particular, Viola and Jones face-detection method is extended to detect small-scale key facial components; and fixed facial landmarks are used to reduce the computational load with little performance degradation in the recognition accuracy.
Multiplayer online games (MOGs) have become popular over
the last few years.The collaboration, communication, and interaction ability of MOGs enable players to cooperate or compete with each other on a large scale.Thus, players could experience relationships as real as those in the
real world. The “real feeling” makes MOGs attractive to an increasing number of players, despite significant amounts of time and subscription fee required
to play. Taking youths in China, for example, according to “Pacific Epoch's
2006 On-line Game Report” [
Despite the advances in interactive realism of MOGs, when compared with real-world human communication, the interfaces are still primitive. For example, in most of the existing MOGs, text-chat is used rather than real-time voice chatting during a conversation, avatars have no activities related to natural body gestures, facial expressions, and so forth.
Among the problems mentioned above, this paper focuses
on facial communication in particular. In everyday life, the manifestation of
facial expressions is a significant part of our social communication. Our
underlying emotions are conveyed by different facial expressions. To feel
immersed and socially aware like in the real world, players must have an
efficient method of conveying and observing changes in emotional states.
Existing MOGs allow players to convey their expressions mainly through
text-based commands augmented by facial and body expressions [
Although a number of existing MOGs have already achieved detailed animation, text commands do not offer an efficient way to control the avatar's expressions easily and naturally. They are simple and straightforward, but not easy-to-use. First, players have to memorize all the commands. Thus the more sophisticated the facial system is, the harder it is to use. Second, humans convey emotions by expressions in real time. Players cannot type text commands every few seconds to update their current mood. Thirdly, facial communication should happen naturally and effortlessly; typing commands ruins the realism.
The goal of this paper is to automatically recognize
the player's facial expressions, so that the recognition results can be used to
drive the “facial expression engine” of a multiplayer online game. While many
facial recognition systems have been reported, MOGs pose unique requirements on
the system which have not been well addressed. In a summary, a facial
expression recognition system for MOGs should meet the following requirements
[
The recognition has to be performed automatically and in real time. The system should consume minimum system resources. The system should be robust under different lighting conditions and complex backgrounds. The system should be user-independent (e.g., the system should be able to handle users of different genders, ages, and ethnicities). The input device should be easy to obtain and without any constraints, so only single regular web camera should be used. The system should be insensitive to distance of user to camera. (i.e., the system should be able to handle a relatively wide range of face resolutions). Players usually have to face the computer screen while playing game. Thus, input of the system should be user's frontal faces with certain degree of tolerance to head rotations. Due to entertainment purpose of the system, the recognition accuracy rate need not to be overly conservative.
In this paper, we propose a real-time automatic system that meets the requirements. It recognizes players' facial expressions, so that
the recognition results can be used to control avatar's expressions by driving the MOG's “animation engine” instead of text commands. Section
In computer vision, a facial expression is usually
considered as the deformations of facial components and their spatial
relations, or changes in the pigmentation of the face. An automatic facial
expression recognition system (AFERS) is a computer system that attempts to
classify these changes or deformations into abstract classes automatically. A
large number of approaches have been proposed since mid 1970s in the computer
vision community. Early works have been surveyed by Samal and Iyengar [
Generally, an AFERS consists of three processing stages: face detection, facial feature extraction and representation, and facial expression recognition. The face-detection stage seeks to automatically locate the face region in an input image or image sequences. As the first step of AFERS, its reliability has a major influence on the performance and usability of the entire system. The face detector could detect faces frame by frame or just detect a face in the first frame and then track it in the subsequent images in a sequence.
After the face has been detected, the next step is to extract and represent the information about the facial expression to be recognized. The extraction process forms a high-level description of the expression as a function of the image pixel data. This description commonly referred to as “feature vector” is used for subsequent expression classification. Geometric features which present the shape and locations of facial components and spectral-transform-based features which are gained by applying image filters to face images are often used to represent the information of facial expressions. Irrespective of the kind of feature extraction approach employed, the essential information about the displayed expressions should be preserved. The extracted features should contain high discrimination power and high stability against different expressions.
Facial expression classification is the last stage of
AFERS and it is decision procedure.The facial changes can be identified as facial action units
(AUs) [
To attain successful performance, almost all the existing facial expression recognition approaches require some control over the imaging conditions, such as high-resolution faces, good lighting, and uncluttered backgrounds. With these constraints, the existing methods in the literature cannot be directly applied in most real-world applications, which always require operational flexibility. Although deployment of the existing methods in fully unconstrained environments is still in the relatively distant future, integrating and extending these algorithms to develop a facial expression recognition system for a certain application context such as MOG is achievable.
Based on the specific requirements of MOGs, a facial expression recognition system is proposed in this section. The system categorizes each frame of user's facial video sequence into one of the six prototypic emotional expressions.
We hypothesize that recognition of the six prototypic
emotional expressions would serve an MOG well in most cases, since players may
not have enough time to perceive more subtle facial changes. Figure
The proposed system for MOGs.
The face region is located in an input image by
implementing one of the boosting methods proposed by Viola and Jones [
To extract the
facial feature automatically, facial landmarks need to be detected without
manual efforts. Automatic facial landmark localization is a complex process. To
find accurate position of landmarks, most of landmark detection methods involve
multiple classification steps and a great number of training samples are
required [
According to the results of the facial landmark
location tolerance test conducted in our previous work [
To take advantage of the computational efficiency of Haar-like features and highly efficient cascade structure used in Viola-Jones Adaboost face-detection method, “AdaBoost” detection principle is still adopted to search the key facial components (the mouth and eyes) within the detected face area. However, low detection rate was observed when the conventional Viola-Jones method was trained with the facial components and employed in the detection process. This is probably due to the lack of structure information of the facial components (compared to the entire face). Especially, the structure of the facial components become less detectable when the detected face is at low resolution. Another possible cause of the low detection rate is the substantial variations of the component shape, especially, mouth, among the different expressions conveyed by the same or different people. This is also true for high-resolution face images. To solve these problems, we improved the “AdaBoost” detection method by employing extended Haar-like features, modified the training criteria, regional scanning, and probabilistic selection of candidate subwindow.
An extended feature set with 14 Haar-like features
(Figure
The extended Haar-like feature set.
In the conventional Viola-Jones method, the cascade classifier is trained based on the desirable hit rate and false-positive rate. Additional stage is added into the cascade classifier if the false positive is higher. However, when the false-positive rate decreases, the hit rate also decreases. In the case of facial components detection, hit rate will dramatically fall for low-resolution face images if the cascade classifier is trained for a target low false-positive rate.
To ensure that low-resolution facial components could be detected, a minimum overall hit rate is set before training. For each stage in the training, the training goal is set to achieve a high hit rate and an acceptable false-positive rate. The number of features used is then increased until the target hit rate and false-positive rate are met for the stage. If the overall hit rate is still greater than the minimum value, another stage is added to the cascade to reduce the overall false-positive rate. In this way, the trained detectors will detect the facial components at a guaranteed hit rate though some false positives will occur, which can be reduced or removed by the scanning scheme introduced below.
Rather than rescaling the classifier as proposed by Viola and Jones, to achieve multiscale searching, input face images are resized to a range of predicted sizes and a fixed classifier is used for facial component detection. Due to the structure of face, we predict the face size according to the size of facial component used for training. In this way, the computation of the whole image pyramid is avoided. If the facial component size is larger than the training size, fewer false positives would be produced due to down sampling; when the component is smaller than the training sample, the input image is scaled up to match the training size.
In addition, prior knowledge of the face structure is used to partition the region of scanning. The top region of the face image is used for eye detection, and the mouth is searched in the lower region of the face. The regional scanning not only reduces the false positives, but also lowers the computation.
To select the true subwindow which
contains the facial component, it is assumed that
the central position of the facial components among different persons follows a
normal distribution. Thus, the probability that a candidate component at
Two cascade classifiers are trained for mouth. One is for detecting closed mouths, and the other is for open mouths. During scanning, if the closed mouth detector failed to find a mouth, the open mouth detector is triggered. In addition, the left and right eye classifiers are trained separately.
After the area of key facial components, mouth and
eyes, have been detected, face images are normalized based on the centers of
the components; and finally, mean coordinates of facial landmarks obtained from
the “location tolerance test” are used as landmarks. Figure
The landmark localization process: (from left to right) detection of face and facial components, normalised face, and fixed set of facial landmarks on the normalised face.
As stated
previously, the extracted features should possess high discriminative power and
high stability against different expressions. Among a number of feature
extraction algorithms proposed in the literature, research has demonstrated
that Gabor filters are more discriminative for facial expressions and robust
against various types of noise than other methods [
A 2D Gabor function is a plane wave with wave vector
In our implementation, we set
A wide range of
classifiers in pattern recognition literature have been applied to expression
classification. We evaluated a number of classification methods in [
SVMs belong to the class of kernel-based supervised learning machines and have been successfully employed in general-purpose pattern-recognition tasks. Based on statistical learning theory, SVMs find the biggest margin to separate different classes. The kernel functions employed in SVMs are used to efficiently map input data which may not be linearly separable to a high-dimensional feature space where linear methods can then be applied. Since there are often only subtle differences between different expressions posed by different people, for example,“anger” and “disgust” are very similar. The high discrimination ability of SVMs plays a major role in designing classifiers that can distinguish such expressions. SVMs also demonstrate relatively good performance when only a modest amount of training data is available, and this also makes SVMs suitable for the system under consideration. Furthermore, only inner products are involved in SVMs computation; the learning and prediction processes are much faster than some traditional classifiers such as a multilayer neural network.
In the implementation, classifiers are trained to identify Gabor coefficient vectors obtained from feature extraction process into one of the six basic emotional expressions or a neutral expression. Since SVMs are binary classifiers and there are 7 categories to distinguish, 21 SVMs are trained to discriminate all pairs of expressions. A multiclass classifier is obtained by combining the SVM outputs through a voting principle. For example, if one SVM makes the decision that the input is “Happiness” and not “Sadness”, then happiness gets +1 and sadness gets −1. After all SVMs have made their decisions, votes for each category are summed together, and the expression with the highest score is considered to be the final decision.
As introduced in Section
The trained detectors were tested on BioID database
[
Mouth detection result. Both detectors are trained using same dataset.
Original “AdaBoost”
Improved “AdaBoost”
Facial component detection results for different resolution faces from BioID database.
FG-NET database [
Recognition results for 7 expressions classification.
Expression | Recognition rate |
---|---|
Happiness | 85.2% |
Sadness | 78.9% |
Fear | 80.7% |
Disgust | 81.6% |
Surprise | 86.3% |
Anger | 83.3% |
Neutral | 84.9% |
Recognition results for 4 expressions classification.
Expression | Recognition rate |
---|---|
Happy | 85.2% |
Unhappy | 85.6% |
Surprise | 86.3% |
Neutral | 84.9% |
Recognition samples from FG-NET.
Recognition samples for real-time test.
In this section, we indicate the manner in which the proposed system can be incorporated in an MOG. A typical MOG is a complex distributed system connecting thousands of users. Two main types of network architecture are employed, namely, client-server and
peer-to-peer [
The system presented in this paper is implemented on the client side as it constitutes a user interface device enhancement. The system outputs a classification of the current emotion of the player and this is transmitted to the server. It is possible that an XML-based description of the emotions is employed. The game logic server running of the centralized server would incorporate a module that can parse the XML message and send the appropriate message to the game world module which in turn issues the necessary message that allows the correct view of the avatar to be generated. Thus, the facial expression recognition system allows a rendering of the appropriate avatar with the required emotion on clients' world views.
In this paper, we presented an automatic facial expression recognition system for MOGs. Several algorithms are improved and extended to meet the specific requirements. Despite recent advances in computer vision techniques for face detection, facial landmarks localization, and feature extraction, building a facial expression recognition system for real-life applications still remains challenging.