Personalized avatars for mobile entertainment

. With evolution in computer and mobile networking technologies comes the challenge of offering novel and complex multimedia applications and end-user services in heterogeneous environments for both developers and service providers. This paper describes one novel service, called LiveMail that explores the potential of existing face animation technologies for innovative and attractive services intended for the mobile market. This prototype service allows mobile subscribers to communicate using personalized 3D face models created from images taken by their phone cameras. The user can take a snapshot of someone’s face – a friend, famous person, themselves, even a pet – using the mobile phone’s camera. After a quick manipulation on the phone, a 3D model of that face is created and can be animated simply by typing in some text. Speech and appropriate animation of the face are created automatically by speech synthesis. Furthermore, these highly personalized animations can be sent to others as real 3D animated messages or as short videos in MMS. The clients were implemented on different platforms, and different network and face animation techniques, and connected into one complex system. This paper presents the architecture and experience gained in building such a system.


Introduction
The two most dynamic forces in communications technology today are certainly mobility and the Internet. In parallel with the fast worldwide growth of mobile subscriptions, the fixed Internet and its service offerings have grown at a rate far exceeding all expectations. The number of people connected to the Internet is continuing to increase and new generations of mobile networks are enabling connectivity virtually everywhere and at any time with any device. With advances in computer and networking technologies comes the challenge of offering new multimedia applications and end user services in heterogeneous environments for both developers and service providers.
Graphical simulations of real or imaginary persons capable of talking and gesturing have been an active research area for a long time. The advances in animation systems have prompted interest in use of animation to enrich the human-to-computer interface and also human-to-human communication. Together with ever more widespread multimedia and multimodal end-user equipment, ranging from devices specifically designed for a particular purpose to generic laptops and mobile phones, a wide range of new applications for face animation systems may be envisioned.
The goal of the project described in this chapter was to explore the potential of existing face animation technologies [7] for innovative and attractive services intended for the mobile market, exploiting in particular the advantages of technologies like MMS and GPRS. The new service will allow customers to take pictures of people using the mobile phone camera and obtain a personalized 3D face model of the person in the picture through a simple manipulation on the mobile phone. In this paper we present architecture of LiveMail system. We describe how unique personalized virtual characters are created with our face adaptation procedure. Also, we describe clients that are implemented on different platforms, most interestingly on mobile platform, since 3D graphics on mobile platforms is in its early stages. Various network and face animation techniques were connected into one complex system and we presented the main performance issues of such a system. Also, if correctly implemented, face animation systems have low bandwidth requirements and high interaction capacity, making them a natural replacement for video streaming in "talking head" scenarios. Combined with facial feature tracking, technology presented in this chapter could be used for very low bandwidth visual communication.
The chapter is organized as follows. In the next section we give a brief introduction on used standards and technologies. Next, we present overall system functionalities and architecture continuing with more details on server and client implementation in the following sections.

Background
Before going further into the in-depth description of LiveMail system, some additional details on used technology and standards are given. The purpose of this section is not to give in-depth coverage of particular standards and technologies but to give enough background information for understanding the rest of the chapter.

3D face modeling and visualization
Creating personalized 3D face models is not as simple task as it seems. There are various techniques that can be used. However, most of them can not be applied on mobile terminals nor they are appropriate for end-user usage. For example, utilization of 3D modeling tool such as 3D Studio Max or Maya and manual construction of 3D models is often expensive, time-consuming and it sometimes does not result with a desirable model. Another way is to use specialized 3D scanners. In this way face models can be produced with very high quality but using them in a mobile environment is not practical. Also, there have been methods that included use of two cameras placed at certain angle and algorithms for picture processing to create 3D model [19]. Some other methods, like in [5], use three perspective images taken from different angles to adjust deformable contours on a generic head model.
Our approach in creating animated personalized face models is based on the adaptation of an existing generic face model, similar to [18]. However, in order to achieve simplicity on a camera-equipped mobile device, our adaptation method uses a single picture as an input.
Another important issue, besides 3D face modeling, is the 3D visualization on mobile devices [13]. Because of the vast computational power required to achieve a usable performance, 3D graphics rendering on handheld devices can still present a problem. Hardware and software capabilities of 3D graphics on handheld devices are clearly smaller than on desktop computers in many ways. Lower speed, smaller display size and resolution, less memory for running and storing programs, are just some of the restrictions. With the introduction of more powerful processors, mobile phones are becoming capable of rendering 3D graphics at interactive frame rates. First attempts to implement 3D graphics accelerators on mobile phones have already been made. Mitsubishi Electric Corp. announced their first 3D graphics LSI core for mobile phones called Z3D in March 2003. Also other manufacturers like Fuetrek, Sanshin Electric, Imagination Technologies and ATI published their 3D hardware solutions for mobile devices a few months after.
Beside hardware solutions, another important prerequisite for 3D graphics on mobile devices is availability of open-standard, well-performing programming interfaces (APIs) that are supported by handset manufacturers, operators, and developers alike. The efforts made by the Khronos Group were clearly one of the biggest steps towards this goal. The Khronos Group is the industry consortium founded by a number of leading media-centric companies in January 2000. One of open standard APIs they have created is the OpenGL ES (OpenGL for Embedded Systems) announced in July 2003. It is a royalty-free, cross-platform low-level API for full-function 2D and 3D graphics on embedded systems.
Another important effort was Mobile 3D Graphics API (JSR-184) specification developed under the Java Community Process. It was published by the end of October 2003. JSR-184 (also known as M3G) is a high-level API for Java mobile 3D graphics and it is designed to provide an efficient 3D graphics API suitable for the J2ME platform, and, in particular, for the CLDC/MIDP profile. OpenGL ES low-level API and M3G high-level API represent de facto industry standards in area of 3D graphics API for mobile devices.

MPEG-4 Facial Animation
Creating animated human faces using computer graphics techniques has been a popular research topic the last few decades [1], and such synthetic faces, or virtual humans, have recently reached a broader public through movies, computer games, and the World Wide Web. Current and future uses include a range of applications, such as human-computer interfaces, avatars, video communication, and virtual guides, salespersons, actors, and newsreaders [8].
Created personalized face model can be animated using speech synthesis [15] or audio analysis (lip synchronization) [14]. Our face animation system is based on the MPEG-4 standard on Face and Body Animation (FBA) [9,10]. This standard specifies a set of Facial Animation Parameters (FAPs) used to control the animation of a face model. The FAPs are based on the study of minimal facial actions and are closely related to muscle actions. They represent a complete set of basic facial actions, and therefore allow the representation of most natural facial expressions. The lips are particularly well defined and it is possible to precisely define the inner and outer lip contour. Exaggerated values permit to define actions that are normally not possible for humans, but could be desirable for cartoon-like characters.
All the parameters involving translational movement are expressed in terms of the Facial Animation Parameter Units (FAPU) (Fig. 1). These units are defined in order to allow interpretation of the FAPs on any facial model in a consistent way, producing reasonable results in terms of expression and speech pronunciation. They correspond to fractions of distances between some key facial features (e.g. eye distance). The fractional units used are chosen to allow enough precision.
The FAP set contains two high-level FAPs for selecting facial expressions and visemes, and 66 low level FAPs. The low-level FAPs are expressed as movements of feature points in the face, and MPEG-4 defines 84 such points (Fig. 2). The feature points not affected by FAPs are used to control the static shape of the face. The viseme parameter allows rendering visemes on the face without having to express them in terms of other parameters or to enhance the result of other parameters, insuring the correct rendering of visemes. A viseme is a visual correlate to a phoneme. It specifies shape of a mouth and tongue for minimal sound unit of speech. In MPEG-4 there are 14 static visemes that are clearly distinguished and included in the standard set. For example, phonemes /p/, /b/ and /m/ represent one type of viseme. Important thing for their visualization is that the shape of the mouth of a speaking human is not only influenced by the current phoneme, but also by the previous and the following phoneme. In MPEG-4, transitions from one viseme to the next are defined by blending only two visemes with a weighting factor. The expression parameter allows definition of high-level facial expressions.
The FAPs can be efficiently compressed and included in a Face and Body Animation (FBA) bit stream for low bit rate storage or transmission. An FBA bit stream can be decoded and interpreted by any MPEG-4 compliant face animation system [2,12], and a synthetic, animated face be visualized.

System functionalities
After a short review on technologies and standards, we present basic functionalities of LiveMail service. LiveMail service provides a simple creation of personalized face model, transmission and display of such virtual character as animated message on various platforms. Procedure in creation of such personalized face model can simply be described as recognition of characteristic face lines on the taken picture and adjustment of generic face model to those face lines. Most important face lines are size and position of face, eyes, nose and mouth.
Once a user creates a personalized face model, he/she can communicate with others by sending animated messages of that face model. Animated message contains a speech animation synthesized from text that user inputs.
Personalization of virtual characters and creation of animated messages with speech and lips synchronization is a time-consuming process that requires a lot of computational power. Capabilities of mobile devices have improved in last few years, but they are still clearly not capable of such a demanding computing. Therefore all time-consuming processes are done on one or more servers while mobile devices are clients. A simplified LiveMail use-case scenario is depicted in Fig. 3. Server's basic task is to perform computationally expensive processes, like previously mentioned personalization of virtual character and creation of animated message with speech and lips synchronization. The client side is responsible for displaying the animated message and handling user requests for the creation of a new personalized virtual character and its animated message through proper user interface. When creating a personalized character, the interface allows user to input a facial image and place a simple mask on it to mark the main features. Although this could be done automatically recognition of face features is not the scope of this work. After selecting main face features the client uploads the picture and face features parameters to the server. The server then builds and stores the personalized face model, and then notifies the client (Fig. 4).  When creating animated messages, the user selects a previously created virtual character, inputs text and addresses the receiver. Client application then sends a request for creation of the animated message to the server, which then synthesizes the speech and creates matching facial animation using the textto-speech framework. Because the animated message preview completely depends on recipient's phone capabilities, animation is adjusted prior to sending to the recipient. For mobile devices that cannot display even medium quality 3D animation, animated message is converted to a short animated GIF that only contains essential frames of the animation. Animated GIF is sent to the recipient as MMS (Fig. 5).

System architecture
LiveMail system architecture is based on a client/server communication model. Tasks that the server must perform are very time-consuming; to be more precise it takes up to 10 seconds to build a new personalized virtual character on today's standard PC (CPU 2-3 GHz, 512 MB RAM). To establish LiveMail as a broad-based service that serves many users simultaneously the server application must be A generic LiveMail system architecture is depicted in Fig. 6. It is a 3-tier architecture that has a thick client and a scalable application server.
3D face model adaptor module is used to create personalized virtual characters, while 3D face model animator and text-to-speech engine modules are used to create animated messages. 3D face model OpenGL renderer module is used to create virtual character's preview images for client user interface. It is also used in a process of converting animated message to animated GIF. Communication module is responsible for communicating with MMS service center (MMSC) and LiveMail clients. Connection to MMSC is established through MM7 protocol while LiveMail client communicates with the server by a simple text protocol built on top of HTTP. Some other application protocols like XML-RPC or SOAP could be used for client/server communication, but HTTP is more widespread and has minimal data overhead [4].

LiveMail server
While described architecture is best for commercial deployment of LiveMail service, as an enterprise application it is very complex and hard to implement. Therefore, the first step in that direction was to build a prototype that would have the same modules and share the concept with the described architecture, only in smaller proportions.
The prototype of a LiveMail server is a single application (written in C++) that consists of a lightweight HTTP server and an application server. HTTP server provides clients with user interface and receives requests, while application server processes client requests: creates new personalized 3D face models and animation messages. User interface is dynamically generated using XSL transformations from XML database each time client makes a request. Database holds information about user accounts, their 3D face models and contents of animated messages. As every server, LiveMail is naturally multithreaded. Light HTTP server, as a part of it, can simultaneously receive many client requests and pass them to application server for processing. Application server consists of many modules assigned for specific tasks like: 3D face model personalization, animation message creation and more. During the process of 3D face model personalization and animated message creation there are resources that cannot be run in multithread environment. Therefore they need to be shared among modules. Microsoft's Text-to-Speech engine, which is used for speech synthesis and as the base for 3D model animation, is a sample of a shared resource. To use such a resource all application server modules need to be synchronized. So, it is clear that technologies used to build LiveMail server prototype have a direct impact on its architecture.
Server's most valuable task is virtual character adaptation. The client's request for new virtual character creation holds entry parameters for the adaptation process: the picture of a person whose 3D model is created and the characteristic facial points in that picture (creating face mask). The server also has a generic 3D face model that is used in adaptation. Based on these inputs, the server deforms and textures the generic model in such a way that it becomes similar to the face in the picture, and thus produces the new 3D model ready for animation. Model is stored on the server for later use in VRML format.
The virtual character adaptation algorithm starts by mapping characteristic facial points from a facial picture into characteristic points of a corresponding generic 3D face model. Initial mapping is followed by three main stages of adaptation. The first stage is normalization. The highest and the lowest point of the generic 3D face model are modulated according to the characteristic facial points in the facial picture.
The generic 3D face model is then translated to the modulated highest point. As a rule, the translated model does not suit the size of facial picture mask so it has to be scaled. Thus, vertical ratio of the generic model and facial picture mask is calculated and every point of the generic model moved in corresponding ratio. There is vertical and horizontal scaling. Horizontal scaling involves adjusting the horizontal distance between the highest point and every other characteristic facial point, relative to face axis symmetry. Vertical scaling is easier, because there is no axis symmetry.
The next stage is processing. There is texture processing and model processing. N existing points map out a net of triangles that covers all normalized space. It's important to notice that beyond the face the net must be uniform.
Based on known points and known triangles the interpolator algorithm is able to determine the coordinates of any new point in that space using interpolation in barrycentric coordinates. The interpolator used here is described in [6]. We forward characteristic points received from client's request to interpolator. Interpolator (based on triangles net) will determine locations of other points needed for the creation of the new model.
The last stage of the algorithm is renormalization, and it is practically identical to the first stage of the algorithm, except it involves rolling back the model from normalized space back to space prior to starting of the algorithm.
Described adaptation algorithm is very time consuming. Measuring each step of new virtual character creation shows that 98% of entire process goes on adaptation algorithm. The rest (2%) is spent on model storage and preview images for client user interface. With the generic 3D face model constructed of approx. 200 polygons, adaptation algorithm takes 7.16 seconds on an AMD Athlon XP 2800+ (Fig. 7). Each adaptation algorithm started by client's request is run in a separate thread so multiprocessor systems can handle simultaneous client requests much faster.
Server's next important task is creation of animated messages. Its execution time depends on message text length (Fig. 8). Also, Microsoft Text-to-Speech engine can not process simultaneous text-to-speech conversions so all client requests are handled one at the time (while all others wait in a queue).

LiveMail client
LiveMail client can be easily implemented on various client platforms because the animation is generated on the server. Our approach was to build an independent face animation player core that can be ported to any platform that supports 3D graphics.
The face animation player is MPEG-4 FBA decoder that decodes MPEG-4 Facial Animation Parameters (FAPs). After decoding FAPs are applied to face model. Interpolation from key positions is used as a facial animation method, similar to morph target approach in computer animation and the MPEG-4 Face Animation Tables (FAT) approach. FATs actually define spatial deformation of a model as a function of the amplitude of the FAPs. We took the interpolation approach because it is very simple to implement and therefore it could be easily ported to various platforms. Also it could be easily adapted to the computer animation because of similar methodology.
Here is how the player really works. Each FAP (both low-and high-level) is defined as a key position of the face, or morph target. Each morph target is described as relative position of each vertex with respect to its position in the neutral face, as well as the relative rotation and translation of each transform  sufficient in all situations tested so far. The vertex and transform movements of the low-level FAPs are added together to produce final facial animation frames. In case of high-level FAPs, the movements are blended by averaging, rather than added together.

Time to create animated message
Various programming languages are used to implement face animation player on a variety of platforms. Implementation of the face animation player is easy due to its simplicity and low requirements.

LiveMail client for mobile platforms
Usage of LiveMail system on different platforms is one of its key features. Mobile platforms are in many ways restrictive in comparison to desktop PCs. Lower processing speed, small display size, storage and memory limitation are just some of the main disadvantages.
As mentioned earlier, there are three distinguish use cases for LiveMail service. First is creation of a face model with snapshot of someone face. After taking a photo with camera, the user needs to adjust the mask with key face part outlined. The mask is used to define 26 feature points on the face that are then, together with picture sent to the server for face adaptation, as described previously. The user interface that is used to create personalized face model requires access to some additional services of the phone. For this reason it is implemented on Java 2 Micro Edition (J2ME) platform that has support for additional package, called Mobile Media API (MMAPI). MMAPI allows access to native multimedia services like, for example, camera manipulation and picture access that is not granted with standard J2ME platform. Also, this user interface is implemented for Symbian platform where it is used in the same way as on J2ME platform (Fig. 9).
Second case is the process of creating and sending the LiveMail message with previously generated personalized characters. Because of the low requirements in the second case, the functionality can be implemented on virtually any of the today's mobile phone platform. In the most restrictive case, the LiveMail WAP service can be used.
Third use case is the previewing and receiving the animated message on the client platform. This is most demanding use case since it requires implementation of the face animation player. Face animation player has to be able to decode a face animation stream and render 3D face model with corresponding animation. This functionality for LiveMail client on mobile platform is currently implemented as a selecting subset of triangles of which virtual face is formed. Generally speaking, picking or selecting objects in 3D scene with a device pointer is not as easy as it seems at first glance. There are several ways of selecting objects like ray casting, color-coding, name lists, etc. Each of these methods tests some kind of intersection with 3D objects. DieselEngine does not have any function for testing intersections so that has to be done manually. For face animation player on Symbian platform a rendering method was used [16].
In the rendering method each 3D object, or to be precise, every triangle of its face mesh is transformed into screen coordinates. So, intersection is tested in a 2D coordinate system. Testing intersection with each triangle of the object is time-consuming and to speed things up bounding volumes were used. In this case we used a bounding sphere. Instead of testing intersection with a couple of dozen triangles for each object, intersection is tested with the one bounding sphere what is very simple to implement using a rendering method. However, the bounding sphere formation itself is not as clear-cut because its centre and radius have to be determined in such a way that with minimum radius all vertices are inside of the bounding sphere. For primitive objects like sphere, box, cone or cylinder this is simple to determine.
For other complex objects or face meshes there are a number of algorithms that perform this task, and these have speed versus quality tradeoffs. For this implementation the Ritter's algorithm [11] that creates a near-optimal bounding sphere was used. If there is a multiple intersection, a value in Z-buffer is compared and the nearest point is selected.

LiveMail client for desktop PCs
A parallel web service is offered as a full featured interface towards LiveMail system: it is possible to create new personalized virtual characters and compose messages that can be sent as e-mail. The system maintains the database of all previously created characters and sent messages for a specific user identified with e-mail and password.
Creating new personalized virtual characters (Fig. 10) is based on Java Applet technology and provides the same functionality as described in previous section. Compared to mobile platform clients, Java Applets on desktop computers provide considerably more possibilities in terms of user interfaces. More precision is allowed in defining feature points since the adjustment of the mask is performed on larger displays. Also, the cost of data transfer, which is significantly cheaper compared to mobile devices, makes it possible to use higher resolution portrait pictures thus making better looking avatars.
It is possible to create new messages using any of the previously created characters. Animation, the LiveMail, is stored on the server and the link, dynamically generated URL, is sent to the recipient's e-mail address (Fig. 11). The player used to view animation is a Java Applet as well, based on the Shout3D rendering engine. It shows satisfying performance of 24-60 frames per second with average sized face models, on standard desktop computer.
Since the web client is completely Java based and requires no plug-ins, the LiveMail service is widely accessible -from any computer with Java Virtual Machine installed.

Conclusion
This chapter presents a prototype system that enables users to deliver personalized, attractive content using simple manipulations on their phones. They create their own, fully personalized content and send it to other people. By engaging in a creative process -taking a picture, producing a 3D face from it,  composing the message -the users have more fun, and the ways they use the application are only limited by their imagination.
LiveMail is expected to appeal to younger customer base and to promote services like GPRS and MMS. It is expected to directly boost revenues from these services by increasing their usage. Due to highly visual and innovative nature of the application, there is a considerable marketing potential. The 3D faces can be delivered throughout various marketing channels, including the web and TV, and used for branding purposes.
Introduced system can be used as a pure entertainment application but there are also other revenues from the system. Barriers were crossed with the face animation player on mobile platform and with 3D face model personalization on the server. Various different network technologies and face animation techniques were connected into one complex system and the experiences gained building such a system were presented. This face animation system, based on MPEG-4 FBA standard, has low bandwidth requirement, and if combined with right facial feature tracking, technology presented in this chapter could be used for very low bandwidth visual communication.
company in Croatia, working in the area of networked virtual reality. Currently he is also working towards his Ph.D. degree at the Faculty of Electrical Engineering and Computing, University of Zagreb. He published several conference and journal papers, some of which were awarded. His main research interests are in the field of multimedia communications, virtual environments and mobile applications development. Her main research interests include networked virtual environments, quality of service, and advanced multimedia services in next generation networks. She is currently a team member of a Croatian Ministry of Science, Education and Sports funded national project no. 0036030, and she leads a research project in the area of networked virtual reality funded by Ericsson Nikola Tesla. She was also involved in joint projects with the Croatian Academic and Research Network CARNet. She has authored more than 40 refereed conference and journal publications. She has been involved in organization of conferences, serving as a Program, Registration and Publicity Chair. She was also a guest editor for the IEEE Communications Magazine feature topic on networked virtual environments. Dr. Matijasevic is a member of IEEE, ACM, and Upsilon Pi Epsilon Honor Society in the Computing Sciences. She is presently the Chair of the IEEE Communications Society Croatia Chapter.