In several application contexts in multimedia field (educational, extreme gaming), the interaction with the user requests that system is able to render music in expressive way. The expressiveness is the added value of a performance and is part of the reason that music is interesting to listen. Understanding and modeling expressive content communication is important for many engineering applications in information technology (e.g., Music Information Retrieval, as well as several applications in the affective computing field). In this paper, we present an original approach to modify the expressive content of a performance in a gradual way, applying a smooth morphing among performances with different expressive content in order to adapt the audio expressive character to the user’s desires. The system won the final stage of Rencon 2011. This performance RENdering CONtest is a research project that organizes contests for computer systems generating expressive musical performances.
In the last years, several services based on Web 2.0 technologies have been developed, proposing new modalities of social interaction for music creation and fruition [
Our studies on music expressiveness [
In this paper we present a system, namely, CaRo 2.0 (CAnazza-ROdà, from the name of the two main authors: besides,
The contribution of the performer to expression communication has two aspects: to clarify the composer’s message enlightening the musical structure and to add his personal interpretation of the piece. A mechanical performance of a score is perceived as lacking of musical meaning and is considered dull and inexpressive as a text read without any prosodic inflection. Indeed, human performers never respect tempo, timing, and loudness notations in a mechanical way when they play a score: some deviations are always introduced, even if the performer explicitly wants to play mechanically [
Most studies on musical expression aim at understanding the systematic presence of
At a physical information level, the main parameters considered in the models of the musical expression, called
The analysis of these systematic deviations has led to the formulation of several models that try to describe their structure, with the aim to explain where, how, and why a performer modifies, sometimes unconsciously, what is indicated by the notation in the score. It should be noticed that, although deviations are only the external surface of something deeper and often not directly accessible, they are quite easily measurable and thus widely used to develop computational models for performance understanding and generation.
In general, musical expression refers both to the means used by the performer to convey the composer’s message and to his/her own contribution to enrich the musical message. Expressiveness related to the musical structure may depend on the dramatic narrative developed by the performer and on the stylistic expectation based on cultural norm (e.g., jazz versus classic music) and the actual performance situation (e.g., audience engagement). Recently, more interest has been given to the expressive component due to the personal interpretation of the performer [
Many studies (see, e.g., [
Notice that sometimes expressive intentions the performer tries to convey can be in contrast with the character of the musical piece. A slightly broader interpretation of expression as
When we talk of deviation, it is important to define which is the reference used for computing deviation. Very often the score is taken as
Most people have an informal understanding of musical expression. While its importance is generally acknowledged, the basic constituents are less clear. Often the simple range expressive-inexpressive is used. Regarding the affective component of music expression, the two theoretical traditions that have most strongly determined past research in this area are
The assumption of the
The focus of the
The most used representation in music research is the two-dimensional Valence-Arousal (V-A) space, even if other dimensions were explored in several studies (e.g., see [
Unlike the studies on music and emotions, the authors focused on expressive intentions described by sensorial adjectives [
Correlation between coordinate axes and acoustic parameters.
Tempo | Legato | Intensity | |
---|---|---|---|
Dim. 1 ( |
|
−0.28 | −0.25 |
Dim. 2 ( |
0.33 |
|
|
Kinetics-Energy space, as mid-level representation of expressive intentions.
We can use this interpretation of Kinetics-Energy space as an indication of how listeners organised the performances in their own minds, when focusing on sensory aspects. The robustness of this representation was confirmed by synthesising different and varying expressive intentions in a musical performance. We can notice that this representation is at an abstraction level which is between the semantic one (such emotion) and physical one (such as timing deviations) and can thus be more effective in representing the listener’s evaluation criteria [
While the models described in the previous section were mainly developed for analysis and understanding purpose, they are also often used for synthesis purposes. Starting from models of musical expression, several software systems for rendering expressive musical performances were developed (see [
The systems for automatic expressive performances can be grouped into three general categories: (i) autonomous, (ii) feedforward, and (iii) feedback systems. Examples for each category are presented.
Given a score, the purpose of all the systems is to calculate the so-called
As an example,
A historical example is
Example of the rules system by KTH application (adapted from [
The
In the VirtualPhilharmony [
The differences between the systems described above include both the algorithms for computing the expressive deviations and the aspects related to the user’s interaction. In the autonomous systems, the user can interact with the system only in the selection of the training set and, as a consequence, the performance style that the system will learn. The feedforward systems allow a deeper interaction with the model parameters: the user can set the parameters, listen to the results, and then fine-tune the parameters again until the results are satisfying. The feedback systems allow a real-time control of the parameters: the user can change the parameters of the performance while (s)he is listening to it, in a similar way as a human musician does. Since the models for music performance usually have numerous parameters, a crucial aspect of the feedback systems is how to allow the user to simultaneously control all these parameters in real-time. VirtualPhilharmony allows the user to control in real-time only two parameters (intensity and tempo); the other ones are defined offline by mean of a so-called performance template.
Also the system developed by the authors, and described in the next sections, is a feedback system. As explained later, the real-time control of the parameters is allowed by means of a control space based on a semantic description of different expressive intentions. Starting from the trajectories drawn by the user on this control space, the system maps concepts such as emotions or sensations in the low level parameters of the model.
CaRo 2.0 simulates the tasks of a pianist who reads a musical score, decodes its symbols, plans the expressive choices, and finally executes the actions in order to actually play the instrument. Usually, pianists analyse the score very carefully before the performance and they add annotations and cues to the musical sheet in order to emphasise rhythmic-melodic structures, section subdivisions, and other interpretative indications. As shown in Figure
System architecture: CaRo 2.0 receives as input an annotated score in MusicXML format and generates in real-time messages to play a MIDI controlled piano.
The specificity of CaRo 2.0, in comparison to the systems presented in Section
The graphical interface of CaRo 2.0 (see Figure
The first structure is a list of note events, each one described by the following fields: ID number, onset (ticks), duration (ticks), bar position (ticks), bar length (beats), voice number, grace (boolean), and pitch, where ticks and beats are common unit measures for representing musical duration (for the complete MIDI detailed specification, see
The second structure is a list of expressive cues, as specified in the score. Each entry is defined by type, ID of the linked event, and voice number. Among many different ways to represent expressive cues in a score, the system is currently able to recognise and render the expressive cues listed in Table
Correspondence among graphical expressive cues, XML representation, and rendering parameters.
Graphical symbol | MusicXML code | Event value |
---|---|---|
• | <articulations ><staccato default-x=“3” default-y=“13” placement=“above”/></articulations > | DR |
|
||
> | <articulations><accent default-x=“−1” default-y=“13” placement=“above”/></articulations> | KV |
|
||
– | <articulations><tenuto default-x=“1” default-y=“14” placement=“above”/></articulations> | DR |
|
||
|
<articulations><breath-mark default-x=“16” default-y=“18” placement=“above”/></articulations> |
|
|
||
|
<notations ><slur number=“1” placement=“above” type=“start”/></notations> |
|
|
||
|
<direction-type><pedal default-y=“−91” line=“no” type=“start”/></direction-type> | KV |
The third data structure describes the hierarchical structure of the piece, that is, its subdivision, top to bottom, in periods, phrases, subphrases, and motifs. Each section is specified by an entry composed by begin (ticks), end (ticks), and hierarchical level number.
For clarity reasons, it is useful to express the duration of the musical event in seconds instead of metric units such as ticks and beats. This conversion is possible taking the tempo marking written in the score, if any, or normalising the total score length to the performance duration. This representation is called
When the user selects the command to start execution of the score, a timer is instantiated to have a temporal reference for the processing of the musical events. Let
Schema of the
Multilayer model.
The abstract control space
Let
The abstract control space constitutes the interface between the user concepts of expression and their internal representation into the system. The user, following his/her own preferences, can create the semantic space, which controls the expressive intention of the performance. Each user can select his/her own expressive ideas and then design the mapping of these concepts to positions and movements on this space. A preferences window (see Figure
The options window (designed for musicians) that allows the fine tuning of the model. For general public, it is possible to load preset configurations (see the button “load parameters from file”).
Map between the
As a particular and significant case, the emotional and sensory control metaphors can be used. For example, a musician may find it more convenient to use the Kinetics-Energy space (see Figure
A control space designed for an educational context. The position of the emoticons follows the results of [
Specific and new expressive intentions can be associated with objects/avatars, which are placed in the screen, and a movement of the pointer, from an object to another one, changes continuously the resulting music expression. For example, a different type of music can be associated with the characters represented in the image. Also the object, with its expressive label and parameters, can be moved, and then the control space can dynamically vary (see, e.g., Figure
The control space is used to change interactively the sound comment of the multimedia product (a cartoon from the movie “The Good, The Bad and The Ugly” by Sergio Leone, in this case). The author can associate different expressive intentions with the characters (here, see the blue labels 1, 2, and 3).
More generally, the system allows a creative design of abstract spaces, according to artist’s fantasies and needs, by inventing new expressions and their spatial relations.
This section reports the results of some expressive music renderings obtained with CaRo 2.0, with the aim of showing how the system works. The beginning of the 3rd Movement of the Piano Sonata number 16 by L. van Beethoven (see Figure
The music excerpt used to evaluate the CaRo 2.0 software. The score has been transcribed in MusicXML format, including all the expressive indications (slurs, articulation marks, and dynamic cues).
The graphical user interface of CaRo 2.0. The dotted line shows the trajectory drawn for rendering the
Piano roll representation of the performances characterized by heavy (b) and light (c) expressive intention. The nominal performance (a) has been added as a reference.
Nominal
Heavy
Light
The system rendered the different performances, correctly playing the sequences of notes written in the score. Moreover, expressive deviations were computed depending on the user’s preferences specified by means of the control space. Figure
The Key-Velocity values of the performances characterised by the different expressive intentions (bright, hard, heavy, light, and soft). Only the melodic line is reported in order to have a more readable graph.
The values of the three main expressive parameters of the performance obtained drawing the trajectory of Figure
The most important public forum worldwide in which computer systems for expressive music performance are assessed is the Performance RENdering CONtest (Rencon) which was initiated in 2002 by Haruhiro Katayose and colleagues as a competition among different systems (
Results of the final stage of the Rencon 2011 contest. Each system was required to render two different performances of the same music piece. Each performance has been evaluated both by the audience attending the contest live and by people connected online through a streaming service.
System | Performance A | Performance B | Total score | ||||
---|---|---|---|---|---|---|---|
Live | Online | Total | Live | Online | Total | ||
CaRo 2.0 | 464 | 56 | 520 | 484 | 61 | 545 | 1065 |
YQX | 484 | 64 | 548 | 432 | 65 | 497 | 1045 |
VirtualPhilharmony | 452 | 46 | 498 | 428 | 32 | 460 | 958 |
Director Musices | 339 | 33 | 372 | 371 | 13 | 384 | 756 |
Shunji System | 277 | 24 | 301 | 277 | 20 | 297 | 598 |
The CaRo 2.0 could be integrated in the browser engines of the music community services. In this way, (i) the databases keep track of the expressiveness of the songs played; (ii) the users create/browse playlists based not only on the song title or the artist name, but also on the music expressiveness. This allows that, (i) in the rhythms games, the performance will be rated on the basis of the expressive intentions of the user, with advantages for the educational skill and the users involvement of the game; (ii) in the music community, the user will be able to search happy or sad music, accordingly to the user affective state or preferences.
In the rhythm games, despite their big commercial success (often bigger than the original music albums), the gameplay is—almost—entirely oriented around the player’s interactions with a musical score or individual songs by means of pressing specific buttons, or activating controls on a specialized game controller. In these games the reaction of the virtual audience (i.e., the score rated) and the result of the battle/cooperative mode are based on the performance of the player judged by the PC. Up to now the game designers did not consider the player’s expressiveness. As future work, we intend to insert the CaRo system in these games, so that the music performance is rated “perfect” not if it is only played exactly like the score, but considering also its interpretation, its expressiveness. In this way, these games can be used profitably also in the educational field.
The authors declare that there is no conflict of interests regarding the publication of this paper.