物語のシーンの印象に基づいたテーマ音楽の変奏曲の生成 Generation of Variations on Theme Music Based on Impressions of Story Scenes

Abstract: This paper describes the construction of a system which transforms a theme music fitting to story scenes represented by texts and/or pictures, and generates variations on the theme music. Inputs to the proposal system are an original theme music and numerical information on given story scenes. The present system varies (1) melodies, (2) tempos, (3) tonalities, and (4)accompaniments of given theme music based on impressions of story scenes. Neural network models are applied to the music generation in order to reflect user’s sensitivity on music and stories. This paper also describes the evaluation experiments to confirm whether the generated variations on theme music reflect impressions of generated story scenes appropriately or not.


INTRODUCTION
Music, pictures, and/or text information are combined into multimedia content with interaction among them [1].The effectiveness of multimodal communication using combined different modal media has been analyzed in the field of cognitive psychology [1].It is expected that multimodal communication will be performed in everyday life in the future owing to the development of information technology [2].However, the interaction among different modal media is not necessarily generated by their simple and random combination.Features and impressions of media should be considered well in order to create effective multimedia contents.Therefore, creation of multimodal contents costs more time and labor than that of single-modal one.Support systems for creation of multimodal contents or for the flexible combination of different modal media are taken interest in [3,4].
The authors are studying on the construction of a system which generates variations on theme music fitting to each story scene represented by texts and/or pictures [5].This system varies melodies, tempos, tones, tonalities, and accompaniments of a given theme music based on impressions of story scenes.This system has two sections representing (a) relations between story scenes and musical images and (b) relations between features of variations and musical impressions.Since human feeling of stories and music is different among people [6] and the difference is important in multimedia content creation, it is necessary to consider the above relations depending on each user.Although in [5] these relations are obtained by questionnaire data, that is, off line, in the present paper a method, which adjusts the relations for each user on line, is proposed.In this paper, the transformation of theme music is defined as follows.Tunes, tones, musical performances, rhythms, tempos are varied according to story scenes [7].

Inputs and outputs
Inputs to the present system are original theme music and numerical information on given story scenes.Outputs are MIDI files of variations on original theme music generated according to each story scene.This paper deals with generation of variations on theme music fitting to stories obtained by the system [8] that generates story-like linguistic  1, is acquired from each picture [8].These are inputs to the present system.

System structure
The present system consists of two sections, a musical image acquisition (MIA) section and a theme music transformation (TMT) section as shown in Figure 1.The MIA section converts information on story scenes shown in Table 1 into transformation image parameters (TIPs) by modular neural network (MNN) models [9].The TMT section transforms inputted original theme music based on values of TIPs, and generates a set of midiformatted candidates of variations on theme music for each story scene.The TMT section applies genetic algorithms (GAs) to the generation of variations candidates, which has MNN models as fitness functions.MNN models consist of three neural network models, an average model network (AMN), an individual variation model network (IVMN), and gating networks.AMN is a hierarchical neural network model expressing user's average feeling of music and stories.IVMN is a radial basis function network model expressing differences among users' feeling of music and stories.The gating network switches over between AMN and IVMN.The present system adjusts IVMNs and the gating networks for each user.

MUSICAL IMAGE ACQUISITION (MIA) SECTION
The MIA section is constructed by MNN models.The inputs to MNN models are shown in Table 1.MNN models estimate the values of TIPs representing musical image for transformation of original theme music.In this paper, TIPs consist of some pairs of adjectives that are selected referring to a study that retrieves many genres musical works with pairs of adjectives representing musical image [10].These are happy-sad, heavy-light, hard-soft, stable-unstable, clear-muddy, calm-violent, smooth-rough, thick-thin.Preexperiments are performed in order to confirm which pairs of adjectives are necessary for TIPs.The procedures of the preexperiments are as follows.
(1) Fixing musical instruments, tempos, tonalities, tones, chords in a melody part and accompaniment parts patterns at random, 125 variations are gener- If the subjects feel that it is difficult to evaluate the difference among the variations with some pairs of adjectives, they give the pairs.The results of the pre-experiments show that it is difficult to evaluate the difference among the variations using adjectives hard-soft, stable-unstable, smooth-rough, or thick-thin.Then, in this paper these four pairs of adjectives are not used.That is, four pairs of adjectives, which are parameters on degree of change from original theme music shown in Table 2, are used.Each parameter value is a real number in [0.0, 1.0].The MIA section estimates the values of TIPs from information on a picture scene.In generation of variations on theme music fitting to story scenes, information on story scenes necessary for the estimation of the values of TIPs is dependent on media representing a story, for example, pictures, texts or animations or the contents of a story, for example, a serious story, a story for children, and is not determined uniquely.Therefore, it is necessary to consider the selection of information on picture scenes for the estimation of the values of TIPs.However, since in this paper, input to the present system is limited to information on pictures scenes, the paper does not discuss this point.In the future it is necessary to change information according to media representing a story or the form of a story.

Procedure on generation of variations [5]
Inputs to the TMT section are original theme music and values of TIPs obtained by the MIA section, and outputs are MIDI files of variations on theme music.MIDI files consist of the melody part and six accompaniment parts.The accompaniment parts consist of an obbligati part, a backing parts 1 and 2, a bass part, a pad part, and a drums part.The TMT section modifies impressions of inputted original theme music varying the following components of MIDI files [5]: (1) scores of melody parts, (2) tempos, (3) tonalities, (4) accompaniment patterns of accompaniment parts, and (5) tones.

Structure of TMT section
The TMT section transforms given original theme music according to inputted TIPs and outputted sets of MIDIformatted candidates of variations on given theme music as shown in Figure 2. GAs are applied to the transformation of a given theme music fitting to TIPs, where a variation generated from a given theme music is represented by a chromosome in the framework of GAs.In this paper, GAs parameters are abbreviated as follows. (

Structure of chromosome
Variations consist of three kinds of chromosomes such as Melody Chromosome, Accompaniment Chromosome, and Status Chromosome.
The melody chromosome has melody part score information.Melody part score information is represented by the format shown in Figure 3.A given original theme music is represented as an initial chromosome.The accompa-niment chromosome has accompaniment part information.The playing pattern number and the performance type of the obbligati part in the accompaniment part are represented by chromosomes, where each information is represented with 1 byte as shown in Figure 3. Initial chromosomes have random values for information.The status chromosome has information on a tempo, a tonality, and a tone.Tempo, tonality, melody part tone, and obbligato part tone are represented by a chromosome as shown in Figure 3. Tempo (60-200 [BPM]), tonality (a major scale or minor one), and tone are also represented with 1 byte.Initial chromosomes have random values for information.

Calculation of fitness value [5]
Fitness values of chromosomes are calculated according to the inputted values of TIPs and melodies of original theme music [5].Let i (i = 1, 2, ..., N) be the chromosome number, that is, the variation number, and Fitness i represents the fitness value of the ith variation.Fitness i is defined as where Melody Fitness i is the fitness value of score information in the melody part of the ith variation referring to [11], and Impression Fitness i is the fitness value of impressions on the ith variation [5].Impression values of variations are estimated by MNN models.These impression values are degrees of four pairs of adjectives used in TIPs estimation.MNN models are obtained by the relation between feature spaces of variations and impression values.
International Journal of Computer Games Technology  Smaller the value of Fitness i is, the better the ith variation is.Procedures of calculation of fitness values are shown in Figure 4.

GA operations
(N − N new ) individuals of parent candidates are selected by the tournament selection according to the fitness values obtained in 4.2.2.Crossovers at probability of P c and mutations at P m are applied to parent candidates.N new individuals are generated at random.Crossover and mutation are performed as follows.

Crossover
uniform crossover is applied to melody chromosomes obtained by the generative theory of total music grouping structure analysis [12] in every group.

Mutation
random values are assigned to the accompaniment chromosome and the status chromosome.Varying score information on the melody part described in [5] is applied to melody chromosomes.

MNN STRUCTURE
The present system uses MNN models to represent (1) relations between story scenes and values of TIPs in the MIA section, and (2) relations between features of variations and musical impressions in the TMT section.MNN models in the present system consist of AMN, IVMN, and the gating network as shown in Figure 5.When the present system adjusts its MNN models for each user, IVMN and the gating network are obtained by learning of user's data of individual variation of feeling of music and stories.AMN is a hierarchical neural network model which consists of sigmoid neurons.AMN is constructed using questionnaire data of subject's feelings for music and stories.The questionnaire data are obtained referring to [6].
IVMN is a hierarchical neural network model which consists of RBF neurons.RBF is a function responding to input values in a local area.Therefore, an RBF network is easy to be adjusted online and fast.When a user is not satisfied with outputs of MNN, learning data of IVMN are generated and saved in the present system.Input values of learning data are input values of MNN.Output values of learning data are evaluation values by each user.
The gating network is an RBF network switching over between AMN and IVMN.The gating network judges whether input values of MNN are close to the area learned by IVMN or not.When a user is not satisfied with outputs of MNN, learning data of IVMN are generated and saved in the present system.Learning data of the gating network are input values of MNN.Output values of learning data of the IVMN are evaluation values by users.
IVMN and the gating network are constructed by the method proposed in [13] using all data saved in the present system.
Outputs of MNN models are defined as where g(x) is an output value of the gating network f personal (x) is an output value of the IVMN f average (x) is an output value of the AMN, and t is a threshold of switching AMN and IVMN.

EXPERIMENTS
Experiments are performed to evaluate the present system by 8 undergraduate/graduate students.In the experiments, GA parameters are set at the following values: N = 100, T = 100, N user = 3, N new = 20, P c = 70%, P m = 20%.In the experiments, the threshold of switching AMN and IVMN by a gating network is set at 0.75.Musical works are chosen at random from prepared seventeen MIDI files of classical tunes or folk tunes, and are used as theme music of stories.

Construction of IVMN and gating network
IVMN and the gating network for each subject are constructed in the following procedures.
(1) Story scenes and theme music are inputted to the present system.The present system generates N variations according to each story scene and outputs them.(2) When a subject is satisfied with one of outputted variations, go to (8).When a subject is not satisfied with any variations, go to (3).(3) A subject looks at the values of TIPs estimated by the present system.The values of TIPs are presented to a subject in the form of Figure 6.(4) The present system adjusts MNN models according to two cases as shown in Figure 7.That is, a subject feels (a) presented musical image is not suitable for story scenes or (b) generated variations are different from presented musical images.
(a) When a subject feels that presented musical image is not suitable for story scenes, a subject evaluates whether the values of TIPs fit to story that although the same scenes are given to the subjects, various theme tunes are transformed by the present system.Variations on theme music generated by the present system are dependent on subjects' impressions on story expressed by pictures.Therefore, even if the same pictures are given, generated variations are different among subjects.Nevertheless, subjects themselves are satisfied with generated variations.Then it is found that the present system generates variations on theme music fitting to subjects' impressions on story well.However, subjects' impressions on story usually change according to time and environment in which subjects are.The present system does not deal with the variations depending on these factors, time, environment, and so forth.This is a future work.

CONCLUSIONS
This paper presents the system which transforms a theme music fitting to story scenes represented by texts and/or pictures, and generates variations on the theme music.The present system varies (1) melodies, (2) tempos, (3) tones, (4) tonalities, and (5) accompaniments of a given theme music based on impressions of story scenes using neural network models and GAs.Differences of human's feeling of music/stories are important in multimedia content creation.This paper proposes the method that adjusts the models in the present system for each user.The results of the experiments show that the system transforms a theme music reflecting user's impressions of story scenes.

Table 2 : 2 )
Transformation image parameters.Some subjects, who have no experience to play some musical instruments over 3 years, listen to the variations and express impressions on them with 8 pairs of adjectives.(3)

( 1 )
) N: Population size (2) T: Maximum number of generations (3) N new : The number of individuals generated randomly (4) N user : Partial population size presented to user (5) P c : Crossover probability (6) P m : Mutation probability.Procedures in the TMT section are as follows.N variations are generated from inputted theme music in the form of chromosomes.(2) Fitness values of chromosomes are calculated according to the inputted values of TIPs and melodies of original theme music.(3) GAs operations of crossover and mutations are performed.Next generation population is generated.Go back to step (2).

Table 1 :
Information on story scene.