^{1}

^{1}

^{2}

^{1}

^{2}

Recently, shift-invariant tensor factorisation algorithms have been proposed for the purposes of sound source separation of pitched musical instruments. However, in practice, existing algorithms require the use of log-frequency spectrograms to allow shift invariance in frequency which causes problems when attempting to resynthesise the separated sources. Further, it is difficult to impose harmonicity constraints on the recovered basis functions. This paper proposes a new additive synthesis-based approach which allows the use of linear-frequency spectrograms as well as imposing strict harmonic constraints, resulting in an improved model. Further, these additional constraints allow the addition of a source filter model to the factorisation framework, and an extended model which is capable of separating mixtures of pitched and percussive instruments simultaneously.

The use of factorisation-based approaches for
the separation of musical sound sources dates back to the early 1980s when
Stautner used principal component analysis (PCA) to separate different tabla
strokes [

Factorisation-based approaches were initially applied
to single channel separation of musical sources [

A commonly used cost function is the generalised
Kullback-Leibler divergence proposed by Lee and Seung [

Regardless of the cost function used, the resultant
decomposition is linear, and as a result each basis function pair typically
corresponds to a single note or chord played by a given pitched instrument.
Therefore, in order to achieve sound source separation, some method is required
to group the basis functions by source or instrument. Different grouping
methods have been proposed in [

When dealing
with tensor notation, we use the conventions described by Bader and Kolda in [

Elementwise multiplication and division are
represented by

Recently, the
above matrix factorisation techniques have been extended to tensor
factorisation models to deal with stereo or multichannel signals by FitzGerald etal.
[

As a first approximation, many commercial stereo
recordings can be considered to have been created by obtaining single-channel
recordings of each instrument individually and then summing and distributing
these recordings across the two channels, with the result that for any given
instrument, the only difference between the two channels lies in the gain of
the instrument [

The concept of
incorporating shift invariance in factorisation algorithms for sound source
separation was introduced in the convolutive factorisation algorithms proposed by
Smaragdis [

Shift invariance in the frequency basis functions was
later developed as a means of overcoming the problem of grouping the frequency
basis functions to sources, particularly in the case where different notes
played by the same instrument occurred over the course of a spectrogram
[

When incorporating shift invariance in the frequency
basis functions, it is assumed that all notes played by a single pitched
instrument consist of translated versions of a single frequency basis function.
This single instrument basis function is then assumed to represent the typical
frequency characteristics of that instrument. This is a simplification of the
real situation, where in practice, the timbre of a given instrument does change
with pitch [

Up till now, the incorporation of shift invariance in
the frequency basis functions required the use of a spectrogram with
log-frequency resolution, such as the constant

If the frequency resolution of the log-frequency
transform is set so that the center frequencies of the bands are given by

In the context of this paper, translation of basis
functions is carried out by means of translation tensors, though other
formulations, such as the shift operator method proposed by Smaragdis [

Research has also been done on allowing more general
forms of invariance, such as that of Eggert etal. on transformation invariant
NMF [

All of the
algorithms incorporating shift invariance can be seen as special cases of a
more general model, shifted 2D nonnegative tensor factorisation (SNTF),
proposed by FitzGerald [

Summary of the tensors used, their dimensions, and function, in the various shift-invariant factorisation models included in this paper. Tensors that occur in multiple models are not repeated.

SNTF | Signal spectrograms | ||

Approximation
of | |||

Instrument gains | |||

Translation tensor (freq.) | |||

Instrument basis functions | |||

Note activations | |||

Translation tensor (time) | |||

SSNTF | Harmonic dictionary | ||

Harmonic weights | |||

SF-SSNTF | Formant filters | ||

SF-SSNTF + N | Noise instrument gains | ||

Noise basis functions | |||

Noise activations | |||

Noise translation tensor |

When using SNTF, a given pitched instrument is
modelled by an instrument spectrogram which is translated up and down in
frequency to give different notes played by the instrument. The gain parameters
are then used to position the instrument in the correct position in the stereo
field. A spectrogram of the

As noted previously, the mapping from log-frequency to
linear-frequency domains is an approximate mapping and this can have an adverse
effect on the sound quality of the resynthesis. Various methods for performing
this mapping and obtaining an inverse CQT have been investigated [

In order to overcome these resynthesis problems,
Schmidt etal. proposed using the spectrograms recovered to create masks which are
then used to refilter the original spectrogram [

While SNTF has
been shown to be capable of separating mixtures of harmonic pitched instruments
[

An alternative approach to the problem of imposing
harmonicity constraints on the basis functions is to note that the magnitude
spectrum of a windowed sinusoid can be calculated directly in closed-form as a
shifted and scaled version of the window's frequency response [

For a given pitch and a given number of harmonics, the
magnitude spectra of the individual sinusoids can be stored in a matrix of size

It is also possible to take into account inharmonicity
in the positioning of the partials through the use of inharmonicity factors.
For example, in the case of instruments containing stretched strings,

These update equations are similar to those of SNTF,
just replacing

An advantage of SSNTF is that the separation problem is now completely formulated in the linear-frequency domain, thereby eliminating the need to use an approximate mapping from log to linear frequency domains at any point in the algorithm, which removes the potential for resynthesis artifacts due to the mapping. Resynthesis of the separated time-domain waveforms can be carried out in a similar manner to that of SNTF, or alternatively, one can take advantage of the use of the additive synthesis model to reconstruct the separated signal using additive synthesis.

The SSNTF algorithm was implemented in Matlab using
the Tensor Toolbox available from [

As an example
of the improved reconstruction that SSNTF can provide, Figure

Spectra of flute note, original, SNTF, and SSNTF, respectively.

Figure

Spectrogram of piano and flute mixture.

Spectrogram of flute signal, (a) original unmixed, (b) SNTF, (c) refiltered SNTF, (d) SSNTF, (e) source-filter SSNTF.

Spectrogram of piano signal, (a) original unmixed, (b) SNTF, (c) refiltered SNTF, (d) SSNTF, (e) source-filter SSNTF.

It should also be noted that the addition of harmonic constraints imposes restrictions on the solutions that can be returned by the factorisation algorithms. This is of considerable benefit when incorporating additional parameters into the models, as will be seen in the following sections.

As noted
previously in Section

When applied in the context of shifted instrument
basis functions, the instrument basis function represents a harmonic excitation
pattern which can be shifted up and down in frequency to generate different
pitches. A single fixed filter is then applied to these translated excitation
patterns, with the filter representing the instrument's resonant structure.
This results in a system where the instrument timbre varies with pitch,
resulting in a more realistic model. The instrument formant filters can be
incorporated into the shifted tensor factorisation framework through a formant
filter tensor

Unfortunately, attempts to incorporate the
source-filter model into the SNTF framework were unsuccessful. The resultant
algorithm had too many parameters to optimise and it was difficult to obtain
good separation results. However, the additional constraints imposed by SSNTF
were found to make the problem tractable. The resultant model can then be described
as

Again using the generalised Kullback-Lieber divergence
as a cost function, the following update equations were
derived:

Filter returned for flute when using source-filter SSNTF.

On listening to the resynthesis, there was a marked
improvement in the sound quality of the flute in comparison with SSNTF, with
less high-frequency energy present. The resynthesis of the piano also improved,
though less so than that of the flute.
Figures

As a further example of source-filter SSNTF,
Figure

Spectrograms for (a) original flute spectrogram, (b) spectrogram recovered using source-filter SSNTF, and (c) spectrogram recovered using SSNTF.

Filter returned for solo flute example in
Figure

The above examples demonstrate the utility of using the source-filter approach as a means of improving the accuracy of the SSNTF model. This is bourn out in the improved resynthesis of the separated sources.

Musical
signals, especially popular music, typically contain unpitched instruments such
as drum sounds in addition to pitched instruments. While allowing shift
invariance in both frequency and time is suitable for separating mixtures of
pitched instruments, it is not suitable for dealing with percussion instruments
such as the snare and kick drums, or other forms of noise in general. These
percussion instruments can be successfully captured by algorithms which allow
shift invariance in time only without the use of frequency shift invariance. In
order to deal with musical signals containing both pitched and percussive
instruments or contain additional noise, it is necessary to have an algorithm
which handles both these cases. This can be done by simply adding the two
models together. This has previously been done by Virtanen in the context of
matrix factorisation algorithms [

Extending the concept to the case of tensor
factorisation techniques results in a generalised tensor factorisation model
for the separation of pitched and percussive instruments, which still allows
the use of a source-filter model for pitched instruments. The model can be
described by

As an example of the use of the combined model,
Figure

Mixture spectrograms of piano, flute, trumpet, snare, hi-hats, and kick drum.

Original spectrograms of (a) piano, (b) flute, (c) trumpet, (f) snare, (g) hi-hats, and (h) kick drum.

Separated spectrograms of (a) piano, (b) flute, (c) trumpet, (f) snare, (g) hi-hats and (h) kick drum.

The performances of SNTF, SNTF using refiltering,
SSNTF, source-filter SSNTF, and source-filter SSNTF with noise basis functions
in the context of modelling mixtures of pitched instruments were compared using
a set of 40 test mixtures. In the case of source-filter SSNTF with noise basis
functions, two noise basis functions were learned in order to aid the
elimination of noise and artifacts from the harmonic sources. The 40 test
signals were of 4 seconds duration and contained mixtures of melodies played by
different instruments and created by using a large library of orchestral
samples [

The 40 test signals consisted of 20 single channel
mixtures of 2 instruments and 20 stereo mixtures of 3 instruments, and these
mixtures were created by linear mixing of individual single channel instrument
signals. In the case of the single channel mixtures, the source signals were
mixed with unity gain, and in the case of the stereo mixtures, mixing was done
according to

Spectrograms were obtained for the mixtures, using a short-time Fourier transform with a Hann window of 4096 samples, with a hopsize of 1024 samples between frames. All variables were initialised randomly, with the exception of the frequency basis functions for SNTF-based separation, which were initialised with harmonic basis functions at the frequency of the lowest note played by each instrument in each example. This was done to put SNTF on an equal footing with the SSNTF-based algorithms, where the pitch of the lowest note of each source was provided. The number of allowable notes was set to the largest pitch range covered by an instrument in the test signal and the number of harmonics used in SSNTF was set to 12. The algorithms were run for 300 iterations, and the separated source spectrograms were estimated by carrying out contracted tensor multiplication on the tensor slices associated with an individual source. The recovered source spectrograms were resynthesised using the phase information from the mixture spectrograms. The phase of the channel where the source was strongest was used in the case of the stereo mixtures.

Using the original source signals as a reference, the
performance of the different algorithms were evaluated using commonly used
metrics, namely the signal-to-distortion ratio (SDR), which provides an overall
measure of the sound quality of the source separation, the
signal-to-interference ratio (SIR), which measures the presence of other
sources in the separated sounds, and the signal-to-artifacts ratio (SAR), which
measures the artifacts present in the recovered signal due to separation and
resynthesis. Details of these metrics can be found in [

A number of different tests were run to determine the
effect of signal duration on the performance of the algorithms and to determine
the effect of using different numbers of allowable shifts in time. For the
tests on signal duration, the mixture signals were truncated to lengths of 1,
2, 3, and 4 seconds in length, the number of time shifts was set to 5, and the
performance of the algorithms was evaluated. A summary of the results obtained
are shown in Figure

Performance evaluation of SNTF (circle solid), refiltered SNTF (diamond solid), SSNTF (square dash-dotted), source-filter SSNTF (triangle solid), and source-filter SSNTF (star dashed) with noise basis functions for various signal durations.

In the case of the algorithms incorporating source filtering, performance improved with increased signal duration. This is particularly evident in the case of the SIR metric. This demonstrates that longer signal durations are required to properly capture filters for each instrument. This is to be expected as increased numbers of notes played by each instrument provide more information on which to learn the filter, while the harmonic model with fewer parameters does not require as much information for training. It should be noted that this trend was less evident in the stereo mixtures than in the mono mixtures, suggesting that the spatial positioning of sources in the stereo field may effect the ability to learn the source filters. This can possibly be tested by measuring the separation of the sources while varying the mixing coefficients and is an area for future investigation. Nonetheless, it can be seen that at longer durations the source-filter approaches outperform SSNTF, with the basic source-filter model performing better in terms of SDR and SAR, while the source-filter plus noise approach performs better in terms of SIR.

The results from testing the effect of the number of
time shifts on the separation of the sources are shown in
Figure

Performance Evaluation of SNTF (circle solid), refiltered SNTF (diamond solid), SSNTF (square dash-dotted), Source-Filter SSNTF (triangle solid) and Source-Filter SSNTF (star dashed) with noise basis functions for various allowable shifts in time.

On listening to the separated sources, the SSNTF-based approaches clearly outperform SNTF. It should be noted that in some cases, SNTF using refiltering resulted in audio quality comparable to the SSNTF-based approaches, however this was only in a small number of examples. In the majority of cases the addition of the source-filter improves on the results obtained by SSNTF. On comparing the source-filter approach to the source-filter plus noise model, it was observed that the results varied from mixture to mixture, with a considerable improvement in resynthesis quality of some sources and a reduction of quality in other cases, while in a large number of tests no major differences could be heard in the results. This shows that in many cases for clean mixture signals of pitched instruments, there is no need to incorporate noise basis functions. Nevertheless, the use of noise basis functions is still useful in the presence of noise or percussion instruments. It should also be noted that in half of the test mixtures SNTF did not manage to correctly separate the sources, which, in conjunction with the distortion due to the smearing of the frequency bins due to the mapping from log to linear frequency, goes a long way towards explaining the negative SDR and SIR scores. While SNTF using refiltering resulted in improved resynthesis in the cases where the sources had been separated correctly, it also suffered from the reliablity issues of the underlying SNTF technique and this is reflected in the poor scores for all metrics. This indicates that the SSNTF-based techniques are considerably more robust than SNTF-based techniques.

The separated sources can also be resynthesised via an additive synthesis approach, and on listening, the results obtained were comparable to those obtained from the spectrogram-based resynthesis. However, as the additive synthesis approach uses different phase information than the spectrogram-based resynthesis, the results are not comparable using the metrics used in this paper. This highlights the need to develop a set of perceptually-based metrics for sound source separation and is an area for future research.

Also investigated was the goodness of fit of the
models to the original spectrogram data, as measured by the cost function. It
was observed that the results obtained for SSNTF were on average 64% smaller
than those for SNTF, despite the fact that SSNTF has a smaller number of free
parameters, as the number of harmonics was considerably smaller than the number
of frequency bins used in the constant

Overall it can be seen that the methods proposed in this paper offer a considerable improvement over previous separation methods using SNTF. Large improvements can be seen in the performance metrics over the previous SNTF method, and it can also be seen that the proposed models result in an improved fit to the original data.

The use of shift-invariant tensor factorisations for the purposes of musical sound source separation, with a particular emphasis on pitched instruments, has been discussed, and problems with existing algorithms were highlighted. The problem of grouping notes to sources can be overcome by incorporating shift invariance in frequency into the factorisation framework, but comes at the price of requiring the use of a log-frequency representation. This causes considerable problems when attempting to resynthesise the separated sources as there is no exact mapping available to map from a log-frequency representation back to a linear-frequency representation, which results in considerable degradation in the sound quality of the separated sources. While refiltering can overcome this problem to some extent, there are still problems with resynthesis.

A further problem with existing techniques was also highlighted, in particular the lack of a strict harmonic constraint on the recovered frequency basis functions. Previous attempts to impose harmonicity used an ad hoc constraint where the basis functions were zeroed in regions where no harmonic activity was expected. While this does guarantee that there will be no activity in these regions, it does not guarantee that the basis functions recovered will have the shape that a sinusoid would have if present in these regions.

Sinusoidal shifted 2D nonnegative tensor factorisation was then proposed as a means of overcoming both of these problems simultaneously. It takes advantage of the fact that a closed form solution exists for calculating the spectrum of a sinusoid of known frequency, and uses an additive-synthesis inspired approach for modeling pitched instruments, where each note played by an instrument is modelled as the sum of a fixed number of weighted sinusoids in harmonic relation to each other. These weights are considered to be invariant to changes in the pitch, and so each note is modelled using the same weights regardless of pitch. The frequency spectrum of the individual harmonics is calculated in the linear frequency domain, eliminating the need to use a log-frequency representation at any point in the algorithm, and harmonicity constraints are imposed explicitly by using a signal dictionary of harmonic sinusoid spectra. Results show that using this signal model results in a better fit to the original mixture spectrogram than algorithms involving the use of a log-frequency representation, thereby demonstrating the benefits of being able to perform the optimisation solely in the linear-frequency domain.

However, it should be noted that the proposed model is not without drawbacks. In particular, best results were obtained if the pitch of the lowest note of each pitched instrument was provided to the algorithm. In most cases this information will not be readily available, and this necessitates the use of the standard shifted 2D nonnegative tensor factorisation algorithm to estimate these pitches before using the sinusoidal model. Research is currently ongoing on other methods to overcome this problem, but despite this, it is felt that the advantages of the new algorithm more than outweigh this drawback.

Using the same harmonic weights or instrument basis function regardless of pitch is only an approximation to the real world situation where the timbre of an instrument does change with pitch. To overcome this limitation, the incorporation of a source-filter model into the tensor factorisation framework had previously been proposed by others. Unfortunately, in the context of sound source separation, it was found that it was difficult to obtain good results using this approach as there were too many parameters to optimise. However, the addition of the strict harmonicity constraint proposed in this paper was found to restrict the range of solutions sufficiently to make the problem tractable.

It had previously been observed that the addition of harmonic constraints was required to create a system which could handle both pitched and percussive instrumentations simultaneously. However, previous attempts at such systems suffered due to the use of log-frequency representations and the lack of a strict harmonic constraint. The combined model presented here extends this earlier work from single channel to multichannel signals, and overcomes these problems by use of sinusoidal constraints applied in the linear-frequency domain, as well as incorporating the source filter model into the system, and so represents a more general model than those previously proposed.

In testing using common source separation performance metrics, the extended algorithms proposed were found to considerably outperform existing tensor factorisation algorithms, with considerably reduced signal distortion and artifacts in the resynthesis. The extended algorithms were also found to be more reliable than SNTF-based approaches.

In conclusion, it has been demonstrated that use of an additive-synthesis based approach for modelling instruments in a factorisation framework overcomes problems associated with previous approaches, as well as allowing extensions to existing models. Future work will concentrate on the improvement of the proposed models, both in terms of increased generality and in improved resynthesis of the separated sources, as well as investigating the effects of the mixing coefficients on the separations obtained. It is also proposed to investigate the use of frequency domain performance metrics as a means of increasing the perceptual relevance of source separation metrics.

This research was part of the IMAAS project funded by Enterprise Ireland. The authors wish to thank Mikel Gainza, Matthew Hart, and Dan Barry for their helpful discussions and comments during the preparation of this paper. The authors also wish to thank the reviewers for their helpful comments which resulted in a much improved paper.