Blind Audio Source Separation with Sparse Nonnegative Matrix Factorization

In this study, a new technique in source separation using Two-Dimensional Nonnegative Matrix Factorization (NMF2D) with the Beta-divergence is proposed. The Time-Frequency (TF) profile of each source is modeled as two-dimensional convolution of the temporal code and the spectral basis. In addition, adaptive sparsity constraint was imposed to reduce the ambiguity and provide uniqueness to the solution. The proposed model used Beta-divergence as a cost function and updated by maximizing the joint probability of the mixing spectral basis and temporal codes using the multiplicative update rules. Experimental tests have been conducted in audio application to blindly separate the source in musical mixture. Results have shown the effectiveness of the algorithm in separating the audio sources from single channel mixture.


INTRODUCTION
Nonnegative Matrix Factorization (NMF) (Lee and Seung, 1999) has become one of the promising and exciting techniques in signal processing.NMF has been successfully applied in various applications such as in automatic music transcription (FitzGerald, 2004) cryptography (Xie et al., 2008), pattern recognition (Biciu et al., 2007) and etc.One of the most useful property of NMF is that the nonnegative constraint by itself enforcing the sparse representation of the data.This representation makes the encoded data easy to be estimated because data was encoded by using only a few active components.In NMF, given the matrix, Y of a dimension of F x N with nonnegative elements, Nonnegative Matrix Factorization (NMF) is the problem of approximate the factorization: where, ∈ ℜ H are a non-negative matrices.F represents the frequency bins while N represents the time slot in the TF domain.W contains the spectral basis vectors while H represents the amplitude of each basis vector at each time point.C is the numbers of component from data sources being used and it is determine such that FC+CN<<FN so that the data can be compressed to its integral component.This problem can be formulated as the minimization of an objective function: ( ) , , , , where, d is a scalar divergence.common way to measure how close Y and WH are to use a so-called Beta-divergence (Kompass, 2005;Fevotte et al., 2009), defined by: ( ) The limiting cases β = 0, 1 and 2 correspond to the Itakura-Saito (IS) divergence, the Kullback-Leibler (KL) divergence and Least Square (LS) distance, respectively.Particularly it underlies the multiplicative Gamma observation noise, Poisson noise and Gaussian additive observation noise, respectively.
The recent developed Two Dimensional NMF (NMF2D) model (Morup and Schmidt, 2006) extends the NMF model in order to provide decomposition that can capture the temporal dependency of the frequency patterns within the source efficiently.In NMF2D, the Time-Frequency (TF) profile of each source is modeled as two-dimensional convolution of the temporal code and the spectral basis.This significantly reduces the number of components per source needed in the decomposition.So far, for NMF2D, there is no research work has been done to apply the general framework of Beta-divergence.This study carried out to accommodate the Beta-divergence cost function in NMF2D model and investigate the effect of β in the performance of the algorithm.To further improve the algorithm, this study proposed a sparseness constraint to be imposed in the cost function to reduce the ambiguity the ambiguity associated with the estimation of the spectral basis and temporal codes.

METHODOLOGY Two-dimensional nonnegative matrix factorization:
In derivation of nonnegative matrix factorization framework, firstly, we considered a source model of Y which is defined as a follows: where, J is the number of sources.The matrix Proposed separation method: In this section, a new algorithm of two-dimensional sparse nonnegative matrix factorization using the sparse Beta-divergence NMF2D will be developed.The algorithm optimizes the parameters of the signal model based on the multiplicative update rule using gradient descent.Sparse representation is a representation of data where most coefficients are zero.It is proving to be a particularly interesting and powerful tool especially for analysis and processing of audio signals.If each signal to be separated has a sparse representation, then there is a good chance that there will be little overlap between the small sets of coefficients used to represent the different source signals.Therefore by selecting the coefficients used by each source signal, we can restore each of the original signals with most of the interference from the unwanted signals removed.Now, we incorporated the Beta-divergence as defined in (3) with the sparsity constrained such that it will minimize the cost function as follow: . Parameter λ is the sparsity constraint and f (H) can be any function with positive derivative such as ( ) given by ( ) . This will resolve the ambiguity between the factors by imposing sparseness on H ϕ and forcing the structure onto τ W .In this study, we employed the multiplicative gradient descent approach which consists in updating each parameter by multiplying its value at the previous iteration by a certain coefficient.The derivatives of (5) corresponding to τ W and H ϕ are given by: Thus, by applying the standard gradient descent approach: Y H % Repeat from steps 2 to 6 until convergence where, W η and H η are positive learning rates which can be obtained by following (Lee and Seung, 1999), namely: Thus, the multiplicative learning rules become: ( ) For ( 10) and (11), A.B denotes element wise multiplication and denotes the element wise division.Table 1 presents the main steps of the proposed algorithm for blind separation using sparse NMF2D with Beta-divergence.

Reconstruction of the separated sources:
From mixture Y, we seek the two estimated sources which are: Then, by using binary masking technique (Wang, 2005) we obtained mask, M j as follows: ˆ1, 0, Then, the time domain estimated signal ˲ is obtained by resynthesizing M j with the mixture Y i.e., ( ) Here, 'resynthesize' signifies the inverse mapping of log-frequency axis to the original frequency axis and then followed by inverse Short-Time Fourier Transform (STFT) back to the time domain.

RESULTS AND DISCUSSION
The proposed algorithm is tested on audio signals containing synthetic piano sound and trumpet sound.The mixture is approximately 5 sec long and sampled at 16 kHz.In this experiment, STFT using 2048-point Hamming window with 50% overlap was used and this gives 175 frequency bins in the log-frequency spectrogram within the range of 50 Hz to 8 kHz with 24 bins/octave.This corresponds to twice the resolution of the equal source signal scale.Figure 1 shows the original TF domains of the source of piano and trumpet as well as its mixture.For audio separation, after conducting the Monte-Carlo experiments over 50 independent realizations of the mixture, the parameters of the convolutive factors of τ and shifts are set to be τ max = 8 and L = 32.This is the best attainable parameter setting to represent the temporal code and spectral basis in the factorization for most of music signals.To evaluate the proposed algorithm, the performance will be measured using the Signal-to-Distortion Ratio (SDR), Source-to-Artifacts Ratio (SAR) and Source-to-Interference Ratio (SIR) which measures an overall sound quality of the source separation.The MATLAB implementation of these measures can be found in Fevotte et al. (2005).
Beta performance analysis: Now, we investigate the effect of β in terms of performance of the proposed algorithm.Figure 2 shows the average SDR values obtained from various values of β using multiplicative update NMF2D algorithm.The value of β tested was varied from 0 to 2 in steps of 0.1.It ought to cover Square (LS) distant, the Kullback divergence and the Itakura-Saito (IS) divergence NMF2D.The average separation performance was obtained from the estimated SDR value for each source in a trumpet-piano mixture, thereby providing a measure of overall separation for each signal.From Fig. 2, as we increase the value of β, the performance also increase and it reach its peak value when β = 0.9 with average Separation results for various values of β using betavaried from 0 to 2 in steps of 0.1.It ought to cover Least Kullback-Leibler (KL) Saito (IS) divergence of tion performance was obtained from the estimated SDR value for each source piano mixture, thereby providing a measure of overall separation for each signal.From Fig. 2, as we , the performance also increase = 0.9 with average SDR value of 13.5 dB is obtained for each source.A tail-off in performance occurs as the value of from 0.9 goes up to 2. From this experiment, it suggests that β around 0.9 is an optimal value for audio separation which will be used in our experiment in the next sub-section.

Blind audio source separation results
compare the performance of audio source separation of proposed algorithms of Beta-divergence sparse NMF2D with the one without the sparsity β = 0.9 for both algorithm.The best value of sparsity parameter λ was identified as 0.5 after conducting the Monte-Carlo experiments over 50 independent realizations.L 1 -norm regularization is used to resolve the ambiguity by forcing all structure in H onto W Figure 3 shows the separation result in log spectrogram for both algorithms.Compared with original sources in Fig. 1, it is visually clear that separation of Beta-divergence NMF2D without the sparseness in Fig. 3A and B led to poor result since the factorization still contains the mixed signal (indicated by the box marked area).This is because without the sparsity constraint, it leads to component ambiguity, i.e., dB is obtained for each source.A off in performance occurs as the value of β increases from 0.9 goes up to 2. From this experiment, it suggests around 0.9 is an optimal value for audio separation which will be used in our experiment in the Blind audio source separation results: Here, we compare the performance of audio source separation of divergence sparse NMF2D sparsity constraint.We set = 0.9 for both algorithm.The best value of sparsity after conducting the Carlo experiments over 50 independent norm regularization is used to resolve all structure in H onto W. Figure 3 shows the separation result in log-frequency algorithms.Compared with original sources in Fig. 1, it is visually clear that divergence NMF2D without the sparseness in Fig. 3A and B led to poor result since the factorization still contains the mixed signal (indicated by ked area).This is because without the sparsity constraint, it leads to component ambiguity, i.e.,   lack of uniqueness in decomposition.In contrary, by employing the sparseness, it has yielded the better performance when the decomposition of spectral bases and temporal codes is performed with the sparsity constraint.
From Table 2, in general both algorithms of Betadivergence NMF2D provide decent results with the performance of SDR, SIR and SAR that can be considered good.Over 10 dB of SDR measurement have been recorded for both methods.However, performance of Beta-divergence NMF2D with sparsity constraint is superior with the average SDR improvement of 2.7 dB per source compare with the one without imposing the sparseness.In percentage, this translates to an average improvement of 20%.From this result, it can be inferred that the sparsity constraint have significant effects on the separation performance.

CONCLUSION
The use of the Beta-divergence for audio source separation using NMF2D model has been investigated.The value of Beta-divergence with β = 0.5 was found to produce an optimal result.Furthermore, the proposed sparse Beta-divergence NMF2D is developed under the probabilistic framework which enables sparseness to be incorporated in the solution.This will significantly resolve the ambiguities problem in the factorization.We confirmed through an experiment that the proposed algorithm performs exceptionally well in separation of

H
τ slice spectral basis and H ϕ represents the ϕ th slice of temporal code for each spectral basis element.The vertical arrow in φ τ ↓ W denotes downward shift operator which moves each element in the matrix by ϕ row down.By the same token, denotes the right shift operator which moves each element in the matrix by τ column to the right.The factorization for NMF2D source model in (4) is based on a model that represents temporal structure and pitch change.In audio processing, the model represents each instrument by a single time-frequency profile convolved in both time and frequency by a timepitch weight matrix.This model thoroughly decreases the number components need to model various instruments and efficiently solves the monaural source separation problem.In the following, novel algorithm of sparse NMF2D with Beta-divergence is proposed to estimate the parameter of Fig. 1: Log-frequency spectrogram of (A) trumpet, (B) piano (A) trumpet, (B) piano, (C) convolutive mixed signal

Table 2 :
Separation results for NMF2D with beta-divergence