Noise Robust Speech Parameterization Using Relative Spectra and Auditory Filterbank

In the present study, a new feature extraction method based on relative spectra and gammachirp auditory filterbank is proposed for robust noisy speech recognition. The relative spectra filtering are applied to the log of the output of the gammachirp filterbank which incorporates the properties of the cochlear filter in order to remove uncorrelated additive noise components. The performances of this method have been evaluated on the isolated speech word corrupted by real-world noisy environments using the continuous Gausian-Mixture density Hidden Markov Model. The evaluation of the experimental results shows that the proposed method achieves best recognition rates compared to the conventional techniques like Perceptual Linear Prediction (PLP), Linear Predictive Cepstral Coefficients (LPCC) and Mel-Frequency Cepstral Coefficients (MFCC).


INTRODUCTION
In many practical applications, the performance of Automatic Speech Recognition (ASR) system is limited due to its lack of the robustness in the presence of background noises.ASR relies on speech feature vectors which contain relevant information to distinguish between different speech sounds.To increase the robustness of ASR-systems, the speech feature must be less sensitive in the presence of background noises, while retaining good of distinguished properties (Gajic and Paliwal, 2006).The most commonly used feature extraction algorithms as PLP (Perceptual Linear Prediction) (Hermansky, 1990), LPCC (Linear Prediction Cepstral Coefficients) (Atal, 1974) and MFCC (Mel-Frequency Cepstral Coefficients) (Davis and Mermelstein, 1980), are highly affected in the presence of noisy environments.There are some other algorithms aiming at improving noise robustness by combining the classic algorithms with other technique like the RASTA (Relative Spectra) filtering (Hermansky and Morgan, 1994) or CMN (Cepstral mean normalization) (Liu et al., 1993;Shao et al., 2007;Droppo and Acero, 2008).
In addition, the auditory system of human has a remarkable ability to recognize the speech signal in noisy environments.This ability has inspired the development of many feature extraction algorithms which take into account certain knowledge on human speech perception (Gajic and Paliwal, 2006).The developed algorithms usually use the gammatone filter as the auditory filter modelling in order to simulate the cochlear filtering (Wang and Brown, 2006;Meddis et al., 2010).A new auditory filter known as gammachirp filter is developed by Irino andPatterson (1997, 2006).This filter with an asymmetric amplitude spectrum represents a good approximation to the asymmetry and level dependent characteristics of the cochlea filtering (Meddis et al., 2010).
A robust feature extractor for noisy speech recognition is presented in this study.The proposed method is based on relative spectra and gammachirp filterbank.The relative spectra is band-pass timefiltering applied to the log of the output spectral representation of the gammachirp filterbank in order to reduce linear channel distortions which appear as additive components in the logarithmic spectral domain.The used gammachirp filterbank is a filterbank of 34 gammachirp filters covering the frequency range (50 and 8000 Hz) (Zouhir andOuni, 2013, 2014).The gammachirp filter is used as a model of auditory filter to provide a spectrum reflecting the cochlea spectral behavior (Irino andPatterson, 1997, 2006;Patterson et al., 2003).
The HTK (Hidden Markov Model Toolkit) recognizer (Young et al., 2009) is employed for isolated-word speech recognition with whole word HMM-GM (HMM with four Gaussian Mixture density) models.Each isolated-word is modeled by a five-state HMM with four mixtures per state.
The isolated speech words extracted from the TIMIT (Garofolo et al., 1990) database and corrupted by real-world noisy environments are used for the performance evaluation of proposed feature extractor.
To compare the performances, the following conventional techniques are used: PLP, LPCC and MFCC.Experimental results in the presence of ambient background noises show that the proposed feature extractor outperforms all the classical techniques mentioned above.

Classical feature for speech recognition:
The classical feature extractors MFCC, PLP and LPCC are similar in several stages.As shown in Fig. 1, these similar stages are linked by the broken arrows.The procedure to obtain the coefficients of each technique is briefly described here.
The MFCC coefficients: A Discrete Fourier Transform (DFT) is computed for each frame of windowed speech to obtain a short-term power spectrum.Then, the power spectrum of the speech signal is weighted by the magnitude frequency response of a Mel-scale filterbank which uses triangular shaped windows.Logarithmic compression of the Mel-filterbank output is applied.The cepstrum coefficients are then obtained by a Discrete Cosine Transform (DCT) (Davis and Mermelstein, 1980).
The PLP coefficients: Similar to the MFCC procedure, the discrete Fourier transform power spectrum is firstly calculated.Then, the auditory-based warping of the frequency axis is employed to weight the obtained spectrum.The window shape used in PLP analysis is designed to obtain a simulation of the critical-band masking curves.After pre-emphasize the filtered power spectrum by an equal-loudness curve, a cubic root Fig. 1: Flowcharts for MFCC, PLP and LPCC feature extraction techniques compression of critical-band energies is applied whereas for MFCC logarithmic compression is used.The result spectrum is converted into LP coefficients using Auto-Regression (AR) modelling.The PLP coefficients are computed by applying a cepstral transformation to the LP coefficients (Hermansky, 1990).
The LPCC coefficients: After the extraction of the LPC coefficients from each speech signal frame using autocorrelation method, 12 cepstral coefficients which correspond to LPCC coefficients are computed from the obtained eight coefficients using cepstral transform (Atal, 1974).

METHODOLOGY Proposed feature extractor:
The proposed feature extraction method is based on relative spectra and gammachirp auditory filterbank for robust noisy speech recognition.An illustrative block diagram of the various steps of the proposed feature extractor is shown in Fig. 2.
In the first step, the speech signal is framed (length of analysis frame is 25 msec with a frame shift of 10 msec) and windowed using a Hamming window.Then we apply the square of Discrete Fourier Transform (DFT) for each window segment to obtain the power spectrum.The second step is the Relative spectra-Gammachirp filterbank.In this step, the power spectrum is analyzed using a 34-channel gammachirp filterbank.The latter is characterized by a centre frequencies covering the frequency range of 50-8000 Hz (sampling frequency = 16 kHz) according to the ERB-rate scale.The used filterbank is developed to provide a realistic auditory filterbank for auditory perception models.The complex analytic form of the gammachirp filter is (Irino and Patterson, 1997): where, a , ϕ , c and r f are respectively, the amplitude, the phase, the chirp factor and the asymptotic frequency.b and n are parameters defining the gamma distribution envelope.The time t>0, "ln" is the natural logarithmic operator and ERB (f r ) is the equivalent rectangular bandwidth of the auditory filter at f r .The ERB value at frequency f in Hz is defined by (Glasberg and Moore, 1990;Moore, 2012;Wang and Brown, 2006): The equivalent rectangular bandwidth rate (ERBrate (f)) at frequency f is given by (Glasberg and Moore, 1990;Moore, 2012;Wang and Brown, 2006) (Irino and Patterson, 2006;Patterson et al., 2003) is given by: where, ( ) is the gamma distribution function.
Afterward, a relative spectral band-pass filtering is applied to the filterbank outputs in the logarithmic domain in order to remove uncorrelated additive noise components.The transfer function of this filter is defined by (Hermansky and Morgan, 1994): (5) In the third step, the inverse logarithm of the relative logarithm spectrum is calculated, yielding a Fig. 2: Diagram of the proposed feature extraction method relative auditory spectrum (Hermansky and Morgan, 1994).The latter is weighted, in the fourth step, by an equal-loudness pre-emphasis, to compensate for the non equal sensitivity of human hearing system across frequency.Then, the cubic-root compression step which aims at simulating the non-linear relation between sound intensity and its perceived loudness is applied to the pre-emphasis spectrum.The sixth step consists to obtain the autoregressive coefficients of the all-pole model using Inverse-DFT and the Levinson-Durbin Recursion, which is designed to estimate of the auditory-like spectrum of speech (Hermansky, 1990;Zouhir and Ouni, 2014).In the seventh step, the proposed feature are obtained by performs a cepstral transformation.

EXPERIMENTAL RESULTS
This section presents the evaluation results of the experiments that were performed with the various techniques, using an isolated-word speech recognizer, in the presence of various types of the ambient background noises.A total of 13227 isolated-words used in these experiments were manually extracted from the TIMIT database (contains speech signals of 630 speakers from eight English dialect regions and the sampling frequency of these signals is 16 kHz) (Garofolo et al., 1990).The training set consisted of 9702 isolated-words and the testing set contains 3525 isolated-words.The all isolated-words used in the testing phase were corrupted by different background noise (Passing-car, Shopping-mall, Rain, Sea waves noise) for various SNR ranging from -3 to 9 dB.These noises were taken from PacDV (PacDV Sound Effects, 2014).
The HTK.3.4.1 toolkit (Young et al., 2009) was used to implement an isolated-word based HMM recognizer.Each isolated-word was represented by a simple left-to-right HMM (HMM-GM) of five states with four diagonal Gaussians per state.
The recognition performance of the proposed PLPrGc (Perceptual Linear Predictive relative spectra-    Gammachirp) feature has been compared to that of the baseline PLP, LPCC and MFCC feature.The feature vector of each technique consisted of 39 coefficients including 12 static coefficients of feature techniques were added to energy (E), differential coefficients first order ( ∆) and second order (A).
Table 1 to 4 represent the recognition rates obtained using the proposed PLPrGc feature and three kinds of PLP, LPCC and MFCC feature for various types of ambient background noise at SNR equal to -3, 0, 3, 6 and 9 dB, respectively.These tables show the effectiveness of proposed feature compared to the other baseline features for the four ambient background noises.It can be observed that PLPrGc feature gives better results of recognition rate for all SNR levels, particularly for low values of SNR values.For example, in the case of passing-car noise at 0 dB SNR, the recognition rate of the PLPrGc is higher than that of the PLP, MFCC and LPCC by 14.1, 17.39 and 24.51, respectively.

CONCLUSION
In this study, we have presented a robust feature extractor based on relative spectra and gammachirp filterbank for noisy speech recognition.Speech recognition results were reported on the isolated speech words TIMIT corrupted by real-world noisy environments and performances were compared with the PLP, LPCC and MFCC.Four different background noises with various SNR ranging from -3 to 9 dB were used.Experimental results show that proposed feature extractor outperformed all the other classical feature extractors for all SNR levels.

Table 1 :
Comparison of recognition rates of the proposed and the other classical features with passing-car noise at various SNR's for HMM-4-GM

Table 3 :
Comparison of recognition rates of the proposed and the other classical features with sea waves noise at various SNR's for HMM-4-GM