2013 ExtractingDeepBottleneckFeature

(Gehring et al., 2013) ⇒ Jonas Gehring, Yajie Miao, Florian Metze, and Alex Waibel. (2013). “Extracting Deep Bottleneck Features Using Stacked Auto-encoders.” In: Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on.

Notes

Cited By

http://scholar.google.com/scholar?q=%222013%22+Extracting+Deep+Bottleneck+Features+Using+Stacked+Auto-encoders

Quotes

Abstract

In this work, a novel training scheme for generating bottleneck features from deep neural networks is proposed. A stack of denoising auto-encoders is first trained in a layer-wise, unsupervised manner. Afterwards, the bottleneck layer and an additional layer are added and the whole network is fine-tuned to predict target phoneme states. We perform experiments on a Cantonese conversational telephone speech corpus and find that increasing the number of auto-encoders in the network produces more useful features, but requires pre-training, especially when little training data is available. Using more unlabeled data for pre-training only yields additional gains. Evaluations on larger datasets and on different system setups demonstrate the general applicability of our approach. In terms of word error rate, relative improvements of 9.2% (Cantonese, ML training), 9.3% (Tagalog, BMMI-SAT training), 12% (Tagalog, confusion network combinations with MFCCs), and 8.7% (Switchboard) are achieved.

1. INTRODUCTION

The two main approaches for incorporating artificial neural networks (ANNs) in acoustic modeling today are hybrid systems and tandem systems. In the former, a neural network is trained to estimate the emission probabilities for Hidden Markov Models (HMM) [1]. In contrast, tandem systems use neural networks to generate discriminative features as input values for the common combination of Gaussian Mixture Models (GMM) and HMMs. This is done by training a network to predict phonetic targets, and then either using the estimated target probabilities (“probabilistic features” [2]) or the activations of a narrow hidden layer (“bottleneck features”, BNF [3]). Those features are usually fused with standard input features and decorrelated for modeling with GMMs.

In the field of machine learning, deep learning deals with the efficient training of deep neural networks (DNN), with algorithms generally inspired by the greedy, unsupervised, layer-wise pre-training scheme first proposed by Hinton et al. [4]. While Hinton et al. used a deep belief network (DBN) where each layer is modeled by a restricted Boltzmann machine (RBM), later works showed that other architectures like auto-encoders [5] or convolutional neural networks [6] are suitable for building deep networks using similar schemes as well.

Neural networks have been successfully used for acoustic modeling over two decades ago [7],[8], but had been mostly abandoned in favor of GMM/HMM acoustic models throughout the late 1990s. However, the ability to train networks with millions of parameters in a feasible amount of time has caused a renewed interest in connectionist models [9]. Recently, deep neural networks have been used with great success in hybrid DNN/HMM systems, resulting in strong improvements on challenging large-vocabulary tasks such as Switchboard [10] or Bing voice search data [11].

2. RELATED WORK

While bottleneck features have been used in speech recognition systems for some time now, only few works on applying deep learning techniques to this task have been published.

In 2011, Yu & Seltzer applied a deep belief network as proposed by Hinton et al. for extracting bottleneck features, with the bottleneck being a small RBM placed in the middle of the network [12].

The network was pre-trained on frames of MFCCs including deltas and delta-deltas, and then fine-tuned to predict either phoneme or senone targets. They found that pre-training the RBMs increases the accuracy of the recognition system, and that additional strong improvements can be achieved by using context-dependent targets for supervised training. However, they noted that possibly due to their symmetric placement of the bottleneck layer, increasing the number of layers in the network to more than 5 did not improve recognition performance any further. In a more recent work, it was also argued that RBMs are not suitable for modeling decorrelated data like MFCCs [13].

Sainath et al. introduced DBN training in a previously proposed architecture based on training an auto-encoder on phonetic class probabilities estimated by a neural network [14],[15]. In their work, they first trained a stack of RBMs for classification of speaker-adapted PLP features and applied a 2-step auto-encoder to reduce the output of the resulting DBN to 40 bottleneck features. These features out-performed a strong GMM/HMM system using the same input, but they found that performance gains are higher when training systems on little data.

This work proposes a different approach that profits from increasing the model capacity by adding more hidden layers, and enables the supervised training of the bottleneck layer in order to retrieve useful features for a GMM/HMM acoustic model. Instead of pre-training the layers with restricted Boltzmann machines, we use auto-encoders which are straightforward to setup and train.

3. MODEL DESCRIPTION

…

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2013 ExtractingDeepBottleneckFeature	Jonas Gehring Yajie Miao Florian Metze Alex Waibel			Extracting Deep Bottleneck Features Using Stacked Auto-encoders