Marek Grochowski, PhD, Department of Informatics, NCU: “Speaker adaptation of deep acoustic model”
In automatic speech recognition systems (ASR) the acoustic model is responsible for detecting the phoneme sequence in the input sound signal. Deep neural networks trained on hundreds of hours of recordings have become recently the state-of-the art methods in this area, capable of producing acoustic models with very high accuracy. Speaker specific differences in the pronunciation (from speech defects, accents, etc.) may negatively affect the quality of phoneme recognition and, consequently, increase the number of incorrectly recognized words by the ASR system. However, having a small sample of recordings (one or several utterances) of a given subject, we are able to significantly improve the accuracy of speech recognition for that person by fine-tuning of the neural network, without the need for expensive training of new model. During the speech, several speaker adaptation techniques used for deep LSTM network fine-tuning will be presented.