Format

Send to

Choose Destination
Neural Netw. 2015 Apr;64:39-48. doi: 10.1016/j.neunet.2014.08.005. Epub 2014 Sep 16.

Deep Convolutional Neural Networks for large-scale speech tasks.

Author information

1
IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, United States. Electronic address: tsainath@google.com.
2
IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, United States. Electronic address: bedk@us.ibm.com.
3
IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, United States. Electronic address: gsaon@us.ibm.com.
4
IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, United States. Electronic address: soltau@google.com.
5
Department of Computer Science, University of Toronto, United States. Electronic address: asamir@cs.toronto.edu.
6
Department of Computer Science, University of Toronto, United States. Electronic address: gdahl@cs.toronto.edu.
7
IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, United States. Electronic address: bhuvana@us.ibm.com.

Abstract

Convolutional Neural Networks (CNNs) are an alternative type of neural network that can be used to reduce spectral variations and model spectral correlations which exist in signals. Since speech signals exhibit both of these properties, we hypothesize that CNNs are a more effective model for speech compared to Deep Neural Networks (DNNs). In this paper, we explore applying CNNs to large vocabulary continuous speech recognition (LVCSR) tasks. First, we determine the appropriate architecture to make CNNs effective compared to DNNs for LVCSR tasks. Specifically, we focus on how many convolutional layers are needed, what is an appropriate number of hidden units, what is the best pooling strategy. Second, investigate how to incorporate speaker-adapted features, which cannot directly be modeled by CNNs as they do not obey locality in frequency, into the CNN framework. Third, given the importance of sequence training for speech tasks, we introduce a strategy to use ReLU+dropout during Hessian-free sequence training of CNNs. Experiments on 3 LVCSR tasks indicate that a CNN with the proposed speaker-adapted and ReLU+dropout ideas allow for a 12%-14% relative improvement in WER over a strong DNN system, achieving state-of-the art results in these 3 tasks.

KEYWORDS:

Deep learning; Neural networks; Speech recognition

PMID:
25439765
DOI:
10.1016/j.neunet.2014.08.005
[Indexed for MEDLINE]

Supplemental Content

Full text links

Icon for Elsevier Science
Loading ...
Support Center