BiLSTM with Attention Pooling for Speech Act Recognition

This research project was conducted for the 2019 Fall Brain-Mind-Behavior independent research course at Seoul National University. It was an extension of work in the summer internship at KIXLab, KAIST, where I developed a speech act classifier for a speech act based chatbot project. I won the best research award from the 2019 Fall Brain-Mind-Behavior independent research presentation.
In this research, I implemented LSTM based RNN model using a pretrained word embedding model. I also designed a self-attentive pooling method to produce sentence-wise embedding and compared the method with existing methods. Lastly, the attention layer was used for the qualitative analysis of feature attribution. There are details in the article below.


Speech act (or Dialogue act) means the role of each utterance within a dialogue, such as a command, question, or acknowledgment. Speech act focuses on “what language does” and analyzing one’s speech act can infer the speaker's intention and status. As understanding speakers’ intention and status is critical for a human-like conversation, the automatic speech act recognition has a great potential to improve automatic dialogue application, such as a chatbot.

Table 1. Examples of speech act classification. Retrieved from [2].

This study aims to implement an automatic speech act classifier by utilizing a deep learning model based on the 2-layer RNN structure (Figure.2) used by existing models (Kumar et al [1], Chen et al [2]). In addition, Attention Mechanism [7] is applied to the pooling method, which was not focused in previous studies, and compared with existing methods.

Many studies have been done on converting natural words to numerical data. Simply encoding words in a one-hot vector is not only inefficient, it cannot represent the relationship between words. To solve this problem, there have been many attempts to find a vector space, so-called an embedding space, that well represents the relationship of words. In addition to word2vec [3] and GloVe [4], efforts have been made to obtain good word vectors until BERT [5], which has recently revolutionized the field of natural language processing. However, relatively little research has been done to obtain the vector of sentence units. In many cases, simple methods are used to obtain sentence vectors, such as averaging vectors of words in the sentence, or taking only the last output when gone through an RNN structure. Taking this into account, this study seeks to obtain sentence (or utterance) vectors that better represent the meaning of sentences through a novel pooling method.


Building Block

Figure 1. (a) LSTM unit architecture and formula. (b) Bi-LSTM data flow. .

Speech act classification is a type of natural language processing (NLP). Natural language is featured as time-series data, and this study utilizes RNN (Recurrent Neural Network) structures which are suitable for time series data. In particular, LSTM (Long Short-Term Memory) is used as a unit building block in this study and in previous studies ([1], [2]). LSTM is an extension of RNN by incorporating memory units and update/forget gates, which solves the existing problem of vanishing gradient facilitates learning long term dependency. Furthermore, by bi-directionally arranging LSTM (Bi-LSTM, Bidirectional LSTM), the information flows both following and reversing time sequence, achieving a better understanding of back and forth contexts. Bi-LSTM is an effective model for speech act classification, as understanding contexts is important to infer speakers’ intention ([1], [2]).

Layer Architecture

Figure 2. Overall architecture of speech act classification.

Dialogue consists of a dual hierarchical structure; dialogue is a sequence of utterances, and each utterance is a sequence of words. As there is a high correlation in speech acts of utterances within a dialogue, such as the high probability of “answer” after “question,” the temporal dependency between utterances is also important, not only considering the temporal dependency between words. Taking this into account, the existing models ([1], [2]) and the model suggested in this study have a two-level hierarchical structure (Figure. 2). The first level layer is the Utterance feature extractor that extracts an utterance vector from the sequence of words, and the second level layer, Logit, deduces speech act categories from utterance vectors. Both Utterance feature extractor and Logit use Bi-LSTM as the core skeleton.

Utterance feature extractor consists of Embedding layer, Bi-LSTM and Pooling layer (Fig. 3-(a). The Embedding layer is a layer with a pretrained word embedding model, which converts words to embedding vectors. Word2vec [3] and GloVe [4] are tested and GloVe with 300 dimensions is adopted. As an augmentation of the word vectors, morpheme vectors extracted by Natural Language Tool Kit (NLTK) are concatenated to the word vectors. The Pooling layer serves to obtain an embedding vector of the utterance unit by aggregating the hidden state obtained by Bi-LSTM. This will be discussed in more detail later in the subsequent section.

Logit (Fig.3-(a)) has a relatively simple structure. The utterance vectors are processed through Bi-LSTM, and the probability of each speech act category is obtained by applying Softmax to the output hidden states of Bi-LSTM. Before the utterance vectors entering Bi-LSTM, the additional informative data are augmented: a flag identifying speakers, and a flag for indicating whether the utterance ends with a question mark.

Figure 3. Architecture of utterance feature extractor (b) and logit (a).

Pooling method

Figure 4. Diagram of each pooling method. (a) attention pooling, (b) average pooling, (c) last pooling.

This study proposes Attention pooling, as a novel pooling method utilizing Attention Mechanism, and compares it with the existing method, Average pooling [1]. Pooling is the aggregation of a column of hidden state vectors produced by Bi-LSTM to obtain a single unit of utterance vector. Average pooling (Fig.4-(b)) is the method of averaging the hidden state vectors, and Last pooling (Fig.4-(c) is the method of using the last hidden state. Attention pooling (Fig.4-(a) proposed in this study is a method of weighting the hidden state vectors by calculating the attention weight through the Attention module. As a result, the Attention module is trained to yield attention weights in accordance with the relevant significance of the given pieces. FC (Fully Connected) network and Bi-LSTM are tested as candidate structures for the Attention Module.

Dataset & Experimental setting

SwDA (Switch Board Dialogue Act Corpus, 2000) dataset is used to test the performance of the suggested models. The SwDA consists of 1,115 call chat data and a total of about 210,000 utterances. Each utterance is labeled into 42 speech act categories according to the taxonomy from DAMSL (Dialog Act Markup in Several Layers) [6]. The policy for splitting the dataset is the same as existing studies (Training:1003, Validation:112, Test:19).

Adam optimizer is used and the initial learning rate is set to 0.01. After 100 epochs, the learning rate is lowered to 0.0001 for fine-tuning. The dimension size for hidden states in all Bi-LSTM is 128, the batch size is 128, and the dropout rate is 0.2. The dimension of the Attention Module is 16. The Embedding layer utilizes the pretrained model (Glove [4]) without extra learning. The word length for each utterance is fixed to 36 with zero-padding. For training, 8 pieces of consecutive utterance are used as input, and the entire conversation for evaluation to calculate the accuracy. It was implemented in Python 3.6 using Pytorch 1.10 and trained using NVIDIA Titan X GPU (12GB).


Comparison of Pooling Methods

Table 2. Accuracy(precision) of each method and previous study.

Table 2 is the result of training models with different pooling methods. Attention Pooling shows the highest performance with 80.7 percent of validation accuracy, which is higher than the existing pooling methods, the last pooling and the average pooling. Also, the performance of the proposed model is comparable with the state-of-the-art model by Chen et al[2], indicating the validity of the model.

The effect of the different structures of Attention Module, FC or Bi-LSTM, is not significant. This is because the hidden states from the Bi-LSTM of Utterance feature extractor, which are input data of the Attention Model, already have information about the context. Therefore, even in the FC structure, the contexts are embedded in the attention weights.

Qualitative Analysis of Attention

Figure 5. Qualitative analysis of attention weights.

One of the benefits of using the Attention Mechanism is that it is easy to intuitively show how the model works. In this study, attention weights are extracted from the Attention Module for qualitative analysis. The attention weights near the subject words in utterance tend to have higher values, meaning that the information of those parts significantly affects the classification.

From this result, we can infer that subjects and predicates might be critical for recognizing speech acts. As those parts are adjacent in the sentence structure of English, the attention weights near the subject words are required to have high values.


Finally, the model proposed in this study achieved 80.73% validation accuracy. This is quite competitive considering the existing SOTA model (chen et al, CRF-ASN[2] : 80.8%) and the actual human accuracy (84%). It is less accurate than CRF-ASN, but while the CRF-ASN model applies a very complex methodology, this model has a relatively simple structure.

The pooling method using the Attention Mechanism allows us to intuitively interpret the model's process as well as improve the performance. This certainly contributes to the field of interpretable artificial intelligence. CRF-ASN also applies the Attention Mechanism, but not to the pooling method, but to the CRF (Conditional Random Field) structure. In other words, the model in this study has the advantage of being able to interpret the operation of the model at the word level, showing which parts of utterances are critical for recognizing speech acts.


[1] Kumar, H., Agarwal, A., Dasgupta, R., Joshi, S., & Kumar, A. (2017). Dialogue Act Sequence Labeling using Hierarchical encoder with CRF. Retrieved from

[2] Chen, Z., Yang, R., Zhao, Z., Cai, D., & He, X. (2018). Dialogue Act Recognition via CRF-Attentive Structured Network. The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval  - SIGIR ’18, 225–234.

[3] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119)

[4] Mackay, C. (1875). “Glove.” Notes and Queries, s5-IV(96), 346.

[5] Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. (Mlm). Retrieved from

[6] Core, M. G., & Allen, J. (1997, November). Coding dialogs with the DAMSL annotation scheme. In AAAI fall symposium on communicative action in humans and machines (Vol. 56).

[7] Luong, M. T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025.