Hierarchical Network with Label Embedding for Contextual Emotion Recognition

Emotion recognition has been used widely in various applications such as mental health monitoring and emotional management. Usually, emotion recognition is regarded as a text classification task. Emotion recognition is a more complex problem, and the relations of emotions expressed in a text are nonnegligible. In this paper, a hierarchical model with label embedding is proposed for contextual emotion recognition. Especially, a hierarchical model is utilized to learn the emotional representation of a given sentence based on its contextual information. To give emotion correlation-based recognition, a label embedding matrix is trained by joint learning, which contributes to the final prediction. Comparison experiments are conducted on Chinese emotional corpus RenCECps, and the experimental results indicate that our approach has a satisfying performance in textual emotion recognition task.


Introduction
As an essential element in human nature, emotions have been wildly studied in psychology. Emotion recognition involves the identification of detailed emotional states, which mainly refer to a wide range of mental states, such as happiness, anger, and fear [1]. Textual emotion recognition (TER) is a kind of fine-grained sentiment analysis. It aims to classify a textual expression into one or several emotion classes depending on the underlying emotion theories employed [2].TER should be the most common application in the field of natural language processing, such as mental health monitoring [3], emotional management [4,5], sinister tone analysis in social networks [6], and human-computer interaction systems [7]. In recent decades, TER tasks have gained considerable interest in the research community.
Recent researches about TER mainly conducted on sentence level, which are aimed at recognizing subtle emotions based on word and concept-based features extracted from the given sentence. However, emotional expression is complicated, and the same sentence could present different emotions in different contexts. In the absence of contextual information, even humans cannot give confident emotional judgments. Therefore, it is necessary to utilize contextual information for sentence-level emotion recognition.
Given a sentence, its context generally refers to the sentences that appear around it. For example, given a sentence from a blog, its context refers to those sentences that appeared around the current sentence. Given an utterance from a dialogue, its context generally refers to the preceding occurred utterances. Such contextual information has been explored in some preceding works, such as HANs [8], TreeLSTM [9], and CLSTM [10]. Under different circumstances, the contextual sentences of current sentence have different contributions to final prediction, and attention mechanism-based networks are widely utilized to address this problem. Inspired by HANs (Hierarchical Attention Networks), we explore effective encoders for sentence-level encoding and contextual-level encoding to generate more accurate emotion representation expressed in the given sentence.
Emotional expression is very complicated. Some emotions often cooccurred with each other, such as the emotion pair of "Joy" and "Love," while some are opposed and rarely appear together, such as "Joy" and "Anxiety." Emotion correlation has always been a significant factor in emotion recognition tasks. To accurately recognize emotions, it is necessary to fully consider the correlations between each emotion. This paper explores a hierarchical model to learn contextual representations, which encodes the emotional information of a given sentence based on its context. Besides, to realize emotion correlation learning, we trained a label embedding matrix by joint learning, which is beneficial to emotion correlation-based emotion prediction. Therefore, our contributions are summarized below: (1) This paper proposes a hierarchical model to learn contextual representations for sentence-level emotion recognition. We take pretrained language model BERT as the sentence-level encoder and take attention-based bidirectional LSTM as the contextlevel encoder, which are aimed at learning the emotional information of the given sentence based on its context (2) To give emotion correlation-based prediction, the label embedding matrix is learning by joint learning. Emotion correlation is obtained by calculating the similarity features between sentence representation and each label embedding, which contributes to the final prediction (3) To guarantee the effectiveness of both emotion prediction and label embedding, the proposed network is trained by an assemble training objective. The experimental results indicate that our approach has a satisfying performance in TER task The rest of this paper is organized as follows: Section 2 presents some related works of textual emotion recognition and contextual modeling. Section 3 describes the methodology of proposed hierarchical network with label embedding. Experimental results and discussion are shown in Section 4 and Section 5. Finally, Section 6 concludes this paper.

Textual Emotion Recognition.
Deep learning-based techniques have achieved significant improvement on TER tasks [11]. Word embedding techniques are aimed at learning latent, low-dimensional representations from the language structure and alleviating the problems of sparse features and high-dimensional representation in traditional bag-ofwords models. Some well-established embedding models are wildly used in many NLP tasks, such as Word2vec [12,13] and GloVe [14]. They are trained on a large scale of unlabeled data and aimed to capture fine-grained syntactic and semantic regularities. Recently, the emergence of pretrained language model opened the pretraining era in the NLP field. The pretrained language models provide useful general knowledge and can be fine-tuned to almost all downstream tasks. CoVe and ELMo [15] generate dynamic and context-sensitive word embedding, by which the same words with different contexts are given different word vectors. They greatly alleviate the occurrence of ambiguity. BERT utilizes a large amount of unlabeled data during the training, which helps the model to learn useful linguistic knowledge. Bert performs well in encoding contextual grammatical knowledge and has achieved satisfactory results in many NLP task.
Emotion label correlation is always a critical problem in TER tasks. Deep canonical correlation analysis (DCCA) performs well in feature-aware label embedding and labelcorrelation aware prediction [16,17]. Some studies try to explore the label correlations by transforming multilabel classification tasks into ranking or generation tasks. In [18], they transform multilabel emotion classification tasks into a relevant emotion ranking task, which is aimed at generating a ranked list of relevant emotions according to emotional intensity. In [19], multilabel classification is regarded as a sequence generation task, and Sequence Generation Model (SGM) is proposed to learn label correlations. Some approaches attempt to estimate label correlations by the modification of loss function implicitly. The label-correlation sensitive loss function is first proposed in [20] with the BP-MLL algorithm. Joint binary cross-entropy (JBCE) loss [21] is proposed to train the joint binary neural network (JBNN) to capture label relations. To reduce the computational complexity, partial label dependence can also contribute to this task, which is demonstrated in [22]. A semisupervised multilabel method is proposed in [23], while label correlations are incorporated by modifying the loss function. Multilabel classification and label correlations learning can also be realized in a joint learning framework [24,25].

Contextual
Modeling. Emotion recognition of each sentence in hierarchical texts, such as document or dialogue, highly depends on contextual cues, and context modeling is indispensable. To realize contextual emotion recognition, not only need to model the current sentence but also the contextual sentences, which helps to know the overall emotional tendency of the document or dialogue.
Context modeling architectures can mainly be summarized into two kinds: flatten context modeling and hierarchical context modeling. By flatten context modeling, context sentences and current sentence are concatenated, and all tokens are flattened into a word sequence. This sequence is fed into neural networks for contextual information extraction and final prediction [26,27]. However, emotions flow naturally in the contextual sentences, and the sequential nature is nonnegligible. Flatten context processing makes the sequence of words too long and ignores the time step. It destroyed the hierarchical structural information of contextual sentences.
The hierarchical structure is a natural characteristic of text: words form sentences and sentences from contexts. This structure knowledge can be incorporated into model architecture for better representation. Inspired by this fact, some works try to stack deep learning architectures to provide specialized understanding at each level [28]. By hierarchical context modeling, each sentence is embedded into a vector representation by sentence-level encoder, and context information is further extracted by hierarchy context encoder [29]. Hierarchical Attention Networks (HANs) is proposed in [8], which mirrors the documents' hierarchical structure.
HANs mainly consists of two layers applied at word and sentence level, respectively. In each layer, there is a GRU-based sequence encoder along with an attention layer, which is aimed at paying more attention to important words and sentences that benefit to the final prediction.

Methodology
3.1. Problem Definition. Assume that we have N training sample X along with their contextual information C. Each sample x ∈ X is often a sentence from the hierarchical text, such as dialogue or blog, and its contextual information c = fc 1 , c 2 , ⋯c n g ∈ C often means the preceding n sentences appeared before x. Each sample x is annotated with K emotional labels: fe 1 , ⋯e k , ⋯e K g, which is denoted as a one hot vector y = fy 1 , ⋯y k , ⋯y K g ∈ ℝ 1×K , in which y k = 1 is x contains emotion e k otherwise y k = 0.
For each sample x ∈ X, a multilabel emotion recognition model F is trained to transform x into predicted distributions p = fp 1 , ⋯p k , ⋯p K g based on its contextual information c and then give a final prediction of all possible emotion labels. The function F is denoted as 3.2. Hierarchical Network with Label Embedding. To model a sentence x along with its contextual information c = fc 1 , c 2 , ⋯c n g, the simplest way is to utilizing flatten context modeling, by which x and contextual sentence c are concatenated as x′ = fc 1 , c 2 , ⋯c n , xg, and all tokens in x′ are flattened into a word sequence. However, emotions flow naturally in each sentence. Such flatten processing makes the sequence of words too long and ignores the time step, which destroyed the hierarchical structural information. The sequential nature of context is nonnegligible, and such hierarchical information could better contribute to the emotion prediction. Motivated by Hierarchical Attention Networks (HANs), we focus on hierarchical context modeling. Each sentence in x′ = fc 1 , c 2 , ⋯c n , x,g is first encoded into sentence-level representation h s = fh s c1 , h s c2 , ⋯h s cn , h s x g by a sentence-level encoder En s , and then, contextual information is further encoded by a hierarchy context encoder. The framework of proposed hierarchical network with label embedding is shown in Figure 1.

Sentence-Level Modeling.
At the sentence level, for each sentence s = fw 1 , w 2 , ⋯g in x ′ = fc 1 , c 2 , ⋯c n , xg, the function En s encodes s into sentence-level representations h s , which is denoted as: Inspired by the pretrained language model and transfer learning techniques, pretrained BERT model [30] is taken as sentence-level encoder En s in this paper. BERT stands for Bidirectional Encoder Representations from Transformers, and it is designed to pretrain deep bidirectional representations from unlabeled textual data by jointly conditioning on both left and right context in all layers. It remedies the limitation of insufficient training corpora and contributes to syntactic and semantic sentence representation.
In this way, for the sentences in x ′ = fc 1 , c 2 , ⋯c n , xg, sentence-level representation h s = fh s c1 , h s c2 , ⋯h s cn , h s x g is generated by pretrained BERT model.

Contextual-Level Modeling.
At the contextual level, the function En c encodes the sentence-level representation h s = fh s c1 , h s c2 , ⋯h s cn , h s x g into a context-level representation h c , which is denoted as: In the proposed model, the function En c mainly is consisting of two-layer networks: BiGRU (Bidirectional Gated Recurrent Neural Networks) and attention network.
BiGRU is aimed at dealing with the sequential information of contexts. Take  The attention network is aimed at making the network pay more attention to essential contexts. The attention mechanism considers the contributions of previous occurred contextual sentences c i ∈ c to the prediction of current sentence x. More attention weight will be assigned to related contexts. Attention weight a i and weighted emotional feature vector h c are defined as follows: in which σ indicates the sigmoid activation function; w 1 , b 1 , w 2 , b 2 indicate the model parameters.
In a typical contextual network, h c is fed into the classifier for final prediction. The classifier typically consists of a linear transformation. It is followed by a sigmoid operation to normalize the outputs so that each element in will be in the scale of [0,1]. A multilabel neural network is typically trained by minimizing the Binary Cross Entropy (BCE) between the true labels distribution Y and predicted distribution P as the following: in which p ik is the predicted probability of emotion e k in the ith sample, and y ik is the true label, y ik ∈ f0, 1g.
Above-mentioned typically network is intuitive and straightforward and wildly utilized in multilabel classification problems. However, emotion recognition is a more complex problem. This typical network with BCE loss function can be less effective and poor generalization due to its ignorance of label correlations. To capture label correlations, a joint learning label embedding network is proposed, which is detailed in Section 3.2.3.

Label Embedding Network.
The label embedding is supposed to represent the semantics and relations between emotion labels. The embedding is denoted by in which K is the number of emotion labels, and d is the dimension of label embedding. Each row in E represents an emotion label.
To make label embedding contribute to the emotion recognition network, the most intuitive way is to compare the emotion representation of contextual input with the label embedding by emotional interaction.
Let the function En e as emotion projector maps contextuallevel representation h c into emotion representation h e .
Thus, the prediction of all possible emotion labels could be given based on the feature interaction between emotional feature h e and label embedding matrix E. We firstly implement element-wise product operation on h e and label embedding E k of each emotion e k : h e,k denotes the label-aware feature representation, which incorporates the information of input and a particular emotion label e k . In this way, the probability of containing emotion e k is defined as in which σ indicates the sigmoid activation function that normalizes the prediction of each emotion p k in the scale of [0,1]. w c , b c indicate the model parameters. The final prediction is given: P = fp 1 , ⋯p k , ⋯p K g.

Training
Objectives. For multilabel emotion recognition task, the training objective is often based on binary crossentropy (BCE). However, BCE loss function takes each emotion as an independent individual and does not consider their relationships. Emotion correlation plays an essential role in this task, which makes emotion recognition be a more complex problem than traditional text classification. To guide the model to learn the emotion correlation during the training process, we propose an assembled training objective to consider all aspects.   Figure 1: The framework of proposed hierarchical network with label embedding.

Training Objective on the Output Layer.
To minimize the loss between the true label distribution and the output distribution, label-correlation aware multilabel loss function is applied at the output layer, which is determined as follows: where Y i denotes the set of positive emotions for ith sample x i , and Y i denotes the negative emotions. p i k and p i l are the output possibility of positive emotion e k and negative emotion e l , respectively. Therefore, training with the above loss function is equivalent to maximizing the difference of (p i k − p i l ), which leads the system to output larger values for positive emotions and smaller values for others.

Training
Objective on Label Embedding. Given a contextual input x, its positive label is Y i and its negative label is Emotion representation h e is learned as in Equation (7). In the proposed network, nonlinear label embedding is utilized in the network to guiding the final prediction P by the similarity feature with h e . In this way, we assume that h e can in turn be used in the training of label embedding by being closer to the embedding of positive emotions while farther to other negative emotions.
To measure the distance of emotion representation h e and label embedding, cosine embedding loss is utilized. CosLoss To guarantee label embedding can encode semantic features among labels, we introduce an additional network to recognize each emotion from corresponding label embedding. For each emotion e i , its label embedding is E i . The predictionê i based on E i is given as In summary, the assemble training objective of the proposed method is as follows: RenCECps is an annotated emotional corpus with Chinese blog texts. The corpus is annotated in document, paragraph, and sentence level [31]. Each level is annotated with eight emotional categories ("Joy," "Hate," "Love," "Sorrow," "Anxiety," "Surprise," "Anger," and "Expect"). Our experiments are conducted at sentence level, and the preceding two sentences of the current sentence are taken as the context information. After preprocessing, there are 24310 contextual sentences in training data and 6746 in testing data.
For each emotion e i , its cumulative number CN i is calculated.
in which y n,i is the annotation of emotion e i in the nth sample, and the statistical results are shown in Figure 2. Label cardinality (LCard) is a standard measure of multilabeledness and means the average number of emotions concluded per sentence of the corpus and [32]. In RenCECps, LCard is 1.4468.

Evaluation Metrics.
The global performance is evaluated by micro-and macro-F1 score. F1 score is the harmonic mean of precision and recall. Micro-F1 score gives each sample the same importance, while macro-F1 score takes all classes as equally important. Some popular evaluation measures typically utilized in multilabel classification are utilized to measure the efficiency of proposed methods. Hamming Loss (HL) is the fraction of labels that are incorrectly predicted. Coverage evaluates how far it is needed to go down the ranked emotion list to cover all the relevant emotions in the instance. One Error (OE) evaluates the fraction of sentences whose top-ranked emotion is not in the relevant emotion set. Ranking Loss (RL) evaluates the average fraction of label pairs that are reversely ordered for instance.

Experimental Details.
For a given sentence, its preceding two sentences are taken as contextual sentences. There are total 8 emotion labels annotated for each sentence, and the dimension of label embedding is set to 256. The dimension of hidden state of GRU cell is set to 768/2, and 768 is the dimension of sentence-level embedding.
During the model training, the learning rate is set to 2e-5, and the batch size is set to 128. Adam optimization method is applied to train the model by minimizing the proposed training objective.

Baselines.
In this section, we report the experimental results of our proposed method and baseline models. Additionally, we analyze the influence of training objectives on output layer and label embedding.
We compare our proposed model with six baseline methods as follows.
(1) RERc [18]: a novel framework based on relevant emotion ranking to identify multiple emotions and produce the rankings of relevant emotions from text.
(2) HANs [8]: it has a hierarchical structure that mirrors the hierarchical structure of documents and has two levels of attention mechanisms applied at the word-and sentencelevel. In our experiments, sentence-level encoder of HANs is replaced by pretrained BERT model.
(3) EDL [33]: Emotion Distribution Learning, it learns a mapping function from texts to their emotion distributions describing multiple emotions and their respective intensities based on label distribution learning.
(4) EmoDetect [34]: it outputs the emotion distribution based on a dimensionality reduction method using nonnegative matrix factorization, which combines several constraints such as emotions bindings, topic correlations, and emotion lexicons in a constraint optimization framework.
(6) Rank-SVM [36]: adapts maximum margin strategy to deal with multilabel data and focuses on distinguishing relevant from irrelevant while neglecting the rankings of relevant ones.
The comparison experiments of baseline HANs are implemented based on the open-source codes shared on GitHub, and the results of other baselines are adopted from the published papers.  Table 1. The best result on each metric is in italics.

Experimental Results and Discussions
As the results shown in Table 1, it indicates that our proposed method significantly outperforms baseline models to a great extent and achieves satisfactory performance on RenCECps dataset. For example, compared to the baseline RERc, our model achieves an improvement of 10.73% micro-F1 score. On multilabel evaluation measures, our model achieves a reduction of 46.15% ranking loss and 21.78% one error. Compared to other baselines, our model achieved satisfactory results as well, which demonstrated the effectiveness of the proposed method.

The Effectiveness of Label Embedding Layer.
Our proposed model is an extension of the baseline of HANs. In our experiments, sentence-level encoder of HANs is replaced by pretrained BERT model. Therefore, by comparing the results of these two models, it can be revealed whether the addition of label embedding layer is effective on the subtask of emotion correlation learning.
As we can see from the results shown in Table 1, the proposed model significantly outperforms baseline HANs, which achieves the improvement of micro-F1 score from 0.5573 to 0.5665 and macro-F1 score from 0.4003 to 0.4186. On multilabel evaluation measures, our model achieves a reduction of ranking loss from 0.1136 to 0.1131, one error from 0.3623 to 0.3559, and hamming loss from 0.2075 to 0.1998.
Both the proposed method and baseline HANs give prediction based on the contextual representation learned from a hierarchical network. HANs directly fed it into output layer for final prediction, which mainly consists of a fully connected layer and an activate function like sigmoid. This implementation is intuitive and straightforward, and it is also a common processing method in most multilabel classification tasks. However, such implementation treats emotion  recognition task as a general text classification task. It does not consider the correlation between emotion labels, such as the probability of cooccurrence of "Love" and "Happy" is higher than that of "Love" and "Sad." In our proposed model, label embedding space is introduced for emotion correlations learning. The final prediction is based on the interaction of the emotion representation of input text and the label embedding matrix. To guarantee that the label embedding matrix learned the semantic features among labels, training objective on label embedding is utilized to guide the training. The predicted probability given by label embedding matrix, as Equation (13), is visualized in Figure 3. The results in the figure clearly show that the label embedding matrix can accurately predict the corresponding emotion, which means that the emotional information of each label has been actually learned in the label embedding matrix.

5.
3. The Effectiveness of Training Objectives. As described in Section 3.3, we proposed an assembled training objective to realize the joint learning of both emotion recognition task and label embedding task. To evaluate the effectiveness of training objectives and label embedding network, we train the proposed model with different training objectives. The results are shown in Table 2. The symbols "M," "C," and "L" denote the loss function of multilabel loss, as in Equation (10), cosine embedding loss, as in Equation (11), and label embedding loss, as in Equation (14), which are utilized for training. From Table 2, compared with the assembled training objective ("M+C+L"), the proposed model with only multilabel loss ("M") on output layer achieves a reduction of 2.22% micro-F1 and 1.39% macro-F1 and an improvement of 12.37% ranking loss, 6.41% one-error, and 4.52% coverage. It suggests that the proposed ensemble training objective can contribute to the classification improvement.
The experimental results of the proposed model trained on "M+C" and "M" indicate the contribution of cosine embedding loss on the training of the label embedding matrix. Cosine embedding loss guides the training of label embedding by making the emotion representation of input being closer to the embedding of positive emotion labels while farther to other negative emotion labels.
The comparison results of the proposed model trained on "M+L" and "M" indicate that the addition of label embedding loss is effective. Label embedding loss guarantees that the trained label embedding matrix is able to encode semantic features among emotion labels.

Conclusions
In this paper, we proposed a hierarchical network with label embedding for contextual emotion recognition. Our method involves hierarchically encoding the given sentence based on its contextual information and training a label embedding matrix with an assembled training objective to realize emotion correlation learning. The experimental results show the strong ability of the proposed method to learn emotional features for contextual emotion recognition. In the future, it shall be interesting to incorporate background resources, such as emotion lexicon and knowledge graph, to make the system more satisfactory and robust.  "M": multilabel loss; "C": cosine embedding loss; "L": label embedding loss.