Human Activity Recognition Based on Residual Network and BiLSTM

Due to the wide application of human activity recognition (HAR) in sports and health, a large number of HAR models based on deep learning have been proposed. However, many existing models ignore the effective extraction of spatial and temporal features of human activity data. This paper proposes a deep learning model based on residual block and bi-directional LSTM (BiLSTM). The model first extracts spatial features of multidimensional signals of MEMS inertial sensors automatically using the residual block, and then obtains the forward and backward dependencies of feature sequence using BiLSTM. Finally, the obtained features are fed into the Softmax layer to complete the human activity recognition. The optimal parameters of the model are obtained by experiments. A homemade dataset containing six common human activities of sitting, standing, walking, running, going upstairs and going downstairs is developed. The proposed model is evaluated on our dataset and two public datasets, WISDM and PAMAP2. The experimental results show that the proposed model achieves the accuracy of 96.95%, 97.32% and 97.15% on our dataset, WISDM and PAMAP2, respectively. Compared with some existing models, the proposed model has better performance and fewer parameters.


Introduction
HAR has received a lot of attention in recent years for its applications in smart homes, fall detection for the elderly, sports training, medical rehabilitation, and misbehavior recognition [1,2]. For example, by analyzing the movements of elderly people living alone, the fall behavior can be detected for seeking help from family members in time. Fitness people can obtain their own exercise data by counting steps and recognizing exercise status to achieve scientific exercise and fitness management. Doctors can diagnose patients with knee diseases by gait analysis. In the rehabilitation phase, the rehabilitation plan can be adjusted based on the movement data of patients with lower limb diseases. The HAR technologies could be divided into two categories: camera-based and sensor-based [3,4]. The camera-based method extracts the human activity features from the video stream by placing a camera in the human surroundings. Although this approach can visually display the details of human action, it suffers from privacy issues, and its performance is subject to the quality of background illumination. On the contrary, the sensor-based approach has many advantages. It is unaffected by the surrounding environment and is promising to obtain higher accuracy. In addition, it will not cause privacy problems of users. Therefore, sensor-based approaches are more suitable for human activity recognition [5]. In this paper, we mainly discuss the problem of sensor-based HAR.
In the existing studies, researches placed the smart device on the waist [6], pants pocket [7], or wrist [8], using the inertial measurement unit (IMU) in the smart device to collect human activity data. The IMU including the accelerometer and gyroscope is used to measure human body acceleration and angular rate. Firstly, the sensor data are spatial and temporal information of sensor data. Some networks [6,25,29] are relatively complex, making them difficult to run in the devices which have limited computer sources and memory spaces. To solve these problems, in this study, we proposed a new DL model which cascades a residual network with BiLSTM. Firstly, we use a residually connected convolutions (ResNet) [30] to extract the spatial features of sensor data. Then we use BiLSTM to obtain forward and backward dependencies of feature sequence.
The primary contributions of this work are as follows: (1) A new model, combining the ResNet with BiLSTM, is proposed to capture the spatial and temporal feature of sensor data. The rationality of this model is explained from the perspective of human lower limb movement and the corresponding IMU signal. (2) We introduce the BiLSTM into ResNet to extract the forward and backward dependencies of feature sequence which is useful to improve the performance of the network. We analyze the impact of model parameters on classification accuracy. The optimal network parameters are selected through experiments. (3) An HAR dataset, in which the human activity data are collected by a self-developed IMU board, was made. The IMU board is attached to human shank to collect the activity data of the human lower limbs. Our model performs well on this dataset. The proposed model was also tested on both the WISDM and PAMA2 HAR datasets and outperforms existing solutions.
The rest of this paper is organized as follows. In Section 2, the proposed mode is described. Section 3 describes the collection of sensor data, the public HAR dataset, experimental setup, experimental results and discussion. Section 4 is the conclusion of this paper.

Proposed Approach
With the massive application of MEMS IMU in the smartphone and wearable systems, the HAR is gradually shifting from image-based to sensor-based. Figure 1 shows the signal of IMU attached to the shanks on different people while running. The traditional HAR method firstly calculates the features of IMU signal in a period, such as the mean, variance, and maximum value of sensor data in the sliding window, or the correlation coefficient of different channel signals. Then, the calculated features are used to judge the activity category by some preset thresholds, or the calculated features are put into a machine learning model for training and classification. As we know, for the same activity, there are differences in different people's movements and differences in one person's movement at different times; therefore, the calculated features are quite different. The calculated features of different actions by traditional methods tend to overlap. As shown in Figure 1, the signal amplitude at a certain point (c, d, e, f) in the running cycle of two people is different, obviously, which leads to large differences in the hand-crafted features. Therefore, it is hard to recognize the human activity using such features and more powerful feature extraction methods are needed.
ResNet [30] is an important improved CNN model with powerful local spatial feature extraction capability widely used in the field of image recognition. It can also be used to extract the local features between different channels of IMU signal in a small sampling segment, that is, the local spatial features of IMU signal. However, the human motion, especially the motion of the lower limbs, is a non-rigid motion. There are some irregular changes in the IMU signal due to the movement of the lower limbs of the human body within a short period. For example, there is an irregular spike at point a for subject A, but it is smooth at point b for subject B in Figure 1. Only extracting the spatial features of the IMU signal may easily lead to false recognition. For a long time, the sensor signal is relatively flat and periodic due to the stability and periodicity of human gait. We can obtain the dependence of the sensor signal over a long time to improve the recognition accuracy. Therefore, we consider using LSTM to extract the long-term dependency of IMU signal.
BiLSTM is a special LSTM that can extract both forward and backward dependence on the time sequences [23]. We proposed a new model by merging the ResNet and BiLSTM based on the above analysis of the characteristics of human limb movement and IMU signal. The architecture of our proposed model is presented in Figure 2. As shown in the figure, the input data are firstly processed by the residual block to extract the local spatial features of the data. Then, the flattened features are fed into BiLSTM. There is a dropout layer followed by the BiLSTM to avoid overfitting. After a dense layer, a Softmax layer is used for yielding a probability distribution over classes at the end of the model.

Spatial Feature Extraction Based on ResNet
Due to the differences of the movements of the lower limbs when different individuals exercise and the complexity of human movements, the extracted hand-crafted features of different human activities are easy to overlap. It is difficult to separate these features by threshold or machine learning model. Manual features that are easy to distinguish are also difficult to design. Traditional methods use the manual features to recognize human activities, and the effect is not good. The CNN model can automatically extract the local spatial features of the sensor signal by learning lots of samples. The powerful capability of feature extraction improves the accuracy of HAR. It has been widely used in sensor-based HAR. However, increasing the number of convolutional layers in the model results in accuracy saturation. The ResNet is proposed to solve this problem. In the shallow network, the residual module can be also beneficial to improve network performance [30]. As shown in Figure 2, the residual module is composed of two convolutional layers connected in sequence, and a parallel skip link is added. In order to obtain the spatial features of the different channels of sensor signal, the two-dimensional convolutional residual network is used. The first convolution layer is designed with 32 kernels of size 2 × 2. The stride length of the convolution window is 2. A Batch Norm (BN) layer is added to this convolutional layer, to speed up the training process, and to avoid problems of covariate shift. The BN layers are followed by the ReLu activation function, which has the advantage of avoiding gradient disappearance. The second convolution layer has the same parameters except the stride length is 1. In order to make the dimension of output of the two convolutional layers consistent with the original input dimension, the same two-dimensional convolution is performed in the parallel skip linking. Let the input of residual block be x, the output can be expressed as where f (x) represents the mapping learned by the stacked convolutional layers and h(x) represents the mapping learned by the shortcut connection part. When the shortcut connection part is added, the residual network is prone to learn constant mappings after the network reaches its optimal performance, thus at least not deteriorating the network performance, while more parameters allow for a greater model fitting potential. The model parameters are described later.

BiLSTM Layer
As mentioned above, only extracting the local spatial features of the sensor signal of human activity is not enough for HAR. The RNN model has the ability to capture temporal information from time sequences. However, Bengio et al. [31] inform that RNN networks can recognize the data for only a moment, owing to the vanishing and exploding gradient issue. LSTM is a special type of RNN that solves the problem of long-time dependence of time series due to its special memory cells [32]. In this study, we use a special LSTM named BiLSTM to analyze the local spatial feature sequence to obtain the long-term regularity of the sensor signal. Figure 3 shows the cells of LSTM. LSTM is implemented through three gates: input gate, forget gate and output gate. An LSTM unit can be defined and explained as follows, where U, W is the weight matrix, and b is the bias term: where, i t is the input gate at time t, · is the matrix multiplication, σ is the sigmoid function, x t is the input data at time t, and h t−1 is the output of the previous LSTM unit. The input gate determines which information in the previous unit needs to be updated.
where, f t represents the forget gate which calculates the importance of the information and forgets some old information.
The candidate state c t is calculated with the tanh function as depicted in Equation (3). Then, the present cell state is computed as expressed in Equation (4), where, denotes element multiplication.
In Equation (6), the output gate o t is calculated. In Equation (7), h t is the output of LSTM unit.
Baseline LSTM predicts the human current activity based only on former data. It is obvious that some information may be lost if the data are considered on only one direction. The BiLSTM is made up of two LSTM layers in two directions. As shown in Figure 4, the output of the BiLSTM is determined by the LSTM in the forward layers and backward layers together. In the BiLSTM, the output layer y t is expressed as follow [33]: where, the → h t and ← h t represent the forward and backward results of LSTM units. The output y t is formed by concatenating these two LSTM units.

The Collection of Homemade Dataset
Unlike most studies in which the devices such as smartphones are fixed to a person's waist to collect human activity data, the self-developed IMU module is fixed to the human body below the knee at the shin in this paper, as shown in Figure 5. Regarding human activities, lower limb activities have attracted much attention. For example, runners want to know their amount of exercise. The monitoring of walking time of patients with lower extremity diseases during rehabilitation is also considered. Therefore, we mainly consider the recognition of human lower limb activities. The sensor is placed under the human knee to directly obtain the movement information of the human body's lower limbs. The sensor module can be fixed together with commonly used knee pads and is very convenient to use in the real applications. This IMU module is powered by a lithium battery (3.7 V, 230 mah), which can be charged by USB. A 9-axis inertial sensor (MPU9250) is integrated on the module, including a 3-axis accelerometer, a 3-axis gyroscope, and a 3-axis magnetometer. There is a processor (Texas Instruments, CC2642) on the module, which integrates the ARM Cortex-M4F microcontroller unit. The size of our module is about 3.5 cm (length) by 2.5 cm (width) by 1 cm (height). As the magnetometer in the IMU module may be affected by the surrounding environment, the accelerometer and gyroscope are used here. The 3-axis acceleration and three-axis angular rate data were collected by the accelerometer and gyroscope at a sampling frequency of 50 Hz. The dynamic range of the accelerometer and gyroscope output data were set to ±8 g and ±2000 dps, respectively. The sensor data were firstly stored in the Flash memory of the IMU module, and then the data were transmitted to the computer through serial port for processing. Figures 6 and 7 show the acceleration and angular rate data collected under the running and sitting within 77 s, respectively. From Figure 6, we can see that the human activities of the running are intense, and the fluctuation range of both acceleration and angular rate is relatively large. In this paper, five volunteers (age range from 22 to 32) from the laboratory participated in the data collection. Each volunteer was asked to do six classes of human activities: sitting, standing, walking, running, going upstairs and going downstairs. Finally, a dataset containing 130,056 samples was made. The detailed information is shown in Table 1. The whole dataset was divided into training and validation sets, with 70% for training and 30% for validation.   The WISDM dataset includes 1,098,209 samples [7]. Thirty-six subjects used a smartphone placed in their trouser pockets to complete six daily activities. The data were collected by an accelerometer in the mobile phone at a sampling frequency of 20 Hz. Walking (38.6%), jogging (31.2%), upstairs (11.2%), downstairs (9.1%), sitting (5.5%), and standing (4.4%), were the collected activities.
The PAMAP2 dataset contains 18 daily activities, including 12 protocol activities (walking, running, rope skipping, vacuum cleaner cleaning, etc.) and six optional activities (watching TV, folding clothes, etc.). The activity information of nine subjects was collected by heart rate meter, three IMU modules and thermometer. The three IMU modules were placed on different positions on the subjects: one IMU on the arm, one IMU on the chest, one IMU on the ankle. The sampling frequency of the sensor is 100 Hz [34]. Figure 8 is the sensor raw data processing flow. Accelerometers and gyroscopes use the same frequency 50 Hz sampling to obtain time series. After a period of sampling, the sequence obtained by the six-axis IMU can be represented as

Data Preprocessing
where (a x,i , a y,i , a z,i ) is the triaxial data sampled by the accelerometer and (g x,i , g y,i , g z,i ) is the triaxial data sampled by the gyroscope. A label y i is assigned to each vector X i to obtain I = {(X i , y i ), i = 1, 2, . . . , N}. As the scale of raw data from different sensors varies greatly, the unprocessed data with big fluctuation affected the performance of network [35]. In addition, the standardization of the data range is helpful to find the global optimal solution in training compared the raw data. Therefore, it is necessary to standardize the data. In this paper, the mean and standard deviation are used to standardize the acceleration data and gyroscope data. The x axis data of accelerometer are standardized according to Equation (10).
where, a x is the x axis data of accelerometer, and µ x , σ x represents the average value and standard deviation of all the x axis raw data of accelerometer, respectively. The output data of the other five axes are standardized in a similar way.
In the current research, the segmentation of basic actions mostly adopts a fixed window size of the sliding window method with fixed overlap coverage rate. The cycle of dynamic actions such as walking is 1-2 s. In this paper, we also use a sliding window with a window length of 100 and an overlap coverage of 50% to segment human actions.

Experimental Environment
The network in this paper was trained on a computer equipped with Intel Core i9-9900 CPU, 16 GB RAM and a graphics processor (GPU) (NVIDIA GeForce GTX 1060 with 6 GB memory). The algorithm was implemented using python 3.

Evaluation Index
To evaluate the performance of the proposed model for HAR, the followed metrics [36] were used for evaluation generally.
where TP, TN are the number of true and false positives, respectively, and FN, FP corresponds to the number of false negatives, false positives. ω i is the proportion of samples of class i.

The Optimal of Model Parameters
In order to obtain the optimal parameters of the proposed model, i.e., the size of convolution kernels, the number of convolution kernels, the number of LSTM units, and the dropout ratio, we adjust them in turn and finally select the appropriate parameters.
Firstly, we analyze the effect of the size of the convolution kernel on the classification accuracy. Seven different convolution kernel sizes are set to test. Figure 9 shows the effect of convolution kernel size on recognition accuracy. It can be seen that the network accuracy is degraded when the size of the convolution kernel increases. We choose the size of the convolution kernel to be 2 × 2 because this size has the best recognition accuracy.
On the basis of the optimal size of convolution kernel, we analyze the effect of the number of convolution kernels on the recognition accuracy. We set the number of convolution kernels to 4,8,16,32,64, and 128 and record the accuracy. It can be seen from Figure 10 that as the number of convolution kernels increases, the recognition accuracy is improved. When this parameter increases to 32, the improvement of accuracy is very small. The accuracy corresponding to the values of 64 and 32 is basically the same. As we know, a larger number of convolution kernels will increase the model size and require higher training costs. Finally, we choose 32 as the number of convolution kernels.   Figure 11 shows the effect of the number of LSTM units in the BiLSTM module on the recognition accuracy. It can be seen from the figure that as the number of LSTM units increases in the beginning, the recognition accuracy will also improve. When the number of LSTM units increases to 128, the accuracy of the proposed network gets worse. In this paper, we choose 64 as the number of model LSTM units. Similarly, we select the dropout ratio from 0.1 to 0.9 in step of 0.1, and test the accuracy of the network in turn. Figure 12 shows that the best network accuracy is obtained when the dropout ratio is 0.5.

Hyperparameters of the Model Trained
The proposed model was trained by minimizing the cross entropy using the Adam optimizer as in [37]. As in [38], the learning rate is firstly set to several constants, i.e., 0.00001, 0.0001, 0.001, 0.01. Then, we find that the possible optimal learning rate may be between 0.0001 and 0.001 for our dataset and the WISDM dataset by comparing the accuracies. For the PAMAP2 dataset, the possible optimal learning rate will be between 0.00001 and 0.0001. We look for the optimal parameter between [0.0001, 0.001] with a step of 0.0001 for our dataset and WISDM dataset. Similarly, we seek the optimal parameters on [0.00001, 0.0001] with a step of 0.00001 for the PAMAP2 dataset. Finally, we select the 0.0003, 0.0006, 0.00003 as the learning rate for our dataset, WISDM, and PAMAP2, respectively. The batch size used to train the model on three datasets is 64. The number of repeated trainings of the model is 80. The hyperparameters used for model training are shown in Table 2.

Experiment Result
The proposed model was trained and tested using a homemade human activity dataset. The model was trained 80 times. The experimental results in Figure 13a show that both training loss and validation loss decrease as the number of training times increases. The two curves are very close after 20 times, and after 70 times, the amplitude of the curves basically remain stable. Figure 13b shows the accuracy curve of the model on the homemade dataset, and the accuracy of the validation set finally reached 96.95%. The result indicates that the model has good performance on the homemade human activity dataset.  Table 3 shows the confusion matrix for the proposed model on our dataset. In the table, HA1 = running, HA2 = walking, HA3 = standing, HA4 = sitting, HA5 = going upstairs, HA6 = going downstairs, PRC: precision, RCL: recall, and F1S: F 1 − score. The diagonal element of the confusion matrix represents the number of correctly recognized activity, and the off diagonal element represents the number of incorrectly recognized activities. There is some confusion between the going upstairs and going downstairs. This is mainly due to the great similarity between the two human activities, resulting in the similar data collected by accelerometer and gyroscope. HA1-HA4 is confused. It can be seen from Figure 5 that the pedestrian is in a static state at the beginning of running. The signal at this stage is similar to the signal of sitting. In addition, the amount of running data in the dataset are relatively small, which makes the model less sensitive to this mode. Therefore, the trained model incorrectly identifies a few running samples as sitting. The table lists the indicators for each category. Among them, the F 1 − score of walking and sitting is up to 0.99. The lowest F 1 − score is running, only above 0.92. The F w is 0.9712, indicating that the network has good classification performance. We compared the accuracy of the proposed model and two baseline models on the homemade dataset. The test was performed 10 times. Each time the dataset was randomly divided into the training set (70%) and the validation set (30%). CNN, BiLSTM and our model were tested in turn for each test. Then, we saved the accuracy of each model on the validation set and averaged it. In the experiment, the CNN has two convolution layers. Each layer has 32 convolution kernels with a size of 2 × 2. The number of hidden units of BiLSTM model is 64. The hyperparameters of training are shown in Table 1. Figure 14 shows the average accuracy of our model, CNN and BiLSTM on the homemade dataset. As shown in the figure, the accuracy of the proposed model is higher than CNN and BiLSTM. The proposed model can extract comprehensive feature information from human activity data, so the accuracy is better than the CNN model and BiLSTM model.

Model Performance on Public Datasets
Although the proposed model has achieved good results on our homemade dataset, to comprehensively verify the performance of the model, the public WISDM and PAMAP2 datasets were tested in our experiment.

Performance on WISDM Dataset
The WISDM dataset was collected by a single accelerometer in the smartphone placed in the trouser pocket with a sampling frequency of 20 Hz [7]. Like the preprocessing of our homemade dataset, the sliding window size is set to 40, and the overlapping area accounts for 50%. The whole dataset is divided into training set (70%) and validation set (30%). The hyperparameters of training are set as Table 2. As shown in Figure 15, the proposed model of accuracy is 97.32% on WISDM. Table 4 is the confusion matrix on the WISDM dataset, where HA1 = downstairs, HA2 = jogging, HA3 = sitting, HA4 = standing, HA5 = upstairs, and HA6 = walking. The table shows that the highest F 1 − score of HA2, HA3, HA4, and HA6 reached 0.99, while HA1-HA5 is easily confused as the acceleration producing by these activities is similar. Figure 16 shows the signal of upstairs and downstairs from WISDM. As shown in the figure, there is no obvious difference between the signals for downstairs and upstairs. The amplitude and frequency of the signals of the two activities are similar. The F w is 97.31%.

Performance on PAMAP2 Dataset
In this paper, the data of 18 channels collected by three IMU modules in the PAMAP2 dataset are selected to test the proposed model. The data from subject 5 and 6 are used as the validation set, and the data from the other subjects are used as the training set. The sliding window size is set to 100 and the overlapping area accounts for 50%. The hyperparameters of training are shown in Table 2. Twelve protocol activities in the dataset were used to test. As shown in Figure 17, the accuracy of the model in this dataset is 97.15% and F w is 97.35%. Table 5 shows the confusion matrix of the proposed model on the PAMAP2 dataset. In the table, HA1 = lying, HA2 = sitting, HA3 = standing, HA4 = walking, HA5 = running, HA6 = cycling, HA7 = Nordic walking, HA8 = ascending stars, HA9 = descending stars, HA10 = vacuum cleaning, HA11 = ironing, and HA12 = rope jumping.

Comparison with Existing Work
In the following, we compare the proposed model in this paper with the related work in recent years. The selected works are based on the WISDM and PAMAP2 datasets in Table 6. It could be noted that the recognition accuracy and F w of our model have improved compared with the relevant work. Firstly, our model is more accurate than CNN, TSE-CNN and SC-CNN, because these models ignored the long-time dependence of sensor feature data. Secondly, our model is more accurate than the hybrid model of CNN and LSTM, or hybrid model of CNN and GRU, because BiLSTM can extract the forward and backward information between time series, while LSTM or GRU only considers the forward information. In addition, the accuracy and F w of our model are higher than the model based on attention mechanism. It reflects that the performance of our model is better than most existing models. In terms of the number of model parameters, our model is also smaller than those in [6,29,39]. Smaller models are easier to apply in mobile devices.

Conclusions
A model incorporating residual network and BiLSTM is proposed for the problem of HAR. The model can comprehensively extract the local spatial features of the sensor data and bi-directional long-term dependence within the spatial features. A new homemade HAR dataset in which the human lower limb activity data are collected by a self-developed IMU module was constructed. The proposed model was tested using the homemade HAR dataset with the accuracy of 96.95%. We tested the proposed model with WISDM and PAMAP2 datasets, and the results show that our model accuracy is 97.32% and 97.15%, respectively. Through the comparison with the previous work, we find that the proposed model obviously achieves improved accuracy and F w . In addition, our model has fewer parameters than some existing models. We believe that our work has important application prospects in the fields of physical training, health management, human lower limb rehabilitation therapy, etc. The model will be further improved in the future to enhance the accuracy of the model.