Identification of Multi-Class Drugs Based on Near Infrared Spectroscopy and Bidirectional Generative Adversarial Networks

Drug detection and identification technology are of great significance in drug supervision and management. To determine the exact source of drugs, it is often necessary to directly identify multiple varieties of drugs produced by multiple manufacturers. Near-infrared spectroscopy (NIR) combined with chemometrics is generally used in these cases. However, existing NIR classification modeling methods have great limitations in dealing with a large number of categories and spectra, especially under the premise of insufficient samples, unbalanced samples, and sensitive identification error cost. Therefore, this paper proposes a NIR multi-classification modeling method based on a modified Bidirectional Generative Adversarial Networks (Bi-GAN). It makes full utilization of the powerful feature extraction ability and good sample generation quality of Bi-GAN and uses the generated samples with obvious features, an equal number between classes, and a sufficient number within classes to replace the unbalanced and insufficient real samples in the courses of spectral classification. 1721 samples of four kinds of drugs produced by 29 manufacturers were used as experimental materials, and the results demonstrate that this method is superior to other comparative methods in drug NIR classification scenarios, and the optimal accuracy rate is even more than 99% under ideal conditions.


Introduction
In the drug market, different drugs and different brands will have different pricing. Sellers can use fake packaging on low-cost pharmaceutical products and sell them as high-priced drugs. They may also use inferior brand drugs of the same drug as famous brand products to sell at high prices in the market. Therefore, it is of great significance in drug supervision to identify the true source of drugs by classification and identification of multiple drugs produced by multiple manufacturers.
Unfortunately, in the field of near-infrared spectrum detection (including the field of drug near-infrared spectrum detection), and even in the wider field of sensor data processing, there is no relevant report on the application of GAN methods. Therefore, we can only modify an appropriate GAN method that is most suitable for the feature extraction and the generation of category samples on demand. The candidate methods are Info-GAN [46], Bi-GAN, VAE-GAN [47], etc. according to our experience and the implementation difficulties, we finally select Bi-GAN as the modification object to achieve the goal of this paper.
Based on this background, this paper constructs a multi-classification model of drugs based on near-infrared spectroscopy and Bi-GAN sample generation, so that it can correctly classify in the scene of a large number of categories and spectra, and effectively solve the problems of insufficient samples, unbalanced samples, and cost-sensitive classification errors in the classification process.
In the scenario of a large number of categories and a large number of spectra, the problems of insufficient samples, unbalanced samples, and cost sensitivity of classification errors are common in other sensor data classification tasks, so this method can also be applied to data classification tasks obtained by other sensors.

Materials
All the materials used in this paper were obtained from the China Institute for Food and Drug Control (Beijing, China). A total of 1721 samples of four drugs (metformin hydrochloride tablets, chlorpromazine hydrochloride tablets, chlorphenamine maleate tablets, cefuroxime axetil tablets) produced by 29 manufacturers were collected. All samples were measured by FTIR spectroscopy (Matrix F spectrometer, Bruker Corporation, Billerica, MA, USA). Before sample collection, the instrument passed a self-diagnosis test and calibration. The wavelength range of data is 4000-11,995 cm −1 , and the resolution is 4 cm −1 .
The near-infrared spectra of the drugs were recorded using a diffuse reflection optical fiber probe. SMA 905 standard interfaces were used for coupling the optical fiber, light source and spectrometer. The ambient temperature was 18-30 • C, and the air humidity was less than 70%. All samples used the same determination background. The measurement operation followed a unified operation protocol, as shown in Figure 1. achieved. "DeepFake" programs [43][44][45] based on a modified GAN can even generate synthetic artificial faces that can't be distinguished from real ones by both humans and machines, which leads to the discussion of artificial intelligence ethics because of its excellent sample generations. Unfortunately, in the field of near-infrared spectrum detection (including the field of drug near-infrared spectrum detection), and even in the wider field of sensor data processing, there is no relevant report on the application of GAN methods. Therefore, we can only modify an appropriate GAN method that is most suitable for the feature extraction and the generation of category samples on demand. The candidate methods are Info-GAN [46], Bi-GAN, VAE-GAN [47], etc. according to our experience and the implementation difficulties, we finally select Bi-GAN as the modification object to achieve the goal of this paper.
Based on this background, this paper constructs a multi-classification model of drugs based on near-infrared spectroscopy and Bi-GAN sample generation, so that it can correctly classify in the scene of a large number of categories and spectra, and effectively solve the problems of insufficient samples, unbalanced samples, and cost-sensitive classification errors in the classification process.
In the scenario of a large number of categories and a large number of spectra, the problems of insufficient samples, unbalanced samples, and cost sensitivity of classification errors are common in other sensor data classification tasks, so this method can also be applied to data classification tasks obtained by other sensors.

Materials
All the materials used in this paper were obtained from the China Institute for Food and Drug Control (Beijing, China). A total of 1721 samples of four drugs (metformin hydrochloride tablets, chlorpromazine hydrochloride tablets, chlorphenamine maleate tablets, cefuroxime axetil tablets) produced by 29 manufacturers were collected. All samples were measured by FTIR spectroscopy (Matrix F spectrometer, Bruker Corporation, Billerica, MA, USA). Before sample collection, the instrument passed a self-diagnosis test and calibration. The wavelength range of data is 4000-11,995 cm −1 , and the resolution is 4 cm −1 .
The near-infrared spectra of the drugs were recorded using a diffuse reflection optical fiber probe. SMA 905 standard interfaces were used for coupling the optical fiber, light source and spectrometer. The ambient temperature was 18-30 °C, and the air humidity was less than 70%. All samples used the same determination background. The measurement operation followed a unified operation protocol, as shown in Figure 1. Sample information is shown in Table 1. It can be seen from this table that the number of samples is not balanced. Some classes have more samples, reaching 135, while others have fewer samples, only 21. The sample numbers within the class are sorted from high to low to form a column chart, as shown in Figure 2, and the distribution histogram is shown in Figure 3. Sample information is shown in Table 1. It can be seen from this table that the number of samples is not balanced. Some classes have more samples, reaching 135, while others have fewer samples, only 21. The sample numbers within the class are sorted from high to low to form a column chart, as shown in Figure 2, and the distribution histogram is shown in Figure 3.    As can be seen from Figures 2 and 3, most of the samples in the graph are concentrated in the top eight categories, accounting for 46.31% of the weight, and the following 21 categories only represent 53.69% of the weight. For more than half of the classes, the number of samples ranged from 21 to 54, which was less than the average sample size of 59 and far less than the highest sample number of 135. Therefore, it can be asserted that the number of samples in this dataset is extremely unbalanced, and the samples in many categories are insufficient.
The spectra of four kinds of drugs produced by various manufacturers are shown in  As can be seen from Figures 2 and 3, most of the samples in the graph are concentrated in the top eight categories, accounting for 46.31% of the weight, and the following 21 categories only represent 53.69% of the weight. For more than half of the classes, the number of samples ranged from 21 to 54, which was less than the average sample size of 59 and far less than the highest sample number of 135. Therefore, it can be asserted that the number of samples in this dataset is extremely unbalanced, and the samples in many categories are insufficient.
The spectra of four kinds of drugs produced by various manufacturers are shown in As can be seen from Figures 2 and 3, most of the samples in the graph are concentrated in the top eight categories, accounting for 46.31% of the weight, and the following 21 categories only represent 53.69% of the weight. For more than half of the classes, the number of samples ranged from 21 to 54, which was less than the average sample size of 59 and far less than the highest sample number of 135. Therefore, it can be asserted that the number of samples in this dataset is extremely unbalanced, and the samples in many categories are insufficient.
The spectra of four kinds of drugs produced by various manufacturers are shown in    Generally, the time from R&D to final registration of the original drug is about 15 years, and it needs to undergo four-phase clinical trials at a cost of hundreds of millions of dollars. Such drugs cannot be imitated until the patent has expired, and the enjoy the protection of policies such as separate pricing. Generic drugs only replicate the main components of the original drug, and even if a huge investment is invested in the generic process, the price is only about 1/3 even 1/6 that of the original drug. Therefore, it is understandable that generic drugs and the original drug can be as consistent as possible without being distinguished. This is very challenging for classification algorithm modeling. It requires that the algorithm be able to distinguish subtle differences between the classification features when extracting the class features.

Methods
Under the above severe classification requirements, we build a classifier based on Bi-GAN generating sample method to achieve fair and accurate classification.
Its main idea is to use artificial a generative adversarial network to generate samples to supplement and improve the sorting of the original samples as shown in Figure 6. Through a fair and reasonable sampling strategy, each category can get enough attention in the model construction, and finally effectively alleviate the shortcomings caused by insufficient intra-class samples and unbalanced inter-class samples in the drug near-infrared spectrum classification method, making the cost of wrong classification problem can be solved effectively. Generally, the time from R&D to final registration of the original drug is about 15 years, and it needs to undergo four-phase clinical trials at a cost of hundreds of millions of dollars. Such drugs cannot be imitated until the patent has expired, and the enjoy the protection of policies such as separate pricing. Generic drugs only replicate the main components of the original drug, and even if a huge investment is invested in the generic process, the price is only about 1/3 even 1/6 that of the original drug. Therefore, it is understandable that generic drugs and the original drug can be as consistent as possible without being distinguished. This is very challenging for classification algorithm modeling. It requires that the algorithm be able to distinguish subtle differences between the classification features when extracting the class features.

Methods
Under the above severe classification requirements, we build a classifier based on Bi-GAN generating sample method to achieve fair and accurate classification.
Its main idea is to use artificial a generative adversarial network to generate samples to supplement and improve the sorting of the original samples as shown in Figure 6. Through a fair and reasonable sampling strategy, each category can get enough attention in the model construction, and finally effectively alleviate the shortcomings caused by insufficient intra-class samples and unbalanced inter-class samples in the drug near-infrared spectrum classification method, making the cost of wrong classification problem can be solved effectively.
The key to its realization lies in the modification of the original Bi-GAN. On the one hand, to make original Bi-GAN have the ability to generate specific class samples instead of random "real" samples, and the other is to make classification supervision run through every process of Bi-GAN training, let the training of generator and discriminator be interfered by the classification loss.

Original Bi-GAN
The internal structure of the original Bi-GAN is shown in Figure 7. Its main objectives are shown in Equation (1): where: In Equations (1) and (2), G is the generator, which can be regarded as the decoder. D as the discriminator and E as the encoder. x represents the real sample. E(x) represents the representation encoded into the potential space, and it is also the extracted features. z is

Original Bi-GAN
The internal structure of the original Bi-GAN is shown in Figure 7. The key to its realization lies in the modification of the original Bi-GAN. On the one hand, to make original Bi-GAN have the ability to generate specific class samples instead of random "real" samples, and the other is to make classification supervision run through every process of Bi-GAN training, let the training of generator and discriminator be interfered by the classification loss.

Original Bi-GAN
The internal structure of the original Bi-GAN is shown in Figure 7.  Its main objectives are shown in Equation (1): where: In Equations (1) and (2), G is the generator, which can be regarded as the decoder. D as the discriminator and E as the encoder. x represents the real sample. E(x) represents the representation encoded into the potential space, and it is also the extracted features. z is the random sampling of the prior distribution, and G(z) represents the sample generated by z. y is the data source, if the data to be discriminated comes from the real sample x, then y = 1; if it comes from the generated sample G(z), then y = 0.
Equation (2) shows that Bi-GAN binds the original spectrum x and its extracted feature E(x), and the generated sample G(z) is bound with its prior distribution sample z, and then the two couples is been labeled with 1 and 0 respectively. The discriminator D is required to distinguish them to the maximum extent, and the generator G is required to prevent discriminator D from distinguishing. After training D and G alternately, generator G and discriminator D reach a Nash equilibrium. At this time, it can be considered that the authenticity of the generated samples has little difference from the "REAL" samples, and G has become a usable sample generator. The effectiveness of the above-mentioned methods in sample generation and feature extraction has been confirmed in reference [22].
However, the prior distribution sampling of the original Bi-GAN is usually the random sampling of the standard normal distribution N(0,1), and the category of the generated sample G(z) is not guaranteed, so it is impossible to determine whether the generated sample is the sample of the specified class. This is not in line with our goal of generating a specific class of samples. Therefore, we need to modify the original Bi-GAN to ensure the generator G can generate demanded random samples of "specified categories".

The Modifications of Original Bi-GAN
The overall modified design based on the original Bi-GAN is shown in Figure 8. In Equations (1) and (2), G is the generator, which can be regarded as the decoder. D as the discriminator and E as the encoder. x represents the real sample. E(x) represents the representation encoded into the potential space, and it is also the extracted features. z is the random sampling of the prior distribution, and G(z) represents the sample generated by z. y is the data source, if the data to be discriminated comes from the real sample x, then y = 1; if it comes from the generated sample G(z), then y = 0.
Equation (2) shows that Bi-GAN binds the original spectrum x and its extracted feature E(x), and the generated sample G(z) is bound with its prior distribution sample z, and then the two couples is been labeled with 1 and 0 respectively. The discriminator D is required to distinguish them to the maximum extent, and the generator G is required to prevent discriminator D from distinguishing. After training D and G alternately, generator G and discriminator D reach a Nash equilibrium. At this time, it can be considered that the authenticity of the generated samples has little difference from the "REAL" samples, and G has become a usable sample generator. The effectiveness of the above-mentioned methods in sample generation and feature extraction has been confirmed in reference [22].
However, the prior distribution sampling of the original Bi-GAN is usually the random sampling of the standard normal distribution N(0,1), and the category of the generated sample G(z) is not guaranteed, so it is impossible to determine whether the generated sample is the sample of the specified class. This is not in line with our goal of generating a specific class of samples. Therefore, we need to modify the original Bi-GAN to ensure the generator G can generate demanded random samples of "specified categories".

The Modifications of Original Bi-GAN
The overall modified design based on the original Bi-GAN is shown in Figure 8. In Figure 8, we made the following changes to Bi-GAN: (1) The sampling of z is limited.
We limit the sampling of z and set the mean and variance of P(E(xi)) as shown in Formula (3). The default values of σ all set to 1 at first, and then they are automatically adjusted according to the previous five history records during the training process: when generating the spectrum, the real spectral template xi must be specified, and then its class label ci is also determined. xi is encoded into E(xi) by encoder E, and then the mean and variance of the prior normal distribution P(E(xi)) is determined according to Formula (3), where the feature vector z could be randomly sampled in the fixed mean and local average variance scope.
(2) Limit the random G(z) in a specified class. In Figure 8, we made the following changes to Bi-GAN: (1) The sampling of z is limited.
We limit the sampling of z and set the mean and variance of P(E(x i )) as shown in Formula (3). The default values of σ all set to 1 at first, and then they are automatically adjusted according to the previous five history records during the training process: when generating the spectrum, the real spectral template x i must be specified, and then its class label c i is also determined. x i is encoded into E(x i ) by encoder E, and then the mean and variance of the prior normal distribution P(E(x i )) is determined according to Formula (3), where the feature vector z could be randomly sampled in the fixed mean and local average variance scope.
(2) Limit the random G(z) in a specified class. We do this by building a classifier. The classifier C in this paper is composed of MLP and softmax. In the pre-training, the real sample x i is used as its input, and the generated sample G(z) is used as its input in the formal training. Its output is a predicted class c i .
The classifier should be pre-trained, and its loss function shown as Equation (4): where c i is the class label of the real sample x i in pre-training, and for the generated sample, the class label of its spectral template x i is taken. k is the total number of drug categories andĉ i is the predicted category. G, D, and C are alternately optimized by gradient descent algorithm, and the optimization objective is changed to: where: According to this objective, the loss of discriminator during training is calculated as Equation (7): where y i represents the data source. If the data to be identified comes from a real sample, then y i = 1; if it comes from the generated sample, then y i = 0.ŷ i is the discriminator's prediction, and the Loss classification is the result of Equation (4). The loss of generator G during training is calculated using Equation (8): In this way, the classifier involves all G, D training processing. During the iterations, the generator will increasingly tend to generate samples of the same class as the template spectra.

Sampling Strategy in Data Set Processing
In this paper, after the spectra of each category are divided into the training set and test set according to the chosen proportion, they do not directly participate in the training and testing except for the pre-training of classifiers. Instead, the number of spectra of each category participating in the training and testing is determined in an equally fixed number, and they are extracted from data set by random sampling with replacement method.
The advantages of this method are: Firstly, each class's participating spectra in the training course are equal, so equal attention can be paid to each category, and the categories with fewer samples in the training process will not be ignored.
Secondly, even for the same spectrum template, because there is a random sampling process in the generation phase, the final generated spectrum will be different, so the diversity is guaranteed to a certain extent.

Application of Classifier
After the training of Bi-GAN, three trained networks, E, G, and C, are taken out to form a structure as shown in Figure 9, which is used to predict the categories.  Figure 9. The model structure for practical application. Just input the spectrum to be classified at x, and the predicted category ci will be output.
For each true spectrum x to be predicted, we can repeat the input a fixed number of times. Due to the existence of P(E(x)), the model will produce a different synthetic spectrum each time, which is consistent with the real spectrum in their categories but has good diversity. Most of the prediction results should be consistent with the category of x except for one or two abnormal values. By counting the frequency of output results, we can select the category with the highest frequency, which can be decided as the final prediction result of the x.
In this way, when P(E(x)) sampling occasionally appears small probability sampling anomaly, the model will not be disturbed by it, and finally, the correct category is selected.

Experimental Environment
This paper uses the following hardware and software environment for the data modeling experiments: Hardware environment: CPU Xeon 2678v3 (12 cores, 24 threads), memory 64 GB, SSD 1TB, GPU NVIDIA Tesla V100.

Muti-Classification Results
In the experiment, E, G, and C are constructed by multi-layer perception (MLP): E network MLP uses 2074-120-30 to set up the network, layer 120 and layer 30 are preceded by dropout (0.2), followed by batch normalization layer (BN) decoding. The activation function is RELU.
G network MLP uses 30-360-2074 to set up the network, and the activation function is also RELU.
The C network classifier is designed with 2074-150-30-softmax, and the activation function is sigmoid.
All the networks use RMSprop optimizer, and its parameters are the Keras' default parameters, the batch size is set to 60, trained for 150 epochs.
The training set is used for modeling, and the test set is used to verify the effectiveness of the model. For example, in the first row of Table 1, for the metaformin hydrochloride tablets produced by Shanghai Xinyi Pharmaceutical Factory Co., Ltd. (Zhengzhou, China), if the training set and test set are divided by 9:1 into 94 samples, 89 samples are randomly selected to be put in the training set for modeling, and the remaining nine samples are put in the test set for verifying the effectiveness of the model. The nine samples in the test set are invisible during the modeling period. They are "non-existent" external samples for the training process, but they are internal samples for the whole data set because their distribution and internal properties are similar to those of the 89 samples participating in the training. For each true spectrum x to be predicted, we can repeat the input a fixed number of times. Due to the existence of P(E(x)), the model will produce a different synthetic spectrum each time, which is consistent with the real spectrum in their categories but has good diversity. Most of the prediction results should be consistent with the category of x except for one or two abnormal values. By counting the frequency of output results, we can select the category with the highest frequency, which can be decided as the final prediction result of the x.
In this way, when P(E(x)) sampling occasionally appears small probability sampling anomaly, the model will not be disturbed by it, and finally, the correct category is selected.

Experimental Environment
This paper uses the following hardware and software environment for the data modeling experiments: Hardware environment: CPU Xeon 2678v3 (12 cores, 24 threads), memory 64 GB, SSD 1TB, GPU NVIDIA Tesla V100.

Muti-Classification Results
In the experiment, E, G, and C are constructed by multi-layer perception (MLP): E network MLP uses 2074-120-30 to set up the network, layer 120 and layer 30 are preceded by dropout (0.2), followed by batch normalization layer (BN) decoding. The activation function is RELU.
G network MLP uses 30-360-2074 to set up the network, and the activation function is also RELU.
The C network classifier is designed with 2074-150-30-softmax, and the activation function is sigmoid.
All the networks use RMSprop optimizer, and its parameters are the Keras' default parameters, the batch size is set to 60, trained for 150 epochs.
The training set is used for modeling, and the test set is used to verify the effectiveness of the model. For example, in the first row of Table 1, for the metaformin hydrochloride tablets produced by Shanghai Xinyi Pharmaceutical Factory Co., Ltd. (Zhengzhou, China), if the training set and test set are divided by 9:1 into 94 samples, 89 samples are randomly selected to be put in the training set for modeling, and the remaining nine samples are put in the test set for verifying the effectiveness of the model. The nine samples in the test set are invisible during the modeling period. They are "non-existent" external samples for the training process, but they are internal samples for the whole data set because their distribution and internal properties are similar to those of the 89 samples participating in the training.
Each experiment was conducted 10 times and the best results were recorded. The experimental results are shown in Table 2. As can be seen from Table 2, when the training set's proportion is more than 50%, the classification accuracy rate of the multi-classification model in this paper is more than 99% and when the proportion decreases, the classification accuracy does not decrease accordingly before 4:6, as if it has little relationship with the division of training set and test set, but when the training set is only 40% of the total, the accuracy displays a big drop from 99% to 92%. Since most of the categories in the data set have 30-50 spectra, when the training set accounts for less than 40%, most of the categories obviously begin to reflect missing data. When the training set proportion is only 30%, there are only 6 spectra of the minimum category that can be used for training. While when the training set reaches 20%, there are only four spectra of the minimum category that can participate in the training, and most categories (15 out of 29 categories) have only 10 or fewer data pieces for training.
To investigate the classification errors of each category, we draw the confusion matrix in the case of the most favorable classification (90% of the training set), as shown in Figure 10, and the confusion matrix in the case of the worst classification (20% of the training set), as shown in Figure 11.
From Figure 10, it can be seen that in the case of the most favorable classification (90% of the training set), except for categories 1, 3, and 9 (for tools reasons, the figure count classes from 0, while the information table count categories from 1, we apologize for the inconvenience), all the classes classified perfectly.
The intra-class spectra number of categories 1, 3, and 9 are 94, 67, and 48. Only one category falls into the insufficient data interval 21-54 of Figure 2. However, Category 4, which has the least number of spectra within its class, has a good classification effect. This shows that even if the classification error occurs in this situation, the error is not caused by the lack of intra-class spectra.
As can be seen from Figure 11, those classes with lower classification accuracy are more evenly scattered in the intervals shown in Figure 2, but not in the areas with insufficient data. This shows that the method in this paper has played a due role in eliminating the adverse effect of insufficient spectral numbers in the class.
It can also be seen from Figure 11 that the classification errors are mainly caused by the misclassification among different manufacturers within the same drug. Besides, whether it is in Figure 9 or Figure 10, the classification accuracy of Categories 25-29 (corresponding to cefuroxime axetil tablets) is almost not affected by the decrease of training spectrum proportion. It can be seen that in this method, the most important factor affecting the accuracy of classification is still the inherent characteristics of the spectrum of various pharmaceutical products, which is consistent with our original classification purpose. minimum category that can be used for training. While when the training set reaches 20%, there are only four spectra of the minimum category that can participate in the training, and most categories (15 out of 29 categories) have only 10 or fewer data pieces for training.
To investigate the classification errors of each category, we draw the confusion matrix in the case of the most favorable classification (90% of the training set), as shown in Figure 10, and the confusion matrix in the case of the worst classification (20% of the training set), as shown in Figure 11.  From Figure 10, it can be seen that in the case of the most favorable classification (90% of the training set), except for categories 1, 3, and 9 (for tools reasons, the figure count classes from 0, while the information table count categories from 1, we apologize for the inconvenience), all the classes classified perfectly.

Comparative Methods Results
The experimental results are compared with three kinds of algorithms: one is the traditional linear classification algorithm, mainly PLS-DA and linear SVM; the other is the traditional nonlinear classification algorithm, mainly RBF SVM, k-NN, BP-ANN; the third is the deep learning algorithm in recent years, mainly DBN, SAE, and CNN. Among them: The number of components of PLS-DA is the same as that of z in this paper, which is set to 30. The c value of linear SVM is 1.
The k-NN's k is set to 1. The RBF SVM's Gamma value is set to 0.0001, and the c value is set to 1. The BP-ANN takes two layers, the number of units in each layer is set to 2074-29, and the activation function is sigmoid.
In DBN, only one layer of RBM (1037 units) is set, followed by a full connection layer and a softmax classifier as the output.
The SAE's codec takes two layers, the number of units in each layer is set to 2074-180-30 and 30-180-2074, respectively. The feature layer (30 units) is fully connected to a softmax classifier as the output.
The CNN is constructed according to the optimal method design of the reference [15]. After all the models are well trained, we run the same tests according to our mothed proposed in this paper. The comparison accuracy results are shown in Table 3. It can be seen from Table 3 that: Overall, the accuracy of the Bi-GAN classifier is better in accuracy than the others. DBN, SAE, CNN, and other deep learning algorithms take second place, PLS-DA and linear SVM still available since the traditional linear classification algorithm also has a certain effect in discriminating the composition of drugs.
Except for PLS-DA, when the partition of the training set and test set is extreme, each algorithm will encounter the inflection point of classification accuracy when encountering the lack of necessary class data. Among them, the sensitivity of the nonlinear algorithms is higher than that of the linear algorithm.
In the drug multi-classification algorithms, the old PLS-DA algorithm still has good performance, and it is still worthy of attention when only focusing on the influence of drug components without considering the nonlinear influence factors.
Although Bi-GAN has high accuracy, it has the highest sensitivity in data missing. Once the data missing is serious, it is easy to deviate. The training time and inferring time of the algorithms are shown in Table 4.
It can be seen from the table that: Except for Bi-GAN, the training time of all algorithms decreases with the decreasing of training set proportion, while the inferring time shows an upward trend because test sets are expanding.
The k-NN algorithm has the least training time, but its accuracy is the worst, and its inferring time is longer than most of the others.
Although PLS-DA, linear SVM, RBF-SVM, k-NN, and BP-ANN use CPU to calculate, their training can be completed in less than 1 s because of their simple structure. But, the inferring time of linear SVM and RBF-SVM is longer than that of nonlinear algorithms, including the deep learning algorithms.
The deep learning algorithms' training time is longer than that of linear algorithms, but their inference time is shorter.
Among the deep learning algorithms, the cost of training time and inference time of our method is average. Its cost is fixed and does not vary greatly with the division of training set and test set, and that is a merit. PLS-DA, Linear SVM, RBF-SVM, k-NN, and BP-ANN use scikit-learn software to train and test, scikit-learn uses the CPU for calculation and only uses a single core and a single thread.

Discussion
Construction of the NIR classification model for multi-variety and multi-manufacturer drugs involves much data and complex categories and the application scenarios are challenging. This paper starts with the analysis of the problems of insufficient samples within the class and unbalanced samples between classes in the near-infrared spectrum classification of drugs. Then analyze the cost-sensitive problems of the incorrect classification caused by these problems. Through the modified Bi-GAN, the quantitative generated samples are used instead of the original uneven real samples as the classification training basis, which can effectively solve these problems above to an extent.
By constructing the appropriate network connection, using the appropriate combination of cost functions and a fair sampling strategy, we have achieved excellent classification results in the experiments. The experimental results demonstrate that in this scenario, the proposed method can achieve a classification accuracy of more than 99% in most cases where the training set accounts for more than 50% of the whole data set. Moreover, although the accuracy of this method will be greatly reduced when the proportion of the training set is reduced to 40%, the classification accuracies are relatively stable before that. As for time cost, although the training and inferring time cost of this method are at an average level compared with other deep learning methods, its cost is relatively constant and it also does not fluctuate with the increase or decrease of the number of samples of the dataset.
By comparing the traditional and three new kinds of drug classification algorithms, we can assert that this method has successfully achieved our expectation by solving the pre-set problems, the accuracy and stability of the method for the identification of multi-class drugs by near-infrared spectroscopy are also improved to a certain extent, which can provide a useful reference in the similar scene of near-infrared spectrum analysis and sensor signal data processing.

Conclusions
We propose an improved Bi-Gan method to classify the near-infrared spectra of drugs given the problem of insufficient samples within the class and imbalance of samples between classes. By limiting the mean and variance of latent variables, adding the classification loss constraint, and using the fair strategy of sampling with replacement. We achieve the desired results. The experimental results showed that the best classification accuracy of 1721 NIR spectra of four kinds of drugs produced by 29 manufacturers was significantly 99.4%. Compare to the other eight NIR multi-classification methods in recent years, this method has obvious advantages.
The problems in this paper may also exist in other sensor data classification processing, so the method we propose can be a useful reference for readers in dealing with the multiclassification problems in other scenarios.