Personalized predictions of adverse side effects of the COVID-19 vaccines

Background Misconceptions about adverse side effects are thought to influence public acceptance of the Coronavirus disease 2019 (COVID-19) vaccines negatively. To address such perceived disadvantages of vaccines, a novel machine learning (ML) approach was designed to generate personalized predictions of the most common adverse side effects following injection of six different COVID-19 vaccines based on personal and health-related characteristics. Methods Prospective data of adverse side effects following COVID-19 vaccination in 19943 participants from Iran and Switzerland was utilized. Six vaccines were studied: The AZD1222, Sputnik V, BBIBP-CorV, COVAXIN, BNT162b2, and the mRNA-1273 vaccine. The eight side effects were considered as the model output: fever, fatigue, headache, nausea, chills, joint pain, muscle pain, and injection site reactions. The total input parameters for the first and second dose predictions were 46 and 54 features, respectively, including age, gender, lifestyle variables, and medical history. The performances of multiple ML models were compared using Area Under the Receiver Operating Characteristic Curve (ROC-AUC). Results The total number of people receiving the first dose of the AZD1222, Sputnik V, BBIBP-CorV, COVAXIN, BNT162b2, and mRNA-1273 were 6022, 7290, 5279, 802, 277, and 273, respectively. For the second dose, the numbers were 2851, 5587, 3841, 599, 242 and 228. The Logistic Regression model for predicting different side effects of the first dose achieved ROC-AUCs of 0.620–0.686, 0.685–0.716, 0.632–0.727, 0.527–0.598, 0.548–0.655, 0.545–0.712 for the AZD1222, Sputnik V, BBIBP-CorV, COVAXIN, BNT162b2 and mRNA-1273 vaccines, respectively. The second dose models yielded ROC-AUCs of 0.777–0.867, 0.795–0.848, 0.857–0.906, 0.788–0.875, 0.683–0.850, and 0.486–0.680, respectively. Conclusions Using a large cohort of recipients vaccinated with COVID-19 vaccines, a novel and personalized strategy was established to predict the occurrence of the most common adverse side effects with high accuracy. This technique can serve as a tool to inform COVID-19 vaccine selection and generate personalized factsheets to curb concerns about adverse side effects.


Introduction
The devastating Coronavirus disease 2019 (COVID- 19) pandemic, which was initially deemed impossible to control despite numerous strategies, such as strict personal hygiene guidelines and social distancing, required establishing a global vaccination strategy [1]. The COVID-19 vaccines are one part of the solution to control the crisis. Fortunately, the steps toward using vaccination as the primary tactic against the pandemic were accelerated by the World Health Organization (WHO) Emergency Use Listing (EUL) issuance, designating the COVID-19 approved vaccines [2].
Although vaccination is essential to limit the spread of COVID-19, its success is dependent on the fact that enough individuals would be willing to get vaccinated, but some proportions of the general public show hesitancy. This vaccine hesitancy originates from various concerns, from distrusting governments and pharmaceutical companies to fearing the adverse side effects of vaccines. Vaccines are one of the most potent weapons against many infectious diseases, but at the same time, their side effects still generate intricacies among diverse populations [3][4][5][6][7].
COVID-19 vaccines have shown numerous adverse side effects ranging from local side effects to systemic side effects. These adverse side effects mostly include minor and mild side effects such as headache, fever and pain in the injection area [8,9]. On the other hand, some rare but concerning severe side effects such as thrombotic events and myocarditis cases have been reported [10][11][12][13].
Vaccine adverse effects are correlated to the activity of the immune system, and the latter is closely related to sex, age, underlying disorders, and drug history [14]. In 2018, Kopsaftis Z et al. reported enhanced injection site side effects of influenza vaccines in elderly and Chronic obstructive pulmonary disease (COPD) patients [15]. Immunocompromised patients with primary immunodeficiency and hematological malignancies might be susceptible to vaccine-derived infections and stronger levels of adverse effects [16,17]. There have also been studies investigating the correlations between medical and personal characteristics and adverse side effects following the injection of COVID-19 vaccines, which have revealed a clear correlation between some aspects of vaccine recipients such as age and sex with the experienced adverse reactions [9,18].
Recognizing, anticipating, and predicting the adverse side effects of vaccines, including COVID-19 vaccines, can decrease anxiety and pave the way for the next steps toward a personalized vaccinology approach [19]. To the best of our knowledge, no study to date has addressed this critical matter for any drug or vaccine.
Based on the correlation between health-related traits and adverse side effects of vaccines, applying the medical and personal records may support a personalized estimate of each individual's adverse effects. Finding a correlation between the medical and personal characteristics and the occurrence of an adverse reaction can only be achievable through a large dataset and an enormous amount of data. Due to this reason, these predictions can only be calculated for milder adverse side effects that happen with high frequency in the population and finding a correlation between health-related traits of an individual and the rare severe adverse side effects can not be achieved. Of course, even predicting the more common adverse side effects depends on many factors.
The presence of an extensive number of parameters potentially affecting the adverse effects that one would experience makes the predictions of these side effects a complex issue. Finding correlation and building prediction models between these high numbers of parameters can be best handled with more complex methodologies such as Machine Learning (ML) and Artificial intelligence (AI) [20].
AI in healthcare has undergone meaningful progress in recent years; AI has been used as a tool for diagnosis, prognosis and risk stratification, disease screening, drug discovery, and data analysis in clinical trials [21][22][23].
Since the onset of this pandemic, AI has played a pivotal role in mitigating the impacts of COVID-19. Starting from predicting the COVID-19 dynamics, scanning for drug candidates from previously approved drugs, vaccine development, predicting the severity of COVID-19 induced infection, and even analyzing the behavioral changes towards COVID-19 vaccination [24][25][26][27][28][29].
In this study using health-related characteristics and personal traits, a machine learning approach was designed to predict the potential adverse side effects after COVID-19 vaccination.
For a standardized representation of the methods and result section of this paper, a modified version of the Transparent Reporting of Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) guideline was followed [30].

Source of data and participants
The prospective data of the most common adverse side effects were utilized following COVID-19 vaccination in 19943 participants. Data collection was performed using a completely anonymous online survey from health care personnel at 90 hospitals in Iran and at the University Hospital of Lausanne in Switzerland. No personal information was gathered during the process to follow our anonymity strategy. The healthcare authorities informed the vaccination recipients at the vaccination centers to fill out the online survey at least 72 h following vaccination. The 72 h interval was done as some adverse reactions may appear more than 24-48 h following the vaccination.
There were no additional selection criteria outside the criteria used for the vaccine eligibility. The final database used to design the models was obtained by aggregating the data of 19943 vaccine recipients who completed the survey before Aug 25, 2021. About 33.07% of these participants had not received the second dose of their vaccine upon completing the survey; the data from these participants were only used to train the first dose models.

Outcome
The participants' adverse side effects were considered as the outputs of the prediction models. Side effects were clustered into eight most common categories: fever, fatigue, headache, nausea, chills, joint pain, muscular pain, and injection site complications (including swelling, redness, and pain).

Predictors
Forty-six parameters including the recipients' age, sex, blood group, smoking history, drug abuse, alcohol dependency status, BMI, comorbidities, use of specific medication, and prior COVID-19 infection status (history of COVID-19 infection, degree of severity, and symptoms) were used as the predictors for the models following the first dose of vaccines.
The prediction model for the second dose included all the 46 predictors from the first dose models, plus side effects from the first dose of vaccine as additional input data (8 parameters). The total number of predictors for the second dose models was 54. Input parameters are demonstrated in Supplementary Table 1.
The selection of variables as predictors was based on the available recorded data. All these predictors were recorded via an online survey explicitly filled by the healthcare personnel three days or more after their vaccination.

Missing data
Study participants completed an online survey that required an answer to all the questions. Due to the absence of missing data, there was no imputation of missing values.

Pre-processing
Most input and output parameters were encoded as binary variables using one-hot encoding [31]. Continuous predictors, including age and BMI, were normalized using a MinMax scaler to avoid feeding models with outlier values (for example, incorrect data entered due to unintentional mistakes while completing the form) [32].

Machine learning methods
To ensure that models will not be overfitted on training data and are generalizable to unseen real-world data, 20% of the data was kept as a test dataset. A 5-fold cross-validation algorithm was performed on the remaining 80% of the data [33]. For this purpose, all records were randomly split into five subsets. Four subsets were used as training data, and one subset was held for model testing as a validation set. The cross-validation process was repeated four more times, with each of the five subsets being used once as the validation data. Model performance metrics were subsequently calculated separately for each training and validation model.
To compare training and validation metrics, they need to have a similar positive and negative data points distribution; this can be achieved by splitting vaccine receivers who showed specific side effects from those who did not, and putting them into five stratified subsets, then combining them into the final five subsets. The same proportion of positive and negative distribution was maintained for every side effect in each of the final five subsets.
The Scikit-learn machine learning library was used to implement both preprocessing algorithms and models [36]. Also, the XGBoost package was used for training Gradient Boosted Decision Trees [37].

Model performance evaluation
For the first dose models, all of the six method types were trained for each side effect. Models' performance in 5-fold crossvalidation was evaluated using accuracy, AUC-ROC, precision, and recall [38,39].
It is important to note that the first and second dose models have been trained and validated independently. The training procedure for the second dose models was similar to the first dose, except that this time models also had access to the first dose side effect data as the input.

Model hyperparameter tuning
A separate hyperparameter tuning analysis was run on each model and each target side effect to achieve the best possible performance for each model. GridSearchCV (RandomSearchCV for models with more parameters to tune) with a Stratified-Cross-Validation was also used for this purpose [40]. The best model configuration was selected using the mean AUC-ROC value for the validation set. It is notable that as in each iteration of hyperparameter tuning, we try to improve the performance metric on the validation dataset, and it can lead to overfitting the model on validation data. To ensure that the models are not overfitted, metrics from training, validation and unseen test sets were compared.

Model input-output correlations and feature importance
The LR model coefficient was used to demonstrate each predictor variable's effect on each side effect's outcome. LR calculates a probability P for each input data X with the following formula where e is the napier's constant and β i is the coefficient for feature i: Finding true correlations that represent real clinical sense requires a very large dataset, therefore this process was done only for the AZD1222, Sputnik V and BBIBP-CorV vaccines that contained a large cohort of participants.

Participants
The median age of subjects was 43 years with an interquartile range (IQR) of 33-57.9344 subjects (46.9%) were men, and 10599 (53.1%) were women. Overall, 5639 subjects (28.28%) out of the total participants were previously infected with COVID-19.
The 46 parameters and their availability are outlined in Supplementary Table 2. The occurrence frequency of each side effect in our dataset has been shown in Fig. 1. For all 12 groups (6 vaccines, 2 doses each), local side effects such as injection site pain, redness or      swelling were the most frequent (58.17%). Nausea was the least frequent side effect (10.13%). Full details of all the side effects' frequencies are available in Supplementary Table 3.

Model specification
Six machine learning methods were evaluated for every dose of each vaccine (12 groups in total) which are listed with the used hyperparameters in Supplementary Table 4. The best parameter for each of the six methods has been calculated by the hyperparameter tuning using Cross-validation (the test dataset was kept unseen in this step).

First dose side effect predictions
As we need to find the models with both strong predictions and generalizability to unseen data, models should not be overfitted on training data and should have near equal performance on validation, training, and test set. As shown in Table 1, all the model types (except KNN that seem to be overfitted on training data) have comparable performance on validation sets. The average AUC of validation sets for all the side effects are 0.654, 0.650, 0.684, 0.568, 0.630 and 0.583 for the AZD1222, Sputnik V, BBIBP-CorV, COVAXIN, BNT162b2, and the mRNA-1273 vaccines respectively. These models, however, differ in their training set values. By comparing models' performance for validation and test set, we concluded that LR had the best total performance and the least overfitting to training data.

Second dose side effect predictions
As expected, the addition of first dose side effects as input features improved the model predictions for second-dose side effects ( Table 2). Except for KNN that showed poor performance on the validation sets, other models showed an average AUC-ROC equal to 0.783.
Here again, like first dose predictions, some models have been overfitted to training data, so LR was selected as the most efficient model. Predictions using LR achieved an AUC-ROC of over 0.90 for some side effects ( Table 2). The full performance report of all the models for both doses can be found in Supplementary Table 5.
A supplementary analysis for the prediction of side effects for the second dose of the AZD1222, Sputnik V and BBIBP-CorV vaccines without including first dose side effects and solely using the original 46 parameters was also performed. The LR models achieved AUCs of 0.687, 0.651 and 0.645 for the training, validation, and test sets for all the three mentioned vaccines. This performance is similar to  Table 6).

Extra-validation test and generalizability
As we need to ensure that our models can be utilized on real-world data, 20% of initial data was left unseen in both training and hyperparameter optimization steps to preclude information leakage from this test set to the model. By comparing the model's performance on various sets (Tables 1 and 2), it can be concluded that our performance on the unseen test sets is comparable to the training and validation sets, which can hint at our models' generalizability. XGBoost, KNN, and RF have been overfitted on training data in both the first and second dose models.
The SVM, MLP, and LR showed average AUCs of 0.683, 0.659, and 0.716 for training sets in all different side effects for the first dose of vaccines, respectively. For the second dose models, the average training set AUCs are 0.839, 0.812, and 0.860.

Model input-output correlations and feature importance
To compare LR coefficients for different features, the continuous variables were first normalized to avoid undesired or upper/ under-estimation of feature effects on prediction.
The feature importance and positive or negative correlation are shown in Figs. 2 and 3 for the first and second doses of AZD1222, Sputnik V and BBIBP-CorV vaccines. For both doses, the feature effect was similar; however, in the case of the second dose, the first dose side effects were included as additional input features. As expected, the presence of a particular side effect following the first dose has an increasing impact on their second dose counterparts.
The predictive value of input features for all three vaccines was also included separately. The detailed presentation of each input's predictive value on all eight side effects for the three mentioned vaccines is available in Supplementary

Limited featured models
To investigate the contributions of the strongest predictors to the models' efficiency, limited featured models were developed based on only five parameters for each dose of vaccines and side effects. The most important features for every dose and every side effect model are shown in Supplementary Table 7. The predictive values were averaged for every input parameter in the eight different side effect groups of the AZD1222, Sputnik V and BBIBP-CorV vaccines. Subsequently, the LR models were run solely based on the five most important input features as described below.

Discussion
In this study, a novel machine-learning based approach was designed to predict the occurrence possibility of each common side effect for six widely approved COVID-19 vaccines solely based on recipients' personal and health-related traits. To the best of our knowledge, this is the first study to use a machine learning method to predict the occurrence of adverse side effects of any vaccine or drug based on an individual's personal and health-related characteristics.
Our findings indicate a significant correlation between the vaccine recipients' personalized characteristics and their adverse reactions. Age had the most substantial impact on the prediction of the side effects of the first dose, which was inversely proportional to the side effects occurrence. This effect is likely due to a more robust immune response in younger individuals leading to more side effects. Interestingly, one of the other influential factors was a history of COVID-19 infection. Participants with a history of COVID-19 infection experienced more vaccine-related adverse effects. In addition, more vaccine-related adverse effects were experienced by participants with a history of cancer.
The presence of specific side effects following the first dose of vaccine substantially impacted the occurrence of that same side effect after the second dose injection. This phenomenon was also observed in previous side effects studies [41,42].
Many differences were found between genders in the presence of various side effects; Women had a higher chance of experiencing all the side effects over the 12 groups of injections in the study. This finding has been supported for other drugs as well and can be explained by a mix of factors, including inherent immune system differences among men and women and the injection dose [43].
The efficiency and acceptance of COVID-19 vaccination programs have been limited by distrust of some portions of the public [44]. Educating the public in this area can help to accelerate the speed of vaccination and establish appropriate herd immunity. The results of our study may provide support to educate the general public and provide assurance of a monitoring process of adverse events.
Since the start of this pandemic, due to uncertainties and lack of data, governments have taken decisions on vaccination programs that are potentially influenced by cognitive biases; therefore, actuating strategies replaced proficient strategies [45]. During the COVID-19 pandemic, AI has played a prominent role in tailoring fast, rapid, and cost-effective strategies and policies for policymakers against the spread of the COVID-19 pandemic [44,46]. AI-based programs are not only straightforward and accessible but also affordable and accurate.
Diverse types of machine learning methods were tested for our approach, from simple linear models (LR) to more complex models like XGBoost and Multi-Layer Perceptrons. Regarding model performance, the LR shows superior prediction performance as well as simplified and generalizable explanations that help in seeing how it decides on each outcome and how each feature affects the output.
Our study had several strengths. First, to the best of our knowledge, no other model to this date has predicted the adverse side effects of any type of vaccine based on the health-related characteristics of an individual. Second, this research is the first step toward personalized vaccinology based on side effects and can be employed for other vaccines.
With our model, personalized fact sheets can be provided to individuals for adverse side effects prior to vaccination. However, our study had many limitations. Due to the slow speed of the vaccination process at the time of the study, this project was built on limited data for a limited number of vaccines. Increasing the dataset size may help models achieve higher generalizability to unseen data for the COVAXIN, BNT162b2, and the mRNA-1273 vaccines. Moreover, At the time of the study, the booster doses had not been rolled out, and it seems logical that the models should be extended to cover the booster vaccines for an increase in the practicality of the approach. The incapability to predict severe adverse side effects is another limitation of our model, which due to the rare occurrence of these reactions, seems unlikely to be achievable. Furthermore, the statistical significance of the input-output correlation is not as strong for the COVAXIN, BNT162b2, and the mRNA-1273 vaccines, which is mainly due to a lower number of participants for these vaccines.
In future studies, this approach can be enhanced by including more input and output data for both more practicality and more accuracy. Furthermore, this approach can easily be generalized to other vaccines and drugs and should not remain exclusive to COVID-19 vaccines. Moreover, a clinical validation study to support the real-world application of these predictions among the general public is another subject that must be studied in further investigations of this approach.
Ultimately, we anticipate that providing the public with a personalized prediction of their adverse side effects following vaccination can improve curb the general public's concerns about the COVID-19 vaccines adverse reactions. To increase the model's applicability, a user-friendly web interface was set up (https://podsaf.org) that allows each individual to enter their own characteristics and see a prediction of their side effects following the COVID-19 vaccines.