Using Artificial Intelligence to Learn Optimal Regimen Plan for Alzheimer’s Disease

Background: Alzheimer’s Disease (AD) is a progressive neurological disorder with no specific curative medications. While only a few medications are approved by FDA (i.e., donepezil, galantamine, rivastigmine, and memantine) to relieve symptoms (e.g., cognitive decline), sophisticated clinical skills are crucial to optimize the appropriate regimens given the multiple coexisting comorbidities in this patient population. Objective: Here, we propose a study to leverage reinforcement learning (RL) to learn the clinicians’ decisions for AD patients based on the longitude records from Electronic Health Records (EHR). Methods: In this study, we withdraw 1,736 patients fulfilling our criteria, from the Alzheimer’s Disease Neuroimaging Initiative(ADNI) database. We focused on the two most frequent concomitant diseases, depression, and hypertension, thus resulting in five main cohorts, 1) whole data, 2) AD-only, 3) AD-hypertension, 4) AD-depression, and 5) AD-hypertension-depression. We modeled the treatment learning into an RL problem by defining the three factors (i.e., states, action, and reward) in RL in multiple strategies, where a regression model and a decision tree are developed to generate states, six main medications extracted (i.e., no drugs, cholinesterase inhibitors, memantine, hypertension drugs, a combination of cholinesterase inhibitors and memantine, and supplements or other drugs) are for action, and Mini-Mental State Exam (MMSE) scores are for reward. Results: Given the proper dataset, the RL model can generate an optimal policy (regimen plan) that outperforms the clinician’s treatment regimen. With the smallest data samples, the optimal-policy (i.e., policy iteration and Q-learning) gained a lesser reward than the clinician’s policy (mean −2.68 and −2.76 vs. −2.66, respectively), but it gained more reward once the data size increased (mean −3.56 and −2.48 vs. −3.57, respectively). Conclusions: Our results highlight the potential of using RL to generate the optimal treatment based on the patients’ longitude records. Our work can lead the path toward the development of RL-based decision support systems which could facilitate the daily practice to manage Alzheimer’s disease with comorbidities.


Introduction
Alzheimer's Disease (AD) is a progressive neurological disorder causing cognitive impairment and brain atrophy. Approximately 5.8 million people in the United States age 65 and older live with Alzheimer's disease and approximately 60%-70% of 50 million people worldwide with dementia are estimated to be diagnosed with AD [1]. Currently, the etiology of AD is still unknown [2]. The scientific community has conflicting opinions about the cause of AD. Researchers have explored both neuroimaging and genetic paths to better understand the underlying causes of AD, yet there hasn't been any conclusive outcome. Some researchers claim β -Amyloid plaque formation and aggregation causes AD [2] whereas others claim Apolipoprotein E (Apo E) gene along with various environmental factors do [3]; however, there is no consensus. Limited knowledge about AD's pathogenesis has resulted in a wide range of speculations regarding the risk factors for AD, like vascular diseases, type-2 diabetes, traumatic brain injury, epilepsy, depression, smoking, diet, physical exercise, and alcohol consumption [4]. Due to the unknowns of AD's etiology and risk factors, drug development has not made any significant progress and available drugs like cholinesterase inhibitors and memantine only treat the disease superficially. These drugs only help to temporarily ameliorate memory and thinking problems, but they do not treat the root cause of AD nor slow the rate of decline of a patient's condition [5]. They are just aimed at modifying the disease symptoms only [6] [7].
Alzheimer's disease management is further complicated by the high rate of comorbidities observed in patients [8]. Approximately 90% of AD patients are diagnosed with comorbid conditions [9]. Chronic diseases such as hypertension and depression are frequent among AD patients [10] [11]. Patients are mostly on several medications for other comorbidities. The relationship between AD and these comorbid conditions warrants further investigation on whether they act as risk factors or by-products of AD. Due to the limited knowledge of AD and its comorbid conditions, there is no consensus among physicians on how to manage such conditions [12]. This further complicates the management of AD. As a result, clinicians treat patients differently based on their own understanding and training experiences. Medication management ends up being a trial until a regimen temporarily relieves symptoms. As a result, it could take years of experience for a physician to medica manage AD with comorbidities [13]. A medication regimen learning tool can be very beneficial in this case to provi support to physicians. Such a learning tool would help physicians to learn the ways to properly treat AD patients with a possible comorbidities. For instance, a medication regimen learning tool could suggest a particular combination of drugs f each disease state for a patient instead of suggesting multiple combinations of drugs.
Furthermore, Artificial Intelligence (AI) has made it possible to create such medication regimen learning tools. It has be used to create such decision-support system models to predict drugs based on patient reviews [14]. Reinforcement Learni (RL) is an AI technology to learn a set of actions that can reward the most during the interaction of an agent in a speci environment (e.g., a computer game). Since RL is capable to learn a human-ish behavior and it achieved great success diverse applications that require human interaction (e.g., Go [15]), there is a trend to adapt RL in healthcare (e.g., the regim plan learned from Parkinson's Disease [15] and Sepsis [16]). Such technology can learn from existing clinical data a provide suggestions to junior physicians who have less experience compared to senior physicians. This could revolutioni health care by transferring senior physicians' years of experience to junior physicians through AI technologies. Here, w propose a study to learn the clinician's treatment plan to facilitate the clinical practice of the junior clinician in managing A patients. Specifically, we learn an RL-based model. It is a model based on stepwise decision-making and consists of state actions, and rewards. A virtual agent checks the current state, explores different actions, and picks one that maximizes t future reward (Fig. 1). We have demonstrated that the proposed model outperforms the data-derived methods (e.g., transiti probability-based model) for the patients with concomitant conditions (e.g., depression and hypertension). We came to th conclusion by comparing the Mini-Mental State Exam (MMSE) score from the data to the MMSE predicted by our R model. The results highlight the proposed study to generate the clinician's regimen plan for AD patients. (2) 13 different states are defined using the decision tree. (3) A reinforcement learning model is prepar based on states from section (2) and actions and rewards from data in section (1). (4) Best medication/action is selected f each state after using reinforcement learning.

Data
The data is derived from Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu) which is the most frequently used open-access data in the pharmacogenomic studies for AD [17]. ADNI is a longitudinal multicenter study designed to support advances in AD prevention and treatment through the development of clinical, imaging, genetic, and biochemical biomarkers. One of the major goals of ADNI is to support advances in AD intervention, prevention, and treatment [18].

Patient Cohort
We selected patients based on the following criteria: a minimum of two clinic visits, complete medical history, and clinical assessment data (Table 1). A total of 1,736 patients were selected (957 males and 779 females). Across all selected patients, the total number of visits was 10,082. The mean monthly visit per patient was 32.17 months and the mean number of visits per patient was 6.42 visits. Table 1. Patient demographics for a different cohort of data (Whole Data, AD-only data, AD-hypertension data, ADdepression data, and AD-hypertension-depression data).

AD-Hypertension -Depression
No

RL based Modeling
The traditional medical method of treating AD is assessing a patient's current state and prescribing medication accordingly, then following up on the patient's symptoms afterward. We utilized Reinforcement Learning (RL), a subfield of artificial intelligence (AI), to measure AD progression based on selected consecutive decisions. This consecutive decisionmaking nature of RL models is best described as a Markov decision process (MDP). An MDP consists of states, actions, and rewards where a state is Markovian if and only if the next state is dependent on the current state only. It is based on an agent at a certain state selecting different actions to maximize the rewards. The defined factors are described below.
State s: We define states as a finite set of a patient's progression state in the latest clinic visit. Raw data on participants' states were converted to discrete states. We picked up statistically significant features like Alzheimer's Disease Assessment Scale . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 29, 2023. ; (ADAS13) and age ( Table 2) to predict the Mini-Mental State Exam (MMSE) score using regression. We then chose the significant variables and derived a decision tree ( Fig. 2 and Table 2) to predict the MMSE scores. The decision tree divides each data into different ranges and then predicts the MMSE score. For example, an age of fewer than 70 years and an ADAS score of more than 20 could predict an MMSE score of 20. The predicted MMSE scores at the leaf nodes of a decision tree are our derived discrete states (Fig. 2). We grouped each visit according to the criteria specified by the decision tree and ignored states with less than 50 occurrences to avoid states without enough visits. Hypertension drugs and other supplements are also include d to explore treatment across five data cohorts: Whole, AD-on AD with hypertension, AD with depression, and AD with hypertension and depression. Please note that hypertension dru and supplements are not traditional treatments for AD and are for patients with coexisting hypertension and other conditio [19][20][21].
Reward r: We defined reward as the clinical assessment of the patient's medication response. While multiple assessme scores are used in clinical practice (e.g., Rey Auditory Verbal Learning Test (RAVLT) tests, Montreal Cognitive Assessme (MoCA)), we used MMSE assessment scores in our study because it is a widely used tool to assess cognitive function in bo routine clinical practice and research settings [22,23]. The max score for MMSE is 30 points, with ranges from 20 to indicating mild dementia; 13 to 20 indicating moderate dementia, and less than 12 indicating severe dementia [24]. W calculated the difference between MMSE in the current visit and the previous visit to measure the rate of progression Alzheimer's. A discount rate gamma, 0 ≤ γ ≤ 1 was also introduced to determine the present value of future rewards [25]. W used the discount factor γ =0.3. Our total discounted return is represented by:

Policies
The policy is a map from state to action. It maps an action to every possible state in the system. In other words, can be described as a possible strategy an agent uses in each state to get rewards and it is defined by probability. F example, if an agent uses an action a 1 on state s 1 and a 2 on state s 2, and so on, it can be considered a policy of the agent. O the state action map, for state s 1 , a 1 has the highest probability value and for state s 2, a 2 has the highest probability valu There are many possible policies as different actions can be used for the same states; however, one policy will yield maximum reward.

Optimal policy learned by RL learning
We generated policies using two different RL methods -model-free Q learning and model-based Policy Iteration. Mod based methods rely on planning and transition probabilities, while model-free methods rely on learning or experience [25].

Policy iteration
First, we compute the state-value function v(s) for an arbitrary policy . Value function, v(s) is a function that estimat future rewards on a given state when performing a particular action based on transition probability. The transition probabil is the probability of transitioning from one state, s, to another state, s' after a certain action is applied. This is called poli evaluation. After computing the value function for a policy, we check if there is a particular action that gives a better val for that state. This is repeated until a better policy is found and is called policy improvement. We repeat these evaluation a improvement cycles until we find out the optimum policy.

Q Learning
We used this off-policy temporal difference algorithm to create more variety for optimal policies. Q-learning uses Q-val from a Q-table to find the best actions for each state. The Q-value is an estimation of how good an action is at a particu state. The Q table is an m*n matrix where m is the number of states and n is the number of actions. An agent applies an acti at a particular state and updates the q-table with the reward it receives for that state-action combination. Then the age applies different actions for the same state. Through numerous repetitions, the best action for each state is picked and the is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 29, 2023. an optimal policy independent of the policy being followed. In other words, it is not dependent on transition probability derived from the data set.

Clinicians' policy by a data-driven approach
We used transition probability to find the clinician's policy from the data. We followed an approach similar to Policy Iteration. We used policy evaluation and policy improvement process just once based on the existing transition probability from the data and made the resultant policy as the clinician's policy. Since the policy is totally based on the data, we can safely assume it is very close to the real clinician's policy.

Other policies
We also created zero policy and random policy to compare them with our RL-based and clinician policies. Zero policy implies that in each state no drugs are applied as actions and random policy implies that random drugs are applied as actions without assessing the patient's condition.

Evaluation and comparison
We used offline evaluation to estimate the value of target policies (policies being learned) based on a behavior policy (policy used to generate behavior) [25] derived from the offline log data. It is very useful in settings where online interaction involves high risks and costs (e.g., medication recommendation systems) [26]. We used importance sampling (IS), commonly used off-policy evaluations, to estimate expected values under one distribution given samples from another [25]. It estimates the value of a target policy from behavior policy derived from the data by re-weighing states based on the frequency of their occurrence [27]. In our study, we used stepwise weighted importance sampling (step-WIS) which is the most practical point estimator among the importance of sampling techniques because of its low variance [15] [28] and error [29].

Tests
Test 1: The first test evaluated the impact of data size in generating policies from AD data in order to create a policy with a higher rate of accuracy and closest to the clinician's policy. We split 60%/20%/20% for training, validation, and testing. With the training set, we further divided it into four scenarios relating to different data sizes (e.g., 100%, 80%, 50%, 30%) to feed the models. All training groups were trained 50 times to generate an optimal policy. We repeated this cycle 100 times to eliminate any potential bias in our final reward. A total of 13 states and 6 actions were used for this test.
Test 2: The second test evaluated how the proposed work will perform over the different patient cohorts (e.g., patients with different concomitant diseases). We separated the data into five groups based on the disease diagnosis: Alzheimer's only (9 states, 6 actions), Hypertension and Alzheimer's (10 states, 6 actions), Depression and Alzheimer's (9 states, 6 actions), and Hypertension, Alzheimer's, and Depression (10 states, 6 actions). Hypertension and Depression were the two most prevalent concomitant diseases in the population in the data. Also, depression is one of the most prevalent psychiatric conditions in AD patients [30][31][32]. We then followed the same splitting method used in Test 1. We also wanted to check how different RL's medicine prediction is for different states compared to the clinician's prediction.
Test 3: For our final test, we wanted to learn how our proposed Q-learning model performed over different learning rates, α . This test was to confirm our Q-learning was robust enough to learn the real clinician's policy. We used our already existing Alzheimer's disease-only cohort and compared the results for alphas from 0.1 to 0.9. We then followed the same splitting method used in Test 1.
Test 4: For our final test, we wanted to learn how our proposed Q-learning model performed over the different number of states while keeping the data constant. We changed the total number of discrete states given by a decision tree based on the number of samples. For example, for whole data, we got 13 states when we used leaf nodes of a decision tree that had more than 50 samples and 9 states with leaf nodes that had more than 200 samples and compared the results (Fig. 7). We then followed the same splitting method used in Test 1.

Test 1
Test one revealed that appropriate data size resulted in RL performance comparable to the clinician's performance. Smaller data samples displayed worse results in policy iteration and q-learning than the clinician's policy, but as data size increased, policy iteration and q-learning performed better than the clinician's policy. For  is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 29, 2023. ; applied) and random policy (random drugs were applied). Zero policy repeatedly yielded the lowest mean reward of -9. and the lowest single reward (~ -15) and the random policy was slightly better than the zero policy [mean = -4.76].
The suggestions made by both optimal policies and clinicians' policies are somewhat similar (Table 3). Both poli iteration and Q-learning start off by recommending no drugs when patients are in the first state whereas the clinicia recommend memantine. In state 11, all the policies recommend hypertension whereas, in state 12, the recommendation each policy is totally different. In state 6, both optimal policies recommend Hypertension whereas clinicians recomme memantine. Different actions recommendation for each state for AD-Hypertension-Depression Cohort can be found Supplement Figure 2.   Fig. 3. Comparison of rewards represented by MMSE score (y-axis) for different-sized data for all policies. Policy Iterati and Q-learning are the optimal policies, and the Clinician policy is derived from the data. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 29, 2023.

Test 2
We noticed that the model is comparable with the clinician's policy when data is split around AD itself. Sin hypertension and depression are frequently seen in AD patients and our actions are mainly the medication for AD, poli iteration showed better results in all three cohorts than the clinician's policy (Fig. 4). We also compared the MMSE sco prediction by Q-learning's decision-making and Clinician's policy-making throughout every data cohort and concluded th Q-learning's rewards are more coherent than clinician's. For all the data cohorts, Q-learning's reward predictions a scattered around 0 (lower negative values) whereas clinician reward predictions are scattered around higher negative valu rewards (Fig. 5). Q-learning's reward prediction for whole data can be found in Supplement Figure 3. Predicted reward f each patient for both Q-learning and Clinician's policy can be found in Supplement Figure 1.

Test 3
We also confirmed the Q-learning policy is not always better with high learning rate (alpha) values. There's a gene trend of increasing rewards from a learning rate of 0.1 to 0.4. Then, the reward is stable from the alpha value of 0.3 to arou 0.8 with a mean from -1.28 to -1.30 and then it decreases at 0.9 with a reward of -1.42 (Fig. 6). data neral ound . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 29, 2023. ; https://doi.org/10.1101/2023.01.26.23285064 doi: medRxiv preprint Fig. 6. Comparison of Q-learning policy for different learning rates for AD-only cohort. The learning rate is from 0.1 to 0.9

Test 4
We did not find any concrete connection between changing the number of states and mean reward prediction (Fig. 7). the AD-only data cohort, 10 states from leaf nodes with a sample size greater than 50 predicted better mean reward [mean 1.34] compared to 7 states from leaf nodes with a sample size greater than 200 [mean=-1.54] and 9 states from leaf nod with a sample size greater than 100 [mean=-1.46]. In the AD-depression data cohort, 9 states from leaf nodes with a samp size greater than 50 predicted worse mean reward [mean=-0.97] compared to 6 states from leaf nodes with a sample si greater than 100 [mean=-0.95] (Fig. 7). This analysis for whole data cohort can be found in Supplement Figure 4. ). In an=odes mple size tes is . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted January 29, 2023. ;

Discussion
Our current study proposed an RL-based model to investigate the optimal AD treatment regimen plan based on the EHR. We adopted two RL methods -model-free Q learning and model-based Policy Iteration -to generate the regimen plans. In comparison to the policy (i.e., treatment regimen plan) learned simply from the existing data (i.e., clinician's policy based on transition probability-based method), the experiments displayed RL models that can optimize the treatment regimen for AD given sufficient patient data as suggested by previous studies with Parkinson's [15] and sepsis [16]. However, our current study has notable differences compared to those studies. First, we argued that the AI models can only estimate an ideal physician, which is not comparable to a real clinician's policy, unlike previous studies that strongly suggest AI-based policies can outperform physician policy [15]. Secondly, in previous studies, all the policies were generated based on the on-policy methods (e.g., SARSA and value interaction [15], [16]), which consider the target policy to be identical to the behavior policy. This is problematic in an offline setting because our target policy is very different from the behavior policy as we are using different actions for different states inorder to find an optimal action for a particular state. As a response, we conducted an evaluation that fairly compared the offline model-free models (i.e., Q learning) with the behavior policy. Lastly, we incorporated the importance of data volume to learn an ideal model for real-world implementation in addition to focusing on the RL model performance. Experiments on different data cohorts revealed better RL-based model performance in larger data cohorts. Our experiment showed a harmonization should be achieved between the data and method to generate an optimal policy. In our study, we found the optimal policy by repeating experiments with the training and validation data 50 times. For generalizability, we used 100 bootstrap samples of training and testing data on the resulting optimal policy to find our final reward. This study provided a robust guide for treatment plan learning and has adaptable potential in guiding the treatment of AD patients for junior physicians.
Our results were promising and demonstrated high potential for RL-based models to learn real clinician's policies; however, there are a few limitations to address. First, we could not obtain definitive results from the latest offline reinforcement learning algorithm, like Conservative Q-Learning (CQL), as it consistently predicted supplements as the optimal action. This is due to the high number of cases where supplements (N=4573) (e.g., vitamin and sleeping medication) were prescribed. This contrast with previous studies examining Parkinson's Disease (PD) which did not have higher rates of prescribed supplements (N=442) compared to PD medications (Levodopa=1157 and Dopamine agonist=447). There is a lot of potentials to perform this study by using the latest RL algorithms like CQL if evenly distributed medication data is collected in the future.
A second limitation lies in the accuracy of calculating disease progression with only cognitive assessment data. We could not incorporate neuroimaging and other biomarkers data as these were not available. Although there's no exact way to measure the progression of AD, neuroimaging has been widely used to diagnose AD and monitor disease progression [33]. Due to the unavailability of such data, we had to rely on commonly used cognitive tests like MMSE, ADAS, and CDRSB. A more in-depth study can be performed by incorporating other measures (e.g., mobility) or biomarkers (e.g., amyloid-beta and tau).
Thirdly, we also encountered a lot of negative values in our reward. It could be the result of the small data set, inconsistent data entry for MMSE scores for patients, and the high number of missing values in the record. We tried to minimize the missing values by filling the missing spot with the data from previous visits. The rewards would be much better if accurate MMSE scores were present for each visit for all the patients.
Lastly, there was not an active RL environment to test our algorithms as it is almost impossible to have an active testing environment for medical patients. Off-policy RL algorithms are only successful when they receive direct feedback from an active environment (e.g. a video game). With a proper dataset with evenly distributed medications and fewer missing values, we could use highly effective offline RL algorithms like CQL in the future to avoid this problem [34].