Elucidation of Genome-wide Understudied Proteins targeted by PROTAC-induced degradation using Interpretable Machine Learning

Proteolysis-targeting chimeras (PROTACs) are hetero-bifunctional molecules. They induce the degradation of a target protein by recruiting an E3 ligase to the target. The PROTAC can inactivate disease-related genes that are considered as understudied, thus has a great potential to be a new type of therapy for the treatment of incurable diseases. However, only hundreds of proteins have been experimentally tested if they are amenable to the PROTACs. It remains elusive what other proteins can be targeted by the PROTAC in the entire human genome. For the first time, we have developed an interpretable machine learning model PrePROTAC, which is based on a transformer-based protein sequence descriptor and random forest classification to predict genome-wide PROTAC-induced targets degradable by CRBN, one of the E3 ligases. In the benchmark studies, PrePROTAC achieved ROC-AUC of 0.81, PR-AUC of 0.84, and over 40% sensitivity at a false positive rate of 0.05, respectively. Furthermore, we developed an embedding SHapley Additive exPlanations (eSHAP) method to identify positions in the protein structure, which play key roles in the PROTAC activity. The key residues identified were consistent with our existing knowledge. We applied PrePROTAC to identify more than 600 novel understudied proteins that are potentially degradable by CRBN, and proposed PROTAC compounds for three novel drug targets associated with Alzheimer’s disease.


Introduction
proteins to be degraded by CRBN were predicted by the soft voting model and the distribution of probability 230 scores were presented in the S23 Fig.   231 It is well known that only a subset of the human genome that is considered druggable with studies 232 on drug-like small molecules. There is only limited information for the understudied disease proteins [49]. 233 Hence, we built an understudied human protein database by removing the druggable proteins (Tclinic and 234 Tchem) in Pharos [50] and Casas's druggable proteins [51] from human disease associated genome database 235 [52] and applied our method to predict the probability of these understudied disease proteins to be degraded   CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted February 24, 2023. ; https://doi.org/10.1101/2023.02.23.529828 doi: bioRxiv preprint protein-CRBN complex structures were different with those in the original CRBN-ligand complex structure, 264 but they formed similarly hydrogen bonds with oxygen atoms on the ring of the CRBN ligand moiety.

265
Many residues on these target proteins also formed favorable interactions with the warhead parts on the 266 PROTACs. Interactions between CRBN and CRBN ligand moiety, and between the target proteins and 267 warheads indicate the predicted PROTACs could form a dual binding with CRBN and target proteins and 268 have the potential to bring the target proteins to CRBN and induce the degradation of these proteins.

269
Key positions were also identified for these three proteins by eSHAP analysis. In order to compare with   CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted February 24, 2023. ; https://doi.org/10.1101/2023.02.23.529828 doi: bioRxiv preprint explored by the PROTAC. The most promising advantage for the PROTAC should be the ability to induce 294 these proteins for degradation. Our results for understudied human proteins could be a starting point to 295 develop PROTACs for these proteins. With our prediction results, more detailed studies like sequence and 296 structure comparison, complex structure modeling and linker modeling can be applied and help to design 297 PROTACs for these proteins.

298
Our method is the first machine learning method which can use the features from protein sequence to 299 predict PROTAC-induced degradation for target proteins. In this work, 23 different feature descriptors, three 300 classification methods and combination of them were used to build machine learning models. These models

19
. CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in

20
. CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in

23
. CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in  Fig 7. Structural mapping of the key positions for P12575, Q9UER7 and Q9H1I8. Pink represents strands, blue represents helices, green represents loop regions, yellow represents co-crystalized ligand and magenta represents the top ranked positions.

24
. CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in

30
. CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in

31
. CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted February 24, 2023. ; https://doi.org/10.1101/2023.02.23.529828 doi: bioRxiv preprint