LIGHTHOUSE illuminates therapeutics for a variety of diseases including COVID-19

Summary One of the bottlenecks in the application of basic research findings to patients is the enormous cost, time, and effort required for high-throughput screening of potential drugs for given therapeutic targets. Here we have developed LIGHTHOUSE, a graph-based deep learning approach for discovery of the hidden principles underlying the association of small-molecule compounds with target proteins. Without any 3D structural information for proteins or chemicals, LIGHTHOUSE estimates protein-compound scores that incorporate known evolutionary relations and available experimental data. It identified therapeutics for cancer, lifestyle related disease, and bacterial infection. Moreover, LIGHTHOUSE predicted ethoxzolamide as a therapeutic for coronavirus disease 2019 (COVID-19), and this agent was indeed effective against alpha, beta, gamma, and delta variants of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) that are rampant worldwide. We envision that LIGHTHOUSE will help accelerate drug discovery and fill the gap between bench side and bedside.


INTRODUCTION
Despite enormous efforts to eradicate serious medical conditions such as cancer and infectious diseases, the translation of innovative research results into clinical practice progress slowly (Klein et al., 2017), leaving a large gap between bench side and bedside. The difficulty in identifying bioactive chemicals for a given target protein is one reason for this slow progress, with high-throughput screening (HTS) of a sufficiently diverse compound library being required for each target. About 10 60 natural compounds with a molecular mass of <500 Da are thought to exist (Dobson, 2004), but HTS in most cases has been performed with only $10 6 compounds. Over the past few decades, molecular docking simulations have become widely adopted to reduce the cost, time, and effort required for HTS. This approach has been successful for some proteins whose crystal structures have been solved. More recently, with the advent of AlphaFold2 (Jumper et al., 2021), the ability to predict protein structures has been greatly improved, but it remains difficult to identify pockets of proteins that are potential drug targets and drug discovery without three-dimensional (3D) structural information therefore remains a challenge. Given that high-resolution 3D structural data are not available for most proteins to date and the high computational requirements of molecular docking simulations, the application of this approach has been limited.
Mathematical approaches have gained popularity in various fields, including sensing technologies (Greybush et al., 2019;Mohammadi Estakhri et al., 2019), clinical stratification (Shimizu and Nakayama, 2021), and other medical areas (Rajkomar et al., 2019). In particular, recent advances in artificial intelligence (AI) have demonstrated its potential in the pharmaceutical industry (Paul et al., 2021). Although many AI-based drug discovery methods have been proposed, they have had limited success in translational medicine. Whereas some studies have presented AI models with biological validation experiments (Stokes et al., 2020) or de novo molecular design (Zhavoronkov et al., 2019), many others have performed only computer-based validation without proof-of-concept biomedical experiments (Huang et al., 2021;Ö ztü rk et al., 2018;Tsubaki et al., 2019). In addition, most platforms to date have been trained with small datasets, such as Directory of Useful Decoys Enhanced (DUD-E), that have known biases (Chen et al., 2019) and are far from reflecting real-world data. Furthermore, many existing methods are based on a single network structure, whereas ensemble learning, which combines multiple network structures with different properties, might be expected to be more accurate and appropriate for AI-based drug discovery (Hansen and Salamon, 1990).

LIGHTHOUSE predicts confidence and IC 50 -related scores for any protein-chemical pair
We developed an end-to-end framework that relies on a message passing neural network (MPNN) for compound embedding (Gilmer et al., 2017) to calculate scores for the association between any protein and any chemical. This chemical encoder takes simplified molecular-input line-entry system (SMILES) chemical encoding as input, considers the compounds as (mathematical) graph structures, and transforms them into low-dimensional vector representations. We adopted three different embedding methods for protein sequences: CNN (convolutional neural network) (Ö ztü rk et al., 2018), Transformer (Vaswani et al., 2017), and AAC (amino acid composition up to 3-mers) (Reczko and Bohr, 1994). These methods take amino acid sequences and embed them in numerical vectors that take into account nearby (CNN) or distant (Transformer) sequences or physicochemical properties (AAC). The products of these chemical and protein encoding steps are then concatenated and entered into a feed forward dense neural decoder network. Each chemical-protein pair is converted into a single score after this series of computations ( Figure 1A). We used this architecture to estimate both the confidence level for chemical protein pairs and their median inhibitory concentration (IC 50 ) values.
To train the platform to estimate confidence, we used $1.3 million compound-(human) protein interactions (CPIs) stratify-sampled from STITCH (Table S1), which is one of the largest CPI databases (Szklarczyk et al., 2016) and registers compound-protein pairs together with confidence scores. These scores are based on experimental data, evolutionary evidence such as homologous protein and compound relations, and cooccurrence frequencies in literature abstracts (scores range from 0 to 1, with 1 being the most reliable). To avoid overfitting, we randomly divided the overall data into training (80%), validation (10%), and test (10%) datasets ( Figure S1A).
We fed the network with protein primary structures and chemicals and trained it to output the scores from the STITCH training dataset ( Figure 1B). When we trained the three models (CNN, AAC, and Transformer for protein encoders) separately, the mean squared error (MSE) for the validation data was gradually decreased, and the area under the receiver operating characteristic curve (AUROC) was also improved (Figures S1B-S1G). These findings indicated that our AI models learned the approximation of the hidden 1D relation underlying the compound-protein pairs without overfitting the training data. We examined the performance of the models with the test dataset at the end of the training and (epoch-wise) validation phases, and we discovered that the AUROC for all three models was >0.80 (Table S2) iScience Article equivalent to or better than those of cutting-edge 3D docking simulations (Hsin et al., 2016;Moussa et al., 2021;Wang et al., 2020). It is of note that our AI models can be applied to proteins for which 3D structural information is not available. We took the harmonic mean of the three scores to define the confidence score ( Figure 1B).
We also trained the models to predict scores based on IC 50 values. For this purpose, we used data from BindingDB (Gilson et al., 2016), which collects a variety of experimental findings, and we divided the data into training (80%), validation (10%), and test (10%) datasets ( Figure S2A). The same architecture was adopted to train the AI models to predict scaled IC 50 values ( Figure 1C), yielding an interaction score, and we confirmed that the models adequately learned how to predict IC 50 from amino acid sequencechemical pairs ( Figures S2B-S2G). Finally, we assessed the performance of the models with undisclosed test data, finding that they performed well in predicting IC 50 (Table S3).

LIGHTHOUSE architecture outperforms state-of-the-art methods
We next compared the performance of LIGHTHOUSE with that of similar existing methods. To ensure a fair comparison, we used DUD-E data as were used in previous studies (Tsubaki et al., 2019;Wallach et al., 2015). In brief, we randomly split DUD-E data (102 target proteins) into training (72 proteins) and test (30 proteins) data and then trained the LIGHTHOUSE architecture with the DUD-E training data for classification of compounds as active or decoy with regard to the protein in question. After this training, we examined LIGHTHOUSE performance with the DUD-E test data ( Figure S3A). Of note, we used only amino acid sequences of the proteins for training and evaluation of LIGHTHOUSE, even though DUD-E provides structural data (as PDB files) for proteins. In addition, we used the balanced dataset of DUD-E-the training samples comprise 22,886 active (positive) and 22,886 decoy (negative) samples-as in a previous study (Tsubaki et al., 2019). LIGHTHOUSE yielded an AUROC for the DUD-E test data of 0.956 ( Figure S3B), which was higher than the values produced by state-of-the-art methods including 3D-CNN (Ragoza et al., 2017), AtomNet (Wallach et al., 2015), and a graph-based deep learning method proposed by Tsubaki et al. (2019) (Figure S3C).
For further comparison of LIGHTHOUSE with the second best method, we downloaded CPI data for human and Caenorhabditis elegans from the GitHub repository of Tsubaki et al. (2019). Both of these datasets were generated previously (Liu et al., 2015). Similar to Tsubaki et al., we retrained the LIGHTHOUSE architecture with these training data, and we found that LIGHTHOUSE outperformed this other method on the basis of both AUROC and F1 metrics (Table S4). These bodies of evidence thus show that LIGHTHOUSE is one of the best architectures for drug discovery available to date.

In silico verification of LIGHTHOUSE
We next evaluated the performance of LIGHTHOUSE in terms of its ability to predict known CPIs. We generated two datasets for this purpose: a ''Positive'' dataset consisting of reliable CPIs (STITCH Figure 1. Development of LIGHTHOUSE for discovery of drug candidates without 3D structural data (A) The basic network structure of LIGHTHOUSE consists of encoder and decoder networks. The basic network encodes the amino acid sequence of the protein of interest as numerical vectors by one of three independent methods: CNN, AAC, and Transformer. It also takes the SMILES representation of each small-molecule compound and computes the neural representation with the MPNN algorithm. The network then concatenates the protein and compound representations and calculates a ''Score.'' (B and C) LIGHTHOUSE consists of two modules. Module 1 estimates the association between a given compound-protein pair, and module 2 predicts a scaled IC 50 value for the pair. In each module, the three different streams of the basic network (MPNN_CNN, MPNN_AAC, and MPNN_Transformer) are used, and the harmonic mean of the three scores is presented as the final ensemble score. Each of the three streams in module 1 (B) is trained to minimize the error between the predicted ''Score'' and the score registered in the STITCH database, which contains millions of known and estimated CPIs. The higher the confidence score (closer to 1), the more confident LIGHTHOUSE is that there is some relation between the compound and the protein; conversely, the lower the confidence score (closer to 0), the more confident LIGHTHOUSE is that there is no such relation. Each of the three streams in module 2 (C) is trained to predict scaled IC 50 values with the use of BindingDB data. For instance, an interaction score of 4 means that, if the compound has inhibitory activity, the IC 50 would be $100 mM, whereas an interaction score of 9 means that, if the compound has inhibitory activity, the IC 50 would be $1 nM. Note that module 2 only works if the compound and protein interact, so this module is auxiliary to module 1. See also Figures S1, S2 and Tables S1, S2, S3. iScience Article confidence score of >0.9), and a ''Negative'' dataset in which the amino acid sequences of the ''Positive'' dataset were inverted so that they would no longer be expected to interact with the corresponding chemicals. Calculation by LIGHTHOUSE of the confidence scores for both datasets revealed that those for the ''Positive'' dataset were heavily skewed toward 1 ( Figure 2A). Receiver operating characteristic (ROC) curve analysis showed that the two datasets could be distinguished on the basis of their LIGHTHOUSE confidence scores ( Figure 2B). Given that the STITCH database used for the training of LIGHTHOUSE relies not only on experimental CPI data but also on co-appearance of chemicals and proteins in the literature, some well-studied molecules, such as ATP, have high values even in the ''Negative'' dataset. We calculated confidence and interaction scores for ATP as well as for the tyrosine kinase inhibitor drugs sorafenib and sunitinib as true positive examples. ATP showed a skewed distribution with respect to the confidence score, indicating that it is a false positive compound ( Figure S4A). In contrast, the distribution of confidence scores for both sorafenib and sunitinib ( Figures S4B and S4C) was completely different from that for ATP. Despite the presence of such false positives, LIGHTHOUSE proved to be effective in predicting the degree of association between protein-chemical pairs solely on the basis of protein primary structure.
To demonstrate further the predictive power of LIGHTHOUSE, we searched all 10 known targets of sorafenib registered in the DrugBank database (Wishart et al., 2018). We found that 9 out of the 10 known targets were located in the promising compartment (confidence score of >0.7 and interaction score of >7.0) ( Figure S5A), which constitutes statistically significant enrichment (p = 2.4 3 10 À14 , Fisher exact test). Furthermore, 8 of the 10 known sorafenib targets were among the top 25 candidate proteins (confidence score of >0.75 and interaction score of >7.5) ( Figure S5B). In addition to these 8 known sorafenib targets, the top 25 candidate proteins included an additional 2 kinases that may be unknown targets of this multikinase inhibitor. Thus, 40% of the top candidates (10 out of 25) were found to be experimentally verified or likely true positives.
We next validated the effectiveness of LIGHTHOUSE for well-studied compound-protein pairs. LIGHTHOUSE yielded high confidence scores for adrenergic receptors (a1, a2, b1, b2, and b3) and epinephrine ( Figure 2C). Histamine receptors are classified into four subtypes (Seifert et al., 2013), with HRH1 and HRH2 being targets of anti-allergy and antiulcer drugs, respectively. LIGHTHOUSE predicted that the HRH1 antagonist fexofenadine would associate to a greater extent with HRH1 than with HRH2, whereas the HRH2 inhibitor famotidine would associate to a greater extent with HRH2 than with HRH1 (Figure 2D). These results suggested that LIGHTHOUSE is able to accurately discriminate receptor subtypelevel differences solely on the basis of amino acid sequences.
LIGHTHOUSE also proved informative both for macrocyclic chemicals such as rapamycin, yielding a high confidence score for this drug and mechanistic target of rapamycin (MTOR) ( Figure 2E), as well as for peptide drugs such as bortezomib (used for treatment of multiple myeloma), leuprorelin (hormone-responsive cancers), and semaglutide (type 2 diabetes) ( Figure 2F), yielding high confidence scores for these drugs and their known targets: proteasome subunit PSMB1 (Berkers et al., 2005), gonadotropin-releasing hormone receptor (GNRHR) (Borroni et al., 2000), and glucagon-like peptide-1 (GLP-1) receptor (GLP1R) (Knudsen and Lau, 2019), respectively. Given the rapidly growing demand for peptide drugs (Muttenthaler et al., 2021), LIGHTHOUSE will prove useful for the development of novel peptide therapeutics for a variety of promising targets.
We also applied LIGHTHOUSE to five drugs that were approved by the US Food and Drug Administration (FDA) in 2020 but which had not yet been registered in the STITCH database. LIGHTHOUSE successfully  iScience Article predicted the association between these new drugs and their target proteins ( Figure 2G), indicating the expandability of LIGHTHOUSE to a much larger exploration space than that encompassed by STITCH.
Furthermore, we also evaluated IC 50 data from BindingDB, which are derived from actual bioassays. We found that the predicted value and observed value correlate well for the BindingDB test dataset ( Figure S6). This series of findings thus demonstrated the ability of LIGHTHOUSE to discover new drugs for a broad spectrum of diseases.
LIGHTHOUSE discovers an inhibitor of PPAT, a key metabolic enzyme for cancer treatment We investigated whether LIGHTHOUSE can identify compounds for potentially important therapeutic targets. As such a target, we chose phosphoribosyl pyrophosphate amidotransferase (PPAT), a rate-limiting enzyme in the de novo nucleotide synthesis pathway, given that its expression is most correlated among all metabolic enzymes with poor prognosis in various human cancers and that its depletion markedly inhibits tumor growth (Kodama et al., 2020). Although no PPAT inhibitor has been developed and the 3D structure of the protein has not been solved, we attempted to discover an inhibitor for PPAT by LIGHTHOUSE solely on the basis of its amino acid sequence. We virtually screened $10 9 commercially available compounds in the ZINC database (Sterling and Irwin, 2015) ( Figure S7). To reduce the calculation time, we adopted a stepby-step application of LIGHTHOUSE ( Figure 3A). The MPNN_CNN model excluded most of the chemicals unrelated to PPAT, with only 2.41% of the starting compounds having a score of >0.5 in this initial screening ( Figure 3B). The selected compounds were then processed by the MPNN_AAC and MPNN_Transformer models, which reduced the number of candidate chemicals to 0.0356% of the initial compounds. We also calculated interaction scores by LIGHTHOUSE and visualized them in a 2D plot ( Figure 3C, left). We hypothesized that we would achieve better prediction with the use of both confidence and interaction scores, and the model indeed improved by $17.9% as determined on the basis of the AUROC (Figures 2A, 2B, and S8).
The best candidates would be expected to have high confidence and interaction scores, appearing in the upper right corner of the 2D plot. Indeed, this criterion was met by several well-known drug-target combinations ( Figure 3C, right). To thoroughly investigate known drug-protein interactions, we downloaded the ''Model List of Essential Medicines'' published by WHO (World Health Organization, 2021) and then searched the DrugBank database (Wishart et al., 2018) for each essential drug and excluded those without known target proteins. If multiple known target proteins were registered, the protein at the top of the target list was used as a representative example, and its amino acid sequence was obtained from UniProt, which is linked to DrugBank. Of the 112 drug-protein combinations that we calculated by LIGHTHOUSE, most (79 combinations, 70.5%) (Table S5) were located in the upper right compartment depicted in Figure 3C (right).
Among the >333,000 final compounds in the PPAT analysis, the top candidate PPAT inhibitor with the highest confidence score was ZINC8551105 (riboflavin 5 0 -monophosphate), with a predicted IC 50 of 1-10 mM ( Figure 3D). We performed a biochemical assay to test this prediction and found that riboflavin 5 0 -monophosphate indeed markedly inhibited PPAT activity with an actual IC 50 of 7.5 mM ( Figure 3E). This compound, discovered by LIGHTHOUSE solely on the basis of the PPAT amino acid sequence, is thus a potential lead compound for the development of new therapeutics targeted to a variety of cancers. It is also of note that we tested only this compound, so other top candidates might also inhibit PPAT activity.

LIGHTHOUSE identifies an inhibitor of drug-resistant bacterial growth
Bacterial infections pose a clinical problem worldwide, especially in developing countries, and the emergence of drug-resistant bacterial strains as a result of the overuse of antibiotics has exacerbated this problem. b-Lactamase enzymes produced by antibiotic-resistant bacteria (Tooke et al., 2019) target the b-lactam ring of antibiotics of the penicillin family. We therefore applied LIGHTHOUSE to search for antibiotics not dependent on b-lactam structure.
LIGHTHOUSE predicted that pyridoxal 5 0 -phosphate might associate with penicillin binding proteins such as PBP2 (mrdA), PBP3 (ftsI), and PBP5 (dacA), all of which are essential for cell wall synthesis in Escherichia coli (Macheboeuf et al., 2006) ( Figure 4A). This compound indeed suppressed the growth of wild-type E. coli strain JM109 in a concentration-dependent manner ( Figure 4B). Importantly, pyridoxal 5 0 -phosphate also markedly inhibited the growth of an ampicillin-resistant E. coli transformant that produces b-lactamase . Discovery of lead compounds for treatment of cancer (A) Scheme for PPAT inhibitor discovery. The amino acid sequence of PPAT (517 residues) and the SMILE representation for each chemical were entered into the MPNN_CNN model. If the predicted score was >0.5, the compound was entered into MPNN_AAC, and if the new predicted score was >0.5, the compound was entered into MPNN_Transformer. The harmonic mean of the three scores was then computed to obtain the confidence score.
(B) Almost 1 billion compounds in the ZINC database were processed as in (A). The first filter (MPNN_CNN score >0.5) and subsequent two filters (MPNN_AAC score >0.5, MPNN_Transformer score >0.5) greatly reduced the initial chemical space (to 0.0356%). The interaction scores for these selected candidates were then also calculated.
(C) A 2D map of the 333,290 selected candidates from (B) is shown on the left. Ideal candidates would be expected to have high confidence and interaction scores and would be plotted in the upper right corner of the map. Indeed, some wellknown drug-target pairs meet this criterion, as shown on the right, with compounds represented by the blue circles in the shaded area potentially possessing inhibitory activity for PPAT. ABL1, ABL proto-oncogene 1; COX1, cyclooxygenase 1; HMG-CoA, 3-hydroxy-3-methylyglutaryl-coenzyme A. iScience Article ( Figure 4C). These results thus suggested that, even though it was trained with human proteins, LIGHTHOUSE can also be applied to nonhuman (even bacterial) proteins.

LIGHTHOUSE informs optimization of lead compounds
Diabetes mellitus is also a serious public health concern, with the number of affected individuals expected to increase markedly in the coming decades (Zheng et al., 2018). Dipeptidyl peptidase-4 (DPP-4) cleaves and inactivates the incretin hormones GLP-1 and glucose-dependent insulinotropic polypeptide (GIP), and DPP-4 inhibitors are a new class of antidiabetes drug (Deacon, 2020). Given that LIGHTHOUSE also predicts interaction scores, we examined whether it might also contribute to the optimization step of drug development. Indeed, LIGHTHOUSE accurately predicted the rank order of potency for several recently identified DPP-4 inhibitor derivatives (Li et al., 2016) ( Figure 5A). Furthermore, LIGHTHOUSE predicted that removal of the phosphate group would reduce the inhibitory potency of riboflavin 5 0 -monophosphate for PPAT ( Figure 3E), and this prediction was confirmed correct by the finding that the IC 50 value was increased from 7.5 to 49.9 mM ( Figure 5B). These data suggested the possibility that LIGHTHOUSE is capable of predicting activity cliffs.
LIGHTHOUSE is also able to estimate the effect of point mutations on CPIs. For example, the T315I mutation of ABL1 in leukemia cells reduces the efficacy of imatinib (Jabbour et al., 2008), and LIGHTHOUSE accurately predicted the effect of this mutation ( Figure 5C). LIGHTHOUSE is able to provide such insight from only wild-type amino acid sequences, given the lack of variant information in the original (C) The JM109 strain of E. coli was transformed with the pBlueScript II SK + plasmid, which contains an ampicillin resistance gene as a selection marker, and the cells were plated on LB agar plates containing ampicillin in the absence or presence of pyridoxal 5 0 -phosphate (3 mg/mL) and were incubated overnight. The pH of pyridoxal 5 0 -phosphate was adjusted to 7.0 in order to avoid potential nonspecific toxicity.

OPEN ACCESS
iScience 25, 105314, November 18, 2022 9 iScience Article training dataset. Our results suggest that LIGHTHOUSE is able to predict the effects of small changes in protein or chemical structure, and that this will be the case even if such variants do not exist in nature.

LIGHTHOUSE identifies potential on-and off-targets of given compounds
Opposite to the mode of drug discovery for a given protein, LIGHTHOUSE might also be able to identify proteins as potential on-or off-targets for a given compound. To verify this notion, we examined statins, which are HMG-CoA reductase inhibitors widely administered for the treatment of hyperlipidemia. Epidemiological studies have shown that statins not only lower cholesterol, however, but also have effects on iScience Article cancer, although the target molecules for these effects have remained unclear (Mei et al., 2017). We therefore applied LIGHTHOUSE to three representative statins (atorvastatin, cerivastatin, and fluvastatin) and computed confidence scores for all human protein-coding genes ( Figures 6A and 6B, Table S6). We then sorted the genes on the basis of these confidence scores and performed Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis for the top 500 potential statin targets. In addition to  iScience Article lipid-related pathways such as atherosclerosis and fatty liver, ''pathways in cancer'' was one of the most enriched KEGG pathways ( Figure 6C), consistent with previous findings (Ahern et al., 2014;Alfaqih et al., 2017;Matusewicz et al., 2015;Mullen et al., 2016). Potential targets of statins for cancer treatment identified by LIGHTHOUSE included STAT3, CCND1, AKT1, and CCL2 ( Figure 6D).
Given that side effects of drugs often manifest in organs that express target proteins, we hypothesized that LIGHTHOUSE might be able to identify which organs are at risk of damage from a given drug. We performed another enrichment analysis for the same top 500 potential statin target genes to determine which organs or cell types preferentially express these genes. The top three candidates were the liver, adipocytes, and lung ( Figure 6E), consistent with the liver being the primary site of statin metabolism and interstitial pneumonia being one of the most severe side effects of statins (Momo et al., 2018). Prediction of potential target proteins for a given drug by LIGHTHOUSE will thus provide insight into which organs warrant close monitoring by physicians during treatment with the drug, especially in first-in-human clinical trials.

LIGHTHOUSE identifies potential therapeutics for COVID-19
SARS-CoV-2 emerged at the end of 2019 and has caused a pandemic of infectious pulmonary disease, COVID-19 (Hu et al., 2021). We noticed that genes whose expression is up-regulated after SARS-CoV-2 infection (Blanco-Melo et al., 2020;Riva et al., 2020;Wyler et al., 2021) were enriched in the list of potential statin targets identified by LIGHTHOUSE ( Figure 7A). Indeed, previous studies have shown that statins prevent exacerbation of COVID-19 (Gupta et al., 2021;Zhang et al., 2020). With this finding that LIGHTHOUSE is also effective for COVID-19 drug discovery, we applied it to the virtual screening of $10,000 approved drugs, given that drug repurposing may allow faster delivery of effective agents to patients in need. We calculated scores for angiotensin-converting enzyme 2 (ACE2), which is targeted by SARS-CoV-2 for infection of host cells (Walls et al., 2020), and the top drug candidate, ethoxzolamide, was selected for validation analysis ( Figure 7B). Immunocytofluorescence analysis revealed that ethoxzolamide blocks proliferation of SARS-CoV-2 in Vero-TMPRSS2 cells ( Figure 7C). Furthermore, ethoxzolamide was effective against not only the wild-type virus but also the alpha, beta, gamma, and delta variants. It thus rescued virus-challenged cells in a concentration-dependent manner without affecting non-infected cells (median cytotoxicity concentration of >50 mM) ( Figures 7D, 7E, and S9; Table S7), and it reduced the virus load present in the culture supernatant of the cells (Figures 7F, 7G, and S10). Ethoxzolamide is approved for the treatment of seizures and glaucoma (Ghorai et al., 2020;Pospelov et al., 2021), and its pharmacodynamics are therefore known. It is therefore immediately available for repurposing for the treatment of patients with COVID-19, with its further optimization having the potential to save many lives.
Experimental evaluation of the potential inhibitory effect of acetazolamide, which was ranked second in the list of potential ACE2-targeting drugs predicted by LIGHTHOUSE and has a known mechanism of action (carbonic anhydrase inhibitor) similar to that of ethoxzolamide, revealed that acetazolamide did not inhibit SARS-CoV-2 infection, despite its LIGHTHOUSE confidence score for ACE2 being very similar to that of ethoxzolamide (0.837 versus 0.881, respectively). We therefore hypothesized that ethoxzolamide might preferentially target a SARS-CoV-2 protein rather than host ACE2. To test this hypothesis, we computed the scores of ethoxzolamide and acetazolamide for each of the SARS-CoV-2 protein sequences (4789 entries in the UniProt database, including predictions) ( Figure S11). The virus-derived protein with the largest iScience Article difference in confidence scores (ethoxzolamide minus acetazolamide) was spike protein S, and ORF7, a protein involved in host-virus interaction, showed the second-largest difference (Table S8). These results suggested that ethoxzolamide also targets the viral S protein, whereas acetazolamide does not. Molecular docking simulation suggested that ethoxzolamide binds to the interface between ACE2 and S protein and thereby blocks virus entry into host cells ( Figure S12).
We tested another 10 drugs and found another 2 compounds that also inhibited SARS-CoV-2 infection (Table S9). This high hit rate of 25% (3 hits in 12 compounds) shows the potential power of LIGHTHOUSE for drug discovery.

DISCUSSION
Although recent advances in biological and medical research have uncovered various proteins as promising therapeutic targets in a variety of diseases, the clinical application of these research findings has been limited because of the difficulty in identifying therapeutic chemicals for these targets in a cost-effective and high-throughput manner. Acquisition of 3D structural data for target proteins has been laborintensive, and processing of such data requires a huge amount of computer capacity and time, resulting in a delay in the translation of research findings from the laboratory to the clinic. We have now shown that LIGHTHOUSE facilitates the identification, from a vast chemical space, of candidate compounds for given target proteins solely on the basis of the primary structure of these proteins. Furthermore, the AUROC for LIGHTHOUSE is equivalent to or better than that for state-of-the-art 3D docking simulation methods as well as for other AI methodologies.
Existing in silico drug discovery methods can be broadly classified into two categories: SBDD (structure-based drug discovery) and LBDD (ligand-based drug discovery). SBDD requires the 3D structure of the target protein and manual configuration of the binding box. LBDD requires a list of existing compounds with activity data for the target protein. These substantial requirements limit the applicability of AI-powered drug discovery. LIGHTHOUSE is a computational method that eliminates the drawbacks of these two approaches: It does not require the 3D structure of the target protein or the setting of the binding box, and it can be applied to ''undruggable'' targets for which no inhibitors are currently available. Conventional methods, in particular SBDD, require desktop computers or high-performance cloud computing (such as Amazon EC2) and can calculate only up to 10 CPIs per minute. In contrast, LIGHTHOUSE requires only a simple laptop computer and is able to calculate >1000 CPIs per minute. The time and cost savings of LIGHTHOUSE are therefore several orders of magnitude relative to conventional docking simulation techniques.
We have applied LIGHTHOUSE to attractive targets for various diseases, including cancer, bacterial infection, metabolic diseases, and COVID-19. We have presented three examples that show the effectiveness of LIGHTHOUSE with experimental validation. The actual targets of identified compounds require further biological investigation, especially in the case of pyridoxal 5 0 -phosphate and PBPs. As for ethoxzolamide, we attempted to obtain clinical evidence in support of its effectiveness against COVID-19. However, ethoxzolamide is an old drug and is essentially no longer prescribed as a result of the development of more potent agents such as acetazolamide. We were therefore not able to find data to address whether COVID-19 patients taking ethoxzolamide have a better clinical outcome. We were also not able to decipher the mechanism of action of ethoxzolamide, but our results suggest that the drug might block the interaction between ACE2 and the viral S protein ( Figure S12). The beta, gamma, and delta variants of SARS-CoV-2 appeared to have a higher threshold for drug effectiveness compared with the original and alpha strains ( Figures 7F, 7G, and S10). However, when the drug concentration exceeds several micro molar, the virus load sharply drops to the level of undetected (please bear in mind that the y-axis is log 10 transformed). The important point with regard to our manuscript is that the virus-challenged cells could be rescued by applying ethoxzolamide, thereby providing experiential validation of the prediction of LIGHTHOUSE.
As for COVID-19, there are many published AI systems to help clinicians. For example, various imaging data (X-ray, CT, MRI, etc.), blood tests, age, gender, region, etc., are integrated to predict the severity of COVID-19 and the occurrence of various symptoms, which definitely help physicians to choose the appropriate approved drugs for that patient (Jamshidi et al., 2020). LIGHTHOUSE differs from these AI models in that it discovers drugs on the basis of the amino acid sequence of a target protein, not on that of clinical information. Furthermore, LIGHTHOUSE is able to identify potential drugs not only for COVID-19 but also for other diseases including cancer, bacterial infection, and lifestyle-related conditions. ll OPEN ACCESS iScience 25, 105314, November 18, 2022 iScience Article The hit compounds themselves identified in the present study are not sufficient for immediate clinical use. An additional potential application of LIGHTHOUSE is drug optimization. One promising method to support such optimization is to apply LIGHTHOUSE and either reinforcement learning (Pereira et al., 2021) ( Figure S13) or Metropolis-Hasting (MH) approaches together. Virtual libraries can be generated from identified lead compounds in an intensive manner with the use of sophisticated chemoinformatics algorithms such as RECAP (Retrosynthetic Combinatorial Analysis Procedure) (Lewell et al., 1998). LIGHTHOUSE can score the generated virtual compounds and help to narrow down the candidates with better scores than the original hit compound. Selected candidates can then be synthesized in collaboration with organic chemists and their effects tested. Given the recent success of the MH approach in various life science fields (Biswas et al., 2021;Chen et al., 2021), LIGHTHOUSE should be able to facilitate the optimization of drug candidates by iterating these steps.
LIGHTHOUSE is not limited to compounds for drug repurposing; it can be applied to any compound for which an SMILES representation is available. LIGHTHOUSE facilitates the discovery of compounds that have the potential to interact with targets for which no binding agents have previously been identified. For instance, PPAT has had no known inhibitors and has therefore been categorized as undruggable. In addition, the 3D structure of PPAT has not been reported, with the result that conventional docking simulation methods cannot be applied to discover potential inhibitors. With the use of the amino acid sequence of PPAT alone, LIGHTHOUSE was able to predict potential inhibitors, one of which, compound ZINC8551105, was indeed shown to inhibit PPAT activity. PPAT is therefore no longer an undruggable target. Furthermore, with the use of a hit compound as a starting point, LIGHTHOUSE can be applied to drug optimization. We have calculated the LIGHTHOUSE scores for thousands of analogs of hit compounds that interact with PPAT, and have thereby discovered several PPAT inhibitors that are more potent than the original compounds (manuscript in preparation). LIGHTHOUSE can thus be applied to de novo drug discovery and drug optimization as well as to drug repurposing.
Open Targets is a tool that allows data mining on the basis of the function of a gene of interest and a summary of genome-wide association study (GWAS) data available for the relation of the gene to disease (Ochoa et al., 2022). PandaOmics automatically executes differentially expressed gene (DEG) searches and network analysis on the basis of publicly available data from The Cancer Genome Atlas and GEO databases or user-uploaded data. In addition, the AI system of PandaOmics displays and ranks genes that are the most successful therapeutic targets for a disease (Zhavoronkov et al., 2019). Although both of these methods are relatively effective for target identification, they are not designed to identify medications for the target genes. The identification of promising therapeutic target genes by Open Targets and PandaOmics can thus be followed by the prediction of lead compounds for these targets by LIGHTHOUSE.
An important improvement will be to reduce the number of false positives, which may necessitate taking advantage of CPIs that have not been experimentally explored with the use of methods such as collaborative filtering. The next step for more sophisticated AI-dependent drug discovery and its clinical application will be the use of huge-scale and robust protein-chemical binding (and nonbinding) data, with the recent introduction of academic journals that specialize in big data reflecting the growth of this field. Moreover, development of improved approaches to the handling of imbalanced data  is another active research field in computer science. These advances in both data and methodology should allow more reliable modeling of physical interactions and further facilitate drug discovery in the coming years. For de novo drug design, it will be necessary to develop an approach to efficiently explore the huge chemical space (10 60 compounds), and the development of algorithms using Bayesian optimization with appropriate acquisition function as well as chemical generators ( Figure S13) is desirable.
In summary, we have developed LIGHTHOUSE as a means to discover promising lead compounds for any target protein irrespective of its 3D structural information. Furthermore, we have demonstrated the power of LIGHTHOUSE by identifying and validating therapeutics for various global health concerns including COVID-19. LIGHTHOUSE will serve as a guide for researchers in all areas of biomedicine, paving the way for a wide range of future applications.

Limitations of the study
The biggest limitation of LIGHTHOUSE is the generation of false positives, which is due in part to the fact that the confidence score provided by STITCH is not based solely on experimental data but also on other factors iScience Article such as co-occurrence in the literature. This confidence score therefore does not necessarily reflect actual interaction for a given protein-chemical pair, and well-studied molecules are thus prone to score higher than others. On the other hand, low confidence scores do not necessarily mean that the protein and chemical in question do not interact. We modeled this score because current biologically determined CPI datasets contain fewer CPI pairs. In addition, as a result of publication bias and biological experimental conditions, it is sometimes difficult to tell whether a protein and chemical do not interact, whether they did not bind under the specific assay condition, or whether they were not tested. This drawback of LIGHTHOUSE can be mitigated partially by combining the three different models (CNN, AAC, and Transformer). It may also be important to perform a counter-virtual screening to determine whether an identified small molecule reacts specifically with the target protein or whether it scores highly with many proteins. Such an approach has the potential to reduce the number of false positives and provide more accurate guidance. Furthermore, we also modeled IC 50 data from BindingDB, which is derived from actual bioassays. We found that the predicted value and observed value correlate well in the BindingDB test dataset ( Figure S6). By combining the confidence score (from STITCH) and the interaction score (from BindingDB) provided by LIGHTHOUSE, we were able to discover a clinically approved drug that was able to block SARS-CoV-2 infection.
Another potential limitation is that the performance for STITCH data might be overestimated because we downsampled the original imbalanced data (Table S1). There are several ways to tackle imbalanced data, including downsampling, oversampling, and more complex sampling methods such as SMOTE (Chawla et al., 2011). Each of these approaches has potential limitations, however (Kulkarni et al., 2021;Yu and Zhou, 2021). In the present study, we adopted downsampling, as in a previous study (Tsubaki et al., 2019), so the performance for STITCH data may be overestimated when compared with use of the entire STITCH data. Furthermore, LIGHTHOUSE can in principle handle only low molecular weight compounds, and it is therefore challenging to apply it directly to the design of nucleic acid, which is gaining popularity as biopharmaceuticals and biosensors.
Despite these limitations, LIGHTHOUSE proved to be effective for the identification of lead compounds for all conditions tested. It can theoretically be applied to any protein of any organism, and even to proteins that do not exist naturally. This is an advantage over 3D docking simulation methods, which require prior 3D structural knowledge of the protein of interest. LIGHTHOUSE computes and embeds structural information in numerical vectors, which are then readily retrieved by the subsequent decoding module. Given the accelerating development of protein embedding technologies (Bepler and Berger, 2021) and graph-based chemoinformatics approaches, LIGHTHOUSE has the potential to be a cornerstone of drug discovery. It is also of note that we split the STITCH dataset once, given that a previous study (Tsubaki et al., 2019) showed it could obtain high performance with a relatively small range of hyperparameter tuning. In-depth hyperparameter tuning with cross-validation may further boost the performance of the model.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following: Article where e vw represents edge feature between nodes v and w, N(v) denotes the neighbor nodes of vertex v in graph G, and message functions M t and update functions U t are learned differentiable functions. After T (= 3) cycles of message passing and subsequent update, a readout function (average) is used to extract the embedding vectors at the graph level.
CNN is powerful for computer vision, but here we used a multilayer 1D CNN for protein sequence, as described previously (Ö ztü rk et al., 2018). In brief, the target amino acid is decomposed to each individual character and is encoded with an embedding layer and then fed into the CNN convolutions. We used three consecutive 1D convolutional layers with an increasing number of filters, with the second layer having double and the third layer having triple the number of filters in the first layer (32, 64, and 96 filters for the three layers). The convolution layers are followed by a global max-pooling layer. AAC is an 8420-length vector in which each position corresponds to a sequence of three amino acids (Reczko and Bohr, 1994). Transformer uses a self-attention-based transformer encoder (Vaswani et al., 2017) that operates on the substructure partition fingerprint of proteins. Algorithmically speaking, Transformer follows O(n 2 ) in computation time and memory, where n is the input size. This bottleneck prevented us from considering each amino acid as a token. We therefore used partition fingerprints to decompose amino acid sequence into protein substructures of moderate size and then fed each of the partitions into the model as a token (Huang et al., 2021).
As for the decoder, we exploited a previously described architecture (Paul et al., 2021). In brief, encoder outputs are concatenated and entered into a three-layer feed forward dense neural network (1024,1024, and 512 nodes), which finally outputs one value. We used Rectified Linear Unit (ReLU) (Shimizu and Nakayama, 2020), g(x) = max(0,x), as the activation function in the decoder network.
We defined our loss function with MSE (Equation 4): where P i is the LIGHTHOUSE-predicted score for the ith compound-protein pair and Y i is the true label in the corresponding training data, with a batch size of 128. We trained three architectures (MPNN_CNN, MPNN_AAC, MPNN_Transformer) separately for the STITCH and BindingDB training data with the Adam optimizer and a learning rate of 0.001. For evaluation metrics, we used MSE, concordance index, and Pearson correlation as well as AUROC. For every 10 epochs, we compared the current loss (in the validation dataset) with that of 10 epochs ago; if the loss was not decreasing, we terminated the training for that model. As a result of this early termination, we trained MPNN_CNN for 40 epochs, MPNN_AAC for 70 epochs, and MPNN_Transformer for 100 epochs with regard to the confidence score (Figures S1B-S1G). As for the models for the interaction score, we trained MPNN_CNN for 70 epochs, MPNN_AAC for 100 epochs, and MPNN_Transformer for 70 epochs ( Figures S2B-S2G), according to the same guidelines. After the training was completed, we finally evaluated the models with the test datasets, which were kept aside during the training and so had not previously been seen by the models.

Linear combination of confidence and interaction scores
We generated a combined score by linear combination of the confidence and interaction scores (confidence + alpha*interaction), where alpha represents the relative weight of the interaction score. We initially set alpha in order to scale the maximum values of the confidence (max 1) and interaction (max 10) scores. We then conducted a grid search with regard to alpha, which ranges from 0.05 to 0.2, and identified the optimal alpha to distinguish between positive and negative data. Finally, the combination score was defined as confidence score plus 0.075*interaction score.

Generation of virtual chemical libraries and prediction by LIGHTHOUSE
We prepared nearly 1 billion purchasable substances, which were downloaded from the ZINC database (Sterling and Irwin, 2015) as of 30 July 2020, for virtual PPAT inhibitor screening. For drug repurposing, we obtained approved drugs from the KEGG-DRUG database (Kanehisa et al., 2021) as of 24 January 2021. For calculation of confidence and interaction scores, we fixed the proteins of interest (PPAT or ACE2) and changed the compounds iteratively, which yielded lists of predicted scores for all the ll OPEN ACCESS 22 iScience 25, 105314, November 18, 2022 iScience Article compounds tested. As for the peptide drugs shown in Figure 2, we converted them as for small-molecule compounds with the use of SMILES.

Virtual identification of statin targets and enrichment analyses
Three representative statins were fixed as chemical inputs, and all human protein-coding genes in the UniProt database were iteratively changed. The harmonic mean of the three confidence scores was calculated as an affinity score for statins, and the human protein-coding genes were sorted on the basis of this score. The resulting top 500 potential targets were then subjected to enrichment analyses with the use of the Metascape web server (Zhou et al., 2019).

Molecular docking simulation
The crystal structure of the receptor binding domain of the spike protein of SARS-CoV-2 in complex with ACE2 (Shang et al., 2020) was downloaded from PDB (accession number: 6VW1) and converted to a pdbqt file with ADFRsuite (version 1.0) according to the recommendations of the developers. A grid box was then set with the following parameters in angstroms: center (X, Y, Z) = (80, 0, 180) and dimensions (X, Y, Z) = (30, 30, 30). Docking simulation was performed with the use of AutoDock Vina software (version 1.2.3) (Trott and Olson, 2010) and with an exhaustiveness of 32.

QUANTIFICATION AND STATISTICAL ANALYSIS
A p value of <0.05 was regarded as statistically significant. Statistical tests are indicated in the results or STAR+methods sections as well as in figure legends. We used R software (version 4.1.3) for statistical analysis, with the exception of the IC 50 calculation presented in Figure 3, for which we used JMP Pro (version 15.1.0).