Selection of SARS-CoV-2 main protease inhibitor using structure-based virtual screening

Background: Conserved domains within SARS coronavirus 2 nonstructural proteins represent key targets for the design of novel inhibitors. Methods: The authors aimed to identify potential SARS coronavirus 2 NSP5 inhibitors using the ZINC database along with structure-based virtual screening and molecular dynamics simulation. Results: Of 13,840 compounds, 353 with robust docking scores were initially chosen, of which ten hit compounds were selected as candidates for detailed analyses. Three compounds were selected as coronavirus NSP5 inhibitors after passing absorption, distribution, metabolism, excretion and toxicity study; root and mean square deviation; and radius of gyration calculations. Conclusion: ZINC000049899562, ZINC000169336666 and ZINC000095542577 are potential NSP5 protease inhibitors that warrant further experimental studies.

and F219L) have previously been documented [5,10]. The catalytic dyad of NSP5 in coronaviruses involves the residues His41 and Cys145.
The current treatment protocols include drugs that were designed to treat viral diseases other than COVID-19 [13][14][15]. Antiviral agents specific to COVID-19 are indeed a global requirement. The lack of specific antiviral drugs for combating SARS-CoV-2 has led researchers to conduct randomized clinical trials on COVID-19 patients using previously approved drugs, such as HIV protease inhibitors [16]. In this study, the authors aimed to screen the ZINC database for potential NSP5 protease inhibitors using structure-based virtual screening and then evaluate them using molecular dynamics (MD) simulations.

Database used
The ZINC database is used to screen for chemical compounds with inhibitory potential against enzymes and other targets in a specific fashion [17,18]. A comprehensive library of antiviral compounds containing 13,840 potential antiviral phytochemicals and synthetic compounds was generated from the ZINC database using multiple ligand file formats (e.g., .sdf ). Adjustment of hydrogens and lone pairs was allowed for all molecules, and charges were performed using Merck molecular force field 94x when required. After protonation of the molecules and by using a root mean square (RMS) gradient of 0.1 kcal/mol/Å, geometry minimization was performed on the molecules to optimize the arrangements [19]. Following molecule optimization, the file was saved in .mdb format (molecular operating environment [MOE] database) for subsequent virtual analysis.

Structure-based virtual screening
The complex structure of SARS-CoV-2 NSP5 protease with boceprevir provides a basis for the detection of lead inhibitors via screening in silico. To perform virtual screening, an in-house database of potential binding compounds was utilized for molecular docking simulations via the MOE software [20] suite 910. The SARS-CoV-2 3CLpro model (Protein Data Bank [PDB] ID: 7brp, chains A and B) was used for molecular docking. PDB code 7brp is a polyprotein replicase with a resolution of 1.80Å that cleaves the C-terminus at 11 sites. The boceprevir ligand of 7brp was used as a control. To increase the selectivity of the compounds, MOE was used to dock the selected hits at the M pro binding site. Before the preliminary docking protocol was applied, the selected ligand was removed from the complex protein crystal structure, and the docking protocol was then repeated on the protein binding site, thus validating the procedure. The RMS deviation (RMSD) between cocrypted and redocked ligands was measured using MOE scientific vector language script at 1.86Å [21]. The docking was successful in reproducing a determined binding mode for the intended protein-ligand complex.
The target protein's crystal structure was protonated using the Protonate 3D method [22], and energy minimization was accompanied by the MOE-implemented AMBER12 force field. To specifically identify the active site, the grid was set to match hits in the PDB. The triangular matcher algorithm was used as a placement method during docking. The London dG scoring and generalized Born volume integral/weighted surface area dG rescoring functions were also used. In a given binding site, all compounds were docked. Next, a protein-ligand interaction fingerprint (PLIF) module in MOE was applied to fingerprint the binding interaction of crucial residues between compounds and protein. More associations and binding poses were inspected visually using MOE 2015.10 (Chemical Computing Group, Montreal, Canada). Based on the binding profiles, five compounds from the authors' database were subjected to MD simulation.
PLIF analysis PLIF analysis was performed using the PLIF module in MOE 2015. 10. This analysis provides valuable information regarding protein-ligand interactions by converting 3D structural binding information into a 1D binary string [23]. Protein preparation steps were first performed on the five selected proteins. These steps included the addition of hydrogen atoms, bond order assignments, disulfide fixation and protonation state assignment followed by energy minimization, which was performed utilizing the AMBER10:extended Hückel theory force field, followed by superposing the protein structures. Finally, the interaction fingerprints were generated by labeling all common binding site residues in the five proteases. These fingerprints provide valuable information that can be used for pharmacophore model generation, virtual screening and clustering.

MD simulations
MD simulation was performed using the NAMD engine [24], which is integrated into the MOE software. The AMBER12:extended Hückel theory all-atom optimized potentials for liquid simulations force field (group II ion and group VIII parameters) [25] was applied, as it is often used to describe macromolecular systems. The system was solvated and neutralized by adding Clor Na + ions and water molecules, and the total number of solvent molecules was 9295 (equivalent to 1.023 g/cm 3 ). A periodic boundary condition was utilized, and the box size was 38.29 × 38.44 × 26.57Å. Following steepest descent geometry optimization, an equilibration of 100 steps was done in the constant pressure/constant temperature/constant number of particles ensemble with particle mesh Ewald for efficient full electrostatics. The SHAKE algorithm was utilized to ensure equilibrium values for all bonds involving hydrogen atoms [26]. Finally, the whole system was subjected to MD production runs at a temperature of 300 K for 100 ns in the absolute temperature/number of particles/volume ensemble.
For the last 100-ns MD production round, an integration time step of 2.0 fs was carried out, and the trajectories were captured after every 1 ps. The simulation analysis and visualization molecular dynamic tool use graphic processing units to accelerate the calculations. To predict the grounds for compound stability, RMSD, RMS fluctuation (RMSF), radius of gyration (RoG) and hydrogen bond stabilization were determined in the simulated systems. Prism 8 (GraphPad Software, CA, USA) was used to plot the graphs.

Free energy calculations
Calculations of the binding free energy for the ligand-protein complex were estimated from the MD production trajectories. The molecular mechanics-generalized Born surface area/molecular mechanics-Poisson-Boltzmann surface area approach incorporated into AmberTools20 was utilized for the free energy calculations [27]. The following method was carried out to estimate the binding free energy ( G bind ) within each ns [28]: The absolute free energy (G) for every case of the complex, receptor and ligand was calculated utilizing the formulas: in which E MM is the molecular mechanics energy (sum of internal, electrostatic and van der Waals energy terms), G sol is the desolvation free energy, G sol-ele is the electrostatic solvation component and G SASA is the non-electrostatic solvation component. Runs of 100 ns were utilized for normal mode calculations for amino acid residues around the ligand of less than 12Å.
Absorption, distribution, metabolism, excretion and toxicity studies Absorption, distribution, metabolism, excretion and toxicity (ADMET) characteristics were foreseen using the ADMET descriptors in BIOVIA Discovery Studio 4.5 (Dassault Systèmes, Accelrys, CA, USA). This module uses six mathematical models to properly quantify and predict ADMET properties [29]. The ADMET absorption model can predict human intestinal absorption (HIA) after oral administration. The 95% and 99% confidence ellipses in the ADMET 2D polar surface area (2D PSA) and ADMET AlogP98 plane define the absorption levels of the HIA model [30]. The ellipses highlight the domains where compounds with excellent absorption are likely to be selected. The upper limits of the 2D PSA values for the 95% and 99% confidence ellipsoids are 131.62 and 148.12, respectively. The ADMET aqueous solubility model predicts the solubility of each compound in water at 25 • C. This model is based on a genetic partial least squares method using 784 compounds in which solubility was measured experimentally [31].
The ADMET blood-brain barrier (BBB) model predicts the BBB penetration of a ligand following oral administration. This model originated from a quantitative linear regression study to predict BBB penetration and the 95% and 99% confidence ellipses analogous to those of HIA in the plane of ADMET 2D PSA and ADMET AlogP98. These values were obtained from many molecules that enter the CNS following oral administration [32].
An ADMET plasma protein binding model was used to predict whether a compound has high binding characteristics to blood carrier proteins. Calculations are based on AlogP98 and 1D symmetry to two marker compounds. One group is used to mark the binding at a level of ≥90%, whereas the other group flags the binding at a higher level of ≥95%. The predicted binding affinity is modulated based on the situations used to calculate logP [33]. The ADMET CYP2D6 binding model predicts the inhibition of the cytochrome P450 2D6 enzyme using probability speculation and the 2D chemical structure. These predictions are based on a training set of 100 molecules that have CYP2D6 inhibition characteristics. The ADMET hepatotoxicity model predicts the potential human liver toxicity of many miscellaneous compounds. The predictions are based on a model-based recursive partitioning of 382 training compounds that are toxic to the liver. Liver toxicity occurs as a result of a reaction to certain substances or drugs and can be characterized by neoplasia or elevated aminotransferase levels in a dose-dependent manner in >10% of the human population [34].

Results & discussion
Validation of the docking procedure The first step in the docking study was assessing the MOE-Dock program accuracy. This was performed by extracting the co-crystallized ligand from the active site and redocking within the inhibitor's binding cavity of the NSP5 protease. Initially, the RMSD value was found to be 1.86Å (Figure 1), proving that the authors' docking procedure was valid for the studied inhibitors [35] and the MOE-Dock method was thus valid for docking these inhibitors.

Coronavirus M pro & SARS-CoV-2 M pro inhibition
According to a previous study [20], the mechanism of inhibition of boceprevir was determined in the crystal structure of SARS-CoV-2 M pro -boceprevir through studies of this complex at a resolution of 2.1Å. Only one polypeptide was found in the asymmetric array. Nevertheless, two polypeptides, protomers A and B, were shown to be connected by a twofold crystallographic axis of symmetry to form a dimer ( Figure 2A). On the electron density maps, residues 1-306 were visible. There were three domains within the promoter ( Figure 2B). Residues 8-101 formed the first domain (domain I), whereas residues 102-184 formed domain II. Both domains were shown to have an antiparallel β-structure. Residues 201-303 formed domain III and were composed of five α-helices arranged in antiparallel spherical symmetry as well as a spacer region of short length (residues 185-200) that connected domain III to domain II. In addition, 3CLpro has a Cys-His dyad, which was found in the groove between domains I and II. In the proteolysis process, cysteine thiol serves as a nucleophile [20]. Proteolysis is initiated upon NSP5 binding with its peptide-like substrates in the S1 pocket to form a replication-transcription complex [36,37].
Coronavirus utilizes chymotrypsin-like activity as well as papain protease to treat and cleave viral long polyprotein precursor to afford independent functional nonstructural proteins [38]. Proteases of coronavirus species have several conserved domains. The locations of the intended subsite amino acids in the enzyme active site are known as S1, S1 , S2, S3 and S4 ( Figure 3A) [39,40]. In the NSP5 active site region, Cys145, Gly143 and Ser144 contribute the S1 residues, which also serve as oxyanion holes. The S1 group contains His163 residue, whereas Glu166 and Gln189 are located at the S2 position. The S3 subsite is completely exposed to the solvent [36,41]. Finally, the S4 site was made up of amino acids with bulky side-chains i.e. Gln189 and Pro168 ( Figure 3B) [42].

PLIF analysis
As seen in Figure 4, an automatic visualization of the crucial interactions with the residues was constructed through the utilization of PLIF analysis. The fingerprint bits showed several interactions, including ionic and surface interactions and side chain acceptor and backbone acceptor interactions. Nevertheless, the fingerprint bits revealed some important residues, such as Thr26, His41, Ser46, Met49, Leu141, Asn142, Gly143, Cys145, His164, Met165, Glu166, Leu167, Arg188, Gln189, Thr190, Ala191 and Gln192, that contributed to the protein-ligand interaction. Within all complexes, His41 and Cyc145 demonstrated surface and side chain interactions, respectively. In addition, Cys145 was shown to display side chain acceptor, bond donor and backbone acceptor interactions in most of the complexes. Finally, the aromatic ring in His41 caused surface interactions with the ligands.

Structure-based virtual screening
The antiviral compound library was utilized for structure-based virtual screening for SARS-CoV-2 M pro . The process relies on the determination of computationally suitable molecules in the 3D structure and active site of the target protein, after which these compounds are ranked according to their interaction profiles. The interaction profiles were studied using molecular docking that was carried out on NSP5 (PDB ID: 7brp) using MOE 2015. 10.
The binding site of NSP5 contains a conserved Cys145 and His41 catalytic dyad with other hot spot amino acids, such as residues Met165, Glu166, Gln189 and Thr190 [19]. The docking, which is used to filter the prepared database, relies on predicting possible chemical probes with highest binding affinity to those active site residues. To cover all of the active site residues, the ligand (boceprevir) was used as a template. Of the 13,840 compounds docked, the authors selected 353 compounds with substantial affinities that achieved docking scores of less than -12.0. The key amino acids of the binding site are indicated in red (S1 ), green (S1), blue (S2), purple (S3) and yellow (S4).
The PLIF module in MOE was used to analyze the interactions between the active site residues and selected candidate compounds. The results of candidate compound selection revealed hydrogen bond interactions with the catalytic dyad. Therefore, among the 353 compounds, ten hits were selected based on the significance of the interactions with the catalytic dyad and other active site residues. The docking scores indicated that compound ZINC000095542577 showed the highest binding affinity compared with the others (-16.499 kcal/mol) ( Figure 5). The authors observed significant interactions of all ten selected compounds via the Cys145 and His41 catalytic dyad with residues of the S1-S4 subsite pocket (Table 1). To determine the stability of the selected compounds, MD simulation of the selected hits was carried out.

MD simulation
MD simulation is a solid method for studying conformational stability and protein dynamics. Five systems were selected for 100-ns simulation studies: ZINC000169336666, ZINC000049899562, ZINC000095539404, ZINC000028387333 and ZINC000095542577.

Binding free energy
By utilizing the MMPBSA.py method, binding free energy calculations were performed using the default settings. Table 2 shows the total binding free energy values for the final 20 ns of simulations as well as the energy contribution of each component. The results showed that ZINC000028387333 had a higher binding affinity for M pro . Nevertheless, the higher binding affinity for compound ZINC000028387333 resembled the binding affinity for the reference compound. These results are consistent with the molecular docking studies. In fact, all compounds displayed higher affinity for M pro , but the affinity was less than that observed for ZINC000028387333. Electrostatic interactions predominated the total binding energy of compound ZINC000028387333 to the target than van der Waals interactions. However, van der Waals interactions predominated the binding energy for all the remaining complexes as well as boceprevir. The nonpolar solvation energy was similar in all complexes and favorable for binding, but the polar solvation energy was similar in ZINC000049899562, ZINC000169336666 and ZINC000095539404 complexes and dissimilar in ZINC000095542577, ZINC000028387333 and boceprevirmds complexes.

RMSD analysis
RMSD provides information about the overall stability of a protein complex with regard to variance from the initial structure. All selected systems were considerably stable, with an estimated RMSD of about 1.4Å ( Figure 6). Upon estimation of the average variance, ZINC000095542577, ZINC00000049899562, ZINC00016933666 and ZINC000028387333 demonstrated overall RMSDs of 1.472, 1.410, 1.379 and 1.359Å, respectively, with a slight Cα backbone fluctuation of about 5000 ps that was stabilized in the remaining simulations. The study of the Cα backbone RMSD of ZINC000095539404 yielded average RMSDs of 2.1432Å. However, a tendency toward a slight drop was observed from 5000 to 8000 ps and was accompanied by significant variation, which indicated stability of the complexes. At 1000 and approximately 4000 ps, a marginal fluctuation was observed, which stabilized during the simulation. The data showed that all systems had stable internal motion.

RMSF analysis
RMSF of protein residues demonstrated a small amount of fluctuation on the backbone residues when binding the compound to the M pro target. The RMSD values were determined by the least square fitting to the initial structure as a reference frame running over 100 ns trajectory. When no ligand was bound, the RMSF values of the M pro protein residues were compared with the reference (Figure 7). Throughout all systems, the loop region fluctuated to a certain degree, whereas active residue regions remained unchanged during the simulation process.
The results showed that the interaction of all five hits with the target protein led to their stabilization. Binding of the five chosen compounds in the active site domain caused a decrease in fluctuations with respect to protein structure fluctuations on the loop, whereas binding in the active site domain of all chosen compounds did not result in additional fluctuations in the protein structure. In addition, these fluctuations were observed primarily on the third to sixth and eighth helices. It can be assumed that no backbone conformational changes took place in the  M pro during ligand binding. Furthermore, the active binding domain was on the second, third, eighth and ninth helix regions. Notably, a sharp peak was seen most significantly on the close residue from 142 to 164.

RoG & binding analysis of different ligands with M pro
RoG after ligand binding provides information about the folding of the protein structure. Consequently, RoG was evaluated to assess the system's compactness over time. Less compact protein (less folded) with high conformational entropy was seen with higher RoG values, whereas more folded protein with higher structural stability was associated with low RoG values. As shown in Figure 8, the average gyration scores of all systems were between 21.5 and 24Å ( Table 3). The simulation data revealed that all of the systems were compact, suggesting that the systems converged well.  A simulation of hydrogen bond stability was conducted to further validate the stability of the hit compounds. This was performed by calculating all possible hydrogen bond donors and acceptors in the active site domain of the 3CLpro protein. The hydrogen bond formations between the donors and acceptors were subjected to a fixed geometric criterion of ≤3.5Å and a fixed angle of 30 • between the donors and acceptors. Figure 9 shows the      hydrogen bond formation between the hit compounds and the active residues, which were completely preserved during the MD simulation run. As shown in Figure 8, a maximum of 16 hydrogen bonds were observed for ZINC000028387333 with the 3CLpro protein. Compound ZINC000095542577 also preserved 15 hydrogen bonds within the active site domain. In addition, compounds ZINC000049899562, ZINC000095539404 and ZINC000169336666 maintained 14 hydrogen bonds within the active site domain. In addition to the stability of the hydrogen bond formations, compounds ZINC000169336666, ZINC000049899562, ZINC000095539404, ZINC000028387333 and ZINC000095542577 showed smaller RMSDs of around 1.79, 1.73, 1.76, 1.92 and 1.92Å, respectively ( Table 4). The findings demonstrated that the binding of these five compounds to the active amino acids was located predominantly in the loop region, with some of the compounds present in the active pockets. The screening hits for this study resulted in a diversity of chemical compounds.

ADMET study
Pharmacokinetics and toxicity effects play essential roles in choosing lead compounds and designing safe and effective compounds. Therefore, ADMET properties of the selected candidates were analyzed. To determine the compounds' drug-like properties, Lipinski's rule of five [43] and Veber rules [44] were applied (Tables 5 & 6). As depicted in Table 5, BIOVIA Discovery Studio 4.5 was used to determine the ADMET parameters. Furthermore, an ADMET plot was constructed using the calculated AlogP98 versus 2D PSA. The plot indicated the confidence level of the predictions for the HIA and BBB penetration models. As shown in Figure 10, the plot that corresponded to the HIA and BBB penetration models demonstrated two analogous 95% and 99% confidence ellipses. The PSA value of a given compound was inversely proportional to the HIA value of any compound and thus displayed an inverse relationship with cell wall permeability [32]. AlogP98 was shown to be lipophilic. Moreover, since the value is a ratio, it can be used to estimate a compound's hydrophilicity and hydrophobicity. As a result, the hydrogen-bonding characteristics determined by calculating PSA may play a crucial role along with the AlogP98 calculation [30]. The ellipse with 95% confidence indicated a chemical boundary that included compounds with high absorption (90%). By contrast, the ellipse with 99% confidence showed the region of chemical boundary with compounds that had excellent absorption through the cell membrane. The criteria of PSA <140Å2 and AlogP98 <5 should be fulfilled by any compound with optimum cell permeability.
Compound ZINC000029040250 was selected for the ADMET study and showed good HIA (absorption level 0). Compound ZINC000003583397 showed moderate absorption, whereas compounds ZINC000014945752, ZINC000014907365 and ZINC000028387333 had low absorption. The remaining compounds exhibited very low absorption. When investigating aqueous solubility, ZINC000095543045, ZINC000169336666, ZINC000049899562 and ZINC000095542577 showed extremely low aqueous solubility. Compounds ZINC000003583397 and ZINC000014945752 had low aqueous solubility. Com-    , whereas the other compounds showed undefined plasma protein binding (level 4). All selected compounds, with the exception of compounds ZINC000029040250 and ZINC000003583397, were found to be outside of the AlogP98 versus 2D PSA confidence ellipse. All selected compounds, with the exception of ZINC000029040250 and ZINC000003583397, fell outside the ADMET model ellipse filter, suggesting low intestinal absorption and BBB penetration capability. Figure 9 presents a plot of PSA and AlogP for selected compounds.

Conclusion
SARS-CoV-2 M pro is an interesting target for drug discovery. Several diverse compounds have been screened to explore SARS-CoV-2 M pro inhibition utilizing pharmacophores [45], natural and bioactive compounds [46,47] as well as peptidomimetic inhibitors [48]. Based on the crystal structure of SARS-CoV-2 NSP5 protease, the ZINC database, which contains potential antiviral compounds, was virtually screened for potential NSP5 inhibitors. After the initial selection of 353 compounds out of 13,840, ten compounds were chosen using a PLIF approach, and five compounds were finally identified. The interaction between SARS-CoV-2 NSP5 and the selected hits was evaluated using MD simulation. RMSD and RMSF analyses and RoG calculation have all been used to analyze known antiviral drugs [19]. However, in this study, we used the ZINC database to virtually screen anti-NSP5 compounds.
To recapitulate, it can be concluded that compounds ZINC000049899562, ZINC000169336666 and ZINC000095542577 may act as anti-coronavirus agents. Computational studies proved that these compounds have great medicinal and pharmacological actions. Further experimental studies are needed to confirm these findings.

Future perspective
The current study demonstrates the utilization of computer modeling and simulation for the field of drug discovery and development. The current COVID-19 pandemic continues to challenge health and economy. In addition, coronaviruses are well known for their ability to mutate and resist treatments and vaccines. Our study has aided in the identification of several potential anti-COVID-19 hits that can be considered the first step in developing potent antiviral drugs.

Summary points
• SARS coronavirus 2 (SARS-CoV-2) main protease is an interesting target for drug discovery.
• Computer modeling and simulation aided the discovery of potential hits.
• The ZINC antiviral database contains different chemical scaffolds.
• Virtual screening of the ZINC antiviral database yielded several hits with excellent binding interaction with the SARS-CoV-2 main protease. • Molecular docking analysis aided in the identification of the potential hits.
• Molecular dynamics simulations were utilized to further analyze the hits.
• Three compounds with high binding interaction with NSP5 have been discovered.

Financial & competing interests disclosure
Financial support was received from the Institute of Research and Consulting Studies at King Khalid University (27-19-S-2020). The authors have no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.
No writing assistance was utilized in the production of this manuscript.