A dose-response model for statistical analysis of chemical genetic interactions in CRISPRi screens

An important application of CRISPR interference (CRISPRi) technology is for identifying chemical-genetic interactions (CGIs). Discovery of genes that interact with exposure to antibiotics can yield insights to drug targets and mechanisms of action or resistance. The objective is to identify CRISPRi mutants whose relative abundance is suppressed (or enriched) in the presence of a drug when the target protein is depleted, reflecting synergistic behavior. Different sgRNAs for a given target can induce a wide range of protein depletion and differential effects on growth rate. The effect of sgRNA strength can be partially predicted based on sequence features. However, the actual growth phenotype depends on the sensitivity of cells to depletion of the target protein. For essential genes, sgRNA efficiency can be empirically measured by quantifying effects on growth rate. We observe that the most efficient sgRNAs are not always optimal for detecting synergies with drugs. sgRNA efficiency interacts in a non-linear way with drug sensitivity, producing an effect where the concentration-dependence is maximized for sgRNAs of intermediate strength (and less so for sgRNAs that induce too much or too little target depletion). To capture this interaction, we propose a novel statistical method called CRISPRi-DR (for Dose-Response model) that incorporates both sgRNA efficiencies and drug concentrations in a modified dose-response equation. We use CRISPRi-DR to re-analyze data from a recent CGI experiment in Mycobacterium tuberculosis to identify genes that interact with antibiotics. This approach can be generalized to non-CGI datasets, which we show via an CRISPRi dataset for E. coli growth on different carbon sources. The performance is competitive with the best of several related analytical methods. However, for noisier datasets, some of these methods generate far more significant interactions, likely including many false positives, whereas CRISPRi-DR maintains higher precision, which we observed in both empirical and simulated data.

This produces baseline counts in the range of single digits up to a few thousand, with a mean count in the hundreds (see histogram), which is typical of what is seen in real CRISPRi sequencing datasets.

Sampling sgRNA efficiencies
To simulate the effect of CRISPRi depleXon, an efficiency is chosen for each individual sgRNA  in each gene  from a uniform distribuXon: "! ~(−25,0) This range was chosen based on the data from the Mtb CRISPRi library in the Li, Poulton (1) paper, which was based on empirical esXmates of growth rates (fitness defects), fit to a piecewise linear model and extrapolated to predicted LFCs at 25 generaXons [2].Since sgRNA efficiencies empirically span a range of -25 to 0, and we want 0 to represent no depleXon and -25 to represent high depleXon (in induced vs. uninduced).

Sampling concentraXon-dependent slopes
Next, concentraXon-dependent coefficients (slopes) are chosen for each gene .For interacXng genes, slopes are chosen from a Normal distribuXon around +K2 or -K2, where K2 is a parameter.For non-interacXng genes, slopes are chosen from a Normal distribuXon around 0, with a standard deviaXon of  #$ .The larger the variance, the more risk there is the slopes of non-interacXng genes overlapping with interacXng genes.
" ~A N(0,  #$ $ ) if  is a non − interacting gene N(+K $ ,  #$ $ ) if  is a positive interaction N(−K $ ,  #$ $ ) if  is a negative interaction R Using K2=0.8 and sK2=0.2, the histogram below shows the overlapping distribuXons of slopes for the 3 types of genes.
SimulaXon of sgRNA counts Next, mean abundances for each sgRNA at each concentraXon were generated by sampling from the dose-response equaXon.First, a linear model was used to esXmate a level which was then transformed to an abundance ( "!% ) by the sigmoid transformaXon, s, reproducing the modified dose-response equaXon described in the main text.
The coefficients K0 and K1 were set to 3 and 0.3, respecXvely, to simulate sensiXvity to protein depleXon.

SimulaXng noise between concentraXons
In addiXon to noise between replicate counts (controlled by Pnb) and noise affecXng drug sensiXvity (variability of slopes, controlled sK2), another source of noise was simulated by randomly shijing the abundance at each concentraXon, where the shijs were sampled from a Normal distribuXon, inflaXng or deflaXng the expected counts at a given concentraXon, deviaXng from a perfect linear trend.This was modeled as a gene-level effect using the parameter sC controls the noise between concentraXons.

Results
The other parameters are set as the following: • sgRNA efficiency sensiXvity: K0=3, K1=0.3 • concentraXon dependence: q=2, K2=0.8, sK2=0.2 We generated simulated datasets by varying the 2 noise parameters: • noise between concentraXons: low: sC =0.01, med: sC =0.15 ,high: sC =0.3 • noise between replicates: low: Pnb=0.9, med: Pnb=0.5, high: Pnb=0.1 By forming all combinaXons of these parameters, we generated the following 9 scenarios: • LL (Low noise between concentraXons, Low noise within concentraXons): sC =0.01, Pnb=0.9 • LM (Low noise between concentraXons, Medium noise within concentraXons): sC =0.01, Pnb=0.5 • LH (Low noise between concentraXons, High noise within concentraXons): sC =0.01, Pnb=0.1 • ML (Medium noise between concentraXons, Low noise within concentraXons): sC =0.15, Pnb=0.9 • MM (Medium noise between concentraXons, Medium noise within concentraXons): sC =0.15, Pnb=0.5 • MH (Medium noise between concentraXons, High noise within concentraXons): sC =0.15, Pnb=0.1 • HL (High noise between concentraXons, Low noise within concentraXons): sC =0.3, Pnb=0.9 • HM (High noise between concentraXons, Medium noise within concentraXons): sC =0.3, Pnb=0.5 • HH (High noise between concentraXons, High noise within concentraXons): sC =0.3, Pnb=0.1 Fig 1 illustrates the different effects of simulated noise between concentraXons vs. within concentraXons (among replicates) in the low-high combinaXon scenarios, for representaXve sgRNAs in negaXve interacXng genes.The decreasing trend of counts is more variable in HL and HH than in the LL and LH scenarios.Whereas the dispersion of counts within a concentraXon is greater in LH and HH.The medium noise scenarios (LM, ML, MM, MH, HM) are somewhere in between these four levels of dispersions.We analyzed all 9 scenarios with MAGeCK-MLE, MAGeCK-RRA and CRISPRi-DR.For CRISPRi-DR, the criteria used to idenXfy significant interacXons was adjusted P-value<0.05and |Zslope|>2, as described in the main text.For MAGeCK-MLE, the criteria used to idenXfy significant interacXons was adjusted P-value based on Wald < 0.05.MAGeCK was run 3 Xmes independently for each drug concentraXon: 2 µM, 4 µM, 8 µM.Each was compared to the simulated no-drug control (DMSO).A single P-value per gene was calculated from the three analyses using the Fisher's method, which was then adjusted using Benjamini-Hochberg.AddiXonally, a single LFC value per gene was set equal to the most significant LFC across the three concentraXons.Significant interacXons were classified as combined adjusted P-value < 0.05 and |gene LFC| >1.Consistent with observaXons made with the experimental data, the number of genes reported by CRISPRi DR span a shorter range than the number of genes reported by MAGeCK-RRA.However, in this case we can see that the datasets with the highest noise that have the largest discrepancies between the number of hits for both depleted and enriched genes.Thus, we can infer that the experimental datasets seen in Fig 5 with higher discrepancy are those with high noise resulXng in a possibly higher number of false posiXves.ComparaXvely as seen in the boVom two panels of Fig 2, the discrepancies between CRISPRi-DR and MAGeCK-MLE are high for all datasets, regardless of noise.CRISPRi-DR hits do not exceed the number of simulated interacXng genes for both depleted and enriched cases, whereas MAGeCK-MLE nearly always does.Given there were 50 simulated depleted/enriched genes, the up to 300 depleted or enriched calls made by MAGeCK-MLE include many false posiXves.
Table 1 shows detailed confusion matrices of the calls made by the three methods, using the 50 simulated negaXve and 50 simulated posiXve interacXons (genes 0-49 and genes 50-99) as the ground truth.Correct predicXons are on the descending diagonals, and the off-diagonal entries represent errors (either false posiXves, FP, or false negaXves, FN).The center of these matrices are correctly idenXfied non-interacXng genes.This is the largest square in all noise scenarios, for all three methods and is excluded from recall/precision calculaXons, to focus on predicXve performances for depleted and enriched genes.
MAGeCK-MLE idenXfies a large number of interacXng genes for all the simulated datasets and as a result has a consistently high recall rate.For example, in the lowest noise scenario (LL), CRISPRi-DR idenXfied 74% of the simulated interacXng genes, MAGeCK-RRA idenXfies 56.5% and MAGeCK-MLE idenXfies 99.9% of the genes.However, this also means MAGeCK-MLE makes many calls that are false posiXves.CRISPRi-DR falsely idenXfies an average of 1.4 genes, MAGeCK false idenXfies 3.9 genes and MAGeCK-MLE idenXfied nearly 542.
As noise is increased through adjustments of the sC or Pnb parameters, the number calls made by MAGeCK increase, thus increasing its recall rate to be similar to that of MAGeCK MLE.In the HH scenario, the recall rate of MAGeCK-MLE remains quite high at 88.3% and MAGeCK-RRA's increases to 87.5%.Conversely, with the increased noise, CRISPRi-DR makes fewer calls resulXng in a decreased recall rate of 30.1% in the HH scenario.However, CRISPRi-DR sustains a low false posiXve rate at 2.2%.Whereas the false posiXve rates of MAGeCK-MLE and MAGeCK increase substanXally (MLE = 42.5%,RRA = 42.1%),diluXng the sets of predicted enriched and depleted genes with non-interacXng genes.In summary, with increasing noise CRISPRi-DR idenXfies less of the true interacXng genes, yet maintains its ability to keep the set of reported interacXng genes from being diluted with non-interacXng genes.
Table 1.Evalua(on of CRISPRi-DR, MAGeCK-RRA, and MAGeCK-MLE performances in the nine noise scenarios.For each of the noise scenarios, CRISPRi-DR, MAGeCK-RRA and MAGeCK-MLE are run.These confusion matrices reflect the average confusion matrix of the 10 runs per noise scenario. To clarify, the metrics are calculated as follows.For depleted genes, RecallD=TPD/(TPD+FND) and PrecisionD= TPD/(TPD+FPD).For enriched genes, RecallE=TPE/(TPE+FNE) and PrecisionE= TPE/(TPE+FPE).For overall results, depleted and enriched genes are combined as follows: Recall=(TPD+TPE)/(TPD+FND+TPE+FNE) and Precision=(TPD+TPE)/( TPD+FPD+TPE+FPE).Also, Recall=TPR, Precision=1-FPR.The effect of noise on the results of the three methodologies can also visualized using a bar chart of true and false posiXves as in Fig 3 (calculated from enriched and depleted genes combined).We calculate the number of significant genes while increasing the amount of noise using either the sC and Pnb parameters (resulXng in the 9 noise scenarios).In the lej panel, the average number of significant genes were calculated for each method run for a specific sC value across the possible Pnb values.The errorbars seen are the 95% confidence interval of the number of significant genes.For example, the orange bar for sC =0.01 represents the average number of genes found significant by MAGeCK in the LL, LM, LH scenarios (where sC =0.01, but Pnb varies).The same is done in the right panel, using Pnb.Noise between concentraXon increases as sC is increased and noise between replicates is increases as Pnb is decreased.
All three methods make a comparable number of true posiXve calls, regardless of noise parameters.When noise increases for either parameter, the number of false posiXve calls for all three methods also increases.However, the number of false posiXves calls are much higher for MAGeCK-RRA and MAGeCK-MLE than for CRISPRi-DR.
In this plot, it is clear how much more MAGeCK-RRA is affected by noise among replicates than between concentraXons.The orange bar for Pnb=0.1 represents the results for MAGeCK-RRA in the LH, MH, HH scenarios and it shows the highest number of false posiXves compared to the other Pnb values as well as the sC values, consistent with the observaXons made in Fig 2 .This is likely a result of stochasXc fluctuaXons of counts at individual drug concentraXons that are not necessarily supported at other concentraXons.This could help explain the poor performance of MAGeCK-RRA on certain datasets that are especially noise, which ojen generates a large number of hits; our analysis suggests that many of these hits could be false posiXves.CRISPRi-DR and MAGeCK-MLE are more affected by noise within concentraXons than noise within replicates, since these methods rely more on increasing or decreasing trends in abundance that must be (at least somewhat) consistent across concentraXons and thus is less affected by within replicate noise as MAGeCK-RRA is.

Effect of MulXple ConcentraXons on Significant Genes detected
To evaluate the benefit of profiling a CRISPRi library on mulXple concentraXons on the performance of CRISPRi-DR and the MAGeCK methods, we adapted the simulaXon above to compare their precision and recall when using one, two, or three drug concentraXons.We re-ran the simulaXon (10 iteraXons per concentraXon amount), keeping the same concentraXon range (2-8 µM), and used the high-noise (HH) parameter seungs (sC =0.1, Pnb =0.1).In all runs, we always kept the highest concentraXon (8 µM), along with the no-drug control.First, we evaluated performance using only one concentraXon (the highest, 8 µM), then the two highest (4 and 8 µM), then all three concentraXons (the full range, 2-8 µM).For MAGeCK-RRA, each concentraXon was compared independently to the no-drug control, and then a single LFC value was calculated per gene as the most significant LFC across the concentraXons and a single Pvalue was calculated per gene using the Fisher's method, which was then adjusted with Benjamini-Hochberg.For CRISPRi-DR and MAGeCK-MLE, the no-drug control was treated as the lowest concentraXon, and it was combined with either the highest concentraXon, top two, or all three drug concentraXons, which were then used for doing regressions (over 2-4 effecXve concentraXon points).
As seen in Fig 6, the recall of CRISRi-DR and MAGECK-MLE (number of true posiXves) held constant as concentraXons were added, whereas the number of true posiXves idenXfied by MAGeCK-RRA increased slightly with more concentraXons.Adding concentraXon points caused a significant increase in the number of false posiXves found by of MAGeCK-RRA.This is apparently due to the commitment of false posiXve errors, i.e. non-interacXng simulated genes that are classified as significant due to high variaXon in barcode counts in this high noise seung.MAGeCK is suscepXble to false posiXves when evaluaXng only a single concentraXon point, but this gets amplified as more concentraXons are added, because each concentraXon is evaluated independently, and the hits (including false posiXve genes) are combined post-analysis, explaining why precision drops as concentraXons are added, because errors accumulate.
In contrast, CRISPRi-DR and MAGeCK-MLE are more robust with respect to false-posiXve errors, because regression incorporates data from all concentraXons available and looks for significant trends.This allows CRISPRi-DR to maintain higher precision, which does not decrease as addiXonal concentraXon points are added.

Algorithms Summary
The performance of CRISPRi-DR is compared to six other sojware packages intended for analyzing CRISPRi and CRISPRko libraries.
Below are general overviews of some exisXng methodologies comparable to CRISPRi-DR used to assess CRISPR screens.We describe the general math used in the analyses, as well as how we adjusted our datasets to be run using each methodology.Some methods work directly on counts, whereas others require log fold changes.The result of some of these methods is at the sgRNA level, which were combined post hoc to obtain gene level informaXon.AddiXonally, some of these methods do not take mulXple concentraXons into account, thus we combined outputs of analyses at different concentraXons using Fisher's method of combing P-values: " = a log  "% * %9& = logc " :;< d + logc " =>? d + logc " @AB@ d where n=3 for the 3 concentraXon levels (low, medium, high) for each gene .
To use CRISPRi datasets as inputs to DEBRA, counts of the most efficient sgRNA for each gene were provided as input to DEBRA.The output of DEBRA for these datasets is log fold changes along with Pvalues calculated using the Ward method.Since DEBRA does not account for increasing concentraXon, each concentraXon (low, medium and high) in a given dataset was run separately, with the corresponding DMSO condiXon set as the control.These gene-wise P-values of the three concentraXon outputs were combined using the Fisher's method and then adjusted using Benjamini-Hochberg.Genes were ranked by combined P-value and marked significant if adjusted combined P-value < 0.05.

CGA-LMM
Command Line Prompt: > Rscript ./CGA_LMM.R single_sgRNA_per_gene_counts.txtCGA_LMM_out Designed for hypomorph libraries, CGA-LMM [4] assesses the concentraXon-dependent variaXon in mutant abundance (in a library) using slope coefficients derived from linear mixed models.CGA-LMM assumes one set of counts per gene and was not designed to incorporate informaXon from mulXple sgRNAs.This method uses a conservaXve populaXon-based approach by idenXfying genes as significant only if they have slopes that are outliers when compared to the general populaXon.
To use CRISPRi data as the input for this method, we use the most efficient sgRNA for each gene.In line with the approach of DuVa, DeJesus (4), significant genes were marked as those with adjusted P-value< 0.5 and |Zrobust| > 3.5.Genes were ranked by P-value.

CRISPhieRmix R FuncXon Call:
> CRISPhieRmix(log2fc_from_DESeq2, geneIds, negCtrl, nMesh = 100, PLOT = TRUE, VERBOSE = TRUE, mu=-6, pq=0.1,BIMODAL=T) CRISPhieRmix [5] is a methodology created specifically for the CRISPRi variant to address guide efficiency present in CRISPRi and CRISPRa screens.The method takes log fold changes of sgRNA counts between two condiXons, typically derived from DESeq2 or edgeR outputs.It fits a hierarchical mixture model on these log fold changes, based on the assumpXon that genes are represented by a mixture distribuXon of effecXve and ineffecXve guides.CRISPhieRmix then computes False Discovery Rates (FDRs) as the posterior probability that a gene is non-essenXal.These probabiliXes are aggregated over all possible mixtures to finalize the FDRs for each gene.
In analyzing our CRISPRi datasets, we uXlized the bimodal opXon available in the method's implementaXon.Typically, this analysis resulted in one posiXve and one negaXve mode, consistent with the opXon's assumpXon that genes are a mixture of ineffecXve, depleXng, or enriching guides.Since CRISPhieRmix does not account for increasing concentraXon, each concentraXon (low, medium, and high) in each dataset was run separately, with the corresponding DMSO condiXon set as the control.In the actual counts files, negaXve control genes were excluded, as they were causing noise in the mixture model esXmaXons.The local FDR s from the analysis of the three concentraXons were combined using the Fisher's method.Genes were ranked by were combined local FDR and marked significant if combined local FDR < 0.05.

MAGeCK (Model-based Analysis of Genome-wide CRISPR-Cas9 Knockout)
Command Line Prompt: Li, Xu (6) designed the Robust Ranking Algorithm (RRA), one of the first algorithms for CRISPRko screens, available to researchers as MAGeCK [7].The input to the method is raw control and experimental sgRNA counts.These counts are fiVed to a NegaXve Binomial model to assess if counts vary significantly (similar to DESeq2).The sgRNA level P-values for each gene are the combined using a modified version of robust rank aggregaXon to evaluate whether a subset of them is enriched (RRA), resulXng in a list of genes with False Discovery Rates (FDRs) for both posiXve and negaXve interacXons.
Since MAGeCK does not account for mulXple concentraXons, each concentraXon (low, medium, and high) in a given dataset was run separately, with the corresponding DMSO condiXon set as the control.MAGeCK uXlizes a separate set of controls be provided, from which it calculates its P-values.We provided "NegaXve" sgRNA controls in list form to the method for this purpose (using 1750 non-targeXng sgRNAs in the Mtb CRISPRi library).For each concentraXon, we determined a gene's overall P-values based on the lowest FDR, whether posiXve or negaXve.We then merged the resulXng gene P-values from the three concentraXons using the Fisher's method and then adjusted using Benjamini-Hochberg.Genes were ranked by combined P-value and marked significant if adjusted combined P-value < 0.05.Direct comparisons of the significant genes in the CGI libraries from Li, Poulton (1) found by CRISPRi-DR and MAGeCK-RRA can be seen in Supplemental Table S3, where significant calls made by MAGeCK-RRA have an addiXonal constraint of |LFC|>1, to be more comparable to the publicaXon's analysis of the data.

DrugZ
Command Line Prompt: DrugZ [8] is a method for analyzing chemical geneXc interacXons with drug treatments in CRISPRi, CRISPRko and CRIPSRa libraries treated with drugs.Raw sgRNA counts of the control and experimental condiXon are provided as input to the method.The log fold changes of the normalized counts are then calculated, along with guide level z-scores and variance esXmated by empirical Bayes.The z-scores at the guide level are summed to get gene level z-scores, from which P-values can be obtained from a Normal distribuXon.The output of the method is similar to MAGeCK in that it provides the staXsXcs for both the suppressive and synergisXc interacXons of the genes.
In our approach to analyzing CRISPRi datasets with drugZ, sgRNA counts were used directly without modificaXon.Similar to many other methodologies, drugZ does not simultaneously accommodate mulXple concentraXons.Thus, each concentraXon (low, medium, and high) in a given dataset was run separately, with the corresponding DMSO condiXon set as the control.At each concentraXon, we determined a gene's overall P-value based on the lowest P-value, whether suppressive or synergisXc.We then merged the resulXng gene P-values across the three concentraXons using the Fisher's method and then adjusted using Benjamini-Hochberg.Genes were ranked by were combined P-value and marked significant if adjusted combined P-value < 0.05.

MAGeCK MLE
Command Line Prompt: > mageck mle --norm-method control -n MLE_out --genes-var 0 -update-efficiency --count-table subsampled_drug_counts.txt -threads 16 --control-sgrna negatives.txt--design-matrix design_matrix.txt--sgrna-efficiency squashed_sgRNA_info.txt -sgrna-eff-name-column 1 --sgrna-eff-score-column 2 MAGeCK MLE [9], an extension of MAGeCK, esXmates gene effects across mulXple condiXons (i.e.cell lines or drug treatments), while accounXng for sgRNA knockout efficiency.Like MAGeCK, the input to the method is a set of raw sgRNA counts but also requires a design matrix specifying which counts come from which condiXon along with sgRNA efficiencies between the range of 0 and 1. Raw sgRNA counts are fiVed to a NegaXve Binomial GLM with log link to sgRNA level counts.Maximum likelihood esXmaXon (MLE) of fiung the guide counts across all samples is used to calculate the beta scores.The significance of these beta scores is calculated through the Wald test.
When using MAGeCK MLE with default seungs, we found that a maximum of 39 sgRNAs per gene could be processed without triggering "gene too large" errors.Therefore, for genes with more than 39 sgRNAs in our CRISPRi datasets, we randomly selected 39 sgRNAs for analysis.In our design matrices, we treated increasing concentraXon as a Xme series variable.AddiXonally, we included sgRNA efficiencies, expressed as esXmated log fold change values normalized to a 0-1 scale, where 1 represents higher sgRNA efficiency.Like MAGeCK, MAGeCK-MLE requires a list of control sgRNAs which we fulfilled with a list of "NegaXve" sgRNAs in each CRISPRi dataset.Genes in the MAGeCK MLE analysis results were ranked by Wald P-value and significant genes were marked as those with Wald FDR < 0.05.

Combining Results from Mul(ple Concentra(ons using Fisher's Method for P-values
In our study, for methods that weren't designed to handle mulXple concentraXons simultaneously, we ran each method three Xmes for a given CRISPRi drug dataset, once for each concentraXon level: low, medium, and high.We combined the significance of the genes across the three analysis results using Fisher's method.Fig 7 illustrates the impact of this combined approach.The ROC Curves are for RIF in 1 day pre-depleXon, seung target genes as the 75 condiXonally essenXal genes (adjusted P-value < 0.5 from resampling in Transit) from a previously published TnSeq study of M. tuberculosis H37Rv exposed to sub-MIC concentraXons of various anXbioXcs, including rifampicin [10].While changes in essenXality due to transposon inserXons are not technically the same as fitness defects resulXng from CRISPRi depleXon, there is substanXal overlap between essenXality and vulnerability [2].
Panel A displays ROC Curves for all runs of the tested methods, with gene rankings determined by respecXve P-values.For CRISPhieRmix, we used "locFDR" to rank the genes and for MAGeCK and drugZ, which provide staXsXcs for both posiXve and negaXve interacXons, we selected the minimum P-value for each gene to rank them.In this case, the high concentraXon runs performed the best for methods that did not account for mulXple concentraXons simultaneously.CRISPRi-DR performs among the best (black curve, panel A), partly because it uXlizes info from all 3 concentraXons.However, one cannot just consider the highest concentraXon in this analysis, as someXmes the MIC of a drug can be unknown, where the highest concentraXon might be excessively strong, potenXally leading to an overesXmaXon of depleXon effects and false posiXves compared to the control.In Panel B, the ROC Curves ajer applicaXon of the Fisher's method closely resembled, but were not idenXcal to, the high concentraXon curves in Panel A. This indicates that lower concentraXons also contribute to a gene's overall significance.Using the Fisher's method to combine significance of the three concentraXons provides a more comprehensive representaXon of the overall significance of gene interacXons across different concentraXon levels.Nonetheless, CRISPRi-DR sXll has a ROC curve that is compeXXve with the best of these other analysis methods (Panel B).

Analysis of RIF D1 CRISPRi dataset by CRISPR Methods
CRISPRi-DR performs as well as the best of the other methods in idenXfying the target genes by significance-based ranking (ROC curves).While the highest AUC (0.866) is achieved by MAGeCK-MLE on the RIF D1 dataset, CRISRPi-DR has similar AUC of 0.850).However, the number of significant genes both by adjusted P-value < 0.05 per concentraXon and Fisher's combined adjusted P-value < 0.05 for the methods necessary show much higher number of false posiXves.The methods that find nearly all the target genes are CRISPhieRmix, MAGeCK, and MAGeCK MLE, but these methods report a very high number of false posiXves (thousands of genes that are putaXvely staXsXcally significant).Although CRISPRi-DR and drugZ are the two methods that report a relaXvely lower number of total hits, they both detect only ~40 of the 75 target genes as significant, and drugZ reports many more false posiXves (433 overall).Overall, the incorporaXon of both sgRNA efficiency and increasing concentraXon in a model such as CRISPRi-DR along with the addiXonal Z-score constraint allows for rankings similar to the other CRISPR analysis techniques, but it reduces the number of false posiXves reported.Thus, CRISPRi-DR has the highest F1-score of the methodologies tested.the other methodologies tested.Although CRISPRi-DR idenXfied less of the target genes as significant, it marked less false posiXve genes than the other methods and thus overall had the highest F1 scores.CRISPRi-DR and CGA-LMM both show higher precision than the other methods.However, CRISPRi-DR has a much higher precision than CGA-LMM, perhaps aVributed to its ability to uXlize both drug concentraXon and sgRNA efficiency in the model, whereas CGA-LMM only uses concentraXon.As seen in Table 5, CRISPRi-DR is the best method based on precision and F1-score for nearly all the datasets.There are quite a few methods that show high recall in the datasets (these are usually 100% recall).Precision here is calculated as TP/(FP+TP) and Recall is calculated as TP/(TP+FN).Based on AUC, MAGeCK and MAGeCK-MLE seems to have the highest values.However, Supplemental Table 2 shows that for certain datasets, such as RIF D5, the AUC values are comparably high for most methods.

E. coli CRISRPi Dataset for growth on different carbon sources
Mathis, OVo and Reynolds (11) quanXfied growth rate changes as a funcXon of varied gene expression levels using CRISPRi through a library of modified single guide RNAs (sgRNAs).The dataset consisted of a library of 5927 sgRNAs targeXng 88 genes in Escherichia coli MG1655, for which they observed their effects on growth rate on media with different carbon sources.To generate diversity, they incrementally added mutaXons to sgRNAs in the targeXng sequence or posiXon in gene and created showed a link between the number of mutaXons and their impact on growth rate.
The authors demonstrated that many gene-environment interacXons not detected at maximum knockdown levels are seen at intermediate levels of expression interference.The authors quanXfied growth rate effects of modified sgRNAs in different carbon sources (glucose and glycerol) under turbidostat growth condiXons.Compared to the convenXonal method of using a single (maximalefficiency) knockdown per gene, Mathis, OVo and Reynolds (11) found 37% more interacXng genes assessing the differences in the fiVed parameters of a logisXc fit.This fit included the quanXfied growth rates and the Hill coefficient.
While this is not technically a chemical-geneXcs (CGI) experiment, the data included mulXple Xme points along, with sgRNAs designed to span efficiencies, saXsfying requirements for our model.Thus, the abundances in this dataset can be represented through a modified version of Eq. ( 3) in the main text: where Xme replaces log concentraXon, and growth rate replaces sgRNA efficiency.Ajer logsigmoid transformaXon, the equaXon becomes: Where the intercept folds in the inflecXon points  '/$ (Xme results in 50% depleXon) and  6 (growth rate results in 50% depleXon).
Therefore two linear regressions (one for glucose and one glycerol) we fit for the dataset are : analogous to Eq (5) in the main text for fiung sgRNA abundances for chemical geneXc interacXon screens.
As seen in

Predic(ng uninduced Abundances from SCV
The dataset generated by Mathis, OVo and Reynolds (11) quanXfied growth rates of the sgRNA mutants but they did not have any measurements equivalent to our uninduced abundances, i.e. abundances of the mutants in the growth mediums without ATC inducXon.However, the uninduced abundances could be esXmated from the induced (no drug) using the SCV (standard coefficient of variaXon).We observed (on other datasets) that genes with greater depleXon due to CRISPR interference had higher noise among their counts (abundance), which could be quanXfied by SCV.Fig 10 demonstrates the correlaXon seen of the SCV of the abundances at all concentraXons is correlated with the uninduced abundances in the simulated highest noise scenario (HH).The more depleted sgRNAs have higher SCVs since lower number of total counts can increase the amount of noise.We observe similar relaXonships in the CRISPRi datasets as well.
A. B. The points are linearly correlated with higher depleXon resulXng in higher SCV.
The calculaXon of the uninduced abundances using SCV of the induced counts available is: where  " = baseline abundances calculated for gene ,  " = counts across all concentraXons ( "& are counts for gene  specifically at 0 concentraXon) and  = 2 for this dataset.Therefore, we used this method to esXmate uninduced abundances for the E. coli CRISPRi data.

Results
We ran CRISPRi-DR independently for each carbon source, using the SCV method to generate uninduced baseline abundances.In both analyses, a significant number of genes exhibited notable depleXon (indicaXng reduced fitness) or had a Xme-dependence coefficient q value of less than 0.05.This may be because many of the 88 genes selected for this experiment are specifically because they are essenXal for growth (on either carbon source).
As  The analysis of this dataset demonstrates that the CRISPRi-DR method can be applied to other datasets, including those not explicitly designed for chemical-geneXcs.The modified Dose-Response model nicely incorporates the simultaneous effects of Xme and the variable efficiencies of sgRNAs of varying effiency on mutant abundance.

Minimum Number of sgRNAs per gene needed for CRISPRi-DR
CreaXng a library of sgRNAs can be expensive and Xme-consuming.A user may want to know how many sgRNAs per gene are necessary to span a range of predicted efficiencies and reflect genuine interacXons with a given treatment.Based on our invesXgaXon below, we recommend at least 5 sgRNAs per gene.
We subsampled our exisXng library of about 96,000 total sgRNAs such that each gene has a maximum of 2, 4, 6, 8, 10, and so on sgRNAs per gene.We re-rank the genes ajer running these subsampled libraries through the CRISPRi-DR model.ROC Curves with target genes obtained from Xu, DeJesus (10) reveals A. B.
that 2 sgRNAs per gene is not enough to capture expected interacXons, but at least 5 sgRNAs spanning a range of predicted efficiencies is sufficient.

Rankings of Select Genes with Reduced Sets of sgRNAs
In a library treated with isoniazid at 1 day pre-depleXon, we sampled all the genes to have a maximum of 2-20 sgRNAs, incremenXng at intervals of 2, and ran the sampled libraries through the CRISPRi-DR model.We repeated this sampling 10x each at every increment.
In Panel A of Fig 12, we see inhA (enoyl-ACP reductase, in mycolic acid pathway) is an essenXal gene that is the target of INH [12], and nadA, an enriched gene in this dataset (with all sgRNAs).The figure depicts changes in ranking of these two genes as the number of sampled sgRNAs is increased.DepleXon ranking in this context is defined as genes in the order of increasing concentraXon dependence slope and enrichment ranking is genes in the order of decreasing concentraXon dependence slope.With all the sgRNAs, the depleXon ranking of inhA is #12 and the enrichment ranking of nadA is #21.The shaded region surrounding the line is the standard deviaXon of the rankings across the 10 iteraXons performed at a parXcular sampling level.At the lej end of the plot, with a low number of sgRNAs sampled per gene, the standard deviaXons are high, at 30.2 for inhA and 134.6 for nadA.These standard deviaXons reduce sustainably and rankings for both genes start to converge to their true rankings (based on all sgRNAs) at around the 5 sgRNA sampling level.
We observed a similar phenomenon in Panel B of Fig 12, with rpoC and eccD3, a few genes that interact with rifampicin with 1 day pre-depleXon.The variaXon of the rankings of these genes is possibly more variable than those in the isoniazid library depicted in Panel A, with a standard deviaXon of 283.8 for eccD3 and 202.6 for rpoC at the 2 sgRNA sampling level.However, the rankings of these genes also converge at the 5 sgRNA subsampling level.depleXon, along with the AUC values.The shaded regions surrounding the lines is the standard deviaXon of the true posiXve rate at each false posiXve rate.In both Panels, the worst performing sgRNA sampling amount is 2, with the lowest AUC across the 10 runs.StarXng at 4 sgRNAs, the performance of the library remains constant.
Thus, we conclude that at least 5 sgRNAs (of diverse efficiency) should be included for each gene when designing a CRISPRi library, to ensure adequate regression fits with CRISPRi-DR.

Fig 1 .
Fig 1. Abundances of select sgRNA(s) in the low and high noise scenarios.For each noise scenario, the counts of a representaXve sgRNA from simulated negaXve interacXng genes are seen here with the means across replicates marked by the horizontal black line.The medium noise scenarios are somewhere in between levels of dispersion depicted.

Fig 2 .
Fig 2. Comparisons of significant number of depleted and enriched genes reported by the CRISPRi-DR, MAGeCK and MAGeCK-MLE.The top two panels are comparison of CRIPSRi-DR to MAGeCK and boVom panels are comparison to MAGeCK-MLE.The lej panels are comparisons of depleted genes, and the right panels are comparisons of enriched genes.The number of hits (both enriched and depleted) are greater in MAGeCK-MLE than in the CRISPRi-DR model and in some cases MAGeCK-RRA, with the greatest difference coming from HH noise datasets.

Fig 3 .
Fig 3. Bar chart of the average true posi(ves and false posi(ves calls for CRISPRi-DR and MAGeCK-MLE as noise parameters are adjusted to increase noise.The horizontal dashed line in both panels is the number of total simulated interacXng genes (100 total).sC is increased to increase the noise between concentraXons and Pnb is decreased to increase noise between replicates of concentraXons.The lejmost bars of the plot are the lowest noise, and the rightmost bars are the highest noise.

Fig 4 .
Fig 4. Distribu(on of slopes sampled from for simulated non-interac(ng and interac(ng genes at mul(ple K2 values.The lejmost panel shows the highest amount of overlap of the distribuXons of simulated slopes with K2 is low and the rightmost panel is the lowest amount of overlap of the distribuXons of simulated slopes with K2 is high.The middle panel strikes a balance between these two.

Figure 5
Figure 5 shows the true and false posiXve calls made by CRISPRi-DR, MAGeCK-MLE and MAGeCK-RRA within the lowest noise scenario (LL) for a range of K2 values, from 0.4 to 1.2.As expected, all three methods show increases in true posiXves and decreases in false posiXves as K2 increases.Regardless of K2 value, MAGeCK-MLE makes many more calls than CRISPRi-DR and MAGeCK-RRA and thus makes the highest number of true posiXve calls across the range of K2 values but also makes a highest number of false posiXves calls.ComparaXvely, the number of false posiXve calls made by MAGeCK-RRA and CRISPRi-DR are low regardless of K2 value.

Fig 5 .
Fig 5. Calls made by CRISPRi-DR and MAGeCK-MLE at different K2 values.CRISPRi-DR and MAGeCK-MLE were both run using the LL scenario for 3 total concentraXons at a range of K2 values.Each K2 value was run 10 Xmes each.

Fig 6 .
Fig 6.Trends of select sgRNAs in the various noise scenario.For each of the sgRNAs seen here, the blue dots are abundances for the three simulated replicates at each concentraXon.The gray line shows the change in abundance mean as concentraXon increases.

Fig 7 .
Fig 7. ROC Curves of RIF in 1 day pre-deple(on using known target genes from TnSeq screens as true posi(ves.A) ROC curves per run of a tested methodology.Rankings based on gene P-values calculated by the methodology, or with minimal post-processing.B) One ROC Curve per method based on ranking of final P-values post combining mulXple concentraXons for methods that do not account for them.

Fig 9 ,Fig 9 .
Fig 9. Time vs. Abundance Curves for ,aA for growth in glucose and glycerol.A) RelaXve abundances in HaA versus Xme in glucose shows sigmoidal curves that can be linearized, revealing a strong depleXon of HaA for growth in glucose.B) RelaXve abundances in HaA versus Xme in glycerol does not shows as obvious sigmoidal curves that can be linearized, revealing an almost enrichment of HaA for growth in glycerol.

Fig 10 .
Fig 10.Correla(on of SCV across concentra(ons vs. uninduced abundances in the simulated HH scenario.The points are linearly correlated with higher depleXon resulXng in higher SCV.

Fig 11 .
Fig 11.Coefficients of Time-Dependence in CRISPRi-DR Analyses of Glucose and Glycerol Data.A) Z-Score DistribuXon of the Coefficients.The distribuXon of coefficients B) CorrelaXon plot of the coefficients of the genes.The solid diagonal line is y=x.The fuschia labeled points closest to the line are genes involved in both gluconeogenesis and glycolysis.The points farther away from this line, the orange labeled points (pKA and HaA), are genes involved in glycolysis but not gluconeogenesis and; they have more negaXve coefficients in glucose than in glycerol.

Fig 12 .Fig 13 .
Fig 12.The rankings for select genes based on maximum number of sgRNAs sampled per gene.(A) select genes from INH 1 day pre-depleXon.inhA is significantly depleted and nadA is significantly enriched in the presence of isoniazid (B) Select genes from RIF 1 day pre-depleXon.rpoC is highly depleted and eccD3 is highly enriched in the presence of rifampicin.Each sampled sgRNA library is run through the CRISPRi-DR model 10 Xmes.The shaded regions represent the standard deviaXons of rankings over 10 iteraXons at every sgRNA sampling value.In both panels, gene rankings converge at about 5 sgRNAs.

Fig 14 .ROC
Fig 14.Standard Devia(ons of r 2 for figng of select genes based on maximum number of sgRNAs sampled per gene.(A) select genes from INH 1 day pre-depleXon.(B) Select genes from RIF 1 day pre-depleXon Each sampled sgRNA library is run through the CRISPRi-DR model 10 Xmes and the r 2 values extracted for the select genes.The standard deviaXons of these r 2 values are obtained and for each sampling level and ploVed here.In both panels, standard deviaXon starts to converge at about 5 sgRNAs.

Table 5 . Assessment of Best CRISPR Method for the EMB, INH, LEVO, VAN and RIF CRISPRi screens in Day 1, 5 and 10 day pre-deple(on based on recall, precision, AUC and F1-scrore.
For the methods that require individual dosage runs, they were combined using Fisher's method across concentraXons and then assessed.CRISPRi-DR is the best method by F1-Score for most of the drug-treated datasets. (11)cted in Fig 11A, most of the coefficients of Xme-dependence have a Z score between -1 and 1.While the two curves are similar, they are not idenXcal.The deviaXon in the distribuXon curves around The glycolyXc genes, highlighted fuchsia, are necessary for growth on both carbon sources.These genes are closest to the y=x line, showing similar fitness changes over Xme in both carbon sources, as expected.As menXoned previously, there are two notable genes (highlighted orange) HaA (fructose bisphosphate aldolase) and pKA (phosphofructokinase).These genes, idenXfied in the Mathis, OVo and Reynolds(11)analysis are well-known examples of genes involved in glycolysis but not for incorporaXon of glycerol, as aptly have more negaXve coefficients in the glucose dataset analysis than the glycerol.
a Z-score of 1 can be traced back to two genes labeled in orange in Fig 11B (HaA and pKA).In Fig 11B, we observe a strong correlaXon in the coefficients of Xme dependence of most genes; they align closely with the y=x line.These Xme parameter coefficients ( E #$%&'"( ,  E #$)&(*'$ ) are analogous to the coefficients of concentraXon dependence is reflecXve of the interacXon of a gene with the chemical in typical CRISPRi-DR outputs.