Critical evaluation of the reliability of DNA methylation probes on the Illumina MethylationEPIC BeadChip microarrays

DNA methylation (DNAm) plays a crucial role in a number of complex diseases. However, the reliability of DNAm levels measured using Illumina arrays varies across different probes. Previous research primarily assessed probe reliability by comparing duplicate samples between the 450k-450k or 450k-EPIC platforms, with limited investigations on Illumina EPIC arrays. We conducted a comprehensive assessment of the EPIC array probe reliability using 138 duplicated blood DNAm samples generated by the Alzheimer’s Disease Neuroimaging Initiative study. We introduced a novel statistical measure, the modified intraclass correlation, to better account for the disagreement in duplicate measurements. We observed higher reliability in probes with average methylation beta values of 0.2 to 0.8, and lower reliability in type I probes or those within the promoter and CpG island regions. Importantly, we found that probe reliability has significant implications in the analyses of Epigenome-wide Association Studies (EWAS). Higher reliability is associated with more consistent effect sizes in different studies, the identification of differentially methylated regions (DMRs) and methylation quantitative trait locus (mQTLs), and significant correlations with downstream gene expression. Moreover, blood DNAm measurements obtained from probes with higher reliability are more likely to show concordance with brain DNAm measurements. Our findings, which provide crucial reliable information for probes on the EPIC array, will serve as a valuable resource for future DNAm studies.


INTRODUCTION
DNA methylation (DNAm) is a widely studied epigenetic mechanism characterized by the addition or removal of a methyl group at the 5 th position of cytosine [1].Alterations in DNAm levels have been implicated in many diseases, such as Alzheimer's disease [2][3][4][5][6].Methylated DNA is relatively stable and can be easily detected; thus, it is a viable source of biomarkers [7].Although whole-genome bisulfite sequencing and long-read platforms are still too costly for large-scale epidemiological studies, array-based technologies offer a cost-effective and comprehensive approach to measure DNAm profiles on a genomewide scale.The Illumina Infinium HumanMethylation450 BeadChip and its updated version, the Infinium MethylationEPIC BeadChip, provide probes that target over 485,000 and 850,000 CpG sites per sample, respectively [8,9].In recent years, several studies have examined the reliability (i.e., reproducibility) of DNAm levels from the same DNA samples measured twice using Illumina arrays, and found that probe reliability varies across different probes [10][11][12][13][14][15].However, most previous studies have compared duplicate samples between the 450k-450k or 450k-EPIC platforms, and there is a lack of studies on EPIC-EPIC comparisons (Supplementary Table 1).
To address these critical gaps and assess the reliability of blood DNAm levels measured using EPIC arrays, we conducted an analysis using 138 duplicated blood DNAm samples from the Alzheimer's Disease Neuroimaging Initiative (ADNI) study [16].We compared the magnitudes and patterns of reliability observed in the EPIC-EPIC comparison with findings from previous studies.Our study aimed to provide valuable insights into the reliability of blood DNAm levels measured by EPIC arrays, where such information is lacking.To improve the measurement of probe reliability, we created a new statistic: the modified intraclass correlation coefficient (mICC).We then evaluated the impact of probe reliability on epigenome-wide association studies (EWAS).Higher reliability of methylation levels increases the likelihood of reproducible findings, which are essential for the development of biomarkers or identifying actionable targets.Therefore, our study provides a valuable resource for future DNA methylation studies.

Study dataset
We analyzed a subset of whole-blood DNAm samples generated by the Alzheimer's Disease Neuroimaging Initiative (ADNI) study [16], in which the same blood samples were measured twice (technical replications).To create a dataset of samples from independent subjects, we selected the initial visit data for each subject from the longitudinal ADNI study.Our study included 138 samples measured on 69 independent subjects aged 65-94 years during their initial visits.To avoid confounding batch effects (see also Discussion), only duplicates measured on different methylation plates were analyzed.The ADNI study datasets can be accessed at adni.loni.usc.edu.

Preprocessing of DNA methylation data
DNA methylation was measured using the Illumina HumanMethylation EPIC BeadChip, which includes more than 850,000 CpGs.We preprocessed the DNAm data using the SeSAMe 2 pipeline described by Welsh et al. (2023), which was found to perform the best and produced the largest percentage of reliable CpG probes in a recent comparison of various preprocessing and normalization pipelines [17].
Supplementary Table 2 shows the number of CpGs at each pre-processing step.
First, we removed CpGs that overlapped with single nucleotide polymorphism (SNP), non-CpG probes, cross-reactive probes [18], and probes located on X or Y chromosomes.Samples and probes were further filtered using the iterative Greedy-cut algorithm (with a p-value threshold of 0.01) in the RnBeads R package, which iteratively removes the probe or sample with the highest fraction of unreliable measurements one at a time [19].Next, we removed additional probes that had missing values in more than 5% of samples or were masked by the pOOBAH (P-value with out-of-band array hybridization) algorithm in SeSAMe R package in more than 20% of samples.Finally, we performed a noob (normal-exponential using out-of-band probes) background correction and nonlinear dye-bias correction [20].These analyses were performed using RnBeads and SeSAMe R packages.

Estimation of probe reliability
To estimate the reliability of CpG probes, we computed intraclass correlations (ICCs) for each probe based on methylation beta values, which were measured in duplicates of blood samples collected from the same subject and at the same visit.The ICC is defined as , where  is the between-subject variance and  is the within-subject variance.As recommended by Koo and Li (2016) [21], ICC values were computed using a two-way random effect, absolute agreement, and single-rating model, as implemented in the icc () function of the irr R package.
As discussed below, probes with a large between-subject variance could yield high ICC values, even in the presence of conspicuous differences between duplicate measurements.To explicitly account for the amount of disagreement in duplicate measurements, we proposed to assess reliability using a modified ICC, which is defined as the ICC -half-width of 95% confidence Limits of Agreement (HoLA).The HoLA was calculated as 1.96σ , where  is the standard deviation of the differences between two duplicate measures.
Smaller values of HoLA indicate greater agreement between methylation levels in duplicate samples, whereas larger values of HoLA indicate more disagreement.

Comparison of reliability of probes with different characteristics
To compare the reliability of probes with different characteristics (e.g., type I probes vs. type II probes), we performed mixed-effects model analyses using the lmerTest R package [22].For each comparison, we fitted a mixed-effects model with probe reliability as the outcome variable and probe characteristics (e.g., type I vs. type II probes) as the fixed effect variable.To account for correlations in the probes on the EPIC array, we additionally included random effects for chromosomes, genes, and co-methylated clusters.The comethylated clusters of probes were identified using the coMethDMR R package [23] with methylation beta values as input.As both fixed and random effects are included, these models fall into the general class of linear mixed-effects models.By including random effects, the mixed-effects model acknowledges that the observations (i.e., probes) within the same random effect (i.e., chromosomes, genes, or co-methylated clusters) are more similar to each other than to observations from different groups or clusters.This allows for a more accurate estimation of the fixed effects, while properly accounting for the correlated structure of the data.We also assessed the relationship between reliability and the mean and standard deviation of the methylation beta values.The means of beta values were computed using all samples, and the standard deviations of beta values were computed after randomly selecting one sample from two duplicate samples.

Evaluating the impact of probe reliability on mQTL analysis, DNAm-to-gene expression correlations, and surrogate variables
We searched mQTLs for CpG probes using the GoDMC database [24], which was downloaded from http://mqtldb.godmc.org.uk/downloads.To select significant blood mQTLs in GoDMC, we used the same criteria as in the original study [25], that is, considering a cis P-value smaller than 10 -8 and a trans P-value smaller than 10 -14 as significant.For DNAm-to-mRNA association analysis, we analyzed matched gene expression (Affymetrix Human Genome U 219 array) and DNA methylation (EPIC array) data from 263 independent subjects in the ADNI study.In this analysis, we examined probes located in promoter regions (within ± 2k bp of the transcription start site; TSS) and distal regions (> ± 2k bp of the TSS) separately.
Specifically, for CpGs located in the promoter region, we computed Spearman correlations between CpG methylation and expression levels of the target genes.On the other hand, for CpGs in distal regions, we computed the Spearman correlations between CpG methylation and expression levels of ten genes upstream and downstream, following the approach used in previous integrative DNAm-to-gene expression analyses for probes in distal regions [26,27].Subsequently, we selected the most significant P-value for each probe and considered P-values less than 1 10 -5 to be statistically significant, as in several previous analyses of DNA methylation in blood samples [5,16,28].To evaluate the reliability of the estimated cell-type proportions, we computed major immune cell type proportions in the blood, including B lymphocytes, natural killer cells, CD4+ T lymphocytes, CD8+ T lymphocytes, monocytes, neutrophils, and eosinophils using the EpiDISH R package [29].

The probes on EPIC arrays have higher reliability than those in 450k arrays
The  [10] were 0.703, 0.724, and 0.729, respectively.Consistent with previous studies [14,15], we observed similar ICC estimates when using either methylation M-values or methylation beta values (data not shown).
The ICCs for probes in our EPIC-EPIC comparison ranged from -0.362 to 0.999, with a mean of 0.381 and a median of 0.325 (Figure 1).The ICC values are generally interpreted as follows: < 0.4 (Poor), 0.4 -0.6 (Fair), 0.6-0.75(Good), and > 0.75 (Excellent) [30].We found that these EPIC-EPIC comparisons resulted in higher ICCs than those of the 450k-EPIC comparisons in Sugden et al. (2020) [15], where the estimated ICCs for the 450k-EPIC comparison ranged from -0.28 to 1.00, with a mean of 0.21 and a median of 0.09 (P-value < 0.001, Supplementary Figure 1).A comparison of the 333,588 common probes in both studies showed that a larger number of probes achieved good reliability (n = 38,528) and excellent reliability (n = 64,141) in our EPIC-EPIC comparison study compared to the 450k-EPIC comparison in Sugden et al.
In contrast, the ICCs in the EPIC-EPIC comparison were only slightly higher than those in the 450k-450k comparison estimated by Bose et al. (2014), which ranged from 0 to 0.998, with a mean of 0.366 and a median of 0.296 [10] (Supplementary Figure 2).When comparing the 333,600 common probes in both studies, we found a larger number of probes with good or excellent reliability (good probes: n = probes had the same classification in both the studies (Supplementary Table 4).

The distribution of ICC in the EPIC-EPIC comparison shows similar patterns to those obtained in the 450k-EPIC and 450k-450k comparisons
We found that the distribution of ICC values in our EPIC-EPIC comparison showed patterns similar to those of previous comparisons [10][11][12][13]15].Specifically, as indicated in the formula for defining ICC, we observed that ICC values increased with the between-subject variance  of DNA methylation levels (Figure 2).The highest ICCs values were observed for probes with average methylation values ranging between 0.2 and 0.8 (Supplementary Figure 3).Overall, the ICC values were lower for probes located in the CpG island and TSS200 regions (Supplementary Figures 4-5).Type II probes on the EPIC array, which use a single sequence per CpG site to measure DNAm levels, showed significantly higher ICC values than type I probes, which use two separate sequences per CpG site (P-value < 2.2 × 10 -16 , Supplementary Figure 6).Given that a substantial proportion of type I probes are located in CpG-rich regions, such as CpG islands and promoter regions [9,31], this result is consistent with the observed lower ICC values at CpG islands and promoter regions.

The modified ICC explicitly accounts for agreement between duplicate DNAm measurements
As noted previously [14,32], a challenge when assessing the agreement of methylation levels in duplicate samples is that the ICC is dependent on between-subject variance.Because ICC is defined by  /   , its values can be strongly influenced by the between-subject variances  of the methylation levels.
Figure 3 shows a correlation plot for cg14089881, which had an ICC value of 0.8099, indicating excellent reliability.For this probe, the estimated limits of agreement [33], or the 95% confidence interval for differences in methylation beta values in the duplicates, were (-0.233, 0.272), indicating considerable differences between the duplicate measurements.Note that the standard deviation (SD) for this CpG was 0.2116, which ranked 1166 th (0.18%) among the 640,960 probes.Therefore, some probes, such as cg14089881, can achieve high ICC values because of their relatively high between-subject variance,  , even when there is substantial disagreement in duplicate measurements.
Therefore, we propose a modification to ICC by incorporating the Half-width of the 95% confidence Limits of Agreement (HoLA).We define the modified ICC as the ICC subtracted by HoLA, which penalizes the ICC by the amount of disagreement in duplicate measurements for the probe.Note that the modified ICC prioritizes probes with a large ICC and small HoLA.On the other hand, probes with a large ICC and substantial disagreement in duplicate measurements will have a reduced modified ICC compared with the ICC.For example, the modified ICC for cg14089881 mentioned above was 0.5579, indicating only fair reliability for this probe after correction for disagreement based on HoLA.11).
In the following sections, we will assess the impact of probe reliability, as measured by the modified ICC, in relation to different types of analyses in epigenome-wide association studies (EWAS).

Higher probe reliability is associated with the presence of mQTLs and significant correlations with downstream gene expression
We hypothesized that probe reliability might affect the effectiveness of integrative analyses that correlate DNAm with other types of omics variants, such as mQTLs or downstream gene expression.Methylation Quantitative Trait Locus (mQTLs) refers to genetic variations that influence the patterns of DNAm.Min et al. (2021) performed a large mQTL study involving 32,851 subjects and found that approximately 45% of DNAm sites on the Illumina array were influenced by genetic variants [24].Supplementary Figure 12 shows that consistent with previous observations [15], CpG probes with mQTLs have significantly higher reliability (P-value < 2.2 × 10 -16 ).Specifically, CpG probes influenced by mQTLs had a median modified ICC of 0.546 compared to the other CpGs with a median modified ICC of 0.214.
Similarly, as DNAm is a key epigenetic modification that influences gene activity by regulating gene expression, we next investigated the impact of probe reliability on DNAm-to-mRNA correlations in blood samples.In this analysis, we examined probes located in promoter regions (within ± 2k bp of the transcription start site; TSS) and distal regions (> ± 2k bp of the TSS) separately.For both groups of probes, we found that the modified ICCs were higher for those probes significantly associated with downstream gene expression levels compared to other probes (P-value = 3.24×10 -5 and P-value = 1.41×10 -9 , respectively; Supplementary Figure 13).These findings highlight the importance of considering probe reliability in EWAS and its potential implications for understanding the relationship between DNAm, genetic variation, and gene expression.

Higher probe reliability is associated with larger blood-brain DNAm correlations
For neurological disorders, such as AD, it is preferable to use disease-relevant tissues for epigenetic studies.
However, obtaining methylation levels in brain tissue from living human subjects is currently not feasible.
As a practical alternative, measuring methylation levels in accessible tissues, such as the blood, is often employed.Previous research by Hannon et al. (2015) examined matched DNAm profiles of pre-mortem blood samples and post-mortem brain tissues in the London dataset and found that only a small proportion of CpGs showed significant brain-blood correlations in DNAm levels [34].
To assess the impact of probe reliability on cross-tissue associations, we examined modified ICCs for the probes in relation to the correlation of DNAm measured in the brain prefrontal cortex and blood in the London dataset.Supplementary Figure 14 demonstrates that probes with higher brain-blood correlations in DNAm levels also had higher reliability.Specifically, for probes with high (> 0.75), medium (0.4-0.75), and low (< 0.4) brain-blood correlations, the median modified ICCs were 0.869, 0.786, and 0.215, respectively.These results are consistent with another recent study, which found an increase in reliability estimates in the 450k-EPIC comparison [15] for probes with moderate to high brain-blood correlations.
These findings suggest that reliable probes, by providing more accurate representations of DNAm levels, could facilitate the identification of potential biomarkers in brain disorders, such as dementia.

Higher probe reliability is associated with more consistent association signals
A previous study by Sugden et al. (2020) analyzed the effect of tobacco smoking on DNAm across 22 studies and observed that the number of replications of individual probes across studies positively correlated with reliability [15].We previously conducted a study on DNAm associated with Alzheimer's disease diagnosis using two large clinical datasets generated by the ADNI and Australian Imaging, Biomarkers, and Lifestyle (AIBL) consortia.We hypothesized that poor probe reliability would impact the consistency of the DNAm-to-AD associations estimated in the ADNI and AIBL datasets.Indeed, we found that the differences in estimated effect sizes for DNAm-to-AD associations between the two studies were smallest for probes with excellent reliability (modified ICC > 0.75) and largest for probes with poor reliability (modified ICC < 0.4) (Figure 4).Furthermore, we also analyzed the results of our sex-specific study in AD [35] and found that the pattern of association between probe reliability and consistency in estimated effect sizes was similar for both males and females (Supplementary Figure 15).
Biologically, DNAm levels are often correlated across the genome and occur as a regional phenomenon [36].Differentially methylated regions (DMRs) refer to specific regions in the genome where the levels of DNAm consistently and significantly differ between different conditions.We also considered the impact of probe reliability on DMR identification.Interestingly, as shown in Figure 5, we found that probes located within DMRs that we had previously identified to be associated with AD [5] exhibited significantly higher modified ICCs than other probes (P-value = 0.049).Specifically, the median modified ICC for probes within DMRs was 0.726, which was significantly higher than the median of 0.275 observed for the other probes.We observed similar patterns of association in sex-specific analysis of AD-associated DMRs (Supplementary Figure 16).
These findings highlight the importance of probe reliability in studying DNAm associations, as reliable probes help to minimize discrepancies between different datasets and lead to more consistent results.
Therefore, probe reliability is crucial for detecting genuine DNAm associations in EWAS.

Surrogate variables for cell-type proportions are reliable
In EWAS, one common approach for accounting for cell-type heterogeneity across different samples is to estimate the proportions of various cell types within each sample.These estimated cell type proportions were then included as covariates in the regression models.We assessed the reliability of these estimated cell type proportions and found that they showed good agreement between duplicate samples (Supplementary Table 5).Specifically, the modified ICC ranged from 0.693 to 0.930 across the different cell types.Notably, the proportions of NK (Natural Killer) cells and B cells were observed to have the lowest and highest reliabilities, respectively.These results are consistent with a previous study that examined DNAm levels in newborn and 14-year-old samples, which reported a high correlation in estimated cell-type proportions between duplicate samples [13].Taken together, these findings provide strong support for the use of cell-type proportions to adjust for cell-type heterogeneity in EWAS.

DISCUSSION
In this study, we comprehensively evaluated the reliability of probes on Illumina EPIC arrays and created a valuable resource for EWAS studies (Supplementary Table 6).We carefully selected DNAm samples from the ADNI dataset.To avoid batch effects and ensure accurate assessment of probe reliability, we included duplicate samples that were placed on different methylation plates within the ADNI dataset.
Compared to previous studies that estimated probe reliability using duplicate measurements on either the 450k-EPIC or 450k-450k arrays, we found that the overall reliability estimates were higher in our study, in which both duplicates were measured using the EPIC arrays, consistent with results from previous studies [15,17] (Supplementary Table 1).Several factors could potentially account for the higher reliability estimates observed in our EPIC-EPIC comparisons.First, compared to the 450k arrays, the EPIC arrays included a significantly larger number of type II probes (~ 374,697) [9], which tend to have higher reliability than type I probes (Supplementary Figure 6).Additionally, during the design of EPIC arrays, some probes from the 450k arrays that were found to be unreliable were removed [9,37].
The source of DNA used in the reliability experiments may also have played a role in our higher reliability estimates.Technical replicates for the EPIC-EPIC and 450k-450k comparisons in most studies, including the ADNI study, were generated around the same time.However, replicates for 450k-EPIC comparisons were more likely to be generated later.We speculate that the 450k-450k and EPIC-EPIC comparisons may have utilized DNA extracted only once, while the DNA used in 450k-EPIC comparisons likely involved separate extractions.Therefore, reliability estimates in 450k-EPIC comparisons additionally account for variations in DNA extractions, potentially contributing to the lower reliability estimates observed in these comparisons.
It is important to consider additional factors that contribute to probe reliability.During our preprocessing step, we incorporated the pOOBAH algorithm (P-value with Out-Of-Band Array Hybridization) in the SeSAMe R package.This step specifically identifies and removes probes with hybridization issues [20].Therefore, low reliability is unlikely to be attributed to failed probe hybridization with the target DNA, as previously demonstrated [15].However, an important factor that might impact reliability estimates is batch effects [10].Notably, when we calculated the reliability using duplicated samples placed on the same methylation plate, we observed an increase in the median ICC for the EPIC-EPIC comparison from 0.325 to 0.733.This highlights the impact of the batch effect on reliability estimates and emphasizes the importance of accounting for batch effects in methylation studies.
Consistent with previous studies, we observed that the reliability of the probes was influenced by the mean and variance of methylation levels as measured by beta values [14,32,38].To understand the dependency of ICC on DNAm variances, note that ICC is defined by  /   [14] where  and  are between-subject and within-subject variances, respectively.Therefore, ICC is influenced by betweensubject variances of the probes, and probes with low variation in methylation levels, corresponding to low between-subject variance, would result in low ICC values.
It is well known that the mean and variance of methylation beta values follow an inverse U relationship.
If we consider the beta value as the proportion (p) of methylated cells in a large population of cells, the mean-variance relationship of beta values is consistent with the theory that for a binomial proportion p, the variance  1  /, where n is the total number of cells, is the highest when p is 0.5.Therefore, betweensubject variance peaks around a mean beta value of 0.5, and is lower for probes with the lowest and highest beta values [15].
Consequently, probes with extremely low or high average beta values have lower variances and are more likely to be classified as unreliable than probes with intermediate beta values.Consistent with previous observations by Xu and Taylor (2013) [14], we also found that among probes with very low or high average methylation beta values (below 0.1 or above 0.9), only a small portion (14.12% or 52,471 out of 371,531 probes) were classified as having good or excellent reliability (i.e., ICC > 0.6).In contrast, for probes with intermediate average methylation beta values, a significantly higher proportion (58.5% or 157,617 out of 269,429 probes) was classified as having good or excellent reliability.These results are consistent with our observation of lower ICC values in the TSS and CpG island regions, which are evolutionarily conserved in gene regulation [39].
On the other hand, probes with large between-subject variance could yield high ICC values, even in the presence of sizable differences between duplicate measurements.To explicitly account for the disagreement in duplicate measurements, we proposed a modified ICC that incorporates a penalty based on the half-width of the 95% confidence limits of agreement.We found that the modified ICC maintained the characteristics of the original ICC, with similar patterns of association with the mean and standard deviations of beta values.It also had lower values in CpG islands and TSS200 regions, as well as for CpGs measured by type I probes.The classification of probes based on the modified ICC largely agreed with the classifications based on the original ICC, except for probes with substantial disagreement in duplicate measurements.
We next studied the implications of probe reliability in the EWAS, and our findings highlight its significant impact on downstream analyses.Consistent with observations in previous studies that used ICCs to assess reliability [10,15], our analyses using modified ICCs, which explicitly account for the agreement between duplicate DNAm measurements, also revealed that probe reliability plays a crucial role in the success of integrative analyses involving DNAm and other types of omics data, such as mQTLs and mRNA gene expression.Furthermore, we observed that blood DNAm measurements obtained using probes with higher reliability were more likely to show concordance with brain DNAm.Finally, we demonstrated that higher reliability is associated with more consistent effect sizes and the identification of DMRs in EWAS.This is likely because methylation signals from unreliable probes can be contaminated with noise, thereby increasing the likelihood of generating false positives.We found that surrogate variables for cell-type proportions demonstrated good-to-excellent reliability across all major immune cell types in the blood, supporting the use of estimated cell-type proportions in addressing cell-type heterogeneity in EWAS.
Several limitations and areas for future study are in order.First, our analyses focused on methylation levels measured in whole blood; therefore, the results may be applicable only to EWAS conducted using blood samples.Additional studies that assess the reliability of DNAm measured by EPIC array probes in target tissues (e.g., the brain) and other accessible tissues (e.g., saliva) are needed.Second, due to the specific criteria we applied, including selecting independent subjects and ensuring that samples were placed on different methylation plates, we were able to include only a relatively small number of 69 pairs of duplicate samples in this study.Future studies with larger sample sizes are needed to confirm our findings.
Finally, it is important to note that reliability estimates can be influenced by the choice of normalization procedure.In this study, we used the SeSAMe 2 pipeline, which was found to have the best performance and yielded the largest percentage of reliable CpG probes in a recent comparison of various preprocessing and normalization pipelines in a small study with 16 pairs of duplicate samples [17].Additional studies with larger sample sizes are needed to thoroughly evaluate and compare different preprocessing procedures for estimating the reliability of DNAm levels in EPIC arrays.
To reduce the burden of multiple comparisons in EWAS, some authors have proposed excluding probes with low reliability a priori [10,40], whereas others cautioned against this approach, as it may potentially exclude probes with low variability that are located in important gene regulatory regions [14].To this end, we recommend a practical strategy for performing EWAS analysis: first, based on the specific sample size, determine the maximum number of probes, denoted as m, that can be tested with sufficient power (e.g., 80%), considering corrections for multiple comparisons [41].The primary analysis would concentrate on examining m probes with the highest reliability.Subsequently, secondary analyses can be carried out to investigate the remaining probes.This ensures a focused and structured approach to exploring associations in EWAS, prioritizing reliability, and ensuring power in the analysis.An interesting topic for further research is to rigorously design sequential multiple comparison procedures that maximally leverage the reliability information of all probes while controlling the overall Type I error rate in EWAS.The idea is to test all the probes, but with the most reliable probes first, and appropriately adjust the significance level of each probe analysis to account for the increased chance of obtaining a false-positive result when conducting multiple comparisons.
In summary, we comprehensively evaluated the reliability of probes on EPIC arrays.We observed that probes with substantial between-subject variance could still yield high ICC values even when there were notable differences between duplicate measurements.To explicitly account for the disagreement between duplicate measurements, we propose a new statistical measure, the modified ICC.Our findings revealed that probe reliability has significant implications for various downstream analyses of the EWAS.
Importantly, we generated a valuable resource for DNAm research by identifying a set of high-quality probes on the EPIC array, which will contribute to optimizing the robustness and potential of EWAS.   Figure 3 DNA methylation levels at cg14089881 for 69 pairs of duplicate samples in the ADNI dataset shows ICC value can be high (ICC = 0.8099) even when there is substantial disagreement in duplicate measurements (HoLA = 0.2521).In (A), each point corresponds to methylation levels of duplicate samples from the same subject, and the centerline indicates perfect agreement.In both (A) and (B), dashed lines indicate 95% confidence limits of agreement, which is computed as 1.96 x standard deviation of the differences between the two duplicate measurements.In (B), shown is histogram of the differences in methylation levels in the duplicate samples.Abbreviation: HoLA = Half-width of Limits of Agreement    Probe reliability (ICC) increased as standard deviation (SD) of DNA methylation levels increased.
Figure 3 DNA methylation levels at cg14089881 for 69 pairs of duplicate samples in the ADNI dataset shows ICC value can be high (ICC = 0.8099) even when there is substantial disagreement in duplicate measurements (HoLA = 0.2521).In (A), each point corresponds to methylation levels of duplicate samples from the same subject, and the centerline indicates perfect agreement.In both (A) and (B), dashed lines indicate 95% con dence limits of agreement, which is computed as 1.96 x standard deviation of the differences between the two duplicate measurements.In (B), shown is histogram of the differences in methylation levels in the duplicate samples.Abbreviation: HoLA = Half-width of Limits of Agreement

Figure 1
Figure 1 Distribution of estimated intraclass correlation coefficient (ICC) for 138 technical duplicate DNA methylation samples generated from 69 independent subjects in the ADNI study.Dashed line indicates median (red) and mean (blue) of ICC values at 0.325 and 0.381, respectively.

Figure 2
Figure 2 Probe reliability (ICC) increased as standard deviation (SD) of DNA methylation levels increased.

Figures Figure 1
Figures

Figure 4 Higher
Figure 4

Table 1
presents the probe classifications based on the modified ICC and demonstrates good agreement with the original ICC, except for a small group of CpGs with substantial disagreement in duplicate measurements.The modified ICC retains the properties of the ICC, including similar patterns of association with mean and standard deviations of beta values, lower values in CpG islands and TSS200 regions, and CpGs measured by type I probes (Supplementary Figures 7-

TABLE Table 1
Comparison of probes reliability classifications based on ICC and modified ICC, which is computed as ICC -HoLA, where HoLA refers to Half-width of Limits of Agreement.The probe classifications based on modified ICC agree well with the classifications based on the original ICC, except for a small group of probes (highlighted in red) with substantial disagreement as measured by HoLA in duplicate measurements.