Charité - University Medicine, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health, Institute of Biometry and Clinical Epidemiology, Charitéplatz 1, Berlin, 10117, Germany.
Berlin Institute of Health (BIH), Anna-Louisa-Karsch-Strane 2, Berlin, 10178, Germany.
BMC Bioinformatics. 2020 Jan 30;21(1):36. doi: 10.1186/s12859-020-3364-6.
In methylation analyses like epigenome-wide association studies, a high amount of biomarkers is tested for an association between the measured continuous outcome and different covariates. In the case of a continuous covariate like smoking pack years (SPY), a measure of lifetime exposure to tobacco toxins, a spike at zero can occur. Hence, all non-smokers are generating a peak at zero, while the smoking patients are distributed over the other SPY values. Additionally, the spike might also occur on the right side of the covariate distribution, if a category "heavy smoker" is designed. Here, we will focus on methylation data with a spike at the left or the right of the distribution of a continuous covariate. After the methylation data is generated, analysis is usually performed by preprocessing, quality control, and determination of differentially methylated sites, often performed in pipeline fashion. Hence, the data is processed in a string of methods, which are available in one software package. The pipelines can distinguish between categorical covariates, i.e. for group comparisons or continuous covariates, i.e. for linear regression. The differential methylation analysis is often done internally by a linear regression without checking its inherent assumptions. A spike in the continuous covariate is ignored and can cause biased results.
We have reanalysed five data sets, four freely available from ArrayExpress, including methylation data and smoking habits reported by smoking pack years. Therefore, we generated an algorithm to check for the occurrences of suspicious interactions between the values associated with the spike position and the non-spike positions of the covariate. Our algorithm helps to decide if a suspicious interaction can be found and further investigations should be carried out. This is mostly important, because the information on the differentially methylated sites will be used for post-hoc analyses like pathway analyses.
We help to check for the validation of the linear regression assumptions in a methylation analysis pipeline. These assumptions should also be considered for machine learning approaches. In addition, we are able to detect outliers in the continuous covariate. Therefore, more statistical robust results should be produced in methylation analysis using our algorithm as a preprocessing step.
在甲基化分析中,如全基因组关联研究,需要对大量生物标志物进行测试,以确定测量的连续结果与不同协变量之间的关联。在连续协变量(如吸烟包年数 [SPY])的情况下,这是衡量终生接触烟草毒素的指标,可能会出现零值峰值。因此,所有不吸烟者都会在零值处产生一个峰值,而吸烟患者则分布在其他 SPY 值上。此外,如果设计了“重度吸烟者”类别,则该峰值也可能出现在协变量分布的右侧。在这里,我们将重点关注甲基化数据在连续协变量分布的左侧或右侧出现峰值的情况。在生成甲基化数据后,通常通过预处理、质量控制和确定差异甲基化位点来进行分析,通常以流水线方式进行。因此,数据是在一系列方法中处理的,这些方法在一个软件包中可用。该流水线可以区分分类协变量,即用于组比较,或连续协变量,即用于线性回归。差异甲基化分析通常通过不检查其内在假设的线性回归在内部进行。连续协变量中的峰值被忽略,可能会导致有偏结果。
我们重新分析了五个数据集,其中四个可从 ArrayExpress 免费获得,包括甲基化数据和按吸烟包年数报告的吸烟习惯。因此,我们生成了一种算法来检查与峰值位置相关的值与协变量非峰值位置之间可疑交互的发生情况。我们的算法有助于确定是否可以找到可疑交互,是否需要进一步进行调查。这一点非常重要,因为差异甲基化位点的信息将用于事后分析,如途径分析。
我们有助于检查甲基化分析流水线中线性回归假设的有效性。这些假设也应该考虑用于机器学习方法。此外,我们能够检测到连续协变量中的异常值。因此,使用我们的算法作为预处理步骤,在甲基化分析中可以产生更稳健的统计结果。