Suppr超能文献

检测甲基化数据中尖峰协变量的可疑交互作用。

Detection of suspicious interactions of spiking covariates in methylation data.

机构信息

Charité - University Medicine, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health, Institute of Biometry and Clinical Epidemiology, Charitéplatz 1, Berlin, 10117, Germany.

Berlin Institute of Health (BIH), Anna-Louisa-Karsch-Strane 2, Berlin, 10178, Germany.

出版信息

BMC Bioinformatics. 2020 Jan 30;21(1):36. doi: 10.1186/s12859-020-3364-6.

Abstract

BACKGROUND

In methylation analyses like epigenome-wide association studies, a high amount of biomarkers is tested for an association between the measured continuous outcome and different covariates. In the case of a continuous covariate like smoking pack years (SPY), a measure of lifetime exposure to tobacco toxins, a spike at zero can occur. Hence, all non-smokers are generating a peak at zero, while the smoking patients are distributed over the other SPY values. Additionally, the spike might also occur on the right side of the covariate distribution, if a category "heavy smoker" is designed. Here, we will focus on methylation data with a spike at the left or the right of the distribution of a continuous covariate. After the methylation data is generated, analysis is usually performed by preprocessing, quality control, and determination of differentially methylated sites, often performed in pipeline fashion. Hence, the data is processed in a string of methods, which are available in one software package. The pipelines can distinguish between categorical covariates, i.e. for group comparisons or continuous covariates, i.e. for linear regression. The differential methylation analysis is often done internally by a linear regression without checking its inherent assumptions. A spike in the continuous covariate is ignored and can cause biased results.

RESULTS

We have reanalysed five data sets, four freely available from ArrayExpress, including methylation data and smoking habits reported by smoking pack years. Therefore, we generated an algorithm to check for the occurrences of suspicious interactions between the values associated with the spike position and the non-spike positions of the covariate. Our algorithm helps to decide if a suspicious interaction can be found and further investigations should be carried out. This is mostly important, because the information on the differentially methylated sites will be used for post-hoc analyses like pathway analyses.

CONCLUSIONS

We help to check for the validation of the linear regression assumptions in a methylation analysis pipeline. These assumptions should also be considered for machine learning approaches. In addition, we are able to detect outliers in the continuous covariate. Therefore, more statistical robust results should be produced in methylation analysis using our algorithm as a preprocessing step.

摘要

背景

在甲基化分析中,如全基因组关联研究,需要对大量生物标志物进行测试,以确定测量的连续结果与不同协变量之间的关联。在连续协变量(如吸烟包年数 [SPY])的情况下,这是衡量终生接触烟草毒素的指标,可能会出现零值峰值。因此,所有不吸烟者都会在零值处产生一个峰值,而吸烟患者则分布在其他 SPY 值上。此外,如果设计了“重度吸烟者”类别,则该峰值也可能出现在协变量分布的右侧。在这里,我们将重点关注甲基化数据在连续协变量分布的左侧或右侧出现峰值的情况。在生成甲基化数据后,通常通过预处理、质量控制和确定差异甲基化位点来进行分析,通常以流水线方式进行。因此,数据是在一系列方法中处理的,这些方法在一个软件包中可用。该流水线可以区分分类协变量,即用于组比较,或连续协变量,即用于线性回归。差异甲基化分析通常通过不检查其内在假设的线性回归在内部进行。连续协变量中的峰值被忽略,可能会导致有偏结果。

结果

我们重新分析了五个数据集,其中四个可从 ArrayExpress 免费获得,包括甲基化数据和按吸烟包年数报告的吸烟习惯。因此,我们生成了一种算法来检查与峰值位置相关的值与协变量非峰值位置之间可疑交互的发生情况。我们的算法有助于确定是否可以找到可疑交互,是否需要进一步进行调查。这一点非常重要,因为差异甲基化位点的信息将用于事后分析,如途径分析。

结论

我们有助于检查甲基化分析流水线中线性回归假设的有效性。这些假设也应该考虑用于机器学习方法。此外,我们能够检测到连续协变量中的异常值。因此,使用我们的算法作为预处理步骤,在甲基化分析中可以产生更稳健的统计结果。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4511/6993406/73b71ea92a96/12859_2020_3364_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验