一种通过整合多个数据源预测疾病相关拷贝数变异的新型计算框架。

A Novel Computational Framework to Predict Disease-Related Copy Number Variations by Integrating Multiple Data Sources.

作者信息

Yuan Lin, Sun Tao, Zhao Jing, Shen Zhen

机构信息

School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China.

School of Computer and Software, Nanyang Institute of Technology, Nanyang, China.

出版信息

Front Genet. 2021 Jun 29;12:696956. doi: 10.3389/fgene.2021.696956. eCollection 2021.

DOI:10.3389/fgene.2021.696956

PMID:34267783

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8276077/

Abstract

Copy number variation (CNV) may contribute to the development of complex diseases. However, due to the complex mechanism of path association and the lack of sufficient samples, understanding the relationship between CNV and cancer remains a major challenge. The unprecedented abundance of CNV, gene, and disease label data provides us with an opportunity to design a new machine learning framework to predict potential disease-related CNVs. In this paper, we developed a novel machine learning approach, namely, IHI-BMLLR (Integrating Heterogeneous Information sources with Biweight Mid-correlation and L1-regularized Logistic Regression under stability selection), to predict the CNV-disease path associations by using a data set containing CNV, disease state labels, and gene data. CNVs, genes, and diseases are connected through edges and then constitute a biological association network. To construct a biological network, we first used a self-adaptive biweight mid-correlation (BM) formula to calculate correlation coefficients between CNVs and genes. Then, we used logistic regression with L1 penalty (LLR) function to detect genes related to disease. We added stability selection strategy, which can effectively reduce false positives, when using self-adaptive BM and LLR. Finally, a weighted path search algorithm was applied to find top path associations and important CNVs. The experimental results on both simulation and prostate cancer data show that IHI-BMLLR is significantly better than two state-of-the-art CNV detection methods (i.e., CCRET and DPtest) under false-positive control. Furthermore, we applied IHI-BMLLR to prostate cancer data and found significant path associations. Three new cancer-related genes were discovered in the paths, and these genes need to be verified by biological research in the future.

摘要

拷贝数变异（CNV）可能在复杂疾病的发生发展中起作用。然而，由于疾病关联机制复杂且样本数量不足，理解CNV与癌症之间的关系仍然是一项重大挑战。前所未有的丰富CNV、基因和疾病标签数据为我们提供了一个机会，来设计一种新的机器学习框架以预测潜在的疾病相关CNV。在本文中，我们开发了一种新颖的机器学习方法，即IHI - BMLLR（在稳定性选择下将异构信息源与双权中相关性和L1正则化逻辑回归相结合），通过使用包含CNV、疾病状态标签和基因数据的数据集来预测CNV - 疾病的关联路径。CNV、基因和疾病通过边相连，进而构成一个生物关联网络。为构建生物网络，我们首先使用自适应双权中相关性（BM）公式计算CNV与基因之间的相关系数。然后，我们使用带有L1惩罚（LLR）函数的逻辑回归来检测与疾病相关的基因。在使用自适应BM和LLR时，我们添加了能有效减少假阳性的稳定性选择策略。最后，应用加权路径搜索算法来找到排名靠前的路径关联和重要的CNV。在模拟数据和前列腺癌数据上的实验结果表明，在控制假阳性方面，IHI - BMLLR显著优于两种最先进的CNV检测方法（即CCRET和DPtest）。此外，我们将IHI - BMLLR应用于前列腺癌数据并发现了显著的路径关联。在这些路径中发现了三个新的癌症相关基因，未来需要通过生物学研究对这些基因进行验证。