Suppr超能文献

使用XGBoost高精度识别癌症相关长链非编码RNA

Identification of Cancer-Related Long Non-Coding RNAs Using XGBoost With High Accuracy.

作者信息

Zhang Xuan, Li Tianjun, Wang Jun, Li Jing, Chen Long, Liu Changning

机构信息

CAS Key Laboratory of Tropical Plant Resources and Sustainable Use, Xishuangbanna Tropical Botanical Garden, Chinese Academy of Sciences, Kunming, China.

University of Chinese Academy of Sciences, Beijing, China.

出版信息

Front Genet. 2019 Aug 9;10:735. doi: 10.3389/fgene.2019.00735. eCollection 2019.

Abstract

In the past decade, hundreds of long noncoding RNAs (lncRNAs) have been identified as significant players in diverse types of cancer; however, the functions and mechanisms of most lncRNAs in cancer remain unclear. Several computational methods have been developed to detect associations between cancer and lncRNAs, yet those approaches have limitations in both sensitivity and specificity. With the goal of improving the prediction accuracy for associations of lncRNA with cancer, we upgraded our previously developed cancer-related lncRNA classifier, CRlncRC, to generate CRlncRC2. CRlncRC2 is an eXtreme Gradient Boosting (XGBoost) machine learning framework, including Synthetic Minority Over-sampling Technique (SMOTE)-based over-sampling, along with Laplacian Score-based feature selection. Ten-fold cross-validation showed that the AUC value of CRlncRC2 for identification of cancer-related lncRNAs is much higher than previously reported by CRlncRC and others. Compared with CRlncRC, the number of features used by CRlncRC2 dropped from 85 to 51. Finally, we identified 439 cancer-related lncRNA candidates using CRlncRC2. To evaluate the accuracy of the predictions, we first consulted the cancer-related long non-coding RNA database Lnc2Cancer v2.0 and relevant literature for supporting information, then conducted statistical analysis of somatic mutations, distance from cancer genes, and differential expression in tumor tissues, using various data sets. The results showed that our approach was highly reliable for identifying cancer-related lncRNA candidates. Notably, the highest ranked candidate, lncRNA AC074117.1, has not been reported previously; however, integrated multi-omics analyses demonstrate that it is the target of multiple cancer-related miRNAs and interacts with adjacent protein-coding genes, suggesting that it may act as a cancer-related competing endogenous RNA, which warrants further investigation. In conclusion, CRlncRC2 is an effective and accurate method for identification of cancer-related lncRNAs, and has potential to contribute to the functional annotation of lncRNAs and guide cancer therapy.

摘要

在过去十年中,数百种长链非编码RNA(lncRNA)已被确定为多种癌症中的重要参与者;然而,大多数lncRNA在癌症中的功能和机制仍不清楚。已经开发了几种计算方法来检测癌症与lncRNA之间的关联,但这些方法在敏感性和特异性方面都存在局限性。为了提高lncRNA与癌症关联的预测准确性,我们对之前开发的癌症相关lncRNA分类器CRlncRC进行了升级,生成了CRlncRC2。CRlncRC2是一个极端梯度提升(XGBoost)机器学习框架,包括基于合成少数过采样技术(SMOTE)的过采样,以及基于拉普拉斯分数的特征选择。十折交叉验证表明,CRlncRC2识别癌症相关lncRNA的AUC值远高于CRlncRC和其他方法之前报道的值。与CRlncRC相比,CRlncRC2使用的特征数量从85个降至51个。最后,我们使用CRlncRC2鉴定出439个癌症相关lncRNA候选物。为了评估预测的准确性,我们首先查阅了癌症相关长链非编码RNA数据库Lnc2Cancer v2.0和相关文献以获取支持信息,然后使用各种数据集对体细胞突变、与癌症基因的距离以及肿瘤组织中的差异表达进行统计分析。结果表明,我们的方法在识别癌症相关lncRNA候选物方面高度可靠。值得注意的是,排名最高的候选物lncRNA AC074117.1此前尚未见报道;然而,综合多组学分析表明它是多种癌症相关miRNA的靶点,并与相邻的蛋白质编码基因相互作用,这表明它可能作为一种癌症相关的竞争性内源RNA,值得进一步研究。总之,CRlncRC2是一种有效且准确的鉴定癌症相关lncRNA的方法,有潜力为lncRNA的功能注释做出贡献并指导癌症治疗。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cb56/6701491/8a4a07a0e38f/fgene-10-00735-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验