Suppr超能文献

利用文本挖掘共现特征为癌症基因panel发现情境化基因

Contextualizing Genes by Using Text-Mined Co-Occurrence Features for Cancer Gene Panel Discovery.

作者信息

Chen Hui-O, Lin Peng-Chan, Liu Chen-Ruei, Wang Chi-Shiang, Chiang Jung-Hsien

机构信息

Department of Computer Science and Information Engineering, College of Electrical Engineering and Computer Science, National Cheng Kung University, Tainan, Taiwan.

Institute of Medical Informatics, National Cheng Kung University, Tainan, Taiwan.

出版信息

Front Genet. 2021 Oct 25;12:771435. doi: 10.3389/fgene.2021.771435. eCollection 2021.

Abstract

Developing a biomedical-explainable and validatable text mining pipeline can help in cancer gene panel discovery. We create a pipeline that can contextualize genes by using text-mined co-occurrence features. We apply Biomedical Natural Language Processing (BioNLP) techniques for literature mining in the cancer gene panel. A literature-derived 4,679 × 4,630 gene term-feature matrix was built. The L858R and T790M, and V600E genetic variants are important mutation term features in text mining and are frequently mutated in cancer. We validate the cancer gene panel by the mutational landscape of different cancer types. The cosine similarity of gene frequency between text mining and a statistical result from clinical sequencing data is 80.8%. In different machine learning models, the best accuracy for the prediction of two different gene panels, including MSK-IMPACT (Memorial Sloan Kettering-Integrated Mutation Profiling of Actionable Cancer Targets), and Oncomine cancer gene panel, is 0.959, and 0.989, respectively. The receiver operating characteristic (ROC) curve analysis confirmed that the neural net model has a better prediction performance (Area under the ROC curve (AUC) = 0.992). The use of text-mined co-occurrence features can contextualize each gene. We believe the approach is to evaluate several existing gene panels, and show that we can use part of the gene panel set to predict the remaining genes for cancer discovery.

摘要

开发一个具有生物医学可解释性和可验证性的文本挖掘流程有助于发现癌症基因组合。我们创建了一个能够通过使用文本挖掘共现特征来将基因情境化的流程。我们应用生物医学自然语言处理(BioNLP)技术对癌症基因组合进行文献挖掘。构建了一个从文献中得出的4679×4630基因术语-特征矩阵。L858R和T790M以及V600E基因变体是文本挖掘中重要的突变术语特征,且在癌症中经常发生突变。我们通过不同癌症类型的突变图谱来验证癌症基因组合。文本挖掘与临床测序数据统计结果之间的基因频率余弦相似度为80.8%。在不同的机器学习模型中,预测包括纪念斯隆凯特琳癌症中心可操作癌症靶点综合突变分析(MSK-IMPACT)和安捷伦癌症基因组合在内的两种不同基因组合的最佳准确率分别为0.959和0.989。受试者工作特征(ROC)曲线分析证实神经网络模型具有更好的预测性能(ROC曲线下面积(AUC)=0.992)。使用文本挖掘共现特征能够将每个基因情境化。我们认为该方法是评估几个现有的基因组合,并表明我们可以使用部分基因组合集来预测其余基因以用于癌症发现。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dbd2/8573063/89025321c1b1/fgene-12-771435-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验