Suppr超能文献

利用文献挖掘和转录组机器学习开发剪接特征数据库以描绘癌症通路。

Splicing signature database development to delineate cancer pathways using literature mining and transcriptome machine learning.

作者信息

Lee Kyubin, Hyung Daejin, Cho Soo Young, Yu Namhee, Hong Sewha, Kim Jihyun, Kim Sunshin, Han Ji-Youn, Park Charny

机构信息

Research Institute, National Cancer Center, 232 Ilsan-ro, Goyang-si, Gyeonggi-do 10408, Republic of Korea.

Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA.

出版信息

Comput Struct Biotechnol J. 2023 Mar 2;21:1978-1988. doi: 10.1016/j.csbj.2023.02.052. eCollection 2023.

Abstract

Alternative splicing (AS) events modulate certain pathways and phenotypic plasticity in cancer. Although previous studies have computationally analyzed splicing events, it is still a challenge to uncover biological functions induced by reliable AS events from tremendous candidates. To provide essential splicing event signatures to assess pathway regulation, we developed a database by collecting two datasets: (i) reported literature and (ii) cancer transcriptome profile. The former includes knowledge-based splicing signatures collected from 63,229 PubMed abstracts using natural language processing, extracted for 202 pathways. The latter is the machine learning-based splicing signatures identified from pan-cancer transcriptome for 16 cancer types and 42 pathways. We established six different learning models to classify pathway activities from splicing profiles as a learning dataset. Top-ranked AS events by learning model feature importance became the signature for each pathway. To validate our learning results, we performed evaluations by (i) performance metrics, (ii) differential AS sets acquired from external datasets, and (iii) our knowledge-based signatures. The area under the receiver operating characteristic values of the learning models did not exhibit any drastic difference. However, random-forest distinctly presented the best performance to compare with the AS sets identified from external datasets and our knowledge-based signatures. Therefore, we used the signatures obtained from the random-forest model. Our database provided the clinical characteristics of the AS signatures, including survival test, molecular subtype, and tumor microenvironment. The regulation by splicing factors was additionally investigated. Our database for developed signatures supported retrieval and visualization system.

摘要

可变剪接(AS)事件可调节癌症中的某些信号通路和表型可塑性。尽管先前的研究已对剪接事件进行了计算分析,但从大量候选事件中发现由可靠的AS事件诱导的生物学功能仍是一项挑战。为了提供关键的剪接事件特征以评估信号通路调控,我们通过收集两个数据集开发了一个数据库:(i)已发表的文献和(ii)癌症转录组图谱。前者包括使用自然语言处理从63,229篇PubMed摘要中收集的基于知识的剪接特征,共提取了202条信号通路。后者是从16种癌症类型和42条信号通路的泛癌转录组中识别出的基于机器学习的剪接特征。我们建立了六种不同的学习模型,将剪接图谱中的信号通路活性分类为学习数据集。根据学习模型特征重要性排名靠前的AS事件成为每条信号通路的特征。为了验证我们的学习结果,我们通过(i)性能指标、(ii)从外部数据集中获取的差异AS集以及(iii)我们基于知识的特征进行了评估。学习模型的受试者工作特征值下的面积没有显示出任何显著差异。然而,随机森林与从外部数据集中识别出的AS集以及我们基于知识的特征相比,明显表现出最佳性能。因此,我们使用了从随机森林模型获得的特征。我们的数据库提供了AS特征的临床特征,包括生存测试、分子亚型和肿瘤微环境。此外,还研究了剪接因子的调控作用。我们用于开发特征的数据库支持检索和可视化系统。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3a37/10023904/71e048901cba/ga1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验