Faculty of Mathematics and Informatics, Sofia University, "St. Kliment Ohridski", 5 James Bourchier Blvd., Sofia, 1164, Bulgaria.
Department of Biotechnology, Boku University, Vienna, 1180, Austria.
Biol Direct. 2019 Nov 21;14(1):22. doi: 10.1186/s13062-019-0249-6.
Recently high-throughput technologies have been massively used alongside clinical tests to study various types of cancer. Data generated in such large-scale studies are heterogeneous, of different types and formats. With lack of effective integration strategies novel models are necessary for efficient and operative data integration, where both clinical and molecular information can be effectively joined for storage, access and ease of use. Such models, combined with machine learning methods for accurate prediction of survival time in cancer studies, can yield novel insights into disease development and lead to precise personalized therapies.
We developed an approach for intelligent data integration of two cancer datasets (breast cancer and neuroblastoma) - provided in the CAMDA 2018 'Cancer Data Integration Challenge', and compared models for prediction of survival time. We developed a novel semantic network-based data integration framework that utilizes NoSQL databases, where we combined clinical and expression profile data, using both raw data records and external knowledge sources. Utilizing the integrated data we introduced Tumor Integrated Clinical Feature (TICF) - a new feature for accurate prediction of patient survival time. Finally, we applied and validated several machine learning models for survival time prediction.
We developed a framework for semantic integration of clinical and omics data that can borrow information across multiple cancer studies. By linking data with external domain knowledge sources our approach facilitates enrichment of the studied data by discovery of internal relations. The proposed and validated machine learning models for survival time prediction yielded accurate results.
This article was reviewed by Eran Elhaik, Wenzhong Xiao and Carlos Loucera.
最近,高通量技术已与临床检测一起大量用于研究各种类型的癌症。此类大规模研究中生成的数据具有异质性,类型和格式也各不相同。由于缺乏有效的整合策略,需要新型模型来实现高效且可行的数据整合,以便能够有效地将临床和分子信息结合起来进行存储、访问和使用。此类模型与用于癌症研究中生存时间精确预测的机器学习方法相结合,可以深入了解疾病的发展,并促成精确的个性化治疗。
我们开发了一种方法,用于对两个癌症数据集(乳腺癌和神经母细胞瘤)进行智能数据集成-这些数据集是在 CAMDA 2018“癌症数据集成挑战赛”中提供的,并对用于预测生存时间的模型进行了比较。我们开发了一种新颖的基于语义网络的数据集成框架,该框架利用了 NoSQL 数据库,我们在其中结合了临床和表达谱数据,同时使用了原始数据记录和外部知识库。利用集成数据,我们引入了肿瘤综合临床特征(TICF)-这是一种用于精确预测患者生存时间的新特征。最后,我们应用并验证了几种用于生存时间预测的机器学习模型。
我们开发了一种用于临床和组学数据语义集成的框架,该框架可以跨多个癌症研究借鉴信息。通过将数据与外部领域知识库链接,我们的方法通过发现内部关系来促进所研究数据的丰富。所提出和验证的用于生存时间预测的机器学习模型产生了准确的结果。
本文由 Eran Elhaik、Wenzhong Xiao 和 Carlos Loucera 进行了评论。