XinJiang Laboratory of Minority Speech and Language Information Processing, Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, No. 40-1, Beijing South Road, Urumqi, Xinjiang, China.
University of Chinese Academy of Sciences, Beijing 100049, China.
Gigascience. 2020 Jun 1;9(6). doi: 10.1093/gigascience/giaa032.
The explosive growth of genomic, chemical, and pathological data provides new opportunities and challenges for humans to thoroughly understand life activities in cells. However, there exist few computational models that aggregate various bioentities to comprehensively reveal the physical and functional landscape of biological systems.
We constructed a molecular association network, which contains 18 edges (relationships) between 8 nodes (bioentities). Based on this, we propose Bioentity2vec, a new method for representing bioentities, which integrates information about the attributes and behaviors of a bioentity. Applying the random forest classifier, we achieved promising performance on 18 relationships, with an area under the curve of 0.9608 and an area under the precision-recall curve of 0.9572.
Our study shows that constructing a network with rich topological and biological information is important for systematic understanding of the biological landscape at the molecular level. Our results show that Bioentity2vec can effectively represent biological entities and provides easily distinguishable information about classification tasks. Our method is also able to simultaneously predict relationships between single types and multiple types, which will accelerate progress in biological experimental research and industrial product development.
基因组学、化学和病理学数据的爆炸式增长为人类彻底了解细胞中的生命活动提供了新的机遇和挑战。然而,目前还很少有计算模型能够综合各种生物实体,全面揭示生物系统的物理和功能景观。
我们构建了一个分子关联网络,其中包含 8 个节点(生物实体)之间的 18 条边(关系)。在此基础上,我们提出了一种新的生物实体表示方法 Bioentity2vec,它整合了生物实体的属性和行为信息。我们应用随机森林分类器在 18 种关系上取得了有希望的性能,曲线下面积为 0.9608,精度-召回曲线下面积为 0.9572。
我们的研究表明,构建一个具有丰富拓扑和生物学信息的网络对于系统地理解分子水平上的生物学景观是重要的。我们的结果表明,Bioentity2vec 可以有效地表示生物实体,并提供易于区分的分类任务信息。我们的方法还能够同时预测单类型和多类型之间的关系,这将加速生物实验研究和工业产品开发的进展。