Ramadan Emad, Alinsaif Sadiq, Hassan Md Rafiul
Department of Information and Computer Science, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia.
BMC Bioinformatics. 2016 Jul 25;17 Suppl 7(Suppl 7):274. doi: 10.1186/s12859-016-1095-5.
Massive biological datasets are generated in different locations all over the world. Analysis of these datasets is required in order to extract knowledge that might be helpful for biologists, physicians and pharmacists. Recently, analysis of biological networks has received a lot of attention, as an understanding of the network can reveal information about life at the cellular level. Biological networks can be generated that examine the interaction between proteins or the relationship amongst different genes at the expression level. Identifying information from biological networks is recognized as a significant challenge, due to the inherent complexity of the structures. Computational techniques are used to analyze such complex networks with varying success.
In this paper, we construct a new method for predicting phenotype-gene association in breast cancer using biological network analysis. Several network topological measures have been computed and fed as features into two classification models to investigate phenotype-gene association in breast cancer. More importantly, to overcome the problem of the skewed datasets, a synthetic minority oversampling technique (SMOTE) is adapted in order to transform an imbalanced dataset to a balanced one. We have applied our method on the gene co-expression network (GCN), protein-protein interaction network (PPI), and the integrated functional interaction network (FI), which combined the PPIs and gene co-expression, amongst others. We assess the quality of our proposed method using a slightly modified cross-validation.
Our method can identify phenotype-gene association in breast cancer. Moreover, use of the integrated functional interaction network (FI) has the potential to reveal more information and hidden patterns than the other networks. The software and accompanying examples are freely available at http://faculty.kfupm.edu.sa/ics/eramadan/NetTop.zip .
世界各地不同地点都产生了海量的生物数据集。为了提取可能对生物学家、医生和药剂师有用的知识,需要对这些数据集进行分析。最近,生物网络分析受到了广泛关注,因为对网络的理解可以揭示细胞水平上的生命信息。可以生成生物网络来研究蛋白质之间的相互作用或不同基因在表达水平上的关系。由于结构的内在复杂性,从生物网络中识别信息被认为是一项重大挑战。人们使用计算技术来分析这种复杂网络,取得了不同程度的成功。
在本文中,我们构建了一种利用生物网络分析预测乳腺癌表型-基因关联的新方法。计算了几种网络拓扑度量,并将其作为特征输入到两个分类模型中,以研究乳腺癌中的表型-基因关联。更重要的是,为了克服数据集不均衡的问题,采用了合成少数类过采样技术(SMOTE),将不平衡数据集转换为平衡数据集。我们已将我们的方法应用于基因共表达网络(GCN)、蛋白质-蛋白质相互作用网络(PPI)以及整合了PPI和基因共表达等的综合功能相互作用网络(FI)。我们使用略有修改的交叉验证来评估我们提出的方法的质量。
我们的方法可以识别乳腺癌中的表型-基因关联。此外,与其他网络相比,使用综合功能相互作用网络(FI)有可能揭示更多信息和隐藏模式。该软件及配套示例可从http://faculty.kfupm.edu.sa/ics/eramadan/NetTop.zip免费获取。