Zhang Jingpu, Deng Lianping, Deng Lei
School of Computer and Data Science, Henan University of Urban Construction, 467000, Pingdingshan, China.
School of Computer Science and Engineering, Central South University, 410075, Changsha, China.
BMC Genomics. 2025 Apr 10;23(Suppl 6):869. doi: 10.1186/s12864-024-11117-0.
Domains can be viewed as portable units of protein structure, folding, function, evolution, and design. Small proteins are often found to be composed of only a single domain, while most large proteins consist of multiple domains for achieving various composite cellular functions. A dysfunction in domains may affect the function of proteins in some disease. Inferring the disease-related domains will help our understanding of the mechanism of human complex diseases.
In this study, we firstly build a global heterogeneous information network based on structural-based domains, proteins, and diseases. Then the topological features of the network are extracted according to the meta-paths between domain and disease nodes. Finally, we train a binary classifier based on the XGBOOST (eXtreme Gradient Boosting) algorithm to predict the potential associations between domains and diseases. The results show that the binary classification model using the XGBOOST algorithm performs significantly better than models using other machine learning algorithms, achieving an AUC (Area Under Curve) score of 0.94 in the leave-one-out cross-validation experiment.
We develop a method to build a binary classifier using the topological features based on meta-paths and predict the potential associations between domains and diseases. Based on its predictive performance in independent test sets, the method is proved to be powerful. Moreover, representing domains and diseases through integrating more multi-omic data will further optimize predictive performance.
结构域可被视为蛋白质结构、折叠、功能、进化及设计的可移植单元。人们经常发现小蛋白质仅由单个结构域组成,而大多数大蛋白质由多个结构域组成,以实现各种复合细胞功能。结构域功能异常可能在某些疾病中影响蛋白质的功能。推断与疾病相关的结构域将有助于我们理解人类复杂疾病的机制。
在本研究中,我们首先基于基于结构的结构域、蛋白质和疾病构建了一个全局异构信息网络。然后根据结构域与疾病节点之间的元路径提取网络的拓扑特征。最后,我们基于XGBOOST(极端梯度提升)算法训练了一个二元分类器,以预测结构域与疾病之间的潜在关联。结果表明,使用XGBOOST算法的二元分类模型的性能明显优于使用其他机器学习算法的模型,在留一法交叉验证实验中达到了0.94的曲线下面积(AUC)得分。
我们开发了一种利用基于元路径的拓扑特征构建二元分类器并预测结构域与疾病之间潜在关联的方法。基于其在独立测试集中的预测性能,该方法被证明是强大的。此外,通过整合更多多组学数据来表示结构域和疾病将进一步优化预测性能。