GNPI：用于整合系统发生信息的图规范化，以进行宏基因组宿主表型预测。

GNPI: Graph normalization to integrate phylogenetic information for metagenomic host phenotype prediction.

机构信息

Hubei Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan, China; School of Computer, Central China Normal University, Wuhan, China.

Mathematics and Science College, Shanghai Normal University, Shanghai, China.

出版信息

Methods. 2022 Sep;205:11-17. doi: 10.1016/j.ymeth.2022.05.007. Epub 2022 May 27.

DOI:10.1016/j.ymeth.2022.05.007

PMID:35636652

Abstract

Microorganisms play important roles in our lives especially on metabolism and diseases. Determining the probability of human suffering from specific diseases and the severity of the disease based on microbial genes is the crucial research for understanding the relationship between microbes and diseases. Previous could extract the topological information of phylogenetic trees and integrate them to metagenomic datasets, thus enable classifiers to learn more information in limited datasets and thus improve the performance of the models. In this paper, we proposed a GNPI model to better learn the structure of phylogenetic trees. GNPI maintained the original vector format of metagenomic datasets, while previous research had to change the input form to matrices. The vector-like form of the input data can be easily adopted in the baseline machine learning models and is available for deep learning models. The datasets processed with GNPI help enhance the accuracy of machine learning and deep learning models in three different datasets. GNPI is an interpretable data processing method for host phenotype prediction and other bioinformatics tasks.

摘要

微生物在我们的生活中扮演着重要的角色，特别是在新陈代谢和疾病方面。基于微生物基因来确定人类患特定疾病的概率和疾病的严重程度，是理解微生物与疾病之间关系的关键研究。先前的研究可以提取系统发育树的拓扑信息并将其整合到宏基因组数据集中，从而使分类器能够从有限的数据集中学习更多信息，从而提高模型的性能。在本文中，我们提出了一种 GNPI 模型，以更好地学习系统发育树的结构。GNPI 保持了宏基因组数据集的原始向量格式，而先前的研究必须将输入形式更改为矩阵。输入数据的向量形式可以很容易地应用于基线机器学习模型，并可用于深度学习模型。使用 GNPI 处理的数据集有助于提高三种不同数据集的机器学习和深度学习模型的准确性。GNPI 是一种用于宿主表型预测和其他生物信息学任务的可解释数据处理方法。